# IMDB ratings prediction 

Objective: to predict ratings of new shows <br> 
Datasets: https://datasets.imdbws.com/ <br> 
Datasets legend: https://www.imdb.com/interfaces/ <br> 

- *Datasets for this project downloaded on 1 June 2021*



**Introduction** 

In this workbook, I have attempted to find the best model (out of the 4 models: Linear Regression, K nearest neighbors(KNN), Bayesian Ridge and Random forest) to predict ratings of new shows.

The features that are selected are genres, title types and run time. I acknowledge that there are shortfall with using only 3 features. Nevertheless, I am still quite satisfied with the outcome of this project with just 2 months into python and ML learning. I do hope to come back to this and explore more in the future. 

Commented out codes are used for data cleaning. Only cleaned datasets are uploaded in this workbook. You can download the entire code, uncomment the codes in your local machine and run the original datasets straight from IMDB. 

**Table of contents** 

<a id='100'></a>
1. [Loading data](#1) 
2. [Data cleaning](#2)
3. [Analysing genres and ratings](#3)
    - Model 1: [Linear Regression](#3.1)
    - Model 2: [K nearest neighbors](#3.2)
    - Model 3: [Baysian Ridge](#3.3)
    - Model 4: [Random Forest](#3.4) 
    - [Optimal Genre VS Rating model](#3.5)<br>
<br>
4. [Analysing title type and ratings](#4)
    - Model 1: [Linear Regression](#4.1)
    - Model 2: [K nearest neighbors](#4.2)
    - Model 3: [Baysian Ridge](#4.3)
    - Model 4: [Random Forest](#4.4) 
    - [Optimal Title Type VS Rating model](#4.5)<br>
<br>
5. [Analysing Runtime type and ratings](#5)
    - Model 1: [Linear Regression](#5.1)
    - Model 2: [K nearest neighbors](#5.2) 
    - Model 3: [Baysian Ridge](#5.3)
    - Model 4: [Random Forest](#5.4)
    - [Optimal Runtime VS Rating model](#5.5)<br>
<br>
6. [Overview of models](#6)
6. [Overall Prediction](#7)
7. [Optimization](#8) 
8. [Predictions](#9) 
9. [Interesting findings](#10) 
10. [Future opportunities](#11) 

In [None]:
# importing libraries
import pandas as pd 
import numpy as np
import seaborn as sns 
import matplotlib.pyplot as plt
import itertools

from sklearn.linear_model import LinearRegression, BayesianRidge
from sklearn.neighbors import KNeighborsRegressor
from sklearn.ensemble import RandomForestRegressor

from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score 

SEED = 0 

<a id='1'></a>
# [Loading data](#100) 

In [None]:
# titles = pd.read_csv('title.basics.tsv.gz', sep='\t', low_memory=False)
# crews = pd.read_csv('title.crew.tsv.gz', sep='\t')
# principals = pd.read_csv('title.principals.tsv.gz', sep='\t')
# names = pd.read_csv('name.basics.tsv.gz', sep='\t')
# ratings = pd.read_csv('title.ratings.tsv.gz', sep='\t')

# # saving ratings as csv
# ratings.to_csv('ratings.csv', index=False)

ratings = pd.read_csv('../input/imdb-dataset/ratings.csv')
### ratings = pd.read_csv('ratings.csv') # use this instead for local machine

<a id='2'></a>
# [Data cleaning](#100)  

### Cleaning up titles dataframe

In [None]:
# # drop originalTitle and endYear column in titles 
# titles.drop(['originalTitle', 'endYear','isAdult'], axis=1, inplace=True)

# # drop rows with null values
# titles.dropna(inplace=True)

# # drop rows with \N
# titles.drop(titles[titles.startYear=='\\N'].index, inplace=True) 
# titles.drop(titles[titles.runtimeMinutes=='\\N'].index, inplace=True) 
# titles.drop(titles[titles.genres=='\\N'].index, inplace=True) 

# # drop rows with startYear<1990
# titles["startYear"] = titles["startYear"].astype(int)
# titles.drop(titles[titles.startYear<1990].index, inplace=True)

# # drop rows with runtime<30 and runtime>1500
# titles["runtimeMinutes"] = titles["runtimeMinutes"].astype(int)
# titles.drop(titles[titles.runtimeMinutes<30].index, inplace=True)
# titles.drop(titles[titles.runtimeMinutes>1500].index, inplace=True)

In [None]:
# # saving as csv 
# titles.to_csv('titles_clean.csv', index=False) 

# loading cleaned file
titles = pd.read_csv('../input/imdb-dataset/titles_clean.csv')
### titles = pd.read_csv('titles_clean.csv') # use this instead for local machine
titles.head() 

### Merging titles with ratings

In [None]:
# # merge titles with ratings 
# title_ratings = titles.merge(ratings, how='left', left_on='tconst', right_on='tconst')

# # convert numVotes to int and drop rows with <100 votes
# title_ratings.dropna(inplace=True)
# title_ratings["numVotes"] = title_ratings["numVotes"].astype(int)
# title_ratings.drop(title_ratings[title_ratings.numVotes<100].index, inplace=True)

In [None]:
# # saving as csv
# title_ratings.to_csv('title_ratings.csv', index=False)

# loading cleaned file
title_ratings = pd.read_csv('../input/imdb-dataset/title_ratings.csv')
### title_ratings = pd.read_csv('title_ratings.csv') # use this instead for local machine
title_ratings.head() 

### Merging names with ratings

In [None]:
# # merge principals with ratings
# principals.drop(['ordering', 'job', 'characters'], axis=1, inplace=True)
# principals_ratings = principals.merge(ratings, how='left', left_on='tconst', right_on='tconst')

# # calculate average rating and average numVotes per person 
# principals_avg_ratings = principals_ratings.groupby(['nconst']).mean()

In [None]:
# # drop column
# names.drop(['knownForTitles'], axis=1, inplace=True)

# # merging names with principals
# names_ratings = names.merge(principals_avg_ratings, how='left', left_on='nconst', right_on='nconst')

# # drop null values
# names_ratings = names_ratings.drop(names_ratings[names_ratings.averageRating.isnull() == True].index)

In [None]:
# # saving as csv
# names_ratings.to_csv('names_ratings.csv', index=False)

# loading cleaned file
names_ratings = pd.read_csv('../input/imdb-dataset/names_ratings.csv')
### names_ratings = pd.read_csv('names_ratings.csv') # use this instead for local machine
names_ratings.head() 

### Creating one hot encoding for genres 

In [None]:
# listing unique genres
unique_genres = []
for show_genres in title_ratings.genres.str.split(pat=','): 
    for genre in show_genres: 
        if genre not in unique_genres: 
            unique_genres.append(genre) 
print(unique_genres) 
print(len(unique_genres))

In [None]:
# each show can have >1 genre 
# -> creating a list of 0s and 1s to represent genres for that title
df = title_ratings[['tconst', 'genres']]

lst = []
for i in range(len(df)): 
    sub_lst = []
    sub_lst.append(df.tconst.iloc[i]) 
    for g in unique_genres: 
        if g in df.genres.iloc[i]: 
            sub_lst.append(1)
        else: 
            sub_lst.append(0)
    lst.append(sub_lst) 

In [None]:
# converting it to a dataframe
genre = pd.DataFrame(lst,
                    columns=['tconst', 
                             'Comedy', 'Fantasy', 'Romance', 'Short', 
                             'Western', 'Drama', 'Thriller', 'Documentary', 
                             'Musical', 'Crime', 'Family', 'Biography', 
                             'History', 'Animation', 'Sci-Fi', 'Horror', 
                             'Action', 'Music', 'Mystery', 'Adventure', 
                             'Sport', 'War', 'Adult', 'Game-Show', 
                             'News', 'Talk-Show', 'Reality-TV'])

In [None]:
genre.head() 

### Creating one hot encoding for title type 

In [None]:
# listing unique title type
unique_titleType = [] 
for titleType in title_ratings.titleType: 
    if titleType not in unique_titleType: 
        unique_titleType.append(titleType)
print(unique_titleType)
print(len(unique_titleType))

In [None]:
# this is easier to do one hot because there is only 1 title type for each show
titleType = pd.get_dummies(title_ratings.titleType)
titleType = pd.concat([title_ratings.tconst,titleType], axis=1)

In [None]:
titleType.head() 

<a id='3'></a>
### [Analysing genres and ratings](#100) 

For the analysis, I will be running train, test and evaluation on all 4 models. As this is a regression problem, I will be using Mean Absolute Error(MAE) to evaluate the model's efficiency and select the model with the least MAE. 

In [None]:
# merging one hot encoded genre with ratings
genre_ratings = genre.merge(ratings, how='left') 
genre_ratings.head() 

In [None]:
# top 10 genres
total_by_genre = genre.drop('tconst', axis=1).sum().sort_values(ascending=False) 
total_by_genre.head(10) 

In [None]:
# visualizing top 10 in violin plot
genre_type = ['Comedy', 'Fantasy', 'Romance', 'Short', 
              'Western', 'Drama', 'Thriller', 'Documentary', 
              'Musical', 'Crime', 'Family', 'Biography', 
              'History', 'Animation', 'Sci-Fi', 'Horror', 
              'Action', 'Music', 'Mystery', 'Adventure', 
              'Sport', 'War', 'Adult', 'Game-Show', 
              'News', 'Talk-Show', 'Reality-TV']

top_10_genres = ['Drama', 'Comedy', 'Crime', 'Action',
                 'Mystery', 'Romance', 'Thriller', 'Adventure',
                 'Documentary', 'Horror']

unpivot_genre_ratings = pd.melt(genre_ratings, 
                                id_vars=['averageRating'], 
                                value_vars=top_10_genres)

unpivot_genre_ratings = unpivot_genre_ratings.loc[unpivot_genre_ratings.value>0]
unpivot_genre_ratings.rename(columns={'averageRating': 'ratings', 'variable': 'genres'}, inplace=True)

plt.figure(figsize=(16, 6))
sns.violinplot(data=unpivot_genre_ratings, 
               x='genres', 
               y='ratings', 
               gridsize=120,
               width=1.2)
plt.xlabel('Genres', size=20) 
plt.ylabel('Ratings', size=20)
plt.title('Genres VS Ratings', size=30)
# plt.savefig('genres_vs_ratings.png', dpi=300)
plt.show() 

From the chart, we can see that generally drama, crime, mystery and adventure shows are usually rated higher. Shows like thriller and horror on the other hand has lower ratings. 

In [None]:
# correlation heatmap
fig, ax = plt.subplots(figsize=(30,30))  
sns.heatmap(genre_ratings.corr(), annot=True, annot_kws={"size":10}, fmt=".2%")

In [None]:
# data preparation for training and testing
X_gen = genre_ratings[['Comedy', 'Fantasy', 'Romance', 'Short', 
              'Western', 'Drama', 'Thriller', 'Documentary', 
              'Musical', 'Crime', 'Family', 'Biography', 
              'History', 'Animation', 'Sci-Fi', 'Horror', 
              'Action', 'Music', 'Mystery', 'Adventure', 
              'Sport', 'War', 'Adult', 'Game-Show', 
              'News', 'Talk-Show', 'Reality-TV']]
y_gen = genre_ratings['averageRating']

In [None]:
# splitting the data into training and testing sets
X_gen_train, X_gen_test, y_gen_train, y_gen_test = train_test_split(X_gen, y_gen,
                                                                    test_size = 0.2, 
                                                                    shuffle=True,
                                                                    random_state=SEED)

<a id='3.1'></a>
##### [Linear Regression](#100)

In [None]:
# train the model
LR_gen_regressor = LinearRegression() 
LR_gen_regressor.fit(X_gen_train, y_gen_train) 

In [None]:
# test the model
LR_y_gen_pred = LR_gen_regressor.predict(X_gen_test)

LR_compare_gen_df = pd.DataFrame({'Actual': y_gen_test, 
                                  'Predicted Output': LR_y_gen_pred})
LR_compare_gen_df.head()

In [None]:
# evaluate the model
LR_gen_r2 = r2_score(y_gen_test, LR_y_gen_pred)
LR_gen_MAE = mean_absolute_error(y_gen_test, LR_y_gen_pred)

print(f'Coefficients= {LR_gen_regressor.coef_}')

print(f'MAE            = {LR_gen_MAE}')
print(f'MSE            = {mean_squared_error(y_gen_test, LR_y_gen_pred)}')
print(f'RMSE           = {mean_squared_error(y_gen_test, LR_y_gen_pred, squared=False)}')
print(f'r2             = {LR_gen_r2}')


print(f'Training score = {LR_gen_regressor.score(X_gen_train, y_gen_train)}')
print(f'Test score     = {LR_gen_regressor.score(X_gen_test, y_gen_test)}')

<a id='3.2'></a>
##### [KNN](#100) 

In [None]:
# train the model
KNN_gen_regressor = KNeighborsRegressor() 
KNN_gen_regressor.fit(X_gen_train, y_gen_train) 

In [None]:
%%time 
# occasionally adding this function for cells that takes a longer time to load

# test the model
KNN_y_gen_pred = KNN_gen_regressor.predict(X_gen_test)

KNN_compare_gen_df = pd.DataFrame({'Actual': y_gen_test, 
                                   'Predicted Output': KNN_y_gen_pred})
KNN_compare_gen_df.head()

In [None]:
%%time 

# evaluate the model
KNN_gen_r2 = r2_score(y_gen_test, KNN_y_gen_pred)
KNN_gen_MAE = mean_absolute_error(y_gen_test, KNN_y_gen_pred)

print(f'MAE            = {KNN_gen_MAE}')
print(f'MSE            = {mean_squared_error(y_gen_test, KNN_y_gen_pred)}')
print(f'RMSE           = {mean_squared_error(y_gen_test, KNN_y_gen_pred, squared=False)}')
print(f'r2             = {KNN_gen_r2}')

<a id='3.3'></a>
##### [Bayesian Ridge](#100)

In [None]:
# train the model 
BR_gen_regressor = BayesianRidge() 
BR_gen_regressor.fit(X_gen_train, y_gen_train) 

In [None]:
# test the model
BR_y_gen_pred = BR_gen_regressor.predict(X_gen_test)

BR_compare_gen_df = pd.DataFrame({'Actuals': y_gen_test, 
                                  'Predicted Output': BR_y_gen_pred})
BR_compare_gen_df.head()

In [None]:
# evaluate the model
BR_gen_r2 = r2_score(y_gen_test, BR_y_gen_pred)
BR_gen_MAE = mean_absolute_error(y_gen_test, BR_y_gen_pred)


print(f'Coefficients = {BR_gen_regressor.coef_}')
print(f'MAE            = {BR_gen_MAE}')
print(f'MSE            = {mean_squared_error(y_gen_test, BR_y_gen_pred)}')
print(f'RMSE           = {mean_squared_error(y_gen_test, BR_y_gen_pred, squared=False)}')
print(f'r2             = {BR_gen_r2}')

<a id='3.4'></a>
##### [Random Forest ](#100)

In [None]:
# train the model
RF_gen_regressor = RandomForestRegressor() 
RF_gen_regressor.fit(X_gen_train, y_gen_train) 

In [None]:
# test the model
RF_y_gen_pred = RF_gen_regressor.predict(X_gen_test)

RF_compare_gen_df = pd.DataFrame({'Actual': y_gen_test, 
                                  'Predicted Output': RF_y_gen_pred})
RF_compare_gen_df.head()

In [None]:
# evaluate the model
RF_gen_r2 = r2_score(y_gen_test, RF_y_gen_pred)
RF_gen_MAE = mean_absolute_error(y_gen_test, RF_y_gen_pred)

print(f'MAE            = {RF_gen_MAE}')
print(f'MSE            = {mean_squared_error(y_gen_test, RF_y_gen_pred)}')
print(f'RMSE           = {mean_squared_error(y_gen_test, RF_y_gen_pred, squared=False)}')
print(f'r2             = {RF_gen_r2}')

<a id='3.5'></a>
### [Optimal Genre VS Rating model ](#100)

In [None]:
# optimal model search for genre
optimal_gen_MAE = 100 
optimal_gen_r2 = 0
optimal_gen_model = ''

# determined using lowest MAE score
if LR_gen_MAE < optimal_gen_MAE:
    optimal_gen_MAE = LR_gen_MAE
    optimal_gen_r2 = LR_gen_r2
    optimal_gen_model = 'Linear Regression' 
if KNN_gen_MAE < optimal_gen_MAE:
    optimal_gen_r2 = KNN_gen_r2
    optimal_gen_MAE = KNN_gen_MAE
    optimal_gen_model = 'K Nearest Neighbors' 
if BR_gen_MAE < optimal_gen_MAE:
    optimal_gen_r2 = BR_gen_r2
    optimal_gen_MAE = BR_gen_MAE
    optimal_gen_model = 'Bayesian Ridge' 
if RF_gen_MAE < optimal_gen_MAE:
    optimal_gen_MAE = RF_gen_MAE
    optimal_gen_r2 = RF_gen_r2
    optimal_gen_model = 'Random Forest' 
    

print(f'Optimal model is \033[1m{optimal_gen_model}\033[0m with MAE of \033[1m{optimal_gen_MAE}\033[0m and r2 of \033[1m{optimal_gen_r2}\033[0m')

<a id='4'></a>
### [Analysing title type and ratings](#100)

In [None]:
# merging one hot encoded title type with ratings
titleType_ratings = titleType.merge(ratings, how='left') 
titleType_ratings.head() 

In [None]:
total_by_tt = titleType.drop('tconst', axis=1).sum().sort_values(ascending=False) 
total_by_tt

In [None]:
# visualizing with violin plot
title_type = ['movie', 'short', 'tvEpisode', 'tvMiniSeries', 
              'tvMovie', 'tvSeries', 'tvShort', 'tvSpecial', 
              'video', 'videoGame']

top_tt = ['movie', 'tvEpisode', 'tvSeries', 'tvMovie',
          'video', 'tvMiniSeries', 'tvSpecial']

unpivot_tt_ratings = pd.melt(titleType_ratings, 
                             id_vars=['averageRating'], 
                             value_vars=top_tt)

unpivot_tt_ratings = unpivot_tt_ratings.loc[unpivot_tt_ratings.value>0]
unpivot_tt_ratings.rename(columns={'averageRating': 'ratings', 'variable': 'titletypes'}, inplace=True)

plt.figure(figsize=(16, 6))
sns.violinplot(data=unpivot_tt_ratings, 
               x='titletypes', 
               y='ratings', 
               gridsize=120,
               width=1.2)
plt.xlabel('Title Types', size=20) 
plt.ylabel('Ratings', size=20)
plt.title('Title Types VS Ratings', size=30)
plt.savefig('titletypes_vs_ratings.png', dpi=300)
plt.show() 

In [None]:
# correlation heatmap
fig, ax = plt.subplots(figsize=(30,30))  
sns.heatmap(titleType_ratings.corr(), annot=True, annot_kws={"size":20}, fmt=".2%")

In [None]:
# data preparation for training and testing
X_tt = titleType_ratings[['movie', 'short', 'tvEpisode', 'tvMiniSeries',
                          'tvMovie', 'tvSeries', 'tvShort', 'tvSpecial', 
                          'video', 'videoGame']]
y_tt = titleType_ratings['averageRating']

In [None]:
# splitting the data into training and testing sets
X_tt_train, X_tt_test, y_tt_train, y_tt_test = train_test_split(X_tt, y_tt,
                                                                test_size = 0.2, 
                                                                shuffle=True,
                                                                random_state=SEED)

<a id='4.1'></a>
##### [Linear Regression](#100)

In [None]:
# train the model
LR_tt_regressor = LinearRegression() 
LR_tt_regressor.fit(X_tt_train, y_tt_train) 

In [None]:
# test the model
LR_y_tt_pred = LR_tt_regressor.predict(X_tt_test)

LR_compare_tt_df = pd.DataFrame({'Actual': y_tt_test, 
                                 'Predicted Output': LR_y_tt_pred})
LR_compare_tt_df.head()

In [None]:
# evaluate the model
LR_tt_r2 = r2_score(y_gen_test, LR_y_tt_pred)
LR_tt_MAE = mean_absolute_error(y_gen_test, LR_y_tt_pred)

print(f'Coefficients= {LR_tt_regressor.coef_}')

print(f'MAE            = {LR_tt_MAE}')
print(f'MSE            = {mean_squared_error(y_tt_test, LR_y_tt_pred)}')
print(f'RMSE           = {mean_squared_error(y_tt_test, LR_y_tt_pred, squared=False)}')

print(f'r2             = {LR_tt_r2}')


print(f'Training score = {LR_tt_regressor.score(X_tt_train, y_tt_train)}')
print(f'Test score     = {LR_tt_regressor.score(X_tt_test, y_tt_test)}')

<a id='4.2'></a>
##### [KNN](#100)

In [None]:
# train the model
KNN_tt_regressor = KNeighborsRegressor() 
KNN_tt_regressor.fit(X_tt_train, y_tt_train) 

In [None]:
%%time 

# test the model
KNN_y_tt_pred = KNN_tt_regressor.predict(X_tt_test)

KNN_compare_tt_df = pd.DataFrame({'Actual': y_tt_test, 
                                  'Predicted Output': KNN_y_tt_pred})
KNN_compare_tt_df.head()

In [None]:
%%time 

# evaluate the model
KNN_tt_r2 = r2_score(y_gen_test, KNN_y_tt_pred)
KNN_tt_MAE = mean_absolute_error(y_gen_test, KNN_y_tt_pred)

print(f'MAE            = {KNN_tt_MAE}')
print(f'MSE            = {mean_squared_error(y_tt_test, KNN_y_tt_pred)}')
print(f'RMSE           = {mean_squared_error(y_tt_test, KNN_y_tt_pred, squared=False)}')
print(f'r2             = {KNN_tt_r2}')

<a id='4.3'></a>
##### [Bayesian Ridge](#100)

In [None]:
# train the model
BR_tt_regressor = BayesianRidge() 
BR_tt_regressor.fit(X_tt_train, y_tt_train) 

In [None]:
# test the model
BR_y_tt_pred = BR_tt_regressor.predict(X_tt_test)

BR_compare_tt_df = pd.DataFrame({'Actual': y_tt_test, 
                                 'Predicted Output': BR_y_tt_pred})
BR_compare_tt_df.head()

In [None]:
# evaluate the model
BR_tt_r2 = r2_score(y_gen_test, BR_y_tt_pred)
BR_tt_MAE = mean_absolute_error(y_gen_test, BR_y_tt_pred)


print(f'Coefficients = {BR_tt_regressor.coef_}')
print(f'MAE            = {BR_tt_MAE}')
print(f'MSE            = {mean_squared_error(y_tt_test, BR_y_tt_pred)}')
print(f'RMSE           = {mean_squared_error(y_tt_test, BR_y_tt_pred, squared=False)}')
print(f'r2             = {BR_tt_r2}')

<a id='4.4'></a>
##### [Random Forest ](#100)

In [None]:
# train the model
RF_tt_regressor = RandomForestRegressor() 
RF_tt_regressor.fit(X_tt_train, y_tt_train) 

In [None]:
# test the model
RF_y_tt_pred = RF_tt_regressor.predict(X_tt_test)

RF_compare_tt_df = pd.DataFrame({'Actual': y_tt_test, 
                                 'Predicted Output': RF_y_tt_pred})
RF_compare_tt_df.head()

In [None]:
# evaluate the model
RF_tt_r2 = r2_score(y_gen_test, RF_y_tt_pred)
RF_tt_MAE = mean_absolute_error(y_gen_test, RF_y_tt_pred)

print(f'MAE            = {RF_tt_MAE}')
print(f'MSE            = {mean_squared_error(y_tt_test, RF_y_tt_pred)}')
print(f'RMSE           = {mean_squared_error(y_tt_test, RF_y_tt_pred, squared=False)}')
print(f'r2             = {RF_tt_r2}')

<a id='4.5'></a>
### [Optimal Title Type VS Rating model](#100)

In [None]:
# optimal model search for title type
optimal_tt_MAE = 100 
optimal_tt_r2 = 0 
optimal_tt_model = ''

# determined using lowest MAE score
if LR_tt_MAE < optimal_tt_MAE:
    optimal_tt_MAE = LR_tt_MAE
    optimal_tt_r2 = LR_tt_r2
    optimal_tt_model = 'Linear Regression' 
if KNN_tt_MAE < optimal_tt_MAE:
    optimal_tt_MAE = KNN_tt_MAE
    optimal_tt_r2 = KNN_tt_r2
    optimal_tt_model = 'K Nearest Neighbors' 
if BR_tt_MAE < optimal_tt_MAE:
    optimal_tt_MAE = BR_tt_MAE
    optimal_tt_r2 = BR_tt_r2
    optimal_tt_model = 'Bayesian Ridge' 
if RF_tt_MAE < optimal_tt_MAE:
    optimal_tt_r2 = RF_tt_r2
    optimal_tt_MAE = RF_tt_MAE
    optimal_tt_model = 'Random Forest' 
    
    
print(f'Optimal model is \033[1m{optimal_tt_model}\033[0m with MAE of \033[1m{optimal_tt_MAE}\033[0m and r2 of \033[1m{optimal_tt_r2}\033[0m.')

<a id='5'></a>
### [Analysing Runtime and ratings ](#100)

In [None]:
# extract features
runtime_ratings = title_ratings[['runtimeMinutes', 'averageRating']]
runtime_ratings.head() 

In [None]:
runtime_ratings.info() 

In [None]:
# visualizing the data
plt.figure(figsize=(16, 10))
plt.scatter(title_ratings.runtimeMinutes, title_ratings.averageRating, c='orange') 
plt.xlabel('Runtime (minutes)', size=20) 
plt.ylabel('Ratings', size=20)
plt.title('Runtime VS Ratings', size=30)
plt.savefig('runtime_vs_ratings.png', dpi=300)
plt.show() 

From the chart, we can see that most of the shows are in the less than 200 mins range. Shows that are more than 200 mins generally have a rating of more than 5. 

In [None]:
# data preparation for training and testing
X_rt = runtime_ratings['runtimeMinutes'].values.reshape(-1,1)
y_rt = runtime_ratings['averageRating']

In [None]:
# splitting the data into training and testing sets
X_rt_train, X_rt_test, y_rt_train, y_rt_test = train_test_split(X_rt, y_rt,
                                                                test_size = 0.2, 
                                                                shuffle=True,
                                                                random_state=SEED)

<a id='5.1'></a>
##### [Linear Regression](#100)

In [None]:
# train the model
LR_rt_regressor = LinearRegression() 
LR_rt_regressor.fit(X_rt_train, y_rt_train) 

In [None]:
# test the model
LR_y_rt_pred = LR_rt_regressor.predict(X_rt_test)

LR_compare_rt_df = pd.DataFrame({'Actual': y_rt_test, 
                                 'Predicted Output': LR_y_rt_pred})
LR_compare_rt_df.head()

In [None]:
# evaluate the model
LR_rt_r2 = r2_score(y_rt_test, LR_y_rt_pred)
LR_rt_MAE = mean_absolute_error(y_rt_test, LR_y_rt_pred)

print(f'Coefficients= {LR_rt_regressor.coef_}')

print(f'MAE            = {LR_rt_MAE}')
print(f'MSE            = {mean_squared_error(y_rt_test, LR_y_rt_pred)}')
print(f'RMSE           = {mean_squared_error(y_rt_test, LR_y_rt_pred, squared=False)}')

print(f'r2             = {LR_rt_r2}')


print(f'Training score = {LR_rt_regressor.score(X_rt_train, y_rt_train)}')
print(f'Test score     = {LR_rt_regressor.score(X_rt_test, y_rt_test)}')

<a id='5.2'></a>
##### [KNN ](#100)

In [None]:
# train the model
KNN_rt_regressor = KNeighborsRegressor() 
KNN_rt_regressor.fit(X_rt_train, y_rt_train) 

In [None]:
%%time 

# test the model
KNN_y_rt_pred = KNN_rt_regressor.predict(X_rt_test)

KNN_compare_rt_df = pd.DataFrame({'Actual': y_rt_test, 
                                  'Predicted Output': KNN_y_rt_pred})
KNN_compare_rt_df.head()

In [None]:
%%time 

# evaluate the model
KNN_rt_r2 = r2_score(y_rt_test, KNN_y_rt_pred)
KNN_rt_MAE = mean_absolute_error(y_rt_test, KNN_y_rt_pred)


print(f'MAE            = {KNN_rt_MAE}')
print(f'MSE            = {mean_squared_error(y_rt_test, KNN_y_rt_pred)}')
print(f'RMSE           = {mean_squared_error(y_rt_test, KNN_y_rt_pred, squared=False)}')
print(f'r2             = {KNN_rt_r2}')

<a id='5.3'></a>
##### [Bayesian Ridge](#100)

In [None]:
# train the model
BR_rt_regressor = BayesianRidge() 
BR_rt_regressor.fit(X_rt_train, y_rt_train) 

In [None]:
# test the model
BR_y_rt_pred = BR_rt_regressor.predict(X_rt_test)

BR_compare_rt_df = pd.DataFrame({'Actual': y_rt_test, 
                                 'Predicted Output': BR_y_rt_pred})
BR_compare_rt_df.head()

In [None]:
# evaluate the model
BR_rt_r2 = r2_score(y_rt_test, BR_y_rt_pred)
BR_rt_MAE = mean_absolute_error(y_rt_test, BR_y_rt_pred)

print(f'Coefficients = {BR_rt_regressor.coef_}')
print(f'MAE            = {BR_rt_MAE}')
print(f'MSE            = {mean_squared_error(y_rt_test, BR_y_rt_pred)}')
print(f'RMSE           = {mean_squared_error(y_rt_test, BR_y_rt_pred, squared=False)}')
print(f'r2             = {BR_rt_r2}')

<a id='5.4'></a>
##### [Random Forest ](#100)

In [None]:
# train the model
RF_rt_regressor = RandomForestRegressor() 
RF_rt_regressor.fit(X_rt_train, y_rt_train) 

In [None]:
# test the model
RF_y_rt_pred = RF_rt_regressor.predict(X_rt_test)

RF_compare_rt_df = pd.DataFrame({'Actual': y_rt_test, 
                                 'Predicted Output': RF_y_rt_pred})
RF_compare_rt_df.head()

In [None]:
# evaluate the model
RF_rt_r2 = r2_score(y_rt_test, RF_y_rt_pred)
RF_rt_MAE = mean_absolute_error(y_rt_test, RF_y_rt_pred)


print(f'MAE            = {RF_rt_MAE}')
print(f'MSE            = {mean_squared_error(y_rt_test, RF_y_rt_pred)}')
print(f'RMSE           = {mean_squared_error(y_rt_test, RF_y_rt_pred, squared=False)}')
print(f'r2             = {RF_rt_r2}')

<a id='5.5'></a>
### [Optimal Run Time VS Rating model](#100)

In [None]:
# optimal model search for run time
optimal_rt_MAE = 100 
optimal_rt_r2 = 0 
optimal_rt_model = ''

# determined using lowest MAE score
if LR_rt_MAE < optimal_rt_MAE:
    optimal_rt_MAE = LR_rt_MAE
    optimal_rt_r2 = LR_rt_r2
    optimal_rt_model = 'Linear Regression' 
if KNN_rt_MAE < optimal_rt_MAE:
    optimal_rt_MAE = KNN_rt_MAE
    optimal_rt_r2 = KNN_rt_r2
    optimal_rt_model = 'K Nearest Neighbors' 
if BR_rt_MAE < optimal_rt_MAE:
    optimal_rt_MAE = BR_rt_MAE
    optimal_rt_r2 = BR_rt_r2
    optimal_rt_model = 'Bayesian Ridge' 
if RF_rt_MAE < optimal_rt_MAE:
    optimal_rt_MAE = RF_rt_MAE
    optimal_rt_r2 = RF_rt_r2
    optimal_rt_model = 'Random Forest' 
    
print(f'Optimal model is \033[1m{optimal_rt_model}\033[0m with MAE of \033[1m{optimal_rt_MAE}\033[0m and r2 of \033[1m{optimal_rt_r2}\033[0m.')

<a id='6'></a>
# [Overview of models](#100)  

In [None]:
# putting it together
models = ['Linear Regression', 'K nearest neighbors', 'Bayesian Ridge', 'Random Forest']
genres_r2 = [LR_gen_r2, KNN_gen_r2, BR_gen_r2, RF_gen_r2]
genres_MAE = [LR_gen_MAE, KNN_gen_MAE, BR_gen_MAE, RF_gen_MAE]
genre_metrics = pd.DataFrame(zip(models, genres_r2, genres_MAE), 
                            columns=['Genre Model', 'r2', 'MAE'])
genre_metrics.sort_values('MAE') 

In [None]:
titletypes_r2 = [LR_tt_r2, KNN_tt_r2, BR_tt_r2, RF_tt_r2]
titletypes_MAE = [LR_tt_MAE, KNN_tt_MAE, BR_tt_MAE, RF_tt_MAE]
titletypes_metrics = pd.DataFrame(zip(models, titletypes_r2, titletypes_MAE), 
                            columns=['Title Types Model', 'r2', 'MAE'])
titletypes_metrics.sort_values('MAE') 

In [None]:
runtime_r2 = [LR_rt_r2, KNN_rt_r2, BR_rt_r2, RF_rt_r2]
runtime_MAE = [LR_rt_MAE, KNN_rt_MAE, BR_rt_MAE, RF_rt_MAE]
runtime_metrics = pd.DataFrame(zip(models, runtime_r2, runtime_MAE), 
                            columns=['Runtime Model', 'r2', 'MAE'])
runtime_metrics.sort_values('MAE') 

<a id='7'></a>
## [Overall prediction](#100)

From the above models, Random Forest performed best for genres and runtime and Linear Regression performed the best for title types. Using the r2 score of these 3 models, I calculated the weights by having the r2 score of that model divided by sum of r2 score. Using the weights, I calculated the weighted average of the predicted ratings based on the 3 different predictions. The result is a better MAE score as compared to the rest of the individual models. This is expected because more features will result in a better score. 

In [None]:
# calculating weights of each model
gen_weights = optimal_gen_r2 / (optimal_gen_r2+optimal_tt_r2+optimal_rt_r2)
tt_weights = optimal_tt_r2 / (optimal_gen_r2+optimal_tt_r2+optimal_rt_r2)
rt_weights = optimal_rt_r2 / (optimal_gen_r2+optimal_tt_r2+optimal_rt_r2)

# running the best models
if optimal_gen_model == 'Linear Regression': 
    gen_regressor = LinearRegression().fit(X_gen_train, y_gen_train) 
if optimal_gen_model == 'K Nearest Neighbors' : 
    gen_regressor = KNeighborsRegressor().fit(X_gen_train, y_gen_train) 
if optimal_gen_model == 'Bayesian Ridge': 
    gen_regressor = BayesianRidge().fit(X_gen_train, y_gen_train) 
if optimal_gen_model == 'Random Forest': 
    gen_regressor = RandomForestRegressor().fit(X_gen_train, y_gen_train) 
y_gen_pred = gen_regressor.predict(X_gen_test)

if optimal_tt_model == 'Linear Regression': 
    tt_regressor = LinearRegression().fit(X_tt_train, y_tt_train) 
if optimal_tt_model == 'K Nearest Neighbors' : 
    tt_regressor = KNeighborsRegressor().fit(X_tt_train, y_tt_train) 
if optimal_tt_model == 'Bayesian Ridge': 
    tt_regressor = BayesianRidge().fit(X_tt_train, y_tt_train) 
if optimal_tt_model == 'Random Forest': 
    tt_regressor = RandomForestRegressor().fit(X_tt_train, y_tt_train) 
y_tt_pred = tt_regressor.predict(X_tt_test)

if optimal_rt_model == 'Linear Regression': 
    rt_regressor = LinearRegression().fit(X_rt_train, y_rt_train) 
if optimal_rt_model == 'K Nearest Neighbors' : 
    rt_regressor = KNeighborsRegressor().fit(X_rt_train, y_rt_train) 
if optimal_rt_model == 'Bayesian Ridge': 
    rt_regressor = BayesianRidge().fit(X_rt_train, y_rt_train) 
if optimal_rt_model == 'Random Forest': 
    rt_regressor = RandomForestRegressor().fit(X_rt_train, y_rt_train) 
y_rt_pred = rt_regressor.predict(X_rt_test)

In [None]:
# calculating the weighted average of each rating and summing them up. 
compare_df = pd.DataFrame({'Actuals': y_gen_test, 
                           'Weighted Genre': gen_weights*y_gen_pred,
                           'Weighted Title Type': tt_weights*y_tt_pred, 
                           'Weighted Run Time': rt_weights*y_rt_pred, 
                           'Overall': gen_weights*y_gen_pred+tt_weights*y_tt_pred+rt_weights*y_rt_pred})
compare_df.head() 

In [None]:
overall_MAE = sum(abs(compare_df['Actuals']-compare_df['Overall']))/len(compare_df)

print('\033[1mMAE score\033[0m')
print(f'Genres (Random Forest)        : {RF_gen_MAE}')
print(f'Title type (Linear Regression): {LR_tt_MAE}') 
print(f'Run time (Random Forest)      : {RF_rt_MAE}') 
print(f'Overall                       : {overall_MAE}') 

<a id='8'></a>
## [Optimization ](#100)

From the overall prediction, the overall MAE is 0.826. I will run a gridsearch function on all 3 models to look for the best hyperperameters for each model. Let's see if it will further improve the MAE score. 

##### Genre (Random Forest) 

In [None]:
%%time 

gen_parameters = {'max_depth': range(1,10),         
                  'n_estimators': range(1,10),
                  'max_leaf_nodes': range(2,10)}

RF_gen_gs_classifier = GridSearchCV(RF_gen_regressor,
                                    gen_parameters,
                                    scoring='r2', 
                                    cv=5)
RF_gen_gs_classifier.fit(X_gen_train, y_gen_train)
print(f'{RF_gen_gs_classifier.best_params_} gives the best r2 score at: {RF_gen_gs_classifier.best_score_}')

In [None]:
RF_optimized_gen_regressor = RandomForestRegressor(max_depth=7, 
                                                   max_leaf_nodes=9, 
                                                   n_estimators=4)
RF_optimized_gen_regressor.fit(X_gen_train, y_gen_train) 
RF_optimized_y_gen_pred = RF_optimized_gen_regressor.predict(X_gen_test)

RF_optimized_gen_MAE = mean_absolute_error(y_gen_test, RF_optimized_y_gen_pred)
RF_optimized_gen_r2 = r2_score(y_gen_test, RF_optimized_y_gen_pred)
print(f'MAE            = {RF_optimized_gen_MAE}')
print(f'r2             = {RF_optimized_gen_r2}')

print(f'Original model: {optimal_gen_model}, {optimal_gen_MAE}')

Gridsearch for Genre analysis **did not** improve the score. 

##### Title Type (Linear Regression) 

In [None]:
%%time
tt_parameters = {'n_jobs': range(1,10), 
                 'fit_intercept': [True, False]}

LR_tt_gs_classifier = GridSearchCV(LR_tt_regressor,
                                    tt_parameters,
                                    scoring='neg_mean_absolute_error', 
                                    cv=5)
LR_tt_gs_classifier.fit(X_tt_train, y_tt_train)
print("'{}' gives the best neg MAE score at: {:.2%}".format(LR_tt_gs_classifier.best_params_, LR_tt_gs_classifier.best_score_))

In [None]:
LR_optimized_tt_regressor = LinearRegression(n_jobs=1, fit_intercept=False)
LR_optimized_tt_regressor.fit(X_tt_train, y_tt_train) 
LR_optimized_y_tt_pred = LR_optimized_tt_regressor.predict(X_tt_test)

LR_optimized_tt_MAE = mean_absolute_error(y_tt_test, LR_optimized_y_tt_pred)
LR_optimized_tt_r2 = r2_score(y_tt_test, LR_optimized_y_tt_pred)
print(f'MAE            = {LR_optimized_tt_MAE}')
print(f'r2             = {LR_optimized_tt_r2}')

print(f'Original model: {optimal_tt_model}, {optimal_tt_MAE}')

Gridsearch for Title Type analysis **did not** improve the score.

##### Runtime (Random Forest) 

In [None]:
%%time
rt_parameters = {'max_depth': range(1,10),         
                  'n_estimators': range(1,10),
                  'max_leaf_nodes': range(2,10)}

RF_rt_gs_classifier = GridSearchCV(RF_rt_regressor,
                                    rt_parameters,
                                    scoring='neg_mean_absolute_error', 
                                    cv=5)
RF_rt_gs_classifier.fit(X_rt_train, y_rt_train)
print(f'{RF_rt_gs_classifier.best_params_} gives the best neg MAE score at: {RF_rt_gs_classifier.best_score_}')

In [None]:
RF_optimized_rt_regressor = RandomForestRegressor(max_depth=6, 
                                                  max_leaf_nodes=9, 
                                                  n_estimators=9)
RF_optimized_rt_regressor.fit(X_rt_train, y_rt_train) 
RF_optimized_y_rt_pred = RF_optimized_rt_regressor.predict(X_rt_test)

RF_optimized_rt_MAE = mean_absolute_error(y_rt_test, RF_optimized_y_rt_pred)
RF_optimized_rt_r2 = r2_score(y_rt_test, RF_optimized_y_rt_pred)
print(f'MAE            = {RF_optimized_rt_MAE}')
print(f'r2             = {RF_optimized_rt_r2}')

print(f'Original model: {optimal_rt_model}, {optimal_rt_MAE}')

Gridsearch for Run time analysis **did not** improve the score. 

**Conclusion** <br> 
All 3 gridsearch did not give better results. So I will be using the original optimal model for next steps. 

<a id='9'></a>
## [Predictions](#100) 

Based on the above optimal models, I have written a function to predict future shows ratings with the 3 features (genres, title type and run time). 

I have also included 2 upcoming new release. Let's see how accurate this model is in a few month's time! 

##### Shows ratings prediction

In [None]:
# function to predict movie ratings 
# based on the genre, title type and run time being passed in
def rating_prediction(movie_genres, movie_titletype, movie_runtime):
    movie_genre_binary = []
    for g in unique_genres: 
        if g in movie_genres: 
            movie_genre_binary.append(1) 
        else: 
            movie_genre_binary.append(0) 

    movie_titletype_binary = []
    for tt in unique_titleType: 
        if tt in movie_titletype: 
            movie_titletype_binary.append(1) 
        else: 
            movie_titletype_binary.append(0) 

    movie_gen_pred = RF_gen_regressor.predict([movie_genre_binary])
    movie_tt_pred = RF_tt_regressor.predict([movie_titletype_binary])
    movie_rt_pred = RF_rt_regressor.predict([[movie_runtime]])
    
    predicted_rating = (gen_weights*movie_gen_pred+tt_weights*movie_tt_pred+rt_weights*movie_rt_pred)[0]
    return predicted_rating

In [None]:
# these are the different types to choose from
print(unique_genres) 
print(unique_titleType)

Note: There can be multiple genres but only be 1 title type. 

In [None]:
# Hitman's Wife's Bodyguard
movie_genres = ['Comedy', 'Action']
movie_titletype = 'movie' 
movie_runtime = 100
rating_prediction(movie_genres, movie_titletype, movie_runtime)

In [None]:
# Black widow
movie_genres = ['Adventure', 'Action']
movie_titletype = 'movie' 
movie_runtime = 133
rating_prediction(movie_genres, movie_titletype, movie_runtime)

In [None]:
# input your own show here:
movie_genres = ['Adventure', 'Comedy']
movie_titletype = 'movie' 
movie_runtime = 133
rating_prediction(movie_genres, movie_titletype, movie_runtime)

##### Finding the best show combination

In [None]:
# listing down all variables for consideration
print(top_10_genres)
print(top_tt) 

unique_runtime = [100, 120, 140, 160, 180, 200]
print(unique_runtime)

In [None]:
# listing down all possible combinations 
all_variables = [top_10_genres, top_tt, unique_runtime]
all_combi = list(itertools.product(*all_variables))
print(all_combi[:10])

In [None]:
# looping through all combinations to find the combination with the best rating 
best_rating = 0 
best_combi = []
for combi in all_combi: 
    predicted_rating = rating_prediction(combi[0], combi[1], combi[2]) 
    if predicted_rating > best_rating: 
        best_rating = predicted_rating
        best_combi = combi 
print(f'The best combination is a \033[1m{best_combi[0]}\033[0m \033[1m{best_combi[1]}\033[0m of \033[1m{best_combi[2]}\033[0mmins long with a rating of \033[1m{best_rating}\033[0m.')

<a id='10'></a>
## [Interesting findings](#100) 

In [None]:
names_ratings.head() 

In [None]:
# names with > 10,000 votes 
names_ratings_10000votes = names_ratings.drop(names_ratings[names_ratings.numVotes<10000].index)
names_ratings_10000votes.head() 

In [None]:
plt.figure(figsize=[15,6])
plt.scatter(names_ratings_10000votes.numVotes, names_ratings_10000votes.averageRating, c='orange')
plt.xlabel('Average Votes', size=15) 
plt.ylabel('Ratings', size=15)
plt.title('Average votes VS Ratings', size=25)
plt.savefig('averagevotes_vs_ratings.png', dpi=300)
plt.show() 

In [None]:
# top names with most number of votes 
names_ratings_10000votes.sort_values('numVotes', ascending=False).head()

In [None]:
# Phyllis Carlyle
print(names_ratings[names_ratings.nconst=='nm0138287'])
print(titles[titles.tconst=='tt0114369'])

In [None]:
# Lawrence A. Bonney
print(names_ratings[names_ratings.nconst=='nm0095029'])
print(titles[titles.tconst=='tt0102926'])

In [None]:
# Jonas Rivera
print(names_ratings[names_ratings.nconst=='nm0729304'])
# print(titles[titles.tconst=='tt10484166'])
print(titles[titles.tconst=='tt1049413'])
# print(titles[titles.tconst=='tt10559884 '])
# print(titles[titles.tconst=='tt1702223'])

In [None]:
# top names with highest ratings 
names_ratings_10000votes.sort_values('averageRating', ascending=False).head()

<a id='11'></a>
## [Future opportunities](#100) 

There are many more areas of improvements for this model that can be done in the future. I would like to include additional features like crews involved and production cost and see how much these features will affect ratings. I would also like to dwelve into deep learning models as well. Ultimately to further improve prediction and scores. 

**End notes** <br> 
This project has been a really fun one for a start to my journey into ML. I am surpised by the amount of things I can acheive from just 2 months of study. Although there are still a lot more things to study and learn (I believe I have only scratched the tip of the iceberg), but it has altered my mindset towards coding, programming, ML and even AI. It is no longer mysterious and far fetch to me. I look forward to learning and exploring more into this area! 