# Statistical Approach to for predicting IMDB

1. [Introduction](#introduction)<br>
    1.1 [Background](#background)<br>
    1.2 [Data Description](#datadescription)<br>
    1.3 [Problem Statement](#problemstatement)<br>
2. [Data Exploration](#dataexploration)<br>
    2.1 [Data Loading](#dataloading)<br>
    2.2 [Data Profile](#dataprofile)<br>
    2.3 [Data Cleaning](#datacleaning)<br>
3. [Regression Model Building](#rmb)<br>
    3.1 [Splitting the Dataset](#std)<br>
    3.2 [Scaling to avoid Euclidean Distance problem](#s)<br>
    3.3 [Feature Elimination](#fe)<br>
    3.4 [Simple Linear Regression](#slr)<br>
    3.5 [Support Vector Machines with Linear, Polynomial and RBF Kernels](#svmr)<br>
    3.6 [Ensemble Models](#em)<br>
     3.6.1 [Gradient Boosting with Hyperparameter Tuning](#gbr)<br>
     3.6.2 [Random Forest with Hyperparameter Tuning](#rbr)<br>
    3.7 [XGBoost with Hyperparameter Tuning](#xgbr)<br>
    3.8 [Interpreting Results of a Regresison Model](#irr)<br>
4. [Building a Classificaiton Model](#bc)<br>
    4.1 [Logistic Regression](#lr)<br>
    4.2 [Support Vector machines with Linear, Polynomial adn RBF Kernels](#svmc)<br>
    4.3 [Ensemble Models](#emc)<br>
     4.3.1 [Random Forest with Hyperparameter Tuning](#rfc)<br>
     4.3.2 [Gradient Boosting with Hyperparameter Tuning](#gbc)<br>
    4.4 [XGBoost with Hyperparameter Tuning](#xgbc)<br>
    4.5 [Interpreting Results of Classification Model](#ircm)
5. [Conclusion](#conclusion)<br>

<a id='introduction'></a>

# 1 Introduction

<a id='bakground'></a>

## 1.1 Background

A commercial success movie not only entertains audience, but also enables film companies to gain tremendous profit. A lot of factors such as good directors, experienced actors are considerable for creating good movies. However, famous directors and actors can always bring an expected box-office income but cannot guarantee a highly rated imdb score.

<a id='datadescription'></a>

## 1.2 Data Description

The dataset is from Kaggle website. It contains 28 variables for 5043 movies, spanning across 100 years in 66 countries. There are 2399 unique director names, and thousands of actors/actresses. “imdb_score” is the response variable while the other 27 variables are possible predictors.

|Variable Name |	Description|
| --- | --- |
|movie_title	 | Title of the Movie|
|duration	| Duration in minutes|
|director_name	| Name of the Director of the Movie|
|director_facebook_likes |	Number of likes of the Director on his Facebook Page|
|actor_1_name |	Primary actor starring in the movie|
|actor_1_facebook_likes |	Number of likes of the Actor_1 on his/her Facebook Page|
|actor_2_name |	Other actor starring in the movie|
|actor_2_facebook_likes	| Number of likes of the Actor_2 on his/her Facebook Page|
|actor_3_name |	Other actor starring in the movie|
|actor_3_facebook_likes |	Number of likes of the Actor_3 on his/her Facebook Page|
|num_user_for_reviews |	Number of users who gave a review|
|num_critic_for_reviews |	Number of critical reviews on imdb|
|num_voted_users | 	Number of people who voted for the movie|
|cast_total_facebook_likes |	Total number of facebook likes of the entire cast of the movie|
|movie_facebook_likes |	Number of Facebook likes in the movie page|
|plot_keywords |	Keywords describing the movie plot|
|facenumber_in_poster |	Number of the actor who featured in the movie poster|
|color |	Film colorization. ‘Black and White’ or ‘Color’|
|genres |	Film categorization like ‘Animation’, ‘Comedy’, ‘Romance’, ‘Horror’, ‘Sci-Fi’, ‘Action’, ‘Family’|
|title_year |	The year in which the movie is released (1916:2016)|
|language |	English, Arabic, Chinese, French, German, Danish, Italian, Japanese etc|
|country |	Country where the movie is produced|
|content_rating |	Content rating of the movie|
|aspect_ratio |	Aspect ratio the movie was made in|
|movie_imdb_link |	IMDB link of the movie|
|gross |	Gross earnings of the movie in Dollars|
|budget |	Budget of the movie in Dollars|
|imdb_score |	IMDB Score of the movie on IMDB|

<a id='problemstatement'></a>

## 1.3 Problem Statement

Based on the massive movie information, it would be interesting to understand what are the important factors that make a movie more successful than others. So, we would like to analyze what kind of movies are more successful, in other words, get higher IMDB score. 

In this notebook we are going to build two different kind of models, Regression and Classification. Under each kind of model we are going to start from a basic model to advanced model and also a description of why we choose advanced one. 

Under Regression we are goint to fit Regression line to our data and find the continous target variable imdb_score.

Under Classification we are  going to fit the Classification Model to our data and the Classify the imdb_score in to three categories. 

|imdb_score | Classify |
| --- | ---|
|1-3 | Flop Movie|
|3-6 | Average Movie|
|6-10 | Hit Movie|

<a id='dataexploration'></a>

# 2. Data Exploration

<a id='dataloading'></a>

## 2.1 Data Loading

In [None]:
#importing the libraries that we use
import pandas as pd 
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import pandas_profiling as pp

In [None]:
#importing the dataset
dataset = pd.read_csv('../input/imdb-5000-movie-dataset/movie_metadata.csv')
dataset.head()

In [None]:
dataset.shape

In [None]:
dataset.columns

<a id='dataprofile'></a>

## 2.2 Data Profile

In [None]:
dataset.profile_report()

In [None]:
dataset.drop_duplicates(inplace = True)
dataset.shape

<a id='datacleaning'></a>

## 2.3 Data Cleaning

Data Cleaning is a most important part of building a model. Here we do the standard preprocessing steps of the Data cleaning to make sure our model is not feeded crap.

### 2.3.1 Missing Value Treatment

In [None]:
numerical_cols = [col for col in dataset.columns if dataset[col].dtype != 'object']
categorical_cols = [col for col in dataset.columns if dataset[col].dtype == 'object']

In [None]:
categorical_cols, numerical_cols

In [None]:
dataset[numerical_cols].describe()

In [None]:
dataset[categorical_cols].describe()

In [None]:
dataset.isnull().sum()

In [None]:
dataset.color.unique()

In [None]:
color_mode = dataset['color'].mode().iloc[0]
dataset.color.fillna(color_mode, inplace = True)
dataset.color.isnull().sum()

In [None]:
dataset.director_name.nunique(), dataset.director_name.isnull().sum()

In [None]:
dataset = dataset.dropna(axis = 0, subset = ['director_name'] )

In [None]:
dataset.num_critic_for_reviews.min(), dataset.num_critic_for_reviews.max(), dataset.num_critic_for_reviews.median()

In [None]:
num_critic_for_reviews_median = dataset['num_critic_for_reviews'].median()
dataset.num_critic_for_reviews.fillna(num_critic_for_reviews_median, inplace = True)
dataset.num_critic_for_reviews.isnull().sum()

In [None]:
dataset.duration.min(), dataset.duration.max(), dataset.duration.median()

In [None]:
duration_median = dataset.duration.median()
dataset.duration.fillna(duration_median, inplace = True)
dataset.duration.isnull().sum()

In [None]:
dataset.director_facebook_likes.min(), dataset.director_facebook_likes.max(), dataset.director_facebook_likes.median(),dataset.director_facebook_likes.mean()

In [None]:
director_facebook_likes_mean = dataset.director_facebook_likes.mean()
dataset.director_facebook_likes.fillna(director_facebook_likes_mean, inplace = True)
dataset.director_facebook_likes.isnull().sum()

In [None]:
dataset.actor_3_facebook_likes.min(), dataset.actor_3_facebook_likes.max(), dataset.actor_3_facebook_likes.median(),dataset.actor_3_facebook_likes.mean()

In [None]:
actor_3_facebook_likes_mean = dataset.actor_3_facebook_likes.mean()
dataset.actor_3_facebook_likes.fillna(actor_3_facebook_likes_mean, inplace = True)
dataset.actor_3_facebook_likes.isnull().sum()

In [None]:
dataset = dataset.dropna(axis = 0, subset = ['actor_2_name'])
dataset.actor_2_name.isnull().sum()

In [None]:
dataset.actor_1_facebook_likes.min(), dataset.actor_1_facebook_likes.max(), dataset.actor_1_facebook_likes.median(),dataset.actor_1_facebook_likes.mean()

In [None]:
actor_1_facebook_likes_mean = dataset.actor_1_facebook_likes.mean()
dataset.actor_1_facebook_likes.fillna(actor_1_facebook_likes_mean, inplace = True)
dataset.actor_1_facebook_likes.isnull().sum()

In [None]:
dataset.gross.describe()

In [None]:
dataset.gross.isnull().sum()

In [None]:
dataset = dataset.dropna(axis = 0, subset = ['gross'])
dataset.gross.isnull().sum()

In [None]:
dataset.shape

In [None]:
dataset.isnull().sum()

In [None]:
dataset = dataset.dropna(axis = 0, subset = ['budget'])
dataset.budget.isnull().sum()

In [None]:
dataset.isnull().sum()

In [None]:
dataset.shape

In [None]:
dataset = dataset.dropna(axis = 0, subset = ['actor_3_name'])
dataset.actor_3_name.isnull().sum()

In [None]:
facenumber_in_poster_median = dataset.facenumber_in_poster.median()
dataset.facenumber_in_poster.fillna(facenumber_in_poster_median, inplace = True)
dataset.facenumber_in_poster.isnull().sum()

In [None]:
dataset.plot_keywords.unique()

In [None]:
dataset.language.unique()

In [None]:
dataset.language.value_counts()

In [None]:
language_mode = dataset.language.mode().iloc[0]
dataset.language.fillna(language_mode, inplace = True)
dataset.language.isnull().sum()

In [None]:
dataset = dataset.dropna(axis = 0, subset = ['plot_keywords'])
dataset.plot_keywords.isnull().sum()

In [None]:
dataset.content_rating.unique()

In [None]:
dataset.content_rating.fillna('Not Rated', inplace = True)

In [None]:
dataset.aspect_ratio.unique()

In [None]:
aspect_ratio_mode = dataset.aspect_ratio.mode().iloc[0]
dataset.aspect_ratio.fillna(aspect_ratio_mode, inplace = True)                                                    

In [None]:
dataset.isnull().sum()

In [None]:
dataset.reset_index(inplace = True, drop = True)

### 2.3.2 Profile Report after missing value treatment 

In [None]:
dataset.profile_report()

Dealing with Null Data amount we have lost 25% of the given data. Let's deal with converting the Data in to numericals to feed our model. 

### 2.3.3 Converting Categoricals to Numericals to feed our model

In [None]:
numerical_cols, categorical_cols

Let us deal with the categorical_cols first by converting them in to numericals.

In [None]:
dataset.color.unique(), dataset.color.nunique()

So as we see there are only 2 different categorical variables available in the color variable. We can just map color to 1 and 0 to black and white

In [None]:
dataset['color'] = dataset.color.map({'Color' : 1 , ' Black and White' : 0})

In [None]:
dataset.director_name.unique(), dataset.director_name.nunique()

In [None]:
director_name_value_counts = dataset.director_name.value_counts()

In [None]:
director_name_value_counts  = pd.DataFrame(director_name_value_counts).reset_index().rename(columns = {'index': 'director_name', 'director_name':'director_name_value_counts'})

In [None]:
dataset = pd.merge(dataset, director_name_value_counts,left_on = 'director_name', right_on = 'director_name', how = 'left')

In [None]:
dataset = dataset.drop(columns = 'director_name')

In [None]:
dataset.actor_2_name.unique(), dataset.actor_2_name.nunique()

In [None]:
actor_2_name_value_counts = dataset.actor_2_name.value_counts()

In [None]:
actor_2_name_value_counts  = pd.DataFrame(actor_2_name_value_counts).reset_index().rename(columns = {'index': 'actor_2_name', 'actor_2_name':'actor_2_name_value_counts'})

In [None]:
dataset = pd.merge(dataset, actor_2_name_value_counts,left_on = 'actor_2_name', right_on = 'actor_2_name', how = 'left')

In [None]:
dataset = dataset.drop(columns = 'actor_2_name')

In [None]:
dataset.genres.unique(), dataset.genres.nunique()

The column genres has huge amount of values unique values. Let us divide this feature in to 2 different features with main_genre and the genres

In [None]:
dataset['main_genre'] = dataset.genres.str.split('|').str[0]

In [None]:
dataset.main_genre.unique(), dataset.main_genre.nunique()

Lets convert both the columns in to the numbericals. The main_genre and the genres

In [None]:
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
dataset['main_genre'] = le.fit_transform(dataset.main_genre)

In [None]:
genres_value_counts = dataset.genres.value_counts()

In [None]:
genres_value_counts  = pd.DataFrame(genres_value_counts).reset_index().rename(columns = {'index' : 'genres', 'genres' : 'genres_value_counts'})

In [None]:
dataset = pd.merge(dataset, genres_value_counts,left_on = 'genres', right_on = 'genres', how = 'left')

In [None]:
dataset = dataset.drop(columns = 'genres')

In [None]:
dataset.actor_1_name.unique(), dataset.actor_1_name.nunique()

The variable actor_1_name is also having high cardinaity, hence we decide to change it in to the number of counts

In [None]:
actor_1_name_value_counts = dataset.actor_1_name.value_counts()

In [None]:
actor_1_name_value_counts = pd.DataFrame(actor_1_name_value_counts).reset_index().rename(columns = {'index' : 'actor_1_name', 'actor_1_name' : 'actor_1_name_value_counts'})

In [None]:
dataset = pd.merge(dataset, actor_1_name_value_counts,left_on = 'actor_1_name', right_on = 'actor_1_name', how = 'left')

In [None]:
dataset = dataset.drop(columns = 'actor_1_name')

In [None]:
dataset.movie_title.unique(), dataset.movie_title.nunique()

As we see out of 3816 records, we have 3749 unique records which in not helpful for us for making predictions. So we drop the column from our dataframe

In [None]:
dataset = dataset.drop(columns = 'movie_title')

In [None]:
dataset.actor_3_name.unique(), dataset.actor_3_name.nunique()

This variable also has high cadinality. So changing it in to the value counts variable.

In [None]:
actor_3_name_value_counts = dataset.actor_3_name.value_counts()

In [None]:
actor_3_name_value_counts = pd.DataFrame(actor_3_name_value_counts).reset_index().rename(columns = {'index' : 'actor_3_name', 'actor_3_name' : 'actor_3_name_value_counts'})

In [None]:
dataset= pd.merge(dataset, actor_3_name_value_counts,left_on = 'actor_3_name', right_on = 'actor_3_name', how = 'left')

In [None]:
dataset = dataset.drop(columns = 'actor_3_name')

In [None]:
dataset.plot_keywords.unique(), dataset.plot_keywords.nunique()

Looking in to the variable, we can see has a high cardinality which is unstable and we can delete such variable and mainly, we need to extract the main_plot_keywords of all in it.

In [None]:
dataset['main_plot_keyword'] = dataset.plot_keywords.str.split('|').str[0]

In [None]:
dataset = dataset.drop(columns = 'plot_keywords')

In [None]:
dataset.main_plot_keyword.unique(), dataset.main_plot_keyword.nunique()

As we see the extracted main Plot keyword also consists of high cardinality but is stable. we can replace it with the value counts

In [None]:
main_plot_keyword_value_counts = dataset.main_plot_keyword.value_counts()

In [None]:
main_plot_keyword_value_counts = pd.DataFrame(main_plot_keyword_value_counts).reset_index().rename(columns = {'index' : 'main_plot_keyword', 'main_plot_keyword' : 'main_plot_keyword_value_counts'})

In [None]:
dataset = pd.merge(dataset, main_plot_keyword_value_counts, left_on = 'main_plot_keyword', right_on = 'main_plot_keyword', how = 'left')

In [None]:
dataset = dataset.drop(columns = 'main_plot_keyword')

In [None]:
dataset.movie_imdb_link.unique(), dataset.movie_imdb_link.nunique()

This variable movie_imdb_link is however unique the whole. So considering it will not help out prediciting variable we drop it off.

In [None]:
dataset = dataset.drop(columns = 'movie_imdb_link')

In [None]:
dataset.language.unique(), dataset.language.nunique()

Language variable has only 38 unique values and is consistent. So, we just do label encoding.

In [None]:
from sklearn.preprocessing import LabelEncoder
le1 = LabelEncoder()
dataset['language'] = le1.fit_transform(dataset.language)

In [None]:
dataset.country.unique(), dataset.country.nunique()

Country variable has only 47 unique values and is consistent. So, we just do label encoding.

In [None]:
from sklearn.preprocessing import LabelEncoder
le2 = LabelEncoder()
dataset['country'] = le2.fit_transform(dataset.country)

In [None]:
dataset.content_rating.unique(),dataset.content_rating.nunique()

Content rating has only 12 unique variables and can be done label encoding

In [None]:
from sklearn.preprocessing import LabelEncoder
le3 = LabelEncoder()
dataset['content_rating'] = le3.fit_transform(dataset.content_rating)

In [None]:
dataset.head().T

### 2.3.4 Profile Report after data cleaning 

In [None]:
dataset.profile_report()

As we look in to the profile report we are now having warnings of about the skewness and the zeros. This will be wiped off after doing a scaling operation after dealing with spiltting the dataset. All the unwanted variables will also be removed during the Feature elimination

<a id='rmb'></a>

# 3. Regression Model Building

In [None]:
datasetR = dataset.copy() #lets keep our original dataset for reference. Here datasetR is for Regression model
datasetC = dataset.copy() #Here datasetC is for classification model

<a id='std'></a>

## 3.1 Splitting the Dataset

In [None]:
from sklearn.model_selection import train_test_split
y = datasetR.pop('imdb_score')
X = datasetR
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size = 0.8, test_size = 0.2, random_state = 42)

In [None]:
X_train.shape, y_train.shape, X_test.shape, y_test.shape

<a id='s'></a>

## 3.2 Scaling to avoid Euclidean Distance Problem

We do scaling after we aplit the dataset as we donot want to make our training set metrics to fit the test set.

In [None]:
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
X_train = pd.DataFrame(scaler.fit_transform(X_train.values), columns=X_train.columns, index=X_train.index)

In [None]:
X_test = pd.DataFrame(scaler.transform(X_test.values), columns = X_train.columns, index = X_test.index)

Building our model. As we are having many number of features, out of which there will be only some useful. Lets do some feature selection for our Regression model.

In [None]:
X_train.shape

<a id='fe'></a>

## 3.3 Feature Elimination 

We dont want our model to feed with all the variables which might mot help in prediction. We do remove variables having High Collinearity and use only variables useful for our model by doing the Recursive Feature Elimination.

In [None]:
#removing variables with high colinearity
def correlation(dataset, threshold):
    col_corr = set() # Set of all the names of deleted columns
    corr_matrix = dataset.corr()
    for i in range(len(corr_matrix.columns)):
        for j in range(i):
            if (corr_matrix.iloc[i, j] >= threshold) and (corr_matrix.columns[j] not in col_corr):
                colname = corr_matrix.columns[i] # getting the name of column
                col_corr.add(colname)
                if colname in dataset.columns:
                    del dataset[colname] # deleting the column from the dataset
correlation(X_train,0.90)

In [None]:
X_train.shape

In [None]:
#importing the required libraries
from sklearn.feature_selection import RFE
from sklearn.linear_model import LinearRegression

In [None]:
# Running RFE with the output number of the variable equal to 15
lm = LinearRegression()
lm.fit(X_train, y_train)

rfe = RFE(lm, 15)            # running RFE
rfe = rfe.fit(X_train, y_train)

In [None]:
list(zip(X_train.columns,rfe.support_,rfe.ranking_))

In [None]:
col_rfe = X_train.columns[rfe.support_]
col_rfe

In [None]:
X_train.columns[~rfe.support_]

In [None]:
#Creating a X_train dataframe with rfe varianles
X_train_rfe = X_train[col_rfe]

<a id='slr'></a>

## 3.4 Simple Linear Regression

In [None]:
# Adding a constant variable for using the stats model
import statsmodels.api as sm
X_train_rfe_constant = sm.add_constant(X_train_rfe)

In [None]:
lm = sm.OLS(y_train,X_train_rfe_constant).fit()   # Running the linear model

In [None]:
#Let's see the summary of our linear model
print(lm.summary())

In [None]:
X_test_rfe = X_test[col_rfe]
X_test_rfe_constant = sm.add_constant(X_test_rfe)

In [None]:
y_pred_linear = lm.predict(X_test_rfe_constant)

In [None]:
y_pred_linear.values

In [None]:
y_pred_linear.min(), y_pred_linear.max()

In [None]:
from sklearn.metrics import mean_squared_error

In [None]:
mean_squared_error(y_pred_linear, y_test)

After looking in to the stats, We observe that the r2 score is low of about 0.37 aafter having all consistent variables and the regression line is not fitting the data correctly. So we have to go for much advanced curved model such as support vector machine and ensemble algorithms to make our model to fit the data correctly.

<a id='svmr'></a>

## 3.5 Support Vector Machines with Linear, Polynomial, RBF Kernels

In [None]:
from sklearn.svm import SVR
svr_rbf = SVR(kernel='rbf', gamma=0.1)
svr_lin = SVR(kernel='linear', gamma='auto')
svr_poly = SVR(kernel='poly', gamma='auto', degree=3)

In [None]:
svr_rbf.fit(X_train_rfe, y_train)
y_pred_svm_rbf = svr_rbf.predict(X_test_rfe)

In [None]:
y_pred_svm_rbf

In [None]:
y_pred_svm_rbf.min(), y_pred_svm_rbf.max()

In [None]:
mean_squared_error(y_pred_svm_rbf, y_test)

In [None]:
svr_lin.fit(X_train_rfe, y_train)
y_pred_svm_lin = svr_lin.predict(X_test_rfe)

In [None]:
y_pred_svm_lin

In [None]:
y_pred_svm_lin.min(), y_pred_svm_lin.max()

In [None]:
mean_squared_error(y_pred_svm_lin, y_test)

In [None]:
svr_poly.fit(X_train_rfe, y_train)
y_pred_svm_poly = svr_poly.predict(X_test_rfe)

In [None]:
y_pred_svm_poly

In [None]:
y_pred_svm_poly.min(), y_pred_svm_poly.max()

In [None]:
mean_squared_error(y_pred_svm_poly, y_test)

<a id='em'></a>

## 3.6 Ensemble Models

<a id='gbr'></a>

### 3.6.1 Gradient Boosting with Hyper Parameter Tuning

In [None]:
from sklearn import ensemble
n_trees=200
gradientboost = ensemble.GradientBoostingRegressor(loss='ls',learning_rate=0.03,n_estimators=n_trees,max_depth=4)
gradientboost.fit(X_train_rfe,y_train)

In [None]:
y_pred_gb=gradientboost.predict(X_test_rfe)
error=gradientboost.loss_(y_test,y_pred_gb) ##Loss function== Mean square error
print("MSE:%.3f" % error)

In [None]:
mean_squared_error(y_pred_gb, y_test)

In [None]:
y_pred_gb.min(), y_pred_gb.max()

In [None]:
from sklearn.model_selection import GridSearchCV
# Create the parameter grid based on the results of random search 
param_grid = {
    'loss' : ['ls'],
    'max_depth' : [3, 4, 5],
    'learning_rate' : [0.01, 0.001],
    'n_estimators': [100, 200, 500]
}
# Create a based model
gb = ensemble.GradientBoostingRegressor()
# Instantiate the grid search model
grid_search_gb = GridSearchCV(estimator = gb, param_grid = param_grid, 
                          cv = 3, n_jobs = -1, verbose = 2)

In [None]:
grid_search_gb.fit(X_train_rfe, y_train)
grid_search_gb.best_params_

In [None]:
grid_search_gb_pred = grid_search_gb.predict(X_test_rfe)

In [None]:
mean_squared_error(y_test.values, grid_search_gb_pred)

<a id='rbr'></a>

### 3.6.2 Random Forest with Hyper Parameter Tuning 

In [None]:
from sklearn.ensemble import RandomForestRegressor
rf_regressor = RandomForestRegressor(n_estimators = 500)
rf_regressor.fit(X_train_rfe, y_train)
rf_pred = rf_regressor.predict(X_test_rfe)

In [None]:
mean_squared_error(rf_pred, y_test)

Lets tweek in to the hyperparameter tuning of the RandomForestRegressor to find the best parameters of the model

In [None]:
from sklearn.model_selection import GridSearchCV
# Create the parameter grid based on the results of random search 
param_grid = {
    'bootstrap': [True],
    'max_depth': [90, 100],
    'max_features': [2, 3],
    'min_samples_leaf': [3, 4],
    'min_samples_split': [8, 10],
    'n_estimators': [100, 500, 1000]
}
# Create a based model
rf = RandomForestRegressor()
# Instantiate the grid search model
grid_search_rf = GridSearchCV(estimator = rf, param_grid = param_grid, 
                          cv = 3, n_jobs = -1, verbose = 2)

In [None]:
grid_search_rf.fit(X_train_rfe, y_train)
grid_search_rf.best_params_

In [None]:
y_grid_pred_rf = grid_search_rf.predict(X_test_rfe)

In [None]:
mean_squared_error(y_grid_pred_rf, y_test.values)

<a id='xgbr'></a>

## 3.7 XGBoost with Hyperparameter tuning

In [None]:
import xgboost as xgb
xg_model = xgb.XGBRegressor(n_estimators = 500)
xg_model.fit(X_train_rfe, y_train)

In [None]:
results = xg_model.predict(X_test_rfe)

In [None]:
mean_squared_error(results, y_test.values)

In [None]:
xg_model.score(X_train_rfe, y_train)

In [None]:
from sklearn.metrics import r2_score
r2_score(y_test, results)

In [None]:
from sklearn.model_selection import GridSearchCV
# Create the parameter grid based on the results of random search 
param_grid = {
    'max_depth': [3, 4],
    'learning_rate' : [0.1, 0.01, 0.05],
    'n_estimators' : [100, 500, 1000]
}
# Create a based model
model_xgb= xgb.XGBRegressor()
# Instantiate the grid search model
grid_search_xgb = GridSearchCV(estimator = model_xgb, param_grid = param_grid, 
                          cv = 3, n_jobs = -1, verbose = 2)

In [None]:
grid_search_xgb.fit(X_train_rfe, y_train)
grid_search_xgb.best_params_

In [None]:
y_pred_xgb = grid_search_xgb.predict(X_test_rfe)

In [None]:
mean_squared_error(y_test.values, y_pred_xgb)

<a id='irr'></a>

## 3.8 Interpreting Results of Regression Model

Considering XG Boost as a final model with very less error rate.

In [None]:
feature_importance = grid_search_xgb.best_estimator_.feature_importances_
sorted_importance = np.argsort(feature_importance)
pos = np.arange(len(sorted_importance))
plt.figure(figsize=(12,5))
plt.barh(pos, feature_importance[sorted_importance],align='center')
plt.yticks(pos, X_train_rfe.columns[sorted_importance],fontsize=15)
plt.title('Feature Importance ',fontsize=18)
plt.show()

After looking in to all the metrics almost we have seen that XGBRegressor with "{'learning_rate': 0.05, 'max_depth': 4, 'n_estimators': 500}" these parameters has given the best results with mean squared error of 0.404. The Feature Importance given by this model is shown above.

<a id='bc'></a>

# 4. Building a Classification Model

In [None]:
datasetC.head()

To Build a classification Model I would like to reuse the preprocessed data from the Regression Model.
However I am going to replace the target variable and create a new target variable for our classification Model.

|imdb_score | Classify |
| --- | ---|
1-3 | Flop Movie
3-6 | Average Movie
6-10 | Hit Movie

In [None]:
y_train_classification = y_train.copy()

In [None]:
y_train_classification = pd.cut(y_train_classification, bins=[1, 3, 6, float('Inf')], labels=['Flop Movie', 'Average Movie', 'Hit Movie'])

In [None]:
y_test_classification = y_test.copy()

In [None]:
y_test_classification = pd.cut(y_test_classification, bins=[1, 3, 6, float('Inf')], labels=['Flop Movie', 'Average Movie', 'Hit Movie'])

We have created the target variable and now we will re use the independent variables form the Regression Model.

In [None]:
X_train_rfe_classification = X_train_rfe.copy()
X_test_rfe_classification = X_test_rfe.copy()

<a id='lr'></a>

## 4.1 Logistic Regression

Logistic Regresion is a linear algorithm does basically a binary classification. In order to use the Logistic Regression for Multiclass Classification we need to use the parameter solver as 'saga'. There are also other parameters for solver to do multiclass classification, I used saga as it also does L2 regularisation.

In [None]:
from sklearn.linear_model import LogisticRegression
logit_model = LogisticRegression(solver = 'saga', random_state = 0)
logit_model.fit(X_train_rfe_classification, y_train_classification)

In [None]:
y_logit_pred = logit_model.predict(X_test_rfe_classification)

In [None]:
y_logit_pred

In [None]:
from sklearn import metrics
count_misclassified = (y_test_classification != y_logit_pred).sum()
print('Misclassified samples: {}'.format(count_misclassified))
accuracy = metrics.accuracy_score(y_test_classification, y_logit_pred)
print('Accuracy: {:.2f}'.format(accuracy))
precision = metrics.precision_score(y_test_classification, y_logit_pred, average= 'macro')
print('Precision: {:.2f}'.format(precision))
recall = metrics.recall_score(y_test_classification, y_logit_pred, average= 'macro')
print('Recall: {:.2f}'.format(recall))
f1_score = metrics.f1_score(y_test_classification, y_logit_pred, average = 'macro')
print('F1 score: {:.2f}'.format(f1_score))

<a id='asvmc'></a>

## 4.2 Support Vector Classifier with Linear, Polynomial, RBF

Support Vector Classifier also basically does binary classification. In order to achieve the multi classification, we need to use the decision_function_shape as 'ovo'. The original one-vs-one (‘ovo’) decision function of libsvm which has shape (n_samples, n_classes * (n_classes - 1) / 2)

In [None]:
from sklearn.svm import SVC
svc_linear_model = SVC(kernel='linear', C=100, gamma= 'scale', decision_function_shape='ovo', random_state = 42)

In [None]:
svc_linear_model.fit(X_train_rfe_classification, y_train_classification)
y_svc_linear_pred = svc_linear_model.predict(X_test_rfe_classification)

In [None]:
y_svc_linear_pred

In [None]:
from sklearn import metrics
count_misclassified = (y_test_classification != y_svc_linear_pred).sum()
print('Misclassified samples: {}'.format(count_misclassified))
accuracy = metrics.accuracy_score(y_test_classification, y_svc_linear_pred)
print('Accuracy: {:.2f}'.format(accuracy))
precision = metrics.precision_score(y_test_classification, y_svc_linear_pred, average= 'macro')
print('Precision: {:.2f}'.format(precision))
recall = metrics.recall_score(y_test_classification, y_svc_linear_pred, average= 'macro')
print('Recall: {:.2f}'.format(recall))
f1_score = metrics.f1_score(y_test_classification, y_svc_linear_pred, average = 'macro')
print('F1 score: {:.2f}'.format(f1_score))

In [None]:
from sklearn.svm import SVC
svc_poly_model = SVC(kernel='poly', C=100, gamma= 'scale', degree = 3, decision_function_shape='ovo', random_state = 42)

In [None]:
svc_poly_model.fit(X_train_rfe_classification, y_train_classification)
y_svc_poly_pred = svc_poly_model.predict(X_test_rfe_classification)

In [None]:
y_svc_poly_pred

In [None]:
from sklearn import metrics
count_misclassified = (y_test_classification != y_svc_poly_pred).sum()
print('Misclassified samples: {}'.format(count_misclassified))
accuracy = metrics.accuracy_score(y_test_classification, y_svc_poly_pred)
print('Accuracy: {:.2f}'.format(accuracy))
precision = metrics.precision_score(y_test_classification, y_svc_poly_pred, average= 'macro')
print('Precision: {:.2f}'.format(precision))
recall = metrics.recall_score(y_test_classification, y_svc_poly_pred, average= 'macro')
print('Recall: {:.2f}'.format(recall))
f1_score = metrics.f1_score(y_test_classification, y_svc_poly_pred, average = 'macro')
print('F1 score: {:.2f}'.format(f1_score))

In [None]:
from sklearn.svm import SVC
svc_rbf_model = SVC(kernel='rbf', C=100, gamma= 'scale', decision_function_shape='ovo', random_state = 42)

In [None]:
svc_rbf_model.fit(X_train_rfe_classification, y_train_classification)
y_svc_rbf_pred = svc_rbf_model.predict(X_test_rfe_classification)

In [None]:
y_svc_rbf_pred

In [None]:
from sklearn import metrics
count_misclassified = (y_test_classification != y_svc_rbf_pred).sum()
print('Misclassified samples: {}'.format(count_misclassified))
accuracy = metrics.accuracy_score(y_test_classification, y_svc_rbf_pred)
print('Accuracy: {:.2f}'.format(accuracy))
precision = metrics.precision_score(y_test_classification, y_svc_rbf_pred, average= 'macro')
print('Precision: {:.2f}'.format(precision))
recall = metrics.recall_score(y_test_classification, y_svc_rbf_pred, average= 'macro')
print('Recall: {:.2f}'.format(recall))
f1_score = metrics.f1_score(y_test_classification, y_svc_rbf_pred, average = 'macro')
print('F1 score: {:.2f}'.format(f1_score))

<a id='emc'></a>

## 4.3 Ensemble Models

<a id='rfc'></a>

### 4.3.1 Random Forest Classifier with Hyper Parameter tuning 

In [None]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV
# Create the parameter grid based on the results of random search 
param_grid = {
    'bootstrap': [True],
    'max_depth': [90, 100],
    'max_features': [2, 3],
    'min_samples_leaf': [3, 4],
    'min_samples_split': [8, 10],
    'n_estimators': [100, 500, 1000],
    'random_state' :[0]
}
# Create a based model
rf_model_classification = RandomForestClassifier()
# Instantiate the grid search model
grid_search_rf_model_classificaiton = GridSearchCV(estimator = rf_model_classification, param_grid = param_grid, 
                          cv = 3, n_jobs = -1, verbose = 2)

In [None]:
grid_search_rf_model_classificaiton.fit(X_train_rfe_classification, y_train_classification)

In [None]:
y_rf_classification_pred = grid_search_rf_model_classificaiton.predict(X_test_rfe_classification)

In [None]:
y_rf_classification_pred

In [None]:
from sklearn import metrics
count_misclassified = (y_test_classification != y_rf_classification_pred).sum()
print('Misclassified samples: {}'.format(count_misclassified))
accuracy = metrics.accuracy_score(y_test_classification, y_rf_classification_pred)
print('Accuracy: {:.2f}'.format(accuracy))
precision = metrics.precision_score(y_test_classification, y_rf_classification_pred, average= 'macro')
print('Precision: {:.2f}'.format(precision))
recall = metrics.recall_score(y_test_classification, y_rf_classification_pred, average= 'macro')
print('Recall: {:.2f}'.format(recall))
f1_score = metrics.f1_score(y_test_classification, y_rf_classification_pred, average = 'macro')
print('F1 score: {:.2f}'.format(f1_score))

<a id='gbc'></a>

### 4.3.2 Gradient Boost Classifier with Hyper Parameter Tuning 

In [None]:
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import GridSearchCV
# Create the parameter grid based on the results of random search 
param_grid = {
    'max_depth': [10, 50, 90],
    'max_features': [3],
    'min_samples_leaf': [3],
    'min_samples_split': [8, 10],
    'n_estimators': [100, 500],
    'learning_rate' : [0.1, 0.2],
    'random_state' : [0]
}
# Create a based model
gbc_model_classification = GradientBoostingClassifier()
# Instantiate the grid search model
grid_search_gbc_model_classificaiton = GridSearchCV(estimator = gbc_model_classification, param_grid = param_grid, 
                          cv = 3, n_jobs = -1, verbose = 2)

In [None]:
grid_search_gbc_model_classificaiton.fit(X_train_rfe_classification, y_train_classification)

In [None]:
y_gbc_model_pred = grid_search_gbc_model_classificaiton.predict(X_test_rfe_classification)

In [None]:
y_gbc_model_pred

In [None]:
from sklearn import metrics
count_misclassified = (y_test_classification != y_gbc_model_pred).sum()
print('Misclassified samples: {}'.format(count_misclassified))
accuracy = metrics.accuracy_score(y_test_classification, y_gbc_model_pred)
print('Accuracy: {:.2f}'.format(accuracy))
precision = metrics.precision_score(y_test_classification, y_gbc_model_pred, average= 'macro')
print('Precision: {:.2f}'.format(precision))
recall = metrics.recall_score(y_test_classification, y_gbc_model_pred, average= 'macro')
print('Recall: {:.2f}'.format(recall))
f1_score = metrics.f1_score(y_test_classification, y_gbc_model_pred, average = 'macro')
print('F1 score: {:.2f}'.format(f1_score))

<a id='xgbc'></a>

## 4.4 XG Boost Classifier with Hyper Parameter Tuning 

In [None]:
from xgboost import XGBClassifier
from sklearn.model_selection import GridSearchCV
param_grid = {
     'objective' : ['multi:softmax', 'multi:softprob'],
     'n_estimators': [100, 500, 1000],
     'random_state': [0]
}
# Create a based model
xgb_model_classification = XGBClassifier()
# Instantiate the grid search model
grid_search_xgb_model_classificaiton = GridSearchCV(estimator = xgb_model_classification, param_grid = param_grid, 
                          cv = 3, n_jobs = -1, verbose = 2)

In [None]:
grid_search_xgb_model_classificaiton.fit(X_train_rfe_classification, y_train_classification)

In [None]:
y_xgb_classification_pred = grid_search_xgb_model_classificaiton.predict(X_test_rfe_classification)

In [None]:
y_xgb_classification_pred

In [None]:
from sklearn import metrics
count_misclassified = (y_test_classification != y_xgb_classification_pred).sum()
print('Misclassified samples: {}'.format(count_misclassified))
accuracy = metrics.accuracy_score(y_test_classification, y_xgb_classification_pred)
print('Accuracy: {:.2f}'.format(accuracy))
precision = metrics.precision_score(y_test_classification, y_xgb_classification_pred, average= 'macro')
print('Precision: {:.2f}'.format(precision))
recall = metrics.recall_score(y_test_classification, y_xgb_classification_pred, average= 'macro')
print('Recall: {:.2f}'.format(recall))
f1_score = metrics.f1_score(y_test_classification, y_xgb_classification_pred, average = 'macro')
print('F1 score: {:.2f}'.format(f1_score))

As we see that the Gradient Boost with Hyper Parameter seems to give us the best Results. This is because the nature of Ensemble models tend to being overfitted. However we consider the final model for our classification as Gradient Boosting Classifier.

<a id='ircm'></a>

# 4.5 Interpreting Results of Classfication Model

Considering Gradient Boosting classifier as the final model with 83 % accuracy

In [None]:
feature_importance = grid_search_gbc_model_classificaiton.best_estimator_.feature_importances_
sorted_importance = np.argsort(feature_importance)
pos = np.arange(len(sorted_importance))
plt.figure(figsize=(12,5))
plt.barh(pos, feature_importance[sorted_importance],align='center')
plt.yticks(pos, X_train_rfe.columns[sorted_importance],fontsize=15)
plt.title('Feature Importance ',fontsize=18)
plt.show()

<a id='conclusion'></a>

# 5. Conclusion

After Looking in to the feature importance of the best models in the Regression and Classification Model we see that both the models have given almost the same amount of importance to the respective features, considering XGBosot Regressor and Gradient Boost Classiifier. The results of all Regression and Classification Models are as follows:

|Regression Model|Mean_squared_error|
| --- | --- |
|Simple Linear Regression |0.70|
|SVRegressor Linear|0.72|
|SVRegressor Polynomial|0.93|
|SVRegressor RBF|0.68|
|Gradient Boost|0.43|
|Random Forest|0.45|
|XGBoost|0.40|

|Classification  Model|MisClassifications|Accuracy|Precision|Recall|F1-Score|
| --- | --- | --- | --- | --- | --- |
| Logistic Regression | 190 | 0.75 | 0.47 | 0.40 | 0.41 |
| SVC Linear | 181 | 0.76 | 0.47 | 0.45 | 0.46 |
| SVC Polynomial | 143 | 0.81 | 0.52 | 0.50 | 0.51 |
| SVC RBF | 146 | 0.81 | 0.51 | 0.50 | 0.50 |
| Random Forest | 130 | 0.83 | 0.54 | 0.50 | 0.51 |
| Gradient Boosting | 127 | 0.83 | 0.54 | 0.51 | 0.52 |
| XGBoost | 139 | 0.82 | 0.52 | 0.51 | 0.51 |
