## Introduction

A commercial success movie not only entertains audience, but also enables film companies to gain tremendous profit. A lot of factors such as good directors, experienced actors are considerable for creating good movies. However, famous directors and actors can always bring an expected box-office income but cannot guarantee a highly rated imdb score.

## Data Description 

The dataset (movie-review-data.csv) contains 28 variables for 5043 movies, spanning across 100 years in 66 countries. There are 2399 unique director names, and thousands of actors/actresses. “imdb_score” is the response variable while the other 27 variables are possible predictors.

## Problem Statement

Build Model to predict what kind of movies are more successful.Take imdb scores as response variable and focus on operating predictions by analyzing the rest of variables in the movie data.

## Importing Libraries

In [None]:
import pandas as pd 
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import GridSearchCV

## Data loading  

In [None]:
#Reading the dataset
data = pd.read_csv('../input/imdb-5000-movie-dataset/movie_metadata.csv')
data.head()

In [None]:
# Name of 28 Columns in the dataset
data.columns

In [None]:
#Shape: Number of columns & Number of rows in the dataset
data.shape

In [None]:
#To check non-null values out of total values & datatype of columns
data.info()

In [None]:
#Used for calculating some statistical data like percentile, mean and std of the numerical values
data.describe()

In [None]:
#This will include count, unique, top and freq. The top is the most common value. The freq is the most common value’s frequency.
data.describe(include = 'object')

## Data Cleaning

In [None]:
#Finding Columns with Missing Values
data.isna().any()

In [None]:
#Finding the number of missing values in all columns
data.isnull().sum()

In [None]:
plt.figure(figsize=(20,6))
heatmap = sns.heatmap(data.isnull(),cmap='Oranges',cbar=False,yticklabels=False)
heatmap.set_xticklabels(heatmap.get_xmajorticklabels(), fontsize = 18)

In [None]:
#bifurcating numerical as well as categorical columns
Num_columns = [column for column in data.columns if data[column].dtype != 'object']
Cat_columns = [column for column in data.columns if data[column].dtype == 'object']

In [None]:
Num_columns

In [None]:
Cat_columns

In [None]:
#Replacing some of the Categorical variables with the mode
data['color'].fillna(data['color'].mode()[0], inplace=True)
data['country'].fillna(data['country'].mode()[0], inplace=True)
data['language'].fillna(data['language'].mode()[0], inplace=True)

In [None]:
data['content_rating'].value_counts()

In [None]:
#Filling not rated value in null values 
data['content_rating'].fillna('Not Rated', inplace = True)

In [None]:
#Replacing some of the Numerical variables with the median(not using mean because there may be some outliers)
data['num_critic_for_reviews'].fillna(data['num_critic_for_reviews'].median(), inplace=True)
data['duration'].fillna(data['duration'].median(), inplace=True)
data['director_facebook_likes'].fillna(data['director_facebook_likes'].median(), inplace=True)
data['actor_3_facebook_likes'].fillna(data['actor_3_facebook_likes'].median(), inplace=True)
data['actor_1_facebook_likes'].fillna(data['actor_1_facebook_likes'].median(), inplace=True)
data['actor_2_facebook_likes'].fillna(data['actor_2_facebook_likes'].median(), inplace=True)
data['gross'].fillna(data['gross'].median(), inplace=True)
data['facenumber_in_poster'].fillna(data['facenumber_in_poster'].median(), inplace=True)
data['num_user_for_reviews'].fillna(data['num_user_for_reviews'].median(), inplace=True)
data['budget'].fillna(data['budget'].median(), inplace=True)
data['title_year'].fillna(data['title_year'].mode()[0], inplace=True)
data['aspect_ratio'].fillna(data['aspect_ratio'].mode()[0], inplace=True)

In [None]:
#Checking number of unique values 
data['plot_keywords'].nunique(), data['director_name'].nunique(), data['actor_2_name'].nunique(),data['actor_1_name'].nunique(),data['actor_3_name'].nunique()

In [None]:
#Dropping Null values for these column because they havemany unique values so can't replace them with mode or median
data = data.dropna(axis = 0, subset = ['plot_keywords','director_name','actor_2_name','actor_1_name','actor_3_name'])

In [None]:
#Dropping duplicate rows 
data.drop_duplicates(inplace = True)
data.shape

We lost around 305 rows from the data which is around 6% of whole data. 

In [None]:
data.isnull().sum()

## Data Visualization 

In [None]:
plt.figure(figsize=(16, 8))
sns.distplot(data['imdb_score'], color='g', bins=100)

With this information we can see that the imdb scores are left skewed

In [None]:
data.hist(bins=30,figsize=(16,16),color='Orange',xlabelsize=8, ylabelsize=8)

In [None]:
plt.figure(figsize=(16, 6))
plot = sns.countplot(x='color', data=data)

In [None]:
plt.figure(figsize=(16, 6))
plot = sns.countplot(x='country', data=data)
plt.xticks(rotation = 90)

In [None]:
plt.figure(figsize=(16, 6))
plot = sns.countplot(x='language', data=data)
plt.xticks(rotation = 90)

In [None]:
plt.figure(figsize=(16, 6))
plot = sns.countplot(x='content_rating', data=data)
plt.xticks(rotation = 90)

In [None]:
plt.figure(figsize = (16, 6))
ax = sns.boxplot(x='color', y='imdb_score', data=data)
plt.setp(ax.artists, alpha=.5, linewidth=2, edgecolor="k")
plt.xticks(rotation=45)

In [None]:
plt.figure(figsize = (16, 6))
ax = sns.boxplot(x='content_rating', y='imdb_score', data=data)
plt.setp(ax.artists, alpha=.5, linewidth=2, edgecolor="k")
plt.xticks(rotation=45)

In [None]:
plt.figure(figsize = (16, 6))
ax = sns.boxplot(x='country', y='imdb_score', data=data)
plt.setp(ax.artists, alpha=.5, linewidth=2, edgecolor="k")
plt.xticks(rotation=90)

In [None]:
plt.figure(figsize = (16, 6))
ax = sns.boxplot(x='language', y='imdb_score', data=data)
plt.setp(ax.artists, alpha=.5, linewidth=2, edgecolor="k")
plt.xticks(rotation=90)

In [None]:
fig, ax = plt.subplots(round(len(Num_columns) / 3), 3, figsize = (18, 20))

for i, ax in enumerate(fig.axes):
    if i < len(Num_columns) - 1:
        sns.regplot(x=Num_columns[i],y='imdb_score', data=data[Num_columns], ax=ax, label = Num_columns)

In [None]:
for i in range(0, len(Num_columns), 5):
    sns.pairplot(data=data[Num_columns],
                x_vars=Num_columns[i:i+5],
                y_vars=['imdb_score'])

## Data Pre-Processing 

In [None]:
#Replacing some special characters with comma
data['plot_keywords'] = data['plot_keywords'].str.replace('|',',')
data['genres'] = data['genres'].str.replace('|',',')
data['movie_title'] = data['movie_title'].str.replace('Â',' ')

In [None]:
#New column(Profit) to calculate the net profit made by the movie (Gross-Budget) 
data['Profit']=data['budget'] - data['gross']

In [None]:
#New column(Profit%) to calculate the net profit made by the movie (Gross-Budget) 
data['Profit%']=(data['Profit']/data['gross'])*100

In [None]:
data['country'].value_counts()

from the above output we can observed most of the movies are produced in USA & UK (around 85%)

In [None]:
#Replacing other than USK & UK with others
countries = ['USA','UK']
data['country'] = data['country'].where(data['country'].isin(countries), 'other')
data['country'].value_counts()

In [None]:
data['language'].value_counts()

from the above output we can observed most of the movies have english as a language(around 94%)

In [None]:
#Replacing all the language other than english with others
most_occurred_language = ['English']
data['language'] = data['language'].where(data['language'].isin(most_occurred_language), 'other')
data['language'].value_counts()

In [None]:
#Correlation_Matrix- Finding Correlation between variables
correlation = data.corr()
f,ax = plt.subplots(figsize=(15,15))
sns.heatmap(correlation, annot=True, cmap="YlGnBu", linewidths=.5,fmt='.2f')

From above heatmap it was observed that actor_1_facebook_likes and cast_total_facebook_likes are highly correlated to each other.

In [None]:
#Dropping highly correlated columns
data.drop('cast_total_facebook_likes',axis=1,inplace=True)

In [None]:
data.shape

Converting categorical features into numerical features by using label encoder 

In [None]:
#Converting the column labels into numeric form
labelencoding = LabelEncoder()
categories=['color', 'director_name', 'actor_2_name',
        'genres', 'actor_1_name',
        'actor_3_name',
        'plot_keywords',
        'language', 'country', 'content_rating',
       'title_year', 'aspect_ratio','movie_title','movie_imdb_link']
data[categories]=data[categories].apply(lambda x:labelencoding.fit_transform(x))

In [None]:
#Scales all the data features in the range [0, 1]
y = data['imdb_score']
X = data.drop(['imdb_score'], axis = 1)
scaler = MinMaxScaler()
X = scaler.fit_transform(X)

## Model Building 

In [None]:
#Splitting the data-set into train and test data-set
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size = 0.7, test_size = 0.3, random_state = 42)

In [None]:
X_train.shape, y_train.shape, X_test.shape, y_test.shape

In [None]:
#Training the model
n_trees=300
gbregressor = GradientBoostingRegressor(loss='ls',learning_rate=0.02,n_estimators=n_trees,max_depth=3)
gbregressor.fit(X_train,y_train)

In [None]:
#Predicting the output on test data-set
y_pred=gbregressor.predict(X_test)

In [None]:
#Calculating the error on the basis of actual & predicted output
print('The mean squared error using Gradient boosting regressor  is: ',mean_squared_error(y_test,y_pred))

In [None]:
#Constructing a new dataframe with column names and feature importance
featureimp = pd.DataFrame()
datanew = data.drop(['imdb_score'], axis = 1)
featureimp['columns'] = datanew.columns

featureimp['Feature_importance'] = gbregressor.feature_importances_
#Sorting with feature importance column
featureimp = featureimp.sort_values(by='Feature_importance', ascending=True)

#Barplot indicating Feature Importance
plt.figure(figsize=(16, 16))
plt.barh(y=featureimp['columns'], width=featureimp['Feature_importance'], color='blue')
plt.title('Feature Importance', fontsize=20, fontweight='bold', pad=20)
plt.xlabel('Importance', fontsize=14, labelpad=20)
plt.show()

In [None]:
#Using Cross-validation for Hyperparameter tuning
param_grid = {
    'loss' : ['ls'],
    'max_depth' : [3,4,5],
    'learning_rate' : [0.05, 0.01,0.001],
    'n_estimators': [300,500,1000],
    'min_samples_split' : [1,2],
    'min_samples_leaf' : [0.5,1],
    'max_features' : [15,20,25]}
gbregressor = GradientBoostingRegressor()
gb_gridsearch = GridSearchCV(estimator = gbregressor, param_grid = param_grid, 
                          cv = 5, n_jobs = -1, verbose = 2)

In [None]:
#Training the model with best hyperparameter and printing the best hyperparameters used
gb_gridsearch.fit(X_train, y_train)
gb_gridsearch.best_params_

In [None]:
#Predicting the output after the cross-validation with best hyperparameter & calcualting the error
y_pred_gridsearch = gb_gridsearch.predict(X_test)

In [None]:
#Output
y_pred_gridsearch

In [None]:
print('The mean squared error using Gradient boosting regressor after hyparameter tuning is: ',mean_squared_error(y_test,y_pred_gridsearch))