## Prediction ON IMDB MOVIE RATING - Sindhura Nadendla

## Load the dataset 

In [1]:
from warnings import filterwarnings
filterwarnings('ignore')

In [2]:
import pandas as pd
df = pd.read_csv('/kaggle/input/imdb-india-movies/IMDb Movies India.csv',na_values=(' '),encoding='latin-1')
df.head()

Unnamed: 0,Name,Year,Duration,Genre,Rating,Votes,Director,Actor 1,Actor 2,Actor 3
0,,,,Drama,,,J.S. Randhawa,Manmauji,Birbal,Rajendra Bhatia
1,#Gadhvi (He thought he was Gandhi),(2019),109 min,Drama,7.0,8.0,Gaurav Bakshi,Rasika Dugal,Vivek Ghamande,Arvind Jangid
2,#Homecoming,(2021),90 min,"Drama, Musical",,,Soumyajit Majumdar,Sayani Gupta,Plabita Borthakur,Roy Angana
3,#Yaaram,(2019),110 min,"Comedy, Romance",4.4,35.0,Ovais Khan,Prateik,Ishita Raj,Siddhant Kapoor
4,...And Once Again,(2010),105 min,Drama,,,Amol Palekar,Rajat Kapoor,Rituparna Sengupta,Antara Mali


In [3]:
df.shape

(15509, 10)

In [4]:
df.columns

Index(['Name', 'Year', 'Duration', 'Genre', 'Rating', 'Votes', 'Director',
       'Actor 1', 'Actor 2', 'Actor 3'],
      dtype='object')

## Perform basic data quality checks

In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 15509 entries, 0 to 15508
Data columns (total 10 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   Name      15508 non-null  object 
 1   Year      14981 non-null  object 
 2   Duration  7240 non-null   object 
 3   Genre     13632 non-null  object 
 4   Rating    7919 non-null   float64
 5   Votes     7920 non-null   object 
 6   Director  14984 non-null  object 
 7   Actor 1   13892 non-null  object 
 8   Actor 2   13125 non-null  object 
 9   Actor 3   12365 non-null  object 
dtypes: float64(1), object(9)
memory usage: 1.2+ MB


In [6]:
df.isna().sum()

Name           1
Year         528
Duration    8269
Genre       1877
Rating      7590
Votes       7589
Director     525
Actor 1     1617
Actor 2     2384
Actor 3     3144
dtype: int64

In [7]:
df.duplicated().sum()

6

There are missing values and duplicated rows in the dataset.

In [8]:
# drop the duplicated rows
df.drop_duplicates(inplace=True)
df.duplicated().sum()

0

## Feature Engineering 
Steps performed under this section
1. Dropping statistically insignificant columns

    Movie name doesnt have much significance here. So lets drop that column
    Movie Duration has many missing columns, so lets discard this feature
2. Handling missing values

    Rating, Votes has many missing values, almost half of the rows of dataset are missing. 
    Rating is our target feature, lets handle missing values.
    As per IMDB structure, final rating is determined by aggregated value of votes. Lets handle missing Votes.
3. Convert Year,Votes to int
4. Encoding Categorical features using Target encoding-Mean encoding
5. Separate X and Y features
6. Split the dataset into training data and testing data
-------------------------------------------------------------------------------------------------

1. Dropping statistically insignificant columns

In [9]:
df.drop(columns=['Name','Duration'],axis=1,inplace=True)
df.head()

Unnamed: 0,Year,Genre,Rating,Votes,Director,Actor 1,Actor 2,Actor 3
0,,Drama,,,J.S. Randhawa,Manmauji,Birbal,Rajendra Bhatia
1,(2019),Drama,7.0,8.0,Gaurav Bakshi,Rasika Dugal,Vivek Ghamande,Arvind Jangid
2,(2021),"Drama, Musical",,,Soumyajit Majumdar,Sayani Gupta,Plabita Borthakur,Roy Angana
3,(2019),"Comedy, Romance",4.4,35.0,Ovais Khan,Prateik,Ishita Raj,Siddhant Kapoor
4,(2010),Drama,,,Amol Palekar,Rajat Kapoor,Rituparna Sengupta,Antara Mali


In [10]:
df.shape

(15503, 8)

--------------------------------------------------------------------------------
2. Handling missing values

In [11]:
## Drop duplicate rows
df.drop_duplicates(inplace=True)

In [12]:
df.duplicated().sum()

0

In [13]:
df.nunique()

Year         102
Genre        485
Rating        84
Votes       2034
Director    5938
Actor 1     4718
Actor 2     4891
Actor 3     4820
dtype: int64

In [14]:
cat = list(df.columns[df.dtypes=='object'])

In [15]:
mn = df['Rating'].mean()
df['Rating'].fillna(mn,inplace=True)

In [16]:
for i in cat:
    m = df[i].mode()[0]
    df[i].fillna(m,inplace=True)

In [17]:
df.isna().sum()

Year        0
Genre       0
Rating      0
Votes       0
Director    0
Actor 1     0
Actor 2     0
Actor 3     0
dtype: int64

--------------------------------------------------------------------------------
3. Convert Year,Votes to int

In [18]:
# Lets convert Date column from string to int
df['Year']= df['Year'].str.strip('()').astype(int)
# Lets convert Votes column from string to int
df['Votes'] = df['Votes'].str.replace(',','')
df['Votes'] = df['Votes'].str.replace('.','')
df['Votes'] = df['Votes'].str.strip('$')
df['Votes'] = df['Votes'].str.strip('M')
df['Votes'] = df['Votes'].astype(int)

In [19]:
df.nunique()

Year         102
Genre        485
Rating        85
Votes       2034
Director    5938
Actor 1     4718
Actor 2     4891
Actor 3     4820
dtype: int64

--------------------------------------------------------------------------------
4. Encoding Categorical features using Target encoding-Mean encoding

As we can observe from above code that the Below features have high cardinality:

    Votes
    Director
    Actor 1
    Actor 2
    Actor 3

High cardinality : Features having large number of unique values or large number of categories.

If we use One Hot Encoding for such features, it creates large number of binary columns, resulting in memory usage and large computation times. This is not a feasible scenario.

Hence, for this scenario, we will use Target encoding where we calcluate a group aggregation (mean) for each column and subsitute the data with mean values.

I will perform Target Encoding for the above mentioned features
    

In [20]:
## performing mean encoding
df['Genre encoded'] = round(df.groupby('Genre')['Rating'].transform('mean'),1)
df['Votes encoded'] = round(df.groupby('Votes')['Rating'].transform('mean'),1)
df['Director encoded'] = round(df.groupby('Director')['Rating'].transform('mean'),1)
df['Actor 1 encoded'] = round(df.groupby('Actor 1')['Rating'].transform('mean'),1)
df['Actor 2 encoded'] = round(df.groupby('Actor 2')['Rating'].transform('mean'),1)
df['Actor 3 encoded'] = round(df.groupby('Actor 3')['Rating'].transform('mean'),1)

df.drop(columns=['Genre','Votes','Director','Actor 1','Actor 2','Actor 3'],inplace=True)
df['Rating'] = round(df['Rating'],1)
df.head()

Unnamed: 0,Year,Rating,Genre encoded,Votes encoded,Director encoded,Actor 1 encoded,Actor 2 encoded,Actor 3 encoded
0,2019,5.8,6.0,5.8,5.8,5.8,5.5,5.8
1,2019,7.0,6.0,5.8,7.0,6.8,7.0,7.0
2,2021,5.8,6.3,5.8,5.8,6.2,6.8,5.8
3,2019,4.4,5.7,5.9,4.4,5.4,4.4,4.4
4,2010,5.8,6.0,5.8,6.3,6.8,5.8,5.5


--------------------------------------------------------------------------------
5. Separate X and Y features

In [21]:
X = df.drop(columns='Rating')
Y =df[['Rating']]

In [22]:
X.head()

Unnamed: 0,Year,Genre encoded,Votes encoded,Director encoded,Actor 1 encoded,Actor 2 encoded,Actor 3 encoded
0,2019,6.0,5.8,5.8,5.8,5.5,5.8
1,2019,6.0,5.8,7.0,6.8,7.0,7.0
2,2021,6.3,5.8,5.8,6.2,6.8,5.8
3,2019,5.7,5.9,4.4,5.4,4.4,4.4
4,2010,6.0,5.8,6.3,6.8,5.8,5.5


In [23]:
Y.head()

Unnamed: 0,Rating
0,5.8
1,7.0
2,5.8
3,4.4
4,5.8


--------------------------------------------------------------------------------
6. Split the dataset into training data and testing data

In [24]:
from sklearn.model_selection import train_test_split
xtrain, xtest, ytrain, ytest = train_test_split(X,Y,test_size=0.25,random_state=21)

In [25]:
xtrain.shape

(11325, 7)

In [26]:
xtest.shape

(3776, 7)

In [27]:
ytrain.shape

(11325, 1)

In [28]:
ytest.shape

(3776, 1)

## Build the model

Performing Algorithm Evaluaton to check which regression models give best results 

In [29]:
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor,GradientBoostingRegressor
from sklearn.neighbors import KNeighborsRegressor
from sklearn.svm import SVR
from sklearn.model_selection import RandomizedSearchCV

In [30]:
dct = {
    'Linear':LinearRegression(),
    'DecisionTree':DecisionTreeRegressor(),
    'RandomForest':RandomForestRegressor(),
    'GradientBoosting':GradientBoostingRegressor(),
    'KNN':KNeighborsRegressor(),
    'SVR':SVR()
}

In [31]:
dct.items()

dict_items([('Linear', LinearRegression()), ('DecisionTree', DecisionTreeRegressor()), ('RandomForest', RandomForestRegressor()), ('GradientBoosting', GradientBoostingRegressor()), ('KNN', KNeighborsRegressor()), ('SVR', SVR())])

In [32]:
from sklearn.metrics import mean_squared_error,r2_score
from sklearn.model_selection import cross_val_score

In [33]:
train_mse = []
train_r2 = []
test_mse = []
test_r2 = []
train_cv = []

for name,model in dct.items():
    # fit the model
    m = model.fit(xtrain,ytrain)
    ypred_train = m.predict(xtrain)
    ypred_test = m.predict(xtest)
    # calculate MSE
    mse_train = mean_squared_error(ytrain,ypred_train)
    mse_test = mean_squared_error(ytest,ypred_test)
    # caluclate R2
    r2_train = (r2_score(ytrain,ypred_train))*100
    r2_test = (r2_score(ytest,ypred_test))*100
    # calculate cross validated scores
    cv = cross_val_score(m,xtrain,ytrain,cv=5,scoring='r2')
    scores = (cv.mean())*100

    # add these values to the respective list to compare the output
    train_mse.append(mse_train)
    train_r2.append(r2_train)
    test_mse.append(mse_test)
    test_r2.append(r2_test)
    train_cv.append(scores)

    # print the results
    print(f'Scores for {name}')
    print("Training Scores")
    print(f'MSE:{train_mse}')
    print(f'R2:{train_r2}')
    print("Testing Scores")
    print(f'MSE:{test_mse}')
    print(f'R2:{test_r2}')
    print(f'CV:{train_cv}')
    print("===============================")

Scores for Linear
Training Scores
MSE:[0.24667540151040804]
R2:[74.87422109028985]
Testing Scores
MSE:[0.2715908987362406]
R2:[74.3997681809067]
CV:[74.73950258907246]
Scores for DecisionTree
Training Scores
MSE:[0.24667540151040804, 0.0002136865342163355]
R2:[74.87422109028985, 99.97823438988311]
Testing Scores
MSE:[0.2715908987362406, 0.42923728813559314]
R2:[74.3997681809067, 59.53997673411789]
CV:[74.73950258907246, 59.29568239953734]
Scores for RandomForest
Training Scores
MSE:[0.24667540151040804, 0.0002136865342163355, 0.027653109254059362]
R2:[74.87422109028985, 99.97823438988311, 97.18331902966726]
Testing Scores
MSE:[0.2715908987362406, 0.42923728813559314, 0.22030625239818738]
R2:[74.3997681809067, 59.53997673411789, 79.23387286233573]
CV:[74.73950258907246, 59.29568239953734, 79.86668794872595]
Scores for GradientBoosting
Training Scores
MSE:[0.24667540151040804, 0.0002136865342163355, 0.027653109254059362, 0.1901835197448489]
R2:[74.87422109028985, 99.97823438988311, 97.18

In [34]:
res = {'Name':list(dct.keys()),
       'MSE Training Scores':train_mse,
       'MSE Testing Scores':test_mse,
       'R2 Training Scores':train_r2,
       'R2 Testing Scores':test_r2,
       'CV Training Scores':train_cv}

In [35]:
df_res = pd.DataFrame(res)
df_res.sort_values('CV Training Scores',ascending=False)

Unnamed: 0,Name,MSE Training Scores,MSE Testing Scores,R2 Training Scores,R2 Testing Scores,CV Training Scores
2,RandomForest,0.027653,0.220306,97.183319,79.233873,79.866688
3,GradientBoosting,0.190184,0.23723,80.628352,77.638619,78.407248
0,Linear,0.246675,0.271591,74.874221,74.399768,74.739503
4,KNN,0.181077,0.308273,81.555872,70.942065,71.396913
1,DecisionTree,0.000214,0.429237,99.978234,59.539977,59.295682
5,SVR,0.979322,1.057331,0.24854,0.335717,0.0237


#### Lets consider Random Forest Regressor as its giving good results

In [36]:
params = {'n_estimators':[200,300],
          'max_depth':[5,6,7,8],
          'min_samples_split':[2,3,4,5,6],
          'criterion':['squared_error','absolute_error']}

In [37]:
rfr = RandomForestRegressor()
rscv = RandomizedSearchCV(rfr,params,cv=3,scoring='neg_mean_squared_error')
rscv.fit(xtrain,ytrain)

In [38]:
rscv.best_params_

{'n_estimators': 200,
 'min_samples_split': 5,
 'max_depth': 6,
 'criterion': 'squared_error'}

In [39]:
best_rfr = rscv.best_estimator_
best_rfr

Random Forest model gives score results around 77%. 

I am using XG Boost to check the results and see if this model improves the prediction scores

In [40]:
from xgboost import XGBRegressor

In [41]:
model = XGBRegressor()
model.fit(xtrain,ytrain)

In [42]:
model.score(xtrain,ytrain)

0.9250681221304905

In [43]:
model.score(xtest,ytest)

0.7841636539213964

In [44]:
from sklearn.model_selection import GridSearchCV

In [45]:
params = {'n_estimators':[200,300,500,600,800,1000],
          'learning_rate':[0.05,0.1,0.2,0.3],
          'max_depth':[5,6,7,8,9,10],
          'min_child_weight':[1,2,3],
          'objective':['reg:squarederror'],
          'gamma':[0.1,0.2,0.3,0.4]}

In [46]:
gscv = GridSearchCV(model,params,scoring='neg_mean_squared_error',cv=5)
gscv.fit(xtrain,ytrain)

In [47]:
gscv.best_params_

{'gamma': 0.1,
 'learning_rate': 0.05,
 'max_depth': 5,
 'min_child_weight': 3,
 'n_estimators': 500,
 'objective': 'reg:squarederror'}

In [48]:
best_xgb = gscv.best_estimator_
best_xgb

In [49]:
best_xgb.score(xtrain,ytrain)

0.8574709771358586

In [50]:
best_xgb.score(xtest,ytest)

0.8021907473809164

Lets check by tuning other parameters to this model

In [51]:
params1 = {'subsample':[0.5,0.6,0.7,0.8,0.9,1],
           'colsample_bytree':[0.5,0.6,0.7,0.8,0.9,1]}

In [52]:
gscv1 = GridSearchCV(best_xgb,params1,cv=5,scoring='neg_mean_squared_error')
gscv1.fit(xtrain,ytrain)

In [53]:
gscv1.best_params_

{'colsample_bytree': 0.6, 'subsample': 0.9}

In [54]:
best_xgb2 = gscv1.best_estimator_
best_xgb2

In [55]:
best_xgb2.score(xtrain,ytrain)

0.8765155698655448

In [56]:
best_xgb2.score(xtest,ytest)

0.8048450929698301

## Evaluate the models : Random Forest and XGBoost

In [57]:
from sklearn.metrics import mean_absolute_error,mean_squared_error,r2_score

In [58]:
def eval_model(model,xtrain,ytrain):
    # Fit the model
    model.fit(xtrain,ytrain)
    # Predict the scores
    ypred_test = model.predict(xtest)
    # Calculate MSE,RMSE,MAE,R2 scores
    mse = mean_squared_error(ytest,ypred_test)
    rmse = mse**(1/2)
    mae = mean_absolute_error(ytest,ypred_test)
    r2 = r2_score(ytest,ypred_test)
    return mse,rmse,mae,r2

Random Forest Evaluation Metrics

In [59]:
(MSE,RMSE,MAE,r2) = eval_model(best_rfr,xtrain,ytrain)
print(f'Evaluation Metrics: \nMSE: {MSE}\nRMSE:{RMSE}\nMAE:{MAE}\nR2:{r2}')

Evaluation Metrics: 
MSE: 0.2697500948553083
RMSE:0.5193747152637567
MAE:0.290918996329441
R2:0.7457328285427991


XGBoost Evaluation Metrics

In [60]:
(MSE,RMSE,MAE,r2) = eval_model(best_xgb2,xtrain,ytrain)
print(f'Evaluation Metrics: \nMSE: {MSE}\nRMSE:{RMSE}\nMAE:{MAE}\nR2:{r2}')

Evaluation Metrics: 
MSE: 0.20703834624489967
RMSE:0.45501466596682316
MAE:0.2611911814786115
R2:0.8048450929698301


#### XGBoost is providing best test score as compared to other models. Considering XGBoost for final prediction
--------------------------------------------------------------------------------------

## Model Prediction

In [61]:
ypred_test = best_xgb2.predict(xtest)
ypred_test[:10]

array([7.0099277, 5.7735763, 4.096054 , 7.996697 , 5.9080467, 6.88522  ,
       6.0211606, 5.5722013, 5.8215423, 4.764713 ], dtype=float32)

In [62]:
ytest.head(10)

Unnamed: 0,Rating
4806,7.1
10573,5.8
3674,3.7
9179,8.0
6229,7.3
13993,6.9
7125,6.2
2874,5.8
7049,5.8
6739,4.9


In [63]:
df_final = xtest
df_final['Predicted_Rating'] = ypred_test
df_final

Unnamed: 0,Year,Genre encoded,Votes encoded,Director encoded,Actor 1 encoded,Actor 2 encoded,Actor 3 encoded,Predicted_Rating
4806,2015,5.4,7.1,6.4,5.5,6.2,5.7,7.009928
10573,1989,5.7,5.8,5.8,5.6,5.8,5.8,5.773576
3674,2000,5.9,3.7,4.9,5.2,5.8,5.8,4.096054
9179,1970,5.9,8.0,7.1,6.8,6.8,6.8,7.996697
6229,2003,5.7,6.1,5.7,5.1,5.6,6.9,5.908047
...,...,...,...,...,...,...,...,...
5126,2007,5.8,5.7,5.6,5.7,5.8,5.6,5.541561
5934,1971,5.8,6.2,6.0,6.4,6.0,6.6,6.672915
15146,1972,5.3,5.7,5.8,5.3,6.1,6.0,5.360028
14280,2010,6.0,6.2,5.7,5.1,5.7,5.7,5.692091


Save the results to dataframe

In [64]:
df_final.to_csv('Predicted Ratings.csv',index=False)