# Prediction of Engagement Score

## Problem Statement

ABC is an online content sharing platform that enables users to create, upload and share the content in the form of videos. It includes videos from different genres like entertainment, education, sports, technology and so on. The maximum duration of video is 10 minutes.
Users can like, comment and share the videos on the platform. 
Based on the user’s interaction with the videos, engagement score is assigned to the video with respect to each user. Engagement score defines how engaging the content of the video is. 
Understanding the engagement score of the video improves the user’s interaction with the platform. It defines the type of content that is appealing to the user and engages the larger audience.


### Steps

Based on the POC, the following will be the final method:

1. Train SVD on a subset of the data. 
2. Get the initial predictions.
3. Calculate the error
4. Train a regressor model on the error
5. Calculate the final prediction by adding the error estimate and the initial prediction.

**Final Questions to Answer**

1. On what percentage of the total records should the SVD be trained? 
2. What should be the hyperparameters of SVD and the regressors? SVD is already answered, from the POC.
3. While training the regressors, should we include category id? If yes, which encoder should be used?
4. Can we use weighted sums to get the final prediction, rather than the simple sum?

In [76]:
# MANY UNUSED IMPORTS. CLEAN UP
import pandas as pd
import numpy as np
import seaborn as sb
%matplotlib inline

from sklearn.metrics import r2_score, accuracy_score, mean_squared_error
from category_encoders.leave_one_out import LeaveOneOutEncoder
from sklearn.ensemble import RandomForestRegressor
from xgboost import XGBRegressor
from sklearn.model_selection import train_test_split
from category_encoders import CatBoostEncoder

from surprise.prediction_algorithms import SVD as SVD_sp
from surprise import Reader as Reader_sp
from surprise import Dataset as Dataset_sp
from surprise.model_selection import train_test_split as train_test_split_sp
from surprise.accuracy import rmse as rmse_sp
from surprise.model_selection import GridSearchCV as GridSearchCV_sp
from sklearn.linear_model import LinearRegression, Lasso, Ridge

In [6]:
df = pd.read_csv('train_0OECtn8.csv')

# General Hyperparameters

In [348]:
# subset of the data on which the SVD is trained. This means that the regressors will be trained on 1 - svd_subset
svd_subset = 0.95

# SVD

In [349]:
# Convert the dataset into one that works with Surprise
df_surprise = df[['user_id', 'video_id', 'engagement_score']].copy()
df_surprise = df_surprise.sample(frac=1).reset_index(drop=True)
reader = Reader_sp(rating_scale=(0, 5))
df_surprise = Dataset_sp.load_from_df(df_surprise, reader=reader)
train_sp, test_sp = train_test_split_sp(df_surprise, train_size=svd_subset, test_size=1-svd_subset)


In [350]:
# Train on the train set
svd = SVD_sp(n_factors=100, n_epochs=500, lr_all=0.05)
svd.fit(train_sp)


<surprise.prediction_algorithms.matrix_factorization.SVD at 0x7fdb3d08d7c0>

In [351]:
# Get the initial estimates for the test set
preds_sp = svd.test(test_sp)
y_pred = list(map(lambda x: x.est, preds_sp))
y_true = list(map(lambda x: x.r_ui, preds_sp))

In [352]:
r2_score(y_true, y_pred)

0.4655336895459784

# Error Prediction

We train the regressor model on the test set created earlier.

In [353]:
# Filter out only those records from the source dataframe that are in test_sp
test_sp = pd.DataFrame(test_sp)
test_sp.columns = ['user_id', 'video_id', 'engagement_score']
df_v2 = df.merge(test_sp, on=['user_id', 'video_id', 'engagement_score'], how='inner')

In [354]:
# Add in the initial estimates and the errorto the new train set
df_v2['initial_estimate'] = y_pred
df_v2['error'] = df_v2['initial_estimate'] - df_v2['engagement_score']

We **subtract** the error from the initial estimate to get the correct value.

In [356]:
# Drop row id and engagement score, cannot use those
df_v2 = df_v2.drop(['row_id', 'engagement_score'], axis=1)

In [357]:
# Dummy columns for gender and profession
df_v2 = pd.get_dummies(df_v2, columns=['gender', 'profession'], drop_first=True)

In [358]:
# Train XGB on df_v2
X = df_v2.drop(['error'], axis=1)
Y = df_v2['error'].values

Decided against using the category id in the training set, as no encoding method was working well. *Perhaps I need to study encoding methods in more detail*

In [359]:
# # Encode the other categorical variables
# cbe = CatBoostEncoder(cols=['category_id', 'video_id', 'user_id'])
# cbe.fit(X, Y)
# X = cbe.transform(X)

# Drop categorical variables
X = X.drop(['user_id', 'category_id', 'video_id'], axis=1)

**Train a Few Models** - Finally going with Linear Regression.

In [360]:
# # Train RF
# rf = RandomForestRegressor(n_estimators=500)
# rf.fit(X, Y)

RandomForestRegressor(n_estimators=500)

In [361]:
# # Train XGBoost
# xgb = XGBRegressor(n_estimators = 100)
# xgb.fit(X, Y)

XGBRegressor(base_score=0.5, booster='gbtree', colsample_bylevel=1,
             colsample_bynode=1, colsample_bytree=1, enable_categorical=False,
             gamma=0, gpu_id=-1, importance_type=None,
             interaction_constraints='', learning_rate=0.300000012,
             max_delta_step=0, max_depth=6, min_child_weight=1, missing=nan,
             monotone_constraints='()', n_estimators=100, n_jobs=12,
             num_parallel_tree=1, predictor='auto', random_state=0, reg_alpha=0,
             reg_lambda=1, scale_pos_weight=1, subsample=1, tree_method='exact',
             validate_parameters=1, verbosity=None)

Use some basic error estimators

In [363]:
lr = LinearRegression(fit_intercept=True)
lr.fit(X, Y)

LinearRegression()

# Join All

Run the entire pipeline on the test set


In [364]:
df_test = pd.read_csv('test_1zqHu22.csv')

In [365]:
df_test_predictions = df_test.drop(['row_id'], axis=1)

In [366]:
df_test_predictions = pd.get_dummies(df_test_predictions, columns=['gender', 'profession'], drop_first=True)

In [367]:
# Make the predictions with SVD
initial_predictions = [svd.predict(
                            df_test_predictions.loc[i, 'user_id'], 
                            df_test_predictions.loc[i, 'video_id']
                                    ).est for i in range(len(df_test_predictions))]

df_test_predictions['initial_estimate'] = initial_predictions

In [369]:
# Get the error estimate using the regressor.
df_test_predictions = df_test_predictions.drop(['user_id', 'category_id', 'video_id'], axis=1)
error_predictions = lr.predict(df_test_predictions)
df_test_predictions['error_estimate'] = error_predictions

The final engagement score is the weighted sum of the initial estimate and the error estimate. The weight is `1-svd_subset`. 

Currently, the svd_subset is set at 0.95, so the error gets only 5% weightage.

In [372]:
df_test_predictions['engagement_score'] = df_test_predictions['initial_estimate'] - ((1-svd_subset)*(df_test_predictions['error_estimate']))

In [373]:
df_test['engagement_score'] = df_test_predictions['engagement_score'].values

In [374]:
# In case the engagement score crosses 5 or drops below 0, get them back to the threshold.
df_test.loc[df_test['engagement_score']>5, 'engagement_score'] = 5
df_test.loc[df_test['engagement_score']<0, 'engagement_score'] = 0

In [382]:
df_test[['row_id', 'engagement_score']].to_csv('final_submission.csv', index=False)