# Project 1

Project 1 DATA 612 - Semyon Toybis

This project requires building a basic recommender system to develop an understanding of baseline recommender techinques.

For this project, I will create a recommender that predicts movie ratings for viewers based on their ratings for prior movies.
Specifically, I will use a simple average of movie ratings and a simple average plus viewer and movie bias to predict ratings for movies by viewer.

This is a global baseline recommender. The core idea is to create a simple model to predict what rating a user would give to a movie they haven't seen by using the global average as well as by using the global average adjusted by the user's bias and the respective movie's bias. The purpose is to create a baseline model to which other models can be compared to. For example, if we create more sophistcated model but find that its performance is actually worse than the global baseline recommender, then we know that the more sophistcated model performs poorly. Interestingly, this simple model that makes predictions based on average movie rating and average user rating was only 3% worse than Netflix's CineMatch algorithm, which was the algorithm that particpants were seeking to beat in Netflix's $1,000,000 recommender challenge.

## Dataset

I will create a sample data set using randomly generated numbers as my user item matrix. I will create an 8x8 matrix which I will then use to create train and test sets.

In [66]:
import numpy as np
import pandas as pd

In [67]:
np.random.seed(10)
random_ratings = np.random.randint(0,5, size = (8,8))

In [68]:
ratings_df = pd.DataFrame(random_ratings, columns = ['movie1',
                                                    'movie2',
                                                    'movie3',
                                                    'movie4',
                                                    'movie5',
                                                    'movie6',
                                                    'movie7',
                                                    'movie8'])

In [69]:
ratings_df

Unnamed: 0,movie1,movie2,movie3,movie4,movie5,movie6,movie7,movie8
0,1,4,0,1,3,4,1,0
1,1,2,0,1,0,2,0,4
2,3,0,4,3,0,3,2,1
3,0,4,1,3,3,1,4,1
4,4,1,1,4,3,2,0,3
5,4,2,0,1,2,0,0,3
6,1,3,4,1,4,2,0,0
7,4,4,0,0,2,4,2,0


This is the user item matrix which I will split into train and test sets. I will also replace zeros with NAs, for movies that a viewer did not watch. Thus, we will have a matrix that has each viewer's rating for a movie and NaNs for movies that a viewer did not watch.

In [70]:
ratings_df.replace(0, np.nan, inplace = True)

In [71]:
ratings_df_long = ratings_df.melt(ignore_index=False).reset_index()

In [72]:
ratings_df_long

Unnamed: 0,index,variable,value
0,0,movie1,1.0
1,1,movie1,1.0
2,2,movie1,3.0
3,3,movie1,
4,4,movie1,4.0
...,...,...,...
59,3,movie8,1.0
60,4,movie8,3.0
61,5,movie8,3.0
62,6,movie8,


In [73]:
ratings_df_long.shape

(64, 3)

Below I create train and test sets via an 80/20 split. I also check that the train and test set contains each viewer to avoid the "cold-start" problem. I do this by grouping the dataframe by the index, which represents each viewer.

In [74]:
train = ratings_df_long.groupby(['index']).sample(frac = 0.8,random_state=10)

In [75]:
test = ratings_df_long.drop(train.index)

In [76]:
train['index'].unique()

array([0, 1, 2, 3, 4, 5, 6, 7])

In [77]:
test['index'].unique()

array([1, 2, 4, 0, 6, 3, 5, 7])

In [78]:
train_user_item = train.pivot(index = 'index', columns = 'variable', values = 'value')

In [79]:
train_user_item

variable,movie1,movie2,movie3,movie4,movie5,movie6,movie7,movie8
index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
0,1.0,,,1.0,3.0,,1.0,
1,,2.0,,1.0,,,,4.0
2,,,,3.0,,3.0,2.0,1.0
3,,4.0,1.0,,3.0,1.0,4.0,
4,,1.0,1.0,4.0,,2.0,,3.0
5,4.0,2.0,,,2.0,,,3.0
6,1.0,,4.0,1.0,,2.0,,
7,4.0,4.0,,,2.0,,,


In [80]:
test_user_item = test.pivot(index = 'index', columns = 'variable', values = 'value')

In [81]:
test_user_item

variable,movie1,movie2,movie3,movie4,movie5,movie6,movie7,movie8
index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
0,,4.0,,,,4.0,,
1,1.0,,,,,2.0,,
2,3.0,,4.0,,,,,
3,,,,3.0,,,,1.0
4,4.0,,,,3.0,,,
5,,,,1.0,,,,
6,,3.0,,,4.0,,,
7,,,,,,4.0,2.0,


## Raw average

First, I will use the raw average to predict the training and test set. Below I calculate the raw average.

In [82]:
raw_avg = np.nanmean(train_user_item)

In [83]:
raw_avg

np.float64(2.34375)

Next, I convert the data frames to long format to more easily work with the data by adding columns and performing operations on columns

In [84]:
train_user_item_long = train_user_item.melt(ignore_index=False).reset_index()

In [85]:
train_user_item_long.head()

Unnamed: 0,index,variable,value
0,0,movie1,1.0
1,1,movie1,
2,2,movie1,
3,3,movie1,
4,4,movie1,


First, I add a column with the predicted value based on the raw average

In [86]:
train_user_item_long['predicted_raw_avg'] = raw_avg

In [87]:
train_user_item_long

Unnamed: 0,index,variable,value,predicted_raw_avg
0,0,movie1,1.0,2.34375
1,1,movie1,,2.34375
2,2,movie1,,2.34375
3,3,movie1,,2.34375
4,4,movie1,,2.34375
...,...,...,...,...
59,3,movie8,,2.34375
60,4,movie8,3.0,2.34375
61,5,movie8,3.0,2.34375
62,6,movie8,,2.34375


Below, I add a column for the squared error, which takes the difference between the observed value and the average and squares the difference

In [88]:
train_user_item_long['se_raw_avg'] = (train_user_item_long['value'] - train_user_item_long['predicted_raw_avg'])**2

In [89]:
train_user_item_long

Unnamed: 0,index,variable,value,predicted_raw_avg,se_raw_avg
0,0,movie1,1.0,2.34375,1.805664
1,1,movie1,,2.34375,
2,2,movie1,,2.34375,
3,3,movie1,,2.34375,
4,4,movie1,,2.34375,
...,...,...,...,...,...
59,3,movie8,,2.34375,
60,4,movie8,3.0,2.34375,0.430664
61,5,movie8,3.0,2.34375,0.430664
62,6,movie8,,2.34375,


Next, I calculate the RMSE which is the square root of the mean of the squared errors (the difference between the actual value and the predicted value, squared to avoid negative and positive values cancelling each other out when summing). The RMSE is a summary metric that tells us how accurate our model is: a higher RMSE means our model is less accurate (predicted values are far away from actual values) while a lower RMSE means our model is more accurate (predicted values are close to actual values). A perfect model would have an RMSE of zero.

It can be difficult to evaluate an RMSE on its own, which is why it is useful to have a baseline model to compare to. We can use the RMSE of the global baseline average model as a comparison point for more sophisticated models - if a more sophisticated model has a lower RMSE than the global baseline model, this means this model makes predictions that are closer to the actual values than the global baseline model.

An alternative measure would be the MAE, which measures the average absolute difference between predicted and actual ratings. RMSE penalizes larger errors more heavily than smaller ones while MAE treas all errors equally. Thus, for this assignment we will use RMSE as we want to minimize larger errors.

In [90]:
train_rmse_raw_avg = np.sqrt(np.mean(train_user_item_long['se_raw_avg']))

In [91]:
train_rmse_raw_avg

np.float64(1.1887329126006396)

I perform the same calculation for the test set

In [92]:
test_user_item_long = test_user_item.melt(ignore_index=False).reset_index()

In [93]:
test_user_item_long['predicted_raw_avg'] = raw_avg

In [94]:
test_user_item_long

Unnamed: 0,index,variable,value,predicted_raw_avg
0,0,movie1,,2.34375
1,1,movie1,1.0,2.34375
2,2,movie1,3.0,2.34375
3,3,movie1,,2.34375
4,4,movie1,4.0,2.34375
...,...,...,...,...
59,3,movie8,1.0,2.34375
60,4,movie8,,2.34375
61,5,movie8,,2.34375
62,6,movie8,,2.34375


In [95]:
test_user_item_long['se_raw_avg'] = (test_user_item_long['value'] - test_user_item_long['predicted_raw_avg'])**2

In [96]:
test_user_item_long

Unnamed: 0,index,variable,value,predicted_raw_avg,se_raw_avg
0,0,movie1,,2.34375,
1,1,movie1,1.0,2.34375,1.805664
2,2,movie1,3.0,2.34375,0.430664
3,3,movie1,,2.34375,
4,4,movie1,4.0,2.34375,2.743164
...,...,...,...,...,...
59,3,movie8,1.0,2.34375,1.805664
60,4,movie8,,2.34375,
61,5,movie8,,2.34375,
62,6,movie8,,2.34375,


In [97]:
test_rmse_raw_avg = np.sqrt(np.mean(test_user_item_long['se_raw_avg']))

In [98]:
test_rmse_raw_avg

np.float64(1.260554400188002)

## Raw average with bias

Next, I calculate the viewer bias and the movie bias. This is the difference between the raw average and the average for each movie and for each viewer, respectively. These biases will be added to the raw average to generate predictions.

Incorporating viewer and movie biases can potenitally improve our model as it adds more information to the average. For example, if a viewer has a negative bias (the viewer's average is lower than the global average), this means that the viewer is a harsh critic and thus we may expect that they would a grade a movie more harshly than someone whose viewer bias is zero or someone whose viewer bias is positive (a lenient critic). This is relevant information for recommending items to viewers and may improve our predictions.

The same applies to movie biases. A movie with a negative bias (average rating value lower than the global average) means this movie was reviewed poorly by most viewers (i.e. it is a bad movie) and thus we can expect that a viewer who hasn't seen the movie may rate it poorly as well.

There is an interaction between viewer and movie bias as well. For example, a very lenient critic may rate a very poorly rated movie in line with the global average because the magnitude of the viewer's leniency cancels out the negativity associated with the movie. Incorporating these biases can lead to surprisingly good models - as mentioned earlier, Netflix's algorithm was only 3% better than a model that used the global average and incorporated biases.


In [99]:
viewer_bias = train_user_item_long.groupby('index').mean('value')

In [100]:
viewer_bias

Unnamed: 0_level_0,value,predicted_raw_avg,se_raw_avg
index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,1.5,2.34375,1.461914
1,2.333333,2.34375,1.555664
2,2.25,2.34375,0.696289
3,2.6,2.34375,1.905664
4,2.2,2.34375,1.380664
5,2.75,2.34375,0.852539
6,2.0,2.34375,1.618164
7,3.333333,2.34375,1.868164


In [101]:
viewer_bias['viewer_bias'] = np.subtract(viewer_bias['value'],raw_avg)

In [102]:
viewer_bias

Unnamed: 0_level_0,value,predicted_raw_avg,se_raw_avg,viewer_bias
index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
0,1.5,2.34375,1.461914,-0.84375
1,2.333333,2.34375,1.555664,-0.010417
2,2.25,2.34375,0.696289,-0.09375
3,2.6,2.34375,1.905664,0.25625
4,2.2,2.34375,1.380664,-0.14375
5,2.75,2.34375,0.852539,0.40625
6,2.0,2.34375,1.618164,-0.34375
7,3.333333,2.34375,1.868164,0.989583


In [103]:
movie_bias = train_user_item_long.groupby('variable').mean('value')

In [104]:
movie_bias

Unnamed: 0_level_0,index,value,predicted_raw_avg,se_raw_avg
variable,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
movie1,3.5,2.5,2.34375,2.274414
movie2,3.5,2.6,2.34375,1.505664
movie3,3.5,2.0,2.34375,2.118164
movie4,3.5,2.0,2.34375,1.718164
movie5,3.5,2.5,2.34375,0.274414
movie6,3.5,2.0,2.34375,0.618164
movie7,3.5,2.333333,2.34375,1.555664
movie8,3.5,2.75,2.34375,1.352539


In [105]:
movie_bias.drop('index', axis = 1, inplace = True)

In [106]:
movie_bias['movie_bias'] = np.subtract(movie_bias['value'],raw_avg)

In [107]:
movie_bias

Unnamed: 0_level_0,value,predicted_raw_avg,se_raw_avg,movie_bias
variable,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
movie1,2.5,2.34375,2.274414,0.15625
movie2,2.6,2.34375,1.505664,0.25625
movie3,2.0,2.34375,2.118164,-0.34375
movie4,2.0,2.34375,1.718164,-0.34375
movie5,2.5,2.34375,0.274414,0.15625
movie6,2.0,2.34375,0.618164,-0.34375
movie7,2.333333,2.34375,1.555664,-0.010417
movie8,2.75,2.34375,1.352539,0.40625


Next, I merge the viewer bias and movie bias values into the training and test data frames

In [108]:
train_user_item_long

Unnamed: 0,index,variable,value,predicted_raw_avg,se_raw_avg
0,0,movie1,1.0,2.34375,1.805664
1,1,movie1,,2.34375,
2,2,movie1,,2.34375,
3,3,movie1,,2.34375,
4,4,movie1,,2.34375,
...,...,...,...,...,...
59,3,movie8,,2.34375,
60,4,movie8,3.0,2.34375,0.430664
61,5,movie8,3.0,2.34375,0.430664
62,6,movie8,,2.34375,


In [109]:
viewer_bias.reset_index(inplace = True)

In [110]:
movie_bias.reset_index(inplace = True)

In [111]:
viewer_bias

Unnamed: 0,index,value,predicted_raw_avg,se_raw_avg,viewer_bias
0,0,1.5,2.34375,1.461914,-0.84375
1,1,2.333333,2.34375,1.555664,-0.010417
2,2,2.25,2.34375,0.696289,-0.09375
3,3,2.6,2.34375,1.905664,0.25625
4,4,2.2,2.34375,1.380664,-0.14375
5,5,2.75,2.34375,0.852539,0.40625
6,6,2.0,2.34375,1.618164,-0.34375
7,7,3.333333,2.34375,1.868164,0.989583


In [112]:
movie_bias

Unnamed: 0,variable,value,predicted_raw_avg,se_raw_avg,movie_bias
0,movie1,2.5,2.34375,2.274414,0.15625
1,movie2,2.6,2.34375,1.505664,0.25625
2,movie3,2.0,2.34375,2.118164,-0.34375
3,movie4,2.0,2.34375,1.718164,-0.34375
4,movie5,2.5,2.34375,0.274414,0.15625
5,movie6,2.0,2.34375,0.618164,-0.34375
6,movie7,2.333333,2.34375,1.555664,-0.010417
7,movie8,2.75,2.34375,1.352539,0.40625


In [113]:
train_user_item_long = pd.merge(train_user_item_long,viewer_bias[['index','viewer_bias']],on='index', how='left')
train_user_item_long = pd.merge(train_user_item_long,movie_bias[['variable','movie_bias']],on='variable', how='left')

In [114]:
train_user_item_long

Unnamed: 0,index,variable,value,predicted_raw_avg,se_raw_avg,viewer_bias,movie_bias
0,0,movie1,1.0,2.34375,1.805664,-0.843750,0.15625
1,1,movie1,,2.34375,,-0.010417,0.15625
2,2,movie1,,2.34375,,-0.093750,0.15625
3,3,movie1,,2.34375,,0.256250,0.15625
4,4,movie1,,2.34375,,-0.143750,0.15625
...,...,...,...,...,...,...,...
59,3,movie8,,2.34375,,0.256250,0.40625
60,4,movie8,3.0,2.34375,0.430664,-0.143750,0.40625
61,5,movie8,3.0,2.34375,0.430664,0.406250,0.40625
62,6,movie8,,2.34375,,-0.343750,0.40625


In [115]:
test_user_item_long = pd.merge(test_user_item_long,viewer_bias[['index','viewer_bias']],on='index', how='left')
test_user_item_long = pd.merge(test_user_item_long,movie_bias[['variable','movie_bias']],on='variable', how='left')

In [116]:
test_user_item_long

Unnamed: 0,index,variable,value,predicted_raw_avg,se_raw_avg,viewer_bias,movie_bias
0,0,movie1,,2.34375,,-0.843750,0.15625
1,1,movie1,1.0,2.34375,1.805664,-0.010417,0.15625
2,2,movie1,3.0,2.34375,0.430664,-0.093750,0.15625
3,3,movie1,,2.34375,,0.256250,0.15625
4,4,movie1,4.0,2.34375,2.743164,-0.143750,0.15625
...,...,...,...,...,...,...,...
59,3,movie8,1.0,2.34375,1.805664,0.256250,0.40625
60,4,movie8,,2.34375,,-0.143750,0.40625
61,5,movie8,,2.34375,,0.406250,0.40625
62,6,movie8,,2.34375,,-0.343750,0.40625


Next, I calculate the predicted value when incorporating user and movie bias

In [117]:
train_user_item_long['predicted_raw_avg_bias'] = train_user_item_long['predicted_raw_avg'] + train_user_item_long['viewer_bias'] + train_user_item_long['movie_bias']

In [118]:
train_user_item_long

Unnamed: 0,index,variable,value,predicted_raw_avg,se_raw_avg,viewer_bias,movie_bias,predicted_raw_avg_bias
0,0,movie1,1.0,2.34375,1.805664,-0.843750,0.15625,1.656250
1,1,movie1,,2.34375,,-0.010417,0.15625,2.489583
2,2,movie1,,2.34375,,-0.093750,0.15625,2.406250
3,3,movie1,,2.34375,,0.256250,0.15625,2.756250
4,4,movie1,,2.34375,,-0.143750,0.15625,2.356250
...,...,...,...,...,...,...,...,...
59,3,movie8,,2.34375,,0.256250,0.40625,3.006250
60,4,movie8,3.0,2.34375,0.430664,-0.143750,0.40625,2.606250
61,5,movie8,3.0,2.34375,0.430664,0.406250,0.40625,3.156250
62,6,movie8,,2.34375,,-0.343750,0.40625,2.406250


Now I calculate the squared errors by taking the difference between the observed value and the sum of the average and movie and viewer biases.

In [119]:
train_user_item_long['se_raw_avg_with_bias'] = (train_user_item_long['value'] - train_user_item_long['predicted_raw_avg_bias'])**2

In [120]:
train_user_item_long

Unnamed: 0,index,variable,value,predicted_raw_avg,se_raw_avg,viewer_bias,movie_bias,predicted_raw_avg_bias,se_raw_avg_with_bias
0,0,movie1,1.0,2.34375,1.805664,-0.843750,0.15625,1.656250,0.430664
1,1,movie1,,2.34375,,-0.010417,0.15625,2.489583,
2,2,movie1,,2.34375,,-0.093750,0.15625,2.406250,
3,3,movie1,,2.34375,,0.256250,0.15625,2.756250,
4,4,movie1,,2.34375,,-0.143750,0.15625,2.356250,
...,...,...,...,...,...,...,...,...,...
59,3,movie8,,2.34375,,0.256250,0.40625,3.006250,
60,4,movie8,3.0,2.34375,0.430664,-0.143750,0.40625,2.606250,0.155039
61,5,movie8,3.0,2.34375,0.430664,0.406250,0.40625,3.156250,0.024414
62,6,movie8,,2.34375,,-0.343750,0.40625,2.406250,


In [121]:
train_rmse_raw_avg_with_bias = np.sqrt(np.mean(train_user_item_long['se_raw_avg_with_bias']))

In [122]:
train_rmse_raw_avg_with_bias

np.float64(1.0872664979203581)

In [123]:
test_user_item_long['predicted_raw_avg_bias'] = test_user_item_long['predicted_raw_avg'] + test_user_item_long['viewer_bias'] + test_user_item_long['movie_bias']

In [124]:
test_user_item_long

Unnamed: 0,index,variable,value,predicted_raw_avg,se_raw_avg,viewer_bias,movie_bias,predicted_raw_avg_bias
0,0,movie1,,2.34375,,-0.843750,0.15625,1.656250
1,1,movie1,1.0,2.34375,1.805664,-0.010417,0.15625,2.489583
2,2,movie1,3.0,2.34375,0.430664,-0.093750,0.15625,2.406250
3,3,movie1,,2.34375,,0.256250,0.15625,2.756250
4,4,movie1,4.0,2.34375,2.743164,-0.143750,0.15625,2.356250
...,...,...,...,...,...,...,...,...
59,3,movie8,1.0,2.34375,1.805664,0.256250,0.40625,3.006250
60,4,movie8,,2.34375,,-0.143750,0.40625,2.606250
61,5,movie8,,2.34375,,0.406250,0.40625,3.156250
62,6,movie8,,2.34375,,-0.343750,0.40625,2.406250


In [125]:
test_user_item_long['se_raw_avg_with_bias'] = (test_user_item_long['value'] - test_user_item_long['predicted_raw_avg_bias'])**2

In [126]:
test_user_item_long

Unnamed: 0,index,variable,value,predicted_raw_avg,se_raw_avg,viewer_bias,movie_bias,predicted_raw_avg_bias,se_raw_avg_with_bias
0,0,movie1,,2.34375,,-0.843750,0.15625,1.656250,
1,1,movie1,1.0,2.34375,1.805664,-0.010417,0.15625,2.489583,2.218859
2,2,movie1,3.0,2.34375,0.430664,-0.093750,0.15625,2.406250,0.352539
3,3,movie1,,2.34375,,0.256250,0.15625,2.756250,
4,4,movie1,4.0,2.34375,2.743164,-0.143750,0.15625,2.356250,2.701914
...,...,...,...,...,...,...,...,...,...
59,3,movie8,1.0,2.34375,1.805664,0.256250,0.40625,3.006250,4.025039
60,4,movie8,,2.34375,,-0.143750,0.40625,2.606250,
61,5,movie8,,2.34375,,0.406250,0.40625,3.156250,
62,6,movie8,,2.34375,,-0.343750,0.40625,2.406250,


In [127]:
test_rmse_raw_avg_with_bias = np.sqrt(np.mean(test_user_item_long['se_raw_avg_with_bias']))

In [128]:
test_rmse_raw_avg_with_bias

np.float64(1.5593229737851213)

## Summary

Below I create a dataframe to compare the RMSE values

In [129]:
summary_list = [['train',train_rmse_raw_avg,train_rmse_raw_avg_with_bias],
                ['test',test_rmse_raw_avg,test_rmse_raw_avg_with_bias]]

summary_df = pd.DataFrame(summary_list, columns = ['Dataset','Raw_Avg','Raw_Avg_with_Bias'])

In [130]:
summary_df

Unnamed: 0,Dataset,Raw_Avg,Raw_Avg_with_Bias
0,train,1.188733,1.087266
1,test,1.260554,1.559323


The RMSE for the training set was lower when incorporating bias as compared to using the global average; however, the RMSE for the test set was higher when incorporating bias as compared to using the global average. While it is usually expected that incorporating bias will improve predictions, it is data dependent and this small dataset of randomly generated data is an exception. It is possible that because the data is randomly generated, it is inconsistent with how real viewers would rate movies.

There are methods avaiable to improve the global baseline model with biases. One could incorporate regularization to shrink the bias of movies or users that have fewer ratings. For example, the global baseline model may make inaccurate predictions when a movie only has one rating (e.g. one viewer rated it a 5, but it is possible that a larger sample size would result in a lower bias) and the same applies with a viewer (e.g. one viewer rated only one movie as a 5 but that does not mean the user is a lenient critic - it is possible that a larger smaple size for that viewer may result in a lower bias as well). Incorporating regularization involves weighting the bias, equivalent to the formula below

bias = (mean - global mean) * (number of ratings/ (number of ratings + regularization parameter))

This bias would be calculated for viewers and movies and added to the global average. Viewers or movies with fewer ratings would have a lower bias than viewers or movies with more ratings.