# Project 1

Project 1 DATA 612 - Semyon Toybis

This project requires building a basic recommender system and testing baseline recommender techinques for prediction.
For this project, I will create a recommender that recommends movies to viewers based on their ratings for prior movies.
Specifically, I will use a simple average of movie ratings and a simple average plus viewer and movie bias to predict ratings for movies by viewer.
This is a baseline recommender model that more advanced techinques can be compared to see if they create better recommendations.

## Dataset

I will create a sample data set using randomly generated numbers as my user item matrix. I will create an 8x8 matrix which I will then use to create train and test sets.

In [1]:
import numpy as np
import pandas as pd

In [2]:
np.random.seed(10)
random_ratings = np.random.randint(0,5, size = (8,8))

In [3]:
ratings_df = pd.DataFrame(random_ratings, columns = ['movie1',
                                                    'movie2',
                                                    'movie3',
                                                    'movie4',
                                                    'movie5',
                                                    'movie6',
                                                    'movie7',
                                                    'movie8'])

In [4]:
ratings_df

Unnamed: 0,movie1,movie2,movie3,movie4,movie5,movie6,movie7,movie8
0,1,4,0,1,3,4,1,0
1,1,2,0,1,0,2,0,4
2,3,0,4,3,0,3,2,1
3,0,4,1,3,3,1,4,1
4,4,1,1,4,3,2,0,3
5,4,2,0,1,2,0,0,3
6,1,3,4,1,4,2,0,0
7,4,4,0,0,2,4,2,0


This is the user item matrix which I will split into train and test sets. I will also replace zeros with NAs, for movies that a viewer did not watch.

In [5]:
ratings_df.replace(0, np.nan, inplace = True)

In [6]:
ratings_df_long = ratings_df.melt(ignore_index=False).reset_index()

In [7]:
ratings_df_long

Unnamed: 0,index,variable,value
0,0,movie1,1.0
1,1,movie1,1.0
2,2,movie1,3.0
3,3,movie1,
4,4,movie1,4.0
...,...,...,...
59,3,movie8,1.0
60,4,movie8,3.0
61,5,movie8,3.0
62,6,movie8,


In [8]:
ratings_df_long.shape

(64, 3)

Below I create train and test sets via an 80/20 split. I also check that the train and test set contains each viewer to avoid the "cold-start" problem.

In [9]:
train = ratings_df_long.groupby(['index']).sample(frac = 0.8,random_state=10)

In [10]:
test = ratings_df_long.drop(train.index)

In [11]:
train['index'].unique()

array([0, 1, 2, 3, 4, 5, 6, 7])

In [12]:
test['index'].unique()

array([1, 2, 4, 0, 6, 3, 5, 7])

In [13]:
train_user_item = train.pivot(index = 'index', columns = 'variable', values = 'value')

In [14]:
train_user_item

variable,movie1,movie2,movie3,movie4,movie5,movie6,movie7,movie8
index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
0,1.0,,,1.0,3.0,,1.0,
1,,2.0,,1.0,,,,4.0
2,,,,3.0,,3.0,2.0,1.0
3,,4.0,1.0,,3.0,1.0,4.0,
4,,1.0,1.0,4.0,,2.0,,3.0
5,4.0,2.0,,,2.0,,,3.0
6,1.0,,4.0,1.0,,2.0,,
7,4.0,4.0,,,2.0,,,


In [15]:
test_user_item = test.pivot(index = 'index', columns = 'variable', values = 'value')

In [16]:
test_user_item

variable,movie1,movie2,movie3,movie4,movie5,movie6,movie7,movie8
index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
0,,4.0,,,,4.0,,
1,1.0,,,,,2.0,,
2,3.0,,4.0,,,,,
3,,,,3.0,,,,1.0
4,4.0,,,,3.0,,,
5,,,,1.0,,,,
6,,3.0,,,4.0,,,
7,,,,,,4.0,2.0,


## Raw average

First, I will use the raw average to predict the training and test set. Below I calculate the raw average.

In [17]:
raw_avg = np.nanmean(train_user_item)

In [18]:
raw_avg

np.float64(2.34375)

Next, I convert the data frames to long format to more easily work with the data by adding columns and performing operations on columns

In [19]:
train_user_item_long = train_user_item.melt(ignore_index=False).reset_index()

In [20]:
train_user_item_long.head()

Unnamed: 0,index,variable,value
0,0,movie1,1.0
1,1,movie1,
2,2,movie1,
3,3,movie1,
4,4,movie1,


Below, I add a column for the squared error, which takes the difference between the observed value and the average and squares the difference

In [21]:
train_user_item_long['se_raw_avg'] = (np.subtract(train_user_item_long['value'],raw_avg))**2

In [22]:
train_user_item_long

Unnamed: 0,index,variable,value,se_raw_avg
0,0,movie1,1.0,1.805664
1,1,movie1,,
2,2,movie1,,
3,3,movie1,,
4,4,movie1,,
...,...,...,...,...
59,3,movie8,,
60,4,movie8,3.0,0.430664
61,5,movie8,3.0,0.430664
62,6,movie8,,


Next, I calculate the RMSE which is the square root of the mean of the squared errors

In [23]:
train_rmse_raw_avg = np.sqrt(np.mean(train_user_item_long['se_raw_avg']))

In [24]:
train_rmse_raw_avg

np.float64(1.1887329126006396)

I perform the same calculation for the test set

In [25]:
test_user_item_long = test_user_item.melt(ignore_index=False).reset_index()

In [26]:
test_user_item_long['se_raw_avg'] = (np.subtract(test_user_item_long['value'],raw_avg))**2

In [27]:
test_rmse_raw_avg = np.sqrt(np.mean(test_user_item_long['se_raw_avg']))

In [28]:
test_rmse_raw_avg

np.float64(1.260554400188002)

## Raw average with bias

Next, I calculate the viewer bias and the movie bias. This is the difference between the raw average and the average for each movie and for each viewer, respectively. These biases will be added to the raw average to generate predictions.

In [29]:
viewer_bias = train_user_item_long.groupby('index').mean('value')

In [30]:
viewer_bias['viewer_bias'] = np.subtract(viewer_bias['value'],raw_avg)

In [31]:
viewer_bias

Unnamed: 0_level_0,value,se_raw_avg,viewer_bias
index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,1.5,1.461914,-0.84375
1,2.333333,1.555664,-0.010417
2,2.25,0.696289,-0.09375
3,2.6,1.905664,0.25625
4,2.2,1.380664,-0.14375
5,2.75,0.852539,0.40625
6,2.0,1.618164,-0.34375
7,3.333333,1.868164,0.989583


In [32]:
movie_bias = train_user_item_long.groupby('variable').mean('value')

In [33]:
movie_bias.drop('index', axis = 1, inplace = True)

In [34]:
movie_bias['movie_bias'] = np.subtract(movie_bias['value'],raw_avg)

In [35]:
movie_bias

Unnamed: 0_level_0,value,se_raw_avg,movie_bias
variable,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
movie1,2.5,2.274414,0.15625
movie2,2.6,1.505664,0.25625
movie3,2.0,2.118164,-0.34375
movie4,2.0,1.718164,-0.34375
movie5,2.5,0.274414,0.15625
movie6,2.0,0.618164,-0.34375
movie7,2.333333,1.555664,-0.010417
movie8,2.75,1.352539,0.40625


Next, I merge the viewer bias and movie bias values into the training and test data frames

In [36]:
train_user_item_long

Unnamed: 0,index,variable,value,se_raw_avg
0,0,movie1,1.0,1.805664
1,1,movie1,,
2,2,movie1,,
3,3,movie1,,
4,4,movie1,,
...,...,...,...,...
59,3,movie8,,
60,4,movie8,3.0,0.430664
61,5,movie8,3.0,0.430664
62,6,movie8,,


In [37]:
viewer_bias.reset_index(inplace = True)

In [38]:
movie_bias.reset_index(inplace = True)

In [39]:
viewer_bias

Unnamed: 0,index,value,se_raw_avg,viewer_bias
0,0,1.5,1.461914,-0.84375
1,1,2.333333,1.555664,-0.010417
2,2,2.25,0.696289,-0.09375
3,3,2.6,1.905664,0.25625
4,4,2.2,1.380664,-0.14375
5,5,2.75,0.852539,0.40625
6,6,2.0,1.618164,-0.34375
7,7,3.333333,1.868164,0.989583


In [40]:
movie_bias

Unnamed: 0,variable,value,se_raw_avg,movie_bias
0,movie1,2.5,2.274414,0.15625
1,movie2,2.6,1.505664,0.25625
2,movie3,2.0,2.118164,-0.34375
3,movie4,2.0,1.718164,-0.34375
4,movie5,2.5,0.274414,0.15625
5,movie6,2.0,0.618164,-0.34375
6,movie7,2.333333,1.555664,-0.010417
7,movie8,2.75,1.352539,0.40625


In [41]:
train_user_item_long = pd.merge(train_user_item_long,viewer_bias[['index','viewer_bias']],on='index', how='left')
train_user_item_long = pd.merge(train_user_item_long,movie_bias[['variable','movie_bias']],on='variable', how='left')

In [42]:
train_user_item_long

Unnamed: 0,index,variable,value,se_raw_avg,viewer_bias,movie_bias
0,0,movie1,1.0,1.805664,-0.843750,0.15625
1,1,movie1,,,-0.010417,0.15625
2,2,movie1,,,-0.093750,0.15625
3,3,movie1,,,0.256250,0.15625
4,4,movie1,,,-0.143750,0.15625
...,...,...,...,...,...,...
59,3,movie8,,,0.256250,0.40625
60,4,movie8,3.0,0.430664,-0.143750,0.40625
61,5,movie8,3.0,0.430664,0.406250,0.40625
62,6,movie8,,,-0.343750,0.40625


In [43]:
test_user_item_long = pd.merge(test_user_item_long,viewer_bias[['index','viewer_bias']],on='index', how='left')
test_user_item_long = pd.merge(test_user_item_long,movie_bias[['variable','movie_bias']],on='variable', how='left')

In [44]:
test_user_item_long

Unnamed: 0,index,variable,value,se_raw_avg,viewer_bias,movie_bias
0,0,movie1,,,-0.843750,0.15625
1,1,movie1,1.0,1.805664,-0.010417,0.15625
2,2,movie1,3.0,0.430664,-0.093750,0.15625
3,3,movie1,,,0.256250,0.15625
4,4,movie1,4.0,2.743164,-0.143750,0.15625
...,...,...,...,...,...,...
59,3,movie8,1.0,1.805664,0.256250,0.40625
60,4,movie8,,,-0.143750,0.40625
61,5,movie8,,,0.406250,0.40625
62,6,movie8,,,-0.343750,0.40625


Now I calculate the squared errors by taking the difference between the observed value and the sum of the average and movie and viewer biases.

In [45]:
train_user_item_long['se_raw_avg_with_bias'] = (np.subtract(train_user_item_long['value'],raw_avg)+
                                               train_user_item_long['viewer_bias']+
                                               train_user_item_long['movie_bias'])**2

In [46]:
train_user_item_long

Unnamed: 0,index,variable,value,se_raw_avg,viewer_bias,movie_bias,se_raw_avg_with_bias
0,0,movie1,1.0,1.805664,-0.843750,0.15625,4.125977
1,1,movie1,,,-0.010417,0.15625,
2,2,movie1,,,-0.093750,0.15625,
3,3,movie1,,,0.256250,0.15625,
4,4,movie1,,,-0.143750,0.15625,
...,...,...,...,...,...,...,...
59,3,movie8,,,0.256250,0.40625,
60,4,movie8,3.0,0.430664,-0.143750,0.40625,0.844102
61,5,movie8,3.0,0.430664,0.406250,0.40625,2.157227
62,6,movie8,,,-0.343750,0.40625,


In [47]:
train_rmse_raw_avg_with_bias = np.sqrt(np.mean(train_user_item_long['se_raw_avg_with_bias']))

In [48]:
train_rmse_raw_avg_with_bias

np.float64(1.559033655024804)

In [49]:
test_user_item_long['se_raw_avg_with_bias'] = (np.subtract(test_user_item_long['value'],raw_avg)+
                                               test_user_item_long['viewer_bias']+
                                               test_user_item_long['movie_bias'])**2

In [50]:
test_user_item_long

Unnamed: 0,index,variable,value,se_raw_avg,viewer_bias,movie_bias,se_raw_avg_with_bias
0,0,movie1,,,-0.843750,0.15625,
1,1,movie1,1.0,1.805664,-0.010417,0.15625,1.435004
2,2,movie1,3.0,0.430664,-0.093750,0.15625,0.516602
3,3,movie1,,,0.256250,0.15625,
4,4,movie1,4.0,2.743164,-0.143750,0.15625,2.784727
...,...,...,...,...,...,...,...
59,3,movie8,1.0,1.805664,0.256250,0.40625,0.464102
60,4,movie8,,,-0.143750,0.40625,
61,5,movie8,,,0.406250,0.40625,
62,6,movie8,,,-0.343750,0.40625,


In [51]:
test_rmse_raw_avg_with_bias = np.sqrt(np.mean(test_user_item_long['se_raw_avg_with_bias']))

In [52]:
test_rmse_raw_avg_with_bias

np.float64(1.1293412892855663)

## Summary

Below I create a dataframe to compare the RMSE values

In [53]:
summary_list = [['train',train_rmse_raw_avg,train_rmse_raw_avg_with_bias],
                ['test',test_rmse_raw_avg,test_rmse_raw_avg_with_bias]]

summary_df = pd.DataFrame(summary_list, columns = ['Dataset','Raw_Avg','Raw_Avg_with_Bias'])

In [54]:
summary_df

Unnamed: 0,Dataset,Raw_Avg,Raw_Avg_with_Bias
0,train,1.188733,1.559034
1,test,1.260554,1.129341


The RMSE for the training set was higher when incorporating viewer and movie bias versus using just the raw average. However, the RMSE for the test set was lower when incorporating bias. It would seem that incorporating bias should improve results, however for a small data set that is randomly generated there can be varianace as to whether raw average or raw average with bias is the better approach.