# data612 - Group Project 3 : Recommender System (Matrix Factorization)
# date: 2019-06-25
# by: Sang Yoon (Andy) Hwang, Santosh Cheruku, Anthony Munoz

The purpose of this project is to show which algorithm works the best for prediction - SVD, NMF and ALS (the lowest RMSE).
We will use the algorithm that works the best to predict user ratings and recommend items.

# Data Preparation

We are going to use 100k ratings dataset from movielens.

In [87]:
import pandas as pd
import numpy as np
import surprise
from plotly.offline import init_notebook_mode, plot, iplot
import plotly.graph_objs as go
from surprise import Dataset
from surprise import KNNBasic
from surprise import Reader
from surprise import accuracy
from surprise.model_selection import cross_validate
from surprise.model_selection import train_test_split

From https://grouplens.org/datasets/movielens/, ml-latest-small will be used.

In [88]:
df = pd.read_csv('https://raw.githubusercontent.com/wheremagichappens/an.dy/master/data612/ml-100k/ratings.csv')

In [89]:
df.head()

Unnamed: 0,userId,movieId,rating,timestamp
0,1,1,4.0,964982703
1,1,3,4.0,964981247
2,1,6,4.0,964982224
3,1,47,5.0,964983815
4,1,50,5.0,964982931


Just to make our life a little bit easier, we will change the names of the columns.

In [90]:
df.columns = ['user','item','rating','timestamp']

In [91]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100836 entries, 0 to 100835
Data columns (total 4 columns):
user         100836 non-null int64
item         100836 non-null int64
rating       100836 non-null float64
timestamp    100836 non-null int64
dtypes: float64(1), int64(3)
memory usage: 3.1 MB


# Modelling

We will find the best algorithm based on RMSE. 

In [92]:
reader = Reader(rating_scale=(df.rating.min(), df.rating.max()))
data = Dataset.load_from_df(df[['user', 'item', 'rating']], reader)

Let's split data into 5 folds

In [93]:
data.split(n_folds=5)

# SVD

In [94]:
algo = surprise.SVD()
rmse_svd = surprise.evaluate(algo, data, measures=['RMSE'])

Evaluating RMSE of algorithm SVD.

------------
Fold 1
RMSE: 0.8829
------------
Fold 2
RMSE: 0.8820
------------
Fold 3
RMSE: 0.8719
------------
Fold 4
RMSE: 0.8683
------------
Fold 5
RMSE: 0.8713
------------
------------
Mean RMSE: 0.8753
------------
------------


In [95]:
# result = cross_validate(algo, data, measures=['RMSE'], cv=5, verbose=False)
# result['test_rmse'] 

# Non-negative Matrix Factorization

In [96]:
algo = surprise.NMF()
rmse_nmf = surprise.evaluate(algo, data, measures=['RMSE'])

Evaluating RMSE of algorithm NMF.

------------
Fold 1
RMSE: 0.9271
------------
Fold 2
RMSE: 0.9241
------------
Fold 3
RMSE: 0.9221
------------
Fold 4
RMSE: 0.9157
------------
Fold 5
RMSE: 0.9207
------------
------------
Mean RMSE: 0.9219
------------
------------


In [97]:
#result = cross_validate(algo, data, measures=['RMSE'], cv=5, verbose=False)
#result['test_rmse'] 

# BaselineOnly

In [98]:
algo = surprise.BaselineOnly()
rmse_bo = surprise.evaluate(algo, data, measures=['RMSE'])

Evaluating RMSE of algorithm BaselineOnly.

------------
Fold 1
Estimating biases using als...
RMSE: 0.8811
------------
Fold 2
Estimating biases using als...
RMSE: 0.8791
------------
Fold 3
Estimating biases using als...
RMSE: 0.8702
------------
Fold 4
Estimating biases using als...
RMSE: 0.8657
------------
Fold 5
Estimating biases using als...
RMSE: 0.8686
------------
------------
Mean RMSE: 0.8729
------------
------------


In [99]:
#result = cross_validate(algo, data, measures=['RMSE'], cv=5, verbose=False)
#result['test_rmse'] 

Surprisingly, BaselineOnly is the winner! It means we will use ALS. We will now test accuracy using training set on test set.

# BaselineOnly - ALS

We will perform train/test (80/20) split to create model. 

In [100]:
trainset, testset = train_test_split(data, test_size=.20)

We will configure ALS parameter for BaselineOnly

In [101]:
bsl_options = {'method': 'als'}
algo = surprise.BaselineOnly(bsl_options=bsl_options)

# Train the algorithm on the trainset, and predict ratings for the testset
predictions_als = algo.fit(trainset).test(testset)

# Then compute RMSE
accuracy.rmse(predictions_als)

Estimating biases using als...
RMSE: 0.8723


0.8723260528049739

Let's create DataFrame using prediction result.

In [102]:
prediction_df = pd.DataFrame(predictions_als)
prediction_df.head()

Unnamed: 0,uid,iid,r_ui,est,details
0,594,8636,5.0,4.225218,{u'was_impossible': False}
1,306,72998,3.5,3.367205,{u'was_impossible': False}
2,229,357,4.0,3.753616,{u'was_impossible': False}
3,57,442,2.0,2.892028,{u'was_impossible': False}
4,222,1682,4.0,3.415373,{u'was_impossible': False}


Let's select worst/top 10 prediction results on testset.

In [103]:
worst_10 = abs(prediction_df['r_ui'] - prediction_df['est']).sort_values(ascending=False).head(10)
top_10 = abs(prediction_df['r_ui'] - prediction_df['est']).sort_values(ascending=True).head(10)

Worst 10 - it was way off.

In [104]:
prediction_df.iloc[worst_10.index]

Unnamed: 0,uid,iid,r_ui,est,details
10665,573,44199,0.5,4.372386,{u'was_impossible': False}
15293,413,1198,1.0,4.706567,{u'was_impossible': False}
8013,258,122886,0.5,4.1291,{u'was_impossible': False}
358,413,2858,1.0,4.605462,{u'was_impossible': False}
19061,258,87232,0.5,4.101852,{u'was_impossible': False}
2957,129,1358,0.5,4.084332,{u'was_impossible': False}
4886,421,61132,0.5,3.98859,{u'was_impossible': False}
16217,111,2571,0.5,3.983922,{u'was_impossible': False}
8090,393,1240,0.5,3.939089,{u'was_impossible': False}
10440,154,86644,0.5,3.936395,{u'was_impossible': False}


Best 10 - it predicted perfectly.

In [105]:
prediction_df.iloc[top_10.index]

Unnamed: 0,uid,iid,r_ui,est,details
5731,122,260,5.0,5.0,{u'was_impossible': False}
1541,122,318,5.0,5.0,{u'was_impossible': False}
9532,171,318,5.0,5.0,{u'was_impossible': False}
1528,452,1221,5.0,5.0,{u'was_impossible': False}
4949,523,58559,5.0,5.0,{u'was_impossible': False}
12228,599,6625,2.5,2.499975,{u'was_impossible': False}
18377,38,317,3.0,2.999883,{u'was_impossible': False}
20099,610,111663,3.5,3.500148,{u'was_impossible': False}
7736,290,1333,4.0,3.999803,{u'was_impossible': False}
3721,105,3977,3.5,3.500221,{u'was_impossible': False}


We can also use individual uid and iid to check individual prediciton result. Let's select one of the options in testset.

In [106]:
test_10 = testset[0:10]

In [107]:
# prediction using training set compare the results with testset
uid = test_10[1][0]   # raw user id (as in the ratings file).
iid = test_10[1][1]   # raw item id (as in the ratings file).

# get a prediction for specific users and items.
pred = algo.predict(uid, iid, r_ui=test_10[1][2], verbose=True)

user: 306        item: 72998      r_ui = 3.50   est = 3.37   {u'was_impossible': False}


We will then recommend top 5 movies by predicted ratings for any selected user. This time, I will merge item title data set with rating data set.

In [108]:
item_names = pd.read_csv('https://raw.githubusercontent.com/wheremagichappens/an.dy/master/data612/ml-100k/movies.csv')
item_names = item_names.iloc[:,[0,1]]
item_names.columns = ['iid', 'title']
prediction_df = prediction_df.merge(item_names, on = 'iid', how = 'inner')

In [109]:
def recommender_user(uid):
    top_5 = prediction_df[prediction_df.uid == uid].sort_values(by = 'est', ascending = False).head(5)
    for i in range(len(top_5)):
        print('top 5 movies for uid: {0} are iid: {1} - {2} with predicted ratings of {3}'.format(uid, top_5.iid.values[i], top_5.title.values[i], top_5.est.values[i]))

recommender_user(uid)

top 5 movies for uid: 306 are iid: 60069 - WALL·E (2008) with predicted ratings of 3.78461397825
top 5 movies for uid: 306 are iid: 68954 - Up (2009) with predicted ratings of 3.74534008302
top 5 movies for uid: 306 are iid: 2762 - Sixth Sense, The (1999) with predicted ratings of 3.58480396366
top 5 movies for uid: 306 are iid: 134853 - Inside Out (2015) with predicted ratings of 3.53874987782
top 5 movies for uid: 306 are iid: 140110 - The Intern (2015) with predicted ratings of 3.43628738475


# Conclusion - Evaluation

In [110]:
rmse_models = [np.mean(rmse_svd['rmse']),np.mean(rmse_nmf['rmse']),np.mean(rmse_bo['rmse'])]
rmse_df = pd.DataFrame(rmse_models)
rmse_df.index = ['SVD','NMF','BaselineOnly']
rmse_df.columns = ['RMSE - CV 5 folds']

In [111]:
df.rating.hist()

<matplotlib.axes._subplots.AxesSubplot at 0x1390da90>

In [112]:
rmse_df

Unnamed: 0,RMSE - CV 5 folds
SVD,0.875289
NMF,0.921909
BaselineOnly,0.872945


Our first thought was that SVD would give us the best result but it did not as RMSE for SVD is little bit higher than BaselineOnly's. We cannot, however, strictly say that one should always prefer ALS based BaselineOnly to SGD based SVD - depending on the case, one can still use SVD if that is more suitable for his/her own purpose. 

We then have to think about why SVD had higher RMSE. 

Our guess is that it could be something to do with feature scaling issue - how ratings are scaled. 
From the documentation of the dataset we used (http://files.grouplens.org/datasets/movielens/ml-latest-small-README.html) and from the rating historgram, we know that ratings are half-star incremented. On the other hand, alternative dataset with 1 million ratings, which we could not run SVD due to the memory issue, (http://files.grouplens.org/datasets/movielens/ml-1m-README.txt) contains whole-star ratings only. 

According to benchmark RMSE examples on http://surpriselib.com/, we see that RMSE for SVD using 100k dataset is usually around 0.934 but only 0.873 using 1m dataset.

Since SVD is based on SGD, (https://scikit-learn.org/stable/modules/sgd.html) there are several disdvantages - 1) SGD requires a number of hyperparameters such as the regularization parameter and the number of iterations. 2) SGD is sensitive to feature scaling.

For the future reference, it will be a worth try to compare the RMSE of each model one another in different rating scales to see if there will be any improvement.