# data612 - Group Project 3 : Recommender System (Matrix Factorization)
# date: 2019-06-25
# by: Sang Yoon (Andy) Hwang, Santosh Cheruku, Anthony Munoz

The purpose of this project is to show which algorithm works the best for prediction - SVD, NMF and ALS (the lowest RMSE).
We will use the algorithm that works the best to predict user ratings and recommend items.

# Data Preparation

We are going to use 100k ratings dataset from movielens.

In [1]:
import pandas as pd
import surprise
from plotly.offline import init_notebook_mode, plot, iplot
import plotly.graph_objs as go
from surprise import Dataset
from surprise import KNNBasic
from surprise import Reader
from surprise import accuracy
from surprise.model_selection import cross_validate
from surprise.model_selection import train_test_split

From https://grouplens.org/datasets/movielens/, ml-latest-small will be used.

In [2]:
df = pd.read_csv('https://raw.githubusercontent.com/wheremagichappens/an.dy/master/data612/ml-100k/ratings.csv')

In [3]:
df.head()

Unnamed: 0,userId,movieId,rating,timestamp
0,1,1,4.0,964982703
1,1,3,4.0,964981247
2,1,6,4.0,964982224
3,1,47,5.0,964983815
4,1,50,5.0,964982931


Just to make our life a little bit easier, we will change the names of the columns.

In [4]:
df.columns = ['user','item','rating','timestamp']

In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100836 entries, 0 to 100835
Data columns (total 4 columns):
user         100836 non-null int64
item         100836 non-null int64
rating       100836 non-null float64
timestamp    100836 non-null int64
dtypes: float64(1), int64(3)
memory usage: 3.1 MB


# Modelling

We will find the best algorithm based on RMSE. 

In [6]:
reader = Reader(rating_scale=(df.rating.min(), df.rating.max()))
data = Dataset.load_from_df(df[['user', 'item', 'rating']], reader)

Let's split data into 5 folds

In [7]:
data.split(n_folds=5)

# SVD

In [8]:
# result = cross_validate(algo, data, measures=['RMSE'], cv=5, verbose=False)
# result['test_rmse'] 

In [14]:
algo = surprise.SVD()
surprise.evaluate(algo, data, measures=['RMSE'])


The evaluate() method is deprecated. Please use model_selection.cross_validate() instead.


Using data.split() or using load_from_folds() without using a CV iterator is now deprecated. 



Evaluating RMSE of algorithm SVD.

------------
Fold 1
RMSE: 0.8760
------------
Fold 2
RMSE: 0.8784
------------
Fold 3
RMSE: 0.8672
------------
Fold 4
RMSE: 0.8659
------------
Fold 5
RMSE: 0.8770
------------
------------
Mean RMSE: 0.8729
------------
------------


CaseInsensitiveDefaultDict(list,
                           {'rmse': [0.8759879771735134,
                             0.878406272272054,
                             0.8672146012465194,
                             0.8659068338633159,
                             0.8770347484336569]})

# SVDpp

In [42]:
#algo = surprise.SVDpp()
#surprise.evaluate(algo, data, measures=['RMSE'])

# Non-negative Matrix Factorization

In [11]:
algo = surprise.NMF()
surprise.evaluate(algo, data, measures=['RMSE'])


The evaluate() method is deprecated. Please use model_selection.cross_validate() instead.


Using data.split() or using load_from_folds() without using a CV iterator is now deprecated. 



Evaluating RMSE of algorithm NMF.

------------
Fold 1
RMSE: 0.9224
------------
Fold 2
RMSE: 0.9193
------------
Fold 3
RMSE: 0.9184
------------
Fold 4
RMSE: 0.9166
------------
Fold 5
RMSE: 0.9250
------------
------------
Mean RMSE: 0.9204
------------
------------


CaseInsensitiveDefaultDict(list,
                           {'rmse': [0.9224254904167606,
                             0.9193270641437427,
                             0.9183989469358387,
                             0.9165932121998853,
                             0.9250273783815011]})

# BaselineOnly

In [15]:
algo = surprise.BaselineOnly()
surprise.evaluate(algo, data, measures=['RMSE'])


The evaluate() method is deprecated. Please use model_selection.cross_validate() instead.


Using data.split() or using load_from_folds() without using a CV iterator is now deprecated. 



Evaluating RMSE of algorithm BaselineOnly.

------------
Fold 1
Estimating biases using als...
RMSE: 0.8743
------------
Fold 2
Estimating biases using als...
RMSE: 0.8759
------------
Fold 3
Estimating biases using als...
RMSE: 0.8674
------------
Fold 4
Estimating biases using als...
RMSE: 0.8674
------------
Fold 5
Estimating biases using als...
RMSE: 0.8757
------------
------------
Mean RMSE: 0.8721
------------
------------


CaseInsensitiveDefaultDict(list,
                           {'rmse': [0.8742732306244616,
                             0.8758519033828231,
                             0.8674367781096043,
                             0.8673539243629983,
                             0.8756883647725023]})

Surprisingly, BaselineOnly is the winner! It means we will use ALS. We will now test accuracy using training set on test set.

# BaselineOnly - ALS

We will perform train/test (80/20) split to create model. 

In [16]:
trainset, testset = train_test_split(data, test_size=.20)

We will configure ALS parameter for BaselineOnly

In [37]:
bsl_options = {'method': 'als' #sgd - this case sgd has same output. Not too sure why.}
algo = surprise.BaselineOnly(bsl_options=bsl_options)

# Train the algorithm on the trainset, and predict ratings for the testset
predictions_als = algo.fit(trainset).test(testset)

# Then compute RMSE
accuracy.rmse(predictions_cos)

Estimating biases using als...
RMSE: 0.8702


0.8701576448753574

Let's create DataFrame using prediction result.

In [27]:
prediction_df = pd.DataFrame(predictions_als)
prediction_df.head()

Unnamed: 0,uid,iid,r_ui,est,details
0,177,1411,5.0,3.939666,{'was_impossible': False}
1,226,88,3.0,3.14104,{'was_impossible': False}
2,465,1240,4.0,4.2864,{'was_impossible': False}
3,610,156726,4.5,3.345206,{'was_impossible': False}
4,307,5500,4.0,3.198321,{'was_impossible': False}


Let's select worst/top 10 prediction results on testset.

In [28]:
worst_10 = abs(prediction_df['r_ui'] - prediction_df['est']).sort_values(ascending=False).head(10)
top_10 = abs(prediction_df['r_ui'] - prediction_df['est']).sort_values(ascending=True).head(10)

Worst 10 - it was way off.

In [29]:
prediction_df.iloc[worst_10.index]

Unnamed: 0,uid,iid,r_ui,est,details
14400,543,213,0.5,4.665999,{'was_impossible': False}
8592,594,7982,0.5,4.393276,{'was_impossible': False}
888,125,4973,0.5,4.376447,{'was_impossible': False}
1412,413,1198,1.0,4.757497,{'was_impossible': False}
17506,594,799,0.5,4.241575,{'was_impossible': False}
15000,594,5909,0.5,4.211212,{'was_impossible': False}
17018,258,122886,0.5,4.155418,{'was_impossible': False}
1610,495,106489,0.5,4.147496,{'was_impossible': False}
15240,543,59900,0.5,4.108457,{'was_impossible': False}
15495,594,6140,0.5,4.091105,{'was_impossible': False}


Best 10 - it predicted perfectly.

In [30]:
prediction_df.iloc[top_10.index]

Unnamed: 0,uid,iid,r_ui,est,details
16461,337,318,5.0,5.0,{'was_impossible': False}
12017,122,50,5.0,5.0,{'was_impossible': False}
13828,452,1197,5.0,5.0,{'was_impossible': False}
16753,452,356,5.0,5.0,{'was_impossible': False}
1288,543,2571,5.0,5.0,{'was_impossible': False}
18021,452,541,5.0,5.0,{'was_impossible': False}
12456,452,3147,5.0,5.0,{'was_impossible': False}
301,475,260,5.0,5.0,{'was_impossible': False}
15685,543,7361,5.0,5.0,{'was_impossible': False}
18899,452,1221,5.0,5.0,{'was_impossible': False}


We can also use individual uid and iid to check individual prediciton result. Let's select one of the options in testset.

In [31]:
test_10 = testset[0:10]

In [32]:
# prediction using training set compare the results with testset
uid = test_10[1][0]   # raw user id (as in the ratings file).
iid = test_10[1][1]   # raw item id (as in the ratings file).

# get a prediction for specific users and items.
pred = algo.predict(uid, iid, r_ui=test_10[1][2], verbose=True)

user: 226        item: 88         r_ui = 3.00   est = 3.14   {'was_impossible': False}


We will then recommend top 5 movies by predicted ratings for any selected user. This time, I will merge item title data set with rating data set.

In [33]:
item_names = pd.read_csv('https://raw.githubusercontent.com/wheremagichappens/an.dy/master/data612/ml-100k/movies.csv')
item_names = item_names.iloc[:,[0,1]]
item_names.columns = ['iid', 'title']
prediction_df = prediction_df.merge(item_names, on = 'iid', how = 'inner')

In [34]:
def recommender_user(uid):
    top_5 = prediction_df[prediction_df.uid == uid].sort_values(by = 'est', ascending = False).head(5)
    for i in range(len(top_5)):
        print('top 5 movies for uid: {0} are iid: {1} - {2} with predicted ratings of {3}'.format(uid, top_5.iid.values[i], top_5.title.values[i], top_5.est.values[i]))

recommender_user(uid)

top 5 movies for uid: 226 are iid: 2959 - Fight Club (1999) with predicted ratings of 4.201579916232384
top 5 movies for uid: 226 are iid: 1196 - Star Wars: Episode V - The Empire Strikes Back (1980) with predicted ratings of 4.1738171600491665
top 5 movies for uid: 226 are iid: 48516 - Departed, The (2006) with predicted ratings of 4.145885411293636
top 5 movies for uid: 226 are iid: 2571 - Matrix, The (1999) with predicted ratings of 4.1291402074059596
top 5 movies for uid: 226 are iid: 2398 - Miracle on 34th Street (1947) with predicted ratings of 4.11552459093689


# Conclusion - Evaluation

RMSE for SVD is only little bit higher than BaselineOnly's. Thus, we cannot really say one should always prefer ALS to SVD; depending on the case, one can still use SVD if that is more suitable for his/her own purpose. 