# data612 - Group Project 3 : Recommender System (Matrix Factorization)
# date: 2019-06-25
# by: Sang Yoon (Andy) Hwang, Santosh Cheruku, Anthony Munoz

The purpose of this project is to show which algorithm works the best for prediction - SVD, NMF and ALS (the lowest RMSE).
We will use the algorithm that works the best to predict user ratings and recommend items.

# Data Preparation

We are going to use 100k ratings dataset from movielens.

In [9]:
import pandas as pd
import surprise
from plotly.offline import init_notebook_mode, plot, iplot
import plotly.graph_objs as go
from surprise import Dataset
from surprise import KNNBasic
from surprise import Reader
from surprise import accuracy
from surprise.model_selection import cross_validate
from surprise.model_selection import train_test_split

From https://grouplens.org/datasets/movielens/, ml-latest-small will be used.

In [10]:
df = pd.read_csv('https://raw.githubusercontent.com/wheremagichappens/an.dy/master/data612/ml-100k/ratings.csv')

In [11]:
df.head()

Unnamed: 0,userId,movieId,rating,timestamp
0,1,1,4.0,964982703
1,1,3,4.0,964981247
2,1,6,4.0,964982224
3,1,47,5.0,964983815
4,1,50,5.0,964982931


Just to make our life a little bit easier, we will change the names of the columns.

In [12]:
df.columns = ['user','item','rating','timestamp']

In [13]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100836 entries, 0 to 100835
Data columns (total 4 columns):
user         100836 non-null int64
item         100836 non-null int64
rating       100836 non-null float64
timestamp    100836 non-null int64
dtypes: float64(1), int64(3)
memory usage: 3.1 MB


# Modelling

We will find the best algorithm based on RMSE. 

In [14]:
reader = Reader(rating_scale=(df.rating.min(), df.rating.max()))
data = Dataset.load_from_df(df[['user', 'item', 'rating']], reader)

Let's split data into 5 folds

In [15]:
data.split(n_folds=5)

# SVD

In [16]:
# result = cross_validate(algo, data, measures=['RMSE'], cv=5, verbose=False)
# result['test_rmse'] 

In [17]:
algo = surprise.SVD()
surprise.evaluate(algo, data, measures=['RMSE'])


The evaluate() method is deprecated. Please use model_selection.cross_validate() instead.


Using data.split() or using load_from_folds() without using a CV iterator is now deprecated. 



Evaluating RMSE of algorithm SVD.

------------
Fold 1
RMSE: 0.8700
------------
Fold 2
RMSE: 0.8763
------------
Fold 3
RMSE: 0.8759
------------
Fold 4
RMSE: 0.8769
------------
Fold 5
RMSE: 0.8683
------------
------------
Mean RMSE: 0.8735
------------
------------


CaseInsensitiveDefaultDict(list,
                           {'rmse': [0.870026239585584,
                             0.8762645843403616,
                             0.8759016127337722,
                             0.876948704668583,
                             0.8682730515796463]})

# SVDpp

In [34]:
#algo = surprise.SVDpp()
#surprise.evaluate(algo, data, measures=['RMSE'])

Evaluating RMSE of algorithm SVDpp.

------------
Fold 1


KeyboardInterrupt: 

# Non-negative Matrix Factorization

In [19]:
algo = surprise.NMF()
surprise.evaluate(algo, data, measures=['RMSE'])

Evaluating RMSE of algorithm NMF.

------------
Fold 1
RMSE: 0.9189
------------
Fold 2
RMSE: 0.9225
------------
Fold 3
RMSE: 0.9205
------------
Fold 4
RMSE: 0.9206
------------
Fold 5
RMSE: 0.9196
------------
------------
Mean RMSE: 0.9204
------------
------------


CaseInsensitiveDefaultDict(list,
                           {'rmse': [0.9189394197993441,
                             0.9224562058196945,
                             0.9204832703038095,
                             0.9205707476542334,
                             0.9195520865463978]})

# BaselineOnly

In [20]:
algo = surprise.BaselineOnly()
surprise.evaluate(algo, data, measures=['RMSE'])

Evaluating RMSE of algorithm BaselineOnly.

------------
Fold 1
Estimating biases using als...
RMSE: 0.8692
------------
Fold 2
Estimating biases using als...
RMSE: 0.8744
------------
Fold 3
Estimating biases using als...
RMSE: 0.8730
------------
Fold 4
Estimating biases using als...
RMSE: 0.8753
------------
Fold 5
Estimating biases using als...
RMSE: 0.8697
------------
------------
Mean RMSE: 0.8723
------------
------------


CaseInsensitiveDefaultDict(list,
                           {'rmse': [0.8691502017928444,
                             0.8744398196375575,
                             0.8729733142669314,
                             0.8752502587316723,
                             0.8696910682816448]})

Surprisingly, BaselineOnly is the winner! It means we will use ALS. We will now test accuracy using training set on test set.

# BaselineOnly - ALS

We will perform train/test (80/20) split to create model. 

In [21]:
trainset, testset = train_test_split(data, test_size=.20)

We will configure ALS parameter for BaselineOnly

In [25]:
bsl_options = {'method': 'als'} #sgd - this case sgd has same output. Not too sure why.}
algo = surprise.BaselineOnly(bsl_options=bsl_options)

# Train the algorithm on the trainset, and predict ratings for the testset
predictions_als = algo.fit(trainset).test(testset)

# Then compute RMSE
accuracy.rmse(predictions_als)

Estimating biases using als...
RMSE: 0.8634


0.8634267017490722

Let's create DataFrame using prediction result.

In [26]:
prediction_df = pd.DataFrame(predictions_als)
prediction_df.head()

Unnamed: 0,uid,iid,r_ui,est,details
0,160,1375,2.0,2.673909,{u'was_impossible': False}
1,480,27879,4.0,3.413758,{u'was_impossible': False}
2,274,60072,3.5,2.994028,{u'was_impossible': False}
3,51,965,5.0,3.902123,{u'was_impossible': False}
4,448,2380,2.0,2.503108,{u'was_impossible': False}


Let's select worst/top 10 prediction results on testset.

In [27]:
worst_10 = abs(prediction_df['r_ui'] - prediction_df['est']).sort_values(ascending=False).head(10)
top_10 = abs(prediction_df['r_ui'] - prediction_df['est']).sort_values(ascending=True).head(10)

Worst 10 - it was way off.

In [28]:
prediction_df.iloc[worst_10.index]

Unnamed: 0,uid,iid,r_ui,est,details
8602,256,5618,0.5,4.56,{u'was_impossible': False}
8920,573,3996,0.5,4.436986,{u'was_impossible': False}
4241,413,1198,1.0,4.724251,{u'was_impossible': False}
19537,393,778,0.5,4.187817,{u'was_impossible': False}
14547,239,48394,0.5,4.141754,{u'was_impossible': False}
16609,594,799,0.5,4.123658,{u'was_impossible': False}
12762,393,589,0.5,4.101478,{u'was_impossible': False}
11468,393,74458,0.5,4.090345,{u'was_impossible': False}
12213,419,1073,0.5,4.04954,{u'was_impossible': False}
9637,111,593,0.5,4.018095,{u'was_impossible': False}


Best 10 - it predicted perfectly.

In [29]:
prediction_df.iloc[top_10.index]

Unnamed: 0,uid,iid,r_ui,est,details
987,122,858,5.0,5.0,{u'was_impossible': False}
4873,169,318,5.0,5.0,{u'was_impossible': False}
7560,122,1197,5.0,5.0,{u'was_impossible': False}
14794,523,318,5.0,5.0,{u'was_impossible': False}
14777,171,318,5.0,5.0,{u'was_impossible': False}
6045,560,4993,4.0,3.999874,{u'was_impossible': False}
5839,20,4027,4.0,3.99984,{u'was_impossible': False}
18361,599,2968,3.0,2.999806,{u'was_impossible': False}
17530,610,91535,3.5,3.499727,{u'was_impossible': False}
19174,503,3545,3.0,3.000309,{u'was_impossible': False}


We can also use individual uid and iid to check individual prediciton result. Let's select one of the options in testset.

In [30]:
test_10 = testset[0:10]

In [31]:
# prediction using training set compare the results with testset
uid = test_10[1][0]   # raw user id (as in the ratings file).
iid = test_10[1][1]   # raw item id (as in the ratings file).

# get a prediction for specific users and items.
pred = algo.predict(uid, iid, r_ui=test_10[1][2], verbose=True)

user: 480        item: 27879      r_ui = 4.00   est = 3.41   {u'was_impossible': False}


We will then recommend top 5 movies by predicted ratings for any selected user. This time, I will merge item title data set with rating data set.

In [32]:
item_names = pd.read_csv('https://raw.githubusercontent.com/wheremagichappens/an.dy/master/data612/ml-100k/movies.csv')
item_names = item_names.iloc[:,[0,1]]
item_names.columns = ['iid', 'title']
prediction_df = prediction_df.merge(item_names, on = 'iid', how = 'inner')

In [33]:
def recommender_user(uid):
    top_5 = prediction_df[prediction_df.uid == uid].sort_values(by = 'est', ascending = False).head(5)
    for i in range(len(top_5)):
        print('top 5 movies for uid: {0} are iid: {1} - {2} with predicted ratings of {3}'.format(uid, top_5.iid.values[i], top_5.title.values[i], top_5.est.values[i]))

recommender_user(uid)

top 5 movies for uid: 480 are iid: 356 - Forrest Gump (1994) with predicted ratings of 3.91209598816
top 5 movies for uid: 480 are iid: 6016 - City of God (Cidade de Deus) (2002) with predicted ratings of 3.78791296156
top 5 movies for uid: 480 are iid: 6874 - Kill Bill: Vol. 1 (2003) with predicted ratings of 3.77721872062
top 5 movies for uid: 480 are iid: 1247 - Graduate, The (1967) with predicted ratings of 3.74832719893
top 5 movies for uid: 480 are iid: 2324 - Life Is Beautiful (La Vita è bella) (1997) with predicted ratings of 3.74437344082


# Conclusion - Evaluation

RMSE for SVD is only little bit higher than BaselineOnly's. Thus, we cannot really say one should always prefer ALS to SVD; depending on the case, one can still use SVD if that is more suitable for his/her own purpose. 