# Part I
# data612 - Group Project 4 : Recommender System (Accuracy and Beyond)
# date: 2019-07-02
# by: Sang Yoon (Andy) Hwang, Santosh Cheruku, Anthony Munoz, Neil Hwang

The purpose of this project is to show which algorithm works the best for prediction - SVD, KNNbaseline, NMF and ALS in terms of accuracy measurements. We do have Part I (Python) and Part II (R) - http://rpubs.com/neilhwang/project4.

# Data Preparation

We are going to use 100k ratings dataset from movielens.

In [1]:
import pandas as pd
import numpy as np
import surprise
from plotly.offline import init_notebook_mode, plot, iplot
import plotly.graph_objs as go
from surprise import Dataset
from surprise import KNNBasic
from surprise import Reader
from surprise import accuracy
from surprise.model_selection import cross_validate
from surprise.model_selection import train_test_split

From https://grouplens.org/datasets/movielens/, ml-latest-small will be used.

In [2]:
df = pd.read_csv('https://raw.githubusercontent.com/wheremagichappens/an.dy/master/data612/ml-100k/ratings.csv')

In [3]:
df.head()

Unnamed: 0,userId,movieId,rating,timestamp
0,1,1,4.0,964982703
1,1,3,4.0,964981247
2,1,6,4.0,964982224
3,1,47,5.0,964983815
4,1,50,5.0,964982931


Just to make our life a little bit easier, we will change the names of the columns.

In [4]:
df.columns = ['user','item','rating','timestamp']

In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100836 entries, 0 to 100835
Data columns (total 4 columns):
user         100836 non-null int64
item         100836 non-null int64
rating       100836 non-null float64
timestamp    100836 non-null int64
dtypes: float64(1), int64(3)
memory usage: 3.1 MB


# Modelling

In [6]:
reader = Reader(rating_scale=(df.rating.min(), df.rating.max()))
data = Dataset.load_from_df(df[['user', 'item', 'rating']], reader)

Let's split data into 5 folds

In [7]:
data.split(n_folds=5)

# SVD

In [8]:
algo = surprise.SVD()
rmse_svd = surprise.evaluate(algo, data, measures=['RMSE'])


The evaluate() method is deprecated. Please use model_selection.cross_validate() instead.


Using data.split() or using load_from_folds() without using a CV iterator is now deprecated. 



Evaluating RMSE of algorithm SVD.

------------
Fold 1
RMSE: 0.8681
------------
Fold 2
RMSE: 0.8822
------------
Fold 3
RMSE: 0.8709
------------
Fold 4
RMSE: 0.8704
------------
Fold 5
RMSE: 0.8803
------------
------------
Mean RMSE: 0.8744
------------
------------


In [9]:
# result = cross_validate(algo, data, measures=['RMSE'], cv=5, verbose=False)
# result['test_rmse'] 

# KNNBaseline

In [10]:
algo = surprise.KNNBaseline()
rmse_knn = surprise.evaluate(algo, data, measures=['RMSE'])

Evaluating RMSE of algorithm KNNBaseline.

------------
Fold 1
Estimating biases using als...
Computing the msd similarity matrix...
Done computing similarity matrix.
RMSE: 0.8707
------------
Fold 2
Estimating biases using als...
Computing the msd similarity matrix...
Done computing similarity matrix.
RMSE: 0.8821
------------
Fold 3
Estimating biases using als...
Computing the msd similarity matrix...
Done computing similarity matrix.
RMSE: 0.8675
------------
Fold 4
Estimating biases using als...
Computing the msd similarity matrix...
Done computing similarity matrix.
RMSE: 0.8719
------------
Fold 5
Estimating biases using als...
Computing the msd similarity matrix...
Done computing similarity matrix.
RMSE: 0.8767
------------
------------
Mean RMSE: 0.8738
------------
------------


# Non-negative Matrix Factorization

In [11]:
algo = surprise.NMF()
rmse_nmf = surprise.evaluate(algo, data, measures=['RMSE'])

Evaluating RMSE of algorithm NMF.

------------
Fold 1
RMSE: 0.9149
------------
Fold 2
RMSE: 0.9275
------------
Fold 3
RMSE: 0.9172
------------
Fold 4
RMSE: 0.9145
------------
Fold 5
RMSE: 0.9287
------------
------------
Mean RMSE: 0.9206
------------
------------


In [12]:
#result = cross_validate(algo, data, measures=['RMSE'], cv=5, verbose=False)
#result['test_rmse'] 

# BaselineOnly

In [13]:
algo = surprise.BaselineOnly()
rmse_bo = surprise.evaluate(algo, data, measures=['RMSE'])

Evaluating RMSE of algorithm BaselineOnly.

------------
Fold 1
Estimating biases using als...
RMSE: 0.8682
------------
Fold 2
Estimating biases using als...
RMSE: 0.8793
------------
Fold 3
Estimating biases using als...
RMSE: 0.8691
------------
Fold 4
Estimating biases using als...
RMSE: 0.8674
------------
Fold 5
Estimating biases using als...
RMSE: 0.8764
------------
------------
Mean RMSE: 0.8721
------------
------------


In [14]:
#result = cross_validate(algo, data, measures=['RMSE'], cv=5, verbose=False)
#result['test_rmse'] 

Surprisingly, BaselineOnly is the winner in terms of RMSE.

We know that top 3 models based on RMSE are BaselineOnly, SVD and KNNBaseline. We will only consider these 3 models to create RMSE and classification reports for recall/precision/f1 calculation.

# DataFrame for further calculation

# BaselineOnly - ALS

We will perform train/test (80/20) split to create model. 

In [15]:
trainset, testset = train_test_split(data, test_size=.20)

We will configure ALS parameter for BaselineOnly

In [16]:
bsl_options = {'method': 'als'}
algo = surprise.BaselineOnly(bsl_options=bsl_options)

# Train the algorithm on the trainset, and predict ratings for the testset
predictions_als = algo.fit(trainset).test(testset)

# Then compute RMSE
accuracy.rmse(predictions_als)

Estimating biases using als...
RMSE: 0.8715


0.8714877722739226

# KNNBaseline - pearson_baseline: item-item

In [17]:
# We'll use KNNBaseline with pearson_baseline (item_based)
sim_options = {'name': 'pearson_baseline',
               'user_based': False  # compute similarities between items
               }
algo = surprise.KNNBaseline(sim_options=sim_options)

# Train the algorithm on the trainset, and predict ratings for the testset
predictions_pear_b_2 = algo.fit(trainset).test(testset)

# Then compute RMSE
accuracy.rmse(predictions_pear_b_2)

Estimating biases using als...
Computing the pearson_baseline similarity matrix...
Done computing similarity matrix.
RMSE: 0.8533


0.8532720604669735

In [18]:
algo = surprise.SVD()

# Train the algorithm on the trainset, and predict ratings for the testset
predictions_svd = algo.fit(trainset).test(testset)

# Then compute RMSE
accuracy.rmse(predictions_svd)

RMSE: 0.8734


0.8733947532482592

Let's create DataFrame using prediction result.

In [19]:
prediction_df = pd.DataFrame(predictions_als)
prediction_df.head()

Unnamed: 0,uid,iid,r_ui,est,details
0,304,3114,5.0,4.160503,{'was_impossible': False}
1,282,1953,4.0,4.134578,{'was_impossible': False}
2,438,7376,3.0,3.271846,{'was_impossible': False}
3,275,593,5.0,4.491543,{'was_impossible': False}
4,245,581,3.0,3.0299,{'was_impossible': False}


# Worst 10 - BaselineOnly - ALS

In [20]:
worst_10 = abs(prediction_df['r_ui'] - prediction_df['est']).sort_values(ascending=False).head(10)
top_10 = abs(prediction_df['r_ui'] - prediction_df['est']).sort_values(ascending=True).head(10)
prediction_df.iloc[worst_10.index]

Unnamed: 0,uid,iid,r_ui,est,details
2288,543,89904,0.5,4.699433,{'was_impossible': False}
7319,495,5952,0.5,4.464456,{'was_impossible': False}
8692,256,7099,0.5,4.360765,{'was_impossible': False}
17740,105,4027,0.5,4.353514,{'was_impossible': False}
10594,413,1198,1.0,4.817739,{'was_impossible': False}
7363,393,541,0.5,4.238366,{'was_impossible': False}
14241,393,778,0.5,4.206726,{'was_impossible': False}
7174,413,2858,1.0,4.687129,{'was_impossible': False}
7197,59,2571,1.0,4.656048,{'was_impossible': False}
6082,594,4902,0.5,4.155857,{'was_impossible': False}


# Best 10

In [21]:
prediction_df.iloc[top_10.index]

Unnamed: 0,uid,iid,r_ui,est,details
2267,413,318,5.0,5.0,{'was_impossible': False}
17421,43,356,5.0,5.0,{'was_impossible': False}
11265,452,1197,5.0,5.0,{'was_impossible': False}
19277,169,318,5.0,5.0,{'was_impossible': False}
17494,452,260,5.0,5.0,{'was_impossible': False}
4787,122,318,5.0,5.0,{'was_impossible': False}
5552,522,141,3.5,3.500105,{'was_impossible': False}
6367,474,2130,3.5,3.49987,{'was_impossible': False}
2546,382,4632,3.5,3.49966,{'was_impossible': False}
4282,380,6503,3.0,2.999546,{'was_impossible': False}


In [22]:
prediction_df_2 = pd.DataFrame(predictions_pear_b_2)
prediction_df_2.head()

Unnamed: 0,uid,iid,r_ui,est,details
0,304,3114,5.0,4.430537,"{'actual_k': 40, 'was_impossible': False}"
1,282,1953,4.0,4.10216,"{'actual_k': 40, 'was_impossible': False}"
2,438,7376,3.0,3.426573,"{'actual_k': 40, 'was_impossible': False}"
3,275,593,5.0,4.485454,"{'actual_k': 40, 'was_impossible': False}"
4,245,581,3.0,3.0299,"{'actual_k': 0, 'was_impossible': False}"


# Worst 10 - KNNBaseline (Pearson): item-item

In [23]:
worst_10 = abs(prediction_df_2['r_ui'] - prediction_df_2['est']).sort_values(ascending=False).head(10)
top_10 = abs(prediction_df_2['r_ui'] - prediction_df_2['est']).sort_values(ascending=True).head(10)
prediction_df_2.iloc[worst_10.index]

Unnamed: 0,uid,iid,r_ui,est,details
3373,89,110130,5.0,0.671736,"{'actual_k': 3, 'was_impossible': False}"
2288,543,89904,0.5,4.526596,"{'actual_k': 21, 'was_impossible': False}"
10594,413,1198,1.0,5.0,"{'actual_k': 30, 'was_impossible': False}"
5156,495,86911,0.5,4.421272,"{'actual_k': 40, 'was_impossible': False}"
7174,413,2858,1.0,4.911856,"{'actual_k': 28, 'was_impossible': False}"
14241,393,778,0.5,4.376813,"{'actual_k': 40, 'was_impossible': False}"
112,393,5902,0.5,4.356017,"{'actual_k': 40, 'was_impossible': False}"
6950,472,4226,1.0,4.838299,"{'actual_k': 15, 'was_impossible': False}"
17054,413,1246,1.0,4.822768,"{'actual_k': 23, 'was_impossible': False}"
7945,527,527,1.0,4.815377,"{'actual_k': 40, 'was_impossible': False}"


# Best 10

In [24]:
prediction_df_2.iloc[top_10.index]

Unnamed: 0,uid,iid,r_ui,est,details
9196,53,1256,5.0,5.0,"{'actual_k': 5, 'was_impossible': False}"
20088,53,4019,5.0,5.0,"{'actual_k': 4, 'was_impossible': False}"
7298,122,1136,5.0,5.0,"{'actual_k': 40, 'was_impossible': False}"
1674,106,47099,5.0,5.0,"{'actual_k': 6, 'was_impossible': False}"
251,53,1125,5.0,5.0,"{'actual_k': 5, 'was_impossible': False}"
13324,515,318,5.0,5.0,"{'actual_k': 10, 'was_impossible': False}"
4579,53,203,5.0,5.0,"{'actual_k': 1, 'was_impossible': False}"
11265,452,1197,5.0,5.0,"{'actual_k': 40, 'was_impossible': False}"
19277,169,318,5.0,5.0,"{'actual_k': 40, 'was_impossible': False}"
20126,475,1198,5.0,5.0,"{'actual_k': 40, 'was_impossible': False}"


In [25]:
prediction_df_3 = pd.DataFrame(predictions_svd)
prediction_df_3.head()

Unnamed: 0,uid,iid,r_ui,est,details
0,304,3114,5.0,4.103298,{'was_impossible': False}
1,282,1953,4.0,4.120505,{'was_impossible': False}
2,438,7376,3.0,3.212172,{'was_impossible': False}
3,275,593,5.0,4.476227,{'was_impossible': False}
4,245,581,3.0,2.831506,{'was_impossible': False}


# Worst 10 - SVD

In [26]:
worst_10 = abs(prediction_df_3['r_ui'] - prediction_df_3['est']).sort_values(ascending=False).head(10)
top_10 = abs(prediction_df_3['r_ui'] - prediction_df_3['est']).sort_values(ascending=True).head(10)
prediction_df_3.iloc[worst_10.index]

Unnamed: 0,uid,iid,r_ui,est,details
2288,543,89904,0.5,4.886576,{'was_impossible': False}
8692,256,7099,0.5,4.526367,{'was_impossible': False}
17054,413,1246,1.0,5.0,{'was_impossible': False}
10594,413,1198,1.0,4.947029,{'was_impossible': False}
17740,105,4027,0.5,4.333372,{'was_impossible': False}
7174,413,2858,1.0,4.828784,{'was_impossible': False}
7319,495,5952,0.5,4.317683,{'was_impossible': False}
1837,154,86644,0.5,4.299732,{'was_impossible': False}
16395,258,122886,0.5,4.229622,{'was_impossible': False}
6075,580,1262,0.5,4.186956,{'was_impossible': False}


# Best 10

In [27]:
prediction_df_3.iloc[top_10.index]

Unnamed: 0,uid,iid,r_ui,est,details
10847,43,1084,5.0,5.0,{'was_impossible': False}
20059,246,2571,5.0,5.0,{'was_impossible': False}
8193,543,2571,5.0,5.0,{'was_impossible': False}
17562,523,168252,5.0,5.0,{'was_impossible': False}
3142,93,1704,5.0,5.0,{'was_impossible': False}
3277,1,1136,5.0,5.0,{'was_impossible': False}
7872,441,2571,5.0,5.0,{'was_impossible': False}
17421,43,356,5.0,5.0,{'was_impossible': False}
19127,1,1197,5.0,5.0,{'was_impossible': False}
17260,43,1,5.0,5.0,{'was_impossible': False}


# Evaluation - RMSE

In [28]:
rmse_models = [accuracy.rmse(predictions_als),accuracy.rmse(predictions_pear_b_2),accuracy.rmse(predictions_svd)]
rmse_df = pd.DataFrame(rmse_models)
rmse_df.index = ['BaselineOnly - ALS','KNNBaseline - peasrson_baseline - Item based','SVD']
rmse_df.columns = ['RMSE - TRAIN/TEST']

RMSE: 0.8715
RMSE: 0.8533
RMSE: 0.8734


In [29]:
df.rating.hist()

<matplotlib.axes._subplots.AxesSubplot at 0x2291b821e10>

In [30]:
rmse_df

Unnamed: 0,RMSE - TRAIN/TEST
BaselineOnly - ALS,0.871488
KNNBaseline - peasrson_baseline - Item based,0.853272
SVD,0.873395


# Evaluation - Precision/Recall

# BaselineOnly - ALS

In [31]:
y_true = np.round(prediction_df['r_ui'])
y_pred = np.round(prediction_df['est'])

In [32]:
# notice that 0.5 becomes 0.0 after rounding - it is because minimum value of y_pred (rounded) is 1.0. There is no 0.5.
from sklearn.metrics import classification_report
print(classification_report(y_true, y_pred))

              precision    recall  f1-score   support

         0.0       0.00      0.00      0.00       288
         1.0       1.00      0.00      0.00       592
         2.0       0.46      0.12      0.18      3020
         3.0       0.28      0.62      0.39      4008
         4.0       0.60      0.63      0.61      9690
         5.0       0.57      0.08      0.14      2570

   micro avg       0.45      0.45      0.45     20168
   macro avg       0.49      0.24      0.22     20168
weighted avg       0.52      0.45      0.42     20168




Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples.



# KNNBaseline (Peasrson): item-item

In [33]:
print(classification_report(np.round(prediction_df_2['r_ui']), np.round(prediction_df_2['est'])))

              precision    recall  f1-score   support

         0.0       0.00      0.00      0.00       288
         1.0       0.14      0.01      0.01       592
         2.0       0.42      0.10      0.17      3020
         3.0       0.28      0.58      0.38      4008
         4.0       0.61      0.66      0.63      9690
         5.0       0.59      0.16      0.25      2570

   micro avg       0.47      0.47      0.47     20168
   macro avg       0.34      0.25      0.24     20168
weighted avg       0.49      0.47      0.44     20168



# SVD

In [34]:
print(classification_report(np.round(prediction_df_3['r_ui']), np.round(prediction_df_3['est'])))

              precision    recall  f1-score   support

         0.0       0.00      0.00      0.00       288
         1.0       0.29      0.01      0.02       592
         2.0       0.45      0.15      0.23      3020
         3.0       0.28      0.60      0.38      4008
         4.0       0.60      0.61      0.60      9690
         5.0       0.53      0.14      0.22      2570

   micro avg       0.45      0.45      0.45     20168
   macro avg       0.36      0.25      0.24     20168
weighted avg       0.49      0.45      0.43     20168



# Designing an online evaluation recommender system

If online evaluation was possible, we could experiment based on the real time trend changing dynamically i.e., if in the user's geographical location if a movie is being rated higher, then it could be recommended to the user as one of the top recommended movies. <br/>
When we design an online evaluation, we should try to include quality factors such as variance, serendipity without having to compromise on accuracy. The user wouldn't be impressed with the recommended choices if we miss either. We should also prefer to include a dynamic changing stats into our recommender system such as if a user is not engaged in our system proposed choices then it should adapt to the scenario and tweak into increasing variance in the choices recommended.

# Conclusion
Our first thought was that SVD would give us the best result but it did not as RMSE for SVD is little bit higher than BaselineOnly's. We cannot, however, strictly say that one should always prefer ALS based BaselineOnly to SGD based SVD - depending on the case, one can still use SVD if that is more suitable for his/her own purpose. 

We then have to think about why SVD had higher RMSE. 

Our guess is that it could be something to do with feature scaling issue - how ratings are scaled. 
From the documentation of the dataset we used (http://files.grouplens.org/datasets/movielens/ml-latest-small-README.html) and from the rating historgram, we know that ratings are half-star incremented. On the other hand, alternative dataset with 1 million ratings, which we could not run SVD due to the memory issue, (http://files.grouplens.org/datasets/movielens/ml-1m-README.txt) contains whole-star ratings only. 

According to benchmark RMSE examples on http://surpriselib.com/, we see that RMSE for SVD using 100k dataset is usually around 0.934 but only 0.873 using 1m dataset.

Since SVD is based on SGD, (https://scikit-learn.org/stable/modules/sgd.html) there are several disdvantages - 1) SGD requires a number of hyperparameters such as the regularization parameter and the number of iterations. 2) SGD is sensitive to feature scaling.

From classification report, we know that Weighted Averages for precision and recall are the highest in KNNBaseline: item to item. Hence, F1 score is also the highest in KNNBaseline: item to item. We confirm that KNNBaseline (pearson_baseline) - item to item is indeed the most accurate model. Since we have somewhat uneven class distribution, we chose to use precision/recall to measure accuracy of each model.