# data612 - Group Project 2 : Recommender System - movielens
# date: 2019-06-18
# by: Sang Yoon (Andy) Hwang, Santosh Cheruku, Anthony Munoz

# Data Preparation

We will going to use 100k ratings dataset from movielens.

In [240]:
import pandas as pd
import surprise
from plotly.offline import init_notebook_mode, plot, iplot
import plotly.graph_objs as go
from surprise import Dataset
from surprise import KNNBasic
from surprise import Reader
from surprise import accuracy
from surprise.model_selection import cross_validate
from surprise.model_selection import train_test_split

From https://surprise.readthedocs.io/en/v1.0.0/_modules/surprise/dataset.html, we know u.data is the one we need.

In [241]:
df = pd.read_table('C:/Users/ahwang/Desktop/Cuny/DATA612/movielens/ml-100k/u.data',header = None )

In [242]:
df.columns = ['user','item','rating','timestamp']

In [243]:
df.head()

Unnamed: 0,user,item,rating,timestamp
0,196,242,3,881250949
1,186,302,3,891717742
2,22,377,1,878887116
3,244,51,2,880606923
4,166,346,1,886397596


In [244]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 4 columns):
user         100000 non-null int64
item         100000 non-null int64
rating       100000 non-null int64
timestamp    100000 non-null int64
dtypes: int64(4)
memory usage: 3.1 MB


# EDA

# Movie ratings distribution - want to know common ratings

In [245]:
init_notebook_mode(connected=True)

data = df['rating'].value_counts().sort_index(ascending=False)
trace = go.Bar(x = data.index,
               text = ['{:.1f} %'.format(val) for val in (data.values / df.shape[0] * 100)],
               textposition = 'auto',
               textfont = dict(color = '#000000'),
               y = data.values,
               )

# Create layout
layout = dict(title = 'Distribution Of {} movie ratings'.format(round(df.shape[0],1)),
              xaxis = dict(title = 'Rating'),
              yaxis = dict(title = 'Count'))

# Create plot
fig = go.Figure(data=[trace], layout=layout)
iplot(fig)

In [246]:
# examine distribution statistics in numbers
df['rating'].describe().apply(lambda x: format(x, 'f'))

count    100000.000000
mean          3.529860
std           1.125674
min           1.000000
25%           3.000000
50%           4.000000
75%           4.000000
max           5.000000
Name: rating, dtype: object

Notice that more than 50% of ratings are higher than 3 which means most of the movies must be pretty good. This also means that users are a bit biased; they are mostly "positive".

# Rating distribution by movie - want to know comon count of ratings 

In [247]:
# Number of ratings by title - clipped at 100
data = df.groupby('item')['rating'].count()

# Create trace
trace = go.Histogram(x = data.values,
                     name = 'Ratings',
                     xbins = dict(start = 0,
                                  end = 100,
                                  size = 2))
# Create layout
layout = go.Layout(title = 'Distribution Of Number of Ratings By Movie',
                   xaxis = dict(title = 'Number of Ratings By Movie'),
                   yaxis = dict(title = 'Count'),
                   bargap = 0.2)

# Create plot
fig = go.Figure(data=[trace], layout=layout)
iplot(fig)

In [248]:
df.groupby('item')['rating'].count().reset_index().sort_values('rating', ascending=False)[:10]

Unnamed: 0,item,rating
49,50,583
257,258,509
99,100,508
180,181,507
293,294,485
285,286,481
287,288,478
0,1,452
299,300,431
120,121,429


141 movies received only 1 rating when the most rated movie received 583 ratings.

# Modelling

We will find the best KNN model based on RMSE. 

In [249]:
reader = Reader(rating_scale=(1, 5))
data = Dataset.load_from_df(df[['user', 'item', 'rating']], reader)

In [250]:
# Learn how to configure parameters in each model, for instance, bsl_option vs sim_option and maybe there is one for SVD.
         ## Know the difference and how to configure for each model..
df_algo = []
list_algo = [surprise.KNNBaseline(), surprise.KNNBasic(), surprise.KNNWithMeans(), surprise.KNNWithZScore()]
# Iterate over all algorithms
for algo in list_algo:
    # Perform cross validation
    results = cross_validate(algo, data, measures=['RMSE'], cv=3, verbose=False)
    
    # Get results & append algorithm name
    tmp = pd.DataFrame.from_dict(results).mean(axis=0)
    tmp = tmp.append(pd.Series([str(algo).split(' ')[0].split('.')[-1]], index=['Algorithm']))
    df_algo.append(tmp)
    
pd.DataFrame(df_algo).set_index('Algorithm').sort_values('test_rmse')    

Estimating biases using als...
Computing the msd similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the msd similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.


Unnamed: 0_level_0,test_rmse,fit_time,test_time
Algorithm,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
KNNBaseline,0.936804,0.752868,8.990795
KNNWithZScore,0.956754,0.61995,8.174964
KNNWithMeans,0.957017,0.526675,7.187005
KNNBasic,0.988416,0.508019,6.855855


As KNNBaseline has the lowest RMSE. We will use this model and test accuracy using training set on test set.

# KNNBaseline with cosine similarity - User_Based 

We will perform train/test (80/20) split to create model. 

In [252]:
trainset, testset = train_test_split(data, test_size=.20)

# We'll use KNNBaseline with cosine similarity (user_based)
sim_options = {'name': 'cosine',
               'user_based': True  # compute  similarities between items
               }
algo = surprise.KNNBaseline(sim_options=sim_options)

# Train the algorithm on the trainset, and predict ratings for the testset
predictions = algo.fit(trainset).test(testset)

# Then compute RMSE
accuracy.rmse(predictions)

Estimating biases using als...
Computing the cosine similarity matrix...
Done computing similarity matrix.
RMSE: 0.9375


0.937462186497259

Let's created prediction dataframe in order to check each prediction result on testset.

In [253]:
prediction_df = pd.DataFrame(predictions)
prediction_df.head()

Unnamed: 0,uid,iid,r_ui,est,details
0,151,181,5.0,4.309899,"{'actual_k': 40, 'was_impossible': False}"
1,407,729,4.0,3.271319,"{'actual_k': 40, 'was_impossible': False}"
2,457,120,2.0,2.674745,"{'actual_k': 40, 'was_impossible': False}"
3,406,23,4.0,4.101048,"{'actual_k': 40, 'was_impossible': False}"
4,480,527,4.0,3.957917,"{'actual_k': 40, 'was_impossible': False}"


Let's select worst/top 10 prediction results on testset.

In [254]:
worst_10 = abs(prediction_df['r_ui'] - prediction_df['est']).sort_values(ascending=False).head(10)
top_10 = abs(prediction_df['r_ui'] - prediction_df['est']).sort_values(ascending=True).head(10)

Worst 10 - it was way off.

In [255]:
prediction_df.iloc[worst_10.index]

Unnamed: 0,uid,iid,r_ui,est,details
10709,405,575,5.0,1.033521,"{'actual_k': 35, 'was_impossible': False}"
6031,405,376,5.0,1.179005,"{'actual_k': 19, 'was_impossible': False}"
12439,887,1473,1.0,4.795042,"{'actual_k': 4, 'was_impossible': False}"
12310,312,265,1.0,4.588611,"{'actual_k': 40, 'was_impossible': False}"
16624,850,56,1.0,4.533574,"{'actual_k': 40, 'was_impossible': False}"
7933,405,38,5.0,1.47192,"{'actual_k': 40, 'was_impossible': False}"
11234,472,584,1.0,4.525237,"{'actual_k': 40, 'was_impossible': False}"
8864,436,132,1.0,4.436251,"{'actual_k': 40, 'was_impossible': False}"
13345,405,842,5.0,1.576311,"{'actual_k': 21, 'was_impossible': False}"
6599,777,127,1.0,4.422314,"{'actual_k': 40, 'was_impossible': False}"


Best 10 - it predicted perfectly.

In [256]:
prediction_df.iloc[top_10.index]

Unnamed: 0,uid,iid,r_ui,est,details
1389,181,1360,1.0,1.0,"{'actual_k': 1, 'was_impossible': False}"
14753,907,520,5.0,5.0,"{'actual_k': 40, 'was_impossible': False}"
5831,181,1259,1.0,1.0,"{'actual_k': 2, 'was_impossible': False}"
14957,118,134,5.0,5.0,"{'actual_k': 40, 'was_impossible': False}"
15085,774,254,1.0,1.0,"{'actual_k': 40, 'was_impossible': False}"
5560,774,758,1.0,1.0,"{'actual_k': 15, 'was_impossible': False}"
15578,774,122,1.0,1.0,"{'actual_k': 40, 'was_impossible': False}"
10220,405,788,1.0,1.0,"{'actual_k': 2, 'was_impossible': False}"
15607,405,1239,1.0,1.0,"{'actual_k': 12, 'was_impossible': False}"
10248,295,483,5.0,5.0,"{'actual_k': 40, 'was_impossible': False}"


We can also use individual uid and iid to check individual prediciton result. Let's select one of the options in testset.

In [262]:
test_10 = testset[0:10]

In [266]:
# prediction using training set compare the results with testset
uid = test_10[0][0] #151  # raw user id (as in the ratings file).
iid = test_10[0][1] #181  # raw item id (as in the ratings file).

# get a prediction for specific users and items.
pred = algo.predict(uid, iid, r_ui=5, verbose=True)

user: 151        item: 181        r_ui = 5.00   est = 4.31   {'actual_k': 40, 'was_impossible': False}


Not bad, residual is about 0.69.

# KNNBaseline with cosine similarity - Item_Based 

In [267]:
trainset, testset = train_test_split(data, test_size=.20)

# We'll use KNNBaseline with cosine similarity (item_based)
sim_options = {'name': 'cosine',
               'user_based': False  # compute  similarities between items
               }
algo = surprise.KNNBaseline(sim_options=sim_options)

# Train the algorithm on the trainset, and predict ratings for the testset
predictions = algo.fit(trainset).test(testset)

# Then compute RMSE
accuracy.rmse(predictions)

Estimating biases using als...
Computing the cosine similarity matrix...
Done computing similarity matrix.
RMSE: 0.9389


0.9389258196945625

Just like we did with user_based, let's create prediction_df

In [268]:
prediction_df = pd.DataFrame(predictions)
prediction_df.head()

Unnamed: 0,uid,iid,r_ui,est,details
0,240,313,5.0,4.579711,"{'actual_k': 18, 'was_impossible': False}"
1,666,11,4.0,3.443993,"{'actual_k': 40, 'was_impossible': False}"
2,92,947,4.0,3.220677,"{'actual_k': 40, 'was_impossible': False}"
3,385,631,3.0,3.884152,"{'actual_k': 40, 'was_impossible': False}"
4,405,1110,1.0,1.077511,"{'actual_k': 40, 'was_impossible': False}"


In [269]:
worst_10 = abs(prediction_df['r_ui'] - prediction_df['est']).sort_values(ascending=False).head(10)
top_10 = abs(prediction_df['r_ui'] - prediction_df['est']).sort_values(ascending=True).head(10)

Worst 10

In [270]:
prediction_df.iloc[worst_10.index]

Unnamed: 0,uid,iid,r_ui,est,details
17681,405,67,5.0,1.0,"{'actual_k': 40, 'was_impossible': False}"
17347,239,190,1.0,4.98939,"{'actual_k': 40, 'was_impossible': False}"
6401,239,64,1.0,4.987196,"{'actual_k': 40, 'was_impossible': False}"
2501,850,98,1.0,4.98698,"{'actual_k': 40, 'was_impossible': False}"
5971,405,376,5.0,1.026875,"{'actual_k': 40, 'was_impossible': False}"
6698,405,842,5.0,1.136975,"{'actual_k': 40, 'was_impossible': False}"
11687,127,750,1.0,4.760091,"{'actual_k': 16, 'was_impossible': False}"
14980,127,268,1.0,4.695952,"{'actual_k': 16, 'was_impossible': False}"
15627,405,1053,5.0,1.391685,"{'actual_k': 40, 'was_impossible': False}"
13033,239,203,1.0,4.550081,"{'actual_k': 40, 'was_impossible': False}"


Best 10

In [271]:
prediction_df.iloc[top_10.index]

Unnamed: 0,uid,iid,r_ui,est,details
6279,628,242,5.0,5.0,"{'actual_k': 21, 'was_impossible': False}"
6228,181,21,1.0,1.0,"{'actual_k': 40, 'was_impossible': False}"
11061,181,931,1.0,1.0,"{'actual_k': 40, 'was_impossible': False}"
6808,173,302,5.0,5.0,"{'actual_k': 37, 'was_impossible': False}"
12431,628,302,5.0,5.0,"{'actual_k': 21, 'was_impossible': False}"
6241,507,315,5.0,5.0,"{'actual_k': 40, 'was_impossible': False}"
17243,774,452,1.0,1.0,"{'actual_k': 40, 'was_impossible': False}"
12721,939,127,5.0,5.0,"{'actual_k': 40, 'was_impossible': False}"
18064,330,427,5.0,5.0,"{'actual_k': 40, 'was_impossible': False}"
9291,296,483,5.0,5.0,"{'actual_k': 40, 'was_impossible': False}"


We can also use individual uid and iid to check individual prediciton result. Let's select one of the options in testset.

In [273]:
test_10 = testset[0:10]

In [274]:
# prediction using training set and compare the results with testset
uid = test_10[0][0] #240  # raw user id (as in the ratings file).
iid = test_10[0][1] #313  # raw item id (as in the ratings file).

# get a prediction for specific users and items.
pred = algo.predict(uid, iid, r_ui=5, verbose=True)

user: 240        item: 313        r_ui = 5.00   est = 4.58   {'actual_k': 18, 'was_impossible': False}


# Conclusion - Evaluation

As you can see, User_Based is performing slightly better than Item_Based (using KNNBaseline - Cosine Similarity) - RMSE on testset for User_Based is slightly lower, (0.9375 < 0.9389). 
For future project, we may want to use SVD and other Model Based approaches. 