# data612 - Group Project 2 : Recommender System - movielens
# date: 2019-06-18
# by: Sang Yoon (Andy) Hwang, Santosh Cheruku, Anthony Munoz

# Data Preparation

We are going to use 100k ratings dataset from movielens.

In [182]:
import pandas as pd
import surprise
from plotly.offline import init_notebook_mode, plot, iplot
import plotly.graph_objs as go
from surprise import Dataset
from surprise import KNNBasic
from surprise import Reader
from surprise import accuracy
from surprise.model_selection import cross_validate
from surprise.model_selection import train_test_split

From https://surprise.readthedocs.io/en/v1.0.0/_modules/surprise/dataset.html, we know u.data is the one we need.

In [184]:
df = pd.read_table('https://raw.githubusercontent.com/wheremagichappens/an.dy/master/data612/proj%202/u.data',header = None )

In [185]:
df.columns = ['user','item','rating','timestamp']

In [186]:
df.head()

Unnamed: 0,user,item,rating,timestamp
0,196,242,3,881250949
1,186,302,3,891717742
2,22,377,1,878887116
3,244,51,2,880606923
4,166,346,1,886397596


In [187]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 4 columns):
user         100000 non-null int64
item         100000 non-null int64
rating       100000 non-null int64
timestamp    100000 non-null int64
dtypes: int64(4)
memory usage: 3.1 MB


# EDA

# Movie ratings distribution - want to know common ratings

In [188]:
init_notebook_mode(connected=True)

data = df['rating'].value_counts().sort_index(ascending=False)
trace = go.Bar(x = data.index,
               text = ['{:.1f} %'.format(val) for val in (data.values / df.shape[0] * 100)],
               textposition = 'auto',
               textfont = dict(color = '#000000'),
               y = data.values,
               )

# Create layout
layout = dict(title = 'Distribution Of {} movie ratings'.format(round(df.shape[0],1)),
              xaxis = dict(title = 'Rating'),
              yaxis = dict(title = 'Count'))

# Create plot
fig = go.Figure(data=[trace], layout=layout)
iplot(fig)

In [189]:
# examine distribution statistics in numbers
df['rating'].describe().apply(lambda x: format(x, 'f'))

count    100000.000000
mean          3.529860
std           1.125674
min           1.000000
25%           3.000000
50%           4.000000
75%           4.000000
max           5.000000
Name: rating, dtype: object

Notice that more than 50% of ratings are higher than 3 which means most of the movies must be pretty good. This also means that users are a bit biased; they are mostly "positive".

# Rating distribution by movie - want to know comon count of ratings 

In [190]:
# Number of ratings by title - clipped at 100
data = df.groupby('item')['rating'].count()

# Create trace
trace = go.Histogram(x = data.values,
                     name = 'Ratings',
                     xbins = dict(start = 0,
                                  end = 100,
                                  size = 2))
# Create layout
layout = go.Layout(title = 'Distribution Of Number of Ratings By Movie',
                   xaxis = dict(title = 'Number of Ratings By Movie'),
                   yaxis = dict(title = 'Count'),
                   bargap = 0.2)

# Create plot
fig = go.Figure(data=[trace], layout=layout)
iplot(fig)

In [191]:
df.groupby('item')['rating'].count().reset_index().sort_values('rating', ascending=False)[:10]

Unnamed: 0,item,rating
49,50,583
257,258,509
99,100,508
180,181,507
293,294,485
285,286,481
287,288,478
0,1,452
299,300,431
120,121,429


141 movies received only 1 rating when the most rated movie received 583 ratings.

# Modelling

We will find the best KNN model based on RMSE. 

In [192]:
reader = Reader(rating_scale=(1, 5))
data = Dataset.load_from_df(df[['user', 'item', 'rating']], reader)

In [193]:
# Learn how to configure parameters in each model, for instance, bsl_option vs sim_option and maybe there is one for SVD.
         ## Know the difference and how to configure for each model..
df_algo = []
list_algo = [surprise.KNNBaseline(), surprise.KNNBasic(), surprise.KNNWithMeans(), surprise.KNNWithZScore()]
# Iterate over all algorithms
for algo in list_algo:
    # Perform cross validation
    results = cross_validate(algo, data, measures=['RMSE'], cv=3, verbose=False)
    
    # Get results & append algorithm name
    tmp = pd.DataFrame.from_dict(results).mean(axis=0)
    tmp = tmp.append(pd.Series([str(algo).split(' ')[0].split('.')[-1]], index=['Algorithm']))
    df_algo.append(tmp)
    
pd.DataFrame(df_algo).set_index('Algorithm').sort_values('test_rmse')    

Estimating biases using als...
Computing the msd similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the msd similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.


Unnamed: 0_level_0,test_rmse,fit_time,test_time
Algorithm,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
KNNBaseline,0.935334,0.706348,8.488771
KNNWithZScore,0.955641,0.607615,7.722909
KNNWithMeans,0.95654,0.536007,7.529695
KNNBasic,0.987837,0.49003,6.796481


As KNNBaseline has the lowest RMSE. We will use this model and test accuracy using training set on test set.

# KNNBaseline with cosine similarity - User_Based 

We will perform train/test (80/20) split to create model. 

In [194]:
trainset, testset = train_test_split(data, test_size=.20)

# We'll use KNNBaseline with cosine similarity (user_based)
sim_options = {'name': 'cosine',
               'user_based': True  # compute  similarities between items
               }
algo = surprise.KNNBaseline(sim_options=sim_options)

# Train the algorithm on the trainset, and predict ratings for the testset
predictions = algo.fit(trainset).test(testset)

# Then compute RMSE
accuracy.rmse(predictions)

Estimating biases using als...
Computing the cosine similarity matrix...
Done computing similarity matrix.
RMSE: 0.9334


0.933351281653701

Let's created prediction dataframe in order to check each prediction result on testset.

In [195]:
prediction_df = pd.DataFrame(predictions)
prediction_df.head()

Unnamed: 0,uid,iid,r_ui,est,details
0,314,106,2.0,3.269508,"{'actual_k': 40, 'was_impossible': False}"
1,314,1085,1.0,3.867234,"{'actual_k': 8, 'was_impossible': False}"
2,846,1004,3.0,3.634876,"{'actual_k': 6, 'was_impossible': False}"
3,83,739,5.0,3.718542,"{'actual_k': 40, 'was_impossible': False}"
4,833,93,4.0,3.394573,"{'actual_k': 40, 'was_impossible': False}"


Let's select worst/top 10 prediction results on testset.

In [196]:
worst_10 = abs(prediction_df['r_ui'] - prediction_df['est']).sort_values(ascending=False).head(10)
top_10 = abs(prediction_df['r_ui'] - prediction_df['est']).sort_values(ascending=True).head(10)

Worst 10 - it was way off.

In [197]:
prediction_df.iloc[worst_10.index]

Unnamed: 0,uid,iid,r_ui,est,details
8379,405,571,5.0,1.0,"{'actual_k': 22, 'was_impossible': False}"
10504,405,575,5.0,1.008425,"{'actual_k': 37, 'was_impossible': False}"
10097,405,452,5.0,1.157295,"{'actual_k': 40, 'was_impossible': False}"
11951,887,718,1.0,4.702353,"{'actual_k': 12, 'was_impossible': False}"
11964,451,884,1.0,4.59702,"{'actual_k': 3, 'was_impossible': False}"
19874,517,1177,5.0,1.52638,"{'actual_k': 6, 'was_impossible': False}"
13438,551,423,1.0,4.468994,"{'actual_k': 40, 'was_impossible': False}"
8711,239,514,1.0,4.46307,"{'actual_k': 40, 'was_impossible': False}"
8238,91,56,1.0,4.435369,"{'actual_k': 40, 'was_impossible': False}"
1217,38,195,1.0,4.428918,"{'actual_k': 40, 'was_impossible': False}"


Best 10 - it predicted perfectly.

In [198]:
prediction_df.iloc[top_10.index]

Unnamed: 0,uid,iid,r_ui,est,details
7568,118,134,5.0,5.0,"{'actual_k': 40, 'was_impossible': False}"
8852,774,122,1.0,1.0,"{'actual_k': 40, 'was_impossible': False}"
18013,774,548,1.0,1.0,"{'actual_k': 10, 'was_impossible': False}"
555,405,777,1.0,1.0,"{'actual_k': 3, 'was_impossible': False}"
611,686,318,5.0,5.0,"{'actual_k': 40, 'was_impossible': False}"
11195,507,50,5.0,5.0,"{'actual_k': 40, 'was_impossible': False}"
15763,405,545,1.0,1.0,"{'actual_k': 10, 'was_impossible': False}"
9576,405,383,1.0,1.0,"{'actual_k': 25, 'was_impossible': False}"
7872,405,1539,1.0,1.0,"{'actual_k': 2, 'was_impossible': False}"
2965,405,1030,1.0,1.0,"{'actual_k': 17, 'was_impossible': False}"


We can also use individual uid and iid to check individual prediciton result. Let's select one of the options in testset.

In [199]:
test_10 = testset[0:10]

In [200]:
# prediction using training set compare the results with testset
uid = test_10[4][0]   # raw user id (as in the ratings file).
iid = test_10[4][1]   # raw item id (as in the ratings file).

# get a prediction for specific users and items.
pred = algo.predict(uid, iid, r_ui=test_10[4][2], verbose=True)

user: 833        item: 93         r_ui = 4.00   est = 3.39   {'actual_k': 40, 'was_impossible': False}


We will then recommend top 5 movies by predicted ratings for any selected user. We will recommend movies that user has not watched yet based on user-user CF, which means we only recommend iid in testset, not in trainset.

In [201]:
item_names = pd.read_table('https://raw.githubusercontent.com/wheremagichappens/an.dy/master/data612/proj%202/u.item', header = None, sep = '|', encoding='ISO-8859-1')
item_names = item_names[[0,1]]
item_names.columns = ['iid', 'title']
prediction_df = prediction_df.merge(item_names, on = 'iid', how = 'inner')

In [202]:
def recommender_user(uid):
    top_5 = prediction_df[prediction_df.uid == uid].sort_values(by = 'est', ascending = False).head(5)
    for i in range(len(top_5)):
        print('top 5 movies for uid: {0} are iid: {1} - {2} with predicted ratings of {3}'.format(uid, top_5.iid.values[i], top_5.title.values[i], top_5.est.values[i]))

recommender_user(uid)

top 5 movies for uid: 833 are iid: 127 - Godfather, The (1972) with predicted ratings of 3.968747268033744
top 5 movies for uid: 833 are iid: 479 - Vertigo (1958) with predicted ratings of 3.877709052047832
top 5 movies for uid: 833 are iid: 12 - Usual Suspects, The (1995) with predicted ratings of 3.85881995894063
top 5 movies for uid: 833 are iid: 488 - Sunset Blvd. (1950) with predicted ratings of 3.781142414543197
top 5 movies for uid: 833 are iid: 515 - Boot, Das (1981) with predicted ratings of 3.7702475188142066


# KNNBaseline with cosine similarity - Item_Based 

In [203]:
trainset, testset = train_test_split(data, test_size=.20)

# We'll use KNNBaseline with cosine similarity (item_based)
sim_options = {'name': 'cosine',
               'user_based': False  # compute  similarities between items
               }
algo = surprise.KNNBaseline(sim_options=sim_options)

# Train the algorithm on the trainset, and predict ratings for the testset
predictions = algo.fit(trainset).test(testset)

# Then compute RMSE
accuracy.rmse(predictions)

Estimating biases using als...
Computing the cosine similarity matrix...
Done computing similarity matrix.
RMSE: 0.9413


0.9413449095340447

Just like we did with user_based, let's create prediction_df

In [204]:
prediction_df = pd.DataFrame(predictions)
prediction_df.head()

Unnamed: 0,uid,iid,r_ui,est,details
0,794,249,3.0,3.946599,"{'actual_k': 27, 'was_impossible': False}"
1,615,678,1.0,3.085502,"{'actual_k': 40, 'was_impossible': False}"
2,709,633,3.0,3.949614,"{'actual_k': 40, 'was_impossible': False}"
3,503,662,3.0,3.663846,"{'actual_k': 40, 'was_impossible': False}"
4,550,304,3.0,3.873793,"{'actual_k': 31, 'was_impossible': False}"


In [205]:
worst_10 = abs(prediction_df['r_ui'] - prediction_df['est']).sort_values(ascending=False).head(10)
top_10 = abs(prediction_df['r_ui'] - prediction_df['est']).sort_values(ascending=True).head(10)

Worst 10

In [206]:
prediction_df.iloc[worst_10.index]

Unnamed: 0,uid,iid,r_ui,est,details
10324,405,94,5.0,1.0,"{'actual_k': 40, 'was_impossible': False}"
875,405,1041,5.0,1.135037,"{'actual_k': 40, 'was_impossible': False}"
17838,405,288,5.0,1.240978,"{'actual_k': 40, 'was_impossible': False}"
6853,405,376,5.0,1.243393,"{'actual_k': 40, 'was_impossible': False}"
19146,481,318,1.0,4.723356,"{'actual_k': 40, 'was_impossible': False}"
19159,312,265,1.0,4.613504,"{'actual_k': 40, 'was_impossible': False}"
11540,239,190,1.0,4.437622,"{'actual_k': 40, 'was_impossible': False}"
13858,620,100,1.0,4.43309,"{'actual_k': 40, 'was_impossible': False}"
6210,405,1063,5.0,1.603688,"{'actual_k': 40, 'was_impossible': False}"
7204,312,157,1.0,4.384106,"{'actual_k': 40, 'was_impossible': False}"


Best 10

In [207]:
prediction_df.iloc[top_10.index]

Unnamed: 0,uid,iid,r_ui,est,details
12460,366,98,5.0,5.0,"{'actual_k': 25, 'was_impossible': False}"
568,137,50,5.0,5.0,"{'actual_k': 36, 'was_impossible': False}"
8568,181,266,1.0,1.0,"{'actual_k': 40, 'was_impossible': False}"
17181,296,50,5.0,5.0,"{'actual_k': 40, 'was_impossible': False}"
12866,152,483,5.0,5.0,"{'actual_k': 40, 'was_impossible': False}"
8604,405,1179,1.0,1.0,"{'actual_k': 40, 'was_impossible': False}"
18409,181,1079,1.0,1.0,"{'actual_k': 40, 'was_impossible': False}"
4742,90,483,5.0,5.0,"{'actual_k': 40, 'was_impossible': False}"
17024,445,831,1.0,1.0,"{'actual_k': 40, 'was_impossible': False}"
17199,457,318,5.0,5.0,"{'actual_k': 40, 'was_impossible': False}"


We can also use individual uid and iid to check individual prediciton result. Let's select one of the options in testset.

In [208]:
test_10 = testset[0:10]

In [209]:
# prediction using training set and compare the results with testset
uid = test_10[1][0]  # raw user id (as in the ratings file).
iid = test_10[1][1]  # raw item id (as in the ratings file).

# get a prediction for specific users and items.
pred = algo.predict(uid, iid, r_ui=test_10[1][2], verbose=True)

user: 615        item: 678        r_ui = 1.00   est = 3.09   {'actual_k': 40, 'was_impossible': False}


We will now recommend top 5 items based on item-item CF.

In [210]:
item_names = pd.read_table('https://raw.githubusercontent.com/wheremagichappens/an.dy/master/data612/proj%202/u.item', header = None, sep = '|', encoding='ISO-8859-1')
item_names = item_names[[0,1]]
item_names.columns = ['iid', 'title']
prediction_df = prediction_df.merge(item_names, on = 'iid', how = 'inner')

In [211]:
def recommender_user(uid):
    top_5 = prediction_df[prediction_df.uid == uid].sort_values(by = 'est', ascending = False).head(5)
    for i in range(len(top_5)):
        print('top 5 movies for uid: {0} are iid: {1} - {2} with predicted ratings of {3}'.format(uid, top_5.iid.values[i], top_5.title.values[i], top_5.est.values[i]))

recommender_user(uid)

top 5 movies for uid: 615 are iid: 127 - Godfather, The (1972) with predicted ratings of 4.526367868519045
top 5 movies for uid: 615 are iid: 23 - Taxi Driver (1976) with predicted ratings of 4.4429992222106565
top 5 movies for uid: 615 are iid: 514 - Annie Hall (1977) with predicted ratings of 4.312281480900541
top 5 movies for uid: 615 are iid: 48 - Hoop Dreams (1994) with predicted ratings of 4.118391806104453
top 5 movies for uid: 615 are iid: 238 - Raising Arizona (1987) with predicted ratings of 4.098468571051248


# Conclusion - Evaluation

As you can see, User_Based is performing slightly better than Item_Based (using KNNBaseline - Cosine Similarity) - RMSE on testset for User_Based is slightly lower. For future project, we may want to use SVD and other Model Based approaches. 