# data612 - Group Project 2 : Recommender System - movielens
# date: 2019-06-18
# by: Sang Yoon (Andy) Hwang, Santosh Cheruku, Anthony Munoz

# Data Preparation

We are going to use 100k ratings dataset from movielens.

In [8]:
import pandas as pd
import surprise
from plotly.offline import init_notebook_mode, plot, iplot
import plotly.graph_objs as go
from surprise import Dataset
from surprise import KNNBasic
from surprise import Reader
from surprise import accuracy
from surprise.model_selection import cross_validate
from surprise.model_selection import train_test_split

From https://grouplens.org/datasets/movielens/, ml-latest-small will be used.

In [9]:
df = pd.read_csv('https://raw.githubusercontent.com/wheremagichappens/an.dy/master/data612/ml-100k/ratings.csv')

In [10]:
df.head()

Unnamed: 0,userId,movieId,rating,timestamp
0,1,1,4.0,964982703
1,1,3,4.0,964981247
2,1,6,4.0,964982224
3,1,47,5.0,964983815
4,1,50,5.0,964982931


Just to make our life a little bit easier, we will change the names of the columns.

In [11]:
df.columns = ['user','item','rating','timestamp']

In [12]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100836 entries, 0 to 100835
Data columns (total 4 columns):
user         100836 non-null int64
item         100836 non-null int64
rating       100836 non-null float64
timestamp    100836 non-null int64
dtypes: float64(1), int64(3)
memory usage: 3.1 MB


# EDA

# Movie ratings distribution - want to know common ratings

In [13]:
init_notebook_mode(connected=True)

data = df['rating'].value_counts().sort_index(ascending=False)
trace = go.Bar(x = data.index,
               text = ['{:.1f} %'.format(val) for val in (data.values / df.shape[0] * 100)],
               textposition = 'auto',
               textfont = dict(color = '#000000'),
               y = data.values,
               )

# Create layout
layout = dict(title = 'Distribution Of {} movie ratings'.format(round(df.shape[0],1)),
              xaxis = dict(title = 'Rating'),
              yaxis = dict(title = 'Count'))

# Create plot
fig = go.Figure(data=[trace], layout=layout)
iplot(fig)

In [14]:
# examine distribution statistics in numbers
df['rating'].describe().apply(lambda x: format(x, 'f'))

count    100836.000000
mean          3.501557
std           1.042529
min           0.500000
25%           3.000000
50%           3.500000
75%           4.000000
max           5.000000
Name: rating, dtype: object

Notice that more than 50% of ratings are higher than 3 which means most of the movies must be pretty good. This also means that users are a bit biased; they are mostly "positive".

# Rating distribution by movie - want to know comon count of ratings 

In [15]:
data = df.groupby('item')['rating'].count()

# Create trace
trace = go.Histogram(x = data.values,
                     name = 'Ratings',
                     xbins = dict(start = 0,
                                  end = 100,
                                  size = 2))
# Create layout
layout = go.Layout(title = 'Distribution Of Number of Ratings By Movie',
                   xaxis = dict(title = 'Number of Ratings By Movie'),
                   yaxis = dict(title = 'Count'),
                   bargap = 0.2)

# Create plot
fig = go.Figure(data=[trace], layout=layout)
iplot(fig)

In [16]:
df.groupby('item')['rating'].count().reset_index().sort_values('rating', ascending=False)[:10]

Unnamed: 0,item,rating
314,356,329
277,318,317
257,296,307
510,593,279
1938,2571,278
224,260,251
418,480,238
97,110,237
507,589,224
461,527,220


3446 movies received only 1 rating when the most rated movie received 329 ratings.

# Modelling

We will find the best KNN model based on RMSE. 

In [20]:
reader = Reader(rating_scale=(df.rating.min(), df.rating.max()))
data = Dataset.load_from_df(df[['user', 'item', 'rating']], reader)

In [21]:
df_algo = []
list_algo = [surprise.KNNBaseline(), surprise.KNNBasic(), surprise.KNNWithMeans(), surprise.KNNWithZScore()]
# Iterate over all algorithms
for algo in list_algo:
    # Perform cross validation
    results = cross_validate(algo, data, measures=['RMSE'], cv=3, verbose=False)
    
    # Get results & append algorithm name
    tmp = pd.DataFrame.from_dict(results).mean(axis=0)
    tmp = tmp.append(pd.Series([str(algo).split(' ')[0].split('.')[-1]], index=['Algorithm']))
    df_algo.append(tmp)
    
pd.DataFrame(df_algo).set_index('Algorithm').sort_values('test_rmse')    

Estimating biases using als...
Computing the msd similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the msd similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.


Unnamed: 0_level_0,test_rmse,fit_time,test_time
Algorithm,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
KNNBaseline,0.882006,0.406126,4.046272
KNNWithZScore,0.905465,0.268841,3.81874
KNNWithMeans,0.906794,0.213541,3.525581
KNNBasic,0.95559,0.181225,2.835989


As KNNBaseline has the lowest RMSE. We will use this model and test accuracy using training set on test set.

# KNNBaseline - User_Based 

We will perform train/test (80/20) split to create model. 

In [22]:
trainset, testset = train_test_split(data, test_size=.20)

# cosine similarity

In [31]:
# We'll use KNNBaseline with cosine similarity (user_based)
sim_options = {'name': 'cosine',
               'user_based': True  # compute similarities between users
               }
algo = surprise.KNNBaseline(sim_options=sim_options)

# Train the algorithm on the trainset, and predict ratings for the testset
predictions_cos = algo.fit(trainset).test(testset)

# Then compute RMSE
accuracy.rmse(predictions_cos)

Estimating biases using als...
Computing the cosine similarity matrix...
Done computing similarity matrix.
RMSE: 0.8751


0.8750892516915244

# pearson

In [29]:
# We'll use KNNBaseline with pearson (user_based)
sim_options = {'name': 'pearson',
               'user_based': True  # compute similarities between users
               }
algo = surprise.KNNBaseline(sim_options=sim_options)

# Train the algorithm on the trainset, and predict ratings for the testset
predictions_pear = algo.fit(trainset).test(testset)

# Then compute RMSE
accuracy.rmse(predictions_pear)

Estimating biases using als...
Computing the pearson similarity matrix...
Done computing similarity matrix.
RMSE: 0.8786


0.8786443805689138

# pearson_baseline

In [30]:
# We'll use KNNBaseline with pearson_baseline (user_based)
sim_options = {'name': 'pearson_baseline',
               'user_based': True  # compute similarities between users
               }
algo = surprise.KNNBaseline(sim_options=sim_options)

# Train the algorithm on the trainset, and predict ratings for the testset
predictions_pear_b = algo.fit(trainset).test(testset)

# Then compute RMSE
accuracy.rmse(predictions_pear_b)

Estimating biases using als...
Computing the pearson_baseline similarity matrix...
Done computing similarity matrix.
RMSE: 0.8780


0.8779955851123548

Cosine similarity works the best for user-user CF so we will use this as a final selection. Let's created prediction dataframe for cosine in order to check each prediction result on testset.

In [71]:
prediction_df = pd.DataFrame(predictions_cos)
prediction_df.head()

Unnamed: 0,uid,iid,r_ui,est,details
0,580,4526,3.0,3.137373,"{'actual_k': 3, 'was_impossible': False}"
1,610,43679,3.5,2.54533,"{'actual_k': 4, 'was_impossible': False}"
2,474,5799,3.5,3.30357,{'was_impossible': False}
3,448,5339,4.0,3.251725,"{'actual_k': 5, 'was_impossible': False}"
4,368,3127,3.0,2.137756,"{'actual_k': 2, 'was_impossible': False}"


Let's select worst/top 10 prediction results on testset.

In [72]:
worst_10 = abs(prediction_df['r_ui'] - prediction_df['est']).sort_values(ascending=False).head(10)
top_10 = abs(prediction_df['r_ui'] - prediction_df['est']).sort_values(ascending=True).head(10)

Worst 10 - it was way off.

In [73]:
prediction_df.iloc[worst_10.index]

Unnamed: 0,uid,iid,r_ui,est,details
11255,543,213,0.5,4.850658,"{'actual_k': 3, 'was_impossible': False}"
16645,210,296,0.5,4.633714,"{'actual_k': 40, 'was_impossible': False}"
8978,156,3223,1.0,5.0,"{'actual_k': 1, 'was_impossible': False}"
15382,305,158783,0.5,4.36805,"{'actual_k': 2, 'was_impossible': False}"
1063,413,2858,1.0,4.815773,"{'actual_k': 40, 'was_impossible': False}"
13569,129,1358,0.5,4.264005,"{'actual_k': 31, 'was_impossible': False}"
15104,393,5902,0.5,4.240609,"{'actual_k': 34, 'was_impossible': False}"
20056,594,799,0.5,4.21302,"{'actual_k': 14, 'was_impossible': False}"
775,111,158872,0.5,4.210425,"{'actual_k': 5, 'was_impossible': False}"
11583,89,73042,5.0,1.293921,"{'actual_k': 2, 'was_impossible': False}"


Best 10 - it predicted perfectly.

In [74]:
prediction_df.iloc[top_10.index]

Unnamed: 0,uid,iid,r_ui,est,details
13785,156,3473,5.0,5.0,"{'actual_k': 1, 'was_impossible': False}"
760,191,99,5.0,5.0,"{'actual_k': 1, 'was_impossible': False}"
9952,452,260,5.0,5.0,"{'actual_k': 40, 'was_impossible': False}"
12211,601,166024,5.0,5.0,"{'actual_k': 1, 'was_impossible': False}"
8498,594,1733,5.0,5.0,"{'actual_k': 1, 'was_impossible': False}"
18309,122,86345,5.0,5.0,"{'actual_k': 5, 'was_impossible': False}"
15156,122,50,5.0,5.0,"{'actual_k': 40, 'was_impossible': False}"
1829,122,1197,5.0,5.0,"{'actual_k': 40, 'was_impossible': False}"
6757,122,121231,5.0,5.0,"{'actual_k': 2, 'was_impossible': False}"
8474,122,2571,5.0,5.0,"{'actual_k': 40, 'was_impossible': False}"


We can also use individual uid and iid to check individual prediciton result. Using the results from cosine similarity, let's select one of the options in testset.

In [75]:
test_10 = testset[0:10]

In [89]:
# We'll use KNNBaseline with cosine similarity (user_based)
sim_options = {'name': 'cosine',
               'user_based': True  # compute similarities between users
               }
algo = surprise.KNNBaseline(sim_options=sim_options)

# Train the algorithm on the trainset, and predict ratings for the testset
predictions_cos = algo.fit(trainset).test(testset)

Estimating biases using als...
Computing the cosine similarity matrix...
Done computing similarity matrix.


In [90]:
# prediction using training set compare the results with testset
uid = test_10[1][0]   # raw user id (as in the ratings file).
iid = test_10[1][1]   # raw item id (as in the ratings file).

# get a prediction for specific users and items.
pred = algo.predict(uid, iid, r_ui=test_10[1][2], verbose=True)

user: 610        item: 43679      r_ui = 3.50   est = 2.55   {'actual_k': 4, 'was_impossible': False}


We will then recommend top 5 movies by predicted ratings for any selected user using cosine similarity. We will recommend movies that user has not watched yet based on user-user CF, which means we only recommend iid in testset, not in trainset. This time, I will merge item title data set with rating data set.

In [78]:
item_names = pd.read_csv('https://raw.githubusercontent.com/wheremagichappens/an.dy/master/data612/ml-100k/movies.csv')
item_names = item_names.iloc[:,[0,1]]
item_names.columns = ['iid', 'title']
prediction_df = prediction_df.merge(item_names, on = 'iid', how = 'inner')

In [79]:
def recommender_user(uid):
    top_5 = prediction_df[prediction_df.uid == uid].sort_values(by = 'est', ascending = False).head(5)
    for i in range(len(top_5)):
        print('top 5 movies for uid: {0} are iid: {1} - {2} with predicted ratings of {3}'.format(uid, top_5.iid.values[i], top_5.title.values[i], top_5.est.values[i]))

recommender_user(uid)

top 5 movies for uid: 610 are iid: 132422 - Da Sweet Blood of Jesus (2014) with predicted ratings of 4.716104308135289
top 5 movies for uid: 610 are iid: 71899 - Mary and Max (2009) with predicted ratings of 4.643632052819533
top 5 movies for uid: 610 are iid: 1201 - Good, the Bad and the Ugly, The (Buono, il brutto, il cattivo, Il) (1966) with predicted ratings of 4.529569150044176
top 5 movies for uid: 610 are iid: 7981 - Infernal Affairs (Mou gaan dou) (2002) with predicted ratings of 4.4852613463063955
top 5 movies for uid: 610 are iid: 1221 - Godfather: Part II, The (1974) with predicted ratings of 4.457222995362181


# KNNBaseline - Item_Based 

# cosine similarity

In [80]:
# We'll use KNNBaseline with cosine similarity (item_based)
sim_options = {'name': 'cosine',
               'user_based': False  # compute similarities between items
               }
algo = surprise.KNNBaseline(sim_options=sim_options)

# Train the algorithm on the trainset, and predict ratings for the testset
predictions_cos_2 = algo.fit(trainset).test(testset)

# Then compute RMSE
accuracy.rmse(predictions_cos_2)

Estimating biases using als...
Computing the cosine similarity matrix...
Done computing similarity matrix.
RMSE: 0.8921


0.8921173945632737

# pearson

In [81]:
# We'll use KNNBaseline with pearson (item_based)
sim_options = {'name': 'pearson',
               'user_based': False  # compute similarities between items
               }
algo = surprise.KNNBaseline(sim_options=sim_options)

# Train the algorithm on the trainset, and predict ratings for the testset
predictions_pear_2 = algo.fit(trainset).test(testset)

# Then compute RMSE
accuracy.rmse(predictions_pear_2)

Estimating biases using als...
Computing the pearson similarity matrix...
Done computing similarity matrix.
RMSE: 0.8792


0.8791761294699607

# pearson_baseline

In [92]:
# We'll use KNNBaseline with pearson_baseline (item_based)
sim_options = {'name': 'pearson_baseline',
               'user_based': False  # compute similarities between items
               }
algo = surprise.KNNBaseline(sim_options=sim_options)

# Train the algorithm on the trainset, and predict ratings for the testset
predictions_pear_b_2 = algo.fit(trainset).test(testset)

# Then compute RMSE
accuracy.rmse(predictions_pear_b_2)

Estimating biases using als...
Computing the pearson_baseline similarity matrix...
Done computing similarity matrix.
RMSE: 0.8538


0.8537676693881011

Just like we did with user_based, let's create prediction_df for the lowest RMSE model. In this case, we choose pearson_baseline.

In [83]:
prediction_df_2 = pd.DataFrame(predictions_pear_b_2)
prediction_df_2.head()

Unnamed: 0,uid,iid,r_ui,est,details
0,580,4526,3.0,3.733361,"{'actual_k': 12, 'was_impossible': False}"
1,610,43679,3.5,3.055715,"{'actual_k': 40, 'was_impossible': False}"
2,474,5799,3.5,3.30357,{'was_impossible': False}
3,448,5339,4.0,3.546093,"{'actual_k': 40, 'was_impossible': False}"
4,368,3127,3.0,2.905527,"{'actual_k': 17, 'was_impossible': False}"


In [84]:
worst_10_2 = abs(prediction_df_2['r_ui'] - prediction_df_2['est']).sort_values(ascending=False).head(10)
top_10_2 = abs(prediction_df_2['r_ui'] - prediction_df_2['est']).sort_values(ascending=True).head(10)

Worst 10

In [85]:
prediction_df_2.iloc[worst_10_2.index]

Unnamed: 0,uid,iid,r_ui,est,details
13511,416,26810,5.0,0.593296,"{'actual_k': 2, 'was_impossible': False}"
16645,210,296,0.5,4.733476,"{'actual_k': 40, 'was_impossible': False}"
11255,543,213,0.5,4.59468,"{'actual_k': 0, 'was_impossible': False}"
4115,543,59900,0.5,4.394793,"{'actual_k': 19, 'was_impossible': False}"
16340,147,168,0.5,4.362093,"{'actual_k': 7, 'was_impossible': False}"
11579,527,527,1.0,4.798976,"{'actual_k': 40, 'was_impossible': False}"
16302,472,4226,1.0,4.775728,"{'actual_k': 12, 'was_impossible': False}"
7319,239,48394,0.5,4.272711,"{'actual_k': 40, 'was_impossible': False}"
10670,393,72998,0.5,4.247237,"{'actual_k': 40, 'was_impossible': False}"
14962,594,7650,0.5,4.209043,"{'actual_k': 10, 'was_impossible': False}"


Best 10

In [86]:
prediction_df_2.iloc[top_10_2.index]

Unnamed: 0,uid,iid,r_ui,est,details
11415,122,1213,5.0,5.0,"{'actual_k': 40, 'was_impossible': False}"
12600,154,1704,5.0,5.0,"{'actual_k': 11, 'was_impossible': False}"
15156,122,50,5.0,5.0,"{'actual_k': 40, 'was_impossible': False}"
2793,276,1210,5.0,5.0,"{'actual_k': 18, 'was_impossible': False}"
14228,122,2959,5.0,5.0,"{'actual_k': 40, 'was_impossible': False}"
2077,97,4011,5.0,5.0,"{'actual_k': 8, 'was_impossible': False}"
1829,122,1197,5.0,5.0,"{'actual_k': 40, 'was_impossible': False}"
5291,162,356,5.0,5.0,"{'actual_k': 18, 'was_impossible': False}"
16322,543,2571,5.0,5.0,"{'actual_k': 30, 'was_impossible': False}"
11247,122,858,5.0,5.0,"{'actual_k': 40, 'was_impossible': False}"


We can also use individual uid and iid to check individual prediciton result. This time, we are using pearson_baseline. Let's select one of the options in testset.

In [93]:
# prediction using training set and compare the results with testset
uid = test_10[1][0]  # raw user id (as in the ratings file).
iid = test_10[1][1]  # raw item id (as in the ratings file).

# get a prediction for specific users and items.
pred = algo.predict(uid, iid, r_ui=test_10[1][2], verbose=True)

user: 610        item: 43679      r_ui = 3.50   est = 3.06   {'actual_k': 40, 'was_impossible': False}


We will now recommend top 5 items based on item-item CF using pearson_baseline.

In [94]:
prediction_df_2 = prediction_df_2.merge(item_names, on = 'iid', how = 'inner')

In [96]:
def recommender_user(uid):
    top_5 = prediction_df_2[prediction_df_2.uid == uid].sort_values(by = 'est', ascending = False).head(5)
    for i in range(len(top_5)):
        print('top 5 movies for uid: {0} are iid: {1} - {2} with predicted ratings of {3}'.format(uid, top_5.iid.values[i], top_5.title.values[i], top_5.est.values[i]))

recommender_user(uid)

top 5 movies for uid: 610 are iid: 1196 - Star Wars: Episode V - The Empire Strikes Back (1980) with predicted ratings of 4.801451056893102
top 5 movies for uid: 610 are iid: 2571 - Matrix, The (1999) with predicted ratings of 4.7342894919841285
top 5 movies for uid: 610 are iid: 1200 - Aliens (1986) with predicted ratings of 4.724743338715447
top 5 movies for uid: 610 are iid: 1201 - Good, the Bad and the Ugly, The (Buono, il brutto, il cattivo, Il) (1966) with predicted ratings of 4.689011286800603
top 5 movies for uid: 610 are iid: 2858 - American Beauty (1999) with predicted ratings of 4.657198075437559


# Conclusion - Evaluation

As you can see, item-item CF using pearson_baseline is performing better (lowest RMSE on testset) than any other algorithms. For future project, we may want to use SVD and other Model Based approaches. 