# data612 - Group Project 2 : Recommender System - movielens
# date: 2019-06-18
# by: Sang Yoon (Andy) Hwang, Santosh Cheruku, Anthony Munoz

# Data Preparation

We are going to use 100k ratings dataset from movielens.

In [1]:
import pandas as pd
import surprise
from plotly.offline import init_notebook_mode, plot, iplot
import plotly.graph_objs as go
from surprise import Dataset
from surprise import KNNBasic
from surprise import Reader
from surprise import accuracy
from surprise.model_selection import cross_validate
from surprise.model_selection import train_test_split

From https://grouplens.org/datasets/movielens/, ml-latest-small will be used.

In [2]:
df = pd.read_csv('https://raw.githubusercontent.com/san123i/CUNY/master/Semester2/612-RecommenderSystem/Week2/data_assignment/ml-100k/ratings.csv')

In [3]:
df.head()

Unnamed: 0,userId,movieId,rating,timestamp
0,1,1,4.0,964982703
1,1,3,4.0,964981247
2,1,6,4.0,964982224
3,1,47,5.0,964983815
4,1,50,5.0,964982931


Just to make our life a little bit easier, we will change the names of the columns.

In [4]:
df.columns = ['user','item','rating','timestamp']

In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100836 entries, 0 to 100835
Data columns (total 4 columns):
user         100836 non-null int64
item         100836 non-null int64
rating       100836 non-null float64
timestamp    100836 non-null int64
dtypes: float64(1), int64(3)
memory usage: 3.1 MB


# EDA

# Movie ratings distribution - want to know common ratings

In [6]:
init_notebook_mode(connected=True)

data = df['rating'].value_counts().sort_index(ascending=False)
trace = go.Bar(x = data.index,
               text = ['{:.1f} %'.format(val) for val in (data.values / df.shape[0] * 100)],
               textposition = 'auto',
               textfont = dict(color = '#000000'),
               y = data.values,
               )

# Create layout
layout = dict(title = 'Distribution Of {} movie ratings'.format(round(df.shape[0],1)),
              xaxis = dict(title = 'Rating'),
              yaxis = dict(title = 'Count'))

# Create plot
fig = go.Figure(data=[trace], layout=layout)
iplot(fig)

In [7]:
# examine distribution statistics in numbers
df['rating'].describe().apply(lambda x: format(x, 'f'))

count    100836.000000
mean          3.501557
std           1.042529
min           0.500000
25%           3.000000
50%           3.500000
75%           4.000000
max           5.000000
Name: rating, dtype: object

Notice that more than 50% of ratings are higher than 3 which means most of the movies must be pretty good. This also means that users are a bit biased; they are mostly "positive".

# Rating distribution by movie - want to know comon count of ratings 

In [8]:
data = df.groupby('item')['rating'].count()

# Create trace
trace = go.Histogram(x = data.values,
                     name = 'Ratings',
                     xbins = dict(start = 0,
                                  end = 100,
                                  size = 2))
# Create layout
layout = go.Layout(title = 'Distribution Of Number of Ratings By Movie',
                   xaxis = dict(title = 'Number of Ratings By Movie'),
                   yaxis = dict(title = 'Count'),
                   bargap = 0.2)

# Create plot
fig = go.Figure(data=[trace], layout=layout)
iplot(fig)

In [9]:
df.groupby('item')['rating'].count().reset_index().sort_values('rating', ascending=False)[:10]

Unnamed: 0,item,rating
314,356,329
277,318,317
257,296,307
510,593,279
1938,2571,278
224,260,251
418,480,238
97,110,237
507,589,224
461,527,220


3446 movies received only 1 rating when the most rated movie received 329 ratings.

# Modelling

We will find the best KNN model based on RMSE. 

In [10]:
reader = Reader(rating_scale=(df.rating.min(), df.rating.max()))
data = Dataset.load_from_df(df[['user', 'item', 'rating']], reader)

In [11]:
df_algo = []
list_algo = [surprise.KNNBaseline(), surprise.KNNBasic(), surprise.KNNWithMeans(), surprise.KNNWithZScore()]
# Iterate over all algorithms
for algo in list_algo:
    # Perform cross validation
    results = cross_validate(algo, data, measures=['RMSE'], cv=3, verbose=False)
    
    # Get results & append algorithm name
    tmp = pd.DataFrame.from_dict(results).mean(axis=0)
    tmp = tmp.append(pd.Series([str(algo).split(' ')[0].split('.')[-1]], index=['Algorithm']))
    df_algo.append(tmp)
    
pd.DataFrame(df_algo).set_index('Algorithm').sort_values('test_rmse')    

Estimating biases using als...
Computing the msd similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the msd similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.


Unnamed: 0_level_0,test_rmse,fit_time,test_time
Algorithm,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
KNNBaseline,0.882585,0.687826,8.429881
KNNWithMeans,0.903477,0.322585,5.389311
KNNWithZScore,0.904951,0.570522,5.669243
KNNBasic,0.958221,0.287922,4.700163


As KNNBaseline has the lowest RMSE. We will use this model and test accuracy using training set on test set.

# KNNBaseline - User_Based 

We will perform train/test (80/20) split to create model. 

In [12]:
trainset, testset = train_test_split(data, test_size=.20)

# cosine similarity

In [13]:
# We'll use KNNBaseline with cosine similarity (user_based)
sim_options = {'name': 'cosine',
               'user_based': True  # compute similarities between users
               }
algo = surprise.KNNBaseline(sim_options=sim_options)

# Train the algorithm on the trainset, and predict ratings for the testset
predictions_cos = algo.fit(trainset).test(testset)

# Then compute RMSE
accuracy.rmse(predictions_cos)

Estimating biases using als...
Computing the cosine similarity matrix...
Done computing similarity matrix.
RMSE: 0.8804


0.8803955012639307

# pearson

In [14]:
# We'll use KNNBaseline with pearson (user_based)
sim_options = {'name': 'pearson',
               'user_based': True  # compute similarities between users
               }
algo = surprise.KNNBaseline(sim_options=sim_options)

# Train the algorithm on the trainset, and predict ratings for the testset
predictions_pear = algo.fit(trainset).test(testset)

# Then compute RMSE
accuracy.rmse(predictions_pear)

Estimating biases using als...
Computing the pearson similarity matrix...
Done computing similarity matrix.
RMSE: 0.8771


0.877082625343741

# pearson_baseline

In [15]:
# We'll use KNNBaseline with pearson_baseline (user_based)
sim_options = {'name': 'pearson_baseline',
               'user_based': True  # compute similarities between users
               }
algo = surprise.KNNBaseline(sim_options=sim_options)

# Train the algorithm on the trainset, and predict ratings for the testset
predictions_pear_b = algo.fit(trainset).test(testset)

# Then compute RMSE
accuracy.rmse(predictions_pear_b)

Estimating biases using als...
Computing the pearson_baseline similarity matrix...
Done computing similarity matrix.
RMSE: 0.8752


0.8751532009110881

Cosine similarity works the best for user-user CF so we will use this as a final selection. Let's created prediction dataframe for cosine in order to check each prediction result on testset.

In [16]:
prediction_df = pd.DataFrame(predictions_cos)
prediction_df.head()

Unnamed: 0,uid,iid,r_ui,est,details
0,70,318,4.5,4.793018,"{'actual_k': 40, 'was_impossible': False}"
1,603,781,4.0,3.54105,"{'actual_k': 6, 'was_impossible': False}"
2,28,63033,3.5,2.878751,{'was_impossible': False}
3,46,160,4.0,3.137027,"{'actual_k': 40, 'was_impossible': False}"
4,103,116799,4.0,3.883817,"{'actual_k': 5, 'was_impossible': False}"


Let's select worst/top 10 prediction results on testset.

In [17]:
worst_10 = abs(prediction_df['r_ui'] - prediction_df['est']).sort_values(ascending=False).head(10)
top_10 = abs(prediction_df['r_ui'] - prediction_df['est']).sort_values(ascending=True).head(10)

Worst 10 - it was way off.

In [18]:
prediction_df.iloc[worst_10.index]

Unnamed: 0,uid,iid,r_ui,est,details
5462,482,2068,0.5,5.0,"{'actual_k': 1, 'was_impossible': False}"
16462,610,4794,5.0,0.5,"{'actual_k': 1, 'was_impossible': False}"
9429,382,73042,1.0,5.0,"{'actual_k': 1, 'was_impossible': False}"
17902,140,3030,1.0,4.757924,"{'actual_k': 10, 'was_impossible': False}"
7014,573,1127,0.5,4.225543,"{'actual_k': 40, 'was_impossible': False}"
1311,580,1262,0.5,4.17007,"{'actual_k': 36, 'was_impossible': False}"
808,282,48322,1.0,4.669248,"{'actual_k': 4, 'was_impossible': False}"
20079,89,356,0.5,4.161695,"{'actual_k': 40, 'was_impossible': False}"
7029,594,8521,0.5,4.095975,"{'actual_k': 1, 'was_impossible': False}"
18558,594,8482,0.5,4.095975,"{'actual_k': 1, 'was_impossible': False}"


Best 10 - it predicted perfectly.

In [19]:
prediction_df.iloc[top_10.index]

Unnamed: 0,uid,iid,r_ui,est,details
15186,523,55167,5.0,5.0,"{'actual_k': 1, 'was_impossible': False}"
1018,523,4993,5.0,5.0,"{'actual_k': 40, 'was_impossible': False}"
6481,122,2959,5.0,5.0,"{'actual_k': 40, 'was_impossible': False}"
6009,610,78836,5.0,5.0,"{'actual_k': 1, 'was_impossible': False}"
11054,305,6818,5.0,5.0,"{'actual_k': 1, 'was_impossible': False}"
1141,298,55292,0.5,0.5,"{'actual_k': 1, 'was_impossible': False}"
16638,53,922,5.0,5.0,"{'actual_k': 12, 'was_impossible': False}"
10889,171,1201,5.0,5.0,"{'actual_k': 40, 'was_impossible': False}"
10876,251,142115,5.0,5.0,"{'actual_k': 1, 'was_impossible': False}"
15180,348,318,5.0,5.0,"{'actual_k': 40, 'was_impossible': False}"


We can also use individual uid and iid to check individual prediciton result. Using the results from cosine similarity, let's select one of the options in testset.

In [20]:
test_10 = testset[0:10]

In [21]:
# We'll use KNNBaseline with cosine similarity (user_based)
sim_options = {'name': 'cosine',
               'user_based': True  # compute similarities between users
               }
algo = surprise.KNNBaseline(sim_options=sim_options)

# Train the algorithm on the trainset, and predict ratings for the testset
predictions_cos = algo.fit(trainset).test(testset)

Estimating biases using als...
Computing the cosine similarity matrix...
Done computing similarity matrix.


In [22]:
# prediction using training set compare the results with testset
uid = test_10[1][0]   # raw user id (as in the ratings file).
iid = test_10[1][1]   # raw item id (as in the ratings file).

# get a prediction for specific users and items.
pred = algo.predict(uid, iid, r_ui=test_10[1][2], verbose=True)

user: 603        item: 781        r_ui = 4.00   est = 3.54   {'actual_k': 6, 'was_impossible': False}


We will then recommend top 5 movies by predicted ratings for any selected user using cosine similarity. We will recommend movies that user has not watched yet based on user-user CF, which means we only recommend iid in testset, not in trainset. This time, I will merge item title data set with rating data set.

In [23]:
item_names = pd.read_csv('https://raw.githubusercontent.com/san123i/CUNY/master/Semester2/612-RecommenderSystem/Week2/data_assignment/ml-100k/movies.csv')
item_names = item_names.iloc[:,[0,1]]
item_names.columns = ['iid', 'title']
prediction_df = prediction_df.merge(item_names, on = 'iid', how = 'inner')

In [24]:
def recommender_user(uid):
    top_5 = prediction_df[prediction_df.uid == uid].sort_values(by = 'est', ascending = False).head(5)
    for i in range(len(top_5)):
        print('top 5 movies for uid: {0} are iid: {1} - {2} with predicted ratings of {3}'.format(uid, top_5.iid.values[i], top_5.title.values[i], top_5.est.values[i]))

recommender_user(uid)

top 5 movies for uid: 603 are iid: 187 - Party Girl (1995) with predicted ratings of 5.0
top 5 movies for uid: 603 are iid: 1596 - Career Girls (1997) with predicted ratings of 4.914636233419947
top 5 movies for uid: 603 are iid: 3494 - True Grit (1969) with predicted ratings of 4.556960431163943
top 5 movies for uid: 603 are iid: 3046 - Incredibly True Adventure of Two Girls in Love, The (1995) with predicted ratings of 4.505937990801941
top 5 movies for uid: 603 are iid: 2290 - Stardust Memories (1980) with predicted ratings of 4.4333323147821275


# KNNBaseline - Item_Based 

# cosine similarity

In [25]:
# We'll use KNNBaseline with cosine similarity (item_based)
sim_options = {'name': 'cosine',
               'user_based': False  # compute similarities between items
               }
algo = surprise.KNNBaseline(sim_options=sim_options)

# Train the algorithm on the trainset, and predict ratings for the testset
predictions_cos_2 = algo.fit(trainset).test(testset)

# Then compute RMSE
accuracy.rmse(predictions_cos_2)

Estimating biases using als...
Computing the cosine similarity matrix...
Done computing similarity matrix.
RMSE: 0.8911


0.8910569063808026

# pearson

In [26]:
# We'll use KNNBaseline with pearson (item_based)
sim_options = {'name': 'pearson',
               'user_based': False  # compute similarities between items
               }
algo = surprise.KNNBaseline(sim_options=sim_options)

# Train the algorithm on the trainset, and predict ratings for the testset
predictions_pear_2 = algo.fit(trainset).test(testset)

# Then compute RMSE
accuracy.rmse(predictions_pear_2)

Estimating biases using als...
Computing the pearson similarity matrix...
Done computing similarity matrix.
RMSE: 0.8762


0.8762242516024136

# pearson_baseline

In [27]:
# We'll use KNNBaseline with pearson_baseline (item_based)
sim_options = {'name': 'pearson_baseline',
               'user_based': False  # compute similarities between items
               }
algo = surprise.KNNBaseline(sim_options=sim_options)

# Train the algorithm on the trainset, and predict ratings for the testset
predictions_pear_b_2 = algo.fit(trainset).test(testset)

# Then compute RMSE
accuracy.rmse(predictions_pear_b_2)

Estimating biases using als...
Computing the pearson_baseline similarity matrix...
Done computing similarity matrix.
RMSE: 0.8516


0.8516121042047731

Just like we did with user_based, let's create prediction_df for the lowest RMSE model. In this case, we choose pearson_baseline.

In [28]:
prediction_df_2 = pd.DataFrame(predictions_pear_b_2)
prediction_df_2.head()

Unnamed: 0,uid,iid,r_ui,est,details
0,70,318,4.5,4.751467,"{'actual_k': 32, 'was_impossible': False}"
1,603,781,4.0,3.263603,"{'actual_k': 40, 'was_impossible': False}"
2,28,63033,3.5,2.878751,{'was_impossible': False}
3,46,160,4.0,3.625819,"{'actual_k': 13, 'was_impossible': False}"
4,103,116799,4.0,3.968108,"{'actual_k': 40, 'was_impossible': False}"


In [29]:
worst_10_2 = abs(prediction_df_2['r_ui'] - prediction_df_2['est']).sort_values(ascending=False).head(10)
top_10_2 = abs(prediction_df_2['r_ui'] - prediction_df_2['est']).sort_values(ascending=True).head(10)

Worst 10

In [30]:
prediction_df_2.iloc[worst_10_2.index]

Unnamed: 0,uid,iid,r_ui,est,details
2640,578,79879,0.5,4.520363,"{'actual_k': 2, 'was_impossible': False}"
7014,573,1127,0.5,4.342934,"{'actual_k': 40, 'was_impossible': False}"
1251,543,59900,0.5,4.280046,"{'actual_k': 17, 'was_impossible': False}"
245,539,31878,0.5,4.277522,"{'actual_k': 13, 'was_impossible': False}"
12791,85,1358,1.0,4.734281,"{'actual_k': 12, 'was_impossible': False}"
1164,567,26171,0.5,4.19318,"{'actual_k': 1, 'was_impossible': False}"
6840,259,4993,0.5,4.168975,"{'actual_k': 9, 'was_impossible': False}"
12921,111,62439,4.5,0.845317,"{'actual_k': 3, 'was_impossible': False}"
8850,419,5480,0.5,4.144571,"{'actual_k': 1, 'was_impossible': False}"
3618,472,4226,1.0,4.598376,"{'actual_k': 12, 'was_impossible': False}"


Best 10

In [31]:
prediction_df_2.iloc[top_10_2.index]

Unnamed: 0,uid,iid,r_ui,est,details
10524,122,1197,5.0,5.0,"{'actual_k': 40, 'was_impossible': False}"
16638,53,922,5.0,5.0,"{'actual_k': 4, 'was_impossible': False}"
1018,523,4993,5.0,5.0,"{'actual_k': 29, 'was_impossible': False}"
3142,276,1210,5.0,5.0,"{'actual_k': 20, 'was_impossible': False}"
16996,53,203,5.0,5.0,"{'actual_k': 2, 'was_impossible': False}"
14066,515,48516,5.0,5.0,"{'actual_k': 16, 'was_impossible': False}"
15180,348,318,5.0,5.0,"{'actual_k': 29, 'was_impossible': False}"
19780,276,260,5.0,5.0,"{'actual_k': 17, 'was_impossible': False}"
16491,3,72378,0.5,0.5,"{'actual_k': 4, 'was_impossible': False}"
14548,515,260,5.0,5.0,"{'actual_k': 15, 'was_impossible': False}"


We can also use individual uid and iid to check individual prediciton result. This time, we are using pearson_baseline. Let's select one of the options in testset.

In [32]:
# prediction using training set and compare the results with testset
uid = test_10[1][0]  # raw user id (as in the ratings file).
iid = test_10[1][1]  # raw item id (as in the ratings file).

# get a prediction for specific users and items.
pred = algo.predict(uid, iid, r_ui=test_10[1][2], verbose=True)

user: 603        item: 781        r_ui = 4.00   est = 3.26   {'actual_k': 40, 'was_impossible': False}


We will now recommend top 5 items based on item-item CF using pearson_baseline.

In [33]:
prediction_df_2 = prediction_df_2.merge(item_names, on = 'iid', how = 'inner')

In [34]:
def recommender_user(uid):
    top_5 = prediction_df_2[prediction_df_2.uid == uid].sort_values(by = 'est', ascending = False).head(5)
    for i in range(len(top_5)):
        print('top 5 movies for uid: {0} are iid: {1} - {2} with predicted ratings of {3}'.format(uid, top_5.iid.values[i], top_5.title.values[i], top_5.est.values[i]))

recommender_user(uid)

top 5 movies for uid: 603 are iid: 1248 - Touch of Evil (1958) with predicted ratings of 4.615225675008942
top 5 movies for uid: 603 are iid: 187 - Party Girl (1995) with predicted ratings of 4.557077574320296
top 5 movies for uid: 603 are iid: 913 - Maltese Falcon, The (1941) with predicted ratings of 4.435407671781854
top 5 movies for uid: 603 are iid: 3732 - Fury, The (1978) with predicted ratings of 4.337713485328569
top 5 movies for uid: 603 are iid: 2318 - Happiness (1998) with predicted ratings of 4.3091439958179


# Conclusion - Evaluation

As you can see, item-item CF using pearson_baseline is performing better (lowest RMSE on testset) than any other algorithms. For future project, we may want to use SVD and other Model Based approaches. 