# Collaborative Filtering

In this notebook we will be using a memory-based recommender system for collaborative filtering. We will be working with the 'Movie Recommendation System' dataset from Kaggle.com. The question we will be trying to solve will be, can we name the top 5 movie recommendations for a user, and can we predict their rating? At the end of this notebook we should be able to list the top 5 highest rankings for a particular user along with their predicted ranking. We start by importing the two data sets and merging them into one. We then remove the title column so that we can easily work with the scikit library 'surprise' which will work with 'user_id', 'item_id', and 'ranking'. We can then use the item id to make the id back to the title of the movie if we choose to do so. 

In [1]:
import pandas as pd
import numpy as np

In [2]:
# read in data
df = pd.read_csv("C:/Users/ryans/Documents/Predictive Modeling/Collaborative Filtering/Data/Dataset.csv")

In [3]:
# view first 5 rows
df.head(n=5)

Unnamed: 0,user_id,item_id,rating,timestamp
0,0,50,5,881250949
1,0,172,5,881250949
2,0,133,1,881250949
3,196,242,3,881250949
4,186,302,3,891717742


In [4]:
# read in titles data frame 
df_titles = pd.read_csv("C:/Users/ryans/Documents/Predictive Modeling/Collaborative Filtering/Data/Movie_Id_Titles.csv")

In [5]:
# view head of titles
df_titles

Unnamed: 0,item_id,title
0,1,Toy Story (1995)
1,2,GoldenEye (1995)
2,3,Four Rooms (1995)
3,4,Get Shorty (1995)
4,5,Copycat (1995)
...,...,...
1677,1678,Mat' i syn (1997)
1678,1679,B. Monkey (1998)
1679,1680,Sliding Doors (1998)
1680,1681,You So Crazy (1994)


## Preprocessing

In [6]:
# merge the two using item_id
data = pd.merge(df, df_titles, on = 'item_id' )
# view first 5 rows
data

Unnamed: 0,user_id,item_id,rating,timestamp,title
0,0,50,5,881250949,Star Wars (1977)
1,290,50,5,880473582,Star Wars (1977)
2,79,50,4,891271545,Star Wars (1977)
3,2,50,5,888552084,Star Wars (1977)
4,8,50,5,879362124,Star Wars (1977)
...,...,...,...,...,...
99998,840,1674,4,891211682,Mamma Roma (1962)
99999,655,1640,3,888474646,"Eighth Day, The (1996)"
100000,655,1637,3,888984255,Girls Town (1996)
100001,655,1630,3,887428735,"Silence of the Palace, The (Saimt el Qusur) (1..."


In [7]:
# drop timestamp 
data.drop('timestamp', axis = 1, inplace = True)

In [8]:
# view our data frame
data

Unnamed: 0,user_id,item_id,rating,title
0,0,50,5,Star Wars (1977)
1,290,50,5,Star Wars (1977)
2,79,50,4,Star Wars (1977)
3,2,50,5,Star Wars (1977)
4,8,50,5,Star Wars (1977)
...,...,...,...,...
99998,840,1674,4,Mamma Roma (1962)
99999,655,1640,3,"Eighth Day, The (1996)"
100000,655,1637,3,Girls Town (1996)
100001,655,1630,3,"Silence of the Palace, The (Saimt el Qusur) (1..."


In [9]:
# get info 
data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 100003 entries, 0 to 100002
Data columns (total 4 columns):
 #   Column   Non-Null Count   Dtype 
---  ------   --------------   ----- 
 0   user_id  100003 non-null  int64 
 1   item_id  100003 non-null  int64 
 2   rating   100003 non-null  int64 
 3   title    100003 non-null  object
dtypes: int64(3), object(1)
memory usage: 3.8+ MB


In [10]:
# sort by user_id
data.sort_values(by = ['user_id'], inplace = True)
data

Unnamed: 0,user_id,item_id,rating,title
0,0,50,5,Star Wars (1977)
952,0,133,1,Gone with the Wind (1939)
584,0,172,5,"Empire Strikes Back, The (1980)"
69223,1,72,4,"Mask, The (1994)"
35889,1,82,5,Jurassic Park (1993)
...,...,...,...,...
55019,943,58,4,Quiz Show (1994)
81124,943,205,5,Patton (1970)
53804,943,720,1,First Knight (1995)
24257,943,794,3,It Could Happen to You (1994)


In [11]:
# check for any duplicates
len(data)-len(data.drop_duplicates())

0

In [13]:
# create a pivot table to view ratings of each movie
movie_pivot = data.pivot_table(index = 'user_id', columns = 'title', values = 'rating')
# view pivot table
movie_pivot

title,'Til There Was You (1997),1-900 (1994),101 Dalmatians (1996),12 Angry Men (1957),187 (1997),2 Days in the Valley (1996),"20,000 Leagues Under the Sea (1954)",2001: A Space Odyssey (1968),3 Ninjas: High Noon At Mega Mountain (1998),"39 Steps, The (1935)",...,Yankee Zulu (1994),Year of the Horse (1997),You So Crazy (1994),Young Frankenstein (1974),Young Guns (1988),Young Guns II (1990),"Young Poisoner's Handbook, The (1995)",Zeus and Roxanne (1997),unknown,Á köldum klaka (Cold Fever) (1994)
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
0,,,,,,,,,,,...,,,,,,,,,,
1,,,2.0,5.0,,,3.0,4.0,,,...,,,,5.0,3.0,,,,4.0,
2,,,,,,,,,1.0,,...,,,,,,,,,,
3,,,,,2.0,,,,,,...,,,,,,,,,,
4,,,,,,,,,,,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
939,,,,,,,,,,,...,,,,,,,,,,
940,,,,,,,,,,,...,,,,,,,,,,
941,,,,,,,,,,,...,,,,,,,,,,
942,,,,,,,,3.0,,3.0,...,,,,,,,,,,


### Importing Surprise Library

The next part of the process is to import the surprise library with a Reader, Dataset, and model. We will be using KNNWithMeans to produce our model. 

In [21]:
from surprise import Reader
from surprise import Dataset
from surprise import KNNWithMeans
from surprise.model_selection import train_test_split

In [22]:
# create reader
reader = Reader(rating_scale = (1, 5))
# load df
movie = Dataset.load_from_df(data[['user_id', 'item_id', 'rating']], reader=reader)

In [23]:
# we will be using item-item centered cosine similarity
sim_options = {
    "name": "pearson",
    "user_based": False,  # compute  similarities between items
}
algo = KNNWithMeans(sim_options=sim_options)

In [24]:
# create a training and testing set that our model will be fit and tested on.
trainset, testset = train_test_split(movie, test_size = 0.2)

In [25]:
# set up a full data trainset that we can use for cross-validation
train_set_full = movie.build_full_trainset() 

#### RAW AND INNER IDS

One issue with the surprise library is that the identifiers of our data will be changed so that the library can work with the numbers. The data that we input will be called raw ids and the identifiers that they are changed to will be called the inner ids. For example, our raw user id will be called raw_uid and it will be the id of a user, such as user 178. To get an idea of who is involved in our training sets and what movies were involved we can execute the following code to change the inner ids to the raw ids. This will help us when making our predictions.  

In [26]:
# raw and inner ids - items
trainset_iids = list(trainset.all_items()) # integers starting from 0
# convert to raw - these are the movies used in the trainset
iid_converter = lambda x: trainset.to_raw_iid(x)
trainset_raw_ids = list(map(iid_converter, trainset_iids))
trainset_raw_ids

[293,
 921,
 153,
 98,
 496,
 1197,
 864,
 265,
 421,
 210,
 1514,
 648,
 531,
 62,
 476,
 193,
 286,
 693,
 327,
 172,
 553,
 65,
 665,
 443,
 381,
 251,
 29,
 157,
 795,
 470,
 367,
 259,
 235,
 883,
 494,
 716,
 232,
 276,
 97,
 722,
 199,
 202,
 294,
 181,
 499,
 174,
 308,
 15,
 436,
 622,
 173,
 845,
 187,
 64,
 280,
 169,
 188,
 879,
 1536,
 323,
 930,
 511,
 673,
 588,
 121,
 333,
 640,
 321,
 182,
 258,
 475,
 273,
 1095,
 515,
 239,
 809,
 1048,
 629,
 1225,
 135,
 1040,
 111,
 21,
 378,
 58,
 408,
 479,
 27,
 465,
 257,
 562,
 481,
 23,
 962,
 42,
 201,
 566,
 313,
 127,
 636,
 1097,
 1011,
 614,
 12,
 699,
 196,
 924,
 86,
 696,
 413,
 11,
 418,
 186,
 357,
 458,
 628,
 231,
 4,
 551,
 617,
 330,
 506,
 291,
 720,
 237,
 195,
 794,
 827,
 649,
 632,
 155,
 471,
 1098,
 735,
 461,
 331,
 569,
 482,
 625,
 303,
 324,
 1,
 227,
 223,
 249,
 124,
 44,
 763,
 207,
 1139,
 736,
 2,
 328,
 1169,
 90,
 988,
 512,
 606,
 1183,
 302,
 1001,
 646,
 960,
 95,
 91,
 402,
 300,
 346,
 74

In [27]:
# raw and inner ids - users
trainset_uids = list(trainset.all_users())
# convert the users to raw - these are the users in the trainset
uid_converter = lambda x: trainset.to_raw_uid(x)
trainset_raw_uids = list(map(uid_converter, trainset_uids))
trainset_raw_uids

[293,
 405,
 788,
 700,
 566,
 151,
 612,
 7,
 116,
 664,
 716,
 806,
 450,
 76,
 452,
 862,
 804,
 681,
 381,
 177,
 232,
 122,
 707,
 932,
 18,
 64,
 379,
 174,
 269,
 314,
 271,
 618,
 532,
 779,
 696,
 6,
 345,
 561,
 200,
 821,
 210,
 401,
 158,
 650,
 712,
 535,
 332,
 489,
 391,
 28,
 267,
 621,
 630,
 399,
 498,
 198,
 533,
 514,
 687,
 385,
 576,
 852,
 567,
 312,
 749,
 907,
 179,
 545,
 65,
 568,
 22,
 538,
 393,
 658,
 279,
 839,
 274,
 244,
 872,
 458,
 539,
 910,
 435,
 743,
 16,
 298,
 432,
 207,
 878,
 774,
 90,
 864,
 889,
 646,
 645,
 144,
 653,
 901,
 57,
 234,
 601,
 795,
 297,
 99,
 851,
 771,
 459,
 303,
 406,
 313,
 70,
 780,
 168,
 833,
 270,
 553,
 592,
 43,
 197,
 663,
 934,
 145,
 288,
 709,
 870,
 239,
 468,
 635,
 94,
 911,
 149,
 456,
 408,
 60,
 738,
 826,
 643,
 243,
 869,
 500,
 597,
 1,
 135,
 660,
 194,
 442,
 328,
 373,
 75,
 758,
 342,
 254,
 348,
 322,
 509,
 374,
 344,
 766,
 828,
 577,
 617,
 648,
 445,
 762,
 868,
 462,
 36,
 181,
 644,
 472,
 3

In [28]:
# now we can fit our algo model to our trainset.
algo.fit(trainset)

Computing the pearson similarity matrix...
Done computing similarity matrix.


<surprise.prediction_algorithms.knns.KNNWithMeans at 0x19eb646a740>

In [29]:
# check the accuracy of our model
from surprise import accuracy

In [30]:
# get predictions
predictions = algo.test(testset)
# RMSE
rmse = accuracy.rmse(predictions)
# MAE
mae = accuracy.mae(predictions)

RMSE: 0.9459
MAE:  0.7431


### Cross Validation

In [31]:
# import cross validation 
from surprise.model_selection import cross_validate

In [32]:
# set up a variable for cross validation
cross_val = cross_validate(
    algo = algo,
    data = movie,
    measures = ['RMSE'],
    cv = 5,
    return_train_measures = True
)

Computing the pearson similarity matrix...
Done computing similarity matrix.
Computing the pearson similarity matrix...
Done computing similarity matrix.
Computing the pearson similarity matrix...
Done computing similarity matrix.
Computing the pearson similarity matrix...
Done computing similarity matrix.
Computing the pearson similarity matrix...
Done computing similarity matrix.


In [33]:
# analyze the results - view the test rmse for each fold
cross_val['test_rmse']

array([0.93638822, 0.94037869, 0.94220828, 0.94134178, 0.94774672])

In [34]:
# train algo on full data set
algo.fit(train_set_full)

Computing the pearson similarity matrix...
Done computing similarity matrix.


<surprise.prediction_algorithms.knns.KNNWithMeans at 0x19eb646a740>

### Predictions

The way that predictions work with the surprise library is that we need to identify a specific user that was involved in our data set to make the prediction on. This means we cannot make any predictions on new users which is a draw back of the memory based KNNWithMeans model. However, for a user that is in our data set and has not submitted a rating of a movie, we can use the predict method which will make a prediction for the given item for the specified user. For example, we can predict the rating that user 7 will have on the Star Wars movie. To do so we will use the raw data. The prediction is labeled 'est' and we see that we predict this user to give Star Wars a 3.5 rating. The prediction also has a r_ui = None which tells us that this user did not have a rating for this movie. Based off of our RMSE of roughly .9 we can say that on average we are off by almost 1 with each rating. For a rating from 1-5 this is not very good for us. These results will help for those who are predicted a score such as 5 or 1 because being off by one rating point will still recommend a movie that the user likes and still accurately predict a movie that they would not like, but for our example of 3.5 if we are off by 1 and do not recommend this movie to the user there is a chance they would have actually rated it a 4.5 and we missed out on recommending them a movie that they loved. 

In [37]:
# make a prediction of user 7 for star wars
algo.predict(uid = '7', iid = '50')

Prediction(uid='7', iid='50', r_ui=None, est=3.5298641040768777, details={'was_impossible': True, 'reason': 'User and/or item is unknown.'})

### GridSearchCV

The surprise library also comes equipped with a GridSearchCV library that can expand on our cross validation. We used cross_val to test our model based on a number of splits for training data. This cross validation showed us that with each different fold, the RMSE was very similar. We can now use GridSearchCV to see what parameters are best for our model. 

In [38]:
# import the grid search cv
from surprise.model_selection import GridSearchCV

In [46]:
# set parameters to test 
param_grid = {
    'k': [10, 20, 30, 40, 50],
    'sim_options': {
    'name': ['cosine', 'pearson']
    },
    'verbose': ['True', 'False']
    }


In [49]:
# run grid search cv to be tested on our k neighbors params
gs = GridSearchCV(KNNWithMeans, param_grid, measures = ['rmse', 'mae'], cv = 5)

In [50]:
gs.fit(movie)

Computing the cosine similarity matrix...
Done computing similarity matrix.
Computing the cosine similarity matrix...
Done computing similarity matrix.
Computing the cosine similarity matrix...
Done computing similarity matrix.
Computing the cosine similarity matrix...
Done computing similarity matrix.
Computing the cosine similarity matrix...
Done computing similarity matrix.
Computing the cosine similarity matrix...
Done computing similarity matrix.
Computing the cosine similarity matrix...
Done computing similarity matrix.
Computing the cosine similarity matrix...
Done computing similarity matrix.
Computing the cosine similarity matrix...
Done computing similarity matrix.
Computing the cosine similarity matrix...
Done computing similarity matrix.
Computing the pearson similarity matrix...
Done computing similarity matrix.
Computing the pearson similarity matrix...
Done computing similarity matrix.
Computing the pearson similarity matrix...
Done computing similarity matrix.
Computing

After waiting for the previous gridsearchCV cell to execute (takes about 5-10 minutes), we can view our best params. The waiting time for this previous step is something to keep in mind. One challenge of a memory-based recommender system is that it is computationally expensive and can take a lot of time to run a model. For extremely large datasets that enterprises work with, a model-based approach to recommender systems is often required. Our results tell us that we should use k = 50 for the neighbors parameter (default is 40), and that the pearson similarity option is used to produce better results. This does not come as a surprise to us because the pearson similarity score is basically a centered cosine similarity score which is most often more accurate than the cosine similarity method. We also see that the user_based = true option gives us the best results. Unlike the similarity scores, this comes as a bit of a surprise to us because the item-based recommender systems normally have better results than the user-based recommender systems. 

In [53]:
# view best params for rmse
best_params = gs.best_params['rmse']
best_params

{'k': 50,
 'sim_options': {'name': 'pearson', 'user_based': True},
 'verbose': 'True'}

In [54]:
# view best RMSE score after grid search
best_rmse = gs.best_score['rmse']
best_rmse

0.9493739430975392

In [56]:
# now that we have the best params we can fit this to our data for a final model
final_model = gs.best_estimator['rmse']
final_model.fit(movie.build_full_trainset())

Computing the pearson similarity matrix...
Done computing similarity matrix.


<surprise.prediction_algorithms.knns.KNNWithMeans at 0x19ebe6d04c0>

### Top 5 Recommended Movies

The predict method in the surprise library works by specifying a particular user id and item id that we want to make a prediction on. To get a prediction of the top 5 movies we have to train our model on the trainsetfull and then use the build_anti_testset feature. This will create a test set that is filled with all the movies our user did not rate. We will then create a function that will take the predictions and the number of movies we want returned. This function has been largely adapted from the surprise documentation and can be located here: https://surprise.readthedocs.io/en/stable/FAQ.html

In [58]:
from collections import defaultdict

In [59]:
def get_top_n(predictions, n=5):
    
    # map the predictions to each user 
    top_n = defaultdict(list)
    for uid, iid, true_r, est, _ in predictions:
        top_n[uid].append((iid, est))

    # sort the predictions for each user and retrieve the top 5 highest ones.
    for uid, user_ratings in top_n.items():
        user_ratings.sort(key=lambda x: x[1], reverse=True)
        top_n[uid] = user_ratings[:n]
        
    # return the top 5
    return top_n


# train our KNNWithMeans algorithm on the full dataset as we did before
trainset = movie.build_full_trainset()
final_algo = KNNWithMeans()
final_algo.fit(trainset)

# predict ratings for all pairs user and item that are NOT in the training set.
testset = trainset.build_anti_testset()
predictions = final_algo.test(testset)

# call our function passing in the predictions and number we want
top_n = get_top_n(predictions, n=5)

# Print the recommended items for each user
for uid, user_ratings in top_n.items():
    print(uid, [iid for (iid, _) in user_ratings])

Computing the msd similarity matrix...
Done computing similarity matrix.
0 [814, 1233, 1368, 1367, 1347]
1 [814, 1467, 1500, 1536, 1599]
2 [814, 1467, 1500, 1536, 1599]
3 [814, 1536, 1467, 1500, 1599]
4 [172, 174, 64, 178, 251]
5 [814, 1463, 1536, 1467, 1599]
6 [814, 1467, 1500, 1536, 1599]
7 [814, 1189, 1293, 1398, 1449]
8 [814, 1463, 1467, 1500, 1536]
9 [172, 174, 114, 12, 119]
10 [119, 169, 318, 814, 851]
11 [814, 1500, 1536, 1467, 1599]
12 [64, 22, 178, 251, 272]
13 [1536, 1467, 1599, 1500, 1642]
14 [119, 814, 851, 1122, 1189]
15 [814, 1536, 1467, 1599, 1500]
16 [251, 272, 119, 169, 313]
17 [814, 1536, 1306, 1500, 1467]
18 [814, 1293, 1398, 1449, 1467]
19 [814, 1467, 1500, 1536, 1599]
20 [814, 1536, 1463, 1467, 1500]
21 [814, 1536, 1467, 1500, 1599]
22 [814, 1536, 1467, 1463, 1599]
23 [814, 1467, 1500, 1536, 1599]
24 [22, 251, 114, 197, 119]
25 [119, 814, 1189, 1293, 1367]
26 [814, 1536, 1467, 1500, 1599]
27 [814, 1536, 1500, 1467, 1599]
28 [814, 1467, 1500, 1536, 1599]
29 [814, 14

In [76]:
# top 5 recommended movies for user 0.
print(top_n[0])

[(814, 5), (1233, 5), (1368, 5), (1367, 5), (1347, 5)]


In [81]:
# View movie ID 814
data.loc[data['item_id'] == 814]

Unnamed: 0,user_id,item_id,rating,title
96402,13,814,5,"Great Day in Harlem, A (1994)"


In [83]:
# View movie ID 1233
data.loc[data['item_id'] == 1233]

Unnamed: 0,user_id,item_id,rating,title
99784,100,1233,3,Nénette et Boni (1996)
99786,139,1233,5,Nénette et Boni (1996)
99789,206,1233,1,Nénette et Boni (1996)
99787,355,1233,4,Nénette et Boni (1996)
99785,637,1233,5,Nénette et Boni (1996)
99788,655,1233,3,Nénette et Boni (1996)


In [86]:
# View movie ID 1368
data.loc[data['item_id'] == 1368]

Unnamed: 0,user_id,item_id,rating,title
97250,181,1368,1,Mina Tannenbaum (1994)
97247,243,1368,2,Mina Tannenbaum (1994)
97246,342,1368,5,Mina Tannenbaum (1994)
97248,655,1368,5,Mina Tannenbaum (1994)
97251,858,1368,4,Mina Tannenbaum (1994)
97249,936,1368,5,Mina Tannenbaum (1994)


In [87]:
# View movie ID 1367
data.loc[data['item_id'] == 1367]

Unnamed: 0,user_id,item_id,rating,title
97760,181,1367,2,Faust (1994)
97759,385,1367,5,Faust (1994)
97761,449,1367,4,Faust (1994)
97763,581,1367,5,Faust (1994)
97762,773,1367,5,Faust (1994)


In [88]:
# View movie ID 1347
data.loc[data['item_id'] == 1347]

Unnamed: 0,user_id,item_id,rating,title
94190,181,1347,1,"Ballad of Narayama, The (Narayama Bushiko) (1958)"
94193,420,1347,3,"Ballad of Narayama, The (Narayama Bushiko) (1958)"
94191,430,1347,5,"Ballad of Narayama, The (Narayama Bushiko) (1958)"
94192,585,1347,2,"Ballad of Narayama, The (Narayama Bushiko) (1958)"


Now we have created a list of the top 5 movie ratings prediction for every user. The first user is projected to give a 5 rating to the following movies: 
- A Great Day in Harlem
- Nénette et Boni
- Mina Tannenbaum
- Faust
- Ballad of Narayama, The (Narayama Bushiko)