## Collaborative Filtering

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from surprise import Dataset
from surprise import Reader, KNNWithMeans, SVD
from ast import literal_eval
from surprise.model_selection import train_test_split, KFold, GridSearchCV, cross_validate
from surprise import accuracy
from collections import defaultdict

Import data from the files: movies_metadata and ratings.

In [2]:
df = pd.read_csv('movies_metadata.csv',usecols=['id','title'])
links = pd.to_numeric(df['id'],errors="coerce")
ratings = pd.read_csv('ratings_small.csv')
df.head()

Unnamed: 0,id,title
0,862,Toy Story
1,8844,Jumanji
2,15602,Grumpier Old Men
3,31357,Waiting to Exhale
4,11862,Father of the Bride Part II


There are a few ratings which do not have a corresponding entry in the movies dataframe. To avoid that, we filter out ratings with a valid entry.

In [3]:
ratings = ratings[ratings['movieId'].isin(links)]
print(ratings.shape)
ratings.head()

(44989, 4)


Unnamed: 0,userId,movieId,rating,timestamp
10,1,1371,2.5,1260759135
11,1,1405,1.0,1260759203
13,1,2105,4.0,1260759139
15,1,2193,2.0,1260759198
16,1,2294,2.0,1260759108


#### Use the Surprise Library for the Recommendation System

There are two different types of Collaborative Filtering: Memory Based Filtering System and Model Based Filtering System. Memory Based Filtering can be further divided into User based and Item based Filtering. 


User Collaborative Filtering: find similar neighbour to the active user and find the movies liked by such similar users. 

Item Collaborative Filtering: find movies similar to the movies we have already liked in the past based on past ratings.

In the movie dataset, the number of items/movies are much more than the number of users. Hence, Item based Filtering will always be better at recommending movies. For this purpose, we could use the KNN algorithm for the Memory Based Filtering.


In [4]:
reader = Reader(rating_scale=(1, 5))
data = Dataset.load_from_df(ratings[['userId', 'movieId', 'rating']], reader)

#### Percentage-Split 

In [5]:
trainset_knn, testset_knn = train_test_split(data, test_size=.25)

algo_knn = KNNWithMeans(k=50,sim_options={'name':'msd',"min_support": 2,'user_based':False})
algo_knn.fit(trainset_knn)                       
predictions_knn = algo_knn.test(testset_knn)
accuracy.rmse(predictions_knn)    

Computing the msd similarity matrix...
Done computing similarity matrix.
RMSE: 0.9138


0.9138344385026723

#### Cross-Validation

To use cross_validation, we could directly use the cross_validate() function or to know the parameters that would give the best results, we could use GridSearchCV which will try all combinations of parameters and  reports all the best parameters. 

In [6]:
sim_options = {
    "name": ["msd", "cosine"],
    "min_support": [2, 3, 4, 5],
    "user_based": [False, True],
}

param_grid = {'k': [10, 20, 50],"sim_options": sim_options}

gs = GridSearchCV(KNNWithMeans, param_grid, measures=["rmse", "mae"], cv=10)
gs.fit(data)

print(gs.best_score["rmse"])
print(gs.best_params["rmse"])

Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computi

Computing the cosine similarity matrix...
Done computing similarity matrix.
Computing the cosine similarity matrix...
Done computing similarity matrix.
Computing the cosine similarity matrix...
Done computing similarity matrix.
Computing the cosine similarity matrix...
Done computing similarity matrix.
Computing the cosine similarity matrix...
Done computing similarity matrix.
Computing the cosine similarity matrix...
Done computing similarity matrix.
Computing the cosine similarity matrix...
Done computing similarity matrix.
Computing the cosine similarity matrix...
Done computing similarity matrix.
Computing the cosine similarity matrix...
Done computing similarity matrix.
Computing the cosine similarity matrix...
Done computing similarity matrix.
Computing the cosine similarity matrix...
Done computing similarity matrix.
Computing the cosine similarity matrix...
Done computing similarity matrix.
Computing the cosine similarity matrix...
Done computing similarity matrix.
Computing th

Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd

Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd

Done computing similarity matrix.
Computing the cosine similarity matrix...
Done computing similarity matrix.
Computing the cosine similarity matrix...
Done computing similarity matrix.
Computing the cosine similarity matrix...
Done computing similarity matrix.
Computing the cosine similarity matrix...
Done computing similarity matrix.
Computing the cosine similarity matrix...
Done computing similarity matrix.
Computing the cosine similarity matrix...
Done computing similarity matrix.
Computing the cosine similarity matrix...
Done computing similarity matrix.
Computing the cosine similarity matrix...
Done computing similarity matrix.
Computing the cosine similarity matrix...
Done computing similarity matrix.
Computing the cosine similarity matrix...
Done computing similarity matrix.
Computing the cosine similarity matrix...
Done computing similarity matrix.
Computing the cosine similarity matrix...
Done computing similarity matrix.
Computing the cosine similarity matrix...
Done computi

Model Based Filtering - SVD

In [7]:
svd_algo = SVD()
# Run 10-fold cross-validation and print results
cross_validate(svd_algo, data, measures=['RMSE', 'MAE'], cv=10, verbose=True)

Evaluating RMSE, MAE of algorithm SVD on 10 split(s).

                  Fold 1  Fold 2  Fold 3  Fold 4  Fold 5  Fold 6  Fold 7  Fold 8  Fold 9  Fold 10 Mean    Std     
RMSE (testset)    0.8959  0.9085  0.8963  0.8790  0.8864  0.9019  0.8979  0.8899  0.8994  0.8984  0.8954  0.0079  
MAE (testset)     0.6916  0.6982  0.6880  0.6746  0.6836  0.6923  0.6873  0.6839  0.6996  0.6957  0.6895  0.0072  
Fit time          3.42    3.25    3.16    3.25    3.69    3.55    3.38    3.44    3.54    4.02    3.47    0.24    
Test time         0.13    0.04    0.10    0.04    0.04    0.04    0.03    0.04    0.06    0.05    0.06    0.03    


{'test_rmse': array([0.89586447, 0.90854157, 0.89625268, 0.87898444, 0.8864358 ,
        0.90193516, 0.89791502, 0.88992037, 0.89935865, 0.89842657]),
 'test_mae': array([0.69160876, 0.69821188, 0.6880448 , 0.67462562, 0.68358393,
        0.69232457, 0.6872796 , 0.68393109, 0.69961545, 0.69566259]),
 'fit_time': (3.415785551071167,
  3.251150608062744,
  3.164181709289551,
  3.2541110515594482,
  3.689880132675171,
  3.547959089279175,
  3.382077693939209,
  3.4420037269592285,
  3.5449516773223877,
  4.023216485977173),
 'test_time': (0.13105368614196777,
  0.03597402572631836,
  0.09893918037414551,
  0.040976762771606445,
  0.04299139976501465,
  0.03597617149353027,
  0.03497576713562012,
  0.040976524353027344,
  0.06495475769042969,
  0.05496072769165039)}

Based on the values of RMSE, SVD performs better than KNN algorithm. 

### Top N Recommended Movies    

In [9]:
def get_top_n(predictions, n=10):
    # First map the predictions to each user.
    top_n = defaultdict(list)
    for uid, iid, true_r, est, _ in predictions:
        top_n[uid].append((iid, est))

    # Then sort the predictions for each user and retrieve the k highest ones.
    for uid, user_ratings in top_n.items():
        user_ratings.sort(key=lambda x: x[1], reverse=True)
        top_n[uid] = user_ratings[:n]

    return top_n

trainset = data.build_full_trainset()
svd_algo.fit(trainset)

# Than predict ratings for all pairs (u, i) that are NOT in the training set.
testset = trainset.build_anti_testset()
predictions = svd_algo.test(testset)

top_n = get_top_n(predictions, n=10)
print(top_n)


defaultdict(<class 'list'>, {1: [(922, 4.124047273318432), (527, 4.095410974559923), (905, 4.0809701608524165), (608, 4.070709501576042), (4993, 4.070351660882528), (2973, 3.986232113964312), (318, 3.9757877117517295), (58559, 3.9732391630718857), (912, 3.969359408898296), (750, 3.9435560998372887)], 2: [(318, 4.549468694081954), (6016, 4.500942581132046), (1939, 4.4906862372827305), (926, 4.421529320067411), (1252, 4.420818401813018), (913, 4.394550574195787), (922, 4.374884753540144), (2019, 4.37066487367658), (858, 4.3445344965406925), (1259, 4.318831719310683)], 3: [(6016, 4.2742891006399475), (858, 4.254858598007161), (31658, 4.151234527606904), (745, 4.124579551580607), (2064, 4.1028916924209184), (926, 4.099388920634311), (2132, 4.094856566566093), (994, 4.0783344971199345), (1939, 4.063813688079266), (750, 4.055027318877748)], 4: [(527, 5), (593, 5), (318, 5), (111, 5), (745, 5), (4226, 5), (608, 5), (912, 5), (3683, 5), (2959, 4.999856572423964)], 5: [(260, 4.7913771191752454)

The above results shows the user along with the recommended movies. Lets take an example user 1 and the below output shows the movies recommended for User 1.

In [10]:
res = top_n.get(1)
mov_ids = dict(res).keys()
mov_list = [ str(item) for item in mov_ids]
print(df[df.id.isin(mov_list)][['id','title']] )

          id                                 title
286      527                    Once Were Warriors
700      922                              Dead Man
2649     912               The Thomas Crown Affair
4020     318              The Million Dollar Hotel
5295    4993                           5 Card Stud
5325     608                       Men in Black II
8316     905                         Pandora's Box
8546     750                       Murder She Said
14332   2973                 Aelita: Queen of Mars
37719  58559  Confession of a Child of the Century


In [11]:
uid = str(1)
iid = str(11152)
# get a prediction for specific users and items.
pred = svd_algo.predict(uid, iid, r_ui=4, verbose=True)
pred

user: 1          item: 11152      r_ui = 4.00   est = 3.56   {'was_impossible': False}


Prediction(uid='1', iid='11152', r_ui=4, est=3.5609926871012916, details={'was_impossible': False})

For a user with id '1' and item/movie_id 11152, the estimated rating is 3.56 while the actual rating is 4.