# Movie Recommender System using Surprise Library.

## Please star/upvote if u like it.

## CONTENTS::

#### 1 ) Importing Modules & Loading the Dataset

#### 2 ) Preparing the Data

#### 3 ) Modelling

#### 4 ) Evaluating 'n' Comparing Various Modules' Performances

#### 5 ) Parameter Tuning with GridSearchCV

## 1 ) Importing Modules & Loading the Dataset

## 1.1 ) Importing Various Modules

In [0]:
import pandas as pd
import numpy as np

# general modules.
from surprise.dataset import Dataset
from surprise import trainset


# modelling algos based on MF.
from surprise.prediction_algorithms.matrix_factorization import SVD
from surprise.prediction_algorithms.matrix_factorization import SVDpp
from surprise.prediction_algorithms.matrix_factorization import NMF

#modelling algos based on kNNs
from surprise.prediction_algorithms.knns import KNNBasic
from surprise.prediction_algorithms.knns import KNNWithMeans
from surprise.prediction_algorithms.knns import KNNWithZScore
from surprise.prediction_algorithms.knns import KNNBaseline

#model selection.
from surprise.model_selection.split import train_test_split
from surprise import accuracy
from surprise.model_selection.search import GridSearchCV

## 1.2 ) Loading the 'built-in' Dataset

The Surprise package has 3 builtin datasets->

1. The movielens-100k dataset.
2. The movielens-1m dataset.
3. The Jester dataset 2.

For this notebook I am using the standard movielens-100K dataset.

In [0]:
data=Dataset.load_builtin(name='ml-100k')

In [6]:
data
type(data)
print(data)

<surprise.dataset.DatasetAutoFolds object at 0x7f50ad7c7f28>


Note that the type of the returned object is 'DatasetAutoFolds'. We will later convert it to a 'Trainset' object which can then be converted to a numpy array or a pandas dataframe so that exploratory analysis can be performed.

## 2 ) Preparing the Data

## 2.1 ) Splitting into Training and Validation Sets.

In [0]:
train,test=train_test_split(data,test_size=0.25,random_state=42,shuffle=True)

This gives us the training and the validation sets. 'train_test_split' is a method from the 'model_selection' module.

In [8]:
print(type(test))
print(type(train))

<class 'list'>
<class 'surprise.trainset.Trainset'>


#### Note here that the types of the 'train' and the 'test' are different.    

1. 'train' belong to the 'Trainset' class ( which has quite useful methods and attributes as we shall see ).

2. 'test' is a normal list where each element is a tuple of the form->(user_id,movie_id,rating).

## 2.2 ) Exploring the 'train' & 'test' sets.

As mentioned before the 'train' set here belongs to the 'Trainset' class of surprise. The Trainset class has many useful
attributes which can be used to get details such as no of users,items  and avg ratings etc..

Below is an example:

In [9]:
print("No of unique users in train set: ",train.n_users)
print("No of unique movies in train set: ",train.n_items)
print("No of ratings in train set: ",train.n_ratings)
print("Range of ratings in train set: ",train.rating_scale)
print("Mean Rating of train set: ",train.global_mean)   

No of unique users in train set:  943
No of unique movies in train set:  1644
No of ratings in train set:  75000
Range of ratings in train set:  (1, 5)
Mean Rating of train set:  3.53064


#### Note that we can also achieve above details by first converting the 'Trainset' class object to the usual Dataframe or a 'Numpy' array. The we can use the usual 'describe' function on the dataframe to get these details such as count, minimum rating or average rating etc...

In [10]:
iterator = train.all_ratings()
rat_df= pd.DataFrame(columns=['uid', 'iid', 'rating'])
i = 0
for (uid, iid, rating) in iterator:
    rat_df.loc[i] = [uid, iid, rating]
    i = i+1

rat_df.head(10)

Unnamed: 0,uid,iid,rating
0,0.0,0.0,4.0
1,0.0,344.0,4.0
2,0.0,817.0,4.0
3,0.0,310.0,4.0
4,0.0,154.0,5.0
5,0.0,809.0,4.0
6,0.0,376.0,4.0
7,0.0,83.0,4.0
8,0.0,200.0,5.0
9,0.0,74.0,3.0


In [11]:
rat_df.describe()

Unnamed: 0,uid,iid,rating
count,75000.0,75000.0,75000.0
mean,329.230547,433.553333,3.53064
std,239.044598,331.873714,1.12433
min,0.0,0.0,1.0
25%,125.75,166.0,3.0
50%,293.0,362.0,4.0
75%,486.0,633.0,4.0
max,942.0,1643.0,5.0


Note that the average rating,min. rating and max. rating are consistent with that obtained from the attributes of the 'Trainset' class of the surprise package.

#### Similalry we can do similar analysis for the 'test' set. But just note that 'test' is a list of tuples and NOT a 'Trainset' class instance and so we have to write the corressponding routines ourselves.

In [12]:
print(test[:7])

uid=[]
mid=[]
ratings=[]
r=[]
for i in range (len(test)):
  tuple=test[i]
  uid.append(tuple[0])
  mid.append(tuple[1])
  ratings.append(tuple[2])
  r.append(tuple[2])
  
print("No of unique users in test set: ",len(list(set(uid))))
print("No of unique movies in test set: ",len(list(set(mid))))
print("No of ratings in test set: ",len(test))

ratings.sort()
print("Range of ratings in test set: ",ratings[0],ratings[len(ratings)-1])
print("Mean Rating of test set: ",sum(ratings)/len(ratings))

 



[('391', '591', 4.0), ('181', '1291', 1.0), ('637', '268', 2.0), ('332', '451', 5.0), ('271', '204', 4.0), ('27', '286', 3.0), ('387', '663', 4.0)]
No of unique users in test set:  943
No of unique movies in test set:  1463
No of ratings in test set:  25000
Range of ratings in test set:  1.0 5.0
Mean Rating of test set:  3.52752


In [13]:
dict={'uid':uid}
test_rat_df=pd.DataFrame(dict)
test_rat_df['iid']=mid
test_rat_df['rating']=r
test_rat_df.head(10)


Unnamed: 0,uid,iid,rating
0,391,591,4.0
1,181,1291,1.0
2,637,268,2.0
3,332,451,5.0
4,271,204,4.0
5,27,286,3.0
6,387,663,4.0
7,92,722,3.0
8,820,347,4.0
9,479,1444,1.0


In [14]:
test_rat_df.describe()

Unnamed: 0,rating
count,25000.0
mean,3.52752
std,1.129714
min,1.0
25%,3.0
50%,4.0
75%,4.0
max,5.0


The data frame , as before,shows the min, max ratings etc... for the test set.

## 2.3 ) Creating Utility Matrix.

In [0]:
# creating utility matrix.
df=rat_df.copy()
index=list(df['uid'].unique())
columns=list(df['iid'].unique())
index=sorted(index)
columns=sorted(columns)
 
util_df=pd.pivot_table(data=df,values='rating',index='uid',columns='iid')

In [None]:
# util_df.head(100)

#### Notice that at this point the utility matrix will have 'Nan' values (run the above cell in case u need to see). To fill these missing ratings we can either fill them either with 'row average' or the 'column average' or with simply 0 as I have done for now.

In [17]:
util_df.fillna(0) 

iid,0.0,1.0,2.0,3.0,4.0,5.0,6.0,7.0,8.0,9.0,...,1634.0,1635.0,1636.0,1637.0,1638.0,1639.0,1640.0,1641.0,1642.0,1643.0
uid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
0.0,4.0,0.0,0.0,0.0,0.0,0.0,2.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1.0,0.0,3.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2.0,0.0,0.0,4.0,0.0,0.0,2.0,4.0,3.0,0.0,2.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3.0,0.0,0.0,0.0,2.0,4.0,5.0,5.0,5.0,4.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4.0,4.0,0.0,0.0,0.0,2.0,0.0,1.0,0.0,2.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5.0,0.0,0.0,0.0,0.0,0.0,3.0,4.0,0.0,3.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6.0,0.0,0.0,0.0,0.0,0.0,0.0,5.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
7.0,2.0,0.0,0.0,0.0,4.0,0.0,0.0,4.0,3.0,4.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
8.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,3.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
9.0,0.0,0.0,4.0,0.0,3.0,0.0,0.0,0.0,0.0,4.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


#### Notice that most of the cells have '0' which simply implies that the utility matrix is sparse as we would expect as not all the movie-user pairs have a rating as not all the users match all the movies. For the case where the user has watched the corressponding movie note that we have a rating on the scae of '1-5'.

## 3 ) Modelling

## 3.1 ) Matrix Factorization  Based Algorithms

#### SINGLE VALUE DECOMPOSITION (SVD).

In [20]:
rec_svd=SVD()
rec_svd.fit(train)
rat_pred=rec_svd.test(test)

print(rat_pred[0])
print(accuracy.rmse(rat_pred, verbose=True))


user: 391        item: 591        r_ui = 4.00   est = 3.53   {'was_impossible': False}
RMSE: 0.9429
0.9429260800626247


#### Notice that for evaluation metric we use the 'Root Mean Squared Error(RMSE)'. 

Also I have printed a single user-item pair out of the returned list to see what is happening.

#### SVD plus-plus.

In [21]:
rec_svdpp=SVDpp()
rec_svdpp.fit(train)
rat_pred=rec_svdpp.test(test)

print(rat_pred[0]) # returns a list.
print(accuracy.rmse(rat_pred, verbose=True))


user: 391        item: 591        r_ui = 4.00   est = 3.32   {'was_impossible': False}
RMSE: 0.9252
0.9251522025687643


#### NON-NEGATIVE MATRIX FACTORIZATION (NMF).

In [22]:
rec_nmf=NMF()
rec_nmf.fit(train)
rat_pred=rec_nmf.test(test)

print(rat_pred[0])
print(accuracy.rmse(rat_pred, verbose=True))


user: 391        item: 591        r_ui = 4.00   est = 3.48   {'was_impossible': False}
RMSE: 0.9678
0.9678219246431873


## 3.2 ) Nearest Neighbor Based (kNN) Algorithms

#### In this section I have used the default simalarity criterion 'Mean Squared Distance(MSD)' to calculate the simalarity. This can be tuned of course by using the GridSearchCV as shown in next section. Other options include ---> 'cosine','pearson' etc... .

#### BASIC kNN.

In [24]:
rec_knnb=KNNBasic()
rec_knnb.fit(train)
rat_pred=rec_knnb.test(test)

print(rat_pred[0])

Computing the msd similarity matrix...
Done computing similarity matrix.
user: 391        item: 591        r_ui = 4.00   est = 3.64   {'actual_k': 40, 'was_impossible': False}
RMSE: 0.9854
0.9853685167499322


#### kNN WITH MEANS.

In [25]:
rec_knnm=KNNWithMeans()
rec_knnm.fit(train)
rat_pred=rec_knnm.test(test)

print(rat_pred[0])

Computing the msd similarity matrix...
Done computing similarity matrix.
user: 391        item: 591        r_ui = 4.00   est = 3.91   {'actual_k': 40, 'was_impossible': False}
RMSE: 0.9553
0.9552766748728951


#### kNN WITH z-SCORE.

In [26]:
rec_knnz=KNNWithZScore()
rec_knnz.fit(train)
rat_pred=rec_knnz.test(test)

print(rat_pred[0]) 

Computing the msd similarity matrix...
Done computing similarity matrix.
user: 391        item: 591        r_ui = 4.00   est = 3.86   {'actual_k': 40, 'was_impossible': False}
RMSE: 0.9555
0.9554783226088124


#### BASELINE kNN .

In [49]:
rec_knn_base=KNNBaseline()
rec_knn_base.fit(train)
rat_pred=rec_knn_base.test(test)

print(rat_pred[0])
print(accuracy.rmse(rat_pred))

Estimating biases using als...
Computing the msd similarity matrix...
Done computing similarity matrix.
user: 391        item: 591        r_ui = 4.00   est = 3.68   {'actual_k': 40, 'was_impossible': False}
RMSE: 0.9341
0.9341380311333354


## 4 ) Evaluating 'n' Comparing Various Modules' Performances

In [59]:
models=[KNNBasic(),KNNWithMeans(),KNNWithZScore(),KNNBaseline(),SVD(),SVDpp()]
model_names=['Basic KNN','KNN With Means','KNN With z-score','KNN Baseline','Single Value Decomposition(SVD)','SVD plus-plus']
rmse=[]
mae=[]
for model in range (len(models)):
  mod=models[model]
  mod.fit(train)
  pred=mod.test(test)
  rmse.append(accuracy.rmse(pred))
  mae.append(accuracy.mae(pred))

d={'Modelling Algo':model_names,'Root Mean Squared Error(RMSE)':rmse,'Mean Absolute Error(MAE)':mae}
comp_df=pd.DataFrame(d)
comp_df

Computing the msd similarity matrix...
Done computing similarity matrix.
RMSE: 0.9854
MAE:  0.7775
Computing the msd similarity matrix...
Done computing similarity matrix.
RMSE: 0.9553
MAE:  0.7519
Computing the msd similarity matrix...
Done computing similarity matrix.
RMSE: 0.9555
MAE:  0.7489
Estimating biases using als...
Computing the msd similarity matrix...
Done computing similarity matrix.
RMSE: 0.9341
MAE:  0.7354
RMSE: 0.9419
MAE:  0.7412
RMSE: 0.9263
MAE:  0.7259


Unnamed: 0,Mean Absolute Error(MAE),Modelling Algo,Root Mean Squared Error(RMSE)
0,0.777509,Basic KNN,0.985369
1,0.751856,KNN With Means,0.955277
2,0.748878,KNN With z-score,0.955478
3,0.735428,KNN Baseline,0.934138
4,0.741209,Single Value Decomposition(SVD),0.941929
5,0.725865,SVD plus-plus,0.926277


#### Notice that the data frame summarizes the performance of all the algos we have used till now. This is especially useful for comapring the algorithms and choosing a particular one for making predictions on the test set.

## 5 ) Parameter Tuning with GridSearchCV

#### The performance of above models can be enhanced by properly tuning the parameters, In this section I have tuned the parameters of two models-. 'kNN Basic' and the 'SVD'.

#### BASIC kNN.

In [0]:
#knn basic

param_dict= {'n_epochs': [5, 10,20],'k':[30,40,50,80,100],'verbose':[False],
            'sim_options': {'name': ['msd', 'cosine','pearson'],
                               'user_based': [True]}}
clf_knnb = GridSearchCV(KNNBasic,param_grid=param_dict, measures=['rmse', 'mae'], cv=5)

In [41]:
clf_knnb.fit(data)
print(clf_knnb.best_score)
print(clf_knnb.best_params)

{'rmse': 0.9763568596887279, 'mae': 0.7702263317066342}
{'rmse': {'n_epochs': 5, 'k': 30, 'verbose': False, 'sim_options': {'name': 'msd', 'user_based': True}}, 'mae': {'n_epochs': 5, 'k': 30, 'verbose': False, 'sim_options': {'name': 'msd', 'user_based': True}}}


#### Notice how using the GridSearchCV we get the best parameters out of the ones specified.                                                          

#### Also note that the RMSE has appreciably decreased from 0.9853 before tuning the kNN Basic to 0.9763 after tuning using GridSearchCV from surpsrise's model selection. 

#### SINGLE VALUE DECOMPOSITION (SVD).

In [0]:
# svd

param_dict= {'n_epochs': [5, 10,20],'n_factors':[50,64,80,96,128],'lr_all':[0.001,0.002,0.005,0.01],'reg_all':[0.1,0.2],'random_state':[42],'verbose':[False]}
clf_svd= GridSearchCV(SVD,param_grid=param_dict, measures=['rmse', 'mae'], cv=5)

In [46]:
clf_svd.fit(data)
print(clf_svd.best_score)
print(clf_svd.best_params)

{'rmse': 0.9200231819795561, 'mae': 0.7281027454569351}
{'rmse': {'n_epochs': 20, 'n_factors': 128, 'lr_all': 0.01, 'reg_all': 0.1, 'random_state': 42, 'verbose': False}, 'mae': {'n_epochs': 20, 'n_factors': 128, 'lr_all': 0.01, 'reg_all': 0.1, 'random_state': 42, 'verbose': False}}


#### Similarly note that the RMSE has majorly decreased to 0.92 after tuning the SVD model.

### Just like this, tuning other models  and tuning other parameters can further enhance the RMSE!!!.

# THE END.

# Please star/upvote if u liked it.