## Using surprise 
- Read data
- Build basic user or item based models
- Grid Search 
- Top N users or items

In [None]:
conda install -c conda-forge scikit-surprise

In [1]:
import os
import pandas as pd
import surprise

# data_dir="E:\\Work\\Machine Learning Course\\Python\\Module 7 Reccomendation Engines\\Data"
# os.chdir(data_dir)

Now first I will demonstrate how we can read data on our system inside Python such that surprise is able to use the dataset to build various kinds of recommendation engines. Now broadly there are two ways to read dataframes inside surprise. 

One is to first create a dataframe using pandas, clean it, manipulate it and then convert the dataframe into an object that surprise would understand. 

The second is to directly read data from a text file into surprise. For that to happen we’ll have to make sure that our text file is in a specific format.

First I’ll demonstrate how we can read data from a dataframe. Below I am reading a CSV file as a dataframe. 

In [2]:
## To read a file using surprise, one needs to make sure that data is in a specific format, there are two common ways to read a dataset, to be used in the library
# Reading from dataframe
# Reading from a text file

# Reading data from a dataframe
df=pd.read_csv("sample_data.csv")
df.head()

Unnamed: 0,user,rating,item
0,1,2,1
1,2,2,1
2,3,3,2
3,4,3,2
4,5,1,1


And this file contains userIDs, their ratings for different item IDs. 

Now to convert this dataframe into an object that surprise would understand, I will need to create a reader object which I am doing below using the reader method within the surprise module. 

I first need to specify line format. This is nothing but specifying the sequence in which the user ID, the rating and the item IDs occur. 

Now in my raw dataframe you can see that the first column is of user IDs, the second column is of ratings and the third column is of item IDs.

And also I know that my data is on a scale of 1 to 5, the ratings have been done on a scale of 1 to 5.

In [3]:
# Surprise expects data to have three columns, user, rating and item. The spelling of these columns names should be as given. If your dataframe has other columns or column names are differen, remember to change them before trying to reading them in surprise

# We will need to create a reader object before we can load our dataframe into surprise 
reader=surprise.dataset.Reader(line_format='user rating item',rating_scale=(1,5))

Once I create the reader object, then I will use the load from dataframe method within the surprise library to convert this dataframe into an object that surprise will be able to build different recommendation engines.

In [4]:
data=surprise.dataset.Dataset.load_from_df(df,reader=reader)

Within this object I have an attribute called raw_ratings. This object is just a list of tuples which contain this particular data now in this format.

In [5]:
data.raw_ratings

[(1, 2, 1.0, None),
 (2, 2, 1.0, None),
 (3, 3, 2.0, None),
 (4, 3, 2.0, None),
 (5, 1, 1.0, None)]

We can also read data directly from a text file. Now for this to work we need to make sure that our text file is in a specific format. 

Take a look at a raw text file (open and show). 

Now you can see that the first row is nothing but a row of column labels. Then I have user IDs, I have their ratings on a scale of 1 to 5, I have their item IDs which is what I am specifying in this reader object. 

I am specifying the line format. Now since this is a csv file, the separator is a comma, the rating scale is on 1 to 5. I am skipping the first line because the first line is just the name of the columns, it doesn’t really contain my data.

In [6]:
# We can load the dataset from a text file as well, directly, just make sure the text file 
# has three columns named as user, rating and item
reader=surprise.dataset.Reader(line_format='user rating item',sep=",", 
                               rating_scale=(1,5),skip_lines=1)

Once the reader object is created, I will use the load from file method to load data from the csv file based on the metadata stored in this reader. 

In [7]:
data1=surprise.dataset.Dataset.load_from_file("sample_data.csv",reader=reader)

And I can again take a look at raw ratings which is nothing but my original data, now is in this form which is just the list of various tuples.

In [8]:
data1.raw_ratings

[('1', '1', 2.0, None),
 ('2', '1', 2.0, None),
 ('3', '2', 3.0, None),
 ('4', '2', 3.0, None),
 ('5', '1', 1.0, None)]

Now let’s work on a slightly bigger dataset. For that I am changing my current working directory to this directory on my system. 

In [9]:
## Let's now, work with a slightly larger dataset and train memory based collaborative filtering models
data_dir=r"C:\Users\VK\Documents\Poojastuff\dono\Manipal-Deloitte\MachineLearning\RecommendationSystems\ml-latest-small"
os.chdir(data_dir)

And within this directory I have a file called ratings.csv, so I am reading that file first as a pandas dataframe and then I will printout the first few observations. 

In [11]:
mr=pd.read_csv("ratings.csv")
print(mr.head())
mr.drop('timestamp',axis=1,inplace=True)
mr.rename(columns={'userId':'user','movieId':'item','rating':'rating'},inplace=True)

   userId  movieId  rating   timestamp
0       1       31     2.5  1260759144
1       1     1029     3.0  1260759179
2       1     1061     3.0  1260759182
3       1     1129     2.0  1260759185
4       1     1172     4.0  1260759205


Now you can see that the first column is the column of user IDs, I have the movie ID, I have rating and I have a timestamp. Since timestamp will not be used by my recommendation engine, I am dropping this. And also I am renaming my column because surprise expects my column to have these names only.

In [13]:
mr.head()

Unnamed: 0,user,item,rating
0,1,31,2.5
1,1,1029,3.0
2,1,1061,3.0
3,1,1129,2.0
4,1,1172,4.0


Now I will define a reader object. I will tell that the first column is the column of user IDs, the second column is the column of item IDs and ratings and the ratings are on the scale of 1 to 5. 

In [14]:
# user, item, rating on scale of 1 to 5
reader=surprise.dataset.Reader(line_format='user item rating', rating_scale=(1,5))

Here I am reading the full dataset from the dataframe mr that I had read in here based on the reader object that I have created here. 

In [15]:
mr_train=surprise.dataset.Dataset.load_from_df(mr,reader=reader)
mr_trainset=mr_train.build_full_trainset()

Now once this line of code is executed I will create an object called mr_train. This will have all the data that I’ll need to build a recommendation engine. 

After I create this object, I will have to create a training set object using this method called build_full trainset. Now we are doing all of these steps because surprise expects all of these steps to be done for it to build recommendation engines.

Now let’s use this training set that we have created to build some collaborative filtering models both user based and item based. 

Now within surprise you have a prediction algorithm module, within the prediction algorithm module, you have various prediction algorithms and some of the algorithms are based on the neighborhood approach. 

So I am importing the module which contains libraries that are based on neighborhood approach and within this we have a KNNBasic which implements the very basic collaborative filtering model. 

In [16]:
## Create a neighbourhood based user and item based collaborative filtering model
import surprise.prediction_algorithms.knns as knns
knnbasic=knns.KNNBasic(k=40,min_k=1,sims_options={'name':'cosine','user_based':True})

Here the k stands for the number of neighbors that I would consider for the closeness of a user or an item. 

Then I have this parameter called similarity options and herein I am saying that I will be using cosine similarity and this will be a user based collaborative filtering model. 

So I’ll instantiate an object of KNNBasic class and then I will use the fit method to work on my train dataset that I created to build a user based collaborative filtering model based on the cosine similarity.

In [17]:
knnbasic.fit(mr_trainset)

Computing the msd similarity matrix...
Done computing similarity matrix.


<surprise.prediction_algorithms.knns.KNNBasic at 0x1e22d7c2ac8>

Now once it is done creating the model, let me go back to my original dataframe and take a look at some of the data points in that. 

In [18]:
mr.head()

Unnamed: 0,user,item,rating
0,1,31,2.5
1,1,1029,3.0
2,1,1061,3.0
3,1,1129,2.0
4,1,1172,4.0


Now once this model is trained I can use it to predict the rating for a certain user or for a certain item ID. 

Now I will be making a rating prediction for user ID 1, for item ID 31 and I also know that the actual rating here is 2.5 which is what I have written in the predict function. Let’s run this. 

So the predicted rating comes out to be 2.99. And if you think about it, it’s not very far away from what the actual rating was. 

In [19]:
knnbasic.predict(uid=1,iid=31,r_ui=2.5)

Prediction(uid=1, iid=31, r_ui=2.5, est=2.986320319817485, details={'actual_k': 40, 'was_impossible': False})

Lets’ build an item based collaborative filtering model. 

Now I’ll still work with the KNNBasic method. I will be using 40 neighbors to predict the rating. The only thing that will change is to the user based parameter I need to provide a value of FALSE for me to build an item based collaborative filtering model. 

In [20]:
## Lets build an item based collaborative filter
knnbasic=knns.KNNBasic(k=40,min_k=1,sims_options={'name':'cosine','user_based':False})

In [21]:
knnbasic.fit(mr_trainset)

Computing the msd similarity matrix...
Done computing similarity matrix.


<surprise.prediction_algorithms.knns.KNNBasic at 0x1e22d7c2c88>

Let me run this and let me train this and now let me do a prediction for user ID 1, for item 31 based on item based collaborative filtering. Now my estimated rating is 2.99 very similar to what I computed earlier.

In [22]:
knnbasic.predict(uid=1,iid=31)

Prediction(uid=1, iid=31, r_ui=None, est=2.986320319817485, details={'actual_k': 40, 'was_impossible': False})

Now we can also build collaborative filtering models taking into account the average ratings of users and items. 

For that I will be using the KNNWithMeans method. Here I am building an item based model because I have given a value of False to the user based parameter. Let me fit this. 

In [23]:
## Collaborative filter with average effects
knnbasic=knns.KNNWithMeans(k=40,min_k=1,sims_options={'name':'pearson','user_based':False})
knnbasic.fit(mr_trainset)

Computing the msd similarity matrix...
Done computing similarity matrix.


<surprise.prediction_algorithms.knns.KNNWithMeans at 0x1e22d841470>

Let’s make a prediction for user ID 1 and item ID 31. Now you can see that the predicted rating is 2.1. The actual rating was 2.5

In [24]:
knnbasic.predict(uid=1,iid=31)

Prediction(uid=1, iid=31, r_ui=None, est=2.1033560498894315, details={'actual_k': 40, 'was_impossible': False})

Now another thing that I can do which is made possible by this framework that I am using, I can split my total data into three folds and I can get an estimate of model performance. 

So what will happen is I will train my data on two folds and test on one fold. 

For that I will instantiate an object of the KFold class within the model_selection module of surprise. 

Then I will run a for loop for every combination of trainset and testset to evaluate the predictions of my KNNBasic algorithm on three folds. Let me run this.

In [25]:
from surprise.model_selection import KFold
from surprise import accuracy

In [26]:
## Instead of  using just one train set, we can split the data into parts
# and then evaluate the model performance out of sample
kf = KFold(n_splits=3)


knnbasic=knns.KNNBasic(k=40,sims_options={'name':'cosine','user_based':False})

for trainset, testset in kf.split(mr_train):
    knnbasic.fit(trainset)
    predictions = knnbasic.test(testset)
    
    accuracy.rmse(predictions, verbose=True)
    accuracy.mae(predictions, verbose=True)

Computing the msd similarity matrix...
Done computing similarity matrix.
RMSE: 0.9814
MAE:  0.7548
Computing the msd similarity matrix...
Done computing similarity matrix.
RMSE: 0.9780
MAE:  0.7525
Computing the msd similarity matrix...
Done computing similarity matrix.
RMSE: 0.9784
MAE:  0.7521


Now as you see above this trains model on two folds and tests on one fold, then trains on the other two fold and tests on one fold, trains on the other two fold and tests on the one fold. 

So this gives us an idea of the out of sample performance of this model which here is being measured by Root Mean Squared Error and Mean Absolute Error. 


Now I can repeat the same process with the model which takes into account the average effects. I can evaluate this model on three folds out of sample.

In [27]:
## Build a collaborative filter model with average effects
# surprise.evaluate(knns.KNNWithMeans(k=40,sims_options={'name':'cosine','user_based':False}),mr_train)

kf = KFold(n_splits=3)


knnwithmeans=knns.KNNWithMeans(k=40,sims_options={'name':'cosine','user_based':False})

for trainset, testset in kf.split(mr_train):
    knnwithmeans.fit(trainset)
    predictions = knnwithmeans.test(testset)
    
    accuracy.rmse(predictions, verbose=True)
    accuracy.mae(predictions, verbose=True)

Computing the msd similarity matrix...
Done computing similarity matrix.
RMSE: 0.9215
MAE:  0.7059
Computing the msd similarity matrix...
Done computing similarity matrix.
RMSE: 0.9257
MAE:  0.7074
Computing the msd similarity matrix...
Done computing similarity matrix.
RMSE: 0.9324
MAE:  0.7143


So now if I compare the accuracy of a model which takes into account means and the model which doesn’t, it seems like the model which takes into account the mean effects is slightly more accurate compared to the model which doesn’t.



Now let’s see how we can do grid search. 

For this to work we’ll first define a grid. Now for a simple demo, I am defining a grid that takes into account the number of neighbors. I will be iterating over 10 neighbors or 20 neighbors. 

And within the similarity options I will be searching between the cosine similarity and the msd and I will only be building an item based collaborative filtering model. So this is my parameter grid. 

In [29]:
## Doing Grid Search
param_grid = {'k': [10, 20],
              'sim_options': {'name': ['msd', 'cosine'],
                              'user_based': [False]}
              }
## MSD: Mean Squared Difference similarity between all pairs of users

I will have to define an estimator first which is what I am doing here and then the surprise module has a GridSearch method to which I will supply the algorithm and the parameter grid and the accuracy measures that I want to do this grid search for me. 

In [30]:
algo=knns.KNNWithMeans

In [31]:
from surprise.model_selection import GridSearchCV

In [32]:
# grid_search = surprise.GridSearchCV(algo,param_grid=param_grid, measures=['RMSE', 'MAE'])

grid_search = GridSearchCV(algo,param_grid=param_grid, measures=['RMSE', 'MAE'])

In [33]:
grid_search.fit(mr_train)

Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the cosine similarity matrix...
Done computing similarity matrix.
Computing the cosine similarity matrix...
Done computing similarity matrix.
Computing the cosine similarity matrix...
Done computing similarity matrix.
Computing the cosine similarity matrix...
Done computing similarity matrix.
Computing the cosine similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix.

Now doing this grid search will take some time as we have many permutations and combinations that this algorithm will go through. 

Now once this grid search procedure is run, we can check what were the best parameters based on the RMSE and MAE as a metric or what were the best parameters based on the RMSE and MAE as a metric. 

In [34]:
print(grid_search.best_params['rmse'])
print(grid_search.best_params['mae'])

{'k': 20, 'sim_options': {'name': 'msd', 'user_based': False}}
{'k': 20, 'sim_options': {'name': 'msd', 'user_based': False}}


So it turns out that for both, the parameters were similar, it turns out that after we do grid search, a model with 20 neighbors which is a item based model and uses msd as similarity metric is the best model.

Also look at best scores for RMSE and MAE

In [35]:
print(grid_search.best_score['rmse'])
print(grid_search.best_score['mae'])

0.9243719812746256
0.7089037329981174


Now let’s create this model once again. And using this model, we will see how we can extract top 5 recommendations for a given item. 

So I will be again building this model based on my grid search results. 

In [36]:
## Top 5 recommendations for an item
model=knns.KNNWithMeans(k=20,sim_options={'name': 'msd', 'user_based': False})
model.fit(mr_trainset)

Computing the msd similarity matrix...
Done computing similarity matrix.


<surprise.prediction_algorithms.knns.KNNWithMeans at 0x1e22d841208>

Now let’s take a look at our raw data. 

In [37]:
mr.head()

Unnamed: 0,user,item,rating
0,1,31,2.5
1,1,1029,3.0
2,1,1061,3.0
3,1,1129,2.0
4,1,1172,4.0


Now to obtain the top 5 recommendations for a given item we will need to do a bit of work. 

Now what happens is surprise assigns a unique integer ID to each item that it stores as an item set. Now that unique item ID can be recovered using this method called to_inner id. Let’s say I want to understand what is the integer ID that surprise has given to the item whose actual ID is 1061. And it turns out that it has given it an integer ID of 2.

In [38]:
mr_trainset.to_inner_iid(1061)

2

Now we can use this method called model.getneighbors and what we need to do is we will need to provide the inner IDs or the IDs given by the surprise module to a given item ID to find the most similar IDs corresponding to this particular item. 

So let’s say I want to look at the 5 most similar items. Now this is returning the internal ID of the most 5 similar items to the item whose internal ID is 2. 

In [39]:
model.get_neighbors(mr_trainset.to_inner_iid(1061),5)

[51, 80, 95, 269, 292]

Let’s see what would be the raw IDs which are the IDs in my dataset of these items. For that I will run a loop and I will use this method called to_raw_iid. And it turns out that these are the items that are most similar to item number 1061.

In [40]:
for i in [51, 80, 95, 269, 292]:
   print(mr_trainset.to_raw_iid(i))

314
537
720
2348
2867


I can similarly also create top 5 predictions for a user. For that I will create a user based collaborative filtering model first. 

In [41]:
## Top 5 recommendations for a user
model=knns.KNNWithMeans(k=20,sim_options={'name': 'msd', 'user_based': True})
model.fit(mr_trainset)

Computing the msd similarity matrix...
Done computing similarity matrix.


<surprise.prediction_algorithms.knns.KNNWithMeans at 0x1e22d841400>

In [42]:
mr.head()

Unnamed: 0,user,item,rating
0,1,31,2.5
1,1,1029,3.0
2,1,1061,3.0
3,1,1129,2.0
4,1,1172,4.0


Let’s take a look at the head of the data again. Let’s say for the user 1, I need to find out the most similar users. So first I will get the inner ID of this user. I will use the inner ID of this user to predict the inner IDs of the user similar to user 1. 

Let’s see what do we get. So users whose inner ID is 8, 32, 67, 95 and 98 are most similar to user whose raw ID is 1. 

In [43]:
mr_trainset.to_inner_uid(1)
model.get_neighbors(mr_trainset.to_inner_uid(1),5)

[8, 32, 67, 95, 98]

Now let’s convert these inner IDs into raw IDs using this for loop. So users who have a label of 9, 33, 68, 96 and 99 in our data are more similar to user 1.

In [44]:
for i in [8, 32, 67, 95, 98]:
    print(mr_trainset.to_raw_uid(i))

9
33
68
96
99
