## Install surpise libraries:

In [None]:
!pip install scikit-surprise

Collecting scikit-surprise
[?25l  Downloading https://files.pythonhosted.org/packages/97/37/5d334adaf5ddd65da99fc65f6507e0e4599d092ba048f4302fe8775619e8/scikit-surprise-1.1.1.tar.gz (11.8MB)
[K     |████████████████████████████████| 11.8MB 342kB/s 
Building wheels for collected packages: scikit-surprise
  Building wheel for scikit-surprise (setup.py) ... [?25l[?25hdone
  Created wheel for scikit-surprise: filename=scikit_surprise-1.1.1-cp36-cp36m-linux_x86_64.whl size=1618272 sha256=217db30157466009721ae15db4c6fc2671a1a349340a4e44a803f3505ad30a23
  Stored in directory: /root/.cache/pip/wheels/78/9c/3d/41b419c9d2aff5b6e2b4c0fc8d25c538202834058f9ed110d0
Successfully built scikit-surprise
Installing collected packages: scikit-surprise
Successfully installed scikit-surprise-1.1.1


## Load movies dataset from built-in suprise libraries:

And do some exploration of the data


In [13]:
from surprise import SVD
from surprise import Dataset
from surprise.model_selection import cross_validate

data = Dataset.load_builtin('ml-100k')

Dataset ml-100k could not be found. Do you want to download it? [Y/n] Y
Trying to download dataset from http://files.grouplens.org/datasets/movielens/ml-100k.zip...
Done! Dataset ml-100k has been saved to /root/.surprise_data/ml-100k


Load as a pandas dataframe for convenience to manipulate data:

In [14]:
import pandas as pd
dataframe = pd.DataFrame(data.__dict__['raw_ratings'], columns=['user_id', 'item_id', 'rating', 'timestamp'])
dataframe.head()

Unnamed: 0,user_id,item_id,rating,timestamp
0,196,242,3.0,881250949
1,186,302,3.0,891717742
2,22,377,1.0,878887116
3,244,51,2.0,880606923
4,166,346,1.0,886397596


Dimension of the dataframe

In [15]:
dataframe.shape

(100000, 4)

In [16]:
print("Sample row in the dataset:")
print(dataframe.loc[0])
print("\n")
print("Sample column (item_id) in the dataset (first 10 values):")
print(dataframe["item_id"].head(10))
print("\n")

Sample row in the dataset:
user_id            196
item_id            242
rating               3
timestamp    881250949
Name: 0, dtype: object


Sample column (item_id) in the dataset (first 10 values):
0    242
1    302
2    377
3     51
4    346
5    474
6    265
7    465
8    451
9     86
Name: item_id, dtype: object




Check how many users and and items are there in the dataset:

In [17]:
print("Number of users in the database = " + str(dataframe["user_id"].nunique()))
print("Number of items in the database = " + str(dataframe["item_id"].nunique()))

Number of users in the database = 943
Number of items in the database = 1682


In [18]:
raw_data = data.raw_ratings
print("Each entry has the columns: user_id, item_id,  rating and timestamp")
print(raw_data[0])
print(raw_data[1])
print(raw_data[2])
print("\n")

Each entry has the columns: user_id, item_id,  rating and timestamp
('196', '242', 3.0, '881250949')
('186', '302', 3.0, '891717742')
('22', '377', 1.0, '878887116')




## Fit different recommender systems:

First, create a trainable set using the build_full_trainset() method from surpise libraries:

https://surprise.readthedocs.io/en/stable/dataset.html#surprise.dataset.DatasetAutoFolds.build_full_trainset



In [21]:
trainset = data.build_full_trainset()

Now fit different recommender algorithms:



In [43]:
from surprise import SVD

rec_svd = SVD()
rec_svd.fit(trainset)

<surprise.prediction_algorithms.matrix_factorization.SVD at 0x7f030118a3c8>

Predict the ranking of a certain item item_id from a certain user user_id

In [44]:
user_id = str(196)  # raw user id (as in the ratings file). They are **strings**!
item_id = str(302)  # raw item id (as in the ratings file). They are **strings**!
# the true ranking is useful to see if the prediction given by the system is accurate:
true_ranking = 4

pred_svd = rec_svd.predict(user_id, item_id, r_ui=true_ranking, verbose=True)


user: 196        item: 302        r_ui = 4.00   est = 4.30   {'was_impossible': False}


Evaluate performance using training set:

In [45]:
from surprise import accuracy

testset = trainset.build_testset()
train_pred = rec_svd.test(testset)
accuracy.rmse(train_pred)
accuracy.mae(train_pred)
accuracy.mse(train_pred)

RMSE: 0.7155
MAE:  0.5727
MSE: 0.5120


0.5119744670979155

If we evaluate the performance with the training set we get a very low prediction error since the observations user to construct the model are also used to test it. This is not the normal procedure in machine learning. Instead, 
we split the dataset in training and test subsets. Let's see what happens when applying different propostions for the training and test subsets. 


Now split training and test:

In [46]:

from surprise.model_selection import train_test_split

print('90% training 10% test:')
trainset, testset = train_test_split(data, test_size=.1)
rec_svd.fit(trainset)
train_pred = rec_svd.test(testset)
accuracy.rmse(train_pred)
accuracy.mae(train_pred)
accuracy.mse(train_pred)

print('75% training 25% test:')
trainset, testset = train_test_split(data, test_size=.25)
rec_svd.fit(trainset)
train_pred = rec_svd.test(testset)
accuracy.rmse(train_pred)
accuracy.mae(train_pred)
accuracy.mse(train_pred)

print('10% training 90% test:')
trainset, testset = train_test_split(data, test_size=.9)
rec_svd.fit(trainset)
train_pred = rec_svd.test(testset)
accuracy.rmse(train_pred)
accuracy.mae(train_pred)
accuracy.mse(train_pred)


90% training 10% test:
RMSE: 0.9381
MAE:  0.7387
MSE: 0.8800
75% training 25% test:
RMSE: 0.9287
MAE:  0.7329
MSE: 0.8625
10% training 90% test:
RMSE: 1.0048
MAE:  0.8028
MSE: 1.0096


1.009561760635906

It It is clear that the larger the training subset, the better the performance (i.e lower RMSE, MAE and MSE) when evaluated in the test subset. 

# Cross-validation:

Let's evaluate the performance using a 5-fold cross-validation (cv=5) procedure:


In [47]:
cross_validate(rec_svd, data, measures=['RMSE', 'MAE','MSE'], cv=5, verbose=True)

Evaluating RMSE, MAE, MSE of algorithm SVD on 5 split(s).

                  Fold 1  Fold 2  Fold 3  Fold 4  Fold 5  Mean    Std     
RMSE (testset)    0.9269  0.9386  0.9420  0.9360  0.9395  0.9366  0.0052  
MAE (testset)     0.7291  0.7416  0.7416  0.7390  0.7405  0.7384  0.0047  
MSE (testset)     0.8591  0.8811  0.8873  0.8760  0.8826  0.8772  0.0098  
Fit time          5.12    5.09    5.11    5.08    5.13    5.11    0.02    
Test time         0.14    0.52    0.14    0.14    0.15    0.22    0.15    


{'fit_time': (5.1180994510650635,
  5.093207359313965,
  5.114635705947876,
  5.07509708404541,
  5.129374742507935),
 'test_mae': array([0.729072  , 0.74157167, 0.74162703, 0.73900787, 0.74050806]),
 'test_mse': array([0.85910454, 0.88106306, 0.88732791, 0.87601938, 0.88260475]),
 'test_rmse': array([0.92687892, 0.93864959, 0.94198084, 0.93595907, 0.93947046]),
 'test_time': (0.14036345481872559,
  0.5181748867034912,
  0.13877081871032715,
  0.14102721214294434,
  0.14541029930114746)}

The average results of the cv procedure indicate that the performance is closer to the one obtained with a train-test split of the 90% - 10% 

# Explore different parameters of the SVD recommender algorithm:

https://surprise.readthedocs.io/en/stable/matrix_factorization.html#surprise.prediction_algorithms.matrix_factorization.SVD


Compare biased SVD versus unbiased SVD: 

In [49]:
rec_svd_unbiased = SVD(biased = 'false')
cross_validate(rec_svd_unbiased, data, measures=['RMSE', 'MAE','MSE'], cv=5, verbose=True)

Evaluating RMSE, MAE, MSE of algorithm SVD on 5 split(s).

                  Fold 1  Fold 2  Fold 3  Fold 4  Fold 5  Mean    Std     
RMSE (testset)    0.9324  0.9426  0.9289  0.9305  0.9401  0.9349  0.0054  
MAE (testset)     0.7347  0.7425  0.7324  0.7320  0.7416  0.7366  0.0045  
MSE (testset)     0.8694  0.8885  0.8629  0.8658  0.8837  0.8741  0.0101  
Fit time          5.18    5.10    5.16    5.06    5.11    5.12    0.04    
Test time         0.16    0.15    0.15    0.15    0.16    0.15    0.00    


{'fit_time': (5.178188800811768,
  5.0956408977508545,
  5.15776801109314,
  5.0560736656188965,
  5.111429214477539),
 'test_mae': array([0.73468855, 0.74249372, 0.73241233, 0.73197725, 0.74156734]),
 'test_mse': array([0.86936501, 0.88846342, 0.86290565, 0.86583297, 0.88370405]),
 'test_rmse': array([0.93239745, 0.94258337, 0.92892715, 0.93050146, 0.94005535]),
 'test_time': (0.15842747688293457,
  0.1498558521270752,
  0.15247678756713867,
  0.151885986328125,
  0.15920543670654297)}

The unbiased version is slightly better than the default biased one. 

# Improve performance using grid search CV:


https://surprise.readthedocs.io/en/stable/getting_started.html#tune-algorithm-parameters-with-gridsearchcv

In [50]:
from surprise import SVD
from surprise import Dataset
from surprise.model_selection import GridSearchCV

param_grid1 = {'biased': ['true', 'false']}
gs = GridSearchCV(SVD, param_grid1, measures=['rmse', 'mae','mse'], cv=5)

gs.fit(data)

# best RMSE score
print(gs.best_score['rmse'])
print(gs.best_score['mae'])
print(gs.best_score['mse'])


# combination of parameters that gave the best RMSE score
print('Best SVD parameters:')
print(gs.best_params['rmse'])

0.9360084446620462
0.7377343697072805
0.8761250438932393
{'biased': 'false'}
