## Install surpise libraries:

In [3]:
!pip install scikit-surprise

Collecting scikit-surprise
[?25l  Downloading https://files.pythonhosted.org/packages/97/37/5d334adaf5ddd65da99fc65f6507e0e4599d092ba048f4302fe8775619e8/scikit-surprise-1.1.1.tar.gz (11.8MB)
[K     |████████████████████████████████| 11.8MB 6.4MB/s 
Building wheels for collected packages: scikit-surprise
  Building wheel for scikit-surprise (setup.py) ... [?25l[?25hdone
  Created wheel for scikit-surprise: filename=scikit_surprise-1.1.1-cp36-cp36m-linux_x86_64.whl size=1618268 sha256=e4118f073ab8531229c259c4887f94d55f389e88f1589471514d825415c42edf
  Stored in directory: /root/.cache/pip/wheels/78/9c/3d/41b419c9d2aff5b6e2b4c0fc8d25c538202834058f9ed110d0
Successfully built scikit-surprise
Installing collected packages: scikit-surprise
Successfully installed scikit-surprise-1.1.1


## Load jester database

And some exploration of the data

https://www.kaggle.com/raidevesh05/movie-ratings-dataset?select=movie_ratings.csv

In [13]:
from surprise import SVD
from surprise import Dataset
from surprise.model_selection import cross_validate

data = Dataset.load_builtin('jester')

Dataset jester could not be found. Do you want to download it? [Y/n] Y
Trying to download dataset from http://eigentaste.berkeley.edu/dataset/archive/jester_dataset_2.zip...
Done! Dataset jester has been saved to /root/.surprise_data/jester


Access the ratings from the raw data:

In [58]:
raw_data = data.raw_ratings
print("Each entry has the columns: user_id, item_id,  rating and timestamp")
print(raw_data[0])
print(raw_data[1])
print(raw_data[2])
print("\n")

Each entry has the columns: user_id, item_id,  rating and timestamp
('1', '5', 0.219, None)
('1', '7', -9.281, None)
('1', '8', -9.281, None)




Load as a pandas dataframe for convenience to manipulate data:

In [104]:
import pandas as pd
df = pd.DataFrame(data.__dict__['raw_ratings'], columns=['user_id', 'item_id', 'rating', 'timestamp'])
df.head()

ValueError: ignored

Dimension of the dataframe

In [17]:
dataframe.shape

(1761439, 4)

Check the average rating score

In [76]:
print("Mean value of the rating column: " + str(dataframe["rating"].mean())) 

Mean value of the rating column: 1.6186024017864769


Remove items with less than 20 ratings and users that have rated less than 10 times:


In [53]:
min_item_ratings = 200
filter_items = df['item_id'].value_counts() > min_item_ratings
filter_items = filter_items[filter_items].index.tolist()

min_user_ratings = 130
filter_users = df['user_id'].value_counts() > min_user_ratings
filter_users = filter_users[filter_users].index.tolist()

df_small = df[(df['item_id'].isin(filter_items)) & (df['user_id'].isin(filter_users))]
print('The original data frame shape:\t{}'.format(df.shape))
print('The new data frame shape:\t{}'.format(df_small.shape))

The original data frame shape:	(1761439, 4)
The new data frame shape:	(1325, 4)


In [54]:
df_small.head()

Unnamed: 0,user_id,item_id,rating,timestamp
1568,46,5,6.469,
1569,46,7,-6.781,
1570,46,8,8.625,
1571,46,13,-8.062,
1572,46,15,-4.156,


In [72]:
print("Sample row in the dataset:")
print(df_small.loc[1568])
print("\n")
print("Sample column (item_id) in the dataset (first 10 values):")
print(df_small["item_id"].head(10))
print("\n")

Sample row in the dataset:
user_id         46
item_id          5
rating       6.469
timestamp     None
Name: 1568, dtype: object


Sample column (item_id) in the dataset (first 10 values):
1568     5
1569     7
1570     8
1571    13
1572    15
1573    16
1574    17
1575    18
1576    19
1577    20
Name: item_id, dtype: object




Check how many users and and items are there in the dataset:

In [77]:
print("Number of users in the reduced database = " + str(df_small["user_id"].nunique()))
print("Number of items in the reduced database = " + str(df_small["item_id"].nunique()))

Number of users in the reduced database = 10
Number of items in the reduced database = 133


## Load data again from the reduced dataframe:

In [79]:
from surprise import Reader 
reader = Reader(rating_scale=(-10, 10))
data_small = Dataset.load_from_df(df_small[['user_id', 'item_id', 'rating']], reader)

## Fit different recommender systems:

First, create a trainable set using the build_full_trainset() method from surpise libraries:

https://surprise.readthedocs.io/en/stable/dataset.html#surprise.dataset.DatasetAutoFolds.build_full_trainset



In [80]:
trainset = data_small.build_full_trainset()

Now fit a SVD model-based recommender algorithm:



In [81]:
from surprise import SVD

rec_svd = SVD()
rec_svd.fit(trainset)

<surprise.prediction_algorithms.matrix_factorization.SVD at 0x7f41c2712828>

Predict the ranking of a certain item item_id from a certain user user_id

In [101]:
raw_data = data_small.raw_ratings
print(raw_data[10])

('46', '44', 6.75, None)


In [103]:
user_id = str(46)  # raw user id (as in the ratings file). They are **strings**!
item_id = str(44)  # raw item id (as in the ratings file). They are **strings**!
true_ranking = 6.75

pred_svd = rec_svd.predict(user_id, item_id,r_ui=true_ranking,verbose=True)


user: 46         item: 44         r_ui = 6.75   est = 6.63   {'was_impossible': False}


In [None]:
df_small.loc

Evaluate performance using training set:

In [91]:
from surprise import accuracy

testset = trainset.build_testset()
train_pred = rec_svd.test(testset)
accuracy.rmse(train_pred)
accuracy.mae(train_pred)
accuracy.mse(train_pred)

RMSE: 0.2835
MAE:  0.2056
MSE: 0.0804


0.08038869522189881

If we evaluate the performance with the training set we get a very low prediction error since the observations user to construct the model are also used to test it. This is not the normal procedure in machine learning. Instead, 
we split the dataset in training and test subsets. Let's see what happens when applying different propostions for the training and test subsets. 


Now split training and test:

In [93]:
from surprise.model_selection import train_test_split

print('90% training 10% test:')
trainset, testset = train_test_split(data_small, test_size=.1)
rec_svd.fit(trainset)
train_pred = rec_svd.test(testset)
accuracy.rmse(train_pred)
accuracy.mae(train_pred)
accuracy.mse(train_pred)

print('75% training 25% test:')
trainset, testset = train_test_split(data_small, test_size=.25)
rec_svd.fit(trainset)
train_pred = rec_svd.test(testset)
accuracy.rmse(train_pred)
accuracy.mae(train_pred)
accuracy.mse(train_pred)

print('10% training 90% test:')
trainset, testset = train_test_split(data_small, test_size=.9)
rec_svd.fit(trainset)
train_pred = rec_svd.test(testset)
accuracy.rmse(train_pred)
accuracy.mae(train_pred)
accuracy.mse(train_pred)


90% training 10% test:
RMSE: 4.2181
MAE:  3.2538
MSE: 17.7921
75% training 25% test:
RMSE: 4.3183
MAE:  3.4456
MSE: 18.6480
10% training 90% test:
RMSE: 4.7076
MAE:  3.8466
MSE: 22.1619


22.16187268708119

It It is clear that the larger the training subset, the better the performance (i.e lower RMSE, MAE and MSE) when evaluated in the test subset. 

# Cross-validation:

Let's evaluate the performance using a 5-fold cross-validation (cv=5) procedure:


In [94]:
cross_validate(rec_svd, data_small, measures=['RMSE', 'MAE','MSE'], cv=5, verbose=True)

Evaluating RMSE, MAE, MSE of algorithm SVD on 5 split(s).

                  Fold 1  Fold 2  Fold 3  Fold 4  Fold 5  Mean    Std     
RMSE (testset)    4.4457  4.7694  4.1768  4.5401  4.5205  4.4905  0.1906  
MAE (testset)     3.5332  3.7774  3.2808  3.6047  3.4237  3.5239  0.1674  
MSE (testset)     19.7641 22.7476 17.4458 20.6127 20.4349 20.2010 1.7024  
Fit time          0.07    0.06    0.06    0.06    0.07    0.06    0.01    
Test time         0.00    0.00    0.00    0.00    0.00    0.00    0.00    


{'fit_time': (0.07156753540039062,
  0.05886530876159668,
  0.05785179138183594,
  0.05864143371582031,
  0.06502032279968262),
 'test_mae': array([3.53321299, 3.77741846, 3.28076105, 3.60465084, 3.42369504]),
 'test_mse': array([19.76405981, 22.74756866, 17.44576225, 20.612653  , 20.43488166]),
 'test_rmse': array([4.44567878, 4.76944113, 4.17681245, 4.54011597, 4.52049573]),
 'test_time': (0.002393960952758789,
  0.0017268657684326172,
  0.0017704963684082031,
  0.0017855167388916016,
  0.0018718242645263672)}

The average results of the cv procedure indicate that the performance is closer to the one obtained with a train-test split of the 90% - 10% 

# Explore different parameters of the SVD recommender algorithm:

https://surprise.readthedocs.io/en/stable/matrix_factorization.html#surprise.prediction_algorithms.matrix_factorization.SVD


Compare biased SVD versus unbiased SVD: 

In [95]:
rec_svd_unbiased = SVD(biased = 'false')
cross_validate(rec_svd_unbiased, data_small, measures=['RMSE', 'MAE','MSE'], cv=5, verbose=True)

Evaluating RMSE, MAE, MSE of algorithm SVD on 5 split(s).

                  Fold 1  Fold 2  Fold 3  Fold 4  Fold 5  Mean    Std     
RMSE (testset)    4.3226  4.2460  4.2779  4.8780  4.3656  4.4180  0.2335  
MAE (testset)     3.3504  3.4220  3.4320  3.7910  3.3948  3.4781  0.1590  
MSE (testset)     18.6849 18.0286 18.3003 23.7951 19.0581 19.5734 2.1394  
Fit time          0.07    0.06    0.06    0.06    0.06    0.06    0.01    
Test time         0.00    0.00    0.00    0.00    0.00    0.00    0.00    


{'fit_time': (0.07158851623535156,
  0.060149431228637695,
  0.0580744743347168,
  0.06408882141113281,
  0.05853152275085449),
 'test_mae': array([3.35042352, 3.42202714, 3.4320174 , 3.79097535, 3.39483064]),
 'test_mse': array([18.68489237, 18.02858057, 18.30030867, 23.7950596 , 19.05806687]),
 'test_rmse': array([4.3226025 , 4.2460076 , 4.27788601, 4.878018  , 4.36555459]),
 'test_time': (0.0029599666595458984,
  0.001783132553100586,
  0.0017902851104736328,
  0.00185394287109375,
  0.0017931461334228516)}

The unbiased version is slightly better than the default biased one. 

# Improve performance using grid search CV:


https://surprise.readthedocs.io/en/stable/getting_started.html#tune-algorithm-parameters-with-gridsearchcv

In [96]:
from surprise import SVD
from surprise import Dataset
from surprise.model_selection import GridSearchCV

param_grid = {'biased': ['true', 'false']}
gs = GridSearchCV(SVD, param_grid, measures=['rmse', 'mae','mse'], cv=5)

gs.fit(data_small)

# best RMSE score
print(gs.best_score['rmse'])
print(gs.best_score['mae'])
print(gs.best_score['mse'])


# combination of parameters that gave the best RMSE score
print('Best SVD parameters:')
print(gs.best_params['rmse'])

4.433704681618332
3.483266044438421
19.684983126754226
Best SVD parameters:
{'biased': 'true'}
