# Surprise library for Collaborative Filtering - Recommender System


The name Surprise stands for Simple Python Recommendation System Engine.
Surprise is an easy-to-use Python library for building and testing recommender systems.

Users can use both built-in datasets (Movielens, Jester), and their own custom datasets.


* Provide various ready-to-use prediction algorithms such as baseline algorithms, neighborhood methods, matrix factorization-based ( SVD, PMF, SVD++, NMF), and many others. 

* Provide various similarity measures (cosine, MSD, pearson…) are built-in.

* Provide tools to evaluate, analyse and compare the algorithms’ performance. Cross-validation procedures can be run very easily using powerful CV iterators 


**Please note that surprise does not support implicit ratings or content-based information.**


**What Surprise library can do on build recommender system models?**

1. data

2. Split dataset into train and test sebsets

3. Build models by using different algorithms

4. Cross validate

5. Tune hyperparameters

6. Model evaluation


References: 

http://surpriselib.com/

Surprise’ documentation: https://surprise.readthedocs.io/en/stable/



## Part 1: Installation

In [None]:
$ pip install numpy
$ pip install scikit-surprise
 
    OR 
    
$ conda install -c conda-forge scikit-surprise

## Part 2: Dataset module

Three built-in datasets are available:

1. The movielens-100k dataset.

2. The movielens-1m dataset.

3. The Jester dataset 2.


In [6]:
from surprise import Dataset

# Load the movielens-100k dataset
data = Dataset.load_builtin('ml-100k')

import pandas as pd
df = pd.DataFrame(data.__dict__['raw_ratings'], columns=['user_id','item_id','rating','timestamp'])
df.head(10)


Unnamed: 0,user_id,item_id,rating,timestamp
0,196,242,3.0,881250949
1,186,302,3.0,891717742
2,22,377,1.0,878887116
3,244,51,2.0,880606923
4,166,346,1.0,886397596
5,298,474,4.0,884182806
6,115,265,2.0,881171488
7,253,465,5.0,891628467
8,305,451,3.0,886324817
9,6,86,3.0,883603013



## Part 3: Prediction_algorithms package

### 1. Basic algorithms:

`NormalPredictor`

NormalPredictor algorithm predicts a random rating based on the distribution of the training set, which is assumed to be normal. This is one of the most basic algorithms that do not do much work.

`BaselineOnly`

BaselineOnly algorithm predicts the baseline estimate for given user and item.

### 2. k-NN algorithms

`KNNBasic`

KNNBasic is a basic collaborative filtering algorithm.

`KNNWithMeans`

KNNWithMeans is basic collaborative filtering algorithm, taking into account the mean ratings of each user.

`KNNWithZScore`

KNNWithZScore is a basic collaborative filtering algorithm, taking into account the z-score normalization of each user.

`KNNBaseline`

KNNBaseline is a basic collaborative filtering algorithm taking into account a baseline rating.

### 3. Matrix Factorization-based algorithms

`SVD`

SVD algorithm is equivalent to Probabilistic Matrix Factorization.

`SVDpp`

The SVDpp algorithm is an extension of SVD that takes into account implicit ratings.

`NMF`

NMF is a collaborative filtering algorithm based on Non-negative Matrix Factorization. It is very similar with SVD.

### 4. Other Collaborative Filtering Algorithms

`Slope One`

SlopeOne is a straightforward implementation of the SlopeOne algorithm.

`Co-clustering`

Coclustering is a collaborative filtering algorithm based on co-clustering.

Reference: https://towardsdatascience.com/building-and-testing-recommender-systems-with-surprise-step-by-step-d4ba702ef80b

## Part 4: Hyperparameters Optimization - GridSearchCV / RandomizedSearchCV

1. The `GridSearchCV` class computes accuracy metrics for an algorithm on various combinations of parameters, over a cross-validation procedure. This is useful for finding the best set of parameters for a prediction algorithm.

**Default parameters:**

surprise.model_selection.search.GridSearchCV(algo_class, param_grid, measures=[u'rmse', u'mae'], cv=None, refit=False, return_train_measures=False, n_jobs=1, pre_dispatch=u'2*n_jobs', joblib_verbose=0)


2. The `RandomizedSearchCV` class computes accuracy metrics for an algorithm on various combinations of parameters, over a cross-validation procedure. As opposed to GridSearchCV, which uses an exhaustive combinatorial approach, RandomizedSearchCV samples randomly from the parameter space. 

**Default parameters:**

surprise.model_selection.search.RandomizedSearchCV(algo_class, param_distributions, n_iter=10, measures=[u'rmse', u'mae'], cv=None, refit=False, return_train_measures=False, n_jobs=1, pre_dispatch=u'2*n_jobs', random_state=None, joblib_verbose=0)

**Import library:**

from surprise.model_selection import GridSearchCV

from surprise.model_selection import RandomizedSearchCV


More details: https://surprise.readthedocs.io/en/stable/model_selection.html



## Part 5: Accuracy module

The `surprise.accuracy` module provides tools for computing accuracy metrics on a set of predictions.

Available accuracy metrics:

`rmse`	Compute RMSE (Root Mean Squared Error).

`mse`	Compute MSE (Mean Squared Error).

`mae`	Compute MAE (Mean Absolute Error).

`fcp`	Compute FCP (Fraction of Concordant Pairs).

**Evaluate model in the train datset:**

from surprise.model_selection import `cross_validate`

**Split dataset into train and test subsets:**

from surprise.model_selection import `train_test_split`


In [8]:
from surprise.model_selection import cross_validate
from surprise import NormalPredictor
from surprise import KNNBasic
from surprise import KNNWithMeans
from surprise import KNNWithZScore
from surprise import KNNBaseline
from surprise import SVD
from surprise import BaselineOnly
from surprise import SVDpp
from surprise import NMF
from surprise import SlopeOne
from surprise import CoClustering
from surprise.accuracy import rmse
from surprise import accuracy
from surprise.model_selection import train_test_split

## Part 6: Example

I'll use the movielens-100k dataset as an example. 


Helpful articles:

https://bmanohar16.github.io/blog/recsys-evaluation-in-surprise

https://github.com/susanli2016/Machine-Learning-with-Python/blob/master/Building%20Recommender%20System%20with%20Surprise.ipynb



In [11]:
# code from https://github.com/susanli2016/Machine-Learning-with-Python/blob/master/Building%20Recommender%20System%20with%20Surprise.ipynb
# use rmse as our accuracy metric for the predictions.

benchmark = []
# Iterate over all algorithms
for algorithm in [SVD(), SVDpp(), SlopeOne(), NMF(), NormalPredictor(), KNNBaseline(), KNNBasic(), KNNWithMeans(), KNNWithZScore(), BaselineOnly(), CoClustering()]:
    # Perform 3-fold cross validation
    results = cross_validate(algorithm, data, measures=['RMSE'], cv=3, verbose=False)
    
    # Get results & append algorithm name
    tmp = pd.DataFrame.from_dict(results).mean(axis=0)
    tmp = tmp.append(pd.Series([str(algorithm).split(' ')[0].split('.')[-1]], index=['Algorithm']))
    benchmark.append(tmp)
    

Estimating biases using als...
Computing the msd similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the msd similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Estimating biases using als...


In [12]:
surprise_results = pd.DataFrame(benchmark).set_index('Algorithm').sort_values('test_rmse')
surprise_results


Unnamed: 0_level_0,test_rmse,fit_time,test_time
Algorithm,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
SVDpp,0.92697,399.482269,19.333255
KNNBaseline,0.93667,1.154948,16.031015
SVD,0.946027,9.619627,1.052136
BaselineOnly,0.946891,1.959221,1.091207
SlopeOne,0.949064,1.303426,9.972574
KNNWithMeans,0.955356,0.792057,12.801589
KNNWithZScore,0.955367,0.835488,18.919293
NMF,0.974776,9.044773,0.56705
CoClustering,0.979337,4.955861,0.884986
KNNBasic,0.989335,0.769511,16.129125


###  Tuning Hyperparameters - GridSearchCV 

In [22]:
# code from https://www.kaggle.com/sunyuanxi/surprise
from surprise.model_selection import GridSearchCV
from surprise.model_selection import RandomizedSearchCV


param_grid = {'n_factors': [70, 80, 90, 100, 110, 120, 130, 140, 150, 160], 'n_epochs': [100], 'reg_all': [0.1]}

#gs_svd = GridSearchCV(SVD, param_grid, measures=['rmse'], cv=3)
gs_svd = RandomizedSearchCV(SVD, param_grid, measures=['rmse'], cv=3)


gs_svd.fit(data)
svd = gs_svd.best_estimator['rmse']

print(gs_svd.best_score['rmse'])
print(gs_svd.best_params['rmse'])


0.9184025288375731
{'n_factors': 100, 'n_epochs': 100, 'reg_all': 0.1}


In [31]:
from surprise.model_selection import train_test_split
from surprise import accuracy

trainset, testset = train_test_split(data, test_size=0.25)

algo = SVD(n_factors= 100, n_epochs= 100, reg_all= 0.1)
algo.fit(trainset)

cross_validate(algo, data, measures=['RMSE', 'MAE'], cv=3, verbose=True)


Evaluating RMSE, MAE of algorithm SVD on 3 split(s).

                  Fold 1  Fold 2  Fold 3  Mean    Std     
RMSE (testset)    0.9151  0.9199  0.9206  0.9185  0.0024  
MAE (testset)     0.7239  0.7253  0.7283  0.7258  0.0018  
Fit time          71.21   58.07   54.85   61.38   7.08    
Test time         2.58    3.08    0.82    2.16    0.97    


{'test_rmse': array([0.91513931, 0.91992049, 0.9205576 ]),
 'test_mae': array([0.72390717, 0.72530278, 0.72828191]),
 'fit_time': (71.21115684509277, 58.07243514060974, 54.84719777107239),
 'test_time': (2.5757861137390137, 3.0809407234191895, 0.8213958740234375)}

In [32]:
predictions = algo.test(testset)

accuracy.rmse(predictions)


RMSE: 0.7275


0.7275058692357854

In [33]:
predictions

[Prediction(uid='695', iid='313', r_ui=2.0, est=2.8795105597703667, details={'was_impossible': False}),
 Prediction(uid='177', iid='96', r_ui=3.0, est=3.7690651444091765, details={'was_impossible': False}),
 Prediction(uid='301', iid='67', r_ui=2.0, est=3.3914121038054805, details={'was_impossible': False}),
 Prediction(uid='867', iid='257', r_ui=4.0, est=4.11060646137873, details={'was_impossible': False}),
 Prediction(uid='145', iid='176', r_ui=5.0, est=4.193793647675448, details={'was_impossible': False}),
 Prediction(uid='207', iid='203', r_ui=3.0, est=3.219610105686778, details={'was_impossible': False}),
 Prediction(uid='104', iid='268', r_ui=3.0, est=3.1207051655454285, details={'was_impossible': False}),
 Prediction(uid='64', iid='746', r_ui=5.0, est=3.7810094014555453, details={'was_impossible': False}),
 Prediction(uid='782', iid='989', r_ui=3.0, est=2.804569550415154, details={'was_impossible': False}),
 Prediction(uid='184', iid='1167', r_ui=5.0, est=3.3724741310503377, det

In [34]:
algo.predict(uid=10, iid=100)


Prediction(uid=10, iid=100, r_ui=None, est=3.533397333013335, details={'was_impossible': False})