# Recommander systems performance comparision

In this notebook we compare many models performance on the [Restaurant Data with Consumer Ratings](https://www.kaggle.com/uciml/restaurant-data-with-consumer-ratings) dataset. For that we use [Surprise](http://surpriselib.com/) for the ease of use and good documentation. The dataset used here (rating_final) has 1161 lines, so we are not very worried about (fitting) performance.

In [None]:
import numpy as np 
import pandas as pd
import matplotlib.pyplot as plt
import random

# Note that there are no NANs in these data; '?' is
# used when there is missing information
rating = pd.read_csv('../input/rating_final.csv')

In [None]:
rating.head()

In [None]:
# Surprise's Input is a df of this shape :
overall_rating = rating[['userID','placeID','rating']]
overall_rating.head()

In [None]:
from surprise import SVD,SVDpp,KNNBasic,KNNWithZScore
from surprise.dataset import Reader, Dataset
from surprise.model_selection import LeaveOneOut
from surprise import accuracy

To import a df for use in Surprise, we need to load it into a `Surprise.Dataset`, for that we use `Dataset.load_from_df(dataframe,reader)` where `reader` has to specify the range of ratings in the dataset. 

⚠️ We use a different kind of validation, since we're trying to validate a recommender system we use the LeaveOneOut validation. From [Surprise Documentation on Cross Validators](https://surprise.readthedocs.io/en/stable/model_selection.html#surprise.model_selection.split.LeaveOneOut) :
> Cross-validation iterator where each user has exactly one rating in the testset.
> 
> Contrary to other cross-validation strategies, LeaveOneOut does not guarantee that all folds will be different, although this is still very likely for sizeable datasets.

Indeed, we leave one vote out from each user as a test set (user with less than the `min_n_ratings` are eliminated), that way we test the performance of the recommender system in a realistic situation.

In [None]:
reader = Reader(rating_scale=(0, 2))
ds = Dataset.load_from_df(overall_rating,reader)
loo = LeaveOneOut(n_splits=1,min_n_ratings=1)

In [None]:
LR = [0.0001,0.0005,0.001,0.005,0.01,0.05,0.1,0.5]

## Classical use of Suprise models

Learning, calculating/plotting metrics and making predictions

### Fitting model using different parameters

Steps followed :

1. Model initializing :
    - Call the model's class with chosen parameters
2. dataset splitting : 
    - Loop over cross validation splits extracting trainset and testset
3. Fitting model :
    - Fit model on the train set
4. Testing :
    - Test model on train and test sets

In [None]:
train_loss = []
test_loss = []
models = []
for i in range(len(LR)) :
    lr_all = LR[i]
    algo = SVD(n_epochs=50,reg_all=0.01,lr_all=lr_all)
    models.append(algo)
    for trainset,testset in loo.split(ds) : #train - validation split with leave one out
        # train and test algorithm.
        algo.fit(trainset)
        train_pred = algo.test(trainset.build_testset())
        test_pred = algo.test(testset)

        # Compute and print Root Mean Squared Error
        train_rmse = accuracy.rmse(train_pred, verbose=False)
        test_rmse = accuracy.rmse(test_pred, verbose=False)
        train_loss.append(train_rmse)
        test_loss.append(test_rmse)

### Plotting test and train rmse lines

In [None]:
plt.plot(LR,train_loss,label='train')
plt.plot(LR,test_loss, label = 'test')
plt.xlabel('learning_rate')
plt.ylabel('rmse')
plt.legend()

### Predicting a rating

In [None]:
# Index of minimum element
i = test_loss.index(min(test_loss))
# using the best model
algo = models[i]
# predicting rating
algo.predict('U1077','132825')

## Grid search using GridSearchCV

Using Grid Search to try different parameters

### Workflow :

1. preparing parameter map :
    - Create a dict of all the parameters you want to try (dict of arrays)
2. dataset splitting : 
    - Prepare an instance of your preferred cross-validator
3. Fitting model :
    - Fit the whole dataset to the grid search instance (separation handled internally)
4. Testing :
    - Handled Internally
5. Visualization :
    - use the cv_results attribute of your gridsearch instance to visualize results.

In [None]:
from surprise.model_selection import GridSearchCV
param_grid = {
    'n_factors' : [10, 20, 50, 100, 130, 150, 200],
    'n_epochs': [10, 15, 30, 50, 100], 
    'lr_all': [0.001, 0.005, 0.007, 0.01, 0.05, 0.07, 0.1],
    'reg_all': [0.01, 0.05, 0.07, 0.1, 0.2, 0.4, 0.6]
}

### SVD and SVD++ :

In [None]:
gs_svd = GridSearchCV(SVD, param_grid, measures=['rmse', 'mae'], cv=loo,return_train_measures = True)
gs_svdpp = GridSearchCV(SVDpp, param_grid, measures=['rmse', 'mae'], cv=loo,return_train_measures = True)

In [None]:
gs_svd.fit(ds)
gs_svdpp.fit(ds)

#### Scores and best parameters

In [None]:
# best RMSE score
print ("Best RMSE Scores")
print(f'SVD : {gs_svd.best_score["rmse"]}')
print(f'SVDpp : {gs_svdpp.best_score["rmse"]}')


# combination of parameters that gave the best RMSE score
print("Parameters")
print(f"SVD : {gs_svd.best_params['rmse']}")
print(f"SVDpp : {gs_svdpp.best_params['rmse']}")

#### Saving results

In [None]:
results_frame_svd = pd.DataFrame.from_dict(gs_svd.cv_results)
results_frame_svd['model'] = 'SVD'
results_frame_svdpp = pd.DataFrame.from_dict(gs_svdpp.cv_results)
results_frame_svdpp['model'] = 'SVDpp'
results_frame = pd.concat([results_frame_svd, results_frame_svdpp])

In [None]:
results_frame.sort_values(by='mean_test_rmse').head()

### KNN and KNN With Z Score :

In [None]:
sim_options = {'name': ['pearson', 'cosine', 'msd'],
               'user_based':[ True, False] # compute  similarities between users
               }
param_grid_knn = {
    'sim_options' : sim_options,
    'k' : [10, 20, 40, 100],
    'min_k' : [1, 5 , 10],
    'verbose' : [False]
}

In [None]:
gs_knn= GridSearchCV(KNNBasic, param_grid=param_grid_knn, measures=['rmse', 'mae'], cv=loo,return_train_measures = True)
gs_knnZ= GridSearchCV(KNNWithZScore, param_grid=param_grid_knn, measures=['rmse', 'mae'], cv=loo,return_train_measures = True)

In [None]:
gs_knn.fit(ds)
gs_knnZ.fit(ds)

#### Scores and best parameters

In [None]:
# best RMSE score
print ("Best RMSE Scores")
print(f'KNN : {gs_knn.best_score["rmse"]}')
print(f'KNNWithZScore : {gs_knnZ.best_score["rmse"]}')


# combination of parameters that gave the best RMSE score
print("Parameters")
print(f"KNN : {gs_knn.best_params['rmse']}")
print(f"KNNWithZScore : {gs_knnZ.best_params['rmse']}")

#### Saving results

In [None]:
results_frame_knn = pd.DataFrame.from_dict(gs_knn.cv_results)
results_frame_knn['model'] = 'KNN'
results_frame_knnZ = pd.DataFrame.from_dict(gs_knnZ.cv_results)
results_frame_knnZ['model'] = 'KNNWithZScore'
results_frame = pd.concat([results_frame, results_frame_knn, results_frame_knnZ],sort=False)

### Finalize results set :

#### Creating global ranks for metrics

In [None]:
results_frame['rank_test_mae'] = results_frame['mean_test_mae'].rank()
results_frame['rank_test_rmse'] = results_frame['mean_test_rmse'].rank()
results_frame.head()

#### Export models performance for visualisation

In [None]:
# Export for visualization
results_frame.to_csv('results.csv',index=False)