# Benchmarking Collaborative Filtering Recommendation Algorithms

The benchmarking applies to collaborative filtering algorithms available in Microsoft/Recommenders repository like Spark ALS, Surprise SVD, Microsoft SAR, etc.

## Experimentation setup:
* Objective
  * To compare how each collaborative filtering algorithm perform in predicting ratings and recommending relevant items.
* Datasets
  * Movielens 100K.
  * Movielens 1M.
  * Movielens 10M.
  * Movielens 20M.
* Data split
  * The data is split into train and test sets.
  * The split ratios are 75-25 for train and test datasets.
  * The splitting is random. 
* Model training
  * A recommendation model is trained by using each of the collaborative filtering algorithms. 
  * It is known that exhaustive search of the hyper parameter space is cubersome. Instead, empirical parameter values reported in the literature that generated optimal results are used.
* Evaluation metrics
  * Ranking metrics:
    * Precision@k.
    * Recall@k.
    * Normalized discounted cumulative gain@k (NDCG@k).
    * Mean-average-precision (MAP). 
    * In the evaluation metrics above, k = 10. 
  * Rating metrics:
    * Root mean squared error (RMSE).
    * Mean average error (MAE).
    * R squared.
    * Explained variance.

## 0 Global settings

In [22]:
import sys
sys.path.append("../../")
import os
import shutil
import tempfile
import pandas as pd
from matplotlib import pyplot as plt
import numpy as np
import seaborn as sns
import papermill as pm
import pyspark

print("System version: {}".format(sys.version))
print("Pandas version: {}".format(pd.__version__))
print("PySpark version: {}".format(pyspark.__version__))

System version: 3.6.0 | packaged by conda-forge | (default, Feb  9 2017, 14:36:55) 
[GCC 4.8.2 20140120 (Red Hat 4.8.2-15)]
Pandas version: 0.23.4
PySpark version: 2.3.1


A temporary directory is created to preserve the output notebook.

In [23]:
# Put temp results in a temp folder.
temp_path = tempfile.mkdtemp()
output_path = os.path.join(temp_path, 'output.ipynb')

In [24]:
# Global set-up parameters used to run the notebooks.
k = 10

# data_sizes = ['100k', '1m', '10m', '20m']
# algorithms = ['als', 'sar', 'svd']

data_sizes = ['20m']
algorithms = ['sar']

notebooks = {
    'als': '../00_quick_start/als_pyspark_movielens.ipynb',
    'sar': '../00_quick_start/sar_single_node_movielens.ipynb',
    'svd': '../02_modeling/surprise_svd_deep_dive.ipynb'
}

## 1 Run notebooks to generate benchmarking results

In [None]:
# For each data size and each algorithm, a recommender is evaluated. 
df_results = pd.DataFrame()

for data_size in data_sizes:
    for algorithm in algorithms:
        # Execute the notebook
        pm.execute_notebook(
            notebooks[algorithm],
            output_path,
            parameters = dict(TOP_K=k, MOVIELENS_DATA_SIZE=data_size),
            kernel_name = "reco_full"
        )
        
        # Read records from the notebook.
        nb = pm.read_notebook(output_path)
        
        # Arrange results and save them into dataframe.
        df_eval = nb.dataframe.transpose()
        df_eval = df_eval.rename(columns=df_eval.iloc[0]).drop(['name', 'type', 'filename'])
        df_eval.columns = [x.lower() for x in list(df_eval.columns)]
        
        if algorithm in ["als", "svd"]:
            df_result = pd.DataFrame(
                {
                    "Data": data_size,
                    "Algo": algorithm,
                    "K": k,
                    "MAP": df_eval['map'].item(),
                    "nDCG@k": df_eval['ndcg'].item(),
                    "Precision@k": df_eval['precision'].item(),
                    "Recall@k": df_eval['recall'].item(),
                    "RMSE": df_eval['rmse'].item(),
                    "MAE": df_eval['mae'].item(),
                    "R2": df_eval['rsquared'].item(),
                    "Explained Variance": df_eval['exp_var'].item()
                }, 
                index=[0]
            )
        elif algorithm == "sar":
            df_result = pd.DataFrame(
                {
                    "Data": data_size,
                    "Algo": algorithm,
                    "K": k,
                    "MAP": df_eval['map'].item(),
                    "nDCG@k": df_eval['ndcg'].item(),
                    "Precision@k": df_eval['precision'].item(),
                    "Recall@k": df_eval['recall'].item(),
                    "RMSE": np.nan,
                    "MAE": np.nan,
                    "R2": np.nan,
                    "Explained Variance": np.nan
                }, 
                index=[0]
            )
        else:
            raise ValueError("{} is not a recognized algorithm".format(algorithm))
        df_results = df_results.append(df_result, ignore_index=True)
        
df_results

Input Notebook:  ../00_quick_start/sar_single_node_movielens.ipynb
Output Notebook: /tmp/tmpr7yywr4e/output.ipynb


HBox(children=(IntProgress(value=0, max=32), HTML(value='')))

The temporary directory is removed after the finish of the all the run.

In [None]:
try:
    shutil.rmtree(temp_path)  # delete directory
except OSError as exc:
    if exc.errno != errno.ENOENT:  # ENOENT - no such file or directory
        raise  # re-raise exception