# Benchmarking Collaborative Filtering Recommendation Algorithms

The benchmarking applies to collaborative filtering algorithms available in Microsoft/Recommenders repository like Spark ALS, Surprise SVD, Microsoft SAR, etc.

## Experimentation setup:
* Objective
  * To compare how each collaborative filtering algorithm perform in recommending list of items.
* Datasets
  * Movielens 100K.
  * Movielens 1M.
  * Movielens 10M.
  * Movielens 20M.
* Data split
  * The data is split into train and test sets.
  * The split ratios are 75-25 for train and test datasets.
  * The splitting is random. 
* Model training
  * A recommendation model is trained by using each of the collaborative filtering algorithms. 
  * It is known that exhaustive search of the hyper parameter space is cubersome. Instead, empirical parameter values reported in the literature that generated optimal results are used.
* Evaluation metrics
  * Precision@k.
  * Recall@k.
  * Normalized discounted cumulative gain@k (NDCG@k).
  * Mean-average-precision (MAP). 
  * In the evaluation metrics above, k = 10. 

## 0 Global settings

In [66]:
# set the environment path to find Recommenders
import sys
sys.path.append("../../")
import os
import numpy as np
import pandas as pd
from zipfile import ZipFile
import papermill as pm
import time
import itertools

import pyspark
from pyspark.ml.recommendation import ALS
import pyspark.sql.functions as F
from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField
from pyspark.sql.types import StringType, FloatType, IntegerType

import surprise

from reco_utils.dataset.url_utils import maybe_download
from reco_utils.dataset.movielens import load_spark_df, load_pandas_df
from reco_utils.dataset.python_splitters import python_random_split
from reco_utils.dataset.spark_splitters import spark_chrono_split
from reco_utils.recommender.sar.sar_singlenode import SARSingleNodeReference
from reco_utils.evaluation.spark_evaluation import SparkRankingEvaluation
from reco_utils.evaluation.python_evaluation import (
    map_at_k,
    ndcg_at_k,
    precision_at_k,
    recall_at_k
)
from reco_utils.evaluation.parameter_sweep import generate_param_grid

print("System version: {}".format(sys.version))
print("Spark version: {}".format(pyspark.__version__))

System version: 3.6.0 | packaged by conda-forge | (default, Feb  9 2017, 14:36:55) 
[GCC 4.8.2 20140120 (Red Hat 4.8.2-15)]
Spark version: 2.3.1


In [2]:
%env PYSPARK_PYTHON=/anaconda/envs/recommender/bin/python3
%env PYSPARK_DRIVER_PYTHON=/anaconda/envs/recommender/bin/python3

env: PYSPARK_PYTHON=/anaconda/envs/recommender/bin/python3
env: PYSPARK_DRIVER_PYTHON=/anaconda/envs/recommender/bin/python3


In [3]:
# Configure Spark
spark = SparkSession \
    .builder \
    .appName("ALS pySpark") \
    .master("local[*]") \
    .config("spark.driver.memory", "2g")\
    .config("spark.executor.cores", "32")\
    .config("spark.executor.memory", "8g")\
    .config("spark.memory.fraction", "0.9")\
    .config("spark.memory.stageFraction", "0.3")\
    .config("spark.executor.instances", 1)\
    .config("spark.executor.heartbeatInterval", "36000s")\
    .config("spark.network.timeout", "10000000s")\
    .config("spark.driver.maxResultSize", "50g")\
    .getOrCreate()

In [4]:
# top k items to recommend
TOP_K = 10

# Select Movielens data size: 100k, 1m, 10m, or 20m
MOVIELENS_DATA_SIZE = '100k'

# Metric for selection
SELECTION_METRIC = "recall"

## 1 Prepare data

In [5]:
# Set data schema
headers = {
    "col_user": "UserId",
    "col_item": "MovieId",
    "col_timestamp": "Timestamp"
}

In [6]:
# Download data
data = load_pandas_df(size=MOVIELENS_DATA_SIZE)

In [79]:
# Split data w.r.t the experimentation protocol.
df_train, df_test = python_random_split(data, ratio=0.7)

## 2 Train model

CF algorithms available in the repo are comparatively studied. They are Spark ALS, SAR, and Surprise SVD.

In [8]:
cf_algorithms = ["als", "sar", "svd"]

Instead of a time-consuming hyper parameter searching, hyper parameters that are empirically selected to train models for each algorithms. These parameters are determined either by referencing to the literature or empirically.

In [91]:
cf_params = {
    "als": {
        "rank": 100,
        "regParam": 150
    },
    "sar": {
        "time_decay_coefficient": 30,
        "similarity_type": "jaccard"
    },
    "svd": {
        "n_factors": 100,
        "n_epochs": 30,
        "lr_all": 0.005,
        "reg_all": 0.02
    }
}

In [92]:
df_results = pd.DataFrame()

for idx, algo in enumerate(cf_algorithms):
    params = cf_params[algo]
    
    if algo == "als":  
        als = ALS(
            maxIter=15,
            implicitPrefs=True,
            coldStartStrategy='drop',
            userCol="UserId",
            itemCol="MovieId",
            ratingCol="Rating",
            nonnegative=False,
            **params
        )
        
        dfs_train = spark.createDataFrame(df_train)
        dfs_test = spark.createDataFrame(df_test)
        
        time_start = time.time()
        model = als.fit(dfs_train)
        time_train = time.time() - time_start

        time_start = time.time()
        dfs_rec = model.recommendForUserSubset(dfs_test, TOP_K)
        dfs_pred = dfs_rec.select('UserId', F.explode('recommendations').alias('r')) \
          .select('UserId', 'r.*')
        time_test = time.time() - time_start
        
        df_pred = dfs_pred.withColumnRenamed("rating", "prediction").toPandas()
        
        print(df_pred.columns)
    elif algo == "sar":
        model = SARSingleNodeReference(
            remove_seen=False, 
            time_now=None, 
            timedecay_formula=True, 
            col_rating="Rating",
            **headers,
            **params
        )
        
        time_start = time.time()
        unique_users = data["UserId"].unique()
        unique_items = data["MovieId"].unique()
        
        enumerate_items_1, enumerate_items_2 = itertools.tee(enumerate(unique_items))
        enumerate_users_1, enumerate_users_2 = itertools.tee(enumerate(unique_users))
        item_map_dict = {x: i for i, x in enumerate_items_1}
        user_map_dict = {x: i for i, x in enumerate_users_1}

        index2user = dict(enumerate_users_2)
        index2item = dict(enumerate_items_2)
        
        model.set_index(unique_users, unique_items, user_map_dict, item_map_dict, index2user, index2item)
        
        df_train_sar = df_train
        model.fit(df_train_sar)
        time_train = time.time() - time_start
        
        time_start = time.time()
        df_test_sar = df_test
        top_k = model.recommend_k_items(df_test_sar)
        
        top_k['UserId'] = pd.to_numeric(top_k['UserId'])
        top_k['MovieId'] = pd.to_numeric(top_k['MovieId'])
        time_test = time.time() - time_start
        
        df_pred = top_k
    elif algo == "svd":
        df_train_svd = df_train[["UserId", "MovieId", "Rating"]]
        train = surprise.Dataset.load_from_df(df_train_svd, reader=surprise.Reader('ml-100k')).build_full_trainset()
        
        svd = surprise.SVD(
            random_state=0, 
            verbose=False,
            **params
        )
        
        time_start = time.time()
        svd.fit(train)
        time_train = time.time() - time_start
        
        time_start = time.time()
        predictions = [svd.predict(row.UserId, row.MovieId, row.Rating)
               for (_, row) in df_test.iterrows()]
        predictions = pd.DataFrame(predictions)
        predictions = predictions.rename(index=str, columns={'uid': 'UserId', 'iid': 'MovieId',
                                                             'est': 'prediction'})
        df_pred = predictions.drop(['details', 'r_ui'], axis='columns')
        df_pred = df_pred.astype({"UserId": int, "MovieId": int, "prediction": float})
        time_test = time.time() - time_start
    else:
        raise ValueError("No algorithm {} found".format(algo))

    eval_map = map_at_k(df_test, df_pred, col_user="UserId", col_item="MovieId", 
                    col_rating="Rating", col_prediction="prediction", 
                    relevancy_method="top_k", k=TOP_K)
    
    eval_ndcg = ndcg_at_k(df_test, df_pred, col_user="UserId", col_item="MovieId", 
                      col_rating="Rating", col_prediction="prediction", 
                      relevancy_method="top_k", k=TOP_K)
    
    eval_precision = precision_at_k(df_test, df_pred, col_user="UserId", col_item="MovieId", 
                                col_rating="Rating", col_prediction="prediction", 
                                relevancy_method="top_k", k=TOP_K)
    
    eval_recall = recall_at_k(df_test, df_pred, col_user="UserId", col_item="MovieId", 
                          col_rating="Rating", col_prediction="prediction", 
                          relevancy_method="top_k", k=TOP_K)

    df_result = pd.DataFrame(
        {
            "Algo": algo,
            "K": TOP_K,
            "MAP": eval_map,
            "nDCG@k": eval_ndcg,
            "Precision@k": eval_precision,
            "Recall@k": eval_recall,
            "Train time": time_train,
            "Test time": time_test
        }, 
        index=[0]
    )

    df_results = df_results.append(df_result, ignore_index=True)
        
df_results

Index(['UserId', 'MovieId', 'prediction'], dtype='object')


Collecting user affinity matrix...
Calculating time-decayed affinities...
Creating index columns...
Building user affinity sparse matrix...
Calculating item cooccurrence...
Calculating item similarity...
Calculating jaccard...
Calculating recommendation scores...
done training
Converting to dense matrix...
Getting top K...
Select users from the test set
Creating output dataframe...
Formatting output


Unnamed: 0,Algo,K,MAP,nDCG@k,Precision@k,Recall@k,Train time,Test time
0,als,10,0.004845,0.050212,0.057264,0.020964,12.190346,0.135208
1,sar,10,0.002616,0.029502,0.035313,0.015359,0.681882,0.118496
2,svd,10,0.559017,1.0,0.922481,0.559017,6.764988,3.926352


In [81]:
df_pred.dtypes

UserId          int64
MovieId         int64
prediction    float64
dtype: object

In [82]:
df_test.dtypes

UserId           int64
MovieId          int64
Rating         float64
Timestamp        int64
hashedUsers      int64
dtype: object

In [86]:
df_test = df_test[["UserId", "MovieId", "Rating"]]

eval_map = map_at_k(df_test, df_pred, col_user="UserId", col_item="MovieId", 
                col_rating="Rating", col_prediction="prediction", 
                relevancy_method="top_k", k=TOP_K)

eval_ndcg = ndcg_at_k(df_test, df_pred, col_user="UserId", col_item="MovieId", 
                  col_rating="Rating", col_prediction="prediction", 
                  relevancy_method="top_k", k=TOP_K)

eval_precision = precision_at_k(df_test, df_pred, col_user="UserId", col_item="MovieId", 
                            col_rating="Rating", col_prediction="prediction", 
                            relevancy_method="top_k", k=TOP_K)

eval_recall = recall_at_k(df_test, df_pred, col_user="UserId", col_item="MovieId", 
                      col_rating="Rating", col_prediction="prediction", 
                      relevancy_method="top_k", k=TOP_K)

df_result = pd.DataFrame(
    {
        "Algo": algo,
        "K": TOP_K,
        "MAP": eval_map,
        "nDCG@k": eval_ndcg,
        "Precision@k": eval_precision,
        "Recall@k": eval_recall,
        "Train time": time_train,
        "Test time": time_test
    }, 
    index=[0]
)

df_result

Unnamed: 0,Algo,K,MAP,nDCG@k,Precision@k,Recall@k,Train time,Test time
0,svd,10,0.559017,1.0,0.922481,0.559017,7.044995,3.72973


In [88]:
df_test[df_test["UserId"]==1].sort_values("Rating", ascending=False)

Unnamed: 0,UserId,MovieId,Rating
59972,1,168,5.0
87967,1,59,5.0
32236,1,1,5.0
17297,1,190,5.0
45796,1,50,5.0
10207,1,183,5.0
17672,1,100,5.0
88021,1,15,5.0
84793,1,207,5.0
6028,1,96,5.0


In [85]:
df_pred[df_pred["UserId"]==1].sort_values("prediction", ascending=False).iloc[0:10,:]

Unnamed: 0,UserId,MovieId,prediction
24379,1,96,4.964904
770,1,127,4.881519
20053,1,50,4.756767
13336,1,169,4.644068
3208,1,136,4.629751
20546,1,183,4.604875
7165,1,56,4.583232
15721,1,190,4.345905
29707,1,213,4.342732
6613,1,132,4.337187
