# Running SAR on MovieLens (sarplus)

This notebook provides an example of how to utilize and evaluate SAR using sarplus, which is based on PySpark (SQL + C++).

The biggest advantage of sarplus is scalability of the prediction part, as it's the only limitation is that the similiarity matrix needs to fit in memory on each worker.

In [14]:
# set the environment path to find Recommenders
import sys
sys.path.append("../../")

from reco_utils.dataset.url_utils import maybe_download
from reco_utils.dataset.python_splitters import python_random_split
import reco_utils.evaluation.python_evaluation as evaluation

import itertools
import os
import pandas as pd
from pyspark.sql import SparkSession
from pysarplus import SARPlus

print("System version: {}".format(sys.version))
print("Pandas version: {}".format(pd.__version__))

System version: 3.6.2 |Anaconda, Inc.| (default, Sep 21 2017, 18:29:43) 
[GCC 4.2.1 Compatible Clang 4.0.1 (tags/RELEASE_401/final)]
Pandas version: 0.20.3


### 1. Download the MovieLens dataset

In [9]:
filepath = maybe_download("http://files.grouplens.org/datasets/movielens/ml-100k/u.data", "ml-100k.data")
data = pd.read_csv(filepath, sep="\t", names=["UserId", "MovieId", "Rating", "Timestamp"])
data.head()

Unnamed: 0,UserId,MovieId,Rating,Timestamp
0,196,242,3,881250949
1,186,302,3,891717742
2,22,377,1,878887116
3,244,51,2,880606923
4,166,346,1,886397596


### 2. Split the data using the python random splitter provided in utilities:

In [6]:
train, test = python_random_split(data)
header = {
        "col_user": "UserId",
        "col_item": "MovieId",
        "col_rating": "Rating",
        "col_timestamp": "Timestamp",
    }

### 3. Start Spark and load sar+

In [4]:
SUBMIT_ARGS = "--packages eisber:sarplus:0.2.2 pyspark-shell"
os.environ["PYSPARK_SUBMIT_ARGS"] = SUBMIT_ARGS

spark = (
    SparkSession.builder.appName("sample")
    .master("local[*]")
    .config("memory", "4G")
    .config("spark.sql.shuffle.partitions", "10")
    .config("spark.sql.crossJoin.enabled", True)
    .config("spark.ui.enabled", False)
    .getOrCreate()
)

### 3. Train the SAR model on our training data, and get the top-k recommendations for our testing data

In [7]:
df_train = spark.createDataFrame(train)
df_test = spark.createDataFrame(test)

model = SARPlus(spark, **header)
model.fit(df_train, similarity_type='jaccard', 
          time_decay_coefficient=30, time_now=None, timedecay_formula=True)

top_k = model.recommend_k_items(df_test, 'sarplus_cache', top_k=10, remove_seen=True)\
    .toPandas()

top_k[top_k.UserId == 796].head(10)

INFO:sarplus:sarplus.fit 1/2: compute item cooccurences...
INFO:sarplus:sarplus.fit 2/2: compute similiarity metric jaccard...
INFO:sarplus:sarplus.recommend_k_items 1/3: create item index
INFO:sarplus:sarplus.recommend_k_items 2/3: prepare similarity matrix
INFO:sarplus:sarplus.recommend_k_items 3/3: compute recommendations


Unnamed: 0,UserId,MovieId,score
7960,796,191,141.250412
7961,796,208,144.016556
7962,796,99,145.766876
7963,796,28,147.205063
7964,796,265,148.355103
7965,796,79,152.82489
7966,796,566,153.102249
7967,796,174,154.093216
7968,796,216,154.755188
7969,796,234,155.419418


### 5. Evaluate how well SAR performs 

In [19]:
header_eval = {
    "col_user":"UserId", 
    "col_item":"MovieId", 
    "col_rating":"Rating", 
    "col_prediction":"score"
}

pd.DataFrame({
    'metrics': ['Top K', 'MAP', 'NDCG', 'Precision@K', 'Recall@K'],
    'value': [
        10,
        evaluation.map_at_k(test, top_k, **header_eval),
        evaluation.ndcg_at_k(test, top_k, **header_eval),
        evaluation.precision_at_k(test, top_k, **header_eval),
        evaluation.recall_at_k(test, top_k, **header_eval)
    ]
})

Unnamed: 0,metrics,value
0,Top K,10.0
1,MAP,0.105815
2,NDCG,0.373197
3,Precision@K,0.326617
4,Recall@K,0.175957
