# Running SAR on MovieLens (Python)

SAR is a fast scalable adaptive algorithm for personalized recommendations based on user transaction history and item descriptions. It produces easily explainable / interpretable recommendations and handles "cold item" and "semi-cold user" scenarios. 

This notebook provides an example of how to utilize and evaluate SAR in Python on a CPU.

In [6]:
# set the environment path to find Recommenders
import sys
sys.path.append("../../")

from reco_utils.recommender.sar.sar_singlenode import SARSingleNodeReference
from reco_utils.dataset.url_utils import maybe_download
from reco_utils.dataset.python_splitters import python_random_split
from reco_utils.evaluation.python_evaluation import PythonRatingEvaluation, PythonRankingEvaluation

import itertools
import pandas as pd

print("System version: {}".format(sys.version))
print("Pandas version: {}".format(pd.__version__))

System version: 3.6.0 | packaged by conda-forge | (default, Feb 10 2017, 07:08:35) 
[GCC 4.2.1 Compatible Apple LLVM 7.3.0 (clang-703.0.31)]
Pandas version: 0.23.4


### 1. Download the MovieLens dataset

In [3]:
filepath = maybe_download("http://files.grouplens.org/datasets/movielens/ml-100k/u.data", "ml-100k.data")

In [4]:
data = pd.read_csv(filepath, sep="\t", names=["UserId", "MovieId", "Rating", "Timestamp"])
data.head()

Unnamed: 0,UserId,MovieId,Rating,Timestamp
0,196,242,3,881250949
1,186,302,3,891717742
2,22,377,1,878887116
3,244,51,2,880606923
4,166,346,1,886397596


### 2. Split the data using the python random splitter provided in reco_utils:

In [7]:
train, test = python_random_split(data)

In [8]:
header = {
        "col_user": "UserId",
        "col_item": "MovieId",
        "col_rating": "Rating",
        "col_timestamp": "Timestamp",
    }

model = SARSingleNodeReference(
                remove_seen=True, similarity_type="jaccard", 
                time_decay_coefficient=30, time_now=None, timedecay_formula=True, **header
            )

### 3. In order to use SAR, we need to hash users and items

In [9]:
unique_users = data["UserId"].unique()
unique_items = data["MovieId"].unique()

We will hash users and items to smaller continuous space.
This is an ordered set - it's discrete, but contiguous.
This helps keep the matrices we keep in memory as small as possible.

In [10]:
enumerate_items_1, enumerate_items_2 = itertools.tee(enumerate(unique_items))
enumerate_users_1, enumerate_users_2 = itertools.tee(enumerate(unique_users))
item_map_dict = {x: i for i, x in enumerate_items_1}
user_map_dict = {x: i for i, x in enumerate_users_1}

The reverse of the dictionary above - array index to actual ID


In [11]:
index2user = dict(enumerate_users_2)
index2item = dict(enumerate_items_2)

We need to index the train and test sets for SAR matrix operations to work

In [12]:
model.set_index(unique_users, unique_items, user_map_dict, item_map_dict, index2user, index2item)

### 4. Train the SAR model on our training data, and get the top-k recommendations for our testing data

In [13]:
model.fit(train)
top_k = model.recommend_k_items(test)

INFO:reco_utils.recommender.sar.sar_singlenode:Collecting user affinity matrix...
INFO:reco_utils.recommender.sar.sar_singlenode:Calculating time-decayed affinities...
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  df["exponential"] = expo_fun(df[self.col_timestamp].values)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  df["rating_exponential"] = df[self.col_rating] * df["exponential"]
INFO:reco_utils.recommender.sar.sar_singlenode:Creating index columns...
  self.index = df.as_matrix([self._col_hashed_users, self._col_hashed_items])
INFO:reco_utils.recommender.sar.sar_singlenode:Building us

In [14]:
# TODO: remove this call when the model returns same type as input
top_k['UserId'] = pd.to_numeric(top_k['UserId'])
top_k['MovieId'] = pd.to_numeric(top_k['MovieId'])

In [15]:
display(top_k.head())

Unnamed: 0,UserId,MovieId,prediction
5787,796,234,155.419422
5786,796,216,154.755182
5789,796,174,154.09321
5788,796,566,153.102244
5785,796,79,152.824891


### 5. Evaluate how well SAR performs 

In [16]:
test.head()

Unnamed: 0,UserId,MovieId,Rating,Timestamp,hashedUsers
42083,600,651,4,888451492,598
71825,607,494,5,883879556,604
99535,875,1103,5,876465144,869
47879,648,238,3,882213535,644
36734,113,273,4,875935609,146


In [17]:
rank_eval = PythonRankingEvaluation(test, top_k, col_user="UserId", col_item="MovieId", 
                                    col_rating="Rating", col_prediction="prediction", 
                                    relevancy_method="top_k")

In [18]:
print("Model:\t" + model.model_str,
      "Top K:\t%d" % rank_eval.top_k,
      "MAP:\t%f" % rank_eval.map_at_k(),
      "NDCG:\t%f" % rank_eval.ndcg_at_k(),
      "Precision@K:\t%f" % rank_eval.precision_at_k(),
      "Recall@K:\t%f" % rank_eval.recall_at_k(), sep='\n')

Model:	sar_ref
Top K:	10
MAP:	0.105815
NDCG:	0.373197
Precision@K:	0.326617
Recall@K:	0.175957
