# SAR Single Node

In this example, we will walk through each step of our Augmented SAR algorithm using a Python single-node implementation.

Microsoft: "SAR is a fast, scalable, adaptive algorithm for personalized recommendations based on user transaction history. It is powered by understanding the similarity between items, and recommending similar items to those a user has an existing affinity for."

(Figures/Diagrams and descriptions taken from Microsoft/Recommenders)
## 1 SAR algorithm

The following figure presents a high-level architecture of SAR. 

At a very high level, two intermediate matrices are created and used to generate a set of recommendation scores:

- An item similarity matrix $S$ estimates item-item relationships.
- An affinity matrix $A$ estimates user-item relationships.

Recommendation scores are then created by computing the matrix multiplication $A\times S$ and then removing seen items.

<img src="https://recodatasets.blob.core.windows.net/images/sar_schema.svg?sanitize=true">

## 2 SAR single-node implementation

The SAR implementation illustrated in this notebook was developed in Python, primarily with Python packages like `numpy`, `pandas`, and `scipy` which are commonly used in most of the data analytics / machine learning tasks. Details of the implementation can be found in [meal_recommender/reco_utils2/recommender/sar/sar_singlenode.py](../reco_utils_2/recommender/sar/sar_singlenode.py).

## 3 SAR single-node based movie recommender

In [3]:
# set the environment path to find Recommenders
import sys
sys.path.append("../")

import itertools

import logging
import os

import numpy as np
import pandas as pd
import papermill as pm

#from reco_utils.dataset import movielens
from reco_utils.dataset.python_splitters import python_stratified_split
from reco_utils.evaluation.python_evaluation import map_at_k, ndcg_at_k, precision_at_k, recall_at_k

print("System version: {}".format(sys.version))
print("Pandas version: {}".format(pd.__version__))

System version: 3.7.4 (v3.7.4:e09359112e, Jul  8 2019, 14:54:52) 
[Clang 6.0 (clang-600.0.57)]
Pandas version: 0.25.1


In [25]:
TOP_K = 10 # top k items to recommend

### 3.1 Load Data

Our Augmented SAR algorithm is intended to be used on interactions with the following schema:
`<username>, <recipe_id>, <rating>`. 

Each row represents a single interaction between a user and an item, which for our purposes is just a rating of a recipe. 

Our synthetic users dataset as described in the final paper. The user generation code can be found in [meal_recommender/data_processing/user_script.ipynb](../data_processing/user_script.ipynb)

The review data sets were generated by sampling recipies from each cuisine in our large data set. The recipes DB generation code can be found in [meal_recommender/data_processing/synthesize_cuisine.ipynb](../data_processing/synthesize_cuisine.ipynb)

In [26]:
data_dir = '../data'
data_type = '/synthetic' #original (scraped) or synthetic (generated)
data_path = data_dir + data_type
review_ratio = .5
n_users = 10
n_recipes_per_cuisine = 30

review_fn = '/reviews/{}_users_30_reviewratio.csv'.format(n_users) # DB with n_users who rate 30% of all recipies 
features_fn = '/recipes/cuisine_size_{}.csv'.format(n_recipes_per_cuisine) # DB of recipies with n_recipes_per_cuisine

In [6]:
data = pd.read_csv(data_path + review_fn)
features = pd.read_csv(data_path + features_fn)

# only keep recipes with reviews
features = pd.merge(features, data, left_on='recipe_id', right_on='recipe_id', how="inner")
features = features[["recipe_id", "cuisine", "clean_ingredients"]]
features = features.drop_duplicates().reset_index()
features = features[["recipe_id", "cuisine", "clean_ingredients"]]

# Convert ingredients column to list
features["clean_ingredients"] = features["clean_ingredients"].apply(lambda a : a.split("+"))

# Convert the float precision to 32-bit in order to reduce memory consumption 
data.loc[:, 'rating'] = data['rating'].astype(np.float32)


In [7]:
features.head()

Unnamed: 0,recipe_id,cuisine,clean_ingredients
0,Pork-Schnitzels-778938,German,"[crumbs, chops, pork, bread, panko, flour, egg..."
1,Runzas-_Bierocks_-1601298,German,"[onion, dinner, still, rolls, cabbage, thawed,..."
2,Sauerbraten-Beef-in-Gingersnap-Gravy-1540117,German,"[onion, cookies, cider, spaetzle, meat, bay, w..."
3,Sauerbraten-1340986,German,"[onion, apple, bottom, cider, bay, cloves, pep..."
4,Sauerkraut-Chickpea-Flour-Ravioli-_-Spiced-App...,German,"[onion, husk, powder, unsweetened, psyllium, s..."


In [8]:
data.head()

Unnamed: 0.1,Unnamed: 0,username,recipe_id,rating
0,6,german_luver_439,Sauerbraten-1500604,5.0
1,1,german_luver_439,Runzas-_Bierocks_-1601298,5.0
2,3,german_luver_439,Sauerbraten-Beef-in-Gingersnap-Gravy-1540117,5.0
3,17,german_luver_439,Chili-shrimp-and-asparagus-stir-fry-352470,5.0
4,11,german_luver_439,Asian-Style-Scallops-I-Adore-Food-55293,4.0


In [11]:
data.describe()

Unnamed: 0.1,Unnamed: 0,rating
count,630.0,630.0
mean,104.622222,2.996825
std,60.708196,1.480125
min,0.0,1.0
25%,53.0,2.0
50%,104.0,3.0
75%,157.0,4.0
max,209.0,5.0


In [12]:
features.describe()

Unnamed: 0,recipe_id,cuisine,clean_ingredients
count,203,203,203
unique,203,21,199
top,Easy-french-ratatouille-308073,Moroccan,"[miso, black, white, fillets, sake, paste, sug..."
freq,1,10,2


### 3.2 Split the data using the python random splitter provided in utilities:

We split the full dataset into a `train` and `test` dataset to evaluate performance of the algorithm against a held-out set not seen during training. Because SAR generates recommendations based on user preferences, all users that are in the test set must also exist in the training set. For this case, we can use the `python_stratified_split` function which holds out a percentage (in this case 25%) of items from each user, but ensures all users are in both `train` and `test` datasets. 

For training our augmented SAR, we also needed to ensure that every recipe was rated at least once, so to ensure this we created a `dummy` user that rates every single recipe in the set a score of 3.

In [18]:
header = {
    "col_user": "username",
    "col_item": "recipe_id",
    "col_rating": "rating",
    "col_timestamp": "date",
    "col_prediction": "Prediction",
}

In [19]:
train, test = python_stratified_split(data, ratio=0.80, col_user=header["col_user"], col_item=header["col_item"],seed=42)
for r in features["recipe_id"]:
    dummy = pd.DataFrame([["dummy",r,3]], columns=['username',"recipe_id","rating"])
    train = train.append(dummy, ignore_index=True)
print(train.shape)
print(test.shape)

(703, 4)
(130, 4)


In [21]:
from reco_utils_2.recommender.sar.sar_singlenode import SARSingleNode

logging.basicConfig(level=logging.DEBUG, 
                    format='%(asctime)s %(levelname)-8s %(message)s')

model = SARSingleNode(
    similarity_type="custom", 
    time_decay_coefficient=30, 
    time_now=None, 
    timedecay_formula=False,
    **header
)

jaccard = lambda a,b: len(set(a).intersection(set(b)))/len(set(a).union(set(b)))

model.fit(train, features, "recipe_id", {"ratings" : 0.5, "clean_ingredients" : (1.0, jaccard)})

2019-12-11 18:38:32,623 INFO     Collecting user affinity matrix
2019-12-11 18:38:32,625 INFO     De-duplicating the user-item counts
2019-12-11 18:38:32,629 INFO     Creating index columns
2019-12-11 18:38:32,635 INFO     Building user affinity sparse matrix
2019-12-11 18:38:32,637 INFO     Calculating item co-occurrence
2019-12-11 18:38:32,641 INFO     Calculating item similarity
2019-12-11 18:38:32,643 INFO     Done training


In [22]:
from reco_utils_2.evaluation.custom_evaluation import accuracy_metric

for i in [1, 3, 5, 10]:
    absolute, relative = accuracy_metric(model, test, i)
    print(i, "absolute:", absolute, "\trelative:", relative)

2019-12-11 18:38:42,155 INFO     Calculating recommendation scores
2019-12-11 18:38:42,157 INFO     Removing seen items
2019-12-11 18:38:42,225 INFO     Calculating recommendation scores
2019-12-11 18:38:42,226 INFO     Removing seen items
2019-12-11 18:38:42,298 INFO     Calculating recommendation scores
2019-12-11 18:38:42,299 INFO     Removing seen items
2019-12-11 18:38:42,370 INFO     Calculating recommendation scores
2019-12-11 18:38:42,371 INFO     Removing seen items


1 absolute: 0.1 	relative: 1.3
3 absolute: 0.36666666666666664 	relative: 1.5888888888888888
5 absolute: 0.4 	relative: 1.04
10 absolute: 0.84 	relative: 1.0919999999999999


In [23]:
top_k = model.recommend_k_items(train, remove_seen=True)
top_k.head()

2019-12-11 18:38:48,351 INFO     Calculating recommendation scores
2019-12-11 18:38:48,352 INFO     Removing seen items


Unnamed: 0,username,recipe_id,Prediction
0,american_connoisseur_059,Cuban-Empanadas-1299337,27.223047
1,american_connoisseur_059,Cuban-Shredded-Beef-894519,26.728396
2,american_connoisseur_059,Chicken-Meatballs-with-Tomato-Balsamic-Glaze-O...,26.541204
3,american_connoisseur_059,Beef-Masala-Curry-1181655,26.275679
4,american_connoisseur_059,Moroccan-Spiced-Fish-1994862,26.192095


The final output from the `recommend_k_items` method generates recommendation scores for each user-item pair, which are shown as follows.

In [24]:
top_k_with_titles = (top_k.join(data[['recipe_id']].drop_duplicates().set_index('recipe_id'), 
                                on='recipe_id', 
                                how='inner').sort_values(by=['username', 'Prediction'], ascending=False))
display(top_k_with_titles.head(10))

Unnamed: 0,username,recipe_id,Prediction
90,swedish_finesser_653,Beef-Masala-Curry-1181655,31.621847
91,swedish_finesser_653,Cuban-Potato-Balls-_Papas-Rellenas_-481776,31.402047
92,swedish_finesser_653,Slow-Cooker-Beef-_-Sweet-Potato-Tagine-985719,30.869323
93,swedish_finesser_653,Swedish-Meatballs-1588508,30.80535
94,swedish_finesser_653,Caprese-Meatballs-_Low-Carb-_-Gluten-Free_-483036,30.676917
95,swedish_finesser_653,Hungarian-Goulash-1975672,30.424678
96,swedish_finesser_653,Merlot-Pot-Roast-With-Horseradish-Smashed-Pota...,30.405387
97,swedish_finesser_653,Cuban-Empanadas-1299337,30.045672
98,swedish_finesser_653,Broiled-Spanish-Mackerel-AllRecipes-36799,29.913448
99,swedish_finesser_653,English-Roast-Beef-Allrecipes,29.625631


### 3.3 Evaluate the results

Our evaluation functions can be found in [meal_recommender/experiment](../experiment).

## References 

1. https://github.com/microsoft/recommenders/blob/master/notebooks/00_quick_start/sar_movielens.ipynb