# SAR Deep Dive with Spark and sarplus

In this example, we will walkthrough each step of the SAR algorithm with an implementation using Spark and SQL.

Smart Adaptive Recommendations (SAR) is a fast, scalable, adaptive algorithm for personalized recommendations based on user transaction history and item descriptions. It is powered by understanding the **similarity** between items, and recommending similar items to ones a user has an existing **affinity** for. 

# 0 Global Variables and Imports

In [6]:
import pandas as pd
import numpy as np
import heapq
import os
import pyspark.sql.functions as F
import sys
import pyspark
from pyspark.sql import SparkSession
from pyspark.sql.window import Window
from pyspark.sql.types import StructType, StructField, StringType, Row, ArrayType, IntegerType, FloatType

from pyspark.sql import SparkSession
from pysarplus import SARPlus

print("System version: {}".format(sys.version))
print("Pandas version: {}".format(pd.__version__))
print("PySpark version: {}".format(pyspark.__version__))

System version: 3.6.5 |Anaconda, Inc.| (default, Apr 29 2018, 16:14:56) 
[GCC 7.2.0]
Pandas version: 0.23.4
PySpark version: 2.3.1


Set the default parameters.

In [3]:
# specify parameters
TOP_K=2
RECOMMEND_SEEN=True
# options are 'jaccard', 'lift' or '' to skip and use item cooccurrence directly
SIMILARITY='jaccard'

# 1 Load Data

We'll work with a small dataset here containing customer IDs, item IDs, and the customer's rating for the item. SAR requires inputs to be of the following schema: `<User ID>, <Item ID>, <Time>, [<Event Type>], [<Event Weight>]` (we will not use time or event type in the example below, and `rating` will be used as the `Event Weight`). 

In [4]:
# There are two versions of the dataframes - the numeric version and the alphanumeric one:
# they both have similar test data for top-2 recommendations and illustrate the indexing approaches to matrix multiplication on SQL
d_train = {
'customerID': [1,1,1,2,2,3,3],
'itemID':     [1,2,3,4,5,6,1],
'rating':     [5,5,5,1,1,3,5]
}
pdf_train = pd.DataFrame(d_train)
d_test = {
'customerID': [1,1,2,2,3,3],
'itemID':     [4,5,1,5,6,1],
'rating':     [1,1,5,5,5,5]
}
pdf_test = pd.DataFrame(d_test)

In [5]:
a_train = np.array([[5,5,5,0,0,0],\
                    [0,0,0,1,1,0],
                    [5,0,0,0,0,3]])
print(a_train)
print(a_train.shape)

[[5 5 5 0 0 0]
 [0 0 0 1 1 0]
 [5 0 0 0 0 3]]
(3, 6)


In [6]:
d_alnum_train = {
'customerID': ['ua','ua','ua','ub','ub','uc','uc'],
'itemID':     ['ia','ib','ic','id','ie','if','ia'],
'rating':     [5,5,5,1,1,3,5]
}
#pdf_train = pd.DataFrame(d_alnum_train)
pdf_train = pd.DataFrame(d_train)
d_alnum_test = {
'customerID': ['ua','ua','ub','ub','uc','uc'],
'itemID':     ['id','ie','ia','ie','if','ia'],
'rating':     [1,1,5,5,5,5]
}
#pdf_test = pd.DataFrame(d_alnum_test)
pdf_test = pd.DataFrame(d_test)
pdf_test.head(10)

Unnamed: 0,customerID,itemID,rating
0,1,4,1
1,1,5,1
2,2,1,5
3,2,5,5
4,3,6,5
5,3,1,5


### Set up Spark context

The following settings work well for debugging locally on VM - change when running on a cluster. We set up a giant single executor with many threads and specify memory cap. 

In [7]:
SUBMIT_ARGS = "--packages eisber:sarplus:0.2.2 pyspark-shell"
os.environ["PYSPARK_SUBMIT_ARGS"] = SUBMIT_ARGS

spark = SparkSession \
    .builder \
    .appName("SAR pySpark") \
    .master("local[*]") \
    .config("memory", "4G") \
    .config("spark.sql.shuffle.partitions", "2") \
    .config("spark.sql.crossJoin.enabled", True) \
    .config("spark.ui.enabled", False) \
    .getOrCreate()

In [8]:
df = spark.createDataFrame(pdf_train).withColumn("type", F.lit(1))
df_test = spark.createDataFrame(pdf_test).withColumn("type", F.lit(0))
df.toPandas()

Unnamed: 0,customerID,itemID,rating,type
0,1,1,5,1
1,1,2,5,1
2,1,3,5,1
3,2,4,1,1
4,2,5,1,1
5,3,6,3,1
6,3,1,5,1


# 3 Compute Item Co-occurrence and Item Similarity

Central to how SAR defines similarity is an item-to-item ***co-occurrence matrix***. Co-occurrence is defined as the number of times two items appear together for a given user.  We can represent the co-occurrence of all items as a $mxm$ matrix $C$, where $c_{i,j}$   is the number of times item $i$ occurred with item $j$.

The co-occurence matric $C$ has the following properties:
- It is symmetric, so $c_{i,j} = c_{j,i}$
- It is nonnegative: $c_{i,j} >= 0$
- The occurrences are at least as large as the co-occurrences. I.e, the largest element for each row (and column) is on the main diagonal: $∀(i,j) C_{i,i},C_{j,j}>=C_{i,j}$.

Once we have a co-occurrence matrix, an ***item similarity matrix*** $S$ can be obtained by rescaling the co-occurrences according to a given metric. Options for the metric include Jaccard, lift, and counts (meaning no rescaling).

The rescaling formula for Jaccard is $s_{ij}=c_{ij} / (c_{ii}+c_{jj}-c_{ij})$

and that for lift is $s_{ij}=c_{ij}/(c_{ii}*c_{jj})$

where $c_{ii}$ and $c_{jj}$ are the $i$th and $j$th diagonal elements of $C$. In general, using counts as a similarity metric favours predictability, meaning that the most popular items will be recommended most of the time. Lift by contrast favours discoverability/serendipity: an item that is less popular overall but highly favoured by a small subset of users is more likely to be recommended. Jaccard is a compromise between the two.

In [10]:
model = SARPlus(spark, col_user='customerID', col_item='itemID', col_rating='rating')
model.fit(df, similarity_type=SIMILARITY)

model.item_similarity.toPandas()

INFO:sarplus:sarplus.fit 1/2: compute item cooccurences...
INFO:sarplus:sarplus.fit 2/2: compute similiarity metric jaccard...


Unnamed: 0,i1,i2,value
0,1,2,0.5
1,1,6,0.5
2,1,1,1.0
3,1,3,0.5
4,3,3,1.0
5,3,2,1.0
6,3,1,0.5
7,6,6,1.0
8,6,1,0.5
9,2,2,1.0


# 4.1 Compute User Affinity Scores

The affinity matrix in SAR captures the strength of the relationship between each individual user and each item. The event types and weights are used in computing this matrix: different event types (such as “rate” vs “view”) should be allowed to have an impact on a user’s affinity for an item. Similarly, the time of a transaction should have an impact; an event that takes place in the distant past can be thought of as being less important in determining the affinity.

Combining these effects gives us an expression for user-item affinity:
$a_{ij}=Σ_k (w_k exp[-log_2((t_0-t_k)/T)] $

where the affinity for user $i$ and item $j$ is the sum of all events involving user $i$ and item $j$, and $w_k$ is the weight of event $k$. The presence of the  $log_{2}$ factor means that the parameter $T$ in the exponential decay term can be treated as a half-life: events this far before the reference date $t_0$ will be given half the weight as those taking place at $t_0$. 

Repeating this computation for all $n$ users and $m$ items results in an $nxm$ matrix $A$.
Simplifications of the above expression can be obtained by setting all the weights equal to 1 (effectively ignoring event types), or by setting the half-life parameter $T$ to infinity (ignoring transaction times).


# 4.2 Remove Seen Items

Optionally we remove items which have already been seen in the training set, i.e. don't recommend items which have been previously bought by the user again.

# 4.3 Top-K Item Calculation

The personalized recommendations for a set of users can then be obtained by multiplying the affinity matrix by the similarity matrix. The result is an recommendation score matrix, with one row per user / item pair; higher scores correspond to more strongly recommended items.

In [15]:
model.recommend_k_items(df_test, cache_path='sar_deep_dive_cache', top_k=2)\
    .toPandas()

INFO:sarplus:sarplus.recommend_k_items 1/3: create item index
INFO:sarplus:sarplus.recommend_k_items 2/3: prepare similarity matrix
INFO:sarplus:sarplus.recommend_k_items 3/3: compute recommendations


Unnamed: 0,customerID,itemID,score
0,1,6,2.5
1,3,2,2.5
2,3,3,2.5
