# Running SAR on MovieLens (pySpark)

SAR is a fast scalable adaptive algorithm for personalized recommendations based on user transaction history and item descriptions. It produces easily explainable / interpretable recommendations and handles "cold item" and "semi-cold user" scenarios. 

This notebook provides an example of how to utilize and evaluate SAR's pySpark implementation, meant for large-scale distributed datasets. We use a smaller dataset in this example to run SAR efficiently on Data Science Virtual Machine.

In [1]:
# set the environment path to find Recommenders
import sys
sys.path.append("../../")

from utilities.recommender.sar.sar_pyspark import SARpySparkReference
from utilities.dataset.url_utils import maybe_download
from utilities.dataset.spark_splitters import spark_random_split
from utilities.evaluation.spark_evaluation import SparkRatingEvaluation, SparkRankingEvaluation

import numpy as np

import pyspark
import pyspark.sql.functions as F
from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField
from pyspark.sql.types import StringType, FloatType, IntegerType

print("System version: {}".format(sys.version))
print("Spark version: {}".format(pyspark.__version__))


System version: 3.5.5 |Anaconda custom (64-bit)| (default, May 13 2018, 21:12:35) 
[GCC 7.2.0]
Spark version: 2.3.1


### 0. Set up Spark context

In [2]:
# the following settings work well for debugging locally on VM - change when running on a cluster
# set up a giant single executor with many threads and specify memory cap
spark = SparkSession \
    .builder \
    .appName("SAR pySpark") \
    .master("local[*]") \
    .config("spark.driver.memory", "2g")\
    .config("spark.executor.cores", "32")\
    .config("spark.executor.memory", "8g")\
    .config("spark.yarn.executor.memoryOverhead", "3g")\
    .config("spark.memory.fraction", "0.9")\
    .config("spark.memory.stageFraction", "0.3")\
    .config("spark.executor.instances", 1)\
    .config("spark.executor.heartbeatInterval", "36000s")\
    .config("spark.network.timeout", "10000000s")\
    .config("spark.driver.maxResultSize", "50g")\
    .getOrCreate()


### 1. Download the MovieLens dataset

In [3]:
filepath = maybe_download("http://files.grouplens.org/datasets/movielens/ml-100k/u.data", "ml-100k.data")

In [4]:
schema = StructType((StructField("UserId", StringType()),
                       StructField("MovieId", StringType()),
                       StructField("Rating", FloatType()),
                       StructField("Timestamp", IntegerType())))
data = spark.read.csv(filepath, schema = schema, sep="\t", header=False)
data.show()

+------+-------+------+---------+
|UserId|MovieId|Rating|Timestamp|
+------+-------+------+---------+
|   196|    242|   3.0|881250949|
|   186|    302|   3.0|891717742|
|    22|    377|   1.0|878887116|
|   244|     51|   2.0|880606923|
|   166|    346|   1.0|886397596|
|   298|    474|   4.0|884182806|
|   115|    265|   2.0|881171488|
|   253|    465|   5.0|891628467|
|   305|    451|   3.0|886324817|
|     6|     86|   3.0|883603013|
|    62|    257|   2.0|879372434|
|   286|   1014|   5.0|879781125|
|   200|    222|   5.0|876042340|
|   210|     40|   3.0|891035994|
|   224|     29|   3.0|888104457|
|   303|    785|   3.0|879485318|
|   122|    387|   5.0|879270459|
|   194|    274|   2.0|879539794|
|   291|   1042|   4.0|874834944|
|   234|   1184|   2.0|892079237|
+------+-------+------+---------+
only showing top 20 rows



### 2. Split the data using the Spark random splitter provided in utilities:

In [5]:
train, test = spark_random_split(data)
print ("N train", train.count())
print ("N test", test.count())

N train 75193
N test 24807


In [6]:
header = {
        "col_user": "UserId",
        "col_item": "MovieId",
        "col_rating": "Rating",
        "col_timestamp": "Timestamp",
    }

model = SARpySparkReference(spark=spark,
                remove_seen=True, similarity_type="jaccard", 
                time_decay_coefficient=30, time_now=None, timedecay_formula=True, **header
            )

### 3. In order to use SAR, we need to hash users and items and make sure there are no cold users

In [7]:
# explicitly make sure we don't have cold users
train_set_users = set([x[0] for x in train.select(header["col_user"]).distinct().collect()])
test_set_users = set([x[0] for x in test.select(header["col_user"]).distinct().collect()])
both_sets = train_set_users.intersection(test_set_users)
test = test.filter(F.col(header["col_user"]).isin(both_sets))
print ("N train", train.count())
print ("N test", test.count())

N train 75193
N test 24807


#### Build uniform index

In [8]:
# we need to index item IDs which we want to score later, i.e. we need to consider all items
train = train.withColumn('type', F.lit(1))
test = test.withColumn('type', F.lit(0))
df_all = train.union(test)
df_all.createOrReplaceTempView("df_all")

# create new index for the items
query = "select " + header["col_user"] + ", " +\
    "dense_rank() over(partition by 1 order by " + header["col_user"] + ") as row_id, " +\
                    header["col_item"] + ", " +\
    "dense_rank() over(partition by 1 order by " + header["col_item"] + ") as col_id, " +\
        header["col_rating"] + ", " + header["col_timestamp"] + ", type from df_all"
print("Running query -- " + query)
df_all = spark.sql(query)
df_all.createOrReplaceTempView("df_all")

Running query -- select UserId, dense_rank() over(partition by 1 order by UserId) as row_id, MovieId, dense_rank() over(partition by 1 order by MovieId) as col_id, Rating, Timestamp, type from df_all


#### Recover the original data but now with index build-in

In [9]:
print("Obtain the indexed dataframes")
query = "select row_id, col_id, " + header["col_rating"] + ", " + header["col_timestamp"] + " from df_all where type=1"
print("Running query -- " + query)
train_indexed = spark.sql(query)

query = "select row_id, col_id, " + header["col_rating"] + ", " + header["col_timestamp"] + " from df_all where type=0"
print("Running query -- " + query)
test_indexed = spark.sql(query)


Obtain the indexed dataframes
Running query -- select row_id, col_id, Rating, Timestamp from df_all where type=1
Running query -- select row_id, col_id, Rating, Timestamp from df_all where type=0


Build index mappings: IDs to index and index to IDs.

In [10]:
print("Obtaining all users and items ")
# Obtain all the users and items from both training and test data
unique_users =\
    np.array([x[header["col_user"]] for x in df_all.select(header["col_user"]).distinct().toLocalIterator()])
unique_items =\
    np.array([x[header["col_item"]] for x in df_all.select(header["col_item"]).distinct().toLocalIterator()])

print("Indexing users and items")
# index all rows and columns, then split again intro train and test
# We perform the reduction on Spark across keys before calling .collect so this is scalable
index2user = \
    dict(df_all.select(["row_id", header["col_user"]]).rdd.reduceByKey(lambda _, v: v).collect())
index2item = \
    dict(df_all.select(["col_id", header["col_item"]]).rdd.reduceByKey(lambda _, v: v).collect())

# reverse the dictionaries: actual IDs to inner index
user_map_dict = {v: k for k, v in index2user.items()}
item_map_dict = {v: k for k, v in index2item.items()}

Obtaining all users and items 
Indexing users and items


Store the index values in the model object.

In [11]:
model.set_index(unique_users, unique_items, user_map_dict, item_map_dict, index2user, index2item)

### 4. Train the SAR model on our training data, and get the top-k recommendations for our testing data

In [12]:
model.fit(train_indexed)
top_k = model.recommend_k_items(test_indexed)

INFO:utilities.recommender.sar.sar_pyspark:Collecting user affinity matrix...
INFO:utilities.recommender.sar.sar_pyspark:Calculating time-decayed affinities...
INFO:utilities.recommender.sar.sar_pyspark:Running query -- select
            row_id, col_id, sum(Rating * exp(-log(2) * (893286638.000000 - Timestamp) / (30.000000 * 3600 * 24))) as Affinity
            from df_train group
            by
            row_id, col_id
INFO:utilities.recommender.sar.sar_pyspark:Calculating item cooccurrence...
INFO:utilities.recommender.sar.sar_pyspark:Calculating item similarity...
INFO:utilities.recommender.sar.sar_pyspark:Running query -- select A.row_item_id, A.col_item_id, (A.value/(B.d+C.d-A.value)) as value from item_cooccurrence as A, diagonal as B, diagonal as C where A.row_item_id = B.i and A.col_item_id=C.i
INFO:utilities.recommender.sar.sar_pyspark:Calculating recommendation scores...
INFO:utilities.recommender.sar.sar_pyspark:done training
INFO:utilities.recommender.sar.sar_pyspark:Rem

In [13]:
top_k.show()

+------+-------+------------------+
|UserId|MovieId|        prediction|
+------+-------+------------------+
|   796|    204|165.31601890070047|
|   796|    186| 155.3215673077529|
|   796|    423|154.52656954379694|
|   551|    161|154.49870950687432|
|   796|     97|153.46520937913883|
|   551|    385|149.90269568537045|
|   551|    568|149.16227159140976|
|   551|     22| 148.6425848821096|
|   796|    655| 148.2438372962911|
|   551|    655|146.64785742078595|
|   796|    403|146.32392308408168|
|   796|    550|145.39668926104616|
|   796|    176| 145.2127404685524|
|   551|     64| 145.0752306738755|
|   796|     64| 145.0373327965255|
|   796|    195|145.01188955356974|
|   551|    173|144.29283951474824|
|   416|    204| 144.1174055165611|
|   551|    195| 143.7550428908026|
|   551|    176|141.62147020334285|
+------+-------+------------------+
only showing top 20 rows



### 5. Evaluate how well SAR performs 

In [14]:
test.show()

+------+-------+------+---------+----+
|UserId|MovieId|Rating|Timestamp|type|
+------+-------+------+---------+----+
|     1|     10|   3.0|875693118|   0|
|     1|    100|   5.0|878543541|   0|
|     1|    101|   2.0|878542845|   0|
|     1|    106|   4.0|875241390|   0|
|     1|    108|   5.0|875240920|   0|
|     1|    113|   5.0|878542738|   0|
|     1|    120|   1.0|875241637|   0|
|     1|    123|   4.0|875071541|   0|
|     1|    125|   3.0|878542960|   0|
|     1|    128|   4.0|875072573|   0|
|     1|    137|   5.0|875071541|   0|
|     1|    141|   3.0|878542608|   0|
|     1|    142|   2.0|878543238|   0|
|     1|    145|   2.0|875073067|   0|
|     1|    151|   4.0|875072865|   0|
|     1|    154|   5.0|878543541|   0|
|     1|    157|   4.0|876892918|   0|
|     1|    158|   3.0|878542699|   0|
|     1|    162|   4.0|878542420|   0|
|     1|    169|   5.0|878543541|   0|
+------+-------+------+---------+----+
only showing top 20 rows



In [15]:
rank_eval = SparkRankingEvaluation(test, top_k, col_user="UserId", col_item="MovieId", 
                                    col_rating="Rating", col_prediction="prediction", 
                                    relevancy_method="top_k")

In [16]:
print("Model:\t" + model.model_str,
      "Top K:\t%d" % rank_eval.k,
      "MAP:\t%f" % rank_eval.map_at_k(),
      "NDCG:\t%f" % rank_eval.ndcg_at_k(),
      "Precision@K:\t%f" % rank_eval.precision_at_k(),
      "Recall@K:\t%f" % rank_eval.recall_at_k(), sep='\n')

Model:	sar_pyspark
Top K:	10
MAP:	0.110564
NDCG:	0.378465
Precision@K:	0.330786
Recall@K:	0.185000
