# Running ALS on MovieLens (pySpark)

ALS (Alternating Least Squares) is a well-known collaborative filtering algorithm.

This notebook provides an example of how to utilize and evaluate ALS pySpark ML implementation, meant for large-scale distributed datasets. We use a smaller dataset in this example to run SAR efficiently on Data Science Virtual Machine.

In [1]:
# set the environment path to find Recommenders
import sys
sys.path.append("../../")

from reco_utils.dataset.url_utils import maybe_download
from reco_utils.dataset.spark_splitters import spark_random_split
from reco_utils.evaluation.spark_evaluation import SparkRatingEvaluation, SparkRankingEvaluation

from pyspark.ml.recommendation import ALS

import numpy as np

import pyspark
import pyspark.sql.functions as F
from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField
from pyspark.sql.types import StringType, FloatType, IntegerType

print("System version: {}".format(sys.version))
print("Spark version: {}".format(pyspark.__version__))


System version: 3.6.0 | packaged by conda-forge | (default, Feb  9 2017, 14:36:55) 
[GCC 4.8.2 20140120 (Red Hat 4.8.2-15)]
Spark version: 2.3.1


Set the default parameters.

In [2]:
# top k items to recommend
TOP_K = 10


### 0. Set up Spark context

The following settings work well for debugging locally on VM - change when running on a cluster. We set up a giant single executor with many threads and specify memory cap. 

In [3]:
# the following settings work well for debugging locally on VM - change when running on a cluster
# set up a giant single executor with many threads and specify memory cap
spark = SparkSession \
    .builder \
    .appName("SAR pySpark") \
    .master("local[*]") \
    .config("spark.driver.memory", "2g")\
    .config("spark.executor.cores", "32")\
    .config("spark.executor.memory", "8g")\
    .config("spark.yarn.executor.memoryOverhead", "3g")\
    .config("spark.memory.fraction", "0.9")\
    .config("spark.memory.stageFraction", "0.3")\
    .config("spark.executor.instances", 1)\
    .config("spark.executor.heartbeatInterval", "36000s")\
    .config("spark.network.timeout", "10000000s")\
    .config("spark.driver.maxResultSize", "50g")\
    .getOrCreate()


### 1. Download the MovieLens dataset

In [4]:
filepath = maybe_download("http://files.grouplens.org/datasets/movielens/ml-100k/u.data", "ml-100k.data")

In [9]:
schema = StructType(
    (
        StructField("UserId", IntegerType()),
        StructField("MovieId", IntegerType()),
        StructField("Rating", FloatType()),
        StructField("Timestamp", IntegerType()),
    )
)
data = spark.read.csv(filepath, schema=schema, sep="\t", header=False)
data.show()

+------+-------+------+---------+
|UserId|MovieId|Rating|Timestamp|
+------+-------+------+---------+
|   196|    242|   3.0|881250949|
|   186|    302|   3.0|891717742|
|    22|    377|   1.0|878887116|
|   244|     51|   2.0|880606923|
|   166|    346|   1.0|886397596|
|   298|    474|   4.0|884182806|
|   115|    265|   2.0|881171488|
|   253|    465|   5.0|891628467|
|   305|    451|   3.0|886324817|
|     6|     86|   3.0|883603013|
|    62|    257|   2.0|879372434|
|   286|   1014|   5.0|879781125|
|   200|    222|   5.0|876042340|
|   210|     40|   3.0|891035994|
|   224|     29|   3.0|888104457|
|   303|    785|   3.0|879485318|
|   122|    387|   5.0|879270459|
|   194|    274|   2.0|879539794|
|   291|   1042|   4.0|874834944|
|   234|   1184|   2.0|892079237|
+------+-------+------+---------+
only showing top 20 rows



### 2. Split the data using the Spark random splitter provided in utilities

In [10]:
train, test = spark_random_split(data, ratio=0.75, seed=123)
print ("N train", train.count())
print ("N test", test.count())

N train 75193
N test 24807


In [43]:
header = {
    "userCol": "UserId",
    "itemCol": "MovieId",
    "ratingCol": "Rating",
}

# Even we uses explicit rating, implicitPrefs=True produces higher accuracy
als = ALS(
    rank=40,
    maxIter=15,
    implicitPrefs=True,
    alpha=0.1,
    coldStartStrategy='drop',
    **header
)

### 3. Train the ALS model on the training data, and get the top-k recommendations for our testing data

In [44]:
model = als.fit(train)
recommendations = model.recommendForUserSubset(test, TOP_K)

In [45]:
recommendations.show()

+------+--------------------+
|UserId|     recommendations|
+------+--------------------+
|   471|[[418, 0.4986122]...|
|   463|[[285, 0.70806473...|
|   833|[[23, 0.7805631],...|
|   496|[[419, 0.5622276]...|
|   148|[[169, 0.5968351]...|
|   540|[[1, 0.6456575], ...|
|   392|[[286, 0.6480617]...|
|   243|[[275, 0.54525304...|
|   623|[[50, 0.5494961],...|
|   737|[[127, 0.35151374...|
|   897|[[210, 0.7934636]...|
|   858|[[286, 0.5446349]...|
|    31|[[303, 0.53303367...|
|   516|[[286, 0.2644445]...|
|   580|[[405, 0.6236924]...|
|   251|[[181, 0.8814149]...|
|   451|[[326, 0.80617386...|
|    85|[[197, 0.7751863]...|
|   137|[[121, 0.68923014...|
|   808|[[288, 0.66081864...|
+------+--------------------+
only showing top 20 rows



In [46]:
# Convert to reco util's ranking evaluator format
top_k = recommendations.select('UserId', F.explode('recommendations').alias('r')) \
    .select('UserId', 'r.*')
top_k.show()

+------+-------+----------+
|UserId|MovieId|    rating|
+------+-------+----------+
|   471|    418| 0.4986122|
|   471|    501|0.46774796|
|   471|    946|0.41865447|
|   471|    596|0.41041267|
|   471|    420| 0.3920477|
|   471|     99|0.38924447|
|   471|    404|0.38697183|
|   471|    477|0.38583365|
|   471|     71|0.38505054|
|   471|    588|0.36934996|
|   463|    285|0.70806473|
|   463|     13| 0.6797578|
|   463|    124| 0.6626799|
|   463|    116|0.63455147|
|   463|    475| 0.6337175|
|   463|     14|0.61140335|
|   463|    286| 0.6070779|
|   463|    302|0.60614985|
|   463|    111|0.59022146|
|   463|    258| 0.5770138|
+------+-------+----------+
only showing top 20 rows



### 4. Evaluate how well SAR performs 

In [47]:
test.show()

+------+-------+------+---------+
|UserId|MovieId|Rating|Timestamp|
+------+-------+------+---------+
|     1|      2|   3.0|876893171|
|     1|      3|   4.0|878542960|
|     1|      4|   3.0|876893119|
|     1|      9|   5.0|878543541|
|     1|     11|   2.0|875072262|
|     1|     17|   3.0|875073198|
|     1|     25|   4.0|875071805|
|     1|     28|   4.0|875072173|
|     1|     30|   3.0|878542515|
|     1|     33|   4.0|878542699|
|     1|     43|   4.0|878542869|
|     1|     48|   5.0|875072520|
|     1|     49|   3.0|878542478|
|     1|     52|   4.0|875072205|
|     1|     59|   5.0|876892817|
|     1|     62|   3.0|878542282|
|     1|     65|   4.0|875072125|
|     1|     66|   4.0|878543030|
|     1|     71|   3.0|876892425|
|     1|     78|   1.0|878543176|
+------+-------+------+---------+
only showing top 20 rows



In [48]:
rank_eval = SparkRankingEvaluation(test, top_k, k = TOP_K, col_user="UserId", col_item="MovieId", 
                                    col_rating="Rating", col_prediction="rating", 
                                    relevancy_method="top_k")

In [49]:
print("Model:\tALS",
      "Top K:\t%d" % rank_eval.k,
      "MAP:\t%f" % rank_eval.map_at_k(),
      "NDCG:\t%f" % rank_eval.ndcg_at_k(),
      "Precision@K:\t%f" % rank_eval.precision_at_k(),
      "Recall@K:\t%f" % rank_eval.recall_at_k(), sep='\n')

Model:	ALS
Top K:	10
MAP:	0.027311
NDCG:	0.110232
Precision@K:	0.105308
Recall@K:	0.082848
