# Running ALS on MovieLens (pySpark)

[ALS](https://spark.apache.org/docs/latest/api/python/_modules/pyspark/ml/recommendation.html#ALS) (Alternating Least Squares) is a well-known collaborative filtering algorithm.

This notebook provides an example of how to utilize and evaluate ALS pySpark ML (DataFrame-based API) implementation, meant for large-scale distributed datasets. We use a smaller dataset in this example to run ALS efficiently on Data Science Virtual Machine.

In [16]:
# set the environment path to find Recommenders
import sys
sys.path.append("../../")
import os
import numpy as np
import pyspark
from pyspark.ml.recommendation import ALS
import pyspark.sql.functions as F
from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField
from pyspark.sql.types import StringType, FloatType, IntegerType, LongType

from reco_utils.dataset import movielens
from reco_utils.common.notebook_utils import is_jupyter
from reco_utils.dataset.url_utils import maybe_download
from reco_utils.dataset.spark_splitters import spark_random_split
from reco_utils.evaluation.spark_evaluation import SparkRatingEvaluation, SparkRankingEvaluation


print("System version: {}".format(sys.version))
print("Spark version: {}".format(pyspark.__version__))


System version: 3.6.0 | packaged by conda-forge | (default, Feb  9 2017, 14:36:55) 
[GCC 4.8.2 20140120 (Red Hat 4.8.2-15)]
Spark version: 2.3.1


Set the default parameters.

In [17]:
# top k items to recommend
TOP_K = 10

# Select Movielens data size: 100k, 1m, 10m, or 20m
MOVIELENS_DATA_SIZE = '100k'

### 0. Set up Spark context

The following settings work well for debugging locally on VM - change when running on a cluster. We set up a giant single executor with many threads and specify memory cap. 

In [18]:
# the following settings work well for debugging locally on VM - change when running on a cluster
# set up a giant single executor with many threads and specify memory cap
spark = SparkSession \
    .builder \
    .appName("ALS pySpark") \
    .master("local[*]") \
    .config("spark.driver.memory", "8g")\
    .config("spark.executor.cores", "32")\
    .config("spark.executor.memory", "8g")\
    .config("spark.memory.fraction", "0.9")\
    .config("spark.memory.stageFraction", "0.3")\
    .config("spark.executor.instances", 1)\
    .config("spark.executor.heartbeatInterval", "36000s")\
    .config("spark.network.timeout", "10000000s")\
    .config("spark.driver.maxResultSize", "50g")\
    .getOrCreate()


### 1. Download the MovieLens dataset

In [19]:
# Note: The DataFrame-based API for ALS currently only supports integers for user and item ids.
schema = StructType(
    (
        StructField("UserId", IntegerType()),
        StructField("MovieId", IntegerType()),
        StructField("Rating", FloatType()),
        StructField("Timestamp", LongType()),
    )
)

data = spark.createDataFrame(
    movielens.load_data(size=MOVIELENS_DATA_SIZE),
    schema
)
data.show()

+------+-------+------+---------+
|UserId|MovieId|Rating|Timestamp|
+------+-------+------+---------+
|   196|    242|   3.0|881250949|
|   186|    302|   3.0|891717742|
|    22|    377|   1.0|878887116|
|   244|     51|   2.0|880606923|
|   166|    346|   1.0|886397596|
|   298|    474|   4.0|884182806|
|   115|    265|   2.0|881171488|
|   253|    465|   5.0|891628467|
|   305|    451|   3.0|886324817|
|     6|     86|   3.0|883603013|
|    62|    257|   2.0|879372434|
|   286|   1014|   5.0|879781125|
|   200|    222|   5.0|876042340|
|   210|     40|   3.0|891035994|
|   224|     29|   3.0|888104457|
|   303|    785|   3.0|879485318|
|   122|    387|   5.0|879270459|
|   194|    274|   2.0|879539794|
|   291|   1042|   4.0|874834944|
|   234|   1184|   2.0|892079237|
+------+-------+------+---------+
only showing top 20 rows



### 2. Split the data using the Spark random splitter provided in utilities

In [20]:
train, test = spark_random_split(data, ratio=0.75, seed=123)
print ("N train", train.cache().count())
print ("N test", test.cache().count())

N train 74850
N test 25150


### 3. Train the ALS model on the training data, and get the top-k recommendations for our testing data

To predict movie ratings, we use the rating data in the training set as users' explicit feedbacks.

When our goal is to recommend top k movies a user is likely to watch, on the other hand, we utilize the ratings as implicit feedbacks.

In [21]:
header = {
    "userCol": "UserId",
    "itemCol": "MovieId",
    "ratingCol": "Rating",
}


# implicitPrefs=True for recommendation, False for rating prediction
als = ALS(
    rank=40,
    maxIter=15,
    implicitPrefs=True,
    alpha=0.1,
    regParam=0.01,
    coldStartStrategy='drop',
    nonnegative=True,
    **header
)

In [22]:
model = als.fit(train)


In [23]:
recommendations = model.recommendForUserSubset(test, TOP_K)
recommendations.show()
    

+------+--------------------+
|UserId|     recommendations|
+------+--------------------+
|   471|[[432, 0.9008726]...|
|   463|[[285, 0.8611088]...|
|   833|[[180, 1.1050073]...|
|   496|[[98, 0.7869402],...|
|   148|[[169, 0.7729108]...|
|   540|[[258, 0.96398807...|
|   392|[[258, 1.0923457]...|
|   243|[[275, 0.77682847...|
|   623|[[50, 1.0055649],...|
|   737|[[100, 0.69587535...|
|   897|[[423, 0.9980144]...|
|   858|[[127, 0.9046049]...|
|    31|[[268, 0.7689046]...|
|   516|[[127, 0.54003024...|
|   580|[[117, 0.9601716]...|
|   251|[[121, 1.0640283]...|
|   451|[[259, 1.2615407]...|
|    85|[[238, 1.0562271]...|
|   137|[[181, 0.86645734...|
|   808|[[294, 0.7781324]...|
+------+--------------------+
only showing top 20 rows



In [24]:
# Convert to reco util's ranking evaluator format
top_k = recommendations.select('UserId', F.explode('recommendations').alias('r')) \
    .select('UserId', 'r.*')
top_k.show()

+------+-------+----------+
|UserId|MovieId|    rating|
+------+-------+----------+
|   471|    432| 0.9008726|
|   471|    588| 0.8720478|
|   471|     95| 0.7204822|
|   471|    501|0.71627504|
|   471|    418|0.69428414|
|   471|    404| 0.6541548|
|   471|     99| 0.6457667|
|   471|    419| 0.5785657|
|   471|    420| 0.5444006|
|   471|    465|0.54001725|
|   463|    285| 0.8611088|
|   463|    100|0.85740536|
|   463|    237| 0.8491992|
|   463|     14|0.84572196|
|   463|    275| 0.8215583|
|   463|    124|  0.819225|
|   463|    137|0.79924214|
|   463|      1| 0.7852974|
|   463|    116| 0.7741509|
|   463|    286| 0.7676758|
+------+-------+----------+
only showing top 20 rows



### 4. Evaluate how well ALS performs

In [25]:
test.show()

+------+-------+------+---------+
|UserId|MovieId|Rating|Timestamp|
+------+-------+------+---------+
|     1|     10|   3.0|875693118|
|     1|     12|   5.0|878542960|
|     1|     14|   5.0|874965706|
|     1|     31|   3.0|875072144|
|     1|     39|   4.0|875072173|
|     1|     54|   3.0|878543308|
|     1|     76|   4.0|878543176|
|     1|     81|   5.0|875072865|
|     1|     84|   4.0|875072923|
|     1|     90|   4.0|878542300|
|     1|    112|   1.0|878542441|
|     1|    120|   1.0|875241637|
|     1|    121|   4.0|875071823|
|     1|    129|   5.0|887431908|
|     1|    148|   2.0|875240799|
|     1|    155|   2.0|878542201|
|     1|    160|   4.0|875072547|
|     1|    163|   4.0|875072442|
|     1|    177|   5.0|876892701|
|     1|    193|   4.0|876892654|
+------+-------+------+---------+
only showing top 20 rows



In [26]:
rank_eval = SparkRankingEvaluation(test, top_k, k = TOP_K, col_user="UserId", col_item="MovieId", 
                                    col_rating="Rating", col_prediction="rating", 
                                    relevancy_method="top_k")

In [27]:
print("Model:\tALS",
      "Top K:\t%d" % rank_eval.k,
      "MAP:\t%f" % rank_eval.map_at_k(),
      "NDCG:\t%f" % rank_eval.ndcg_at_k(),
      "Precision@K:\t%f" % rank_eval.precision_at_k(),
      "Recall@K:\t%f" % rank_eval.recall_at_k(), sep='\n')

Model:	ALS
Top K:	10
MAP:	0.023443
NDCG:	0.096148
Precision@K:	0.095758
Recall@K:	0.073473


### 5. Evaluate rating prediction

In [28]:
als_prediction = ALS(
    rank=40,
    maxIter=15,
    implicitPrefs=False,
    regParam=0.01,
    coldStartStrategy='drop',
    nonnegative=True,
    **header
)

model_prediction = als_prediction.fit(train)

prediction = model_prediction.transform(test)
prediction.show()


+------+-------+------+---------+----------+
|UserId|MovieId|Rating|Timestamp|prediction|
+------+-------+------+---------+----------+
|   332|    148|   5.0|887938486| 4.0304995|
|   606|    148|   3.0|878150506| 3.3472812|
|   916|    148|   2.0|880843892| 2.1215756|
|   236|    148|   4.0|890117028| 2.1268826|
|   602|    148|   4.0|888638517| 4.4665756|
|   222|    148|   2.0|881061164| 3.4551847|
|   372|    148|   5.0|876869915| 4.9526243|
|   935|    148|   4.0|884472892|  4.767795|
|     1|    148|   2.0|875240799| 3.3428721|
|   178|    148|   4.0|882824325|  4.353584|
|   328|    148|   3.0|885048638| 2.5750365|
|    20|    148|   5.0|879668713|  3.617363|
|   164|    148|   5.0|889402203| 4.8500266|
|    54|    148|   3.0|880937490| 3.4620621|
|   423|    148|   3.0|891395417| 2.2861683|
|   880|    148|   2.0|880167030| 3.7357397|
|   870|    148|   2.0|879377064| 2.2010548|
|    59|    148|   3.0|888203175|  3.286203|
|   757|    148|   4.0|888444948| 2.9821608|
|   434|  

In [29]:
rating_eval = SparkRatingEvaluation(test, prediction, col_user="UserId", col_item="MovieId", 
                                    col_rating="Rating", col_prediction="prediction")

print("Model:\tALS rating prediction",
      "RMSE:\t%.2f" % rating_eval.rmse(),
      "MAE:\t%f" % rating_eval.mae(),
      "Explained variance:\t%f" % rating_eval.exp_var(),
      "R squared:\t%f" % rating_eval.rsquared(), sep='\n')

Model:	ALS rating prediction
RMSE:	1.12
MAE:	0.874208
Explained variance:	-0.004006
R squared:	-0.006055


In [30]:
if is_jupyter():
    # Record results with papermill for tests
    import papermill as pm
    pm.record("map", rank_eval.map_at_k())
    pm.record("ndcg", rank_eval.ndcg_at_k())
    pm.record("precision", rank_eval.precision_at_k())
    pm.record("recall", rank_eval.recall_at_k())
    pm.record("rmse", rating_eval.rmse())
    pm.record("mae", rating_eval.mae())
    pm.record("exp_var", rating_eval.exp_var())
    pm.record("rsquared", rating_eval.rsquared())
