# Running ALS on MovieLens (pySpark)

[ALS](https://spark.apache.org/docs/latest/api/python/_modules/pyspark/ml/recommendation.html#ALS) (Alternating Least Squares) is a well-known collaborative filtering algorithm.

This notebook provides an example of how to utilize and evaluate ALS pySpark ML (DataFrame-based API) implementation, meant for large-scale distributed datasets. We use a smaller dataset in this example to run ALS efficiently on Data Science Virtual Machine.

In [2]:
# set the environment path to find Recommenders
import sys
sys.path.append("../../")
import pyspark
from pyspark.ml.recommendation import ALS
import pyspark.sql.functions as F
from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField
from pyspark.sql.types import StringType, FloatType, IntegerType, LongType

from reco_utils.dataset import movielens
from reco_utils.common.notebook_utils import is_jupyter
from reco_utils.dataset.spark_splitters import spark_random_split
from reco_utils.evaluation.spark_evaluation import SparkRatingEvaluation, SparkRankingEvaluation


print("System version: {}".format(sys.version))
print("Spark version: {}".format(pyspark.__version__))


System version: 3.6.0 | packaged by conda-forge | (default, Feb  9 2017, 14:36:55) 
[GCC 4.8.2 20140120 (Red Hat 4.8.2-15)]
Spark version: 2.3.1


Set the default parameters.

In [4]:
# top k items to recommend
TOP_K = 10

# Select Movielens data size: 100k, 1m, 10m, or 20m
MOVIELENS_DATA_SIZE = '100k'

### 0. Set up Spark context

The following settings work well for debugging locally on VM - change when running on a cluster. We set up a giant single executor with many threads and specify memory cap. 

In [5]:
# the following settings work well for debugging locally on VM - change when running on a cluster
# set up a giant single executor with many threads and specify memory cap
spark = SparkSession \
    .builder \
    .appName("ALS pySpark") \
    .master("local[*]") \
    .config("spark.driver.memory", "8g")\
    .config("spark.executor.cores", "32")\
    .config("spark.executor.memory", "8g")\
    .config("spark.memory.fraction", "0.9")\
    .config("spark.memory.stageFraction", "0.3")\
    .config("spark.executor.instances", 1)\
    .config("spark.executor.heartbeatInterval", "36000s")\
    .config("spark.network.timeout", "10000000s")\
    .config("spark.driver.maxResultSize", "50g")\
    .getOrCreate()


### 1. Download the MovieLens dataset

In [6]:
# Note: The DataFrame-based API for ALS currently only supports integers for user and item ids.
schema = StructType(
    (
        StructField("UserId", IntegerType()),
        StructField("MovieId", IntegerType()),
        StructField("Rating", FloatType()),
        StructField("Timestamp", LongType()),
    )
)

data = movielens.load_spark_df(spark, size=MOVIELENS_DATA_SIZE, schema=schema)
data.show()

+------+-------+------+---------+
|UserId|MovieId|Rating|Timestamp|
+------+-------+------+---------+
|   196|    242|   3.0|881250949|
|   186|    302|   3.0|891717742|
|    22|    377|   1.0|878887116|
|   244|     51|   2.0|880606923|
|   166|    346|   1.0|886397596|
|   298|    474|   4.0|884182806|
|   115|    265|   2.0|881171488|
|   253|    465|   5.0|891628467|
|   305|    451|   3.0|886324817|
|     6|     86|   3.0|883603013|
|    62|    257|   2.0|879372434|
|   286|   1014|   5.0|879781125|
|   200|    222|   5.0|876042340|
|   210|     40|   3.0|891035994|
|   224|     29|   3.0|888104457|
|   303|    785|   3.0|879485318|
|   122|    387|   5.0|879270459|
|   194|    274|   2.0|879539794|
|   291|   1042|   4.0|874834944|
|   234|   1184|   2.0|892079237|
+------+-------+------+---------+
only showing top 20 rows



### 2. Split the data using the Spark random splitter provided in utilities

In [7]:
train, test = spark_random_split(data, ratio=0.75, seed=123)
print ("N train", train.cache().count())
print ("N test", test.cache().count())

N train 75193
N test 24807


### 3. Train the ALS model on the training data, and get the top-k recommendations for our testing data

To predict movie ratings, we use the rating data in the training set as users' explicit feedbacks.

When our goal is to recommend top k movies a user is likely to watch, on the other hand, we utilize the ratings as implicit feedbacks.

In [8]:
header = {
    "userCol": "UserId",
    "itemCol": "MovieId",
    "ratingCol": "Rating",
}


# implicitPrefs=True for recommendation, False for rating prediction
als = ALS(
    rank=40,
    maxIter=15,
    implicitPrefs=True,
    alpha=0.1,
    regParam=0.01,
    coldStartStrategy='drop',
    nonnegative=True,
    **header
)

In [9]:
model = als.fit(train)


In the movie recommendation use case, recommending movies that have been rated by the users do not make sense. Therefore, the rated movies are removed from the recommended items.

In order to achieve this, we recommend all movies to all users, and then remove the user-movie pairs that exist in the training datatset.

In [10]:
# Get the count of all the items in the training data, and recommend them to each user.
item_count = len(train.select('MovieId').distinct().collect())

dfs_rec = model.recommendForAllUsers(item_count)
dfs_rec.show()

+------+--------------------+
|UserId|     recommendations|
+------+--------------------+
|   471|[[418, 0.81081796...|
|   463|[[285, 1.1613864]...|
|   833|[[234, 1.1637557]...|
|   496|[[91, 0.64642763]...|
|   148|[[50, 0.9399338],...|
|   540|[[25, 0.8960909],...|
|   392|[[286, 0.98481745...|
|   243|[[275, 0.8062736]...|
|   623|[[50, 1.0966091],...|
|   737|[[50, 0.7834272],...|
|   897|[[210, 1.0125527]...|
|   858|[[286, 0.59100807...|
|    31|[[302, 0.7587307]...|
|   516|[[286, 0.37176636...|
|   580|[[121, 0.97008115...|
|   251|[[257, 1.2410074]...|
|   451|[[286, 1.3096821]...|
|    85|[[170, 1.1391337]...|
|   137|[[121, 0.9710442]...|
|   808|[[313, 0.76340157...|
+------+--------------------+
only showing top 20 rows



In [11]:
# Explode the recommendations to comply with reco utils evaluator format.
dfs_pred = dfs_rec.select('UserId', F.explode('recommendations').alias('r')) \
  .select('UserId', 'r.*')
dfs_pred.show()

+------+-------+----------+
|UserId|MovieId|    rating|
+------+-------+----------+
|   471|    418|0.81081796|
|   471|    588|0.80154705|
|   471|     99|0.75493133|
|   471|    432|  0.735334|
|   471|    501|0.70974267|
|   471|     71| 0.6583156|
|   471|    419|0.63108385|
|   471|    143|   0.62445|
|   471|    404|0.61170727|
|   471|    420| 0.5778315|
|   471|    625|0.54469806|
|   471|    465| 0.5161731|
|   471|    946| 0.5004436|
|   471|     50|0.47752252|
|   471|    151| 0.4772636|
|   471|     95| 0.4528677|
|   471|    416| 0.4515421|
|   471|     91|0.44780803|
|   471|    596| 0.4455471|
|   471|    969|0.44410107|
+------+-------+----------+
only showing top 20 rows



In [12]:
# Remove seen items.
dfs_pred_exclude_train = dfs_pred.alias("pred").join(
    train.alias("train"),
    (dfs_pred['UserId'] == train['UserId']) & (dfs_pred['MovieId'] == train['MovieId']),
    how='outer'
)

top_all = dfs_pred_exclude_train.filter(dfs_pred_exclude_train["pred.rating"].isNotNull()) \
    .select('pred.' + 'UserId', 'pred.' + 'MovieId', 'pred.' + "rating")

top_all.show()

+------+-------+----------+
|UserId|MovieId|    rating|
+------+-------+----------+
|   471|    418|0.81081796|
|   471|    588|0.80154705|
|   471|     99|0.75493133|
|   471|    432|  0.735334|
|   471|    501|0.70974267|
|   471|     71| 0.6583156|
|   471|    419|0.63108385|
|   471|    143|   0.62445|
|   471|    404|0.61170727|
|   471|    420| 0.5778315|
|   471|    625|0.54469806|
|   471|    465| 0.5161731|
|   471|    946| 0.5004436|
|   471|     50|0.47752252|
|   471|    151| 0.4772636|
|   471|     95| 0.4528677|
|   471|    416| 0.4515421|
|   471|     91|0.44780803|
|   471|    596| 0.4455471|
|   471|    969|0.44410107|
+------+-------+----------+
only showing top 20 rows



### 4. Evaluate how well ALS performs

In [13]:
test.show()

+------+-------+------+---------+
|UserId|MovieId|Rating|Timestamp|
+------+-------+------+---------+
|     1|      2|   3.0|876893171|
|     1|      3|   4.0|878542960|
|     1|      4|   3.0|876893119|
|     1|      9|   5.0|878543541|
|     1|     11|   2.0|875072262|
|     1|     17|   3.0|875073198|
|     1|     25|   4.0|875071805|
|     1|     28|   4.0|875072173|
|     1|     30|   3.0|878542515|
|     1|     33|   4.0|878542699|
|     1|     43|   4.0|878542869|
|     1|     48|   5.0|875072520|
|     1|     49|   3.0|878542478|
|     1|     52|   4.0|875072205|
|     1|     59|   5.0|876892817|
|     1|     62|   3.0|878542282|
|     1|     65|   4.0|875072125|
|     1|     66|   4.0|878543030|
|     1|     71|   3.0|876892425|
|     1|     78|   1.0|878543176|
+------+-------+------+---------+
only showing top 20 rows



In [14]:
rank_eval = SparkRankingEvaluation(test, top_all, k = TOP_K, col_user="UserId", col_item="MovieId", 
                                    col_rating="Rating", col_prediction="rating", 
                                    relevancy_method="top_k")

In [15]:
print("Model:\tALS",
      "Top K:\t%d" % rank_eval.k,
      "MAP:\t%f" % rank_eval.map_at_k(),
      "NDCG:\t%f" % rank_eval.ndcg_at_k(),
      "Precision@K:\t%f" % rank_eval.precision_at_k(),
      "Recall@K:\t%f" % rank_eval.recall_at_k(), sep='\n')

Model:	ALS
Top K:	10
MAP:	0.026606
NDCG:	0.102588
Precision@K:	0.098620
Recall@K:	0.081998


### 5. Evaluate rating prediction

In [13]:
als_prediction = ALS(
    rank=40,
    maxIter=15,
    implicitPrefs=False,
    regParam=0.01,
    coldStartStrategy='drop',
    nonnegative=True,
    **header
)

model_prediction = als_prediction.fit(train)

prediction = model_prediction.transform(test)
prediction.show()


+------+-------+------+---------+----------+
|UserId|MovieId|Rating|Timestamp|prediction|
+------+-------+------+---------+----------+
|   406|    148|   3.0|879540276| 1.7514404|
|    27|    148|   3.0|891543129| 2.8414743|
|   606|    148|   3.0|878150506| 3.4356766|
|   916|    148|   2.0|880843892| 2.0533013|
|   236|    148|   4.0|890117028| 2.5193589|
|   602|    148|   4.0|888638517|  3.852337|
|   663|    148|   4.0|889492989|  3.369333|
|   372|    148|   5.0|876869915| 4.2370667|
|   190|    148|   4.0|891033742| 3.5539067|
|     1|    148|   2.0|875240799| 3.1896725|
|   297|    148|   3.0|875239619| 3.5016007|
|   178|    148|   4.0|882824325| 3.8494987|
|   308|    148|   3.0|887740788| 2.9070387|
|   923|    148|   4.0|880387474|   4.19521|
|    54|    148|   3.0|880937490| 2.9716566|
|   430|    148|   2.0|877226047| 3.4732695|
|    92|    148|   2.0|877383934| 1.9645505|
|   447|    148|   4.0|878854729| 3.9878867|
|   374|    148|   4.0|880392992|  2.867302|
|   891|  

In [14]:
rating_eval = SparkRatingEvaluation(test, prediction, col_user="UserId", col_item="MovieId", 
                                    col_rating="Rating", col_prediction="prediction")

print("Model:\tALS rating prediction",
      "RMSE:\t%.2f" % rating_eval.rmse(),
      "MAE:\t%f" % rating_eval.mae(),
      "Explained variance:\t%f" % rating_eval.exp_var(),
      "R squared:\t%f" % rating_eval.rsquared(), sep='\n')

Model:	ALS rating prediction
RMSE:	1.12
MAE:	0.869879
Explained variance:	0.007969
R squared:	0.003702


In [15]:
if is_jupyter():
    # Record results with papermill for tests
    import papermill as pm
    pm.record("map", rank_eval.map_at_k())
    pm.record("ndcg", rank_eval.ndcg_at_k())
    pm.record("precision", rank_eval.precision_at_k())
    pm.record("recall", rank_eval.recall_at_k())
    pm.record("rmse", rating_eval.rmse())
    pm.record("mae", rating_eval.mae())
    pm.record("exp_var", rating_eval.exp_var())
    pm.record("rsquared", rating_eval.rsquared())
