# Introduction to Machine Learning in Spark


## MLlib
MLlib stands for Machine Learning Library. MLlib is now in a maintenance mode, that is, it is not actively being developed  
  
Starting with Spark 2.0, ML is the main machine learning library that operates on DataFrames instead of RDDs as is the case for MLlib RDD-base API. After reaching feature parity (roughly estimated for Spark 2.3), the RDD-based API will be deprecated and is expected to be removed in Spark 3.0

Understanding ML packages in Spark
* spark.mllib contains the original API built on top of RDDs.
* spark.ml provides higher-level API built on top of DataFrames for constructing ML pipelines.


## ML

# Machine learning algorithms

## Linear regression

### Traditional approach to linear regression

In [1]:
from pyspark import SparkContext
sc=SparkContext(master='local[2]',
               appName='my-spark')
from pyspark.sql import SparkSession
spark=SparkSession(sparkContext=sc)

In [2]:
ad=spark.read.csv('Advertising.csv',header=True,inferSchema=True)
ad.show(5)

+-----+-----+---------+-----+
|   TV|Radio|Newspaper|Sales|
+-----+-----+---------+-----+
|230.1| 37.8|     69.2| 22.1|
| 44.5| 39.3|     45.1| 10.4|
| 17.2| 45.9|     69.3|  9.3|
|151.5| 41.3|     58.5| 18.5|
|180.8| 10.8|     58.4| 12.9|
+-----+-----+---------+-----+
only showing top 5 rows



In [4]:
#Features are 'TV', 'Radio', 'Newspaper', 'Sales' columns in the dataframe
from pyspark.ml.linalg import Vectors
ad_df=ad.rdd.map(lambda x: [Vectors.dense(x[0:3]),x[-1]]).toDF(['features','label'])
ad_df.show()

+-----------------+-----+
|         features|label|
+-----------------+-----+
|[230.1,37.8,69.2]| 22.1|
| [44.5,39.3,45.1]| 10.4|
| [17.2,45.9,69.3]|  9.3|
|[151.5,41.3,58.5]| 18.5|
|[180.8,10.8,58.4]| 12.9|
|  [8.7,48.9,75.0]|  7.2|
| [57.5,32.8,23.5]| 11.8|
|[120.2,19.6,11.6]| 13.2|
|    [8.6,2.1,1.0]|  4.8|
| [199.8,2.6,21.2]| 10.6|
|  [66.1,5.8,24.2]|  8.6|
| [214.7,24.0,4.0]| 17.4|
| [23.8,35.1,65.9]|  9.2|
|   [97.5,7.6,7.2]|  9.7|
|[204.1,32.9,46.0]| 19.0|
|[195.4,47.7,52.9]| 22.4|
|[67.8,36.6,114.0]| 12.5|
|[281.4,39.6,55.8]| 24.4|
| [69.2,20.5,18.3]| 11.3|
|[147.3,23.9,19.1]| 14.6|
+-----------------+-----+
only showing top 20 rows



In [5]:
from pyspark.ml.regression import LinearRegression
lr=LinearRegression(featuresCol='features',labelCol='label')

In [6]:
lr_model=lr.fit(ad_df)

In [7]:
pred=lr_model.transform(ad_df)
pred.show(5)

+-----------------+-----+------------------+
|         features|label|        prediction|
+-----------------+-----+------------------+
|[230.1,37.8,69.2]| 22.1| 20.52397440971517|
| [44.5,39.3,45.1]| 10.4|12.337854820894362|
| [17.2,45.9,69.3]|  9.3|12.307670779994238|
|[151.5,41.3,58.5]| 18.5| 17.59782951168913|
|[180.8,10.8,58.4]| 12.9|13.188671856831299|
+-----------------+-----+------------------+
only showing top 5 rows



In [9]:
from pyspark.ml.evaluation import RegressionEvaluator
evaluator=RegressionEvaluator(predictionCol='prediction',labelCol='label')
evaluator.setMetricName('r2').evaluate(pred)

0.897210638178952

In [None]:
# Linear regresiion with cross validation and tuning



### Linear regression with cross-validation

In [10]:
training, test=ad_df.randomSplit([.8, .2],seed=123)

In [11]:
training.show()

+----------------+-----+
|        features|label|
+----------------+-----+
|  [0.7,39.6,8.7]|  1.6|
| [7.8,38.9,50.6]|  6.6|
|  [8.4,27.2,2.1]|  5.7|
|   [8.6,2.1,1.0]|  4.8|
| [8.7,48.9,75.0]|  7.2|
| [13.1,0.4,25.6]|  5.3|
|[16.9,43.7,89.4]|  8.7|
| [17.2,4.1,31.6]|  5.9|
|[17.2,45.9,69.3]|  9.3|
|[17.9,37.6,21.6]|  8.0|
|[18.7,12.1,23.4]|  6.7|
|[19.4,16.0,22.3]|  6.6|
|[19.6,20.1,17.0]|  7.6|
|[23.8,35.1,65.9]|  9.2|
|[25.0,11.0,29.7]|  7.2|
|[25.1,25.7,43.3]|  8.5|
| [25.6,39.0,9.3]|  9.5|
|[26.8,33.0,19.3]|  8.8|
| [28.6,1.5,33.0]|  7.3|
| [31.5,24.6,2.2]|  9.5|
+----------------+-----+
only showing top 20 rows



Cross Validation

In [12]:
lr=LinearRegression(featuresCol='features',labelCol='label')
# parameter grid
from pyspark.ml.tuning import ParamGridBuilder
param_grid=ParamGridBuilder().\
            addGrid(lr.regParam, [0.0, .5, 1.0]).\
            addGrid(lr.elasticNetParam, [0.2,.5,.8]).\
        build()

In [14]:
evaluator=RegressionEvaluator(predictionCol='prediction',labelCol='label',metricName='r2')


In [16]:
from pyspark.ml.tuning import CrossValidator
cv=CrossValidator(estimator=lr, estimatorParamMaps=param_grid, evaluator=evaluator, numFolds=5)

In [17]:
cv_model=cv.fit(training)

In [18]:
pred_training=cv_model.transform(training)

In [19]:
pred_test_cv=cv_model.transform(test)

In [20]:
evaluator.setMetricName('r2').evaluate(pred_training)

0.8982486958337326

In [21]:
evaluator.setMetricName('r2').evaluate(pred_test_cv)

0.8896562076565583

In [22]:
print(cv_model.bestModel.coefficients)

[0.046507497430939085,0.18085452246520456,-0.0010752054907401342]


In [23]:
print(cv_model.bestModel.intercept)

3.075068686285647


In [25]:
cv_model.bestModel._java_obj.getRegParam()

0.0

In [26]:
cv_model.bestModel._java_obj.getElasticNetParam()

0.2

## Logistic Regression

#### Processing categorical columns
Using pipeline we will perform following transformations on categorical data:

* StringIndexer identify column as categorical variable or if want to convert the textual data to numeric data keeping the categorical context
* OneHotEncoder Binarizing categorical index in the form appropriate for modelling
* VectorAssembler all feature columns into one vector column

#### Build cross-validation model

In [27]:
cuse=spark.read.csv('cuse_binary.csv',header=True,inferSchema=True)
cuse.show(5)

+---+---------+---------+---+
|age|education|wantsMore|  y|
+---+---------+---------+---+
|<25|      low|      yes|  0|
|<25|      low|      yes|  0|
|<25|      low|      yes|  0|
|<25|      low|      yes|  0|
|<25|      low|      yes|  0|
+---+---------+---------+---+
only showing top 5 rows



In [38]:
from pyspark.ml.feature import StringIndexer, OneHotEncoder, VectorAssembler
from pyspark.ml import Pipeline

# categorical columns
categorical_columns = cuse.columns[0:3]
categorical_columns

['age', 'education', 'wantsMore']

In [39]:
# Build StringIndexer stages
stringindexer_stages = [StringIndexer(inputCol=c, outputCol='strindexed_' + c) for c in categorical_columns]
# encode label column and add it to stringindexer_stages
stringindexer_stages += [StringIndexer(inputCol='y', outputCol='label')]

In [40]:
# Build OneHotEncoder stages
onehotencoder_stages = [OneHotEncoder(inputCol='strindexed_' + c, outputCol='onehot_' + c) for c in categorical_columns]

In [41]:
# Build VectorAssembler stage
feature_columns = ['onehot_' + c for c in categorical_columns]
vectorassembler_stage = VectorAssembler(inputCols=feature_columns, outputCol='features') 

In [42]:
# Combining all stages into pipeling model
all_stages = stringindexer_stages + onehotencoder_stages + [vectorassembler_stage]
pipeline = Pipeline(stages=all_stages)

In [43]:
# Fit pipeline model
pipeline_model = pipeline.fit(cuse)

In [44]:
# Transform data
final_columns = feature_columns + ['features', 'label']
cuse_df = pipeline_model.transform(cuse).\
            select(final_columns)
            
cuse_df.show(5)

+-------------+----------------+----------------+-------------------+-----+
|   onehot_age|onehot_education|onehot_wantsMore|           features|label|
+-------------+----------------+----------------+-------------------+-----+
|(3,[2],[1.0])|       (1,[],[])|   (1,[0],[1.0])|(5,[2,4],[1.0,1.0])|  0.0|
|(3,[2],[1.0])|       (1,[],[])|   (1,[0],[1.0])|(5,[2,4],[1.0,1.0])|  0.0|
|(3,[2],[1.0])|       (1,[],[])|   (1,[0],[1.0])|(5,[2,4],[1.0,1.0])|  0.0|
|(3,[2],[1.0])|       (1,[],[])|   (1,[0],[1.0])|(5,[2,4],[1.0,1.0])|  0.0|
|(3,[2],[1.0])|       (1,[],[])|   (1,[0],[1.0])|(5,[2,4],[1.0,1.0])|  0.0|
+-------------+----------------+----------------+-------------------+-----+
only showing top 5 rows



In [45]:
# Split data into training and test datasets
training, test = cuse_df.randomSplit([0.8, 0.2], seed=123)

In [46]:
# Estimator
from pyspark.ml.classification import LogisticRegression
logr = LogisticRegression(featuresCol='features', labelCol='label')

In [47]:
# Parameter grid to tune model
from pyspark.ml.tuning import ParamGridBuilder
param_grid = ParamGridBuilder().\
    addGrid(logr.regParam, [0.0, 0.5, 1.0, 2.0]).\
    addGrid(logr.elasticNetParam, [0.0, 0.5, 1.0]).\
    build()

In [48]:
# Evaluator
from pyspark.ml.evaluation import BinaryClassificationEvaluator
evaluator = BinaryClassificationEvaluator(rawPredictionCol="rawPrediction")

In [49]:
# Cross-validation model
from pyspark.ml.tuning import CrossValidator
cv = CrossValidator(estimator=logr, estimatorParamMaps=param_grid, evaluator=evaluator, numFolds=5)

In [50]:
# Fit cross-validation model
cv_model = cv.fit(cuse_df)

In [51]:
show_columns = ['label', 'prediction', 'rawPrediction', 'probability']

In [52]:
# Prediction on training data
pred_training_cv = cv_model.transform(training)
pred_training_cv.select(show_columns).show(20, truncate=False)

+-----+----------+------------------------------------------+---------------------------------------+
|label|prediction|rawPrediction                             |probability                            |
+-----+----------+------------------------------------------+---------------------------------------+
|0.0  |1.0       |[-0.05602431718564116,0.05602431718564116]|[0.4859975829890087,0.5140024170109914]|
|0.0  |1.0       |[-0.05602431718564116,0.05602431718564116]|[0.4859975829890087,0.5140024170109914]|
|0.0  |1.0       |[-0.05602431718564116,0.05602431718564116]|[0.4859975829890087,0.5140024170109914]|
|0.0  |1.0       |[-0.05602431718564116,0.05602431718564116]|[0.4859975829890087,0.5140024170109914]|
|0.0  |1.0       |[-0.05602431718564116,0.05602431718564116]|[0.4859975829890087,0.5140024170109914]|
|0.0  |1.0       |[-0.05602431718564116,0.05602431718564116]|[0.4859975829890087,0.5140024170109914]|
|0.0  |1.0       |[-0.05602431718564116,0.05602431718564116]|[0.4859975829890087,0

In [53]:
# Prediction on training data
pred_test_cv = cv_model.transform(test)
pred_test_cv.select(show_columns).show(20, truncate=False)

+-----+----------+------------------------------------------+---------------------------------------+
|label|prediction|rawPrediction                             |probability                            |
+-----+----------+------------------------------------------+---------------------------------------+
|0.0  |1.0       |[-0.05602431718564116,0.05602431718564116]|[0.4859975829890087,0.5140024170109914]|
|0.0  |1.0       |[-0.05602431718564116,0.05602431718564116]|[0.4859975829890087,0.5140024170109914]|
|0.0  |1.0       |[-0.05602431718564116,0.05602431718564116]|[0.4859975829890087,0.5140024170109914]|
|0.0  |1.0       |[-0.05602431718564116,0.05602431718564116]|[0.4859975829890087,0.5140024170109914]|
|0.0  |1.0       |[-0.05602431718564116,0.05602431718564116]|[0.4859975829890087,0.5140024170109914]|
|0.0  |1.0       |[-0.05602431718564116,0.05602431718564116]|[0.4859975829890087,0.5140024170109914]|
|0.0  |1.0       |[-0.05602431718564116,0.05602431718564116]|[0.4859975829890087,0

## Classification with Decision trees


#### Processing categorical columns

#### Building Decision tree with cross-validation and parameter tuning

In [55]:
from pyspark.ml.classification import DecisionTreeClassifier
dt=DecisionTreeClassifier(featuresCol='features',labelCol='label')
# Parameter grid to tune model
from pyspark.ml.tuning import ParamGridBuilder
param_grid = ParamGridBuilder().\
    addGrid(dt.maxDepth, [2, 5, 6, 8]).\
    build()
# Evaluator
from pyspark.ml.evaluation import BinaryClassificationEvaluator
evaluator = BinaryClassificationEvaluator(rawPredictionCol="rawPrediction",metricName='areaUnderROC')
# Cross-validation model
from pyspark.ml.tuning import CrossValidator
cv = CrossValidator(estimator=dt, estimatorParamMaps=param_grid, evaluator=evaluator, numFolds=5)
cv_model=cv.fit(training)


In [56]:
# Prediction on training data
pred_training_cv = cv_model.transform(training)
pred_training_cv.select(show_columns).show(20, truncate=False)

+-----+----------+-------------+---------------------------------------+
|label|prediction|rawPrediction|probability                            |
+-----+----------+-------------+---------------------------------------+
|0.0  |0.0       |[227.0,211.0]|[0.5182648401826484,0.4817351598173516]|
|0.0  |0.0       |[227.0,211.0]|[0.5182648401826484,0.4817351598173516]|
|0.0  |0.0       |[227.0,211.0]|[0.5182648401826484,0.4817351598173516]|
|0.0  |0.0       |[227.0,211.0]|[0.5182648401826484,0.4817351598173516]|
|0.0  |0.0       |[227.0,211.0]|[0.5182648401826484,0.4817351598173516]|
|0.0  |0.0       |[227.0,211.0]|[0.5182648401826484,0.4817351598173516]|
|0.0  |0.0       |[227.0,211.0]|[0.5182648401826484,0.4817351598173516]|
|0.0  |0.0       |[227.0,211.0]|[0.5182648401826484,0.4817351598173516]|
|0.0  |0.0       |[227.0,211.0]|[0.5182648401826484,0.4817351598173516]|
|0.0  |0.0       |[227.0,211.0]|[0.5182648401826484,0.4817351598173516]|
|0.0  |0.0       |[227.0,211.0]|[0.5182648401826484

In [57]:
# Prediction on training data
pred_test_cv = cv_model.transform(test)
pred_test_cv.select(show_columns).show(20, truncate=False)

+-----+----------+-------------+---------------------------------------+
|label|prediction|rawPrediction|probability                            |
+-----+----------+-------------+---------------------------------------+
|0.0  |0.0       |[227.0,211.0]|[0.5182648401826484,0.4817351598173516]|
|0.0  |0.0       |[227.0,211.0]|[0.5182648401826484,0.4817351598173516]|
|0.0  |0.0       |[227.0,211.0]|[0.5182648401826484,0.4817351598173516]|
|0.0  |0.0       |[227.0,211.0]|[0.5182648401826484,0.4817351598173516]|
|0.0  |0.0       |[227.0,211.0]|[0.5182648401826484,0.4817351598173516]|
|0.0  |0.0       |[227.0,211.0]|[0.5182648401826484,0.4817351598173516]|
|0.0  |0.0       |[227.0,211.0]|[0.5182648401826484,0.4817351598173516]|
|0.0  |0.0       |[227.0,211.0]|[0.5182648401826484,0.4817351598173516]|
|0.0  |0.0       |[227.0,211.0]|[0.5182648401826484,0.4817351598173516]|
|0.0  |0.0       |[227.0,211.0]|[0.5182648401826484,0.4817351598173516]|
|0.0  |0.0       |[227.0,211.0]|[0.5182648401826484

In [58]:
label_and_pred=cv_model.transform(test).select('label','prediction')
label_and_pred.rdd.zipWithIndex().countByKey()

defaultdict(int,
            {Row(label=0.0, prediction=0.0): 216,
             Row(label=1.0, prediction=0.0): 101})

#### Confusion matrix

## Classification with Gradient Boosting

#### Defining classifier with cross-validation and parameter tuning

In [60]:
from pyspark.ml.classification import GBTClassifier
dt=GBTClassifier(featuresCol='features',labelCol='label')
# Parameter grid to tune model
from pyspark.ml.tuning import ParamGridBuilder
param_grid = ParamGridBuilder().\
    addGrid(dt.maxDepth, [2, 5, 6, 8]).\
    addGrid(dt.stepSize, [0.01,.1,.2]).\
    build()
# Evaluator
from pyspark.ml.evaluation import BinaryClassificationEvaluator
evaluator = BinaryClassificationEvaluator(rawPredictionCol="rawPrediction",metricName='areaUnderROC')
# Cross-validation model
from pyspark.ml.tuning import CrossValidator
cv = CrossValidator(estimator=dt, estimatorParamMaps=param_grid, evaluator=evaluator, numFolds=5)
cv_model=cv.fit(training)


In [61]:
# Prediction on training data
pred_training_cv = cv_model.transform(training)
pred_training_cv.select(show_columns).show(20, truncate=False)

+-----+----------+------------------------------------------+---------------------------------------+
|label|prediction|rawPrediction                             |probability                            |
+-----+----------+------------------------------------------+---------------------------------------+
|0.0  |1.0       |[-0.04903348521826213,0.04903348521826213]|[0.4755028869062719,0.5244971130937282]|
|0.0  |1.0       |[-0.04903348521826213,0.04903348521826213]|[0.4755028869062719,0.5244971130937282]|
|0.0  |1.0       |[-0.04903348521826213,0.04903348521826213]|[0.4755028869062719,0.5244971130937282]|
|0.0  |1.0       |[-0.04903348521826213,0.04903348521826213]|[0.4755028869062719,0.5244971130937282]|
|0.0  |1.0       |[-0.04903348521826213,0.04903348521826213]|[0.4755028869062719,0.5244971130937282]|
|0.0  |1.0       |[-0.04903348521826213,0.04903348521826213]|[0.4755028869062719,0.5244971130937282]|
|0.0  |1.0       |[-0.04903348521826213,0.04903348521826213]|[0.4755028869062719,0

In [62]:
# Prediction on test data
pred_test_cv = cv_model.transform(test)
pred_test_cv.select(show_columns).show(20, truncate=False)

+-----+----------+------------------------------------------+---------------------------------------+
|label|prediction|rawPrediction                             |probability                            |
+-----+----------+------------------------------------------+---------------------------------------+
|0.0  |1.0       |[-0.04903348521826213,0.04903348521826213]|[0.4755028869062719,0.5244971130937282]|
|0.0  |1.0       |[-0.04903348521826213,0.04903348521826213]|[0.4755028869062719,0.5244971130937282]|
|0.0  |1.0       |[-0.04903348521826213,0.04903348521826213]|[0.4755028869062719,0.5244971130937282]|
|0.0  |1.0       |[-0.04903348521826213,0.04903348521826213]|[0.4755028869062719,0.5244971130937282]|
|0.0  |1.0       |[-0.04903348521826213,0.04903348521826213]|[0.4755028869062719,0.5244971130937282]|
|0.0  |1.0       |[-0.04903348521826213,0.04903348521826213]|[0.4755028869062719,0.5244971130937282]|
|0.0  |1.0       |[-0.04903348521826213,0.04903348521826213]|[0.4755028869062719,0

In [63]:
evaluator.setMetricName('areaUnderROC').evaluate(pred_training_cv)

0.6798976885183781

In [64]:
evaluator.setMetricName('areaUnderROC').evaluate(pred_test_cv)

0.687935460212688

In [65]:
cv_model.bestModel._java_obj.getMaxDepth()

2

In [67]:
cv_model.bestModel._java_obj.getStepSize()

0.2

#### Generating prediction

#### Prediction performance

# Recommendation system with Spark

### Definition of a problem

We want to develop a movie recommender system with Spark on data from MovieLens. MovieLens is a project developed by GroupLens, a research laboratory at the University of Minnesota. MovieLens provides an online movie recommender application that uses anonymously-collected data to improve recommender algorithms. 

To help people develop the best recommendation algorithms, MovieLens also released several data sets: 

* full data set consists of more than 24 million ratings across more than 40,000 movies by more than 250,000 users.

* small data set that is a subset of the full data set. 

In our example we will start building a working program with a small data set to get faster performance while interacting, exploring, and getting errors with your data. 



### Importing data

In [73]:
ratings=spark.read.csv('ml-latest-small/ratings.csv',header=True,inferSchema=True).repartition(4).cache()
ratings.show(4)

+------+-------+------+----------+
|userId|movieId|rating| timestamp|
+------+-------+------+----------+
|   450| 148372|   5.0|1475737053|
|   270|  76093|   3.0|1469306154|
|   625| 110102|   4.0|1452853116|
|   195|    112|   1.0| 977724281|
+------+-------+------+----------+
only showing top 4 rows



In [74]:
movies=spark.read.csv('ml-latest-small/movies.csv',header=True,inferSchema=True).repartition(4).cache()
movies.show(4,truncate=False)

+-------+-----------------------------------------+------------------------------------+
|movieId|title                                    |genres                              |
+-------+-----------------------------------------+------------------------------------+
|8754   |Prime of Miss Jean Brodie, The (1969)    |Drama                               |
|111486 |Lesson of the Evil (Aku no kyôten) (2012)|Thriller                            |
|1033   |Fox and the Hound, The (1981)            |Animation|Children|Drama            |
|6536   |Sinbad: Legend of the Seven Seas (2003)  |Adventure|Animation|Children|Fantasy|
+-------+-----------------------------------------+------------------------------------+
only showing top 4 rows



In [72]:
print(sc.defaultParallelism)

2


### Spark settings

### Data exploration

In [75]:
ratings.describe().show()

+-------+-----------------+------------------+------------------+--------------------+
|summary|           userId|           movieId|            rating|           timestamp|
+-------+-----------------+------------------+------------------+--------------------+
|  count|           100004|            100004|            100004|              100004|
|   mean|347.0113095476181|12548.664363425463| 3.543608255669773|1.1296390869392424E9|
| stddev|195.1638379781956| 26369.19896881519|1.0580641091070395|1.9168582602710992E8|
|    min|                1|                 1|               0.5|           789652009|
|    max|              671|            163949|               5.0|          1476640644|
+-------+-----------------+------------------+------------------+--------------------+



In [76]:
ratings.select('userId').distinct().count()

671

In [78]:
ratings.filter('rating==1').select('movieId').distinct().count()

1959

In [79]:
ratings.createOrReplaceTempView('ratings')
spark.sql("SELECT COUNT(DISTINCT(movieId)) AS loosers FROM ratings WHERE rating==1").show()

+-------+
|loosers|
+-------+
|   1959|
+-------+



### Recomendation system

Recommender systems is a family of methods that enable filtering through large observation and information space in order to provide recommendations in the information space that user does not have any observation, where the information space is all of the available items that user could choose or select and observation space is what user experienced or observed so far.


There are different methods for building a recommender system, such as, user-based, content-based, or collaborative filtering. 



#### Collaborative filtering


#### Advantages over content based methods

#### Disadvantages over content based methods¶






#### Alternating Least Squares Algorithm

We have users $u$ for items $i$ matrix as in the following:
$$Q_{ui} = \cases{
r  & \text{if user u rate item i} \cr
0 & \text{if user u did not rate item i}
} $$

where $r$ is what rating values can be. If we have $m$ users and $n$ items, then we want to learn a matrix of factors which represent movies. That is, the factor vector for each movie and that would be how we represent the movie in the feature space. Note that, we do not have any knowledge of the category of the movie at this point. We also want to learn a factor vector for each user in a similar way how we represent the movie. Factor matrix for movies $X \in \mathbb{R}^{mxf}$ and factor matrix(each movie is a column vector) for users $X \in \mathbb{R}^{mxf}$(each user is a row vector). However, we have two unknown variables. Therefore, we will adopt an alternating least squares approach with regularization. By doing so, we first estimate $Y$ using $X$ and estimate $X$ by using $Y$. After enough number of iterations, we are aiming to reach a convergence point where either the matrices $X$ and $Y$ are no longer changing or the change is quite small. However, there is a small problem in the data. We have neither user full data nor full items data, (suprisingly) this is also why we are trying to build the recommendation engine in the first place. Therefore, we may want to penalize the movies that do not have ratings in the update rule. By doing so, we will depend on only the movies that have ratings from the users and do not make any assumption around the movies that are not rated in the recommendation. Let's call this weight matrix $w_{ui}$ as such:
$$w_{ui} = \cases{
0 &\text{if  } q_{ui} = 0 \cr
1 & \text{ else} 
}$$
Then, cost functions that we are trying to minimize is in the following:
$$J(x_u) = (q_u - x_u Y) W_u (q_u - x_u Y)^T + \lambda x_u x_u^T$$
$$J(y_i) = (q_i - X y_i) W_i (q_i - X y_i)^T + \lambda y_i y_i^T$$

Note that we need regularization terms in order to avoid the overfitting the data. Ideally, regularization parameters need to be tuned using cross-validation in the dataset for algorithm to generalize better. In this post, I will use the whole dataset. Solutions for factor vectors are given as follows:
$$x_u = (Y W_u Y^T + \lambda I)^{-1} Y W_u q_u$$
$$y_i = (X^T Wi X + \lambda I)^{-1} X^T W_i q_i$$

where $W_u \in \mathbb{R}^{nxn}$ and $W_u \in \mathbb{R}^{mxm}$ diagonal matrices. The algorithm is pretty much of it. In the regulaization, we may want to incorporate both factor matrices in the update rules as well if we want to be more restrictive. That may generalize better, though.

#### Building the recommender system

In our example, the ALS algorithm will create a matrix of all users versus all movies. Most cells in the matrix will be empty. An empty cell means the user hasn't reviewed the movie yet. The ALS algorithm will fill in the probable (predicted) ratings, based on similarities between user ratings. The algorithm uses the least squares computation to minimize the estimation errors, and alternates between solving for movie factors and solving for user factors.

In [80]:
from pyspark.ml.recommendation import ALS
model=ALS(userCol="userId",itemCol="movieId",ratingCol="rating").fit(ratings)

In [83]:
predictions=model.transform(ratings)
predictions.show()

+------+-------+------+----------+----------+
|userId|movieId|rating| timestamp|prediction|
+------+-------+------+----------+----------+
|   575|    148|   4.0|1012605106|  3.925006|
|   232|    463|   4.0| 955089443|  3.840804|
|   452|    463|   2.0| 976424451| 2.5061953|
|   380|    463|   3.0| 968949106| 2.9723182|
|   534|    463|   4.0| 973377486|  3.884037|
|   242|    463|   4.0| 956685706| 3.8402894|
|    30|    463|   4.0| 945277405| 3.5451906|
|   311|    463|   3.0| 898008246| 2.9873717|
|    85|    471|   3.0| 837512312|  2.872746|
|   588|    471|   3.0| 842298526| 3.6434338|
|   126|    471|   5.0| 833287141| 3.9697504|
|   460|    471|   5.0|1072836030|  4.082004|
|   350|    471|   3.0|1011714986| 3.5613809|
|   548|    471|   4.0| 857407799| 3.4402337|
|   602|    471|   3.0| 842357922| 4.1293125|
|   285|    471|   5.0| 965092130|  3.846668|
|   274|    471|   5.0|1074104142| 3.8721604|
|   440|    471|   3.0| 835337519| 3.3404179|
|    86|    471|   4.0| 848161161|

#### Algorithm Evaluation on training data
We need to evaluate the performance of the model by comparing the predicted values with the original values. In this case of numeric forecasting RMSE is a good choice as a measure of fit. Use the RegressionEvaluator method to compare continuous values with the root mean squared calculation. The root mean squared error (RMSE) calculation measures the average of the squares of the errors between what is estimated and the existing data. The lower the mean squared error value, the more accurate the model.


In [84]:
from pyspark.ml.evaluation import RegressionEvaluator
evaluator=RegressionEvaluator(metricName="rmse",labelCol="rating",predictionCol="prediction")
evaluator.evaluate(predictions)

0.6121762502409693

In [86]:
training, test=ratings.randomSplit([80.0,20.0])

In [87]:
model=ALS(userCol="userId",itemCol="movieId",ratingCol="rating").fit(training)
predictions=model.transform(test)
predictions.show()

+------+-------+------+----------+----------+
|userId|movieId|rating| timestamp|prediction|
+------+-------+------+----------+----------+
|   380|    463|   3.0| 968949106|  2.810031|
|   534|    463|   4.0| 973377486| 3.5423799|
|    30|    463|   4.0| 945277405| 3.1847305|
|    85|    471|   3.0| 837512312|  3.182696|
|   274|    471|   5.0|1074104142| 3.3121417|
|   440|    471|   3.0| 835337519| 3.6286314|
|   491|    471|   3.0| 940797129| 4.0265474|
|   452|    471|   3.0| 976422396| 3.4104712|
|    92|    471|   4.0| 848526594|  4.034794|
|   607|    471|   4.0|1118247731| 3.7384584|
|   358|    471|   5.0| 957479605| 3.9340591|
|   502|    471|   4.0| 861322541|  4.155453|
|   514|    471|   4.0| 853893788| 4.2163954|
|   195|    471|   3.0| 976289176| 3.5753496|
|    30|    471|   4.0| 945112993| 3.5556245|
|   399|    471|   5.0| 841562601| 3.4645963|
|   509|    496|   3.0| 940013481|  2.124097|
|   133|   1088|   1.5|1416166508| 2.3406138|
|   111|   1088|   3.5|1097431651|

In [88]:
evaluator=RegressionEvaluator(metricName="rmse",labelCol="rating",predictionCol="prediction")
evaluator.evaluate(predictions)

nan

#### Algorithm Evaluation on test data
To find out how our algorithm works on unseen cases we need to split  ratings data set between training test data set. Then rerun the steps to train the model on the training set, run it on the test set, and evaluate the performance.

#### Handling NaN results
A NaN result is due to SPARK-14489 issue and because the model can't predict values for users for which there's no data

This typically occurs in two scenarios:

* In production, for new users or items that have no rating history and on which the model has not been trained (this is the “cold start problem”).
* During cross-validation, the data is split between training and evaluation sets. When using simple random splits as in Spark’s CrossValidator or TrainValidationSplit, it is actually very common to encounter users and/or items in the evaluation set that are not in the training set

By default, Spark assigns NaN predictions during ALSModel.transform when a user and/or item factor is not present in the model. This can be useful in a production system, since it indicates a new user or item, and so the system can make a decision on some fallback to use as the prediction.

During cross-validation this is not acceptable, since any NaN predicted values will result in NaN results for the evaluation metric and makes model selection impossible.

Spark allows users to set the coldStartStrategy parameter to “drop” in order to drop any rows in the DataFrame of predictions that contain NaN values. The evaluation metric will then be computed over the non-NaN data 


In [89]:
model=ALS(userCol="userId",itemCol="movieId",ratingCol="rating",coldStartStrategy="drop").fit(training)
predictions=model.transform(test)
predictions.show()

evaluator=RegressionEvaluator(metricName="rmse",labelCol="rating",predictionCol="prediction")
evaluator.evaluate(predictions)

+------+-------+------+----------+----------+
|userId|movieId|rating| timestamp|prediction|
+------+-------+------+----------+----------+
|   380|    463|   3.0| 968949106|  2.810031|
|   534|    463|   4.0| 973377486| 3.5423799|
|    30|    463|   4.0| 945277405| 3.1847305|
|    85|    471|   3.0| 837512312|  3.182696|
|   274|    471|   5.0|1074104142| 3.3121417|
|   440|    471|   3.0| 835337519| 3.6286314|
|   491|    471|   3.0| 940797129| 4.0265474|
|   452|    471|   3.0| 976422396| 3.4104712|
|    92|    471|   4.0| 848526594|  4.034794|
|   607|    471|   4.0|1118247731| 3.7384584|
|   358|    471|   5.0| 957479605| 3.9340591|
|   502|    471|   4.0| 861322541|  4.155453|
|   514|    471|   4.0| 853893788| 4.2163954|
|   195|    471|   3.0| 976289176| 3.5753496|
|    30|    471|   4.0| 945112993| 3.5556245|
|   399|    471|   5.0| 841562601| 3.4645963|
|   509|    496|   3.0| 940013481|  2.124097|
|   133|   1088|   1.5|1416166508| 2.3406138|
|   111|   1088|   3.5|1097431651|

0.918654025287006

#### Tuning model 

What are the hyperparameters we can use to tune our ALS model?
There are three such hyperparameters in our case:
* rank = the number of latent factors in the model
* maxIter = the maximum number of iterations
* regParam = the regularization parameter

To find the best values we can test several values for those hyperparameters and choose the best configuration.
In the same way as before we will define a grid of parameter combinations and  run a grid search over the combinations to evaluate the resulting models and comparing their performance using Spark CrossValidator which facilitates multiple values for rank and regParam

In [91]:
model=ALS(userCol="userId",itemCol="movieId",ratingCol="rating",coldStartStrategy="drop")
param_grid = ParamGridBuilder().\
    addGrid(model.rank, [8,10,12]).\
    build()

evaluator=RegressionEvaluator(metricName="rmse",labelCol="rating",predictionCol="prediction")
crossval=CrossValidator(estimator=model,estimatorParamMaps=param_grid,evaluator=evaluator,numFolds=2)
cv_model=crossval.fit(training)


predictions_train=cv_model.transform(training)
predictions_test=cv_model.transform(test)



In [92]:
evaluator.setMetricName('rmse').evaluate(predictions_train)

0.5601740747315376

In [93]:
evaluator.setMetricName('rmse').evaluate(predictions_test)

0.9180136869757596

#### Deployment to recommend movies