### grp

# Spark: The Definitive Guide

## PART 6: Advanced Analytics and Machine Learning 

## dataPaths

In [1]:
recs = '/Users/grp/sparkTheDefinitiveGuide/data/sample_movielens_ratings.txt'

## _Chapter #28 - Recommendation_

-  study of people's preferences (ratings and observed behaviors)
-  explicit feedback:
    -  numerical rating aimed to predict (ex: 4 out of 5 stars)
-  implicit feedback:
    -  ratings represent strength of interactions observed between user and item (measures level of confidence in user's preference on that particular item (ex: # of visits/clicks on a particular webpage)

### Use Cases:
-  recommendations (ex: movies, courses, products)

### MLlib Recommendation Models:
-  ALS (Alternating Least Squares)
-  FPM (Frequent Pattern Mining)

### Evaluators:
-  aim to optimize at reducing total differences between user's ratings and true values
-  good to set coldStartStrategy to "drop" instead of NaN then switch back to NaN when making predictions in production
-  regression evaluator (RegressionEvaluator) expect a "predicted value" and a "true value"
-  regression metrics (RegressionMetrics) supported:
    -  RMSE (root mean squared error)
    -  MSE (mean squared error)
    -  R2 (r squared)
    -  MAE (mean absolute error)
    -  EXPLAINED VARIANCE
-  ranking metrics (RankingMetric):
    -  compares recommendations with actual set of ratings from user
    -  does not focus on rank value rather if algorithm has recommended an already ranked item to user
    -  MEAN AVERAGE PRECISION

### Model Configuration:
-  Model Hyperparameters (structure of how model can be initialized)
-  Training Parameters (structure of how model can be trained)
-  Prediction Parameters (structured of how model determines making predictions)
-  Model Summary (provides information about final trained model)

### _Chapter #28 Exercises (Rec)_

In [2]:
data = spark.read.text(recs)
data.printSchema()
data.show(3)

root
 |-- value: string (nullable = true)

+-------------------+
|              value|
+-------------------+
|0::2::3::1424380312|
|0::3::1::1424380312|
|0::5::2::1424380312|
+-------------------+
only showing top 3 rows



## ALS:
-  finds K-dimensional feature vector via for each user/item pair (gets dot product relationship) to determine user's rating for that item
    -  input dataset schema:
        -  user ID column
        -  item ID column
        -  rating column
    -  performs collaborative filtering:
        -  makes recommendations based only on which items users interacted with in the past
    -  "cold start problem":
        -  when new ratings appear that are not in the training set thus algorithm will not know what to recommend
-  Model Hyperparameters:
    -  rank:
        -  determines dimension of feature vectors learned for users and items
        -  must be tuned accordinly
        -  too high a rank may cause overfitting
        -  too low a rank may make bad training predictions
    -  alpha:
        -  sets baseline confidence for preference from implicit feedback
    -  regParam:
        -  controls regularization to prevent overfitting
    -  implicitPrefs:
        -  TRUE (sets implicit behavior) or FALSE (sets explicit behavior; set as default)
    -  nonnegative:
        -  TRUE (places non-negative contraints on the least-squares and only returns non-negative feature vectors) or FALSE (set as default)
-  Training Parameters:
    -  groups data into blocks that are distributed across cluster
    -  amount of data in each block can have significant impact on training time and performance
    -  rule of thumb is to aim for approximately 1 to 5 million ratings per block
        -  numUserBlocks:
            -  determines how many blocks to split the users into ... default is 10
        -  numItemBlocks:
            -  determines how many blocks to split the items into  ... default is 10
        -  maxIter:
            -  total # of iterations over data before stopping
            -  should be adjusted if objective history shows lack of flatline and volatile differences between iterations
        -  checkpointInterval:
            -  allows to save model state during training to quickly recover from any node failures
        -  seed:
            -  random seed to re-generate same results
-  Prediction Parameters:
    -  coldStartStrategy:
        -  determines what the model should predict for users or items that did not appear in training set
        -  defaultly assigns NaN prediction values when finding user/item not present in model
        -  can set parameter to "drop" to drop any rows in DF of predictions that contain NaN values
        -  good to set coldStartStrategy to "drop" instead of NaN then switch back to NaN when making predictions in production

In [3]:
from pyspark.ml.recommendation import ALS
from pyspark.sql import Row

In [4]:
ratings = spark.read.text(recs)\
.rdd.toDF()\
.selectExpr("split(value , '::') as col")\
.selectExpr(
    "cast(col[0] as int) as userId",
    "cast(col[1] as int) as movieId",
    "cast(col[2] as float) as rating",
    "cast(col[3] as long) as timestamp")

training, test = ratings.randomSplit([0.8, 0.2])
als = ALS()\
.setMaxIter(5)\
.setRegParam(0.01)\
.setUserCol("userId")\
.setItemCol("movieId")\
.setRatingCol("rating")

print(als.explainParams())

alsModel = als.fit(training)
predictions = alsModel.transform(test)

alpha: alpha for implicit preference (default: 1.0)
checkpointInterval: set checkpoint interval (>= 1) or disable checkpoint (-1). E.g. 10 means that the cache will get checkpointed every 10 iterations. Note: this setting will be ignored if the checkpoint directory is not set in the SparkContext. (default: 10)
coldStartStrategy: strategy for dealing with unknown or new users/items at prediction time. This may be useful in cross-validation or production scenarios, for handling user/item ids the model has not seen in the training data. Supported values: 'nan', 'drop'. (default: nan)
finalStorageLevel: StorageLevel for ALS model factors. (default: MEMORY_AND_DISK)
implicitPrefs: whether to use implicit preference (default: False)
intermediateStorageLevel: StorageLevel for intermediate datasets. Cannot be 'NONE'. (default: MEMORY_AND_DISK)
itemCol: column name for item ids. Ids must be within the integer value range. (default: item, current: movieId)
maxIter: max number of iterations (>= 0

In [5]:
ratings.printSchema()
ratings.show(3)

root
 |-- userId: integer (nullable = true)
 |-- movieId: integer (nullable = true)
 |-- rating: float (nullable = true)
 |-- timestamp: long (nullable = true)

+------+-------+------+----------+
|userId|movieId|rating| timestamp|
+------+-------+------+----------+
|     0|      2|   3.0|1424380312|
|     0|      3|   1.0|1424380312|
|     0|      5|   2.0|1424380312|
+------+-------+------+----------+
only showing top 3 rows



In [6]:
# shows top K recommendations for each user or movie
alsModel.recommendForAllUsers(10).selectExpr("userId", "explode(recommendations)").show(3)
alsModel.recommendForAllItems(10).selectExpr("movieId", "explode(recommendations)").show(3)

+------+---------------+
|userId|            col|
+------+---------------+
|    28|[92, 4.9692082]|
|    28| [12, 4.601353]|
|    28|[89, 4.2311087]|
+------+---------------+
only showing top 3 rows

+-------+---------------+
|movieId|            col|
+-------+---------------+
|     31|[12, 3.8597145]|
|     31| [6, 3.5119514]|
|     31|[10, 3.3996582]|
+-------+---------------+
only showing top 3 rows



### _Evaluation Metrics (Regression & Ranking) Example_

In [7]:
from pyspark.ml.evaluation import RegressionEvaluator

In [8]:
evaluator = RegressionEvaluator()\
.setMetricName("rmse")\
.setLabelCol("rating")\
.setPredictionCol("prediction")
rmse = evaluator.evaluate(predictions)
print("Root-mean-square error = %f" % rmse)

Root-mean-square error = 1.682634


In [9]:
from pyspark.mllib.evaluation import RegressionMetrics

In [10]:
regComparison = predictions.select("rating", "prediction")\
.rdd.map(lambda x: (x(0), x(1)))
metrics = RegressionMetrics(regComparison)

In [11]:
from pyspark.mllib.evaluation import RankingMetrics, RegressionMetrics
from pyspark.sql.functions import col, expr

In [12]:
perUserActual = predictions\
.where("rating > 2.5")\
.groupBy("userId")\
.agg(expr("collect_set(movieId) as movies"))

perUserPredictions = predictions\
.orderBy(col("userId"), expr("prediction DESC"))\
.groupBy("userId")\
.agg(expr("collect_list(movieId) as movies"))

perUserActualvPred = perUserActual.join(perUserPredictions, ["userId"]).rdd.map(lambda row: (row[1], row[2][:15]))
ranks = RankingMetrics(perUserActualvPred)

print(ranks.meanAveragePrecision)
print(ranks.precisionAt(5))

0.2548396048396048
0.49999999999999994


## Frequent Pattern Mining:
-  aka Market Basket Analysis
-  finds association rules from raw data

### grp