### grp

# Spark: The Definitive Guide

## PART 6: Advanced Analytics and Machine Learning 

## dataPaths

In [25]:
reg = '/Users/grp/sparkTheDefinitiveGuide/data/regression/'

## _Chapter #27 - Regression_

-  predicts a real number (continuous variable) fom a set of features (numerical form)
-  infinite # of possible output values
-  use metric of error for evaluation compared to accuracy metric (classification)

### Use Cases:
-  entertainment views [predict # of views for a particular service]
-  company revenue [predict how much revenue a company will make in the future]
-  resource yield [predict total resource yield for a particular region]

### MLlib Regression Models:
-  Linear Regression
-  Generalized Linear Regression
-  Isotonic Regression
-  Decision Trees
-  Random Forest
-  Gradient-Boosted Trees
-  Survival Regression
-  Isotonic Regression

### Evaluators:
-  regression evaluator (RegressionEvaluator) expects a "predicted value" and a "true value"
-  metrics (RegressionMetrics) supported:
    -  RMSE (root mean squared error)
    -  MSE (mean squared error)
    -  R2 (r squared)
    -  MAE (mean absolute error)
    -  EXPLAINED VARIANCE 

### Model Configuration:
-  Model Hyperparameters (structure of how model can be initialized)
-  Training Parameters (structure of how model can be trained)
-  Prediction Parameters (structured of how model determines making predictions)
-  Model Summary (provides information about final trained model)

### Training Summary:
- these metrics help represent how well the model is actually fitting the line
    -  Residuals:
        -  weights of each of the input features plugged into model
        -  "**coefficients of determination; measure of fit**"
    -  Objective History:
        -  shows training at each iteration
    -  Root Mean Squared Error:
        -  measures how well line is fitting the data (by looking at the distance between each predicted value and actual value in dataset)
    -  R-Squared:
        -  measures the proportion of variance of the predicted variable captured by the model
        -  "**difference between label and the predicted value**"

### _Chapter #27 Exercises (Reg)_

In [26]:
df = spark.read.load(reg)
df.printSchema()
df.show(3)

root
 |-- features: vector (nullable = true)
 |-- label: double (nullable = true)

+--------------+-----+
|      features|label|
+--------------+-----+
|[3.0,10.1,3.0]|  2.0|
| [2.0,1.1,1.0]|  1.0|
|[1.0,0.1,-1.0]|  0.0|
+--------------+-----+
only showing top 3 rows



## Linear Regression:
-  assume a linear combination of input features (sum of each feature multiplied by a weight) plus Gaussian error in output
-  implements ElasticNet regularization with mix of L1 and L2 regularization
    -  Model Hyperparameters:
        -  family:
            -  uses Gaussian distribution
        -  elasticNetParam:
            -  float value from 0 to 1
            -  mix of L1 and L2 regularization
                -  L1 (value of 1):
                    -  creates sparsity in model because certain feature weights will be zero (no significance to output)
                -  L2 (value of 0):
                    -  does not create sparsity because feature weights will never completely be zero
        -  fitIntercept:
            -  either TRUE or FALSE
            -  determines whether to fit intercept to linear combination of inputs and weights of model
            -  recommended to fit intercept if the training data has not been normalized
        -  regParam:
            -  determines how much weight to give to regularization
            -  recommended to use wide range of values (0, 0.01, 0.1, 1)
        -  standardization:
            -  either TRUE or FALSE
            -  decides whether to standardize the inputs before passing into model
    -  Training Parameters:
        -  maxIter:
            -  total number of iterations over the data before stopping (default value is usually best)
        -  tol:
            -  threshold that stops before maxIter when weights are considered optimized (default value is usually best)
        -  weightCol:
            -  weighs certain rows more than others

In [27]:
from pyspark.ml.regression import LinearRegression

In [28]:
lr = LinearRegression().setMaxIter(10).setRegParam(0.3).setElasticNetParam(0.8)
print(lr.explainParams())
lrModel = lr.fit(df)

aggregationDepth: suggested depth for treeAggregate (>= 2). (default: 2)
elasticNetParam: the ElasticNet mixing parameter, in range [0, 1]. For alpha = 0, the penalty is an L2 penalty. For alpha = 1, it is an L1 penalty. (default: 0.0, current: 0.8)
epsilon: The shape parameter to control the amount of robustness. Must be > 1.0. Only valid when loss is huber (default: 1.35)
featuresCol: features column name. (default: features)
fitIntercept: whether to fit an intercept term. (default: True)
labelCol: label column name. (default: label)
loss: The loss function to be optimized. Supported options: squaredError, huber. (default: squaredError)
maxIter: max number of iterations (>= 0). (default: 100, current: 10)
predictionCol: prediction column name. (default: prediction)
regParam: regularization parameter (>= 0). (default: 0.0, current: 0.3)
solver: The solver algorithm for optimization. Supported options: auto, normal, l-bfgs. (default: auto)
standardization: whether to standardize the tr

In [29]:
summary = lrModel.summary
summary.residuals.show()
print(summary.totalIterations)
print(summary.objectiveHistory)
print(summary.rootMeanSquaredError)
print(summary.r2)

+--------------------+
|           residuals|
+--------------------+
| 0.12805046585610147|
| -0.1446826926157201|
|-0.41903832622420606|
|-0.41903832622420606|
|  0.8547088792080306|
+--------------------+

6
[0.5000000000000001, 0.43152958103627864, 0.313233593388102, 0.312256926665541, 0.30915060819830303, 0.30915058933480266]
0.47308424392175985
0.720239122691221


## Generalized Linear Regression:
-  does not scale well to large numbers of features
-  provides more customization to different noise distribution families
    -  Model Hyperparameters:
        -  family:
            -  error distribution to be used in the model
            -  Gaussian, Binomial, Poisson, Gamma, Tweedie
        -  link:
            -  function that provides relationship between the linear predictor and the mean of the distribution function
            -  cloglog, probit, logit, inverse, sqrt, identity, log
        -  solver:
            -  used for optimization
        -  variancePower:
            -  Tweedie distribution aka relationship between the variance and mean of the distribution
        -  linkPower:
            -  index in the power link function for the Tweedie family
    -  Training Parameters:
        -  maxIter:
            -  total number of iterations over the data before stopping (default value is usually best)
        -  tol:
            -  threshold that stops before maxIter when weights are considered optimized (default value is usually best)
        -  weightCol:
            -  weighs certain rows more than others
    -  Prediction Parameters:
        -  linkPredictionCol:
            -  output of link function for each prediction

In [30]:
from pyspark.ml.regression import GeneralizedLinearRegression

In [31]:
glr = GeneralizedLinearRegression()\
.setFamily("gaussian")\
.setLink("identity")\
.setMaxIter(10)\
.setRegParam(0.3)\
.setLinkPredictionCol("linkOut")
print(glr.explainParams())
glrModel = glr.fit(df)

family: The name of family which is a description of the error distribution to be used in the model. Supported options: gaussian (default), binomial, poisson, gamma and tweedie. (default: gaussian, current: gaussian)
featuresCol: features column name. (default: features)
fitIntercept: whether to fit an intercept term. (default: True)
labelCol: label column name. (default: label)
link: The name of link function which provides the relationship between the linear predictor and the mean of the distribution function. Supported options: identity, log, inverse, logit, probit, cloglog and sqrt. (current: identity)
linkPower: The index in the power link function. Only applicable to the Tweedie family. (undefined)
linkPredictionCol: link prediction (linear predictor) column name (current: linkOut)
maxIter: max number of iterations (>= 0). (default: 25, current: 10)
offsetCol: The offset column name. If this is not set or empty, we treat all instance offsets as 0.0 (undefined)
predictionCol: pred

## Decision Trees:
-  produces a single number per leaf node instead of label (classification decision tree)
-  create a tree to predict numerical outputs instead of training coefficients (classification decision tree) to model a function
    -  Model Hyperparameters:
        -  impurity:
            -  represents metric whether or not the model should split a particular leaf node with a particular value or keep the same
        -  maxDepth:
            -  helps avoid overfitting dataset
        -  maxBins:
            -  determines the # of bins created from continuous features (converted from categorical features)
            -  more bins give higher level of granularity
        -  impurity:
            -  used to build the tree
            -  metric to determine where model should split at specific leaf node
            -  only supports "variance"
        -  minInfoGain:
            -  provides information used for splitting
            -  higher value can help prevent overfitting
            -  this parameter can require a lot of testing variations
        -  minInstancePerNode:
            -  determines minimum # of training instances that need to occur in each leaf node
            -  tree will be "pruned" (removes sections of the tree that provide little power to classify instances) until requirements are met
    -  Training Parameters:
        -  checkpointInterval:
            -  set to save model's work over training every N iterations

In [32]:
from pyspark.ml.regression import DecisionTreeRegressor

In [33]:
dtr = DecisionTreeRegressor()
print(dtr.explainParams())
dtrModel = dtr.fit(df)

cacheNodeIds: If false, the algorithm will pass trees to executors to match instances with nodes. If true, the algorithm will cache node IDs for each instance. Caching can speed up training of deeper trees. Users can set how often should the cache be checkpointed or disable it by setting checkpointInterval. (default: False)
checkpointInterval: set checkpoint interval (>= 1) or disable checkpoint (-1). E.g. 10 means that the cache will get checkpointed every 10 iterations. Note: this setting will be ignored if the checkpoint directory is not set in the SparkContext. (default: 10)
featuresCol: features column name. (default: features)
impurity: Criterion used for information gain calculation (case-insensitive). Supported options: variance (default: variance)
labelCol: label column name. (default: label)
maxBins: Max number of bins for discretizing continuous features.  Must be >=2 and >= number of categories for any categorical feature. (default: 32)
maxDepth: Maximum depth of the tree. 

## Random Forests & Gradient-Boosted Trees:
-  many trees are trained to perform a regression
-  RF [many de-correlated trees are trained and averaged]
-  GBT [each tree makes a weighted prediction]
    -  Model Hyperparameters:
        -  impurity:
            -  represents metric whether or not the model should split a particular leaf node with a particular value or keep the same
        -  maxDepth:
            -  helps avoid overfitting dataset
        -  maxBins:
            -  determines the # of bins created from continuous features (converted from categorical features)
            -  more bins give higher level of granularity
        -  impurity:
            -  used to build the tree
            -  metric to determine where model should split at specific leaf node
            -  only supports "variance"
        -  minInfoGain:
            -  provides information used for splitting
            -  higher value can help prevent overfitting
            -  this parameter can require a lot of testing variations
        -  minInstancePerNode:
            -  determines minimum # of training instances that need to occur in each leaf node
            -  tree will be "pruned" (removes sections of the tree that provide little power to classify instances) until requirements are met
    -  Training Parameters:
        -  checkpointInterval:
            -  set to save model's work over training every N iterations

In [34]:
from pyspark.ml.regression import RandomForestRegressor
from pyspark.ml.regression import GBTRegressor

In [35]:
rf =  RandomForestRegressor()
print(rf.explainParams())
rfModel = rf.fit(df)
print("\n")
gbt = GBTRegressor()
print(gbt.explainParams())
gbtModel = gbt.fit(df)

cacheNodeIds: If false, the algorithm will pass trees to executors to match instances with nodes. If true, the algorithm will cache node IDs for each instance. Caching can speed up training of deeper trees. Users can set how often should the cache be checkpointed or disable it by setting checkpointInterval. (default: False)
checkpointInterval: set checkpoint interval (>= 1) or disable checkpoint (-1). E.g. 10 means that the cache will get checkpointed every 10 iterations. Note: this setting will be ignored if the checkpoint directory is not set in the SparkContext. (default: 10)
featureSubsetStrategy: The number of features to consider for splits at each tree node. Supported options: auto, all, onethird, sqrt, log2, (0.0-1.0], [1-n]. (default: auto)
featuresCol: features column name. (default: features)
impurity: Criterion used for information gain calculation (case-insensitive). Supported options: variance (default: variance)
labelCol: label column name. (default: label)
maxBins: Max 

## Survival Regression (Accelerated Failure Time):
-  used to understand the survival rate of individuals
-  models the log of the survival time via accelerated failure time model
-  tune coefficients according to feature values

## Isotonic Regression:
-  linear function that is always monotonically increasing (cannot decrease)

### _Evaluation Metrics & CV Example_

In [36]:
'''
import org.apache.spark.ml.evaluation.RegressionEvaluator
import org.apache.spark.ml.regression.GeneralizedLinearRegression
import org.apache.spark.ml.Pipeline
import org.apache.spark.ml.tuning.{CrossValidator, ParamGridBuilder}
val glr = new GeneralizedLinearRegression()
  .setFamily("gaussian")
  .setLink("identity")
val pipeline = new Pipeline().setStages(Array(glr))
val params = new ParamGridBuilder().addGrid(glr.regParam, Array(0, 0.5, 1))
  .build()
val evaluator = new RegressionEvaluator()
  .setMetricName("rmse")
  .setPredictionCol("prediction")
  .setLabelCol("label")
val cv = new CrossValidator()
  .setEstimator(pipeline)
  .setEvaluator(evaluator)
  .setEstimatorParamMaps(params)
  .setNumFolds(2) // should always be 3 or more but this dataset is small
val model = cv.fit(df)

import org.apache.spark.mllib.evaluation.RegressionMetrics
val out = model.transform(df)
  .select("prediction", "label")
  .rdd.map(x => (x(0).asInstanceOf[Double], x(1).asInstanceOf[Double]))
val metrics = new RegressionMetrics(out)
println(s"MSE = ${metrics.meanSquaredError}")
println(s"RMSE = ${metrics.rootMeanSquaredError}")
println(s"R-squared = ${metrics.r2}")
println(s"MAE = ${metrics.meanAbsoluteError}")
println(s"Explained variance = ${metrics.explainedVariance}")
'''

'\nimport org.apache.spark.ml.evaluation.RegressionEvaluator\nimport org.apache.spark.ml.regression.GeneralizedLinearRegression\nimport org.apache.spark.ml.Pipeline\nimport org.apache.spark.ml.tuning.{CrossValidator, ParamGridBuilder}\nval glr = new GeneralizedLinearRegression()\n  .setFamily("gaussian")\n  .setLink("identity")\nval pipeline = new Pipeline().setStages(Array(glr))\nval params = new ParamGridBuilder().addGrid(glr.regParam, Array(0, 0.5, 1))\n  .build()\nval evaluator = new RegressionEvaluator()\n  .setMetricName("rmse")\n  .setPredictionCol("prediction")\n  .setLabelCol("label")\nval cv = new CrossValidator()\n  .setEstimator(pipeline)\n  .setEvaluator(evaluator)\n  .setEstimatorParamMaps(params)\n  .setNumFolds(2) // should always be 3 or more but this dataset is small\nval model = cv.fit(df)\n\nimport org.apache.spark.mllib.evaluation.RegressionMetrics\nval out = model.transform(df)\n  .select("prediction", "label")\n  .rdd.map(x => (x(0).asInstanceOf[Double], x(1).asI

### grp