### grp

# Spark: The Definitive Guide

## PART 6: Advanced Analytics and Machine Learning 

## dataPaths

In [1]:
binaryClass = '/Users/grp/sparkTheDefinitiveGuide/data/binary-classification/'

## _Chapter #26 - Classification_

-  predicts a label, category, class, or discrete variable via input features
-  output label has a finite set of possible values

### Use Cases:
-  credit [predict whether loan should be issued or not] risk
-  news [predict topic category] classification
-  classifying human [predict human status (walking, sleeping, running, standing)] activity

### Types:
-  binary [only 2 labels you can predict]
    -  fraud analytics
    -  email spam
-  multiclass [more than 2 labels you can predict, but always a finite set of classes to predict]:
    -  predicting weather (rainy, sunny, cloudy)
-  multilabel [input can produce multiple set of labels that are not fixed]
    -  predict # of objects appearing in image

### MLlib Classification Models:
-  Logistic Regression
-  Decision Trees
-  Random Forests
-  Gradient-Boosted Trees

### Evaluators:
-  provide metric of model success and accuracy
-  used best via automated grid search, hyperparameter tuning, and ML pipeline API
-  classification evaluators expect a "predicted label" and a "true label"
-  Binary (BinaryClassificationEvaluator):
    -  areaUnderROC
    -  areaUnderPR
-  MultiClass (MulticlassClassificationEvaluator):
    -  f1
    -  weightedPrecision
    -  weightedRecall
    -  accuracy

### One-vs-Rest Classifier:
-  some models don't support multiclass classification
-  use one-vs-rest estimator as an alternative to turn model into a binary classification per class
-  isolates 1 class as target and groups all the other classes into 1

### Model Configuration:
-  Model Hyperparameters (structure of how model can be initialized)
-  Training Parameters (structure of how model can be trained)
-  Prediction Parameters (structured of how model determines making predictions)
-  Model Summary (provides information about final trained model)

### _Chapter #26 Exercises (Class)_

In [2]:
bInput = spark.read.format("parquet").load(binaryClass).selectExpr("features", "cast(label as double) as label")
bInput.printSchema()
bInput.show(3)

root
 |-- features: vector (nullable = true)
 |-- label: double (nullable = true)

+--------------+-----+
|      features|label|
+--------------+-----+
|[3.0,10.1,3.0]|  1.0|
|[1.0,0.1,-1.0]|  0.0|
|[1.0,0.1,-1.0]|  0.0|
+--------------+-----+
only showing top 3 rows



## Logistic Regression:
-  combines each of the individual inputs (features) with specific weights generated during training process
-  weights are combined to get probability of belonging to a particular class
-  weights represent of feature importance
-  high weighted features have significance effect on outcome
-  low weighted features are less important
    -  Model Hyperparameters:
        -  family:
            -  multinomial aka multiclass (2 or more possible distinct labels)
            -  binary (2 possible labels)
        -  elasticNetParam:
            -  float value from 0 to 1
            -  mix of L1 and L2 regularization
                -  L1 (value of 1):
                    -  creates sparsity in model because certain feature weights will be zero (no significance to output)
                -  L2 (value of 0):
                    -  does not create sparsity because feature weights will never completely be zero
        -  fitIntercept:
            -  either TRUE or FALSE
            -  determines whether to fit intercept to linear combination of inputs and weights of model
            -  recommended to fit intercept if the training data has not been normalized
        -  regParam:
            -  determines how much weight to give to regularization
            -  recommended to use wide range of values (0, 0.01, 0.1, 1)
        -  standardization:
            -  either TRUE or FALSE
            -  decides whether to standardize the inputs before passing into model
    -  Training Parameters:
        -  maxIter:
            -  total number of iterations over the data before stopping (default value is usually best)
        -  tol:
            -  threshold that stops before maxIter when weights are considered optimized (default value is usually best)
        -  weightCol:
            -  weighs certain rows more than others
    -  Prediction Parameters:
        -  threshold:
            -  Double in range of 0 to 1
            -  probability threshold for predicting a given class
        -  thresholds:
            -  specify an array of threshold values for each class via multiclass classification

In [3]:
from pyspark.ml.classification import LogisticRegression

In [4]:
lr = LogisticRegression()
print(lr.explainParams()) # see all parameters
lrModel = lr.fit(bInput)
print("\n")
print(lrModel.coefficients) # individual feature weights [3 features in this example]
print(lrModel.intercept) # intercept value
print("\n")

'''
# multiclass coefficients and intercept:

lrModel.coefficientMatrix
lrModel.interceptVector

# will return values for each of the classes
'''

# Model Summary:
summary = lrModel.summary
print(summary.areaUnderROC)
summary.roc.show()
summary.pr.show()
print(summary.objectiveHistory) # performance over training iterations

aggregationDepth: suggested depth for treeAggregate (>= 2). (default: 2)
elasticNetParam: the ElasticNet mixing parameter, in range [0, 1]. For alpha = 0, the penalty is an L2 penalty. For alpha = 1, it is an L1 penalty. (default: 0.0)
family: The name of family which is a description of the label distribution to be used in the model. Supported options: auto, binomial, multinomial (default: auto)
featuresCol: features column name. (default: features)
fitIntercept: whether to fit an intercept term. (default: True)
labelCol: label column name. (default: label)
lowerBoundsOnCoefficients: The lower bounds on coefficients if fitting under bound constrained optimization. The bound matrix must be compatible with the shape (1, number of features) for binomial regression, or (number of classes, number of features) for multinomial regression. (undefined)
lowerBoundsOnIntercepts: The lower bounds on intercepts if fitting under bound constrained optimization. The bounds vector size must beequal wi

## Decision Trees:
-  creates a 1 big tree of decisions
-  tends to overfit data hence not performing well on new data outside the training set
    -  Model Hyperparameters:
        -  maxDepth:
            -  helps avoid overfitting dataset
        -  maxBins:
            -  determines the # of bins created from continuous features (converted from categorical features)
            -  more bins give higher level of granularity
        -  impurity:
            -  used to build the tree
            -  metric to determine where model should split at specific leaf node
            -  either "entropy" or "gini"
        -  minInfoGain:
            -  provides information used for splitting
            -  higher value can help prevent overfitting
            -  this parameter can require a lot of testing variations
        -  minInstancePerNode:
            -  determines minimum # of training instances that need to occur in each leaf node
            -  tree will be "pruned" (removes sections of the tree that provide little power to classify instances) until requirements are met
    -  Training Parameters:
        -  checkpointInterval:
            -  set to save model's work over training every N iterations
    -  Prediction Parameters:
        -  thresholds:
            -  specify an array of threshold values for each class

In [5]:
from pyspark.ml.classification import RandomForestClassifier

In [6]:
rfClassifier = RandomForestClassifier()
print(rfClassifier.explainParams())
trainedModel = rfClassifier.fit(bInput)

cacheNodeIds: If false, the algorithm will pass trees to executors to match instances with nodes. If true, the algorithm will cache node IDs for each instance. Caching can speed up training of deeper trees. Users can set how often should the cache be checkpointed or disable it by setting checkpointInterval. (default: False)
checkpointInterval: set checkpoint interval (>= 1) or disable checkpoint (-1). E.g. 10 means that the cache will get checkpointed every 10 iterations. Note: this setting will be ignored if the checkpoint directory is not set in the SparkContext. (default: 10)
featureSubsetStrategy: The number of features to consider for splits at each tree node. Supported options: auto, all, onethird, sqrt, log2, (0.0-1.0], [1-n]. (default: auto)
featuresCol: features column name. (default: features)
impurity: Criterion used for information gain calculation (case-insensitive). Supported options: entropy, gini (default: gini)
labelCol: label column name. (default: label)
maxBins: Max

## Random Forest & Gradient-Boosted Trees:
-  trains multiple trees on various subsets of data
-  having multiple trees will help reduce overfitting because each tree becomes an "expert" on its assigned data within tree
-  RF [train each tree and combine to make an average weighted prediction]
-  BGT [train each tree to make a weighted prediction]
    -  Model Hyperparameters:
        -  RF:
            -  numTrees:
                -  total # of trees to train
            -  featureSubsetStrategy:
                -  determines how many features should be considered for splits
        -  GBT:
            -  lossType:
                -  loss function
            -  maxIter:
                -  total # of iterations over data before stopping
                -  the default is typically best
            -  stepSize:
                - learning rate
    -  Training Parameters:
        -  checkpointInterval:
            -  set to save model's work over training every N iterations
    -  Prediction Parameters:
        -  thresholds:
            -  specify an array of threshold values for each class

In [7]:
from pyspark.ml.classification import RandomForestClassifier
from pyspark.ml.classification import GBTClassifier

In [8]:
rfClassifier = RandomForestClassifier()
print(rfClassifier.explainParams())
trainedModel = rfClassifier.fit(bInput)
print("\n")
gbtClassifier = GBTClassifier()
print(gbtClassifier.explainParams())
trainedModel = gbtClassifier.fit(bInput)

cacheNodeIds: If false, the algorithm will pass trees to executors to match instances with nodes. If true, the algorithm will cache node IDs for each instance. Caching can speed up training of deeper trees. Users can set how often should the cache be checkpointed or disable it by setting checkpointInterval. (default: False)
checkpointInterval: set checkpoint interval (>= 1) or disable checkpoint (-1). E.g. 10 means that the cache will get checkpointed every 10 iterations. Note: this setting will be ignored if the checkpoint directory is not set in the SparkContext. (default: 10)
featureSubsetStrategy: The number of features to consider for splits at each tree node. Supported options: auto, all, onethird, sqrt, log2, (0.0-1.0], [1-n]. (default: auto)
featuresCol: features column name. (default: features)
impurity: Criterion used for information gain calculation (case-insensitive). Supported options: entropy, gini (default: gini)
labelCol: label column name. (default: label)
maxBins: Max

## Naive Bayes:
-  collection of classifiers based on Bayes' theorem
-  assumes all features in dataset are independent of one another
-  typically used in text classification
-  features must be non-negative
-  Model Types:
    -  multivariate Bernoulli [indicator variables represent existence of a term in a document]
    -  multinomial [uses total counts of terms]
        -  Model Hyperparameters:
            -  modelType:
                -  "bernoulli" or "multinomial"
            -  weightCol:
                -  weight differences for different data points
        -  Training Parameters:
            -  smoothing:
                -  determines amount of regularization
                -  helps smooth categorical data to prevent overfitting on training dataset
        -  Prediction Parameters:
            -  thresholds:
                -  specify an array of threshold values for each class

In [9]:
from pyspark.ml.classification import NaiveBayes

In [10]:
nb = NaiveBayes()
print(nb.explainParams())
trainedModel = nb.fit(bInput.where("label != 0"))

featuresCol: features column name. (default: features)
labelCol: label column name. (default: label)
modelType: The model type which is a string (case-sensitive). Supported options: multinomial (default) and bernoulli. (default: multinomial)
predictionCol: prediction column name. (default: prediction)
probabilityCol: Column name for predicted class conditional probabilities. Note: Not all models output well-calibrated probability estimates! These probabilities should be treated as confidences, not precise probabilities. (default: probability)
rawPredictionCol: raw prediction (a.k.a. confidence) column name. (default: rawPrediction)
smoothing: The smoothing parameter, should be >= 0, default is 1.0 (default: 1.0)
thresholds: Thresholds in multi-class classification to adjust the probability of predicting each class. Array must have length equal to the number of classes, with values > 0, excepting that at most one value may be 0. The class with largest value p/t is predicted, where p is 

### _Evaluation Metrics Example_

In [11]:
'''
import org.apache.spark.mllib.evaluation.BinaryClassificationMetrics
val out = model.transform(bInput)
  .select("prediction", "label")
  .rdd.map(x => (x(0).asInstanceOf[Double], x(1).asInstanceOf[Double]))
val metrics = new BinaryClassificationMetrics(out)

metrics.areaUnderPR
metrics.areaUnderROC
println("Receiver Operating Characteristic")
metrics.roc.toDF().show()
'''

'\nimport org.apache.spark.mllib.evaluation.BinaryClassificationMetrics\nval out = model.transform(bInput)\n  .select("prediction", "label")\n  .rdd.map(x => (x(0).asInstanceOf[Double], x(1).asInstanceOf[Double]))\nval metrics = new BinaryClassificationMetrics(out)\n\nmetrics.areaUnderPR\nmetrics.areaUnderROC\nprintln("Receiver Operating Characteristic")\nmetrics.roc.toDF().show()\n'

### grp