### grp

# Spark: The Definitive Guide

## PART 6: Advanced Analytics and Machine Learning 

## dataPaths

In [1]:
simpleML = '/Users/grp/sparkTheDefinitiveGuide/data/simple-ml/'
libsvm = '/Users/grp/sparkTheDefinitiveGuide/data/sample_libsvm_data.txt'

## _Chapter #24 - Advanced Analytics and Machine Learning Overview_

-  Data Processing:
    -  cleaning data
    -  ETL
    -  Feature Engineering (**converting required inputs of type Double (labels) and Vector[Double] (features)**)   
    <br>
-  Supervised Learning:
    -  Process:
        1. gather historical data with labels (dependent variable)
        2. train a model to predict values of labels based on various features (independent variables) of the data points
        3. test model of data that wasn't trained
        4. make predictions on new unlabeled data
    -  Classification:
        -  train an algorithm to predict a dependent variable that is **CATEGORICAL**
        -  Binary Classification [2 categories]
        -  Multiclass Classification [more than 2 categories]
    -  Regression:
        -  train an algorithm to predict a dependent variable that is **CONTINUOUS**
        -  predicts a value on a number line
    - Use Cases:
        -  predicting customer churn
        -  predicting disease
        -  predicting sales
        -  predicting height   
        <br>
-  Recommendation Learning:
    -  Process:
        -  suggest products to users based on their behavior
        -  train an algorithm to make recommendations on user preferences via similarities between the users or items
    -  Use Cases:
        -  movie recommendations
        -  product recommendations   
        <br>
-  Unsupervised Learning:
    -  Process:
        -  identify patterns to discover underlying structure in dataset
        -  no dependent variable (label) to predict
        -  can be difficult to determine if model is accurate or not
    -  Use Cases:
        -  anomaly detection
        -  topic modeling
        -  user segmentation   
        <br>
-  Graph Analytics:
    -  Process:
        -  study of structures to identify relationships within data
        -  vertices (objects) and edges (relationships between objects)
    -  Use Cases:
        -  fraud networks
        -  social networks
        -  pagerank   
        <br>
-  Deep Learning:
    -  neural networks

### Modeling Process:
1.  gather and collect relevant data for task (**DATA COLLECTION**)   
<br>
2.  clearn and inspect data to better understand the data (**DATA CLEANSING**):
    -  EDA:
        -  interactive queries
        -  visualization methods
        -  statistical inference like distributions, correlations, summaries (mean, standard deviation, median, mode, quartiles) 
    -  Handling NULLs / missing values   
    <br>
3.  perform feature engineering to allow algorithm to compute the data in a required form [vectors] (**FEATURE ENGINEERING**):
    -  converting features to numeric representation (vectors or doubles)
    -  normalizing data
    -  adding variables
    -  converting categorical variables to proper format for ML Model input   
    <br>
4.  split data as training set to learn from algorithm (**TRAINING MODELS**):
    -  _the output of the training process is called a **MODEL**_
    -  provide model inputs to produce outputs (predictions) via mathematical manipulation of inputs   
    <br>
5.  split data as testing set to understand model performance (**MODEL TUNING AND EVALUATION**):
    -  tests model to generalize the data it has not seen before
    -  Sets:
        -  train (dataset used to train model)
        -  validation (dataset used to test different variations of hyperparameters) **fit hyperparameters on validation set and NOT test set to prevent overfitting model**  
        -  test "holdout" (dataset used for final evaluation to find best performing model)
    -  _be on the lookup for OVERFITTING (training a model that does not generalize well to new data instead only notices the output from the training set)_   
    <br>
6.  use steps 4 and 5 to optimize a model to run on unseen data for predictions (**APPLY MODEL FOR INSIGHTS**):
    -  export best performing model and send to production to make predictions on new incoming unseen data

### MLLib:
-  Spark provides an interface for building ML pipelines
-  provides distributed ETL FE and ML model training
-  "High-Level" Structured Types:
    -  Transformers:
        -  functions that convert raw data in some form
        -  primarily used for preprocessing and feature engineering data
        -  ex: create new column, convert data type, convert categorical variables into numerical values
    -  Estimators:
        -  typically "fits" a transformer
        -  ex: learning algorithm that trains on DF and produces a Model
    -  Evaluators:
        -  shows model performance
        -  ex: ROC curve
    -  Pipelines:
        -  wrapped up sequence of stages containing [transformers, estimators, and evaluators] to make an ML Workflow
-  "Low-Level" Data Types:
    -  Vectors:
        -  Sparse:
            -  many elements are zero for better compressed representation
        -  Dense:
            -  many unique values
-  Hyperparameters:
    -  configuration parameters via learning algorithms set prior to model training
    -  used to compare different variations of models to one another to find best performance combination
    -  ex: regularization (parameter that pushes models against overfitting data)
-  TrainValidationSplit:
    -  performs random split on data into 2 different groups
-  CrossValidator:
    -  performs Kfold cross validation that splits dataset into "k" non-overlapping randomly partitioned folds

### Model Deployment Options:
-  train model offline and then supply it with offline OLAP data (solid method )
-  train model offline and then put results into a database [Hive, HBase, Cassandra] (solid method)
-  train model offline, persist to disk, and serve to REST API (custom method)
-  train model offline and manually convert distributed model to single machine model (complex method)
-  train model online and use it online via streaming framework (complex method)
-  **productionizing ML can be very difficult and under heavy development / future innovations are currently being worked on**

### _Chapter #24 Exercises (ML)_

### _Vector (dense & sparse) Example_

In [2]:
from pyspark.ml.linalg import Vectors

In [3]:
denseVec = Vectors.dense(1.0, 2.0, 3.0)
size = 3
idx = [1, 2] # locations of non-zero elements in vector
values = [2.0, 3.0]

sparseVec = Vectors.sparse(size, idx, values)

In [4]:
print(denseVec)
print(sparseVec)

[1.0,2.0,3.0]
(3,[1,2],[2.0,3.0])


### _Read LIBSVM Example_

In [5]:
libsvmDF = spark.read.format("libsvm").load(libsvm)
libsvmDF.show(3)

+-----+--------------------+
|label|            features|
+-----+--------------------+
|  0.0|(692,[127,128,129...|
|  1.0|(692,[158,159,160...|
|  1.0|(692,[124,125,126...|
+-----+--------------------+
only showing top 3 rows



### _Categorical & Continuous Variable Example_

In [6]:
df = spark.read.json(simpleML)
df.orderBy("value2").show(3)

+-----+----+------+------------------+
|color| lab|value1|            value2|
+-----+----+------+------------------+
|green|good|    12|14.386294994851129|
|  red| bad|     2|14.386294994851129|
|green| bad|    16|14.386294994851129|
+-----+----+------+------------------+
only showing top 3 rows



In [7]:
df.printSchema()

root
 |-- color: string (nullable = true)
 |-- lab: string (nullable = true)
 |-- value1: long (nullable = true)
 |-- value2: double (nullable = true)



### _RFormula Example_:
-  formula method used to transform data into features for ML model input
-  "~" separate target and terms
-  "+" concat terms
-  "-" remove term
-  ":" interaction (multiplies numeric values; binarizes categorical values)
-  "." include all columns except target (dependent varaible)

In [8]:
from pyspark.ml.feature import RFormula

In [9]:
supervised = RFormula(formula = "lab ~ . + color:value1 + color:value2")

In [10]:
fittedRF = supervised.fit(df) # outputs trained transformer object [RFormulaModel] to transform data via custom Rformula
preparedDF = fittedRF.transform(df)
preparedDF.show(5, False)

# assigns numerical value to each possible color category
# creates additional features for the interaction varaibles between colors and value1/value2

+-----+----+------+------------------+----------------------------------------------------------------------+-----+
|color|lab |value1|value2            |features                                                              |label|
+-----+----+------+------------------+----------------------------------------------------------------------+-----+
|green|good|1     |14.386294994851129|(10,[1,2,3,5,8],[1.0,1.0,14.386294994851129,1.0,14.386294994851129])  |1.0  |
|blue |bad |8     |14.386294994851129|(10,[2,3,6,9],[8.0,14.386294994851129,8.0,14.386294994851129])        |0.0  |
|blue |bad |12    |14.386294994851129|(10,[2,3,6,9],[12.0,14.386294994851129,12.0,14.386294994851129])      |0.0  |
|green|good|15    |38.97187133755819 |(10,[1,2,3,5,8],[1.0,15.0,38.97187133755819,15.0,38.97187133755819])  |1.0  |
|green|good|12    |14.386294994851129|(10,[1,2,3,5,8],[1.0,12.0,14.386294994851129,12.0,14.386294994851129])|1.0  |
+-----+----+------+------------------+----------------------------------

### _Test Set Example_

In [11]:
train, test = preparedDF.randomSplit([0.7, 0.3])

### _Fit Model (Estimator) Example_

In [12]:
from pyspark.ml.classification import LogisticRegression

In [13]:
lr = LogisticRegression(labelCol="label",featuresCol="features")

In [14]:
# parameters
print(lr.explainParams())

aggregationDepth: suggested depth for treeAggregate (>= 2). (default: 2)
elasticNetParam: the ElasticNet mixing parameter, in range [0, 1]. For alpha = 0, the penalty is an L2 penalty. For alpha = 1, it is an L1 penalty. (default: 0.0)
family: The name of family which is a description of the label distribution to be used in the model. Supported options: auto, binomial, multinomial (default: auto)
featuresCol: features column name. (default: features, current: features)
fitIntercept: whether to fit an intercept term. (default: True)
labelCol: label column name. (default: label, current: label)
lowerBoundsOnCoefficients: The lower bounds on coefficients if fitting under bound constrained optimization. The bound matrix must be compatible with the shape (1, number of features) for binomial regression, or (number of classes, number of features) for multinomial regression. (undefined)
lowerBoundsOnIntercepts: The lower bounds on intercepts if fitting under bound constrained optimization. The

In [15]:
# fit to return a LogisticRegressionModel
fittedLR = lr.fit(train)

### _Make Predictions (Transform) Example_

In [16]:
fittedLR.transform(train).select("label", "prediction").show(3)

+-----+----------+
|label|prediction|
+-----+----------+
|  0.0|       0.0|
|  0.0|       0.0|
|  0.0|       0.0|
+-----+----------+
only showing top 3 rows



### _Pipeline Example_

In [17]:
# splits
train, test = df.randomSplit([0.7, 0.3])

# estimators
rForm = RFormula()
lr = LogisticRegression().setLabelCol("label").setFeaturesCol("features")

# pipeline
from pyspark.ml import Pipeline
stages = [rForm, lr]
pipeline = Pipeline().setStages(stages)

# prepare parameter grid to train multiple models with different parameter combinations [2X3X2 = 12 versions being trained]
from pyspark.ml.tuning import ParamGridBuilder
params = ParamGridBuilder()\
  .addGrid(rForm.formula, [
    "lab ~ . + color:value1",
    "lab ~ . + color:value1 + color:value2"])\
  .addGrid(lr.elasticNetParam, [0.0, 0.5, 1.0])\
  .addGrid(lr.regParam, [0.1, 2.0])\
  .build()

# evaluator measuring model performance via areaUnderROC (total area under the receiver)
from pyspark.ml.evaluation import BinaryClassificationEvaluator
evaluator = BinaryClassificationEvaluator()\
  .setMetricName("areaUnderROC")\
  .setRawPredictionCol("prediction")\
  .setLabelCol("label")

# setup validation set avoid performing hyperparameter fitting on test set to prevent overfitting
from pyspark.ml.tuning import TrainValidationSplit
tvs = TrainValidationSplit()\
  .setTrainRatio(0.75)\
  .setEstimatorParamMaps(params)\
  .setEstimator(pipeline)\
  .setEvaluator(evaluator)

# will output a model type TrainValidationSplitModel
tvsFitted = tvs.fit(train)

# evaluate on test set
print(evaluator.evaluate(tvsFitted.transform(test)))

# how algorithm performed over each training iteration
from pyspark.ml import PipelineModel
from pyspark.ml.classification import LogisticRegressionModel
trainedPipeline = tvsFitted.bestModel
trainedLR = trainedPipeline.stages[1]
summaryLR = trainedLR.summary
print(summaryLR.objectiveHistory)

# persist to disk to use for predictions on new data
model = tvsFitted.bestModel
model.write().overwrite().save("/Users/grp/sparkTheDefinitiveGuide/tmp/model")

# load model to make predictions on new data
# must import specific package based on persisted model type
# from pyspark.ml.tuning import TrainValidationSplitModel
from pyspark.ml import PipelineModel
print(model)
applyModel = PipelineModel.load("/Users/grp/sparkTheDefinitiveGuide/tmp/model")
testModel = applyModel.transform(test)
testModel.select("label", "prediction").show(3)

0.9166666666666667
[0.6918966592050804, 0.5993039195122997, 0.5352368612654572, 0.4597944834802591, 0.4496977652371098, 0.4411504923996298, 0.4368051124453638, 0.43178494030257675, 0.4281782327571738, 0.426192838196295, 0.4257479921914462, 0.4257312079788327, 0.425729301169335, 0.4257286665852039, 0.42572859744959407, 0.42572854981719105, 0.42572853357806606, 0.42572853213299605, 0.4257285320834505, 0.42572853208197436]
PipelineModel_4620b985159bfe5c9cdb
+-----+----------+
|label|prediction|
+-----+----------+
|  0.0|       0.0|
|  0.0|       0.0|
|  0.0|       0.0|
+-----+----------+
only showing top 3 rows



### grp