## org.apache.spark.mllib package offers :
#### Encoding/Transformation
#### Tuning/Evaluation
#### Hashing and Feature Reduction
#### Classification (trees, linear, probabilistic, ensemble)
#### Clustering
#### Regression
##### see : https://spark.apache.org/docs/latest/ml-guide.html

## Binary Classification
#### - bot or not
#### - churn
#### - fraud
#### - medical diagnosis

## Linear Classifiers
### See : https://spark.apache.org/docs/2.2.0/mllib-linear-methods.html

<table style="width:100%">
  <tr>
   <td><img src="linear_separator_1.png",width=300,height=300></td>
   <td><img src="linear_separator_2.png",width=300,height=300></td>
  </tr>
</table>

#### Uses linear combination of weighted features to establish a 'decision boundary' :
#### y = wx + b
#### y : the label
#### b : intercept, w_0 
#### w : coefficients, (w_1, w_2, ....w_n) where n = number of features

### Load training data

In [1]:
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
val trainingDS = sqlContext.read.format("csv").option("header", "true").//
option("inferSchema", "true").load("session_training_data.csv")

trainingDS.show(10)

+--------------+-----------------+----------+-----------------+------------+-----------+
|cart_add_count|cart_remove_count|prod_views|prod_views_unique|search_count|is_purchase|
+--------------+-----------------+----------+-----------------+------------+-----------+
|             4|                7|         0|                0|           3|          0|
|             5|                6|        12|               11|           9|          1|
|             9|                2|         8|                8|           8|          1|
|             6|                7|         4|                3|           0|          0|
|             2|                9|         5|                5|           1|          0|
|             1|                5|         0|                0|           0|          0|
|             9|                6|        14|               14|           7|          1|
|             9|                3|         2|                2|           1|          0|
|             9|     

### Setup for Logistic Regression classifier

In [2]:
import org.apache.spark.ml.classification.LogisticRegression
import org.apache.spark.ml.feature.VectorAssembler 
import org.apache.spark.ml.evaluation.MulticlassClassificationEvaluator

val LABEL_COL = "is_purchase"
val FEATURE_COLS = Array("cart_add_count", "cart_remove_count", "prod_views", "prod_views_unique", "search_count")

// VectorAssembler collects features of a Dataset into a sparse vector, stored in 
// a single column, which most ML algos require

val assembler = new VectorAssembler().setInputCols(FEATURE_COLS).setOutputCol("features")

val trainingDSTransformed = assembler.transform(trainingDS)

trainingDSTransformed.show(10)

+--------------+-----------------+----------+-----------------+------------+-----------+--------------------+
|cart_add_count|cart_remove_count|prod_views|prod_views_unique|search_count|is_purchase|            features|
+--------------+-----------------+----------+-----------------+------------+-----------+--------------------+
|             4|                7|         0|                0|           3|          0|[4.0,7.0,0.0,0.0,...|
|             5|                6|        12|               11|           9|          1|[5.0,6.0,12.0,11....|
|             9|                2|         8|                8|           8|          1|[9.0,2.0,8.0,8.0,...|
|             6|                7|         4|                3|           0|          0|[6.0,7.0,4.0,3.0,...|
|             2|                9|         5|                5|           1|          0|[2.0,9.0,5.0,5.0,...|
|             1|                5|         0|                0|           0|          0| (5,[0,1],[1.0,5.0])|
|         

### Train a Logistic Regression model from the training data

In [3]:
val lr = new LogisticRegression().setLabelCol(LABEL_COL).setFeaturesCol("features").setRegParam(0.2)

// Fit the model
val lrModel = lr.fit(trainingDSTransformed)

// Print the coefficients and intercept for logistic regression
println(s"Coefficients: ${lrModel.coefficients} Intercept: ${lrModel.intercept}")

Coefficients: [0.07068837177379167,-0.12516099365830843,0.11706925017163039,0.11853810891596289,0.18701273424465098] Intercept: -3.114507068635333


### Use trained model on test data and see how accurate our predictions are

In [7]:
val testDS = sqlContext.read.format("csv").option("header", "true").//
option("inferSchema", "true").load("session_test_data.csv")
val testDSTransformed = assembler.transform(testDS)

val predictionsDS = lrModel.transform(testDSTransformed)

// BinaryClassificationEvaluator only gives AUC/AUPR, 
// using MulticlassClassificationEvaluator for accuracy
val evaluator = new MulticlassClassificationEvaluator().//
setLabelCol("is_purchase").setPredictionCol("prediction").setMetricName("accuracy")

val acc = evaluator.evaluate(predictionsDS)
println(s"accuracy : ${acc}")

accuracy : 0.9


In [20]:
predictionsDS.select("is_purchase", "prediction").show(10)

+-----------+----------+
|is_purchase|prediction|
+-----------+----------+
|          0|       0.0|
|          0|       0.0|
|          0|       0.0|
|          1|       0.0|
|          0|       0.0|
|          1|       1.0|
|          0|       0.0|
|          0|       0.0|
|          1|       0.0|
|          1|       1.0|
+-----------+----------+
only showing top 10 rows



## ML Using Spark Pipeline
### See : https://spark.apache.org/docs/2.2.0/ml-pipeline.html

### Here we use the Pipeline construct to string multiple transformations together

In [22]:
import org.apache.spark.ml.{Pipeline, PipelineModel}
import org.apache.spark.ml.feature.StandardScaler

val assembler = new VectorAssembler().setInputCols(FEATURE_COLS).setOutputCol("features")

// scales data by Standard Deviation
val scaler = new org.apache.spark.ml.feature.StandardScaler().setInputCol("features").//
setOutputCol("scaledFeatures").setWithStd(true).setWithMean(false)

val logisticRegression = new LogisticRegression().setLabelCol("is_purchase").//
setFeaturesCol("scaledFeatures").setRegParam(0.2)

// Output of each pipeline stage is input to next
// No need to explicitly call transform for each component
val pipeline = new Pipeline().setStages(Array(assembler, scaler, logisticRegression))

val lrModel = pipeline.fit(trainingDS)

// run model on unseen test data
val pipeLinepredictions = lrModel.transform(testDS)

val acc = evaluator.evaluate(pipeLinepredictions)
println(s"accuracy : ${acc}")

accuracy : 0.9


## Decision Trees
### See : https://spark.apache.org/docs/2.2.0/mllib-decision-tree.html

<table style="width:100%">
  <tr>
   <td><img src="decision_tree_1.jpg",width=420,height=350></td>
   <td><img src="decision_tree_2.png",width=400,height=300></td>
  </tr>
</table>

In [4]:
import org.apache.spark.ml.classification.DecisionTreeClassifier
import org.apache.spark.sql.functions.udf

val intToDble = udf[Double, Integer]( _.toDouble)
val trainingDSTransformedForDT = trainingDSTransformed.withColumn("is_purchase_double",//
                                                                  intToDble(trainingDSTransformed(LABEL_COL)))

// DT Throws err when label is not Double
val dt = new DecisionTreeClassifier().setLabelCol("is_purchase_double").setFeaturesCol("features")

// Fit the model
val dtModel = dt.fit(trainingDSTransformedForDT)

In [9]:
// Apply model to test data
val testDSTransformedForDT = testDSTransformed.withColumn("is_purchase_double",//
                                                                  intToDble(testDSTransformed(LABEL_COL)))
val dtPredictions = dtModel.transform(testDSTransformedForDT)

val acc = evaluator.evaluate(dtPredictions)
acc

0.9333333333333333

## Cross Validation with a Random Forest classifier
### See : https://spark.apache.org/docs/2.2.0/ml-tuning.html
### See  : https://spark.apache.org/docs/2.2.0/ml-classification-regression.html#random-forest-classifier

In [10]:
import org.apache.spark.ml.classification.{RandomForestClassificationModel, RandomForestClassifier}
import org.apache.spark.ml.tuning.{CrossValidator, ParamGridBuilder}

val rf = new RandomForestClassifier().setLabelCol("is_purchase_double").setFeaturesCol("features")

// ParamGrid controls the param combos, param combos can multiply REAL fast
// Adding one more param takes the # of combos from 48 to 64
val rfParamGrid = new ParamGridBuilder().//
addGrid(rf.maxDepth, Array(2, 3, 4)).//
addGrid(rf.maxBins, Array(2, 4, 8, 12)).//
addGrid(rf.numTrees, Array(2, 4, 8, 12)).build()

// CrossValidator itself is an Estimator...
val rfCV = new CrossValidator().setEstimator(rf).//
setEvaluator(new MulticlassClassificationEvaluator().setLabelCol("is_purchase_double").//
             setPredictionCol("prediction").setMetricName("accuracy")).//
setEstimatorParamMaps(rfParamGrid).setNumFolds(3)

// ... and thus has a fit method. fits over each param combo, over K folds each
val rfCVModel = rfCV.fit(trainingDSTransformedForDT)

### See what params produced the highest accuracy

In [11]:
println(rfCVModel.bestModel.extractParamMap())
println("numTrees : " + rfCVModel.bestModel.asInstanceOf[RandomForestClassificationModel].getNumTrees)

{
	rfc_fbaadfa046ba-cacheNodeIds: false,
	rfc_fbaadfa046ba-checkpointInterval: 10,
	rfc_fbaadfa046ba-featureSubsetStrategy: auto,
	rfc_fbaadfa046ba-featuresCol: features,
	rfc_fbaadfa046ba-impurity: gini,
	rfc_fbaadfa046ba-labelCol: is_purchase_double,
	rfc_fbaadfa046ba-maxBins: 8,
	rfc_fbaadfa046ba-maxDepth: 3,
	rfc_fbaadfa046ba-maxMemoryInMB: 256,
	rfc_fbaadfa046ba-minInfoGain: 0.0,
	rfc_fbaadfa046ba-minInstancesPerNode: 1,
	rfc_fbaadfa046ba-predictionCol: prediction,
	rfc_fbaadfa046ba-probabilityCol: probability,
	rfc_fbaadfa046ba-rawPredictionCol: rawPrediction,
	rfc_fbaadfa046ba-seed: 207336481,
	rfc_fbaadfa046ba-subsamplingRate: 1.0
}
numTrees : 4


### See how the best model does on test data

In [12]:
val rfPredictions = rfCVModel.bestModel.transform(testDSTransformed)

val acc = evaluator.evaluate(rfPredictions)
acc

0.9666666666666667