Data Processing
--------------------

In [1]:
// Preparing the Dataset for training and testing of models

import org.apache.spark.mllib.classification.{NaiveBayes, NaiveBayesModel}
import org.apache.spark.mllib.util.MLUtils
import org.apache.spark.mllib.evaluation.MulticlassMetrics
import org.apache.spark.mllib.evaluation.BinaryClassificationMetrics

// Load and parse the data file.
val data = MLUtils.loadLibSVMFile(sc, "libsvm.data")

// Split data into training (70%) and test (30%).
val Array(training, test) = data.randomSplit(Array(0.7, 0.3),seed=1)

Naive Bayes Algorithm
-------------------------------

In [11]:

val startTime_dt=System.currentTimeMillis

val model = NaiveBayes.train(training, lambda = 1.0, modelType = "multinomial")

val predictionAndLabels = test.map(p => (model.predict(p.features), p.label))

val endTime_dt=System.currentTimeMillis

// Get evaluation metrics.
val bMetrics_dt = new BinaryClassificationMetrics(predictionAndLabels)
val metrics_dt=new MulticlassMetrics(predictionAndLabels)


//Model Evaluation
println("Confusion Matrix of Naive Bayes is")
println(metrics_dt.confusionMatrix)
println("\nThe Accuracy of Naive Bayes is")
println(metrics_dt.accuracy)
println("\nThe ROC for Naive Bayes is")
bMetrics_dt.roc.collect().foreach(print)
println("\n")
println("\nThe Area under ROC for Naive Bayes is")
println(bMetrics_dt.areaUnderROC)
println("\nThe Precision of Naive Bayes is")
println(metrics_dt.precision)
println("\nThe Precision by Threshold for Naive Bayes is")
bMetrics_dt.precisionByThreshold.collect().foreach(print)
print("\n")
println("\nExecution Time for Training and Prediction of Naive Bayes is")
println(endTime_dt-startTime_dt + "ms")

Confusion Matrix of Naive Bayes is
2873.0  4579.0  
15.0    316.0   

The Accuracy of Naive Bayes is
0.409739175125273

The ROC for Naive Bayes is
(0.0,0.0)(0.6144659151905528,0.9546827794561934)(1.0,1.0)(1.0,1.0)()
The Area under ROC for Naive Bayes is
0.6701084321328203

The Precision of Naive Bayes is
0.409739175125273

The Precision by Threshold for Naive Bayes is
(1.0,0.06455566905005107)(0.0,0.04252858794809199)
Execution Time for Training and Prediction of Naive Bayes is
438ms


### Analysis

First we started our analysis with basic model,Naive bayes theorem which determines probability of future events based on earlier frequency. This model outperformed all other models in terms of speed but didn't perform well in terms of accuracy,prescision for the given dataset.Since it uses probablity based prediction and treats each record as independent ,it was handling imbalanced data inherently, the values of 'Precision by Threshold is the proof for this. To achieve better accuracy rate we planned of exploring other RDD based ML models

Linear Support Vector Machines (SVM):
-------------------------------------------

In [12]:
import org.apache.spark.mllib.classification.{SVMModel, SVMWithSGD}
import org.apache.spark.mllib.evaluation.BinaryClassificationMetrics
import org.apache.spark.mllib.optimization.L1Updater
import org.apache.spark.mllib.evaluation.MulticlassMetrics

val svmAlg = new SVMWithSGD()
svmAlg.optimizer.setNumIterations(200).setRegParam(0.1).setUpdater(new L1Updater)

val startTime_dt=System.currentTimeMillis

val modelL1 = svmAlg.run(training)

val scoreAndLabels = test.map { point =>
  val score = modelL1.predict(point.features)
  (score, point.label)
}

val endTime_dt=System.currentTimeMillis

// Get evaluation metrics.
val bMetrics_dt = new BinaryClassificationMetrics(scoreAndLabels)
val metrics_dt=new MulticlassMetrics(scoreAndLabels)


//Model Evaluation
println("Confusion Matrix of Linear SVM L1 regularization is")
println(metrics_dt.confusionMatrix)
println("\nThe Accuracy of Linear SVM L1 regularization is")
println(metrics_dt.accuracy)
println("\nThe ROC for Linear SVM L1 regularization is")
println(bMetrics_dt.roc.collect().foreach(print))
print("\n")
println("\nThe Area under ROC for Linear SVM L1 regularization is")
println(bMetrics_dt.areaUnderROC)
println("\nThe Precision of Linear SVM L1 regularization is")
println(metrics_dt.precision)
println("\nThe Precision by Threshold for Linear SVM L1 regularization is")
println(bMetrics_dt.precisionByThreshold.collect().foreach(print))
println("\nExecution Time for Training and Prediction of Linear SVM L1 regularization is")
println(endTime_dt-startTime_dt + "ms")

Confusion Matrix of Linear SVM L1 regularization is
6163.0  1289.0  
235.0   96.0    

The Accuracy of Linear SVM L1 regularization is
0.8041886162148272

The ROC for Linear SVM L1 regularization is
(0.0,0.0)(0.17297369833601717,0.29003021148036257)(1.0,1.0)(1.0,1.0)()

The Area under ROC for Linear SVM L1 regularization is
0.5585282565721728

The Precision of Linear SVM L1 regularization is
0.8041886162148272

The Precision by Threshold for Linear SVM L1 regularization is
(1.0,0.06931407942238267)(0.0,0.04252858794809199)()

Execution Time for Training and Prediction of Linear SVM L1 regularization is
2735ms


### Analysis

Above method performs L1 regularization with the regularization parameter set to 0.1 which adds “absolute value of magnitude” of coefficient as penalty term to the loss function and also it shrinks the less important feature’s coefficient to zero thus, removing some feature altogether. Just like Naive Bayes, SVM handled the imbalanced data very well and hence the 'Precision by Threshold' looks almost balanced. On repeatative executions we observed many irregularities in the performance of Linear SVM L1 regularization but the accuracy is good compared to Naive Bayes Model

Optimizing the Linear SVM Model:
-----------------------------------

In [13]:
import org.apache.spark.mllib.classification.{SVMModel, SVMWithSGD}
import org.apache.spark.mllib.evaluation.BinaryClassificationMetrics
import org.apache.spark.mllib.util.MLUtils
import org.apache.spark.mllib.evaluation.MulticlassMetrics

// Run training algorithm to build the model
val numIterations = 100

val startTime_dt=System.currentTimeMillis

val model = SVMWithSGD.train(training, numIterations)


// Compute raw scores on the test set.
val scoreAndLabels = test.map { point =>
  val score = model.predict(point.features)
  (score, point.label)
}
val endTime_dt=System.currentTimeMillis

// Get evaluation metrics.
val bMetrics_dt = new BinaryClassificationMetrics(scoreAndLabels)
val metrics_dt=new MulticlassMetrics(scoreAndLabels)


//Model Evaluation
println("Confusion Matrix of Linear SVM L2 regularization is")
println(metrics_dt.confusionMatrix)
println("\nThe Accuracy of Linear SVM L2 regularization is")
println(metrics_dt.accuracy)
println("\nThe ROC for Linear SVM L2 regularization is")
println(bMetrics_dt.roc.collect().foreach(print))
print("\n")
println("\nThe Area under ROC for Linear SVM L2 regularization is")
println(bMetrics_dt.areaUnderROC)
println("\nThe Precision of Linear SVM L2 regularization is")
println(metrics_dt.precision)
println("\nThe Precision by Threshold for Linear SVM L2 regularization is")
println(bMetrics_dt.precisionByThreshold.collect().foreach(print))
println("\nExecution Time for Training and Prediction of Linear SVM L2 regularization is")
println(endTime_dt-startTime_dt + "ms")

Confusion Matrix of Linear SVM L2 regularization is
7452.0  0.0  
331.0   0.0  

The Accuracy of Linear SVM L2 regularization is
0.957471412051908

The ROC for Linear SVM L2 regularization is
(0.0,0.0)(1.0,1.0)(1.0,1.0)()

The Area under ROC for Linear SVM L2 regularization is
0.5

The Precision of Linear SVM L2 regularization is
0.957471412051908

The Precision by Threshold for Linear SVM L2 regularization is
(0.0,0.04252858794809199)()

Execution Time for Training and Prediction of Linear SVM L2 regularization is
1831ms


### Analysis

To optimize the Linear SVM with L1 regularization we trained a Linear Support Vector Machine (SVM) using Stochastic Gradient Descent using L2 regularization with the regularization parameter set to 1.0. It adds “squared magnitude” of coefficient as penalty term to the loss function and also this technique works very well to give better performnace .Applying the L2 regularization which uses the squared magnitude improved the performance of the model from 80 to 95 percentile. But we cannot fully justify this improvement in accuracy as we are not seeing a balanced 'Precision by Threshold' for some reason. It could be because of over-fitting or because of the imbalanced data. We also observed a reduction in execution time compared to L1 regularisation.

Logistic Regression
-------------------------

In [8]:
import org.apache.spark.mllib.classification.{LogisticRegressionModel, LogisticRegressionWithLBFGS,LogisticRegressionWithSGD}
import org.apache.spark.mllib.evaluation.MulticlassMetrics
import org.apache.spark.mllib.evaluation.BinaryClassificationMetrics
import org.apache.spark.mllib.regression.LabeledPoint
import org.apache.spark.mllib.util.MLUtils

val startTime_dt=System.currentTimeMillis

// Run training algorithm to build the model
val model = new LogisticRegressionWithSGD().run(training)

// Compute raw scores on the test set.
val predictionAndLabels = test.map { case LabeledPoint(label, features) =>
  val prediction = model.predict(features)
  (prediction, label)
}

val endTime_dt=System.currentTimeMillis

// Get evaluation metrics.
val bMetrics_dt = new BinaryClassificationMetrics(predictionAndLabels)
val metrics_dt=new MulticlassMetrics(predictionAndLabels)


//Model Evaluation
println("Confusion Matrix of Logistic regression L1 regularization is")
println(metrics_dt.confusionMatrix)
println("\nThe Accuracy of Logistic regression L1 regularization is")
println(metrics_dt.accuracy)
println("\nThe ROC for Logistic regression L1 regularization is")
bMetrics_dt.roc.collect().foreach(print)
print("\n")
println("\nThe Area under ROC for Logistic regression L1 regularization is")
println(bMetrics_dt.areaUnderROC)
println("\nThe Precision of Logistic regression L1 regularization is")
println(metrics_dt.precision)
println("\nThe Precision by Threshold for Logistic regression L1 regularization is")
println(bMetrics_dt.precisionByThreshold.collect().foreach(print))
println("\nExecution Time for Training and Prediction of Logistic regression L1 regularization is")
println(endTime_dt-startTime_dt + "ms")

Confusion Matrix of Logistic regression L1 regularization is
7452.0  0.0  
331.0   0.0  

The Accuracy of Logistic regression L1 regularization is
0.957471412051908

The ROC for Logistic regression L1 regularization is
(0.0,0.0)(1.0,1.0)(1.0,1.0)
The Area under ROC for Logistic regression L1 regularization is
0.5

The Precision of Logistic regression L1 regularization is
0.957471412051908

The Precision by Threshold for Logistic regression L1 regularization is
(0.0,0.04252858794809199)()

Execution Time for Training and Prediction of Logistic regression L1 regularization is
1795ms


### Analysis

Above method performs L1 regularization with the regularization parameter set to 0.1 which adds “absolute value of magnitude” of coefficient as penalty term to the loss function and also it shrinks the less important feature’s coefficient to zero. We observed many irregularities in the performance of Logistic regression with L1 regularization.Especially with accuracy ,ROC and Precision by Threshold. Detailed study with repeatative execution helps us in deriving the conclusion that the model is not handling imbalanced data and is over-fitting. Logistic Regression - Data Frame model had weight Parameter to handle imbalanced data but for RDDs there is no such option available.

Optimization of Logistic regression:
-------------------------------------

In [9]:

val startTime_dt=System.currentTimeMillis
val model = new LogisticRegressionWithLBFGS().setNumClasses(10).run(training)

val startTime_dt=System.currentTimeMillis

val predictionAndLabels = test.map { case LabeledPoint(label, features) =>
  val prediction = model.predict(features)
  (prediction, label)
}

val endTime_dt=System.currentTimeMillis

// Get evaluation metrics.
val bMetrics_dt = new BinaryClassificationMetrics(predictionAndLabels)
val metrics_dt=new MulticlassMetrics(predictionAndLabels)


//Model Evaluation
println("Confusion Matrix of Logistic regression L2 regularization is")
println(metrics_dt.confusionMatrix)
println("\nThe Accuracy of Logistic regression L2 regularization is")
println(metrics_dt.accuracy)
println("\nThe ROC for Logistic regression L2 regularization is")
println(bMetrics_dt.roc.collect().foreach(print))
print("\n")
println("\nThe Area under ROC for Logistic regression L2 regularization is")
println(bMetrics_dt.areaUnderROC)
println("\nThe Precision of Logistic regression L2 regularization is")
println(metrics_dt.precision)
println("\nThe Precision by Threshold for Logistic regression L2 regularization is")
println(bMetrics_dt.precisionByThreshold.collect().foreach(print))
println("\nThe Model Execution Time for Logistic regression L2 regularization is")
println(endTime_dt-startTime_dt + "ms")

Confusion Matrix of Logistic regression L2 regularization is
7382.0  70.0   
135.0   196.0  

The Accuracy of Logistic regression L2 regularization is
0.9736605422073751

The ROC for Logistic regression L2 regularization is
(0.0,0.0)(0.00939345142243693,0.5921450151057401)(1.0,1.0)(1.0,1.0)()

The Area under ROC for Logistic regression L2 regularization is
0.7913757818416516

The Precision of Logistic regression L2 regularization is
0.9736605422073751

The Precision by Threshold for Logistic regression L2 regularization is
(1.0,0.7368421052631579)(0.0,0.04252858794809199)()

The Model Execution Time for Logistic regression L2 regularization is
247ms


### Analysis

For Logistic Regression, L-BFGS version is strongly recommended since it converges faster and more accurately compared to SGD by approximating the inverse Hessian matrix using quasi-Newton method. Applying the L2 regularization which uses the squared magnitude improved the performance of the model and it is evident while looking at confusion matrix, accuracy, ROC. It handled imbalanced data also reasonably well, somewhat balanced Precision by Threshold is the proof for this statement. The Execution time also was found to be very less which gives all the signs of a good reliable model.

Decision Tree    
------------------------

In [10]:
import org.apache.spark.mllib.tree.DecisionTree
import org.apache.spark.mllib.tree.model.DecisionTreeModel
import org.apache.spark.mllib.util.MLUtils

// Train a DecisionTree model.
val numClasses = 2
val categoricalFeaturesInfo = Map[Int, Int]()
val impurity = "gini"
val maxDepth = 10
val maxBins = 32

val startTime_dt=System.currentTimeMillis

val model = DecisionTree.trainClassifier(training, numClasses, categoricalFeaturesInfo,
  impurity, maxDepth, maxBins)

// Evaluate model on test instances and compute test error
val labelAndPreds = test.map { point =>
  val prediction = model.predict(point.features)
  (point.label, prediction)
}

val endTime_dt=System.currentTimeMillis

//Model Evaluation
val metrics_dt = new MulticlassMetrics(labelAndPreds)
val bMetrics_dt = new BinaryClassificationMetrics(labelAndPreds)
println("Confusion Matrix of Decision Tree is")
println(metrics_dt.confusionMatrix)
println("\nThe Accuracy of Decision Tree is")
println(metrics_dt.accuracy)
println("\nThe ROC for Decision Tree is")
println(bMetrics_dt.roc.collect().foreach(print))
print("\n")
println("\nThe Area under ROC for Decision Tree is")
println(bMetrics_dt.areaUnderROC)
println("\nThe Precision of Decision Tree is")
println(metrics_dt.precision)
println("\nThe Precision by Threshold for Decision Tree is")
println(bMetrics_dt.precisionByThreshold.collect().foreach(print))
println("\nExecution Time for Training and Prediction of Decision Tree is")
println(endTime_dt-startTime_dt + "ms")

Confusion Matrix of Decision Tree is
7423.0  29.0   
29.0    302.0  

The Accuracy of Decision Tree is
0.9925478607220866

The ROC for Decision Tree is
(0.0,0.0)(0.0038915727321524422,0.9123867069486404)(1.0,1.0)(1.0,1.0)()

The Area under ROC for Decision Tree is
0.954247567108244

The Precision of Decision Tree is
0.9925478607220866

The Precision by Threshold for Decision Tree is
(1.0,0.9123867069486404)(0.0,0.04252858794809199)()

Execution Time for Training and Prediction of Decision Tree is
1481ms


### Analysis

Decision Trees are known for over-fitting the data and the the results above support that statement especially the accuracy rate of 99.2%. This is mainly because of that fact that decision Tree does not incorporate bootstrapping or any other technique to understand complete trend in the data set.It takes a random column as the root and then start building the decision tree from there. This is one reason for over-fitting. The behaviour was observed to be very similar to the data frame based model. It was not handling imbalanced data too. The execution time for RDD based decision tree is considerably less than it's dataframe version. We handled imbalanced data by setting threshold in the data frame based model. There is no such option available in RDD decision tree. 

Random Forest
------------------------

In [11]:
import org.apache.spark.mllib.tree.RandomForest
import org.apache.spark.mllib.tree.model.RandomForestModel
import org.apache.spark.mllib.util.MLUtils

// Train a RandomForest model.
val numClasses = 2
val categoricalFeaturesInfo = Map[Int, Int]()
val numTrees = 500
val featureSubsetStrategy = "auto"
val impurity = "gini"
val maxDepth = 15
val maxBins = 32

val startTime_dt=System.currentTimeMillis

val model = RandomForest.trainClassifier(training, numClasses, categoricalFeaturesInfo,
  numTrees, featureSubsetStrategy, impurity, maxDepth, maxBins)

// Evaluate model on test instances and compute test error
val labelAndPreds = test.map { point =>
  val prediction = model.predict(point.features)
  (point.label, prediction)
}

val endTime_dt=System.currentTimeMillis

//Model Evaluation
val metrics_dt = new MulticlassMetrics(labelAndPreds)
val bMetrics_dt = new BinaryClassificationMetrics(labelAndPreds)
println("Confusion Matrix of Random Forest is")
println(metrics_dt.confusionMatrix)
println("\nThe Accuracy of Random Forest is")
println(metrics_dt.accuracy)
println("\nThe ROC for Random Forest is")
println(bMetrics_dt.roc.collect().foreach(print))
print("\n")
println("\nThe Area under ROC for Random Forest is")
println(bMetrics_dt.areaUnderROC)
println("\nThe Precision of Random Forest is")
println(metrics_dt.precision)
println("\nThe Precision by Threshold for Random Forest is")
println(bMetrics_dt.precisionByThreshold.collect().foreach(print))
println("\nExecution Time for Training and Prediction of Random Forest is")
println(endTime_dt-startTime_dt + "ms")


Confusion Matrix of Random Forest is
7447.0  27.0                                                                    
5.0     304.0  

The Accuracy of Random Forest is
0.9958884748811512                                                              

The ROC for Random Forest is
(0.0,0.0)(0.0036125234145036127,0.9838187702265372)(1.0,1.0)(1.0,1.0)()

The Area under ROC for Random Forest is
0.9901031234060168

The Precision of Random Forest is
0.9958884748811512

The Precision by Threshold for Random Forest is
(1.0,0.918429003021148)(0.0,0.03970191442888346)()

Execution Time for Training and Prediction of Random Forest is
70520ms


### Analysis

Unlike Decision Trees, Random Forest has bootstrapping aggragation and other techniques incorporated to the model itself for better data analysis. Just like it's data frame sibling RDD also has a very high execution time in comparison to other models.In fact it is the highest of all models considerd in this study. It is true that we have a set 500 trees while defining the model but that was for better average and accuracy. It is giving a high accuracy rate but we have a feeling that it is because of overfitting. 'Precision By Threshold' figures strengthened our doubts. 

Gradient-Boosted Tree
----------------------------


In [5]:
import org.apache.spark.mllib.tree.GradientBoostedTrees
import org.apache.spark.mllib.tree.configuration.BoostingStrategy
import org.apache.spark.mllib.tree.model.GradientBoostedTreesModel
import org.apache.spark.mllib.util.MLUtils

// Train a GradientBoostedTrees model.
val boostingStrategy = BoostingStrategy.defaultParams("Classification")
boostingStrategy.numIterations = 20
boostingStrategy.treeStrategy.numClasses = 2
boostingStrategy.treeStrategy.maxDepth = 15
boostingStrategy.treeStrategy.categoricalFeaturesInfo = Map[Int, Int]()
boostingStrategy.learningRate=0.1

val startTime_dt=System.currentTimeMillis

val model = GradientBoostedTrees.train(training, boostingStrategy)

// Evaluate model on test instances and compute test error
val labelAndPreds = test.map { point =>
  val prediction = model.predict(point.features)
  (point.label, prediction)
}

val endTime_dt=System.currentTimeMillis

//Model Evaluation
val metrics_dt = new MulticlassMetrics(labelAndPreds)
val bMetrics_dt = new BinaryClassificationMetrics(labelAndPreds)
println("Confusion Matrix of Gradient Boosted Tree is")
println(metrics_dt.confusionMatrix)
println("\nThe Accuracy of Gradient Boosted Tree is")
println(metrics_dt.accuracy)
println("\nThe ROC for Gradient Boosted Tree is")
println(bMetrics_dt.roc.collect().foreach(print))
print("\n")
println("\nThe Area under ROC for Gradient Boosted Tree is")
println(bMetrics_dt.areaUnderROC)
println("\nThe Precision of Gradient Boosted Tree is")
println(metrics_dt.precision)
println("\nThe Precision by Threshold for Gradient Boosted Tree is")
println(bMetrics_dt.precisionByThreshold.collect().foreach(print))
println("\nExecution Time for Training and Prediction of Gradient Boosted Tree is")
println(endTime_dt-startTime_dt + "ms")


Confusion Matrix of Gradient Boosted Tree is
7404.0  37.0   
48.0    294.0  

The Accuracy of Gradient Boosted Tree is
0.989078761403058

The ROC for Gradient Boosted Tree is
(0.0,0.0)(0.004972449939524257,0.8596491228070176)(1.0,1.0)(1.0,1.0)()

The Area under ROC for Gradient Boosted Tree is
0.9273383364337466

The Precision of Gradient Boosted Tree is
0.989078761403058

The Precision by Threshold for Gradient Boosted Tree is
(1.0,0.8882175226586103)(0.0,0.04394192470769626)()

Execution Time for Training and Prediction of Gradient Boosted Tree is
29491ms


### Analysis

It's was explicitly mentioned in spark documentation that Gradient-boosted tree of RDD gives inconsistent performance for imbalanced data. However setting 'learning rate' will help to reduce irregularities in performance and overfitting. So we have done just that for this analysis. Evaluations showed that though it is known a high performing model the Spark RDD version has limitation of handling imbalanced data.

## CASE STUDY 3 CONCLUSION (FINAL CONCLUSION)

We know that RDD are going to have an end of road map soon and Dataframes are the future. Our analysis revealed that models like Naive Bayes and SVM are inherently handling imbalanced data but these are not yet ported to data frames.Also there no development activities going on with RDD.In short the whole machine learning capabilities of spark has lot of limitations compared to python or R. This is one reason that spark is less popular in the market comapred to python or R. Inability to handle imbalanced data in many scenarios is a key drawback that we identified with this study.Our research also shows that development activities are in progress with spark dataframes to overcome many of this drawbacks. The JIRA link that we provided earlier in CASE STUDY 2 is strong proof for this. Hope we will see these features in spark soon and will make it more popular.