 ## Anomaly Detection in Credit Card Fraud  with different clustering algorithm.
 
 We can categorize machine learning algorithms into two main groups: supervised learning and unsupervised learning. With supervised learning algorithms, in order to predict unknown values for new data, we have to know the target value for many previously-seen examples. In contrast, unsupervised learning algorithms explore the data which has no target attribute to find some intrinsic structures in them.

 Clustering is a technique for finding similar groups in data, called clusters. Clustering is often called an unsupervised learning task as no class values denoting an a priori grouping of the data instances are given.

 ### The implementation is done using the DataFrame-based API of SparkMLlib.
 ### Algorithms:
   - K-means
   - Gaussian Mixture Model (GMM)
   - Bisecting K-Means

### 1)-Importing Packages

In [31]:
import org.apache.spark._
import org.apache.spark.sql._
import org.apache.spark.rdd.RDD
import org.apache.spark.sql.functions._
import org.apache.spark.sql.{DataFrame, Row, SQLContext}
//import org.apache.spark.ml.clustering.{KMeans, KMeansModel, GaussianMixture, GaussianMixtureModel, BisectingKMeans, BisectingKMeansModel}
import org.apache.spark.ml.clustering._
import org.apache.spark.ml.evaluation.ClusteringEvaluator
import org.apache.spark.ml.evaluation.{MulticlassClassificationEvaluator,BinaryClassificationEvaluator}
import org.apache.spark.ml.feature.{StandardScaler, VectorAssembler, StringIndexer, MinMaxScaler}
import org.apache.spark.ml.{Pipeline}
import org.apache.spark.mllib.linalg.{Vector, Vectors}
import java.io.{File, PrintWriter}
import java.text.SimpleDateFormat
import java.util.Calendar

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

import org.apache.spark._
import org.apache.spark.sql._
import org.apache.spark.rdd.RDD
import org.apache.spark.sql.functions._
import org.apache.spark.sql.{DataFrame, Row, SQLContext}
import org.apache.spark.ml.clustering._
import org.apache.spark.ml.evaluation.ClusteringEvaluator
import org.apache.spark.ml.evaluation.{MulticlassClassificationEvaluator, BinaryClassificationEvaluator}
import org.apache.spark.ml.feature.{StandardScaler, VectorAssembler, StringIndexer, MinMaxScaler}
import org.apache.spark.ml.Pipeline
import org.apache.spark.mllib.linalg.{Vector, Vectors}
import java.io.{File, PrintWriter}
import java.text.SimpleDateFormat
import java.util.Calendar


### 2)-Load Dataset

In [32]:
val raw = spark.read.format("csv").option("header", "true").option("mode", "DROPMALFORMED").csv("datasets/creditcard.csv")

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

raw: org.apache.spark.sql.DataFrame = [Time: string, V1: string ... 29 more fields]


In [33]:
// cast all the column to Double type.
val df = raw.select(((1 to 28).map(i => "V" + i) ++ Array("Time", "Amount", "Class")).map(s => col(s).cast("Double")): _*)

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

df: org.apache.spark.sql.DataFrame = [V1: double, V2: double ... 29 more fields]


In [34]:
// select a few columns to show.
df.select("V1", "V2", "Time", "Amount", "Class").show()
println(" Class statistics: 1 represents fraud and 0 represents normal")
df.groupBy("Class").count().show()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

+------------------+-------------------+----+------+-----+
|                V1|                 V2|Time|Amount|Class|
+------------------+-------------------+----+------+-----+
|  -1.3598071336738|-0.0727811733098497| 0.0|149.62|  0.0|
|  1.19185711131486|   0.26615071205963| 0.0|  2.69|  0.0|
| -1.35835406159823|  -1.34016307473609| 1.0|378.66|  0.0|
|-0.966271711572087| -0.185226008082898| 1.0| 123.5|  0.0|
| -1.15823309349523|  0.877736754848451| 2.0| 69.99|  0.0|
|-0.425965884412454|  0.960523044882985| 2.0|  3.67|  0.0|
|  1.22965763450793|  0.141003507049326| 4.0|  4.99|  0.0|
|-0.644269442348146|   1.41796354547385| 7.0|  40.8|  0.0|
| -0.89428608220282|  0.286157196276544| 7.0|  93.2|  0.0|
| -0.33826175242575|   1.11959337641566| 9.0|  3.68|  0.0|
|  1.44904378114715|  -1.17633882535966|10.0|   7.8|  0.0|
|  0.38497821518095|  0.616109459176472|10.0|  9.99|  0.0|
|    1.249998742053|  -1.22163680921816|10.0| 121.5|  0.0|
|   1.0693735878819|  0.287722129331455|11.0|  27.5|  0.

### 3)-Data Preparation for Training

In [35]:
val labelConverter = new StringIndexer().setInputCol("Class").setOutputCol("label")
val assembler = new VectorAssembler().setInputCols((1 to 28).map(i => "V" + i).toArray ++ Array("Amount")).setOutputCol("assembled")
val scaler = new MinMaxScaler().setInputCol("assembled").setOutputCol("features")
//val scaler = new StandardScaler().setInputCol("assembled").setOutputCol("features")
val pipeline = new Pipeline().setStages(Array(assembler, scaler, labelConverter))
val pipelineModel = pipeline.fit(df)
val data = pipelineModel.transform(df)
println("Generate feature from raw data:")
data.select("features","label").show()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

labelConverter: org.apache.spark.ml.feature.StringIndexer = strIdx_4eb457338627
assembler: org.apache.spark.ml.feature.VectorAssembler = vecAssembler_1c69964497d2
scaler: org.apache.spark.ml.feature.MinMaxScaler = minMaxScal_f286939561f0
pipeline: org.apache.spark.ml.Pipeline = pipeline_db96c3ee1eff
pipelineModel: org.apache.spark.ml.PipelineModel = pipeline_db96c3ee1eff
data: org.apache.spark.sql.DataFrame = [V1: double, V2: double ... 32 more fields]
Generate feature from raw data:
+--------------------+-----+
|            features|label|
+--------------------+-----+
|[0.93519233743373...|  0.0|
|[0.97854195497169...|  0.0|
|[0.93521702332994...|  0.0|
|[0.94187801720890...|  0.0|
|[0.93861683090479...|  0.0|
|[0.95105714452038...|  0.0|
|[0.97918413907815...|  0.0|
|[0.94734843724736...|  0.0|
|[0.94310096396120...|  0.0|
|[0.95254712917866...|  0.0|
|[0.98291123819341...|  0.0|
|[0.96483408113384...|  0.0|
|[0.97952970932083...|  0.0|
|[0.97646111149632...|  0.0|
|[0.91086362048888

In [36]:
// get the time to split the data.
    val splitTime = data.stat.approxQuantile("Time", Array(0.7), 0.001).head
    val trainingData = data.filter(s"Time<$splitTime").cache()
    val validationData = data.filter(s"Time>=$splitTime").cache()
    println("Split data into Training and Validation: ")
    println("training records count: " + trainingData.count())
    println("validation records count: " + validationData.count())

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

splitTime: Double = 132913.0
trainingData: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [V1: double, V2: double ... 32 more fields]
validationData: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [V1: double, V2: double ... 32 more fields]
Split data into Training and Validation: 
training records count: 199337
validation records count: 85470


### 4)-Training the models

In [79]:
// Create a Kmeans Model with K=2
// train the model
val kmeans = new KMeans().setK(2).setFeaturesCol("features").setPredictionCol("clusters").setSeed(1L).setMaxIter(100).setInitMode("random")
val bisectingkmeans = new BisectingKMeans().setK(2).setFeaturesCol("features").setPredictionCol("clusters").setSeed(1L).setMaxIter(100)
val gaussianMixture = new GaussianMixture().setK(2).setFeaturesCol("features").setPredictionCol("clusters").setSeed(1L).setMaxIter(100)
val latentDirichletAllocation = new LDA().setK(2).setFeaturesCol("features").setTopicDistributionCol("clusters").setSeed(1L).setMaxIter(100)

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

kmeans: org.apache.spark.ml.clustering.KMeans = kmeans_c563b314953f
bisectingkmeans: org.apache.spark.ml.clustering.BisectingKMeans = bisecting-kmeans_a91520f2600b
gaussianMixture: org.apache.spark.ml.clustering.GaussianMixture = GaussianMixture_662b7810399c
latentDirichletAllocation: org.apache.spark.ml.clustering.LDA = lda_a2ee372d8ff2


In [80]:
// Fit that model to the training_data
val t = System.nanoTime
val modelKmeans = kmeans.fit(trainingData)
val durationkmeans = (System.nanoTime - t) / 1e9d
println("\ninitial model training finished.")
println(s"Training process takes $durationkmeans secs")

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

t: Long = 386419915606814
modelKmeans: org.apache.spark.ml.clustering.KMeansModel = kmeans_c563b314953f
durationkmeans: Double = 3.794129698

initial model training finished.
Training process takes 3.794129698 secs


In [81]:
// Fit that model to the training_data
val t = System.nanoTime
val modelBisectingkmeans = bisectingkmeans.fit(trainingData)
val durationbisectingkmeans = (System.nanoTime - t) / 1e9d
println("\ninitial model training finished.")
println(s"Training process takes $durationbisectingkmeans secs")

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

t: Long = 386425215362673
modelBisectingkmeans: org.apache.spark.ml.clustering.BisectingKMeansModel = bisecting-kmeans_a91520f2600b
durationbisectingkmeans: Double = 29.120097831

initial model training finished.
Training process takes 29.120097831 secs


In [82]:
// Fit that model to the training_data
val t = System.nanoTime
val modelGMM = gaussianMixture.fit(trainingData)
val durationGMM = (System.nanoTime - t) / 1e9d
println("\ninitial model training finished.")
println(s"Training process takes $durationGMM secs")

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

t: Long = 386455071015215
modelGMM: org.apache.spark.ml.clustering.GaussianMixtureModel = GaussianMixture_662b7810399c
durationGMM: Double = 12.003813243

initial model training finished.
Training process takes 12.003813243 secs


In [42]:
// see the centroid Kmeans 
modelKmeans.clusterCenters.foreach(centroid => println(centroid.toString))

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

[0.9564501581438364,0.7670070472196541,0.8431864112015065,0.25356960956854985,0.7648245082516371,0.26343767867739337,0.26512053486516435,0.7856784425752985,0.45766099456095577,0.5094514371143524,0.28733182490808207,0.7111406415035201,0.43788899637813633,0.6435538582576761,0.34836762361569834,0.44932254732063615,0.7306151152351633,0.6492562691026942,0.5632157453869857,0.5805278670538507,0.5613378874843741,0.5080302041835568,0.6651849071593796,0.3832449345220203,0.5817309198103178,0.4267484385877942,0.41653436472823563,0.31316365940424834,0.0035642678050799305]
[0.9569450533394839,0.7698304008986779,0.8411419787351364,0.2701368635408002,0.7661454116190318,0.2638665402078969,0.2644086090727458,0.7846255836395395,0.5102360694912171,0.5001272139467199,0.34025897638354463,0.6181875908156352,0.5670861576072683,0.6843401355604942,0.29266938684380595,0.44778699232920227,0.7431149234118652,0.6643562149579135,0.554054459550227,0.5800332473325585,0.5604107487864989,0.5057333056144397,0.66536281405

In [43]:
// see the centroid Bisecting Kmeans 
modelBisectingkmeans.clusterCenters.foreach(centroid => println(centroid.toString))

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

[0.9575854867599131,0.7675035459183824,0.8410462056376878,0.2519040093373025,0.7657193887745851,0.264456429329836,0.26504214702915024,0.7854869040242018,0.46604827916786856,0.5078811912413467,0.28687226352170453,0.6950910916485239,0.49801599798339413,0.6498568699659022,0.3361887429652741,0.44963331778691334,0.7324193822298208,0.6548286253978278,0.5643598002217111,0.5804740905369229,0.5611216342202008,0.5072761657105155,0.664989822634188,0.3066485035597651,0.5815909027872532,0.4288656648921024,0.4166154961329739,0.31312455627368496,0.0033574687413616892]
[0.9559192470121537,0.7671880689574241,0.8440110567352116,0.2570894798649599,0.7645535724869306,0.2629588707731522,0.2650497667659248,0.7856143666382256,0.46149253730778955,0.5088163270246019,0.29596743930889985,0.705021666113527,0.4260954755797099,0.6466355435024234,0.34607607979483657,0.44891236537502766,0.7316279523846115,0.6486588244479086,0.5611496449949587,0.5804783326357015,0.5613070030666977,0.508070852044658,0.6653178109298407,

In [44]:
// output parameters of gaussian mixture model 
for (i <- 0 until modelGMM.getK) {
  println(s"Gaussian $i:\nweight=${modelGMM.weights(i)}\n" +
      s"mu=${modelGMM.gaussians(i).mean}\nsigma=\n${modelGMM.gaussians(i).cov}\n")
}

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

Gaussian 0:
weight=0.7276277748979864
mu=[0.9613884460140616,0.7689213570761666,0.8466957146572097,0.2659721068087794,0.7649434140421982,0.26122096797075534,0.2658291403386824,0.7859144452487762,0.46782749938133134,0.5043834886088266,0.29462632167402686,0.7035277382525145,0.448237678315879,0.6504139516087736,0.34119711037354933,0.45342401432066615,0.7307110905731787,0.6398216035111878,0.567644848294621,0.5804300820847669,0.5614538180518653,0.511193645311341,0.6649753987009894,0.37566130819332716,0.5809384481053492,0.4272110295677354,0.41666740821876175,0.3133929744095242,0.002440721106950736]
sigma=
5.862911849182776E-4    -2.4963495957948885E-5  ... (29 total)
-2.4963495957948885E-5  1.283150221808733E-4    ...
-1.475308292525393E-4   2.0540162927171236E-5   ...
8.374748635325824E-5    4.539507606832936E-5    ...
-1.2326492837605438E-5  9.167962450530916E-6    ...
-1.0729461023994617E-5  -4.610594163520005E-6   ...
-1.1172386657141029E-5  8.761299631826291E-6    ...
-1.446925307162797

### 5)- Prediction

In [84]:
 // Make predictions Kmeans
val t = System.nanoTime
var predictionsk = modelKmeans.transform(validationData)
val durationkmeans = (System.nanoTime - t) / 1e9d
println("\ninitial model training finished.")
println(s"Training process takes $durationkmeans secs")
predictionsk = predictionsk.select(col("label"),col("clusters").cast("Double"),col("features"))
//val df = raw.select(((1 to 28).map(i => "V" + i) ++ Array("Time", "Amount", "Class")).map(s => col(s).cast("Double")): _*)
predictionsk.groupBy("clusters").count().show()
//predictionsk.groupBy("label").count().show()
predictionsk.show(10)

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

t: Long = 386545412230882
predictionsk: org.apache.spark.sql.DataFrame = [V1: double, V2: double ... 33 more fields]
durationkmeans: Double = 0.254539203

initial model training finished.
Training process takes 0.254539203 secs
predictionsk: org.apache.spark.sql.DataFrame = [label: double, clusters: double ... 1 more field]
+--------+-----+
|clusters|count|
+--------+-----+
|     0.0|85099|
|     1.0|  371|
+--------+-----+

+-----+--------+--------------------+
|label|clusters|            features|
+-----+--------+--------------------+
|  0.0|     0.0|[0.93393055100879...|
|  0.0|     0.0|[0.99373239366538...|
|  0.0|     0.0|[0.99412599523130...|
|  0.0|     0.0|[0.99168700342760...|
|  0.0|     0.0|[0.99303996117885...|
|  0.0|     0.0|[0.99358866187046...|
|  0.0|     0.0|[0.99399681197689...|
|  0.0|     0.0|[0.99387845478437...|
|  0.0|     0.0|[0.94168288462400...|
|  0.0|     0.0|[0.90792673663724...|
+-----+--------+--------------------+
only showing top 10 rows



In [85]:
predictionsk.select("clusters", "label").groupBy("clusters", "label").count().orderBy("label", "clusters").show()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

+--------+-----+-----+
|clusters|label|count|
+--------+-----+-----+
|     0.0|  0.0|85010|
|     1.0|  0.0|  352|
|     0.0|  1.0|   89|
|     1.0|  1.0|   19|
+--------+-----+-----+



In [86]:
 // Make predictions Bisecting Kmeans 
val t = System.nanoTime
var predictionsbk = modelBisectingkmeans.transform(validationData)
val durationkmeans = (System.nanoTime - t) / 1e9d
println("\ninitial model training finished.")
println(s"Training process takes $durationkmeans secs")
predictionsbk = predictionsbk.select(col("label"),col("clusters").cast("Double"),col("features"))
predictionsbk.groupBy("clusters").count().show()
predictionsbk.groupBy("label").count().show()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

t: Long = 386548399977406
predictionsbk: org.apache.spark.sql.DataFrame = [V1: double, V2: double ... 33 more fields]
durationkmeans: Double = 0.257309219

initial model training finished.
Training process takes 0.257309219 secs
predictionsbk: org.apache.spark.sql.DataFrame = [label: double, clusters: double ... 1 more field]
+--------+-----+
|clusters|count|
+--------+-----+
|     0.0|32278|
|     1.0|53192|
+--------+-----+

+-----+-----+
|label|count|
+-----+-----+
|  0.0|85362|
|  1.0|  108|
+-----+-----+



In [87]:
predictionsbk.select("clusters", "label").groupBy("clusters", "label").count().orderBy("label", "clusters").show()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

+--------+-----+-----+
|clusters|label|count|
+--------+-----+-----+
|     0.0|  0.0|32242|
|     1.0|  0.0|53120|
|     0.0|  1.0|   36|
|     1.0|  1.0|   72|
+--------+-----+-----+



In [88]:
// Make predictions Bisecting Kmeans 
val t = System.nanoTime
var predictionsgmm = modelGMM.transform(validationData)
val durationkmeans = (System.nanoTime - t) / 1e9d
println("\ninitial model training finished.")
println(s"Training process takes $durationkmeans secs")
predictionsgmm = predictionsgmm.select(col("label"),col("clusters").cast("Double"),col("features"))
predictionsgmm.groupBy("clusters").count().show()
predictionsgmm.groupBy("label").count().show()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

t: Long = 386551384307298
predictionsgmm: org.apache.spark.sql.DataFrame = [V1: double, V2: double ... 34 more fields]
durationkmeans: Double = 0.278146472

initial model training finished.
Training process takes 0.278146472 secs
predictionsgmm: org.apache.spark.sql.DataFrame = [label: double, clusters: double ... 1 more field]
+--------+-----+
|clusters|count|
+--------+-----+
|     0.0|61288|
|     1.0|24182|
+--------+-----+

+-----+-----+
|label|count|
+-----+-----+
|  0.0|85362|
|  1.0|  108|
+-----+-----+



In [89]:
predictionsgmm.select("clusters", "label").groupBy("clusters", "label").count().orderBy("label", "clusters").show()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

+--------+-----+-----+
|clusters|label|count|
+--------+-----+-----+
|     0.0|  0.0|61276|
|     1.0|  0.0|24086|
|     0.0|  1.0|   12|
|     1.0|  1.0|   96|
+--------+-----+-----+



### 6)-Evaluation

 ##### Calculation of Silhouette Score

In [53]:
// Evaluate clustering by computing Silhouette score
val evaluatorsi = new ClusteringEvaluator().setPredictionCol("clusters").setMetricName("silhouette")

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

evaluatorsi: org.apache.spark.ml.evaluation.ClusteringEvaluator = cluEval_4eed8dfb263c


In [54]:
val ScoreKmeans = evaluatorsi.evaluate(predictionsk)
println(s"Silhouette of Kmeans predictions with squared euclidean distance = $ScoreKmeans")

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

ScoreKmeans: Double = 0.6394809448118279
Silhouette of Kmeans predictions with squared euclidean distance = 0.6394809448118279


In [55]:
val ScoreBisectingKmeans  = evaluatorsi.evaluate(predictionsbk)
println(s"Silhouette of Bisecting Kmeans predictions with squared euclidean distance = $ScoreBisectingKmeans")

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

ScoreBisectingKmeans: Double = 0.17617461529914036
Silhouette of Bisecting Kmeans predictions with squared euclidean distance = 0.17617461529914036


In [56]:
val ScoreGMM  = evaluatorsi.evaluate(predictionsgmm)
println(s"Silhouette of Gaussian Mixture predictions with squared euclidean distance = $ScoreGMM")

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

ScoreGMM: Double = 0.12012752377918906
Silhouette of Gaussian Mixture predictions with squared euclidean distance = 0.12012752377918906


In [57]:
// Evaluate clustering.
val costKmeans = modelKmeans.computeCost(trainingData)
println(s"Kmeans: Within Set Sum of Squared Errors = $costKmeans")
val costBisectingKmeans = modelBisectingkmeans.computeCost(trainingData)
println(s"Bisecting Kmeans: Within Set Sum of Squared Errors = $costBisectingKmeans")
val costBisectingKmeans = modelBisectingkmeans.computeCost(trainingData)
println(s"Bisecting Kmeans: Within Set Sum of Squared Errors = $costBisectingKmeans")

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

costKmeans: Double = 9379.74165072235
Kmeans: Within Set Sum of Squared Errors = 9379.74165072235
costBisectingKmeans: Double = 9180.032693362762
Bisecting Kmeans: Within Set Sum of Squared Errors = 9180.032693362762
costBisectingKmeans: Double = 9180.032693362764
Bisecting Kmeans: Within Set Sum of Squared Errors = 9180.032693362764


In [58]:
val evaluatork1 = new MulticlassClassificationEvaluator().setLabelCol("label").setPredictionCol("clusters").setMetricName("accuracy")
val evaluatork2 = new MulticlassClassificationEvaluator().setLabelCol("label").setPredictionCol("clusters").setMetricName("weightedPrecision")
val evaluatork3 = new MulticlassClassificationEvaluator().setLabelCol("label").setPredictionCol("clusters").setMetricName("weightedRecall")
val evaluatork4 = new MulticlassClassificationEvaluator().setLabelCol("label").setPredictionCol("clusters").setMetricName("f1")
val areaUnderROCk = new BinaryClassificationEvaluator().setRawPredictionCol("clusters").setLabelCol("label").setMetricName("areaUnderROC")
val accuracyk = evaluatork1.evaluate(predictionsk)
println("Area Under ROC Curve = " + areaUnderROCk.evaluate(predictionsk))
println("Accuracy = " + evaluatork1.evaluate(predictionsk))
println("Precision = " + evaluatork2.evaluate(predictionsk))
println("Recall = " + evaluatork3.evaluate(predictionsk))
println("F1 = " + evaluatork4.evaluate(predictionsk))
println("Test Error = " + (1.0 - accuracyk))

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

evaluatork1: org.apache.spark.ml.evaluation.MulticlassClassificationEvaluator = mcEval_08126be7cd4d
evaluatork2: org.apache.spark.ml.evaluation.MulticlassClassificationEvaluator = mcEval_f720d9e02002
evaluatork3: org.apache.spark.ml.evaluation.MulticlassClassificationEvaluator = mcEval_a8f70fee349e
evaluatork4: org.apache.spark.ml.evaluation.MulticlassClassificationEvaluator = mcEval_9e718753f1c9
areaUnderROCk: org.apache.spark.ml.evaluation.BinaryClassificationEvaluator = binEval_49486d50c3e2
accuracyk: Double = 0.9948402948402948
Area Under ROC Curve = 0.5859011556013736
Accuracy = 0.9948402948402948
Precision = 0.9977565922685557
Recall = 0.9948402948402948
F1 = 0.9962528097083098
Test Error = 0.005159705159705208


In [59]:
val evaluatorbk1 = new MulticlassClassificationEvaluator().setLabelCol("label").setPredictionCol("clusters").setMetricName("accuracy")
val evaluatorbk2 = new MulticlassClassificationEvaluator().setLabelCol("label").setPredictionCol("clusters").setMetricName("weightedPrecision")
val evaluatorbk3 = new MulticlassClassificationEvaluator().setLabelCol("label").setPredictionCol("clusters").setMetricName("weightedRecall")
val evaluatorbk4 = new MulticlassClassificationEvaluator().setLabelCol("label").setPredictionCol("clusters").setMetricName("f1")
val areaUnderROCbk = new BinaryClassificationEvaluator().setRawPredictionCol("clusters").setLabelCol("label").setMetricName("areaUnderROC")
val accuracybk = evaluatorbk1.evaluate(predictionsbk)
println("Area Under ROC Curve = " + areaUnderROCbk.evaluate(predictionsbk))
println("Accuracy = " + evaluatorbk1.evaluate(predictionsbk))
println("Precision = " + evaluatorbk2.evaluate(predictionsbk))
println("Recall = " + evaluatorbk3.evaluate(predictionsbk))
println("F1 = " + evaluatorbk4.evaluate(predictionsbk))
println("Test Error = " + (1.0 - accuracybk))

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

evaluatorbk1: org.apache.spark.ml.evaluation.MulticlassClassificationEvaluator = mcEval_c017e2f0673e
evaluatorbk2: org.apache.spark.ml.evaluation.MulticlassClassificationEvaluator = mcEval_e5f35a808b6c
evaluatorbk3: org.apache.spark.ml.evaluation.MulticlassClassificationEvaluator = mcEval_b7a1e9d0c97f
evaluatorbk4: org.apache.spark.ml.evaluation.MulticlassClassificationEvaluator = mcEval_c2cb095e7c45
areaUnderROCbk: org.apache.spark.ml.evaluation.BinaryClassificationEvaluator = binEval_a64525af43ae
accuracybk: Double = 0.3780741780741781
Area Under ROC Curve = 0.5221878587661957
Accuracy = 0.3780741780741781
Precision = 0.9976242077007433
Recall = 0.3780741780741781
F1 = 0.5474576635690576
Test Error = 0.6219258219258219


In [60]:
val evaluatorgmm1 = new MulticlassClassificationEvaluator().setLabelCol("label").setPredictionCol("clusters").setMetricName("accuracy")
val evaluatorgmm2 = new MulticlassClassificationEvaluator().setLabelCol("label").setPredictionCol("clusters").setMetricName("weightedPrecision")
val evaluatorgmm3 = new MulticlassClassificationEvaluator().setLabelCol("label").setPredictionCol("clusters").setMetricName("weightedRecall")
val evaluatorgmm4 = new MulticlassClassificationEvaluator().setLabelCol("label").setPredictionCol("clusters").setMetricName("f1")
val areaUnderROCgmm = new BinaryClassificationEvaluator().setRawPredictionCol("clusters").setLabelCol("label").setMetricName("areaUnderROC")
val accuracygmm = evaluatorgmm1.evaluate(predictionsgmm)
println("Area Under ROC Curve = " + areaUnderROCgmm.evaluate(predictionsgmm))
println("Accuracy = " + evaluatorgmm1.evaluate(predictionsgmm))
println("Precision = " + evaluatorgmm2.evaluate(predictionsgmm))
println("Recall = " + evaluatorgmm3.evaluate(predictionsgmm))
println("F1 = " + evaluatorgmm4.evaluate(predictionsgmm))
println("Test Error = " + (1.0 - accuracygmm))

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

evaluatorgmm1: org.apache.spark.ml.evaluation.MulticlassClassificationEvaluator = mcEval_0799d5ec3644
evaluatorgmm2: org.apache.spark.ml.evaluation.MulticlassClassificationEvaluator = mcEval_1706c2a24e8d
evaluatorgmm3: org.apache.spark.ml.evaluation.MulticlassClassificationEvaluator = mcEval_1514125131f6
evaluatorgmm4: org.apache.spark.ml.evaluation.MulticlassClassificationEvaluator = mcEval_792a4b9b0f07
areaUnderROCgmm: org.apache.spark.ml.evaluation.BinaryClassificationEvaluator = binEval_1582e0506993
accuracygmm: Double = 0.718053118053118
Area Under ROC Curve = 0.8033629327647744
Accuracy = 0.718053118053118
Precision = 0.9985458656165365
Recall = 0.718053118053118
F1 = 0.8346308073357539
Test Error = 0.281946881946882
