# Chapter 9: Spark MLlib and ML

In this notebook, we will see the main capabilities of Spark MLlib and ML.

## Working with MLlilb

In this section, we will focus on MLlib

In [1]:
import org.apache.spark.mllib.linalg.{DenseVector, SparseVector}
import org.apache.spark.mllib.feature.{HashingTF, Word2Vec, IDF, StandardScaler, ChiSqSelector}
import org.apache.spark.mllib.regression.LabeledPoint
import org.apache.spark.mllib.classification.{LogisticRegressionWithLBFGS, LogisticRegressionModel}
import org.apache.spark.mllib.evaluation.BinaryClassificationMetrics
import org.apache.spark.rdd.RDD
import java.util.Random

### MLlib Feature Encoding and Data Preparation

#### Working with Spark Vectors

We can create Dense Vectors, Sparse Vectors and Labeled Points

In [2]:
val denseVector = new DenseVector(Array(1,2,3))

denseVector = [1.0,2.0,3.0]


[1.0,2.0,3.0]

In [3]:
val sparseVector = new SparseVector(4, Array(0, 2), Array(1.5, 3.0))

sparseVector = (4,[0,2],[1.5,3.0])


(4,[0,2],[1.5,3.0])

In [4]:
val labeledPoint = new LabeledPoint(1, denseVector)

labeledPoint = (1.0,[1.0,2.0,3.0])


(1.0,[1.0,2.0,3.0])

#### Preparing Textual Data

We can also prepare text data using some in-built data transformations capabilities already included in MLlib. We first prepare some text data about Spam and Non-Spam emails.

In [5]:
val iniData = spark.read.option("header", "true").csv("../data/spam.csv")
val iniDataRdd = iniData.select("label", "text").rdd.filter(row => row(0).isInstanceOf[String] && row(1).isInstanceOf[String])

iniData = [label: string, text: string ... 3 more fields]
iniDataRdd = MapPartitionsRDD[13] at filter at <console>:35


MapPartitionsRDD[13] at filter at <console>:35

In [6]:
iniDataRdd.count()

5573

In [7]:
iniDataRdd.take(1)

0,1
ham,"Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat..."


In [8]:
val textRdd = iniDataRdd.map(_(1))

textRdd = MapPartitionsRDD[14] at map at <console>:37


MapPartitionsRDD[14] at map at <console>:37

Now we use the `HashingTF` transformer.

In [9]:
/**
Transforms an input RDD of text using the Hashing TF transformer
    
@input text_rdd: input RDD
@return: transformed RDD
**/


def hashingTF(textRdd: RDD[String]) = {
    
    val tokenizer = new HashingTF()
    val textTokenized = textRdd.map(_.split(" ").toSeq)
    tokenizer.transform(textTokenized)
    
}

hashingTF: (textRdd: org.apache.spark.rdd.RDD[String])org.apache.spark.rdd.RDD[org.apache.spark.mllib.linalg.Vector]


In [10]:
textRdd

MapPartitionsRDD[14] at map at <console>:37

In [11]:
val hashText = hashingTF(textRdd.map(_.toString))

hashText = MapPartitionsRDD[17] at map at HashingTF.scala:120


MapPartitionsRDD[17] at map at HashingTF.scala:120

In [12]:
hashText.take(1)

[(1048576,[17222,138356,181635,201474,293607,318062,362887,416458,443870,527456,550330,589798,665328,704823,708469,755959,790513,846161,907199,1008885],[1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0])]

In [13]:
/**
Transforms an input RDD of text using the Hashing TF transformer
keeping also the original text
    
:input text_rdd: input RDD
:return: transformed RDD
**/


def hashingTFWithText(textRdd: RDD[String]) = {
    
    val tokenizer = new HashingTF()
    textRdd.map(text => (text, tokenizer.transform(text.split(" ").toSeq)))
    
}

hashingTFWithText: (textRdd: org.apache.spark.rdd.RDD[String])org.apache.spark.rdd.RDD[(String, org.apache.spark.mllib.linalg.Vector)]


In [14]:
val hashTextPreserving = hashingTFWithText(textRdd.map(_.toString))

hashTextPreserving = MapPartitionsRDD[19] at map at <console>:51


MapPartitionsRDD[19] at map at <console>:51

In [15]:
hashTextPreserving.take(1)

[(Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat...,(1048576,[17222,138356,181635,201474,293607,318062,362887,416458,443870,527456,550330,589798,665328,704823,708469,755959,790513,846161,907199,1008885],[1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0]))]

We can now use the `Word2Vec` transformer.

In [16]:
val textTokenized = textRdd.map(_.toString.split(" ").toSeq)
val word2vec_trfomer = new Word2Vec().fit(textTokenized)

textTokenized = MapPartitionsRDD[20] at map at <console>:39
word2vec_trfomer = org.apache.spark.mllib.feature.Word2VecModel@4237d19f


org.apache.spark.mllib.feature.Word2VecModel@4237d19f

In [17]:
word2vec_trfomer.transform("great")

[-0.02146405540406704,-0.003827216336503625,0.0032874231692403555,0.08191434293985367,-0.05732027068734169,0.009234832599759102,-0.016608454287052155,-0.01469072699546814,0.05036984384059906,-0.027246715500950813,0.09713512659072876,-0.08891306817531586,-0.020493406802415848,0.020492125302553177,-0.01188842672854662,-0.006000001449137926,0.011214228346943855,0.039925310760736465,0.02338774874806404,0.014434332959353924,-0.008040934801101685,0.042061470448970795,-0.03339044377207756,0.024230990558862686,0.013198750093579292,0.057594530284404755,0.09731076657772064,0.010434312745928764,0.019490517675876617,-0.03057178296148777,-0.03533543646335602,0.0036919452250003815,-0.007387078367173672,-0.02520918846130371,0.005467329639941454,0.006146532483398914,0.005227089859545231,0.028425566852092743,0.036847710609436035,-0.04392845556139946,0.03440820053219795,-0.10053680092096329,0.017500048503279686,-0.018390337005257607,-0.015096963383257389,-0.027896001935005188,-0.03566542640328407,0.0068

In [18]:
word2vec_trfomer.transform("Free")

[-0.018583469092845917,-0.02790551632642746,-0.01638362556695938,0.024623297154903412,0.01820003241300583,0.022796304896473885,-0.004462585784494877,0.0018990199314430356,-0.002102164551615715,-0.01185684185475111,0.034062523394823074,-0.017565157264471054,0.013573108240962029,0.008274449966847897,0.02155362069606781,0.0100114606320858,0.029035363346338272,0.01414216123521328,0.021385928615927696,-0.032893091440200806,0.0012255565961822867,0.014664345420897007,-0.010014363564550877,-4.0607567643746734E-4,-0.0035053715109825134,0.017249858006834984,0.018836403265595436,0.01989627629518509,-0.001062321476638317,-0.013529905118048191,-0.021062549203634262,-0.007213293574750423,0.007991934195160866,-0.002972015179693699,-0.01547208707779646,0.011107261292636395,-0.003473537275567651,0.036090198904275894,0.005059196148067713,-0.02550704963505268,5.28823584318161E-4,-0.033462341874837875,0.009227710776031017,0.02203906700015068,0.0155676594004035,-0.00961245410144329,-0.00411571841686964,0.0

#### Preparing Data for Supervised Learning

In [19]:
val tf = new HashingTF(100)
val tfVectors = tf.transform(textRdd.map(_.toString.split(" ").toSeq))
val idf = new IDF()
val idfModel = idf.fit(tfVectors)

tf = org.apache.spark.mllib.feature.HashingTF@40c798d6
tfVectors = MapPartitionsRDD[34] at map at HashingTF.scala:120
idf = org.apache.spark.mllib.feature.IDF@382f0e89
idfModel = org.apache.spark.mllib.feature.IDFModel@eb5f024


org.apache.spark.mllib.feature.IDFModel@eb5f024

In [20]:
val spamText = iniDataRdd.filter(_(0) == "spam").map(_(1).toString.split(" ").toSeq)
val genText = iniDataRdd.filter(_(0) != "spam").map(_(1).toString.split(" ").toSeq)

spamText = MapPartitionsRDD[37] at map at <console>:37
genText = MapPartitionsRDD[39] at map at <console>:38


MapPartitionsRDD[39] at map at <console>:38

In [21]:
val spamPoints = idfModel.transform(tf.transform(spamText)).map(x => LabeledPoint(1, x))
val genPoints = idfModel.transform(tf.transform(genText)).map(x => LabeledPoint(0, x))

spamPoints = MapPartitionsRDD[42] at map at <console>:47
genPoints = MapPartitionsRDD[45] at map at <console>:48


MapPartitionsRDD[45] at map at <console>:48

In [22]:
val mlDataIni = spamPoints.union(genPoints)

mlDataIni = UnionRDD[46] at union at <console>:50


UnionRDD[46] at union at <console>:50

In [23]:
val mlData = mlDataIni.map(row => (new Random().nextInt(100), row)).sortByKey().map(_._2)

mlData = MapPartitionsRDD[51] at map at <console>:52


MapPartitionsRDD[51] at map at <console>:52

In [24]:
mlData.take(1)

[(1.0,(100,[1,3,18,24,29,34,54,56,61,63,70,73,77,79,86,87,88,89,96],[2.0452290707569443,2.328758888107958,2.067670405229625,2.3342990684835736,1.7012558119933334,1.8036708174214031,2.343601461145887,1.5078520035765604,4.428099880663993,1.6200791887883912,1.1827898336933784,1.7778629334655305,1.6904977620267831,2.1961487300027565,2.379761442560331,2.0719348040160823,2.094422714894262,1.9202291131818907,3.953767316034235]))]

In [25]:
val mlDataSplit = mlData.randomSplit(Array(0.8, 0.2))

mlDataSplit = Array(MapPartitionsRDD[52] at randomSplit at <console>:54, MapPartitionsRDD[53] at randomSplit at <console>:54)


[MapPartitionsRDD[52] at randomSplit at <console>:54, MapPartitionsRDD[53] at randomSplit at <console>:54]

In [26]:
val mlDataTrain = mlDataSplit(0)
val mlDataTest = mlDataSplit(1)

mlDataTrain = MapPartitionsRDD[52] at randomSplit at <console>:54
mlDataTest = MapPartitionsRDD[53] at randomSplit at <console>:54


MapPartitionsRDD[53] at randomSplit at <console>:54

In [27]:
mlDataTrain.cache()
mlDataTest.cache()

MapPartitionsRDD[53] at randomSplit at <console>:54

In [28]:
spamPoints.filter(_.label == 1.0).count()

747

#### Feature Scaling and Selection

It is useful sometimes for the ML algorithms to scale that data.

`StandardScaler()` --> to scale numerical data

In [29]:
import org.apache.spark.mllib.feature.StandardScaler
val stdScaler = new StandardScaler()
val stdScalerModel = stdScaler.fit(mlData.map(lpoint => lpoint.features))

stdScaler = org.apache.spark.mllib.feature.StandardScaler@398cc281
stdScalerModel = org.apache.spark.mllib.feature.StandardScalerModel@1480ccf1


org.apache.spark.mllib.feature.StandardScalerModel@1480ccf1

In [30]:
val trainLabel = mlDataTrain.map(_.label)
val testLabel = mlDataTest.map(_.label)

trainLabel = MapPartitionsRDD[57] at map at <console>:60
testLabel = MapPartitionsRDD[58] at map at <console>:61


MapPartitionsRDD[58] at map at <console>:61

In [31]:
val mlDataTrainScl = trainLabel.zip(stdScalerModel.transform(mlDataTrain.map(_.features))).map(x => LabeledPoint(x._1, x._2))
val mlDataTestScl = testLabel.zip(stdScalerModel.transform(mlDataTest.map(_.features))).map(x => LabeledPoint(x._1, x._2))

mlDataTrainScl = MapPartitionsRDD[62] at map at <console>:66
mlDataTestScl = MapPartitionsRDD[66] at map at <console>:67


MapPartitionsRDD[66] at map at <console>:67

In [32]:
mlDataTrainScl.take(1)

[(1.0,(100,[1,3,18,24,29,34,54,56,61,63,70,73,77,79,86,87,88,89,96],[2.6092498023853943,2.829417369543754,2.536769897065245,2.9425131453745976,1.8375300213157324,1.895573517843014,2.9834200478287576,1.2772359910016184,5.464175460093429,1.943072888001566,1.3637078242000387,2.0594698576945327,1.8444424730496776,2.732407611455942,3.0927621293082224,2.422815085250892,2.4971212684692707,2.1033897608184167,4.60902425209452]))]

In [33]:
mlDataTestScl.take(1)

[(1.0,(100,[7,10,12,14,27,28,32,45,46,47,48,50,70,73,81,82,88,89,90,94],[1.816142313980476,1.4540000671947833,2.9576417198594047,3.2518786697281294,6.389233955248648,2.649568246784961,2.9649557147580787,4.1772288304916545,2.302293243443268,2.9624935495232645,2.665005310919519,2.1102205546562716,2.7274156484000773,4.1189397153890654,1.7994691052982292,2.322198347033588,2.4971212684692707,2.1033897608184167,3.3365463958671873,2.427410364985236]))]

`ChiSqSelector` --> to select the most relevant features

In [34]:
val selector = new ChiSqSelector(100)
val selectorModel = selector.fit(mlData)

selector = org.apache.spark.mllib.feature.ChiSqSelector@7dccac1f
selectorModel = org.apache.spark.mllib.feature.ChiSqSelectorModel@225555c0


org.apache.spark.mllib.feature.ChiSqSelectorModel@225555c0

In [35]:
val mlDataTrainSel = trainLabel.zip(selectorModel.transform(mlDataTrain.map(_.features))).map(x => LabeledPoint(x._1, x._2))
val mlDataTestSel = testLabel.zip(selectorModel.transform(mlDataTest.map(_.features))).map(x => LabeledPoint(x._1, x._2))

mlDataTrainSel = MapPartitionsRDD[74] at map at <console>:66
mlDataTestSel = MapPartitionsRDD[78] at map at <console>:67


MapPartitionsRDD[78] at map at <console>:67

In [36]:
mlDataTrainSel.take(1)

[(1.0,(100,[1,3,18,24,29,34,54,56,61,63,70,73,77,79,86,87,88,89,96],[2.0452290707569443,2.328758888107958,2.067670405229625,2.3342990684835736,1.7012558119933334,1.8036708174214031,2.343601461145887,1.5078520035765604,4.428099880663993,1.6200791887883912,1.1827898336933784,1.7778629334655305,1.6904977620267831,2.1961487300027565,2.379761442560331,2.0719348040160823,2.094422714894262,1.9202291131818907,3.953767316034235]))]

In [37]:
mlDataTestSel.take(1)

[(1.0,(100,[7,10,12,14,27,28,32,45,46,47,48,50,70,73,81,82,88,89,90,94],[1.7081625982065882,1.3335310318680162,2.2943663581482023,2.5099760825588597,4.830536262034482,2.0341944760332353,2.315949929815377,3.384890364622357,1.8654535169584656,2.2872741298387105,2.207503272105682,1.834646745315708,2.365579667386757,3.555725866931061,1.5870846666533522,1.9742963344521665,2.094422714894262,1.9202291131818907,2.5255492560218293,2.452082104139957]))]

### MLlib Model Training

Once we have prepared our data, we can train some models

In [38]:
val lr = new LogisticRegressionWithLBFGS()
val lrModelRaw = lr.run(mlDataTrain)
val lrModelScl = lr.run(mlDataTrainScl)

lr = org.apache.spark.mllib.classification.LogisticRegressionWithLBFGS@722ab15b
lrModelRaw = org.apache.spark.mllib.classification.LogisticRegressionModel: intercept = 0.0, numFeatures = 100, numClasses = 2, threshold = 0.5
lrModelScl = org.apache.spark.mllib.classification.LogisticRegressionModel: intercept = 0.0, numFeatures = 100, numClasses = 2, threshold = 0.5


org.apache.spark.mllib.classification.LogisticRegressionModel: intercept = 0.0, numFeatures = 100, numClasses = 2, threshold = 0.5

In [39]:
lrModelRaw.weights

[-0.6524993007948156,0.1175990548428511,-0.0488703785947976,-0.1651541220730223,0.02470329252762825,0.004758313816139249,-0.11783593130735898,-0.34457246840918576,-0.28884598600448225,0.08772671168468467,-0.2887938984421034,0.27356439163820995,-0.09704143642101522,-0.2128740980976528,-0.18330099326038438,0.010570348889996459,-0.4353260899048793,0.33076365329166335,-0.2294716121761005,0.1505773697031762,-0.3160735289385896,0.256080759478671,0.03432091773977774,0.16213303455638062,0.022532573059196967,-0.5417264370889596,-0.3889054235945329,-0.17958489503355157,-0.1587373980407736,-0.39288769049444033,-0.05977237238948764,0.1658118715574863,-0.1873701022231263,-0.07108401663371154,-0.25079926082453435,0.1751939976957695,0.15202210115474463,0.01806048166182609,-0.07799288687007119,0.15719136501327144,0.003723668349210831,-0.06071000114325054,-0.2211974995580636,0.08621116733481554,-0.3425680372636201,-0.40725925276322333,0.19162082567035862,0.1438401844274893,0.09150893770274812,-0.248211

In [40]:
lrModelScl.weights

[-0.591629123336054,0.09217860453156451,-0.03688180704739261,-0.13593050421799519,0.020624901401583958,0.0037757674113313377,-0.09371466075235703,-0.32408572741102576,-0.22953519488042023,0.06912053040416696,-0.2648663050818764,0.20484554018184728,-0.07527910009375365,-0.17246497417580461,-0.14148163437823705,0.00825876414198023,-0.3821359397858671,0.2727954534686777,-0.18703772142903155,0.12444576318386746,-0.2720054360159467,0.21274831379570136,0.028211518470854662,0.13128022756486366,0.01787511616907382,-0.5010312463415187,-0.38642165047390525,-0.13577392120077264,-0.12186994565105402,-0.36375050157587924,-0.0487498787716791,0.14194302201198117,-0.14635624165757652,-0.06357667476898972,-0.2386398119207753,0.13598647308860967,0.13904043446199219,0.014901904016705641,-0.06617629073856457,0.124831060069052,0.0030179996657308515,-0.04715631048494058,-0.19579456470477005,0.07475067094100905,-0.28239450202221533,-0.3300101518305582,0.15526247326975148,0.11105574650962002,0.075799578551713

### Predict

Once the model is trained, we can perform predictions.

In [41]:
val rawPreds = lrModelRaw.predict(mlDataTest.map(_.features))
val sclPreds = lrModelScl.predict(mlDataTestScl.map(_.features))

rawPreds = MapPartitionsRDD[169] at mapPartitions at GeneralizedLinearAlgorithm.scala:70
sclPreds = MapPartitionsRDD[171] at mapPartitions at GeneralizedLinearAlgorithm.scala:70


MapPartitionsRDD[171] at mapPartitions at GeneralizedLinearAlgorithm.scala:70

In [42]:
rawPreds.take(1)

[0.0]

In [43]:
sclPreds.take(1)

[0.0]

### Serving and Persistence

Many times, once we train our model, we save it and the load it in oder programs to make predictions. We try first the internal format of Spark, which allows us to save and load a model.

In [44]:
import sys.process._

In [45]:
"rm -rf ../data/lrModelRaw".!
lrModelRaw.save(sc, "../data/lrModelRaw")

In [46]:
val lrModelRawLoaded = LogisticRegressionModel.load(sc, "../data/lrModelRaw")

lrModelRawLoaded = org.apache.spark.mllib.classification.LogisticRegressionModel: intercept = 0.0, numFeatures = 100, numClasses = 2, threshold = 0.5


org.apache.spark.mllib.classification.LogisticRegressionModel: intercept = 0.0, numFeatures = 100, numClasses = 2, threshold = 0.5

In [47]:
val rawPredsLoaded = lrModelRawLoaded.predict(mlDataTest.map(_.features))

rawPredsLoaded = MapPartitionsRDD[188] at mapPartitions at GeneralizedLinearAlgorithm.scala:70


MapPartitionsRDD[188] at mapPartitions at GeneralizedLinearAlgorithm.scala:70

In [48]:
rawPredsLoaded.take(1)

[0.0]

### Model Evaluation

MLlib includes some functionalities to calculate automatically some metrics of trained ML models. While there are more, here we will evaluate the LR model of the spam classification section using the `BinaryClassificationMetrics` functionality.

In [49]:
mlDataTrain.take(1)

[(1.0,(100,[1,3,18,24,29,34,54,56,61,63,70,73,77,79,86,87,88,89,96],[2.0452290707569443,2.328758888107958,2.067670405229625,2.3342990684835736,1.7012558119933334,1.8036708174214031,2.343601461145887,1.5078520035765604,4.428099880663993,1.6200791887883912,1.1827898336933784,1.7778629334655305,1.6904977620267831,2.1961487300027565,2.379761442560331,2.0719348040160823,2.094422714894262,1.9202291131818907,3.953767316034235]))]

In [50]:
val lrModelEval = lrModelRaw
val predLabelLr = mlDataTest.map{case LabeledPoint(label, features) => (lrModelEval.predict(features), label)}
val metricsLr = new BinaryClassificationMetrics(predLabelLr)

lrModelEval = org.apache.spark.mllib.classification.LogisticRegressionModel: intercept = 0.0, numFeatures = 100, numClasses = 2, threshold = 0.5
predLabelLr = MapPartitionsRDD[189] at map at <console>:67
metricsLr = org.apache.spark.mllib.evaluation.BinaryClassificationMetrics@1b8098d3


org.apache.spark.mllib.evaluation.BinaryClassificationMetrics@1b8098d3

In [51]:
predLabelLr.count()

1137

In [52]:
mlDataTest.filter(_.label == 1.0).count()

140

In [53]:
predLabelLr.filter(x => x._1 == 0.0 && x._2 == 0.0).count()

907

In [54]:
predLabelLr.filter(x => x._1 == 0.0 && x._2 == 0.0).count()

907

In [55]:
println("LR model")
println("Area Under PR: " + metricsLr.areaUnderPR)
println("Area Under ROC: " + metricsLr.areaUnderROC)

LR model
Area Under PR: 0.35890776165347404
Area Under ROC: 0.704864593781344


## Working with Spark ML

Now, we are going to see some of the capabilities offered by the Spark ML package, which works with DataFrames instead that MLlib that works with RDDs. In particular, we are going to do again the spam classification problem using two Pipelines: one for the data preparation and the other one for the ML model.

### Data Preparation: Data Encoding & Data Cleaning

In [56]:
import org.apache.spark.ml.classification.LogisticRegression
import org.apache.spark.ml.evaluation.BinaryClassificationEvaluator
import org.apache.spark.ml.feature.{HashingTF, Tokenizer, IDF, SQLTransformer, StringIndexer, VectorAssembler, StandardScaler}
import org.apache.spark.ml.{Pipeline, PipelineModel}
import org.apache.spark.ml.tuning.{ParamGridBuilder, CrossValidator}

In [57]:
iniData.show()

+-----+--------------------+----+----+----+
|label|                text| _c2| _c3| _c4|
+-----+--------------------+----+----+----+
|  ham|Go until jurong p...|null|null|null|
|  ham|Ok lar... Joking ...|null|null|null|
| spam|Free entry in 2 a...|null|null|null|
|  ham|U dun say so earl...|null|null|null|
|  ham|Nah I don't think...|null|null|null|
| spam|FreeMsg Hey there...|null|null|null|
|  ham|Even my brother i...|null|null|null|
|  ham|As per your reque...|null|null|null|
| spam|WINNER!! As a val...|null|null|null|
| spam|Had your mobile 1...|null|null|null|
|  ham|I'm gonna be home...|null|null|null|
| spam|SIX chances to wi...|null|null|null|
| spam|URGENT! You have ...|null|null|null|
|  ham|I've been searchi...|null|null|null|
|  ham|I HAVE A DATE ON ...|null|null|null|
| spam|XXXMobileMovieClu...|null|null|null|
|  ham|Oh k...i'm watchi...|null|null|null|
|  ham|Eh u remember how...|null|null|null|
|  ham|Fine if that��s t...|null|null|null|
| spam|England v Macedon...|null

In [58]:
val sqlSelect = new SQLTransformer().setStatement("SELECT label, text FROM __THIS__")

sqlSelect = sql_065623af654d


sql_065623af654d

In [59]:
val sqlFilter = new SQLTransformer().setStatement("SELECT * from __THIS__ WHERE text is not null AND label is not null")

sqlFilter = sql_d3f2652b3ad0


sql_d3f2652b3ad0

In [60]:
val labelIndexer = new StringIndexer().setInputCol("label").setOutputCol("label_num")

labelIndexer = strIdx_2317a3b760c7


strIdx_2317a3b760c7

In [61]:
val tokenizer = new Tokenizer().setInputCol("text").setOutputCol("text_token")

tokenizer = tok_60ef9752fd72


tok_60ef9752fd72

In [62]:
val countText = new SQLTransformer().setStatement("SELECT *, size(text_token) as count FROM __THIS__")

countText = sql_757a1d37c6b5


sql_757a1d37c6b5

In [63]:
val tf = new HashingTF().setNumFeatures(1000).setInputCol("text_token").setOutputCol("text_tf")

tf = hashingTF_5aeba64632e4


hashingTF_5aeba64632e4

In [64]:
val idf = new IDF().setInputCol("text_tf").setOutputCol("text_features")

idf = idf_ea3a95c144a8


idf_ea3a95c144a8

In [65]:
val assembler = new VectorAssembler().setInputCols(Array("text_features", "count")).setOutputCol("features_raw")

assembler = vecAssembler_282aebb1e6a9


vecAssembler_282aebb1e6a9

In [66]:
val scaler = new StandardScaler().setInputCol("features_raw").setOutputCol("features")

scaler = stdScal_5a99e5855034


stdScal_5a99e5855034

In [67]:
val etlPipelineModel = new Pipeline().setStages(Array(sqlSelect, sqlFilter, labelIndexer, 
                                      tokenizer, countText, tf, idf, assembler, scaler)).fit(iniData)

etlPipelineModel = pipeline_a8ca84d42fbd


pipeline_a8ca84d42fbd

In [68]:
val mlDataPipe = etlPipelineModel.transform(iniData)

mlDataPipe = [label: string, text: string ... 7 more fields]


[label: string, text: string ... 7 more fields]

In [69]:
mlDataPipe.select("label", "features").show()

+-----+--------------------+
|label|            features|
+-----+--------------------+
|  ham|(1001,[7,77,150,1...|
|  ham|(1001,[20,316,484...|
| spam|(1001,[30,35,73,1...|
|  ham|(1001,[57,368,372...|
|  ham|(1001,[135,163,32...|
| spam|(1001,[25,36,91,9...|
|  ham|(1001,[18,47,48,5...|
|  ham|(1001,[36,71,92,2...|
| spam|(1001,[39,43,61,7...|
| spam|(1001,[36,73,82,1...|
|  ham|(1001,[26,41,106,...|
| spam|(1001,[15,35,36,4...|
| spam|(1001,[68,73,122,...|
|  ham|(1001,[19,36,39,1...|
|  ham|(1001,[44,82,170,...|
| spam|(1001,[41,43,49,6...|
|  ham|(1001,[275,426,44...|
|  ham|(1001,[80,147,236...|
|  ham|(1001,[159,170,29...|
| spam|(1001,[9,19,45,71...|
+-----+--------------------+
only showing top 20 rows



In [70]:
val mlDataPipeSplits = mlDataPipe.randomSplit(Array(0.8, 0.2))

mlDataPipeSplits = Array([label: string, text: string ... 7 more fields], [label: string, text: string ... 7 more fields])


[[label: string, text: string ... 7 more fields], [label: string, text: string ... 7 more fields]]

In [71]:
val mlDataPipeTrain = mlDataPipeSplits(0)
val mlDataPipeTest = mlDataPipeSplits(1)

mlDataPipeTrain = [label: string, text: string ... 7 more fields]
mlDataPipeTest = [label: string, text: string ... 7 more fields]


[label: string, text: string ... 7 more fields]

In [72]:
mlDataPipeTrain.count()

4391

In [73]:
mlDataPipeTest.count()

1182

#### Spark ML Models

Once we have our data cleaned and encoded, we can now train a ML models.

In [74]:
val lr = new LogisticRegression().setFeaturesCol("features").setLabelCol("label_num").setRegParam(0.1)

lr = logreg_1f358ec290e4


logreg_1f358ec290e4

In [75]:
val lrPipeline = new Pipeline().setStages(Array(lr))

lrPipeline = pipeline_bf6597265858


pipeline_bf6597265858

In [76]:
val lrPipelineModel = lrPipeline.fit(mlDataPipeTrain)

lrPipelineModel = pipeline_bf6597265858


pipeline_bf6597265858

As in any other pipeline, we can access to one of the steps, and its metadata

In [77]:
lrPipelineModel.stages(0).extractParamMap

{
	logreg_1f358ec290e4-aggregationDepth: 2,
	logreg_1f358ec290e4-elasticNetParam: 0.0,
	logreg_1f358ec290e4-family: auto,
	logreg_1f358ec290e4-featuresCol: features,
	logreg_1f358ec290e4-fitIntercept: true,
	logreg_1f358ec290e4-labelCol: label_num,
	logreg_1f358ec290e4-maxIter: 100,
	logreg_1f358ec290e4-predictionCol: prediction,
	logreg_1f358ec290e4-probabilityCol: probability,
	logreg_1f358ec290e4-rawPredictionCol: rawPrediction,
	logreg_1f358ec290e4-regParam: 0.1,
	logreg_1f358ec290e4-standardization: true,
	logreg_1f358ec290e4-threshold: 0.5,
	logreg_1f358ec290e4-tol: 1.0E-6
}

#### Data Persistence and Spark ML

We can save and load our pipelines (including both data transformers and ML algorithms).

In [78]:
"rm -rf ../data/etlPipelineModel".!
"rm -rf ../data/lr_pipeline_model".!
etlPipelineModel.save("../data/etlPipelineModel")
lrPipelineModel.save("../data/lrPipelineModel")

Name: java.io.IOException
Message: Path ../data/lrPipelineModel already exists. To overwrite it, please use write.overwrite().save(path) for Scala and use write().overwrite().save(path) for Java and Python.
StackTrace:   at org.apache.spark.ml.util.FileSystemOverwrite.handleOverwrite(ReadWrite.scala:503)
  at org.apache.spark.ml.util.MLWriter.save(ReadWrite.scala:102)
  at org.apache.spark.ml.util.MLWritable$class.save(ReadWrite.scala:162)
  at org.apache.spark.ml.PipelineModel.save(Pipeline.scala:293)

In [79]:
val etlPipelineLoad = PipelineModel.load("../data/etlPipelineModel")
val lrPipelineLoad = PipelineModel.load("../data/lrPipelineModel")

etlPipelineLoad = pipeline_a8ca84d42fbd
lrPipelineLoad = pipeline_5f86381e1e35


lastException: Throwable = null


pipeline_5f86381e1e35

In [80]:
val predictions = lrPipelineLoad.transform(etlPipelineLoad.transform(iniData))

predictions = [label: string, text: string ... 10 more fields]


[label: string, text: string ... 10 more fields]

In [81]:
predictions.select("label_num", "prediction").show()

+---------+----------+
|label_num|prediction|
+---------+----------+
|      0.0|       0.0|
|      0.0|       0.0|
|      1.0|       1.0|
|      0.0|       0.0|
|      0.0|       0.0|
|      1.0|       1.0|
|      0.0|       0.0|
|      0.0|       0.0|
|      1.0|       1.0|
|      1.0|       1.0|
|      0.0|       0.0|
|      1.0|       1.0|
|      1.0|       1.0|
|      0.0|       0.0|
|      0.0|       0.0|
|      1.0|       1.0|
|      0.0|       0.0|
|      0.0|       0.0|
|      0.0|       0.0|
|      1.0|       1.0|
+---------+----------+
only showing top 20 rows



#### Automated Model Selection: Parameter Search

Spark ML offers some functionalities to perform hiperparameter tunning on ML models. Let's check our previous problem testing different regularization parameters.

In [82]:
val lr = new LogisticRegression().setFeaturesCol("features").setLabelCol("label_num").setRegParam(0.1)

lr = logreg_c1ba64163252


logreg_c1ba64163252

In [83]:
val estimatorPipeline = new Pipeline().setStages(Array(lr))

estimatorPipeline = pipeline_58193ef5e226


pipeline_58193ef5e226

In [84]:
val paramGrid = new ParamGridBuilder().addGrid(lr.regParam, Array(0.1, 0.01, 0.05)).build()

paramGrid = 


Array({
	logreg_c1ba64163252-regParam: 0.1
}, {
	logreg_c1ba64163252-regParam: 0.01
}, {
	logreg_c1ba64163252-regParam: 0.05
})


[{
	logreg_c1ba64163252-regParam: 0.1
}, {
	logreg_c1ba64163252-regParam: 0.01
}, {
	logreg_c1ba64163252-regParam: 0.05
}]

In [85]:
val evaluator = new BinaryClassificationEvaluator().setLabelCol("label_num").setRawPredictionCol("prediction")

evaluator = binEval_6b19b7eaa0cf


binEval_6b19b7eaa0cf

In [86]:
val crossVal = new CrossValidator().setEstimator(estimatorPipeline).setEstimatorParamMaps(paramGrid)
.setEvaluator(evaluator).setNumFolds(3)

crossVal = cv_cb1f28a7c907


cv_cb1f28a7c907

In [87]:
val cvModel = crossVal.fit(mlDataPipe)

cvModel = cv_cb1f28a7c907


cv_cb1f28a7c907

In [88]:
cvModel.transform(mlDataPipe).select("label_num", "prediction").show()

+---------+----------+
|label_num|prediction|
+---------+----------+
|      0.0|       0.0|
|      0.0|       0.0|
|      1.0|       1.0|
|      0.0|       0.0|
|      0.0|       0.0|
|      1.0|       1.0|
|      0.0|       0.0|
|      0.0|       0.0|
|      1.0|       1.0|
|      1.0|       1.0|
|      0.0|       0.0|
|      1.0|       1.0|
|      1.0|       1.0|
|      0.0|       0.0|
|      0.0|       0.0|
|      1.0|       1.0|
|      0.0|       0.0|
|      0.0|       0.0|
|      0.0|       0.0|
|      1.0|       1.0|
+---------+----------+
only showing top 20 rows

