# Machine Learning with MLlib

In this Notebook, we will review the RDD-Based Machine Learning library MLlib.

## Data Types

First, we have to understand the different data structures used by MLlib. In particular, they are:

    * Vectors
    * Labeled Points
    * Rating
    * Model Classes
    
We will se `Vectors` and `Labeled Points` in more detail.

In [1]:
import org.apache.spark.mllib.linalg.Vectors

`Vector()` --> to hold the features values. It can be `dense` and `sparse`.

In [2]:
val vectorDense = Vectors.dense(Array(1.0,1.0,2.0,2.0))

vectorDense = [1.0,1.0,2.0,2.0]


[1.0,1.0,2.0,2.0]

In [3]:
val vectorSparse = Vectors.sparse(4, Array(0, 2), Array(1.0, 2.0))

vectorSparse = (4,[0,2],[1.0,2.0])


(4,[0,2],[1.0,2.0])

`LabeledPoint()` --> hold both features values and label values

In [4]:
import org.apache.spark.mllib.regression.LabeledPoint

In [5]:
val labelPoint = LabeledPoint(1, vectorDense)

labelPoint = (1.0,[1.0,1.0,2.0,2.0])


(1.0,[1.0,1.0,2.0,2.0])

## Algorithms

In this section, we will review the different algorithms associated with Machine Learning problems. Among other, we could highlight the following families of algorithms:

    * Feature Extraction
    * Statistics
    * Classification and Regression
    * Collaborative Filtering and Recommendation
    * Dimensionality Reduction
    * Model Evaluation

### Feature Extraction

ML algorithms only accept numerical values as inputs. Here, we discuss some algorithm that help us to translate some inputs (like text, non-scaled numerical vectors, etc) to numerical values that ML algorithms can understand. In particular, we will discuss the following algorithms:

    * TD-IDF
    * Scaling
    * Normalization
    * Word2Vec

#### td-idf()

`td-idf()` --> Term Frecuency - Inverse Document Frequency, useful to convert text input to numerical inputs

In [6]:
import org.apache.spark.mllib.feature.{HashingTF, IDF}

In [7]:
val sentences = sc.parallelize(Array("hello", "hello how are you", "good bye", "bye"))
val words = sentences.map(_.split(" ").toSeq)
val tf = new HashingTF(100)
val tfVectors = tf.transform(words)

sentences = ParallelCollectionRDD[0] at parallelize at <console>:30
words = MapPartitionsRDD[1] at map at <console>:31
tf = org.apache.spark.mllib.feature.HashingTF@26f75ac0
tfVectors = MapPartitionsRDD[2] at map at HashingTF.scala:120


MapPartitionsRDD[2] at map at HashingTF.scala:120

In [8]:
tfVectors.collect()

[(100,[48],[1.0]), (100,[25,37,38,48],[1.0,1.0,1.0,1.0]), (100,[5,68],[1.0,1.0]), (100,[5],[1.0])]

In [9]:
val idf = new IDF()
val idfModel = idf.fit(tfVectors)
val tfIdfVectors = idfModel.transform(tfVectors)

idf = org.apache.spark.mllib.feature.IDF@79a788d1
idfModel = org.apache.spark.mllib.feature.IDFModel@40615a99
tfIdfVectors = MapPartitionsRDD[7] at mapPartitions at IDF.scala:178


MapPartitionsRDD[7] at mapPartitions at IDF.scala:178

In [10]:
tfIdfVectors.collect()

[(100,[48],[0.5108256237659907]), (100,[25,37,38,48],[0.9162907318741551,0.9162907318741551,0.9162907318741551,0.5108256237659907]), (100,[5,68],[0.5108256237659907,0.9162907318741551]), (100,[5],[0.5108256237659907])]

#### Word2Vect

`Word2Vec` --> also useful to tranform text into numerical data

In [11]:
import org.apache.spark.mllib.feature.Word2Vec

In [12]:
val word2vec = new Word2Vec().setMinCount(0)
val word2vecModel = word2vec.fit(words)

word2vec = org.apache.spark.mllib.feature.Word2Vec@5ffb76c3
word2vecModel = org.apache.spark.mllib.feature.Word2VecModel@4ae04e59


org.apache.spark.mllib.feature.Word2VecModel@4ae04e59

In [13]:
val word2vecVectors = word2vecModel.transform("hello")

word2vecVectors = [-0.002391571644693613,-0.004730306100100279,0.004567709285765886,-0.0021375345531851053,0.003772377735003829,-0.004304440226405859,-0.0035075515042990446,-0.002512869192287326,0.0043669031001627445,-0.002442498691380024,-0.002165005775168538,-0.0010312151862308383,0.0036732545122504234,-0.001366215292364359,-0.0011274006683379412,0.0032704095356166363,-4.419920442160219E-4,0.004512975923717022,0.003956434316933155,-0.0010905592935159802,-0.0027423431165516376,0.001025308622047305,0.002350220223888755,-0.003991275560110807,6.259826477617025E-4,0.0032516797073185444,0.003080913331359625,0.0022275270894169807,0.0045756129547953606,-0.0024304573889821768,-1.6684022557456046E-4,0.0036196813452988863,0.0018787361914291978,0.004775937646...


[-0.002391571644693613,-0.004730306100100279,0.004567709285765886,-0.0021375345531851053,0.003772377735003829,-0.004304440226405859,-0.0035075515042990446,-0.002512869192287326,0.0043669031001627445,-0.002442498691380024,-0.002165005775168538,-0.0010312151862308383,0.0036732545122504234,-0.001366215292364359,-0.0011274006683379412,0.0032704095356166363,-4.419920442160219E-4,0.004512975923717022,0.003956434316933155,-0.0010905592935159802,-0.0027423431165516376,0.001025308622047305,0.002350220223888755,-0.003991275560110807,6.259826477617025E-4,0.0032516797073185444,0.003080913331359625,0.0022275270894169807,0.0045756129547953606,-0.0024304573889821768,-1.6684022557456046E-4,0.0036196813452988863,0.0018787361914291978,0.004775937646627426,-0.00196994561702013,-0.0027159007731825113,0.0031720369588583708,-0.0021813814528286457,0.002189431106671691,2.624143671710044E-4,-0.004954970441758633,0.002538732485845685,0.0014121009735390544,1.1002375686075538E-4,-9.805667214095592E-4,0.0041071022

#### Scaling

While our input data could be already numeric, it is useful sometimes for the ML algorithms to scale that data.

`StandardScaler()` --> to scale numerical data

In [14]:
import org.apache.spark.mllib.feature.StandardScaler

In [15]:
val vectors = Array(Vectors.dense(Array(-2.0, 5.0, 1.0, 4.0)),
                    Vectors.dense(Array(2.0, 0.0, 1.0, 7.2)),
                    Vectors.dense(Array(4.0, 2.0, 0.5, 0.8)))

val vectorsRdd = sc.parallelize(vectors)
val scaler = new StandardScaler(withMean=true, withStd=true)
val model = scaler.fit(vectorsRdd)
val scaledData = model.transform(vectorsRdd)

vectors = Array([-2.0,5.0,1.0,4.0], [2.0,0.0,1.0,7.2], [4.0,2.0,0.5,0.8])
vectorsRdd = ParallelCollectionRDD[20] at parallelize at <console>:36
scaler = org.apache.spark.mllib.feature.StandardScaler@7fd0b1ab
model = org.apache.spark.mllib.feature.StandardScalerModel@16f42c8c
scaledData = MapPartitionsRDD[25] at map at VectorTransformer.scala:52


MapPartitionsRDD[25] at map at VectorTransformer.scala:52

In [16]:
scaledData.collect()

[[-1.0910894511799618,1.0596258856520353,0.5773502691896257,0.0], [0.2182178902359923,-0.9271726499455306,0.5773502691896257,1.0], [0.8728715609439694,-0.13245323570650427,-1.1547005383792517,-1.0]]

#### Normalization

As with scaling, sometimes it is very usefull to normalize our data.

In [17]:
import org.apache.spark.mllib.feature.Normalizer

In [18]:
val norm = new Normalizer()
val normData = norm.transform(vectorsRdd)

norm = org.apache.spark.mllib.feature.Normalizer@1ed117d5
normData = MapPartitionsRDD[26] at map at VectorTransformer.scala:52


MapPartitionsRDD[26] at map at VectorTransformer.scala:52

In [19]:
normData.collect()

[[-0.29488391230979427,0.7372097807744856,0.14744195615489714,0.5897678246195885], [0.2652790545386455,0.0,0.13263952726932274,0.9550045963391238], [0.8751666735874727,0.43758333679373634,0.10939583419843409,0.17503333471749455]]

### Statistics

The library MLlib includes useful functionalities to calculate some main statistics over numeric RDDs

In [20]:
import org.apache.spark.mllib.stat.Statistics

#### colStats()

`colStats()` --> to calculate statistics over an RDD of numerical values

In [21]:
val colStats = Statistics.colStats(vectorsRdd)

colStats = org.apache.spark.mllib.stat.MultivariateOnlineSummarizer@69eed7f4


org.apache.spark.mllib.stat.MultivariateOnlineSummarizer@69eed7f4

In [22]:
val colStatsMap = Map("count" -> colStats.count, 
                      "max" -> colStats.max,
                      "mean" -> colStats.mean,
                      "min" -> colStats.min,
                      "normL1" -> colStats.normL1,
                      "normL2" -> colStats.normL2,
                      "numNonzeros" -> colStats.numNonzeros,
                      "variance" -> colStats.variance)

colStatsMap = Map(count -> 3, variance -> [9.333333333333334,6.333333333333333,0.08333333333333333,10.240000000000002], mean -> [1.3333333333333335,2.333333333333333,0.8333333333333334,4.0], numNonzeros -> [3.0,2.0,3.0,3.0], min -> [-2.0,0.0,0.5,0.8], normL1 -> [8.0,7.0,2.5,12.0], normL2 -> [4.898979485566356,5.385164807134504,1.5,8.27526434623088], max -> [4.0,5.0,1.0,7.2])


Map(count -> 3, variance -> [9.333333333333334,6.333333333333333,0.08333333333333333,10.240000000000002], mean -> [1.3333333333333335,2.333333333333333,0.8333333333333334,4.0], numNonzeros -> [3.0,2.0,3.0,3.0], min -> [-2.0,0.0,0.5,0.8], normL1 -> [8.0,7.0,2.5,12.0], normL2 -> [4.898979485566356,5.385164807134504,1.5,8.27526434623088], max -> [4.0,5.0,1.0,7.2])

In [23]:
colStatsMap.foreach{case(key, value) => println(key + ": " + value)}

count: 3
variance: [9.333333333333334,6.333333333333333,0.08333333333333333,10.240000000000002]
mean: [1.3333333333333335,2.333333333333333,0.8333333333333334,4.0]
numNonzeros: [3.0,2.0,3.0,3.0]
min: [-2.0,0.0,0.5,0.8]
normL1: [8.0,7.0,2.5,12.0]
normL2: [4.898979485566356,5.385164807134504,1.5,8.27526434623088]
max: [4.0,5.0,1.0,7.2]


#### corr()

`corr()` --> to calculate the correlation matrix between the columns of one RDD or between two RDDs

In [24]:
Statistics.corr(vectorsRdd)

1.0                  -0.7370434740955019   -0.755928946018455   -0.3273268353539885   
-0.7370434740955019  1.0                   0.11470786693528112  -0.39735970711951274  
-0.755928946018455   0.11470786693528112   1.0                  0.8660254037844397    
-0.3273268353539885  -0.39735970711951274  0.8660254037844397   1.0                   

In [25]:
import org.apache.spark.rdd.RDD

In [26]:
val data1: RDD[Double] = sc.parallelize(Array(1, 2, 3, 4, 5))
val data2: RDD[Double] = sc.parallelize(Array(10, 19, 32, 41, 56))

data1 = ParallelCollectionRDD[39] at parallelize at <console>:35
data2 = ParallelCollectionRDD[40] at parallelize at <console>:36


ParallelCollectionRDD[40] at parallelize at <console>:36

In [27]:
Statistics.corr(data1, data2)

0.996326893005933

#### chiSqTest()

`chiSqTest()` --> to compute the Pearson's independence test

In [28]:
val labelPointRdd = vectorsRdd.map(x => LabeledPoint(0, x))

labelPointRdd = MapPartitionsRDD[51] at map at <console>:38


MapPartitionsRDD[51] at map at <console>:38

In [29]:
val chiSqTest = Statistics.chiSqTest(labelPointRdd)

chiSqTest = 


Array(Chi squared test summary:
method: pearson
degrees of freedom = 0
statistic = 0.0
pValue = 1.0
No presumption against null hypothesis: the occurrence of the outcomes is statistically independent.., Chi squared test summary:
method: pearson
degrees of freedom = 0
statistic = 0.0
pValue = 1.0
No presumption against null hypothesis: the occurrence of the outcomes is statistically independent.., Chi squared test summary:
method: pearson
degrees of freedom = 0
statistic = 0.0
pValue = 1.0
No presumption against null hypothesis: the occurrence of the outcomes is statistically independent.., Chi squared test summary:
method: pearson
degrees of freedom = 0
statistic = 0.0
pValue = 1.0
No presumption against null hypothesi...


[Chi squared test summary:
method: pearson
degrees of freedom = 0 
statistic = 0.0 
pValue = 1.0 
No presumption against null hypothesis: the occurrence of the outcomes is statistically independent.., Chi squared test summary:
method: pearson
degrees of freedom = 0 
statistic = 0.0 
pValue = 1.0 
No presumption against null hypothesis: the occurrence of the outcomes is statistically independent.., Chi squared test summary:
method: pearson
degrees of freedom = 0 
statistic = 0.0 
pValue = 1.0 
No presumption against null hypothesis: the occurrence of the outcomes is statistically independent.., Chi squared test summary:
method: pearson
degrees of freedom = 0 
statistic = 0.0 
pValue = 1.0 
No presumption against null hypothesis: the occurrence of the outcomes is statistically independent..]

In [30]:
chiSqTest.foreach(x => println("Test value: " + x.pValue))

Test value: 1.0
Test value: 1.0
Test value: 1.0
Test value: 1.0


### Machine Learning: Regression

In this section, we will explore the conventional Linear Regression model.

In [31]:
import java.util.Random
val randGenerator = new Random()
import org.apache.spark.mllib.regression.LinearRegressionWithSGD

randGenerator = java.util.Random@5e249b2d


java.util.Random@5e249b2d

First, we will create training data according to a Linear Regression model with the following weights:

    * Weights: [2.5, 1.25, 0.5, 1]

In [32]:
val regFeatures = for(_ <- 1 to 500) yield {for (_ <- 1 to 4) yield randGenerator.nextInt(20)}
val regFeaturesRdd = sc.parallelize(regFeatures).map(x => Vectors.dense(x.toArray.map(_.toDouble)))
val scaler = new StandardScaler()
val regFeaturesScale = scaler.fit(regFeaturesRdd).transform(regFeaturesRdd)
val regData = regFeaturesScale.map(x => LabeledPoint({
    val arrayValue = x.toArray
    val randGenerator = new Random()
    2.5*x(0) + 1.25*x(1) + 0.5*x(2) + x(3) + randGenerator.nextDouble
},x))
regData.take(2)

regFeatures = Vector(Vector(16, 11, 8, 16), Vector(3, 19, 3, 8), Vector(2, 1, 17, 15), Vector(5, 4, 15, 0), Vector(17, 6, 8, 11), Vector(6, 15, 10, 8), Vector(3, 15, 13, 9), Vector(8, 15, 2, 5), Vector(17, 12, 18, 17), Vector(10, 10, 17, 5), Vector(7, 0, 14, 8), Vector(15, 10, 17, 6), Vector(10, 9, 0, 18), Vector(8, 18, 3, 4), Vector(16, 5, 14, 13), Vector(18, 9, 1, 6), Vector(6, 5, 17, 0), Vector(3, 7, 18, 13), Vector(7, 5, 5, 0), Vector(7, 10, 15, 11), Vector(15, 10, 3, 10), Vector(1, 2, 1, 12), Vector(4, 3, 12, 6), Vector(13, 4, 19, 4), Vector(17, 4, 17, 1), Vector(1, 6, 11, 3), Vector(3, 5, 6, 7), Vector(2, 4, 10, 19), Vector(6, 18, 17, 1), Vector(17, 0, 1, 2), Vector(18, 8, 6, 12), Vector(5, 14, 12, ...


Vector(Vector(16, 11, 8, 16), Vector(3, 19, 3, 8), Vector(2, 1, 17, 15), Vector(5, 4, 15, 0), Vector(17, 6, 8, 11), Vector(6, 15, 10, 8), Vector(3, 15, 13, 9), Vector(8, 15, 2, 5), Vector(17, 12, 18, 17), Vector(10, 10, 17, 5), Vector(7, 0, 14, 8), Vector(15, 10, 17, 6), Vector(10, 9, 0, 18), Vector(8, 18, 3, 4), Vector(16, 5, 14, 13), Vector(18, 9, 1, 6), Vector(6, 5, 17, 0), Vector(3, 7, 18, 13), Vector(7, 5, 5, 0), Vector(7, 10, 15, 11), Vector(15, 10, 3, 10), Vector(1, 2, 1, 12), Vector(4, 3, 12, 6), Vector(13, 4, 19, 4), Vector(17, 4, 17, 1), Vector(1, 6, 11, 3), Vector(3, 5, 6, 7), Vector(2, 4, 10, 19), Vector(6, 18, 17, 1), Vector(17, 0, 1, 2), Vector(18, 8, 6, 12), Vector(5, 14, 12, 4), Vector(15, 11, 6, 4), Vector(16, 9, 7, 0), Vector(7, 10, 10, 13), Vector(19, 12, 13, 8), Vector(9, 13, 2, 1), Vector(17, 1, 19, 19), Vector(14, 1, 16, 7), Vector(4, 1, 15, 18), Vector(5, 7, 4, 1), Vector(1, 10, 7, 16), Vector(15, 4, 0, 19), Vector(11, 3, 9, 3), Vector(17, 11, 7, 12), Vector(8, 2

Once the data has been created, we can train our model:

In [33]:
val numIterations = 10000
val stepSize = 0.1
val miniBatchFraction = 1.0
val lrModel = LinearRegressionWithSGD.train(regData, numIterations = numIterations, 
                                            stepSize = stepSize, miniBatchFraction = miniBatchFraction)

numIterations = 10000
stepSize = 0.1
miniBatchFraction = 1.0
lrModel = org.apache.spark.mllib.regression.LinearRegressionModel: intercept = 0.0, numFeatures = 4




org.apache.spark.mllib.regression.LinearRegressionModel: intercept = 0.0, numFeatures = 4

We can now compare the value of the original and computated weights and intercpet:

In [34]:
println("Computed weights: " + lrModel.weights)
println("Original weights: [2.5, 1.25, 0.5, 1]")

Computed weights: [2.3416762582321367,1.3269221781302136,0.7296018935053467,1.1208114768699862]
Original weights: [2.5, 1.25, 0.5, 1]


### Machine Learning: Classification

In this section, we will explore different classification models:

    * Logistic Regression
    * Support Vector Machines (SVMs)
    * Naive Bayes
    * Decision Trees
    * Random Forests
    
For every case, we will try to solve the sampe problem: a model to classify messages into two groups: legitimate and Spam. For that, we will have first to preprocess some text data using come functionalities studied in previous sections of this Notebook.

In [35]:
import org.apache.spark.mllib.classification.{LogisticRegressionWithSGD, SVMWithSGD, NaiveBayes}
import org.apache.spark.mllib.tree.{DecisionTree, RandomForest}

#### Data Preparation

Read the data:

In [36]:
val iniData = spark.read.option("header", "true").csv("../data/spam.csv")

iniData = [label: string, text: string ... 3 more fields]


[label: string, text: string ... 3 more fields]

In [37]:
iniData.show()

+-----+--------------------+----+----+----+
|label|                text| _c2| _c3| _c4|
+-----+--------------------+----+----+----+
|  ham|Go until jurong p...|null|null|null|
|  ham|Ok lar... Joking ...|null|null|null|
| spam|Free entry in 2 a...|null|null|null|
|  ham|U dun say so earl...|null|null|null|
|  ham|Nah I don't think...|null|null|null|
| spam|FreeMsg Hey there...|null|null|null|
|  ham|Even my brother i...|null|null|null|
|  ham|As per your reque...|null|null|null|
| spam|WINNER!! As a val...|null|null|null|
| spam|Had your mobile 1...|null|null|null|
|  ham|I'm gonna be home...|null|null|null|
| spam|SIX chances to wi...|null|null|null|
| spam|URGENT! You have ...|null|null|null|
|  ham|I've been searchi...|null|null|null|
|  ham|I HAVE A DATE ON ...|null|null|null|
| spam|XXXMobileMovieClu...|null|null|null|
|  ham|Oh k...i'm watchi...|null|null|null|
|  ham|Eh u remember how...|null|null|null|
|  ham|Fine if that��s t...|null|null|null|
| spam|England v Macedon...|null

Filter the data:

In [38]:
val iniDataRdd = iniData.select("label", "text").rdd

iniDataRdd = MapPartitionsRDD[497] at rdd at <console>:41


MapPartitionsRDD[497] at rdd at <console>:41

In [39]:
iniDataRdd.take(1)

0,1
ham,"Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat..."


In [40]:
iniDataRdd.take(1)(0)(0)

ham

In [41]:
iniDataRdd.count()

5574

In [42]:
iniDataRddFilter = iniDataRdd.filter(lambda row: (isinstance(row.label, str) and isinstance(row.text, str)))

Name: Compile Error
Message: <console>:1: error: ')' expected but '(' found.
iniDataRddFilter = iniDataRdd.filter(lambda row: (isinstance(row.label, str) and isinstance(row.text, str)))
                                                            ^
<console>:1: error: ')' expected but '(' found.
iniDataRddFilter = iniDataRdd.filter(lambda row: (isinstance(row.label, str) and isinstance(row.text, str)))
                                                                                           ^
<console>:1: error: ';' expected but ')' found.
iniDataRddFilter = iniDataRdd.filter(lambda row: (isinstance(row.label, str) and isinstance(row.text, str)))
                                                                                                          ^

StackTrace: 

In [43]:
val iniDataRddFilter = iniDataRdd.filter(row => (row(0), row(1)) match {
    case (key: String, value: String) => true
    case _ => false
})

iniDataRddFilter = MapPartitionsRDD[498] at filter at <console>:43


MapPartitionsRDD[498] at filter at <console>:43

In [44]:
iniDataRddFilter.count()

5573

Vectorize data:

In [45]:
val textRdd = iniDataRddFilter.map(row => row(1))

textRdd = MapPartitionsRDD[499] at map at <console>:45


MapPartitionsRDD[499] at map at <console>:45

In [46]:
val tf = new HashingTF(1000)
val tfVectors = textRdd.map(x => tf.transform(x.toString.split(" ")))
val idf = new IDF()
val idfModel = idf.fit(tfVectors)

tf = org.apache.spark.mllib.feature.HashingTF@67f1b487
tfVectors = MapPartitionsRDD[500] at map at <console>:57
idf = org.apache.spark.mllib.feature.IDF@5ae73b77
idfModel = org.apache.spark.mllib.feature.IDFModel@3e967f1b


org.apache.spark.mllib.feature.IDFModel@3e967f1b

In [47]:
val spamText = iniDataRddFilter.filter(_(0) == "spam").map(_(1))

spamText = MapPartitionsRDD[503] at map at <console>:45


MapPartitionsRDD[503] at map at <console>:45

In [48]:
spamText.count()

747

In [49]:
spamText.take(3)

[Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C's apply 08452810075over18's, FreeMsg Hey there darling it's been 3 week's now and no word back! I'd like some fun you up for it still? Tb ok! XxX std chgs to send, �1.50 to rcv, WINNER!! As a valued network customer you have been selected to receivea �900 prize reward! To claim call 09061701461. Claim code KL341. Valid 12 hours only.]

In [50]:
val genText = iniDataRddFilter.filter(_(0) == "ham").map(_(1))

genText = MapPartitionsRDD[505] at map at <console>:45


MapPartitionsRDD[505] at map at <console>:45

In [51]:
genText.count()

4825

In [52]:
genText.take(3)

[Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat..., Ok lar... Joking wif u oni..., U dun say so early hor... U c already then say...]

In [53]:
val tf2 = tf
val tfVectors = textRdd.map(x => tf2.transform(x.toString.split(" ")))

tf2 = org.apache.spark.mllib.feature.HashingTF@67f1b487
tfVectors = MapPartitionsRDD[506] at map at <console>:50


MapPartitionsRDD[506] at map at <console>:50

In [54]:
val tfSpam = tf
val idfModelSpam = idfModel
val spamVectors = spamText.map(x => tfSpam.transform(x.toString.split(" ")))
val spamIdf = spamVectors.map(x => idfModelSpam.transform(x))

tfSpam = org.apache.spark.mllib.feature.HashingTF@67f1b487
idfModelSpam = org.apache.spark.mllib.feature.IDFModel@3e967f1b
spamVectors = MapPartitionsRDD[507] at map at <console>:56
spamIdf = MapPartitionsRDD[508] at map at <console>:57


MapPartitionsRDD[508] at map at <console>:57

In [55]:
spamIdf.take(1)

[(1000,[30,33,35,72,128,140,166,170,388,409,445,468,508,634,667,670,685,692,716,755,784,880,887,989],[5.3300313420375645,5.917818006939683,4.498733822996802,4.137231838309754,2.8732955692162605,4.713845202613747,4.406360502865787,1.4846230856914024,3.5413519540206178,4.531523645819793,2.0151721633241344,4.2951348677555625,4.600516517306744,5.191881003556747,5.160132305242167,3.3894262452119444,4.269159381352302,4.498733822996802,4.421175588650928,7.3400823008812655,5.3300313420375645,4.451480938146257,10.14104029310496,2.514400868539215])]

In [56]:
val tfGen = tf
val idfModelGen = idfModel
val genVectors = genText.map(x => tfGen.transform(x.toString.split(" ")))
val genIdf = genVectors.map(x => idfModelGen.transform(x))

tfGen = org.apache.spark.mllib.feature.HashingTF@67f1b487
idfModelGen = org.apache.spark.mllib.feature.IDFModel@3e967f1b
genVectors = MapPartitionsRDD[509] at map at <console>:56
genIdf = MapPartitionsRDD[510] at map at <console>:57


MapPartitionsRDD[510] at map at <console>:57

In [57]:
genIdf.take(1)

[(1000,[7,42,150,165,258,260,360,362,445,647,655,687,744,745,785,831,854,878,899,966],[3.2413731452528047,5.129360646575413,4.962306561912247,5.447814377693948,4.7972268115527985,4.020698022053802,5.099507683425732,4.531523645819793,2.0151721633241344,5.5348257546835775,3.8055866424368565,5.25857237805542,3.846744714930364,4.841678574123632,5.681429228875453,3.4001215343286924,5.917818006939683,3.698614522884689,3.4001215343286924,2.8177257180614497])]

In [58]:
val spamPoints = spamIdf.map(x => LabeledPoint(1, x))
val genPoints = genIdf.map(x => LabeledPoint(0, x))

spamPoints = MapPartitionsRDD[511] at map at <console>:66
genPoints = MapPartitionsRDD[512] at map at <console>:67


MapPartitionsRDD[512] at map at <console>:67

In [59]:
spamPoints.take(1)

[(1.0,(1000,[30,33,35,72,128,140,166,170,388,409,445,468,508,634,667,670,685,692,716,755,784,880,887,989],[5.3300313420375645,5.917818006939683,4.498733822996802,4.137231838309754,2.8732955692162605,4.713845202613747,4.406360502865787,1.4846230856914024,3.5413519540206178,4.531523645819793,2.0151721633241344,4.2951348677555625,4.600516517306744,5.191881003556747,5.160132305242167,3.3894262452119444,4.269159381352302,4.498733822996802,4.421175588650928,7.3400823008812655,5.3300313420375645,4.451480938146257,10.14104029310496,2.514400868539215]))]

In [60]:
genPoints.take(1)

[(0.0,(1000,[7,42,150,165,258,260,360,362,445,647,655,687,744,745,785,831,854,878,899,966],[3.2413731452528047,5.129360646575413,4.962306561912247,5.447814377693948,4.7972268115527985,4.020698022053802,5.099507683425732,4.531523645819793,2.0151721633241344,5.5348257546835775,3.8055866424368565,5.25857237805542,3.846744714930364,4.841678574123632,5.681429228875453,3.4001215343286924,5.917818006939683,3.698614522884689,3.4001215343286924,2.8177257180614497]))]

In [61]:
val mlDataIni = spamPoints.union(genPoints)

mlDataIni = UnionRDD[513] at union at <console>:69


UnionRDD[513] at union at <console>:69

In [62]:
val randGenerator = new Random()
randGenerator.nextInt(20)

val mlData = mlDataIni.map(row => (randGenerator.nextInt(100), row)).sortByKey().map(_._2)

randGenerator = java.util.Random@52f20a41
mlData = MapPartitionsRDD[518] at map at <console>:76


MapPartitionsRDD[518] at map at <console>:76

In [63]:
mlData.map(_.label).take(10)

[1.0, 1.0, 1.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]

In [64]:
val mlDataTrainTest = mlData.randomSplit(weights = Array(0.8, 0.2))

mlDataTrainTest = Array(MapPartitionsRDD[520] at randomSplit at <console>:74, MapPartitionsRDD[521] at randomSplit at <console>:74)


[MapPartitionsRDD[520] at randomSplit at <console>:74, MapPartitionsRDD[521] at randomSplit at <console>:74]

In [65]:
val mlDataTrain = mlDataTrainTest(0)
val mlDataTest = mlDataTrainTest(1)

mlDataTrain = MapPartitionsRDD[520] at randomSplit at <console>:74
mlDataTest = MapPartitionsRDD[521] at randomSplit at <console>:74


MapPartitionsRDD[521] at randomSplit at <console>:74

In [66]:
mlDataTrain.cache()
mlDataTest.cache()

MapPartitionsRDD[521] at randomSplit at <console>:74

In [67]:
mlDataTrain.count()

4469

In [68]:
mlDataTest.count()

1103

In [69]:
mlDataTest.take(1)(0).features

(1000,[12,184,410,437],[5.3300313420375645,4.498733822996802,4.962306561912247,3.9163380067295592])

#### Logistic Regression

In [70]:
val lrModel = new LogisticRegressionWithSGD().run(mlDataTrain)

lrModel = org.apache.spark.mllib.classification.LogisticRegressionModel: intercept = 0.0, numFeatures = 1000, numClasses = 2, threshold = 0.5




org.apache.spark.mllib.classification.LogisticRegressionModel: intercept = 0.0, numFeatures = 1000, numClasses = 2, threshold = 0.5

In [71]:
for(data <- mlDataTest.take(10)){
    val pred = lrModel.predict(data.features)
    println("Actual label: " + data.label + "; Prediction: " + pred)
}

Actual label: 0.0; Prediction: 0.0
Actual label: 0.0; Prediction: 0.0
Actual label: 0.0; Prediction: 0.0
Actual label: 0.0; Prediction: 0.0
Actual label: 0.0; Prediction: 0.0
Actual label: 0.0; Prediction: 0.0
Actual label: 0.0; Prediction: 0.0
Actual label: 0.0; Prediction: 0.0
Actual label: 1.0; Prediction: 1.0
Actual label: 1.0; Prediction: 1.0


#### Suport Vector Machines

In [72]:
val svmModel = new SVMWithSGD().run(mlDataTrain)

svmModel = org.apache.spark.mllib.classification.SVMModel: intercept = 0.0, numFeatures = 1000, numClasses = 2, threshold = 0.0


org.apache.spark.mllib.classification.SVMModel: intercept = 0.0, numFeatures = 1000, numClasses = 2, threshold = 0.0

In [73]:
for(data <- mlDataTest.take(10)){
    val pred = svmModel.predict(data.features)
    println("Actual label: " + data.label + "; Prediction: " + pred)
}

Actual label: 0.0; Prediction: 0.0
Actual label: 0.0; Prediction: 0.0
Actual label: 0.0; Prediction: 0.0
Actual label: 0.0; Prediction: 0.0
Actual label: 0.0; Prediction: 0.0
Actual label: 0.0; Prediction: 0.0
Actual label: 0.0; Prediction: 0.0
Actual label: 0.0; Prediction: 0.0
Actual label: 1.0; Prediction: 1.0
Actual label: 1.0; Prediction: 1.0


#### Naive Bayes

In [74]:
val nbModel = new NaiveBayes().run(mlDataTrain)

nbModel = org.apache.spark.mllib.classification.NaiveBayesModel@5750a5c8


org.apache.spark.mllib.classification.NaiveBayesModel@5750a5c8

In [75]:
for(data <- mlDataTest.take(10)){
    val pred = nbModel.predict(data.features)
    println("Actual label: " + data.label + "; Prediction: " + pred)
}

Actual label: 0.0; Prediction: 0.0
Actual label: 0.0; Prediction: 0.0
Actual label: 0.0; Prediction: 0.0
Actual label: 0.0; Prediction: 0.0
Actual label: 0.0; Prediction: 0.0
Actual label: 0.0; Prediction: 0.0
Actual label: 0.0; Prediction: 0.0
Actual label: 0.0; Prediction: 0.0
Actual label: 1.0; Prediction: 1.0
Actual label: 1.0; Prediction: 1.0


#### Decision Trees

In [76]:
val numClasses = 2
val categoricalFeaturesInfo=Map[Int, Int]()
val impurity="gini"
val maxDepth=15
val maxBins=64


val treeModel = DecisionTree.trainClassifier(input = mlDataTrain, numClasses = numClasses, 
                                             categoricalFeaturesInfo = categoricalFeaturesInfo,
                                             impurity = impurity, maxDepth = maxDepth,
                                             maxBins = maxBins)

numClasses = 2
categoricalFeaturesInfo = Map()
impurity = gini
maxDepth = 15
maxBins = 64
treeModel = DecisionTreeModel classifier of depth 15 with 285 nodes


DecisionTreeModel classifier of depth 15 with 285 nodes

In [77]:
for(data <- mlDataTest.take(10)){
    val pred = treeModel.predict(data.features)
    println("Actual label: " + data.label + "; Prediction: " + pred)
}

Actual label: 0.0; Prediction: 0.0
Actual label: 0.0; Prediction: 0.0
Actual label: 0.0; Prediction: 1.0
Actual label: 0.0; Prediction: 0.0
Actual label: 0.0; Prediction: 0.0
Actual label: 0.0; Prediction: 0.0
Actual label: 0.0; Prediction: 0.0
Actual label: 0.0; Prediction: 0.0
Actual label: 1.0; Prediction: 1.0
Actual label: 1.0; Prediction: 1.0


#### Random Forest

In [78]:
val numClasses = 2
val categoricalFeaturesInfo = Map[Int, Int]()
val impurity = "gini"
val maxDepth = 15
val maxBins = 64
val featureSubsetStrategy = "auto"
val numTrees = 10

val forestModel = RandomForest.trainClassifier(input = mlDataTrain, numClasses = numClasses, 
                                             categoricalFeaturesInfo = categoricalFeaturesInfo, 
                                             impurity = impurity, maxDepth = maxDepth, 
                                             maxBins = maxBins, numTrees = numTrees,
                                             featureSubsetStrategy = featureSubsetStrategy)

numClasses = 2
categoricalFeaturesInfo = Map()
impurity = gini
maxDepth = 15
maxBins = 64
featureSubsetStrategy = auto
numTrees = 10
forestModel = 


TreeEnsembleModel classifier with 10 trees


In [79]:
for(data <- mlDataTest.take(10)){
    val pred = forestModel.predict(data.features)
    println("Actual label: " + data.label + "; Prediction: " + pred)
}

Actual label: 0.0; Prediction: 0.0
Actual label: 0.0; Prediction: 0.0
Actual label: 0.0; Prediction: 0.0
Actual label: 0.0; Prediction: 0.0
Actual label: 0.0; Prediction: 0.0
Actual label: 0.0; Prediction: 0.0
Actual label: 0.0; Prediction: 0.0
Actual label: 0.0; Prediction: 0.0
Actual label: 1.0; Prediction: 1.0
Actual label: 1.0; Prediction: 1.0


### Machine Learning: Clustering

In this section, we will explore the `K-means` algorithm, which is the main clustering algorithm included in MLlib.

Here, we will study the previous spam classification problem. We will cluster our mesages into two groups, and then, we will count the number of points that fall into each group.

In [80]:
import org.apache.spark.mllib.clustering.KMeans

In [81]:
val clusterData = mlData.map(_.features)
clusterData.cache()

clusterData = MapPartitionsRDD[1051] at map at <console>:75


MapPartitionsRDD[1051] at map at <console>:75

In [82]:
val clusters = KMeans.train(clusterData, 2, maxIterations=1700, initializationMode="random")

clusters = org.apache.spark.mllib.clustering.KMeansModel@214ab2b9


org.apache.spark.mllib.clustering.KMeansModel@214ab2b9

In [83]:
val clustersModel = clusters
val predictions = clusterData.map(x => clustersModel.predict(x))

clustersModel = org.apache.spark.mllib.clustering.KMeansModel@214ab2b9
predictions = MapPartitionsRDD[1062] at map at <console>:80


MapPartitionsRDD[1062] at map at <console>:80

In [84]:
predictions.countByValue()

Map(0 -> 5571, 1 -> 1)

### Collavorative Filtering and Recommendation: Alternating Least Squares

Now, we will explore the `Alternating Least Squares` algorithm, very used for collaborative filtering problems.

In [85]:
import org.apache.spark.mllib.recommendation.{ALS, Rating}

Load and prepare the data

In [86]:
val dataAls = sc.textFile("../data/als/test.data")
val ratings = dataAls.map(_.split(',')).map(l => Rating(l(0).toInt, l(1).toInt, l(2).toFloat))

dataAls = ../data/als/test.data MapPartitionsRDD[1067] at textFile at <console>:41
ratings = MapPartitionsRDD[1069] at map at <console>:42


MapPartitionsRDD[1069] at map at <console>:42

In [87]:
ratings.take(1)

[Rating(1,1,5.0)]

Build a recommendation moddel using ALS:

In [88]:
val rank = 10
val numIterations = 10
val alsModel = ALS.train(ratings, rank, numIterations)

rank = 10
numIterations = 10
alsModel = org.apache.spark.mllib.recommendation.MatrixFactorizationModel@e5f8e32


org.apache.spark.mllib.recommendation.MatrixFactorizationModel@e5f8e32

Now we can perform some predictions:

In [89]:
val testData = ratings.map(p => (p.user, p.product))

testData = MapPartitionsRDD[1277] at map at <console>:44


MapPartitionsRDD[1277] at map at <console>:44

In [90]:
testData.take(2)

[(1,1), (1,2)]

In [91]:
val alsPredictions = alsModel.predict(testData)

alsPredictions = MapPartitionsRDD[1286] at map at MatrixFactorizationModel.scala:140


MapPartitionsRDD[1286] at map at MatrixFactorizationModel.scala:140

In [92]:
alsPredictions.take(2)

[Rating(1,1,4.995327538757314), Rating(1,2,1.0020486223725058)]

### Dimensionality Reduction

In this section, we will see two main functionalities included in MLlib relative to dimensionality reduction:

    * Principal Component Analysis
    * Singular Vector Decomposition
    
    
We will use the data from the Clustering Section, training also a KMeans model with the "reduced" data.

#### Principal Component Analysis

In [93]:
clusterData.take(2)

[(1000,[36,146,184,387,450,508,511,534,581,620,661,691,743,769,813,852,948,999],[2.238988888679249,2.6520585961726324,4.498733822996802,4.482733481650361,3.4610822341183796,4.600516517306744,3.855183583576229,2.7481324262622544,5.191881003556747,4.988282048315508,5.25857237805542,4.2820627861882095,5.447814377693948,5.224670826379739,4.406360502865787,3.7895863010904156,5.25857237805542,5.406992383173693]), (1000,[36,73,78,146,167,170,231,263,343,388,425,431,447,517,525,526,596,660,704,803,831,903,951],[2.238988888679249,7.076543745619019,4.2438415733680115,5.304117192345265,4.207027600245295,1.4846230856914024,4.2951348677555625,2.220639750011052,2.3232492322969884,2.360901302680412,2.9995618694855373,4.391761703444634,3.6913942749112016,4.207027600245295,3.5632731750149267,2.117099071070212,3.6286559342777784,4.391761703444634,3.6422615863335572,4.051157229538511,3.4001215343286924,3.3376011773473584,5.099507683425732])]

In [94]:
import org.apache.spark.mllib.linalg.Matrix
import org.apache.spark.mllib.linalg.distributed.RowMatrix

In [95]:
val mat = new RowMatrix(clusterData)

mat = org.apache.spark.mllib.linalg.distributed.RowMatrix@4a9ea0b8


org.apache.spark.mllib.linalg.distributed.RowMatrix@4a9ea0b8

In [96]:
val pc = mat.computePrincipalComponents(2)

pc = 


-0.004459599011320296   0.0041608...


-0.006639428117164253   0.0332054310436396      
-0.0013645425248541843  0.0035456557549498058   
-2.5710328882322386E-4  0.007015531669825936    
-0.004463687500827013   0.0013951563429472902   
-0.0023527218330895924  0.024390934315625543    
-0.001222962075320043   0.007885626529188996    
-0.004632324946425941   0.007985071258782759    
-0.006097919717778844   0.013917509155318248    
-0.0015643671662767245  0.009363235968379456    
-4.8057066882834816E-4  0.004764736689663678    
-6.90804677118604E-4    0.003604015306000942    
-0.012675350915813977   0.012652788299459542    
5.762153759525019E-4    6.107019012387356E-4    
-0.004924684573814147   0.02397821779717647     
8.751817252586666E-4    -0.0030496075057231545  
-0.03903173161860575    0.011655881469136273    
-0.004459599011320296   0.004160893626195047    
-0.0018887112732935733  0.0062206603715747      
-0.02907981016955361    0.05474416629343554     
-0.016776749654633785   0.0185579592253671      
4.008956560157287E-4

In [97]:
val projectedPca = mat.multiply(pc).rows

projectedPca = MapPartitionsRDD[1289] at mapPartitions at RowMatrix.scala:443


MapPartitionsRDD[1289] at mapPartitions at RowMatrix.scala:443

In [98]:
val kmeansModelPca = KMeans.train(projectedPca, 2, maxIterations=1700, initializationMode="random")
val predictionsPca = projectedPca.map(x => kmeansModelPca.predict(x))

kmeansModelPca = org.apache.spark.mllib.clustering.KMeansModel@54e75e68
predictionsPca = MapPartitionsRDD[1333] at map at <console>:87


MapPartitionsRDD[1333] at map at <console>:87

In [99]:
predictionsPca.countByValue()

Map(0 -> 4555, 1 -> 1017)

#### Singular Value Decomposition

In [100]:
val svd = mat.computeSVD(20)

svd = 


-0.03175426599745287    0.01404...


SingularValueDecomposition(null,[230.92399413088708,168.19026124879895,109.84815231736111,108.07534735546739,99.39483397804844,96.92313709383784,92.3007084190019,89.53922163782167,87.39542396024576,84.8202049439851,83.08234143420717,82.7496467971585,81.64185008669489,80.12470389477421,78.32716692786273,76.77514947395973,75.993850263491,74.77200312260355,74.00221055778967,73.58996929973304],-0.02667562165609747    0.010690451459367756    ... (20 total)
-0.009125903178416477   0.002387006735496973    ...
-0.010300577087381604   0.004861417423187044    ...
-0.010356430131513642   -8.15577265844504E-4    ...
-0.03175426599745287    0.014044243389408843    ...
-0.0077633835913056715  0.0033761615810848442   ...
-0.011680558903263657   9.362731433118169E-4    ...
-0.0403065638389083     0.010128847514276111    ...
-0.011525735778459164   0.0047542141156844756   ...
-0.009823169977147276   0.004221567982109323    ...
-0.00731780883433451    0.002576731880106117    ...
-0.015213531341521818   

In [101]:
val projectedSVD = mat.multiply(svd.V).rows

projectedSVD = MapPartitionsRDD[1432] at mapPartitions at RowMatrix.scala:443


MapPartitionsRDD[1432] at mapPartitions at RowMatrix.scala:443

In [102]:
val kmeansModelSVD = KMeans.train(projectedSVD, 2, maxIterations=1700, initializationMode="random")
val predictionsSVD = projectedSVD.map(x => kmeansModelSVD.predict(x))

kmeansModelSVD = org.apache.spark.mllib.clustering.KMeansModel@504d9f9d
predictionsSVD = MapPartitionsRDD[1470] at map at <console>:87


MapPartitionsRDD[1470] at map at <console>:87

In [103]:
predictionsSVD.countByValue()

Map(0 -> 1577, 1 -> 3995)

### Model Evaluation

MLlib includes some functionalities to calculate automatically some metrics of trained ML models. While there are more, here we will evaluate the LR model of the spam classification section using the `BinaryClassificationMetrics` functionality.

In [104]:
import org.apache.spark.mllib.evaluation.BinaryClassificationMetrics

In [105]:
mlDataTrain.take(1)

[(1.0,(1000,[36,146,184,387,450,508,511,534,581,620,661,691,743,769,813,852,948,999],[2.238988888679249,2.6520585961726324,4.498733822996802,4.482733481650361,3.4610822341183796,4.600516517306744,3.855183583576229,2.7481324262622544,5.191881003556747,4.988282048315508,5.25857237805542,4.2820627861882095,5.447814377693948,5.224670826379739,4.406360502865787,3.7895863010904156,5.25857237805542,5.406992383173693]))]

In [106]:
val lrModelEval = lrModel
val predLabelLr = mlDataTest.map(lpoint => (lrModelEval.predict(lpoint.features), lpoint.label))
val metricsLr = new BinaryClassificationMetrics(predLabelLr)

lrModelEval = org.apache.spark.mllib.classification.LogisticRegressionModel: intercept = 0.0, numFeatures = 1000, numClasses = 2, threshold = 0.5
predLabelLr = MapPartitionsRDD[1474] at map at <console>:87
metricsLr = org.apache.spark.mllib.evaluation.BinaryClassificationMetrics@4044debc


org.apache.spark.mllib.evaluation.BinaryClassificationMetrics@4044debc

In [107]:
println("LR model")
println("Area Under PR: " + metricsLr.areaUnderPR)
println("Area Under ROC: " + metricsLr.areaUnderROC)

LR model
Area Under PR: 0.70969305879444
Area Under ROC: 0.8678751748889835


## Pipeline API

ML pipelines are an interesting concept in order to organize all the tasks relative to a ML problem (data preparation + model training) into a Pipeline. In this section, we will solve the spam classification problem using ML pipelines, which are made by a series of Transformers and Estimators.

In [108]:
import org.apache.spark.ml.classification.LogisticRegression
import org.apache.spark.ml.feature
import org.apache.spark.ml.Pipeline

In [109]:
iniData.show()

+-----+--------------------+----+----+----+
|label|                text| _c2| _c3| _c4|
+-----+--------------------+----+----+----+
|  ham|Go until jurong p...|null|null|null|
|  ham|Ok lar... Joking ...|null|null|null|
| spam|Free entry in 2 a...|null|null|null|
|  ham|U dun say so earl...|null|null|null|
|  ham|Nah I don't think...|null|null|null|
| spam|FreeMsg Hey there...|null|null|null|
|  ham|Even my brother i...|null|null|null|
|  ham|As per your reque...|null|null|null|
| spam|WINNER!! As a val...|null|null|null|
| spam|Had your mobile 1...|null|null|null|
|  ham|I'm gonna be home...|null|null|null|
| spam|SIX chances to wi...|null|null|null|
| spam|URGENT! You have ...|null|null|null|
|  ham|I've been searchi...|null|null|null|
|  ham|I HAVE A DATE ON ...|null|null|null|
| spam|XXXMobileMovieClu...|null|null|null|
|  ham|Oh k...i'm watchi...|null|null|null|
|  ham|Eh u remember how...|null|null|null|
|  ham|Fine if that��s t...|null|null|null|
| spam|England v Macedon...|null

In [110]:
val sqlSelect = new feature.SQLTransformer().setStatement("SELECT label, text FROM __THIS__")

sqlSelect = sql_477e8cbfde55


sql_477e8cbfde55

In [111]:
val sqlFilter = new feature.SQLTransformer().setStatement("SELECT * from __THIS__ WHERE text is not null AND label is not null")

sqlFilter = sql_7072efbdafb2


sql_7072efbdafb2

In [112]:
val labelIndexer = new feature.StringIndexer().setInputCol("label").setOutputCol("label_num")

labelIndexer = strIdx_38a8df58a677


strIdx_38a8df58a677

In [113]:
val tokenizer = new feature.Tokenizer().setInputCol("text").setOutputCol("text_token")

tokenizer = tok_08996ed607cb


tok_08996ed607cb

In [114]:
val tf = new feature.HashingTF().setNumFeatures(1000).setInputCol("text_token").setOutputCol("text_tf")

tf = hashingTF_9956eff1dc22


hashingTF_9956eff1dc22

In [115]:
val idf = new feature.IDF().setInputCol("text_tf").setOutputCol("features")

idf = idf_8362dd91ab12


idf_8362dd91ab12

In [116]:
val lr = new LogisticRegression().setFeaturesCol("features").setLabelCol("label_num")

lr = logreg_00f550b98a1c


logreg_00f550b98a1c

In [117]:
val mlPipeline = new Pipeline().setStages(Array(sqlSelect, sqlFilter, labelIndexer, tokenizer, tf, idf, lr))

mlPipeline = pipeline_9a454f6620a8


pipeline_9a454f6620a8

In [118]:
val mlPipelineModel = mlPipeline.fit(iniData)

mlPipelineModel = pipeline_9a454f6620a8


pipeline_9a454f6620a8

In [119]:
mlPipelineModel.transform(iniData).show(5)

+-----+--------------------+---------+--------------------+--------------------+--------------------+--------------------+--------------------+----------+
|label|                text|label_num|          text_token|             text_tf|            features|       rawPrediction|         probability|prediction|
+-----+--------------------+---------+--------------------+--------------------+--------------------+--------------------+--------------------+----------+
|  ham|Go until jurong p...|      0.0|[go, until, juron...|(1000,[7,77,150,1...|(1000,[7,77,150,1...|[46.1925496142281...|[1.0,9.5478469531...|       0.0|
|  ham|Ok lar... Joking ...|      0.0|[ok, lar..., joki...|(1000,[20,316,484...|(1000,[20,316,484...|[22.7239392220272...|[0.99999999999972...|       0.0|
| spam|Free entry in 2 a...|      1.0|[free, entry, in,...|(1000,[30,35,73,1...|(1000,[30,35,73,1...|[-49.707099745267...|[5.78975520257700...|       1.0|
|  ham|U dun say so earl...|      0.0|[u, dun, say, so,...|(1000,[57,3