# Machine Learning with MLlib

In this Notebook, we will review the RDD-Based Machine Learning library MLlib.

## Data Types

First, we have to understand the different data structures used by MLlib. In particular, they are:

    * Vectors
    * Labeled Points
    * Rating
    * Model Classes
    
We will se `Vectors` and `Labeled Points` in more detail.

In [1]:
import org.apache.spark.mllib.linalg.Vectors

`Vector()` --> to hold the features values. It can be `dense` and `sparse`.

In [2]:
val vectorDense = Vectors.dense(Array(1.0,1.0,2.0,2.0))

vectorDense = [1.0,1.0,2.0,2.0]


[1.0,1.0,2.0,2.0]

In [3]:
val vectorSparse = Vectors.sparse(4, Array(0, 2), Array(1.0, 2.0))

vectorSparse = (4,[0,2],[1.0,2.0])


(4,[0,2],[1.0,2.0])

`LabeledPoint()` --> hold both features values and label values

In [4]:
import org.apache.spark.mllib.regression.LabeledPoint

In [5]:
val labelPoint = LabeledPoint(1, vectorDense)

labelPoint = (1.0,[1.0,1.0,2.0,2.0])


(1.0,[1.0,1.0,2.0,2.0])

## Algorithms

In this section, we will review the different algorithms associated with Machine Learning problems. Among other, we could highlight the following families of algorithms:

    * Feature Extraction
    * Statistics
    * Classification and Regression
    * Collaborative Filtering and Recommendation
    * Dimensionality Reduction
    * Model Evaluation

### Feature Extraction

ML algorithms only accept numerical values as inputs. Here, we discuss some algorithm that help us to translate some inputs (like text, non-scaled numerical vectors, etc) to numerical values that ML algorithms can understand. In particular, we will discuss the following algorithms:

    * TD-IDF
    * Scaling
    * Normalization
    * Word2Vec

#### td-idf()

`td-idf()` --> Term Frecuency - Inverse Document Frequency, useful to convert text input to numerical inputs

In [6]:
import org.apache.spark.mllib.feature.{HashingTF, IDF}

In [7]:
val sentences = sc.parallelize(Array("hello", "hello how are you", "good bye", "bye"))
val words = sentences.map(_.split(" ").toSeq)
val tf = new HashingTF(100)
val tfVectors = tf.transform(words)

sentences = ParallelCollectionRDD[0] at parallelize at <console>:30
words = MapPartitionsRDD[1] at map at <console>:31
tf = org.apache.spark.mllib.feature.HashingTF@285f22e7
tfVectors = MapPartitionsRDD[2] at map at HashingTF.scala:120


MapPartitionsRDD[2] at map at HashingTF.scala:120

In [8]:
tfVectors.collect()

[(100,[48],[1.0]), (100,[25,37,38,48],[1.0,1.0,1.0,1.0]), (100,[5,68],[1.0,1.0]), (100,[5],[1.0])]

In [9]:
val idf = new IDF()
val idfModel = idf.fit(tfVectors)
val tfIdfVectors = idfModel.transform(tfVectors)

idf = org.apache.spark.mllib.feature.IDF@78575d5
idfModel = org.apache.spark.mllib.feature.IDFModel@71b66609
tfIdfVectors = MapPartitionsRDD[7] at mapPartitions at IDF.scala:178


MapPartitionsRDD[7] at mapPartitions at IDF.scala:178

In [10]:
tfIdfVectors.collect()

[(100,[48],[0.5108256237659907]), (100,[25,37,38,48],[0.9162907318741551,0.9162907318741551,0.9162907318741551,0.5108256237659907]), (100,[5,68],[0.5108256237659907,0.9162907318741551]), (100,[5],[0.5108256237659907])]

#### Word2Vect

`Word2Vec` --> also useful to tranform text into numerical data

In [11]:
import org.apache.spark.mllib.feature.Word2Vec

In [12]:
val word2vec = new Word2Vec().setMinCount(0)
val word2vecModel = word2vec.fit(words)

word2vec = org.apache.spark.mllib.feature.Word2Vec@3e172326
word2vecModel = org.apache.spark.mllib.feature.Word2VecModel@4f59a871


org.apache.spark.mllib.feature.Word2VecModel@4f59a871

In [13]:
val word2vecVectors = word2vecModel.transform("hello")

word2vecVectors = [-3.0199125831131823E-5,0.0016926409443840384,-0.004056421108543873,-0.004137029871344566,0.0011226466158404946,0.00416228175163269,-0.004522450268268585,-0.0034207457210868597,6.502352771349251E-4,0.0044151670299470425,0.003331251908093691,0.004456773865967989,-3.881255106534809E-4,-0.001828477019444108,0.0032321717590093613,-9.604980587027967E-4,-0.001065896824002266,3.4843764296965674E-5,-0.0032923424150794744,1.6158685320988297E-4,-0.0041165947914123535,-0.004293373785912991,8.68087459821254E-4,-7.831022376194596E-4,0.0038289502263069153,0.004241717979311943,0.0014030217425897717,0.004157128278166056,-7.13129120413214E-4,0.004436533898115158,-0.00357951526530087,-0.0024177739396691322,0.0020359789486974478,0.0014305984368547797...


[-3.0199125831131823E-5,0.0016926409443840384,-0.004056421108543873,-0.004137029871344566,0.0011226466158404946,0.00416228175163269,-0.004522450268268585,-0.0034207457210868597,6.502352771349251E-4,0.0044151670299470425,0.003331251908093691,0.004456773865967989,-3.881255106534809E-4,-0.001828477019444108,0.0032321717590093613,-9.604980587027967E-4,-0.001065896824002266,3.4843764296965674E-5,-0.0032923424150794744,1.6158685320988297E-4,-0.0041165947914123535,-0.004293373785912991,8.68087459821254E-4,-7.831022376194596E-4,0.0038289502263069153,0.004241717979311943,0.0014030217425897717,0.004157128278166056,-7.13129120413214E-4,0.004436533898115158,-0.00357951526530087,-0.0024177739396691322,0.0020359789486974478,0.0014305984368547797,0.002553508384153247,0.004991794936358929,0.0023916843347251415,-0.0027679004706442356,-0.0016963399248197675,-0.003633954096585512,0.0018068415811285377,-0.003447020659223199,0.003823435865342617,0.004363159649074078,0.0017268708907067776,-0.001438589766621

#### Scaling

While our input data could be already numeric, it is useful sometimes for the ML algorithms to scale that data.

`StandardScaler()` --> to scale numerical data

In [14]:
import org.apache.spark.mllib.feature.StandardScaler

In [15]:
val vectors = Array(Vectors.dense(Array(-2.0, 5.0, 1.0, 4.0)),
                    Vectors.dense(Array(2.0, 0.0, 1.0, 7.2)),
                    Vectors.dense(Array(4.0, 2.0, 0.5, 0.8)))

val vectorsRdd = sc.parallelize(vectors)
val scaler = new StandardScaler(withMean=true, withStd=true)
val model = scaler.fit(vectorsRdd)
val scaledData = model.transform(vectorsRdd)

vectors = Array([-2.0,5.0,1.0,4.0], [2.0,0.0,1.0,7.2], [4.0,2.0,0.5,0.8])
vectorsRdd = ParallelCollectionRDD[20] at parallelize at <console>:36
scaler = org.apache.spark.mllib.feature.StandardScaler@675c2d7c
model = org.apache.spark.mllib.feature.StandardScalerModel@1e06fc10
scaledData = MapPartitionsRDD[25] at map at VectorTransformer.scala:52


MapPartitionsRDD[25] at map at VectorTransformer.scala:52

In [16]:
scaledData.collect()

[[-1.0910894511799618,1.0596258856520353,0.5773502691896257,0.0], [0.2182178902359924,-0.9271726499455306,0.5773502691896257,1.0], [0.8728715609439696,-0.13245323570650427,-1.1547005383792517,-1.0]]

#### Normalization

As with scaling, sometimes it is very usefull to normalize our data.

In [17]:
import org.apache.spark.mllib.feature.Normalizer

In [18]:
val norm = new Normalizer()
val normData = norm.transform(vectorsRdd)

norm = org.apache.spark.mllib.feature.Normalizer@7a257871
normData = MapPartitionsRDD[26] at map at VectorTransformer.scala:52


MapPartitionsRDD[26] at map at VectorTransformer.scala:52

In [19]:
normData.collect()

[[-0.29488391230979427,0.7372097807744856,0.14744195615489714,0.5897678246195885], [0.2652790545386455,0.0,0.13263952726932274,0.9550045963391238], [0.8751666735874727,0.43758333679373634,0.10939583419843409,0.17503333471749455]]

### Statistics

The library MLlib includes useful functionalities to calculate some main statistics over numeric RDDs

In [20]:
import org.apache.spark.mllib.stat.Statistics

#### colStats()

`colStats()` --> to calculate statistics over an RDD of numerical values

In [21]:
val colStats = Statistics.colStats(vectorsRdd)

colStats = org.apache.spark.mllib.stat.MultivariateOnlineSummarizer@57c10720


org.apache.spark.mllib.stat.MultivariateOnlineSummarizer@57c10720

In [22]:
val colStatsMap = Map("count" -> colStats.count, 
                      "max" -> colStats.max,
                      "mean" -> colStats.mean,
                      "min" -> colStats.min,
                      "normL1" -> colStats.normL1,
                      "normL2" -> colStats.normL2,
                      "numNonzeros" -> colStats.numNonzeros,
                      "variance" -> colStats.variance)

colStatsMap = Map(count -> 3, variance -> [9.333333333333334,6.333333333333333,0.08333333333333333,10.240000000000002], mean -> [1.3333333333333333,2.333333333333333,0.8333333333333334,4.0], numNonzeros -> [3.0,2.0,3.0,3.0], min -> [-2.0,0.0,0.5,0.8], normL1 -> [8.0,7.0,2.5,12.0], normL2 -> [4.898979485566356,5.385164807134504,1.5,8.27526434623088], max -> [4.0,5.0,1.0,7.2])


Map(count -> 3, variance -> [9.333333333333334,6.333333333333333,0.08333333333333333,10.240000000000002], mean -> [1.3333333333333333,2.333333333333333,0.8333333333333334,4.0], numNonzeros -> [3.0,2.0,3.0,3.0], min -> [-2.0,0.0,0.5,0.8], normL1 -> [8.0,7.0,2.5,12.0], normL2 -> [4.898979485566356,5.385164807134504,1.5,8.27526434623088], max -> [4.0,5.0,1.0,7.2])

In [23]:
colStatsMap.foreach{case(key, value) => println(key + ": " + value)}

count: 3
variance: [9.333333333333334,6.333333333333333,0.08333333333333333,10.240000000000002]
mean: [1.3333333333333333,2.333333333333333,0.8333333333333334,4.0]
numNonzeros: [3.0,2.0,3.0,3.0]
min: [-2.0,0.0,0.5,0.8]
normL1: [8.0,7.0,2.5,12.0]
normL2: [4.898979485566356,5.385164807134504,1.5,8.27526434623088]
max: [4.0,5.0,1.0,7.2]


#### corr()

`corr()` --> to calculate the correlation matrix between the columns of one RDD or between two RDDs

In [24]:
Statistics.corr(vectorsRdd)

1.0                   -0.7370434740955019   -0.7559289460184548  -0.32732683535398843  
-0.7370434740955019   1.0                   0.11470786693528112  -0.39735970711951274  
-0.7559289460184548   0.11470786693528112   1.0                  0.8660254037844397    
-0.32732683535398843  -0.39735970711951274  0.8660254037844397   1.0                   

In [25]:
import org.apache.spark.rdd.RDD

In [26]:
val data1: RDD[Double] = sc.parallelize(Array(1, 2, 3, 4, 5))
val data2: RDD[Double] = sc.parallelize(Array(10, 19, 32, 41, 56))

data1 = ParallelCollectionRDD[39] at parallelize at <console>:35
data2 = ParallelCollectionRDD[40] at parallelize at <console>:36


ParallelCollectionRDD[40] at parallelize at <console>:36

In [27]:
Statistics.corr(data1, data2)

0.996326893005933

#### chiSqTest()

`chiSqTest()` --> to compute the Pearson's independence test

In [28]:
val labelPointRdd = vectorsRdd.map(x => LabeledPoint(0, x))

labelPointRdd = MapPartitionsRDD[51] at map at <console>:38


MapPartitionsRDD[51] at map at <console>:38

In [29]:
val chiSqTest = Statistics.chiSqTest(labelPointRdd)

chiSqTest = 


Array(Chi squared test summary:
method: pearson
degrees of freedom = 0
statistic = 0.0
pValue = 1.0
No presumption against null hypothesis: the occurrence of the outcomes is statistically independent.., Chi squared test summary:
method: pearson
degrees of freedom = 0
statistic = 0.0
pValue = 1.0
No presumption against null hypothesis: the occurrence of the outcomes is statistically independent.., Chi squared test summary:
method: pearson
degrees of freedom = 0
statistic = 0.0
pValue = 1.0
No presumption against null hypothesis: the occurrence of the outcomes is statistically independent.., Chi squared test summary:
method: pearson
degrees of freedom = 0
statistic = 0.0
pValue = 1.0
No presumption against null hypothesi...


[Chi squared test summary:
method: pearson
degrees of freedom = 0 
statistic = 0.0 
pValue = 1.0 
No presumption against null hypothesis: the occurrence of the outcomes is statistically independent.., Chi squared test summary:
method: pearson
degrees of freedom = 0 
statistic = 0.0 
pValue = 1.0 
No presumption against null hypothesis: the occurrence of the outcomes is statistically independent.., Chi squared test summary:
method: pearson
degrees of freedom = 0 
statistic = 0.0 
pValue = 1.0 
No presumption against null hypothesis: the occurrence of the outcomes is statistically independent.., Chi squared test summary:
method: pearson
degrees of freedom = 0 
statistic = 0.0 
pValue = 1.0 
No presumption against null hypothesis: the occurrence of the outcomes is statistically independent..]

In [30]:
chiSqTest.foreach(x => println("Test value: " + x.pValue))

Test value: 1.0
Test value: 1.0
Test value: 1.0
Test value: 1.0
