# Machine Learning with MLlib

In this Notebook, we will review the RDD-Based Machine Learning library MLlib.

## Data Types

First, we have to understand the different data structures used by MLlib. In particular, they are:

    * Vectors
    * Labeled Points
    * Rating
    * Model Classes
    
We will se `Vectors` and `Labeled Points` in more detail.

In [1]:
import org.apache.spark.mllib.linalg.Vectors

`Vector()` --> to hold the features values. It can be `dense` and `sparse`.

In [2]:
val vectorDense = Vectors.dense(Array(1.0,1.0,2.0,2.0))

vectorDense = [1.0,1.0,2.0,2.0]


[1.0,1.0,2.0,2.0]

In [3]:
val vectorSparse = Vectors.sparse(4, Array(0, 2), Array(1.0, 2.0))

vectorSparse = (4,[0,2],[1.0,2.0])


(4,[0,2],[1.0,2.0])

`LabeledPoint()` --> hold both features values and label values

In [4]:
import org.apache.spark.mllib.regression.LabeledPoint

In [5]:
val labelPoint = LabeledPoint(1, vectorDense)

labelPoint = (1.0,[1.0,1.0,2.0,2.0])


(1.0,[1.0,1.0,2.0,2.0])

## Algorithms

In this section, we will review the different algorithms associated with Machine Learning problems. Among other, we could highlight the following families of algorithms:

    * Feature Extraction
    * Statistics
    * Classification and Regression
    * Collaborative Filtering and Recommendation
    * Dimensionality Reduction
    * Model Evaluation

### Feature Extraction

ML algorithms only accept numerical values as inputs. Here, we discuss some algorithm that help us to translate some inputs (like text, non-scaled numerical vectors, etc) to numerical values that ML algorithms can understand. In particular, we will discuss the following algorithms:

    * TD-IDF
    * Scaling
    * Normalization
    * Word2Vec

#### td-idf()

`td-idf()` --> Term Frecuency - Inverse Document Frequency, useful to convert text input to numerical inputs

In [6]:
import org.apache.spark.mllib.feature.{HashingTF, IDF}

In [7]:
val sentences = sc.parallelize(Array("hello", "hello how are you", "good bye", "bye"))
val words = sentences.map(_.split(" ").toSeq)
val tf = new HashingTF(100)
val tfVectors = tf.transform(words)

sentences = ParallelCollectionRDD[0] at parallelize at <console>:30
words = MapPartitionsRDD[1] at map at <console>:31
tf = org.apache.spark.mllib.feature.HashingTF@71a5849e
tfVectors = MapPartitionsRDD[2] at map at HashingTF.scala:120


MapPartitionsRDD[2] at map at HashingTF.scala:120

In [8]:
tfVectors.collect()

[(100,[48],[1.0]), (100,[25,37,38,48],[1.0,1.0,1.0,1.0]), (100,[5,68],[1.0,1.0]), (100,[5],[1.0])]

In [9]:
val idf = new IDF()
val idfModel = idf.fit(tfVectors)
val tfIdfVectors = idfModel.transform(tfVectors)

idf = org.apache.spark.mllib.feature.IDF@21d1b236
idfModel = org.apache.spark.mllib.feature.IDFModel@4d8bc766
tfIdfVectors = MapPartitionsRDD[7] at mapPartitions at IDF.scala:178


MapPartitionsRDD[7] at mapPartitions at IDF.scala:178

In [10]:
tfIdfVectors.collect()

[(100,[48],[0.5108256237659907]), (100,[25,37,38,48],[0.9162907318741551,0.9162907318741551,0.9162907318741551,0.5108256237659907]), (100,[5,68],[0.5108256237659907,0.9162907318741551]), (100,[5],[0.5108256237659907])]

#### Word2Vect

`Word2Vec` --> also useful to tranform text into numerical data

In [11]:
import org.apache.spark.mllib.feature.Word2Vec

In [12]:
val word2vec = new Word2Vec().setMinCount(0)
val word2vecModel = word2vec.fit(words)

word2vec = org.apache.spark.mllib.feature.Word2Vec@576b7754
word2vecModel = org.apache.spark.mllib.feature.Word2VecModel@1e0cdbd1


org.apache.spark.mllib.feature.Word2VecModel@1e0cdbd1

In [13]:
val word2vecVectors = word2vecModel.transform("hello")

word2vecVectors = [0.0024051424115896225,0.0024265004321932793,0.003939271904528141,0.004567727446556091,-0.0017529428005218506,4.160736862104386E-4,0.0031681915279477835,8.893151534721255E-4,-0.002022825414314866,0.004433310125023127,-0.0030568328220397234,0.003593616420403123,0.0017325482331216335,-0.004822498187422752,-0.002658026060089469,-5.373888416215777E-4,-0.004821146838366985,-0.001790562178939581,-0.00481686694547534,-0.004933829419314861,0.0021309254225343466,-0.0010357925202697515,0.0012177051976323128,7.862550555728376E-4,-0.0033831512555480003,0.0017680389573797584,0.00233650510199368,-0.00357298762537539,-0.0011263885535299778,-4.385605570860207E-4,0.0020018289797008038,0.0033215349540114403,0.004542876500636339,0.0016289004124701023...


[0.0024051424115896225,0.0024265004321932793,0.003939271904528141,0.004567727446556091,-0.0017529428005218506,4.160736862104386E-4,0.0031681915279477835,8.893151534721255E-4,-0.002022825414314866,0.004433310125023127,-0.0030568328220397234,0.003593616420403123,0.0017325482331216335,-0.004822498187422752,-0.002658026060089469,-5.373888416215777E-4,-0.004821146838366985,-0.001790562178939581,-0.00481686694547534,-0.004933829419314861,0.0021309254225343466,-0.0010357925202697515,0.0012177051976323128,7.862550555728376E-4,-0.0033831512555480003,0.0017680389573797584,0.00233650510199368,-0.00357298762537539,-0.0011263885535299778,-4.385605570860207E-4,0.0020018289797008038,0.0033215349540114403,0.004542876500636339,0.0016289004124701023,0.0016515826573595405,0.0022391413804143667,-8.887116564437747E-4,-0.0032134903594851494,0.001556847244501114,-0.0012816108064725995,-0.0024765771813690662,-0.0030194150749593973,-0.002288764575496316,-0.0018253488233312964,0.004293504636734724,-0.0040577943

#### Scaling

While our input data could be already numeric, it is useful sometimes for the ML algorithms to scale that data.

`StandardScaler()` --> to scale numerical data

In [14]:
import org.apache.spark.mllib.feature.StandardScaler

In [15]:
val vectors = Array(Vectors.dense(Array(-2.0, 5.0, 1.0, 4.0)),
                    Vectors.dense(Array(2.0, 0.0, 1.0, 7.2)),
                    Vectors.dense(Array(4.0, 2.0, 0.5, 0.8)))

val vectorsRdd = sc.parallelize(vectors)
val scaler = new StandardScaler(withMean=true, withStd=true)
val model = scaler.fit(vectorsRdd)
val scaledData = model.transform(vectorsRdd)

vectors = Array([-2.0,5.0,1.0,4.0], [2.0,0.0,1.0,7.2], [4.0,2.0,0.5,0.8])
vectorsRdd = ParallelCollectionRDD[20] at parallelize at <console>:36
scaler = org.apache.spark.mllib.feature.StandardScaler@4172ba3a
model = org.apache.spark.mllib.feature.StandardScalerModel@1ec24b33
scaledData = MapPartitionsRDD[25] at map at VectorTransformer.scala:52


MapPartitionsRDD[25] at map at VectorTransformer.scala:52

In [16]:
scaledData.collect()

[[-1.0910894511799618,1.0596258856520353,0.5773502691896257,0.0], [0.2182178902359923,-0.9271726499455306,0.5773502691896257,1.0], [0.8728715609439694,-0.13245323570650427,-1.1547005383792517,-1.0]]

#### Normalization

As with scaling, sometimes it is very usefull to normalize our data.

In [17]:
import org.apache.spark.mllib.feature.Normalizer

In [18]:
val norm = new Normalizer()
val normData = norm.transform(vectorsRdd)

norm = org.apache.spark.mllib.feature.Normalizer@7f735a13
normData = MapPartitionsRDD[26] at map at VectorTransformer.scala:52


MapPartitionsRDD[26] at map at VectorTransformer.scala:52

In [19]:
normData.collect()

[[-0.29488391230979427,0.7372097807744856,0.14744195615489714,0.5897678246195885], [0.2652790545386455,0.0,0.13263952726932274,0.9550045963391238], [0.8751666735874727,0.43758333679373634,0.10939583419843409,0.17503333471749455]]

### Statistics

The library MLlib includes useful functionalities to calculate some main statistics over numeric RDDs

In [20]:
import org.apache.spark.mllib.stat.Statistics

#### colStats()

`colStats()` --> to calculate statistics over an RDD of numerical values

In [21]:
val colStats = Statistics.colStats(vectorsRdd)

colStats = org.apache.spark.mllib.stat.MultivariateOnlineSummarizer@56711efb


org.apache.spark.mllib.stat.MultivariateOnlineSummarizer@56711efb

In [22]:
val colStatsMap = Map("count" -> colStats.count, 
                      "max" -> colStats.max,
                      "mean" -> colStats.mean,
                      "min" -> colStats.min,
                      "normL1" -> colStats.normL1,
                      "normL2" -> colStats.normL2,
                      "numNonzeros" -> colStats.numNonzeros,
                      "variance" -> colStats.variance)

colStatsMap = Map(count -> 3, variance -> [9.333333333333334,6.333333333333333,0.08333333333333333,10.240000000000002], mean -> [1.3333333333333333,2.333333333333333,0.8333333333333334,4.0], numNonzeros -> [3.0,2.0,3.0,3.0], min -> [-2.0,0.0,0.5,0.8], normL1 -> [8.0,7.0,2.5,12.0], normL2 -> [4.898979485566356,5.385164807134504,1.5,8.27526434623088], max -> [4.0,5.0,1.0,7.2])


Map(count -> 3, variance -> [9.333333333333334,6.333333333333333,0.08333333333333333,10.240000000000002], mean -> [1.3333333333333333,2.333333333333333,0.8333333333333334,4.0], numNonzeros -> [3.0,2.0,3.0,3.0], min -> [-2.0,0.0,0.5,0.8], normL1 -> [8.0,7.0,2.5,12.0], normL2 -> [4.898979485566356,5.385164807134504,1.5,8.27526434623088], max -> [4.0,5.0,1.0,7.2])

In [23]:
colStatsMap.foreach{case(key, value) => println(key + ": " + value)}

count: 3
variance: [9.333333333333334,6.333333333333333,0.08333333333333333,10.240000000000002]
mean: [1.3333333333333333,2.333333333333333,0.8333333333333334,4.0]
numNonzeros: [3.0,2.0,3.0,3.0]
min: [-2.0,0.0,0.5,0.8]
normL1: [8.0,7.0,2.5,12.0]
normL2: [4.898979485566356,5.385164807134504,1.5,8.27526434623088]
max: [4.0,5.0,1.0,7.2]


#### corr()

`corr()` --> to calculate the correlation matrix between the columns of one RDD or between two RDDs

In [24]:
Statistics.corr(vectorsRdd)

1.0                  -0.7370434740955019   -0.755928946018455   -0.3273268353539885   
-0.7370434740955019  1.0                   0.11470786693528112  -0.39735970711951274  
-0.755928946018455   0.11470786693528112   1.0                  0.8660254037844397    
-0.3273268353539885  -0.39735970711951274  0.8660254037844397   1.0                   

In [25]:
import org.apache.spark.rdd.RDD

In [26]:
val data1: RDD[Double] = sc.parallelize(Array(1, 2, 3, 4, 5))
val data2: RDD[Double] = sc.parallelize(Array(10, 19, 32, 41, 56))

data1 = ParallelCollectionRDD[39] at parallelize at <console>:35
data2 = ParallelCollectionRDD[40] at parallelize at <console>:36


ParallelCollectionRDD[40] at parallelize at <console>:36

In [27]:
Statistics.corr(data1, data2)

0.996326893005933

#### chiSqTest()

`chiSqTest()` --> to compute the Pearson's independence test

In [28]:
val labelPointRdd = vectorsRdd.map(x => LabeledPoint(0, x))

labelPointRdd = MapPartitionsRDD[51] at map at <console>:38


MapPartitionsRDD[51] at map at <console>:38

In [29]:
val chiSqTest = Statistics.chiSqTest(labelPointRdd)

chiSqTest = 


Array(Chi squared test summary:
method: pearson
degrees of freedom = 0
statistic = 0.0
pValue = 1.0
No presumption against null hypothesis: the occurrence of the outcomes is statistically independent.., Chi squared test summary:
method: pearson
degrees of freedom = 0
statistic = 0.0
pValue = 1.0
No presumption against null hypothesis: the occurrence of the outcomes is statistically independent.., Chi squared test summary:
method: pearson
degrees of freedom = 0
statistic = 0.0
pValue = 1.0
No presumption against null hypothesis: the occurrence of the outcomes is statistically independent.., Chi squared test summary:
method: pearson
degrees of freedom = 0
statistic = 0.0
pValue = 1.0
No presumption against null hypothesi...


[Chi squared test summary:
method: pearson
degrees of freedom = 0 
statistic = 0.0 
pValue = 1.0 
No presumption against null hypothesis: the occurrence of the outcomes is statistically independent.., Chi squared test summary:
method: pearson
degrees of freedom = 0 
statistic = 0.0 
pValue = 1.0 
No presumption against null hypothesis: the occurrence of the outcomes is statistically independent.., Chi squared test summary:
method: pearson
degrees of freedom = 0 
statistic = 0.0 
pValue = 1.0 
No presumption against null hypothesis: the occurrence of the outcomes is statistically independent.., Chi squared test summary:
method: pearson
degrees of freedom = 0 
statistic = 0.0 
pValue = 1.0 
No presumption against null hypothesis: the occurrence of the outcomes is statistically independent..]

In [30]:
chiSqTest.foreach(x => println("Test value: " + x.pValue))

Test value: 1.0
Test value: 1.0
Test value: 1.0
Test value: 1.0


### Machine Learning: Regression

In this section, we will explore the conventional Linear Regression model.

In [31]:
import java.util.Random
val randGenerator = new Random()
import org.apache.spark.mllib.regression.LinearRegressionWithSGD

randGenerator = java.util.Random@ff982dd


java.util.Random@ff982dd

First, we will create training data according to a Linear Regression model with the following weights:

    * Weights: [2.5, 1.25, 0.5, 1]

In [32]:
val regFeatures = for(_ <- 1 to 500) yield {for (_ <- 1 to 4) yield randGenerator.nextInt(20)}
val regFeaturesRdd = sc.parallelize(regFeatures).map(x => Vectors.dense(x.toArray.map(_.toDouble)))
val scaler = new StandardScaler()
val regFeaturesScale = scaler.fit(regFeaturesRdd).transform(regFeaturesRdd)
val regData = regFeaturesScale.map(x => LabeledPoint({
    val arrayValue = x.toArray
    val randGenerator = new Random()
    2.5*x(0) + 1.25*x(1) + 0.5*x(2) + x(3) + randGenerator.nextDouble
},x))
regData.take(2)

regFeatures = Vector(Vector(4, 3, 11, 17), Vector(18, 17, 0, 12), Vector(7, 0, 18, 11), Vector(19, 19, 1, 16), Vector(4, 8, 19, 0), Vector(13, 12, 13, 10), Vector(12, 4, 6, 12), Vector(19, 1, 2, 2), Vector(9, 8, 19, 7), Vector(4, 3, 5, 6), Vector(10, 4, 17, 2), Vector(14, 16, 17, 11), Vector(17, 16, 2, 17), Vector(3, 16, 18, 1), Vector(16, 12, 18, 11), Vector(0, 15, 5, 1), Vector(3, 16, 11, 1), Vector(1, 5, 12, 5), Vector(10, 8, 3, 14), Vector(18, 17, 7, 16), Vector(1, 9, 15, 9), Vector(17, 12, 0, 9), Vector(5, 6, 8, 8), Vector(15, 4, 0, 12), Vector(10, 7, 13, 7), Vector(18, 18, 13, 9), Vector(12, 0, 2, 14), Vector(1, 2, 13, 18), Vector(12, 11, 0, 11), Vector(1, 2, 9, 17), Vector(10, 5, 16, 17), Vector(4,...


Vector(Vector(4, 3, 11, 17), Vector(18, 17, 0, 12), Vector(7, 0, 18, 11), Vector(19, 19, 1, 16), Vector(4, 8, 19, 0), Vector(13, 12, 13, 10), Vector(12, 4, 6, 12), Vector(19, 1, 2, 2), Vector(9, 8, 19, 7), Vector(4, 3, 5, 6), Vector(10, 4, 17, 2), Vector(14, 16, 17, 11), Vector(17, 16, 2, 17), Vector(3, 16, 18, 1), Vector(16, 12, 18, 11), Vector(0, 15, 5, 1), Vector(3, 16, 11, 1), Vector(1, 5, 12, 5), Vector(10, 8, 3, 14), Vector(18, 17, 7, 16), Vector(1, 9, 15, 9), Vector(17, 12, 0, 9), Vector(5, 6, 8, 8), Vector(15, 4, 0, 12), Vector(10, 7, 13, 7), Vector(18, 18, 13, 9), Vector(12, 0, 2, 14), Vector(1, 2, 13, 18), Vector(12, 11, 0, 11), Vector(1, 2, 9, 17), Vector(10, 5, 16, 17), Vector(4, 15, 16, 19), Vector(17, 13, 11, 17), Vector(10, 6, 19, 5), Vector(8, 12, 0, 8), Vector(6, 19, 12, 6), Vector(9, 1, 7, 18), Vector(3, 1, 5, 7), Vector(18, 2, 18, 17), Vector(4, 8, 9, 9), Vector(10, 13, 0, 19), Vector(2, 17, 7, 2), Vector(14, 8, 11, 19), Vector(14, 17, 11, 16), Vector(18, 0, 3, 4), V

Once the data has been created, we can train our model:

In [33]:
val numIterations = 10000
val stepSize = 0.1
val miniBatchFraction = 1.0
val lrModel = LinearRegressionWithSGD.train(regData, numIterations = numIterations, 
                                            stepSize = stepSize, miniBatchFraction = miniBatchFraction)

numIterations = 10000
stepSize = 0.1
miniBatchFraction = 1.0
lrModel = org.apache.spark.mllib.regression.LinearRegressionModel: intercept = 0.0, numFeatures = 4




org.apache.spark.mllib.regression.LinearRegressionModel: intercept = 0.0, numFeatures = 4

We can now compare the value of the original and computated weights and intercpet:

In [34]:
println("Computed weights: " + lrModel.weights)
println("Original weights: [2.5, 1.25, 0.5, 1]")

Computed weights: [2.378815686036176,1.286597790305558,0.709914794818028,1.1564210592452264]
Original weights: [2.5, 1.25, 0.5, 1]
