# Machine Learning with MLlib

In this Notebook, we will review the RDD-Based Machine Learning library MLlib.

In [1]:
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("MLlib").master("local[*]").getOrCreate()
sc = spark.sparkContext

## Data Types

First, we have to understand the different data structures used by MLlib. In particular, they are:

    * Vectors
    * Labeled Points
    * Rating
    * Model Classes

We will se `Vectors` and `Labeled Points` in more detail.

In [2]:
from pyspark.mllib.linalg import Vectors

`Vector()` --> thold the features values. It can be `dense` and `sparse`.

In [3]:
vector_dense = Vectors.dense([1.0,1.0,2.0,2.0])

In [4]:
vector_sparse_1 = Vectors.sparse(4, {0: 1.0, 2: 2.0})

In [5]:
vector_sparse_2 = Vectors.sparse(4, [0, 2], [1.0, 2.0])

`LabeledPoint()` --> hold both features values and label values

In [6]:
from pyspark.mllib.regression import LabeledPoint

In [7]:
label_point = LabeledPoint(1, vector_dense)

## Algorithms

In this section, we will review the different algorithms associated with Machine Learning problems. Among other, we could highlight the following families of algorithms:

    * Feature Extraction
    * Statistics
    * Classification and Regression
    * Collaborative Filtering and Recommendation
    * Dimensionality Reduction
    * Model Evaluation

### Feature Extraction

ML algorithms only accept numerical values as inputs. Here, we discuss some algorithm that help us to translate some inputs (like text, non-scaled numerical vectors, etc) to numerical values that ML algorithms can understand. In particular, we will discuss the following algorithms:

    * TD-IDF
    * Scaling
    * Normalization
    * Word2Vec

#### td-idf()

`td-idf()` --> Term Frecuency - Inverse Document Frequency, useful to convert text input to numerical inputs

In [8]:
from pyspark.mllib.feature import HashingTF, IDF

In [9]:
sentences = sc.parallelize(["hello", "hello how are you", "good bye", "bye"])
words = sentences.map(lambda word: word.split(" "))
tf = HashingTF(100)
tf_vectors = tf.transform(words)

In [10]:
tf_vectors.collect()

[SparseVector(100, {45: 1.0}),
 SparseVector(100, {1: 1.0, 21: 1.0, 24: 1.0, 45: 1.0}),
 SparseVector(100, {64: 1.0, 88: 1.0}),
 SparseVector(100, {88: 1.0})]

In [11]:
idf = IDF()
idf_model = idf.fit(tf_vectors)
tf_idf_vectors = idf_model.transform(tf_vectors)

In [12]:
tf_idf_vectors.collect()

[SparseVector(100, {45: 0.5108}),
 SparseVector(100, {1: 0.9163, 21: 0.9163, 24: 0.9163, 45: 0.5108}),
 SparseVector(100, {64: 0.9163, 88: 0.5108}),
 SparseVector(100, {88: 0.5108})]

#### Word2Vect

`Word2Vec` --> also useful to tranform text into numerical data

In [13]:
from pyspark.mllib.feature import Word2Vec

In [14]:
word2vec = Word2Vec().setMinCount(0)
word2vec_model = word2vec.fit(words)

In [15]:
word2vec_vectors = word2vec_model.transform("hello")

In [16]:
word2vec_vectors

DenseVector([-0.0041, -0.0042, 0.0003, -0.0047, -0.0049, -0.0031, 0.0016, -0.0043, 0.0004, 0.0001, 0.001, 0.0001, 0.0044, -0.0018, -0.0035, -0.0047, 0.0037, 0.0006, 0.0029, -0.0016, 0.0003, -0.0047, 0.0039, 0.0041, 0.0025, -0.0047, -0.0018, 0.0021, -0.0003, -0.0013, 0.0025, -0.0012, -0.0009, 0.0006, 0.0034, -0.004, 0.0018, -0.0032, 0.0034, -0.0001, -0.0031, -0.0005, 0.0025, 0.0022, 0.0029, -0.0013, 0.0004, -0.0038, 0.0005, -0.0012, -0.0008, 0.0035, -0.0029, -0.0005, 0.0013, -0.0045, 0.003, 0.0015, -0.0047, -0.0023, -0.0031, -0.0036, 0.0048, -0.0038, -0.0002, 0.0024, -0.0026, 0.005, -0.0019, 0.001, -0.004, -0.0021, 0.0025, -0.0015, -0.0026, 0.0046, -0.0029, 0.0026, -0.0004, -0.0025, 0.0008, -0.0031, 0.0041, 0.0039, -0.0019, -0.0028, -0.0044, -0.004, 0.0034, -0.0014, 0.0048, 0.0044, -0.0022, 0.0049, -0.0015, 0.0021, 0.0046, -0.0019, -0.0036, -0.0034])

#### Scaling

While our input data could be already numeric, it is useful sometimes for the ML algorithms to scale that data.

`StandardScaler()` --> to scale numerical data

In [17]:
from pyspark.mllib.feature import StandardScaler

In [18]:
vectors = [Vectors.dense([-2.0, 5.0, 1.0, 4.0]),
           Vectors.dense([2.0, 0.0, 1.0, 7.2]),
           Vectors.dense([4.0, 2.0, 0.5, 0.8])]

vectors_rdd = sc.parallelize(vectors)
scaler = StandardScaler(withMean=True, withStd=True)
model = scaler.fit(vectors_rdd)
scaled_data = model.transform(vectors_rdd)

In [19]:
scaled_data.collect()

[DenseVector([-1.0911, 1.0596, 0.5774, 0.0]),
 DenseVector([0.2182, -0.9272, 0.5774, 1.0]),
 DenseVector([0.8729, -0.1325, -1.1547, -1.0])]

#### Normalization

As with scaling, sometimes it is very usefull to normalize our data.

In [20]:
from pyspark.mllib.feature import Normalizer

In [21]:
norm = Normalizer()
norm_data = norm.transform(vectors_rdd)

In [22]:
norm_data.collect()

[DenseVector([-0.2949, 0.7372, 0.1474, 0.5898]),
 DenseVector([0.2653, 0.0, 0.1326, 0.955]),
 DenseVector([0.8752, 0.4376, 0.1094, 0.175])]

### Statistics

The library MLlib includes useful functionalities to calculate some main statistics over numeric RDDs

In [23]:
from pyspark.mllib.stat import Statistics

#### colStats()

`colStats()` --> to calculate statistics over an RDD of numerical values

In [24]:
col_stats = Statistics.colStats(vectors_rdd)

In [25]:
col_stats_dict = {
    "count": col_stats.count(),
    "max": col_stats.max(),
    "mean": col_stats.mean(),
    "min": col_stats.min(),
    "normL1": col_stats.normL1(),
    "normL2": col_stats.normL2(),
    "numNonzeros": col_stats.numNonzeros(),
    "variance": col_stats.variance()
}

In [26]:
for key, value in col_stats_dict.items():
    print("{0}: {1}".format(key, value))

count: 3
max: [ 4.   5.   1.   7.2]
mean: [ 1.33333333  2.33333333  0.83333333  4.        ]
min: [-2.   0.   0.5  0.8]
normL1: [  8.    7.    2.5  12. ]
normL2: [ 4.89897949  5.38516481  1.5         8.27526435]
numNonzeros: [ 3.  2.  3.  3.]
variance: [  9.33333333   6.33333333   0.08333333  10.24      ]


#### corr()

`corr()` --> to calculate the correlation matrix between the columns of one RDD or between two RDDs

In [27]:
Statistics.corr(vectors_rdd)

array([[ 1.        , -0.73704347, -0.75592895, -0.32732684],
       [-0.73704347,  1.        ,  0.11470787, -0.39735971],
       [-0.75592895,  0.11470787,  1.        ,  0.8660254 ],
       [-0.32732684, -0.39735971,  0.8660254 ,  1.        ]])

In [28]:
data1 = sc.parallelize([1, 2, 3, 4, 5])
data2 = sc.parallelize([10, 19, 32, 41, 56])

In [29]:
Statistics.corr(data1, data2)

0.996326893005933

#### chiSqTest()

`chiSqTest()` --> to compute the Pearson's independence test

In [30]:
label_point_rdd = vectors_rdd.map(lambda x: LabeledPoint(0, x))

In [31]:
chi_sq_test = Statistics.chiSqTest(label_point_rdd)

In [32]:
for test in chi_sq_test:
    print("Test value: {0}: ".format(test.pValue))

Test value: 1.0: 
Test value: 1.0: 
Test value: 1.0: 
Test value: 1.0: 


### Machine Learning: Regression

In this section, we will explore the conventional Linear Regression model.

In [33]:
from random import randint, random
from pyspark.mllib.regression import LinearRegressionWithSGD

First, we will create training data according to a Linear Regression model with the following weights and intercept:

    * Weights: [5, 3, 8, 1]
    * Intercept: 20

In [34]:
def linear_reg(x):
    """
    Given an input vector x, returns the following value:
    5*x[0] + 3*x[1] + 8*x[2] + x[3] + 20 + random()
    
    :input x: input vector
    :return: computated value
    """
    
    return 5*x[0] + 3*x[1] + 8*x[2] + x[3] + 20 + random()

In [35]:
reg_features = [[randint(0,20) for _ in range(4)] for _ in range(100)]
reg_features_rdd = sc.parallelize(reg_features)
scaler = StandardScaler()
reg_features_scale = scaler.fit(reg_features_rdd).transform(reg_features_rdd)
reg_data = reg_features_scale.map(lambda x: LabeledPoint(linear_reg(x), Vectors.dense(x)))

In [36]:
reg_data.take(2)

[LabeledPoint(55.09369153419037, [2.24426297982,0.0,2.61640695905,2.00223719375]),
 LabeledPoint(65.00138140972491, [2.40456747838,3.20630734343,2.4528815241,3.00335579063])]

Once the data has been created, we can train our model:

In [37]:
lr_model = LinearRegressionWithSGD.train(data = reg_data, intercept=True)

We can now compare the value of the original and computated weights and intercpet:

In [38]:
print("Computed --> Weights: {0}; Intercept: {1}".format(lr_model.weights, lr_model.intercept))
print("Original --> Weights: {0}; Intercept: {1}".format([5, 3, 8, 1], 20))

Computed --> Weights: [-92.7816357191,-115.27305982,-93.9699605439,-98.8850015221]; Intercept: -41.21507547619649
Original --> Weights: [5, 3, 8, 1]; Intercept: 20


### Machine Learning: Classification

In this section, we will explore different classification models:

    * Logistic Regression
    * Support Vector Machines (SVMs)
    * Naive Bayes
    * Decision Trees
    * Random Forests
    
For every case, we will try to solve the sampe problem: a model to classify messages into two groups: legitimate and Spam. For that, we will have first to preprocess some text data using come functionalities studied in previous sections of this Notebook.

In [39]:
from pyspark.mllib.classification import LogisticRegressionWithSGD, SVMWithSGD, NaiveBayes
from pyspark.mllib.tree import DecisionTree, RandomForest

#### Data Preparation

Read the data:

In [40]:
ini_data = spark.read.csv("../data/spam.csv", header=True)

In [41]:
ini_data.show()

+-----+--------------------+----+----+----+
|label|                text| _c2| _c3| _c4|
+-----+--------------------+----+----+----+
|  ham|Go until jurong p...|null|null|null|
|  ham|Ok lar... Joking ...|null|null|null|
| spam|Free entry in 2 a...|null|null|null|
|  ham|U dun say so earl...|null|null|null|
|  ham|Nah I don't think...|null|null|null|
| spam|FreeMsg Hey there...|null|null|null|
|  ham|Even my brother i...|null|null|null|
|  ham|As per your reque...|null|null|null|
| spam|WINNER!! As a val...|null|null|null|
| spam|Had your mobile 1...|null|null|null|
|  ham|I'm gonna be home...|null|null|null|
| spam|SIX chances to wi...|null|null|null|
| spam|URGENT! You have ...|null|null|null|
|  ham|I've been searchi...|null|null|null|
|  ham|I HAVE A DATE ON ...|null|null|null|
| spam|XXXMobileMovieClu...|null|null|null|
|  ham|Oh k...i'm watchi...|null|null|null|
|  ham|Eh u remember how...|null|null|null|
|  ham|Fine if that��s t...|null|null|null|
| spam|England v Macedon...|null

Filter the data:

In [42]:
ini_data_rdd = ini_data.select(["label", "text"]).rdd

In [43]:
ini_data_rdd.take(1)

[Row(label='ham', text='Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat...')]

In [44]:
ini_data_rdd.take(1)[0].text

'Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat...'

In [45]:
ini_data_rdd.count()

5574

In [46]:
ini_data_rdd_filter = ini_data_rdd.filter(lambda row: (isinstance(row.label, str) and isinstance(row.text, str)))

In [47]:
ini_data_rdd_filter.count()

5573

Vectorize data:

In [48]:
text_rdd = ini_data_rdd_filter.map(lambda row: row.text.split(" "))

In [49]:
tf = HashingTF(1000)
tf_vectors = tf.transform(text_rdd)
idf = IDF()
idf_model = idf.fit(tf_vectors)

In [50]:
spam_text = ini_data_rdd_filter.filter(lambda row: row.label == "spam").map(lambda row: row.text.split(" "))

In [51]:
spam_text.count()

747

In [52]:
spam_text.take(3)

[['Free',
  'entry',
  'in',
  '2',
  'a',
  'wkly',
  'comp',
  'to',
  'win',
  'FA',
  'Cup',
  'final',
  'tkts',
  '21st',
  'May',
  '2005.',
  'Text',
  'FA',
  'to',
  '87121',
  'to',
  'receive',
  'entry',
  'question(std',
  'txt',
  "rate)T&C's",
  'apply',
  "08452810075over18's"],
 ['FreeMsg',
  'Hey',
  'there',
  'darling',
  "it's",
  'been',
  '3',
  "week's",
  'now',
  'and',
  'no',
  'word',
  'back!',
  "I'd",
  'like',
  'some',
  'fun',
  'you',
  'up',
  'for',
  'it',
  'still?',
  'Tb',
  'ok!',
  'XxX',
  'std',
  'chgs',
  'to',
  'send,',
  '�1.50',
  'to',
  'rcv'],
 ['WINNER!!',
  'As',
  'a',
  'valued',
  'network',
  'customer',
  'you',
  'have',
  'been',
  'selected',
  'to',
  'receivea',
  '�900',
  'prize',
  'reward!',
  'To',
  'claim',
  'call',
  '09061701461.',
  'Claim',
  'code',
  'KL341.',
  'Valid',
  '12',
  'hours',
  'only.']]

In [53]:
gen_text = ini_data_rdd_filter.filter(lambda row: row.label != "spam").map(lambda row: row.text.split(" "))

In [54]:
gen_text.count()

4826

In [55]:
gen_text.take(3)

[['Go',
  'until',
  'jurong',
  'point,',
  'crazy..',
  'Available',
  'only',
  'in',
  'bugis',
  'n',
  'great',
  'world',
  'la',
  'e',
  'buffet...',
  'Cine',
  'there',
  'got',
  'amore',
  'wat...'],
 ['Ok', 'lar...', 'Joking', 'wif', 'u', 'oni...'],
 ['U',
  'dun',
  'say',
  'so',
  'early',
  'hor...',
  'U',
  'c',
  'already',
  'then',
  'say...']]

In [56]:
spam_vectors = tf.transform(spam_text)
spam_idf = idf_model.transform(spam_vectors)

In [57]:
spam_idf.take(1)

[SparseVector(1000, {4: 4.2564, 52: 1.9576, 162: 3.663, 261: 5.407, 289: 1.5941, 309: 9.9766, 359: 5.2937, 365: 3.6159, 368: 4.8647, 389: 4.2314, 408: 4.16, 505: 5.0423, 524: 9.8246, 542: 4.8882, 547: 2.8359, 569: 3.8809, 571: 5.2586, 588: 4.937, 627: 2.9288, 633: 4.467, 648: 4.4212, 655: 3.602, 665: 4.2951, 783: 2.6097})]

In [58]:
gen_vectors = tf.transform(gen_text)
gen_idf = idf_model.transform(gen_vectors)

In [59]:
gen_idf.take(1)

[SparseVector(1000, {14: 4.8417, 17: 4.1041, 41: 4.8417, 52: 1.9576, 66: 4.9123, 84: 6.8547, 97: 5.0423, 125: 5.015, 501: 3.7976, 604: 2.5278, 606: 4.467, 657: 4.3218, 668: 3.3376, 683: 3.4329, 708: 5.1919, 802: 4.5828, 914: 5.0995, 932: 4.5828, 993: 4.072})]

In [60]:
spam_points = spam_idf.map(lambda x: LabeledPoint(1, x))
gen_points = gen_idf.map(lambda x: LabeledPoint(0, x))

In [61]:
spam_points.take(1)

[LabeledPoint(1.0, (1000,[4,52,162,261,289,309,359,365,368,389,408,505,524,542,547,569,571,588,627,633,648,655,665,783],[4.25642035557,1.95763995962,3.66302357778,5.40699238317,1.59412694928,9.97656409663,5.29366369787,3.61586790789,4.86466809235,4.23141905337,4.15996008939,5.04234926959,9.82459228268,4.88819858976,2.83590803714,3.88093607968,5.25857237806,4.93698875393,2.92877472154,4.46698512468,4.42117558865,3.6019876872,4.29513486776,2.60971104834]))]

In [62]:
gen_points.take(1)

[LabeledPoint(0.0, (1000,[14,17,41,52,66,84,97,125,501,604,606,657,668,683,708,802,914,932,993],[4.84167857412,4.10407963099,4.84167857412,1.95763995962,4.91229614134,6.85474235355,5.04234926959,5.0149502954,3.79755447074,2.52779392588,4.46698512468,4.32180311484,3.33760117735,3.43291135715,5.19188100356,4.58281694021,5.09950768343,4.58281694021,4.07199131644]))]

In [63]:
ml_data_ini = spam_points.union(gen_points)

In [64]:
randint(0,20)

20

In [65]:
ml_data = ml_data_ini.map(lambda row: (randint(0,100), row)).sortByKey().map(lambda row: row[1])

In [66]:
ml_data.map(lambda x: x.label).take(10)

[1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.0, 0.0, 0.0]

In [67]:
ml_data_train, ml_data_test = ml_data.randomSplit(weights = [0.8, 0.2])

In [68]:
ml_data_train.cache()
ml_data_test.cache()

PythonRDD[651] at RDD at PythonRDD.scala:49

In [69]:
ml_data_train.count()

4422

In [70]:
ml_data_test.count()

1151

In [71]:
ml_data_test.take(1)[0].features

SparseVector(1000, {67: 4.1041, 69: 2.9288, 72: 4.4362, 80: 2.2979, 101: 2.0411, 111: 3.6914, 119: 2.7124, 175: 3.4784, 186: 3.2277, 289: 1.5941, 299: 4.1041, 300: 2.4605, 339: 5.407, 343: 3.663, 437: 3.8467, 462: 3.5507, 500: 4.1951, 575: 4.7547, 581: 2.9491, 657: 8.6436, 676: 4.5828, 818: 4.5828, 833: 5.7355, 870: 2.1097, 886: 4.7547, 925: 2.7342, 935: 3.6219})

#### Logistic Regression

In [72]:
lr_model = LogisticRegressionWithSGD.train(data=ml_data_train)

In [73]:
for data in ml_data_test.take(10):
    print("Actual label: {0}; Prediction: {1}".format(data.label, lr_model.predict(data.features)))

Actual label: 1.0; Prediction: 1
Actual label: 0.0; Prediction: 0
Actual label: 0.0; Prediction: 0
Actual label: 0.0; Prediction: 0
Actual label: 0.0; Prediction: 0
Actual label: 0.0; Prediction: 0
Actual label: 0.0; Prediction: 0
Actual label: 0.0; Prediction: 0
Actual label: 0.0; Prediction: 0
Actual label: 0.0; Prediction: 0


#### Suport Vector Machines

In [74]:
svm_model = SVMWithSGD.train(data=ml_data_train)

In [75]:
for data in ml_data_test.take(10):
    print("Actual label: {0}; Prediction: {1}".format(data.label, svm_model.predict(data.features)))

Actual label: 1.0; Prediction: 1
Actual label: 0.0; Prediction: 0
Actual label: 0.0; Prediction: 0
Actual label: 0.0; Prediction: 0
Actual label: 0.0; Prediction: 0
Actual label: 0.0; Prediction: 0
Actual label: 0.0; Prediction: 0
Actual label: 0.0; Prediction: 0
Actual label: 0.0; Prediction: 0
Actual label: 0.0; Prediction: 0


#### Naive Bayes

In [76]:
nb_model = NaiveBayes.train(data=ml_data_train)

In [77]:
for data in ml_data_test.take(10):
    print("Actual label: {0}; Prediction: {1}".format(data.label, nb_model.predict(data.features)))

Actual label: 1.0; Prediction: 1.0
Actual label: 0.0; Prediction: 0.0
Actual label: 0.0; Prediction: 1.0
Actual label: 0.0; Prediction: 0.0
Actual label: 0.0; Prediction: 0.0
Actual label: 0.0; Prediction: 0.0
Actual label: 0.0; Prediction: 0.0
Actual label: 0.0; Prediction: 0.0
Actual label: 0.0; Prediction: 0.0
Actual label: 0.0; Prediction: 0.0


#### Decision Trees

In [78]:
tree_model = DecisionTree.trainClassifier(data=ml_data_train, numClasses = 2, categoricalFeaturesInfo={},
                                          maxDepth=15, maxBins=64)

In [79]:
for data in ml_data_test.take(10):
    print("Actual label: {0}; Prediction: {1}".format(data.label, tree_model.predict(data.features)))

Actual label: 1.0; Prediction: 0.0
Actual label: 0.0; Prediction: 0.0
Actual label: 0.0; Prediction: 0.0
Actual label: 0.0; Prediction: 0.0
Actual label: 0.0; Prediction: 0.0
Actual label: 0.0; Prediction: 0.0
Actual label: 0.0; Prediction: 0.0
Actual label: 0.0; Prediction: 0.0
Actual label: 0.0; Prediction: 0.0
Actual label: 0.0; Prediction: 0.0


#### Random Forest

In [80]:
forest_model = RandomForest.trainClassifier(data=ml_data_train, numClasses=2, categoricalFeaturesInfo={},
                                            maxDepth=15, maxBins=64, numTrees=10)

In [81]:
for data in ml_data_test.take(10):
    print("Actual label: {0}; Prediction: {1}".format(data.label, forest_model.predict(data.features)))

Actual label: 1.0; Prediction: 0.0
Actual label: 0.0; Prediction: 0.0
Actual label: 0.0; Prediction: 0.0
Actual label: 0.0; Prediction: 0.0
Actual label: 0.0; Prediction: 0.0
Actual label: 0.0; Prediction: 0.0
Actual label: 0.0; Prediction: 0.0
Actual label: 0.0; Prediction: 0.0
Actual label: 0.0; Prediction: 0.0
Actual label: 0.0; Prediction: 0.0


### Machine Learning: Clustering

In this section, we will explore the `K-means` algorithm, which is the main clustering algorithm included in MLlib.

Here, we will study the previous spam classification problem. We will cluster our mesages into two groups, and then, we will count the number of points that fall into each group.

In [82]:
from pyspark.mllib.clustering import KMeans

In [83]:
cluster_data = ml_data.map(lambda lpoint: lpoint.features)
cluster_data.cache()

PythonRDD[1201] at RDD at PythonRDD.scala:49

In [84]:
clusters = KMeans.train(cluster_data, 2, maxIterations=1700, initializationMode="random")

In [85]:
predictions = cluster_data.map(lambda x: clusters.predict(x))

In [86]:
predictions.countByValue()

defaultdict(int, {1: 4596, 0: 977})

### Collavorative Filtering and Recommendation: Alternating Least Squares

Now, we will explore the `Alternating Least Squares` algorithm, very used for collaborative filtering problems.

In [87]:
from pyspark.mllib.recommendation import ALS, Rating

Load and prepare the data

In [88]:
data_als = sc.textFile("../data/als/test.data")
ratings = data_als.map(lambda l: l.split(',')).map(lambda l: Rating(int(l[0]), int(l[1]), float(l[2])))

In [89]:
ratings.take(1)

[Rating(user=1, product=1, rating=5.0)]

Build a recommendation moddel using ALS:

In [90]:
rank = 10
numIterations = 10
als_model = ALS.train(ratings, rank, numIterations)

Now we can perform some predictions:

In [91]:
test_data = ratings.map(lambda p: (p[0], p[1]))

In [92]:
test_data.take(2)

[(1, 1), (1, 2)]

In [93]:
als_predictions = als_model.predictAll(test_data)

In [94]:
als_predictions.take(2)

[Rating(user=1, product=1, rating=4.997434130426677),
 Rating(user=1, product=2, rating=0.9999267659226103)]

### Dimensionality Reduction

In this section, we will see two main functionalities included in MLlib relative to dimensionality reduction:

    * Principal Component Analysis
    * Singular Vector Decomposition
    
    
We will use the data from the Clustering Section, training also a KMeans model with the "reduced" data.

#### Principal Component Analysis

In [95]:
cluster_data.take(2)

[SparseVector(1000, {49: 5.6814, 63: 3.7059, 65: 4.7972, 111: 3.6914, 243: 4.6369, 320: 3.3376, 339: 5.407, 365: 2.4106, 421: 3.7817, 540: 2.9982, 564: 4.6185, 635: 3.8137, 661: 5.1919, 668: 3.3376, 686: 5.1294, 725: 5.4904, 740: 3.4668, 803: 4.0207, 813: 5.2247, 824: 3.1791, 870: 2.1097, 880: 4.2564, 948: 4.072}),
 SparseVector(1000, {51: 4.4515, 98: 3.9815, 119: 2.7124, 174: 3.4441, 278: 4.3084, 287: 3.953, 289: 1.5941, 300: 2.4605, 403: 2.6674, 477: 5.4904, 483: 3.9911, 495: 1.8168, 561: 5.5348, 581: 2.9491, 670: 4.7757, 783: 2.6097, 809: 12.3782, 853: 4.8882, 870: 2.1097, 872: 3.0846, 895: 4.8647, 976: 3.9719})]

In [96]:
from pyspark.mllib.linalg import Matrix
from pyspark.mllib.linalg.distributed import RowMatrix

In [97]:
mat = RowMatrix(cluster_data)

In [98]:
pc = mat.computePrincipalComponents(2)

In [99]:
projected_pca = mat.multiply(pc).rows

In [100]:
kmeans_model_pca = KMeans.train(projected_pca, 2, maxIterations=1700, initializationMode="random")

In [101]:
predictions_pca = projected_pca.map(lambda x: kmeans_model_pca.predict(x))

In [102]:
predictions_pca.countByValue()

defaultdict(int, {0: 5515, 1: 58})

#### Singular Value Decomposition

In [103]:
svd = mat.computeSVD(20, computeU=True)

In [104]:
projected_svd = mat.multiply(svd.V).rows

In [105]:
kmeans_model_svd = KMeans.train(projected_svd, 2, maxIterations=1700, initializationMode="random")

In [106]:
predictions_svd = projected_svd.map(lambda x: kmeans_model_svd.predict(x))

In [107]:
predictions_svd.countByValue()

defaultdict(int, {1: 4016, 0: 1557})

### Model Evaluation

MLlib includes some functionalities to calculate automatically some metrics of trained ML models. While there are more, here we will evaluate the LR model of the spam classification section using the `BinaryClassificationMetrics` functionality.

In [108]:
from pyspark.mllib.evaluation import BinaryClassificationMetrics

In [109]:
ml_data_train.take(1)

[LabeledPoint(1.0, (1000,[49,63,65,111,243,320,339,365,421,540,564,635,661,668,686,725,740,803,813,824,870,880,948],[5.68142922888,3.70588728221,4.79722681155,3.69139427491,4.63688416148,3.33760117735,5.40699238317,2.41057860526,3.78168112158,2.99824709435,4.61853502281,3.81368385267,5.19188100356,3.33760117735,5.12936064658,5.49037399211,3.46681290883,4.02069802205,5.22467082638,3.17913083638,2.109675132,4.25642035557,4.07199131644]))]

In [110]:
pred_label_lr = ml_data_test.map(lambda lpoint: (float(lr_model.predict(lpoint.features)), lpoint.label))
metrics_lr = BinaryClassificationMetrics(pred_label_lr)

In [111]:
print("LR model")
print("Area Under PR: {0}".format(metrics_lr.areaUnderPR))
print("Area Under ROC: {0}".format(metrics_lr.areaUnderROC))

LR model
Area Under PR: 0.7012104519813399
Area Under ROC: 0.8513007541963054


## Pipeline API

ML pipelines are an interesting concept in order to organize all the tasks relative to a ML problem (data preparation + model training) into a Pipeline. In this section, we will solve the spam classification problem using ML pipelines, which are made by a series of Transformers and Estimators.

In [112]:
from pyspark.ml.classification import LogisticRegression
from pyspark.ml.feature import HashingTF, Tokenizer, IDF, SQLTransformer, StringIndexer
from pyspark.ml.pipeline import Pipeline

In [113]:
ini_data.show()

+-----+--------------------+----+----+----+
|label|                text| _c2| _c3| _c4|
+-----+--------------------+----+----+----+
|  ham|Go until jurong p...|null|null|null|
|  ham|Ok lar... Joking ...|null|null|null|
| spam|Free entry in 2 a...|null|null|null|
|  ham|U dun say so earl...|null|null|null|
|  ham|Nah I don't think...|null|null|null|
| spam|FreeMsg Hey there...|null|null|null|
|  ham|Even my brother i...|null|null|null|
|  ham|As per your reque...|null|null|null|
| spam|WINNER!! As a val...|null|null|null|
| spam|Had your mobile 1...|null|null|null|
|  ham|I'm gonna be home...|null|null|null|
| spam|SIX chances to wi...|null|null|null|
| spam|URGENT! You have ...|null|null|null|
|  ham|I've been searchi...|null|null|null|
|  ham|I HAVE A DATE ON ...|null|null|null|
| spam|XXXMobileMovieClu...|null|null|null|
|  ham|Oh k...i'm watchi...|null|null|null|
|  ham|Eh u remember how...|null|null|null|
|  ham|Fine if that��s t...|null|null|null|
| spam|England v Macedon...|null

In [114]:
sql_select = SQLTransformer(statement = "SELECT label, text FROM __THIS__")

In [115]:
sql_filter = SQLTransformer(statement = "SELECT * from __THIS__ WHERE text is not null AND label is not null")

In [116]:
label_indexer = StringIndexer(inputCol="label", outputCol="label_num")

In [117]:
tokenizer = Tokenizer(inputCol = "text", outputCol = "text_token")

In [118]:
tf = HashingTF(numFeatures = 1000, inputCol = "text_token", outputCol = "text_tf")

In [119]:
idf = IDF(inputCol="text_tf", outputCol="features")

In [120]:
lr = LogisticRegression(featuresCol="features", labelCol="label_num")

In [121]:
ml_pipeline = Pipeline(stages=[sql_select, sql_filter, label_indexer, tokenizer, tf, idf, lr])

In [122]:
ml_pipeline_model = ml_pipeline.fit(ini_data)

In [123]:
ml_pipeline_model.transform(ini_data).show(5)

+-----+--------------------+---------+--------------------+--------------------+--------------------+--------------------+--------------------+----------+
|label|                text|label_num|          text_token|             text_tf|            features|       rawPrediction|         probability|prediction|
+-----+--------------------+---------+--------------------+--------------------+--------------------+--------------------+--------------------+----------+
|  ham|Go until jurong p...|      0.0|[go, until, juron...|(1000,[7,77,150,1...|(1000,[7,77,150,1...|[46.1925496142281...|[1.0,9.5478469531...|       0.0|
|  ham|Ok lar... Joking ...|      0.0|[ok, lar..., joki...|(1000,[20,316,484...|(1000,[20,316,484...|[22.7239392220272...|[0.99999999999972...|       0.0|
| spam|Free entry in 2 a...|      1.0|[free, entry, in,...|(1000,[30,35,73,1...|(1000,[30,35,73,1...|[-49.707099745267...|[5.78975520257700...|       1.0|
|  ham|U dun say so earl...|      0.0|[u, dun, say, so,...|(1000,[57,3