# Machine Learning with MLlib

In this Notebook, we will review the RDD-Based Machine Learning library MLlib.

In [1]:
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("MLlib").master("local[*]").getOrCreate()
sc = spark.sparkContext

## Data Types

First, we have to understand the different data structures used by MLlib. In particular, they are:

    * Vectors
    * Labeled Points
    * Rating
    * Model Classes

We will se `Vectors` and `Labeled Points` in more detail.

In [2]:
from pyspark.mllib.linalg import Vectors

`Vector()` --> thold the features values. It can be `dense` and `sparse`.

In [3]:
vector_dense = Vectors.dense([1.0,1.0,2.0,2.0])

In [4]:
vector_sparse_1 = Vectors.sparse(4, {0: 1.0, 2: 2.0})

In [5]:
vector_sparse_2 = Vectors.sparse(4, [0, 2], [1.0, 2.0])

`LabeledPoint()` --> hold both features values and label values

In [6]:
from pyspark.mllib.regression import LabeledPoint

In [7]:
label_point = LabeledPoint(1, vector_dense)

## Algorithms

In this section, we will review the different algorithms associated with Machine Learning problems. Among other, we could highlight the following families of algorithms:

    * Feature Extraction
    * Statistics
    * Classification and Regression
    * Collaborative Filtering and Recommendation
    * Dimensionality Reduction
    * Model Evaluation

### Feature Extraction

ML algorithms only accept numerical values as inputs. Here, we discuss some algorithm that help us to translate some inputs (like text, non-scaled numerical vectors, etc) to numerical values that ML algorithms can understand. In particular, we will discuss the following algorithms:

    * TD-IDF
    * Scaling
    * Normalization
    * Word2Vec

#### td-idf()

`td-idf()` --> Term Frecuency - Inverse Document Frequency, useful to convert text input to numerical inputs

In [8]:
from pyspark.mllib.feature import HashingTF, IDF

In [9]:
sentences = sc.parallelize(["hello", "hello how are you", "good bye", "bye"])
words = sentences.map(lambda word: word.split(" "))
tf = HashingTF(100)
tf_vectors = tf.transform(words)

In [10]:
tf_vectors.collect()

[SparseVector(100, {45: 1.0}),
 SparseVector(100, {1: 1.0, 21: 1.0, 24: 1.0, 45: 1.0}),
 SparseVector(100, {64: 1.0, 88: 1.0}),
 SparseVector(100, {88: 1.0})]

In [11]:
idf = IDF()
idf_model = idf.fit(tf_vectors)
tf_idf_vectors = idf_model.transform(tf_vectors)

In [12]:
tf_idf_vectors.collect()

[SparseVector(100, {45: 0.5108}),
 SparseVector(100, {1: 0.9163, 21: 0.9163, 24: 0.9163, 45: 0.5108}),
 SparseVector(100, {64: 0.9163, 88: 0.5108}),
 SparseVector(100, {88: 0.5108})]

#### Word2Vect

`Word2Vec` --> also useful to tranform text into numerical data

In [13]:
from pyspark.mllib.feature import Word2Vec

In [14]:
word2vec = Word2Vec().setMinCount(0)
word2vec_model = word2vec.fit(words)

In [15]:
word2vec_vectors = word2vec_model.transform("hello")

In [16]:
word2vec_vectors

DenseVector([0.0005, 0.0002, -0.0012, -0.0001, 0.0049, -0.001, 0.0043, 0.0039, 0.0007, 0.0022, -0.0008, -0.0003, -0.001, 0.0046, 0.0, -0.0046, 0.0025, -0.003, -0.0031, -0.0001, 0.0038, 0.003, 0.0026, -0.0048, -0.002, 0.0037, 0.0007, -0.0013, 0.002, 0.0035, -0.0032, 0.0014, 0.0045, 0.0006, -0.0033, -0.0008, -0.0031, -0.0026, 0.0044, -0.0013, 0.0004, 0.0024, 0.0021, -0.0048, -0.0008, 0.0035, -0.0015, 0.0013, -0.0017, 0.0027, -0.0006, 0.0001, -0.0009, -0.0026, -0.001, 0.0008, -0.0007, 0.0038, -0.0032, -0.0025, -0.0032, 0.0039, -0.0, 0.0037, -0.0027, 0.0021, 0.0002, -0.0027, 0.0041, -0.0006, -0.001, 0.0043, 0.0011, -0.0022, 0.003, -0.0046, -0.0021, -0.005, 0.0022, 0.0014, 0.0033, -0.0009, 0.0008, 0.0018, 0.0012, -0.0047, 0.0009, -0.0037, 0.0029, 0.0002, 0.0039, 0.0037, 0.0032, -0.0011, -0.002, 0.0, -0.0006, -0.0004, -0.0012, 0.0004])

#### Scaling

While our input data could be already numeric, it is useful sometimes for the ML algorithms to scale that data.

`StandardScaler()` --> to scale numerical data

In [17]:
from pyspark.mllib.feature import StandardScaler

In [18]:
vectors = [Vectors.dense([-2.0, 5.0, 1.0, 4.0]),
           Vectors.dense([2.0, 0.0, 1.0, 7.2]),
           Vectors.dense([4.0, 2.0, 0.5, 0.8])]

vectors_rdd = sc.parallelize(vectors)
scaler = StandardScaler(withMean=True, withStd=True)
model = scaler.fit(vectors_rdd)
scaled_data = model.transform(vectors_rdd)

In [19]:
scaled_data.collect()

[DenseVector([-1.0911, 1.0596, 0.5774, 0.0]),
 DenseVector([0.2182, -0.9272, 0.5774, 1.0]),
 DenseVector([0.8729, -0.1325, -1.1547, -1.0])]

#### Normalization

As with scaling, sometimes it is very usefull to normalize our data.

In [20]:
from pyspark.mllib.feature import Normalizer

In [21]:
norm = Normalizer()
norm_data = norm.transform(vectors_rdd)

In [22]:
norm_data.collect()

[DenseVector([-0.2949, 0.7372, 0.1474, 0.5898]),
 DenseVector([0.2653, 0.0, 0.1326, 0.955]),
 DenseVector([0.8752, 0.4376, 0.1094, 0.175])]

### Statistics

The library MLlib includes useful functionalities to calculate some main statistics over numeric RDDs

In [23]:
from pyspark.mllib.stat import Statistics

#### colStats()

`colStats()` --> to calculate statistics over an RDD of numerical values

In [24]:
col_stats = Statistics.colStats(vectors_rdd)

In [25]:
col_stats_dict = {
    "count": col_stats.count(),
    "max": col_stats.max(),
    "mean": col_stats.mean(),
    "min": col_stats.min(),
    "normL1": col_stats.normL1(),
    "normL2": col_stats.normL2(),
    "numNonzeros": col_stats.numNonzeros(),
    "variance": col_stats.variance()
}

In [26]:
for key, value in col_stats_dict.items():
    print("{0}: {1}".format(key, value))

count: 3
max: [ 4.   5.   1.   7.2]
mean: [ 1.33333333  2.33333333  0.83333333  4.        ]
min: [-2.   0.   0.5  0.8]
normL1: [  8.    7.    2.5  12. ]
normL2: [ 4.89897949  5.38516481  1.5         8.27526435]
numNonzeros: [ 3.  2.  3.  3.]
variance: [  9.33333333   6.33333333   0.08333333  10.24      ]


#### corr()

`corr()` --> to calculate the correlation matrix between the columns of one RDD or between two RDDs

In [27]:
Statistics.corr(vectors_rdd)

array([[ 1.        , -0.73704347, -0.75592895, -0.32732684],
       [-0.73704347,  1.        ,  0.11470787, -0.39735971],
       [-0.75592895,  0.11470787,  1.        ,  0.8660254 ],
       [-0.32732684, -0.39735971,  0.8660254 ,  1.        ]])

In [28]:
data1 = sc.parallelize([1, 2, 3, 4, 5])
data2 = sc.parallelize([10, 19, 32, 41, 56])

In [29]:
Statistics.corr(data1, data2)

0.996326893005933

#### chiSqTest()

`chiSqTest()` --> to compute the Pearson's independence test

In [30]:
label_point_rdd = vectors_rdd.map(lambda x: LabeledPoint(0, x))

In [31]:
chi_sq_test = Statistics.chiSqTest(label_point_rdd)

In [32]:
for test in chi_sq_test:
    print("Test value: {0}: ".format(test.pValue))

Test value: 1.0: 
Test value: 1.0: 
Test value: 1.0: 
Test value: 1.0: 


### Machine Learning: Regression

In this section, we will explore the conventional Linear Regression model.

In [33]:
from random import randint, random
from pyspark.mllib.regression import LinearRegressionWithSGD

First, we will create training data according to a Linear Regression model with the following weights and intercept:

    * Weights: [5, 3, 8, 1]
    * Intercept: 20

In [34]:
def linear_reg(x):
    """
    Given an input vector x, returns the following value:
    5*x[0] + 3*x[1] + 8*x[2] + x[3] + 20 + random()
    
    :input x: input vector
    :return: computated value
    """
    
    return 5*x[0] + 3*x[1] + 8*x[2] + x[3] + 20 + random()

In [35]:
reg_features = [[randint(0,20) for _ in range(4)] for _ in range(100)]
reg_features_rdd = sc.parallelize(reg_features)
scaler = StandardScaler()
reg_features_scale = scaler.fit(reg_features_rdd).transform(reg_features_rdd)
reg_data = reg_features_scale.map(lambda x: LabeledPoint(linear_reg(x), Vectors.dense(x)))

In [36]:
reg_data.take(2)

[LabeledPoint(31.44320327503508, [0.325623877528,1.31040867515,0.518832866202,1.03965430882]),
 LabeledPoint(60.720532751402224, [3.25623877528,1.96561301272,2.07533146481,1.21293002696])]

Once the data has been created, we can train our model:

In [37]:
lr_model = LinearRegressionWithSGD.train(data = reg_data, intercept=True)

We can now compare the value of the original and computated weights and intercpet:

In [38]:
print("Computed --> Weights: {0}; Intercept: {1}".format(lr_model.weights, lr_model.intercept))
print("Original --> Weights: {0}; Intercept: {1}".format([5, 3, 8, 1], 20))

Computed --> Weights: [5.89847605314,3.62444173624,8.39588463624,1.62514271703]; Intercept: 16.034763324762647
Original --> Weights: [5, 3, 8, 1]; Intercept: 20
