<center>
<a href="http://www.insa-toulouse.fr/" ><img src="http://www.math.univ-toulouse.fr/~besse/Wikistat/Images/logo-insa.jpg" style="float:left; max-width: 120px; display: inline" alt="INSA"/></a> 
<a href="http://wikistat.fr/" ><img src="http://www.math.univ-toulouse.fr/~besse/Wikistat/Images/wikistat.jpg" style="max-width: 250px; display: inline"  alt="Wikistat"/></a>
<a href="http://www.math.univ-toulouse.fr/" ><img src="http://www.math.univ-toulouse.fr/~besse/Wikistat/Images/logo_imt.jpg" style="float:right; max-width: 200px; display: inline" alt="IMT"/> </a>
</center>

# IA Framework.
## Lab 1  - Introduction to Pyspark.
#### Part 4 IntroductionTo  [SparkML](https://spark.apache.org/docs/latest/ml-guide.html) library (or *MLlib  DataFrame-based API*)   <a href="http://spark.apache.org/"><img src="http://spark.apache.org/images/spark-logo-trademark.png" style="max-width: 100px; display: inline" alt="Spark"/> </a> 

## Introduction

Since Spark 2.0 MlLib library which used only RDD is note developped any more. If it still can be used, no more functionnality will be added.

The main *machine learning* Spark library is now `SparkML`. `SparkML` use only *DataFrame*.

`SparkML` does not has as much functionality as `MlLib` today, but it will reach total compatibility in Spark 3.0

## Context

In [14]:
from pyspark import SparkContext
from pyspark.sql import SparkSession 
sc = SparkContext.getOrCreate()
spark = SparkSession.builder \
     .master("local") \
     .appName("cal4 pyspark") \
     .getOrCreate()

##  Elementary statistics 

Most of function used in *MLlib* notebook, does not exist on *SparkML* only correlation function and hypothesis testing are available so far.

### Vectors object

The *SparkML* library used `Vectors` object to manipulate array (similar to numpy).

In [2]:
from numpy import array
np_vectors=array([1.0,0.0,2.0,4.0,0.0])
np_vectors

array([1., 0., 2., 4., 0.])

In [4]:
from pyspark.ml.linalg import Vectors
denseVec2=Vectors.dense([1.0,0.0,2.0,4.0,0.0])
denseVec2

DenseVector([1.0, 0.0, 2.0, 4.0, 0.0])

The code above build *DenseVector*.  *SparseVector* can be used for sparse object.

In [5]:
sparseVec1 = Vectors.sparse(10, {0: 1.0, 2: 2.0, 6: 4.0})
sparseVec1

SparseVector(10, {0: 1.0, 2: 2.0, 6: 4.0})

Another syntax

In [6]:
sparseVec2 = Vectors.sparse(10, [0, 2, 6], [1.0, 2.0, 4.0])
sparseVec2

SparseVector(10, {0: 1.0, 2: 2.0, 6: 4.0})

### Correlation

The `pyspark.ml.stat.Correlation` function enables to compute correlations ( *Pearson* et *Spearman*)  between columns of a *DataFrame*.

In [13]:
from pyspark.ml.linalg import Vectors
from pyspark.ml.stat import Correlation

data = [(Vectors.sparse(4, [(0, 1.0), (3, -2.0)]),),
        (Vectors.dense([4.0, 5.0, 0.0, 3.0]),),
        (Vectors.dense([6.0, 7.0, 0.0, 8.0]),),
        (Vectors.sparse(4, [(0, 9.0), (3, 1.0)]),)]


df = spark.createDataFrame(data, ["features"])

r1 = Correlation.corr(df, "features").head()
print("Pearson correlation matrix:\n" + str(r1[0]))

r2 = Correlation.corr(df, "features", "spearman").head()
print("Spearman correlation matrix:\n" + str(r2[0]))

Pearson correlation matrix:
DenseMatrix([[1.        , 0.05564149,        nan, 0.40047142],
             [0.05564149, 1.        ,        nan, 0.91359586],
             [       nan,        nan, 1.        ,        nan],
             [0.40047142, 0.91359586,        nan, 1.        ]])
Spearman correlation matrix:
DenseMatrix([[1.        , 0.10540926,        nan, 0.4       ],
             [0.10540926, 1.        ,        nan, 0.9486833 ],
             [       nan,        nan, 1.        ,        nan],
             [0.4       , 0.9486833 ,        nan, 1.        ]])


### Summary Statistics

In [15]:
r2

Row(spearman(features)=DenseMatrix(4, 4, [1.0, 0.1054, nan, 0.4, 0.1054, 1.0, nan, 0.9487, nan, nan, 1.0, nan, 0.4, 0.9487, nan, 1.0], False))

###  Hypothesis testing

In [16]:
from pyspark.ml.linalg import Vectors
from pyspark.ml.stat import ChiSquareTest

data = [(0.0, Vectors.dense(0.5, 10.0)),
        (0.0, Vectors.dense(1.5, 20.0)),
        (1.0, Vectors.dense(1.5, 30.0)),
        (0.0, Vectors.dense(3.5, 30.0)),
        (0.0, Vectors.dense(3.5, 40.0)),
        (1.0, Vectors.dense(3.5, 40.0))]
df = spark.createDataFrame(data, ["label", "features"])

r = ChiSquareTest.test(df, "features", "label").head()
print("pValues: " + str(r.pValues))
print("degreesOfFreedom: " + str(r.degreesOfFreedom))
print("statistics: " + str(r.statistics))

pValues: [0.6872892787909721,0.6822703303362126]
degreesOfFreedom: [2, 3]
statistics: [0.75,1.5]


## ML Pipeline

*SparkML*, is based on the notion of **ML Pipeline**. 

A **ML Pipeline** allows to combine different steps of the ML process from data cleaning and processing to learning trough object call *pipeline* or *workflow*. 

### Estimator, Transformer, and Param

A **ML Pipeline** is build from three type of objects:


 * **Transformer**: is a function which will convert a *DataFrame* to another *DataFrame*. In most case, the *DataFrame* created is similar to the old one with a new columns . **Transformer** example: 
     * The `predict` function of a ML model is a **Transformer**. It will take a DataFrame as an input and return a new dataframe with the prediction column. 
     * A *StringIndexer* column will take a DataFrame with text inside and will return new dataframe with text converted as numerical value.
 
* **Estimator**: is an algorithm, apply on a *DataFrame* in order to build **Transformer**. E**Estimator** example:
    * A learning algorithm is an **Estimator**. Once it has been applied on a *DataFrame* it will build a **Transformer** able to perform prediction

* **Parameter**: An API shared by  **Transformer** and **Estimators** to specified parameters.

#### Example : Logistic Regression

In [17]:
from pyspark.ml.linalg import Vectors
from pyspark.ml.classification import LogisticRegression

training = spark.createDataFrame([
    (1.0, Vectors.dense([0.0, 1.1, 0.1])),
    (0.0, Vectors.dense([2.0, 1.0, -1.0])),
    (0.0, Vectors.dense([2.0, 1.3, 1.0])),
    (1.0, Vectors.dense([0.0, 1.2, -0.5]))], ["label", "features"])


test = spark.createDataFrame([
    (1.0, Vectors.dense([-1.0, 1.5, 1.3])),
    (0.0, Vectors.dense([3.0, 2.0, -0.1])),
    (1.0, Vectors.dense([0.0, 2.2, -1.5]))], ["label", "features"])

We create a  `LogisticRegression` **Estimator**: `lr`.


In [18]:
lr = LogisticRegression(maxIter=10, regParam=0.01, featuresCol='features', labelCol='label', predictionCol='prediction', probabilityCol='probability')
lr.explainParams()

"aggregationDepth: suggested depth for treeAggregate (>= 2). (default: 2)\nelasticNetParam: the ElasticNet mixing parameter, in range [0, 1]. For alpha = 0, the penalty is an L2 penalty. For alpha = 1, it is an L1 penalty. (default: 0.0)\nfamily: The name of family which is a description of the label distribution to be used in the model. Supported options: auto, binomial, multinomial (default: auto)\nfeaturesCol: features column name. (default: features, current: features)\nfitIntercept: whether to fit an intercept term. (default: True)\nlabelCol: label column name. (default: label, current: label)\nlowerBoundsOnCoefficients: The lower bounds on coefficients if fitting under bound constrained optimization. The bound matrix must be compatible with the shape (1, number of features) for binomial regression, or (number of classes, number of features) for multinomial regression. (undefined)\nlowerBoundsOnIntercepts: The lower bounds on intercepts if fitting under bound constrained optimizat

We apply the estimator on the train *DataFrame*. It produces a **Transformer**.

In [19]:
model = lr.fit(training)

We then apply the **Transformer** on the test `DataFrame` .

In [20]:
prediction = model.transform(test)

The result is a new *Dataframe* : *prediction* which is the test *Dataframe* with two new columns : prediction and probability.

In [21]:
prediction

DataFrame[label: double, features: vector, rawPrediction: vector, probability: vector, prediction: double]

Those names has been specified within the definition of `lr` **Estimator** used to build `model` **Transformer**

results

In [22]:
result = prediction.select("features", "label", "probability", "prediction").collect()
for row in result:
    print("features=%s, label=%s -> prob=%s, prediction=%s"
          % (row.features, row.label, row.probability, row.prediction))

features=[-1.0,1.5,1.3], label=1.0 -> prob=[0.0013759947069214283,0.9986240052930786], prediction=1.0
features=[3.0,2.0,-0.1], label=0.0 -> prob=[0.9816604009374171,0.018339599062582975], prediction=0.0
features=[0.0,2.2,-1.5], label=1.0 -> prob=[0.0016981475578358419,0.9983018524421641], prediction=1.0


### Pipeline

A **Pipeline** is an association of differents **Transformers** and **Estimators** in order to specify a complete ML pipeline. 

To apply text classification we will apply these successive steps:

 * Convert text in list of word (Tokenizer)
 * Convert list of word into numerical value (Hashing TF)
 * Training
 * Prediction

All these step will be resume in a single object **Pipeline**.

**NB** Tokenizer and Hashing TF will be used here without explanation. They will be studied in details in the third lab of AIF module.

#### Example : Tokenize, Hash and logistic regression.

In [23]:
from pyspark.ml import Pipeline
from pyspark.ml.classification import LogisticRegression
from pyspark.ml.feature import HashingTF, Tokenizer

# DataFrame d'Apprentissage
training = spark.createDataFrame([
    (0, "a b c d e spark", 1.0),
    (1, "b d", 0.0),
    (2, "spark f g h", 1.0),
    (3, "hadoop mapreduce", 0.0)
], ["id", "text", "label"])


# DataFrame Test.
test = spark.createDataFrame([
    (4, "spark i j k"),
    (5, "l m n"),
    (6, "spark hadoop spark"),
    (7, "apache hadoop")
], ["id", "text"])


*Tokenizer* is an **Estimator** which allows to build a  tokenizer **Transformer** object.

In [24]:
tokenizer = Tokenizer(inputCol="text", outputCol="words")

In [25]:
df_tokenized = tokenizer.transform(training)
df_tokenized.select("words").take(4)

[Row(words=['a', 'b', 'c', 'd', 'e', 'spark']),
 Row(words=['b', 'd']),
 Row(words=['spark', 'f', 'g', 'h']),
 Row(words=['hadoop', 'mapreduce'])]

*Hashing TF* is an **Estimator** hashing TF **Transformer** object.

In [26]:
hashingTF = HashingTF(inputCol="words", outputCol="features")

In [27]:
df_hash= hashingTF.transform(df_tokenized)
df_hash.select("features").take(4)

[Row(features=SparseVector(262144, {17222: 1.0, 27526: 1.0, 28698: 1.0, 30913: 1.0, 227410: 1.0, 234657: 1.0})),
 Row(features=SparseVector(262144, {27526: 1.0, 30913: 1.0})),
 Row(features=SparseVector(262144, {15554: 1.0, 24152: 1.0, 51505: 1.0, 234657: 1.0})),
 Row(features=SparseVector(262144, {42633: 1.0, 155117: 1.0}))]

The step aboves can be combine in a Pipeline along with the training step.

In [28]:
tokenizer = Tokenizer(inputCol="text", outputCol="words")
hashingTF = HashingTF(inputCol=tokenizer.getOutputCol(), outputCol="features")
lr = LogisticRegression(maxIter=10, regParam=0.001)

In [29]:
pipeline = Pipeline(stages=[tokenizer, hashingTF, lr])

Apply all steps on the training dataframe.

In [30]:
model = pipeline.fit(training)

Prediction

In [31]:
prediction = model.transform(test)
selected = prediction.select("id", "text", "probability", "prediction")
for row in selected.collect():
    rid, text, prob, prediction = row
    print("(%d, %s) --> prob=%s, prediction=%f" % (rid, text, str(prob), prediction))

(4, spark i j k) --> prob=[0.15964077387874742,0.8403592261212527], prediction=1.000000
(5, l m n) --> prob=[0.8378325685476743,0.16216743145232568], prediction=0.000000
(6, spark hadoop spark) --> prob=[0.06926633132976034,0.9307336686702395], prediction=1.000000
(7, apache hadoop) --> prob=[0.9821575333444218,0.01784246665557808], prediction=0.000000


**Exercise** As in the second notebook, use SparkML library to build a ML model in order to predict attack on the [KDD Cup 1999](http://kdd.ics.uci.edu/databases/kddcup99/kddcup99.html) dataset.


Try to use various **Transformer** within the pipeline you will build : https://spark.apache.org/docs/latest/ml-features.html (One-Hot encoder to user string variable, ACP to reduce number of features, Different methods to scale the data, etc..)