**Steps:**
* One-Hot-Encode the label (dim=3)
* Split the dataset into training and validation (80/20%)
* Scale the features to take values between 0 and 1 
    * fit on training dataset and apply the scaler also to the validation set
* Buld the NN:
    * 3 hidden layers with 50, 20 and 10 nodes activated by *ReLU*
    * Output layer with 3 nodes and *Softmax* activation
    * Use *categorical crossentropy* as a loss
    * Ask Maurizio for the optimizer, weight initialization, regularization, dropout
    * For now we can use *Adam* and leave everything else as default
* Create the trainer
    * AEASGD for now is the one with the best performances
* Train the model!

**Optional**
* ... Example of Cross Validation using spark?

## Create a Spark Session

In [1]:
import findspark
findspark.init('/afs/cern.ch/work/m/migliori/public/spark2.3.1')

In [2]:
application_name = 'dist-keras-notebook'
master = "local[*]"

In [4]:
from pyspark.sql import SparkSession
import os

os.environ["PYSPARK_SUBMIT_ARGS"] = "--packages org.diana-hep:spark-root_2.11:0.1.14 pyspark-shell"

spark = SparkSession.builder\
        .appName("test-spark-root")\
        .config("spark.driver.memory", "15G")\
        .getOrCreate()

## Load the dataframe

In [5]:
HLF_dataset = spark.read.format("parquet").load("HLF_dataset.parquet")
HLF_dataset.count()

316712

In [6]:
HLF_dataset.show(5)

+--------------------+-----+
|           hfeatures|label|
+--------------------+-----+
|[405.151287078857...|    1|
|[53.2725028991699...|    1|
|[323.423492431640...|    1|
|[228.913959503173...|    1|
|[149.133995056152...|    1|
+--------------------+-----+
only showing top 5 rows



## Prepare the features

In [7]:
## Converte hfeatures in vector dense 
## The function used to convert returns a list but we need vector dense
from pyspark.ml.linalg import Vectors, VectorUDT
from pyspark.sql.functions import udf

vector_dense_udf = udf(lambda r : Vectors.dense(r),VectorUDT())
HLF_dataset = HLF_dataset.withColumn('dense_features',vector_dense_udf('hfeatures'))

In [8]:
## Create train and test dataframes
train, test = HLF_dataset.randomSplit([0.8, 0.2], 42)

In [9]:
from pyspark.ml import Pipeline
from pyspark.ml.feature import OneHotEncoderEstimator
from pyspark.ml.feature import MinMaxScaler

## One-Hot-Encode
encoder = OneHotEncoderEstimator(inputCols=["label"], outputCols=["encoded_label"])

## Scale feature vector
scaler = MinMaxScaler(inputCol="dense_features", outputCol="features")

pipeline = Pipeline(stages=[encoder, scaler])

fitted_pipeline = pipeline.fit(train)

In [10]:
## Transform train and test
train = fitted_pipeline.transform(train)
test = fitted_pipeline.transform(test)

In [11]:
train = train.selectExpr('features', 'encoded_label as label')
test = test.selectExpr('features', 'encoded_label as label')

train.show(5)

+--------------------+-------------+
|            features|        label|
+--------------------+-------------+
|[0.0,0.0155206299...|(2,[1],[1.0])|
|[0.0,0.0190699355...|(2,[1],[1.0])|
|[0.0,0.0190699355...|(2,[1],[1.0])|
|[0.0,0.0200420945...|(2,[1],[1.0])|
|[0.0,0.0200420945...|(2,[1],[1.0])|
+--------------------+-------------+
only showing top 5 rows



## Build the Keras model