# Tuning Machine Learning models in Spark

<a href = "http://yogen.io"><img src="http://yogen.io/assets/logo.svg" alt="yogen" style="width: 200px; float: right;"/></a>

## ML Pipelines in Spark

ML model training and tuning often represents running the same steps once and again. Often, we run the same steps with small variations in order to evaluate combinations of parameters. 

In order to make this use case a lot easier, Spark provides the [Pipeline](https://spark.apache.org/docs/2.2.0/ml-pipeline.html) abstraction.

A Pipeline represents a series of steps in the processing of a dataset. Each step is a Transformer or an Estimator. The whole Pipeline is an Estimator, so we can .fit the whole pipeline in one step. When we do that, the steps'  .fit and .transform methods will be called in turn.

![pipelineestimator](https://spark.apache.org/docs/2.3.0/img/ml-Pipeline.png)

![PipelineModel](https://spark.apache.org/docs/2.3.0/img/ml-PipelineModel.png)

## Example: predicting flight delays

We'll be using the same [Transtats'](https://www.transtats.bts.gov/) OTP performance data] from way back when. Remember it?

### Load the data

In [3]:
from pyspark.sql import SparkSession, types, functions
from pyspark.ml.classification import RandomForestClassifier
from pyspark.ml.feature import OneHotEncoder, StringIndexer, VectorAssembler

path_to_file = '/home/dsc/Data/us_dot/otp/On_Time_On_Time_Performance_2015_8.zip'
csv_filename = 'On_Time_On_Time_Performance_2015_8.csv'

!unzip -o {path_to_file} {csv_filename} -d .

columns_of_interest =['FlightDate', 'Year', 'Month', 'DayofMonth', 'DayOfWeek', 
                      'Carrier', 'TailNum', 'FlightNum', 'Origin', 'OriginCityName',
                      'OriginStateName', 'Dest', 'DestCityName', 'DestStateName',
                      'DepTime', 'DepDelay', 'Distance']

Archive:  /home/dsc/Data/us_dot/otp/On_Time_On_Time_Performance_2015_8.zip
  inflating: ./On_Time_On_Time_Performance_2015_8.csv  


In [12]:
session = SparkSession.builder.getOrCreate()
flights = session.read.csv(csv_filename, header=True, inferSchema=True)
flights = flights.select(columns_of_interest)
flights.show(5)

+-------------------+----+-----+----------+---------+-------+-------+---------+------+--------------+---------------+----+---------------+-------------+-------+--------+--------+
|         FlightDate|Year|Month|DayofMonth|DayOfWeek|Carrier|TailNum|FlightNum|Origin|OriginCityName|OriginStateName|Dest|   DestCityName|DestStateName|DepTime|DepDelay|Distance|
+-------------------+----+-----+----------+---------+-------+-------+---------+------+--------------+---------------+----+---------------+-------------+-------+--------+--------+
|2015-08-02 00:00:00|2015|    8|         2|        7|     AA| N790AA|        1|   JFK|  New York, NY|       New York| LAX|Los Angeles, CA|   California|    854|    -6.0|  2475.0|
|2015-08-03 00:00:00|2015|    8|         3|        1|     AA| N784AA|        1|   JFK|  New York, NY|       New York| LAX|Los Angeles, CA|   California|    858|    -2.0|  2475.0|
|2015-08-04 00:00:00|2015|    8|         4|        2|     AA| N793AA|        1|   JFK|  New York, NY|    

In [None]:
#Vamos a construir un modelo de ML, Predeciremos si un vuelo va a tener retraso o no

### Drop nas

In [13]:
flights = flights.na.drop()

### Feature extraction and generation of target variable

The departing hour is the most important factor in delays, so we need to calculate it.

We'll also generate a binary target variable

In [14]:
flights.select('DepTime').show(5)

+-------+
|DepTime|
+-------+
|    854|
|    858|
|    902|
|    857|
|    857|
+-------+
only showing top 5 rows



In [17]:
flights = flights.withColumn('DepHour', (flights['DepTime']/100).cast(types.IntegerType()))

In [18]:
flights.show(5)

+-------------------+----+-----+----------+---------+-------+-------+---------+------+--------------+---------------+----+---------------+-------------+-------+--------+--------+-------+
|         FlightDate|Year|Month|DayofMonth|DayOfWeek|Carrier|TailNum|FlightNum|Origin|OriginCityName|OriginStateName|Dest|   DestCityName|DestStateName|DepTime|DepDelay|Distance|DepHour|
+-------------------+----+-----+----------+---------+-------+-------+---------+------+--------------+---------------+----+---------------+-------------+-------+--------+--------+-------+
|2015-08-02 00:00:00|2015|    8|         2|        7|     AA| N790AA|        1|   JFK|  New York, NY|       New York| LAX|Los Angeles, CA|   California|    854|    -6.0|  2475.0|      8|
|2015-08-03 00:00:00|2015|    8|         3|        1|     AA| N784AA|        1|   JFK|  New York, NY|       New York| LAX|Los Angeles, CA|   California|    858|    -2.0|  2475.0|      8|
|2015-08-04 00:00:00|2015|    8|         4|        2|     AA| N79

In [21]:
flights = flights.withColumn('Delayed', (flights['DepDelay'] > 15).cast(types.IntegerType()))

In [22]:
flights.show(5)

+-------------------+----+-----+----------+---------+-------+-------+---------+------+--------------+---------------+----+---------------+-------------+-------+--------+--------+-------+-------+
|         FlightDate|Year|Month|DayofMonth|DayOfWeek|Carrier|TailNum|FlightNum|Origin|OriginCityName|OriginStateName|Dest|   DestCityName|DestStateName|DepTime|DepDelay|Distance|DepHour|Delayed|
+-------------------+----+-----+----------+---------+-------+-------+---------+------+--------------+---------------+----+---------------+-------------+-------+--------+--------+-------+-------+
|2015-08-02 00:00:00|2015|    8|         2|        7|     AA| N790AA|        1|   JFK|  New York, NY|       New York| LAX|Los Angeles, CA|   California|    854|    -6.0|  2475.0|      8|      0|
|2015-08-03 00:00:00|2015|    8|         3|        1|     AA| N784AA|        1|   JFK|  New York, NY|       New York| LAX|Los Angeles, CA|   California|    858|    -2.0|  2475.0|      8|      0|
|2015-08-04 00:00:00|2015

In [23]:
#hacemos dos dataframes de 10 y 90 % de los datos, el primero para entrenar.
flights_sample, rest = flights.randomSplit([.1,.9])

In [27]:
#flights_sample.schema.fields[0]

StructField(FlightDate,TimestampType,true)

### Handle different fields in different ways

In [25]:
categorical_fields = ['Year', 'Month', 'DayofMonth', 'DayOfWeek', 'Carrier', 
               'Origin', 'OriginCityName', 'OriginStateName', 
               'Dest', 'DestCityName', 'DestStateName']

string_fields = [field.name for field in flights_sample.schema.fields if field.dataType == types.StringType()]

continuous_fields = ['Distance', 'DepHour']

target_field = 'Delayed'

### Handling categorical fields


#### StringIndexer, OneHotEncoderEstimator



In [39]:
#help(StringIndexer)
#El valor handleINvalid es keep, skip o ..
carrier_indexer = StringIndexer(inputCol='Carrier', outputCol='CarrierIndex', handleInvalid='keep')

In [40]:
carriers = flights_sample.select('Carrier')

In [41]:
carrier_indexer_transformer = carrier_indexer.fit(carriers)

In [42]:
carrier_indexer_transformer.transform(carriers).show(5)

+-------+------------+
|Carrier|CarrierIndex|
+-------+------------+
|     AA|         2.0|
|     AA|         2.0|
|     AA|         2.0|
|     AA|         2.0|
|     AA|         2.0|
+-------+------------+
only showing top 5 rows



In [44]:
onehot_encoder = OneHotEncoder(inputCol='CarrierIndex', outputCol='CarrierOneHot')

#### SparseVectors

## Let's build our first Pipeline!

Our pipeline consists of a number of StringIndexers, followed by one OneHotEncoderEstimator, followed by a VectorAssembler, with a RandomForestClassifier at the end.

### StringIndexer steps

### OneHotEncoder

Only one OneHotEncoder will be enough to process all categorical columns.

### VectorAssembler

### RandomForestClassifier

### Pipeline!

## Evaluating and tuning our Pipeline

### Params and Evaluators

### Let's have a look

### Further Reading

https://spark.apache.org/docs/2.2.0/ml-tuning.html

https://stackoverflow.com/questions/28569788/how-to-open-stream-zip-files-through-spark