#### One Hot Encoding
    - Convert the index into meaningful-mathematical values.
    - Steps:
        1. Create a columns for each of the levels
        2. These new columns are called dummy variables.
        3. Assign binary values, 1 indicates the presence of the value, where 0 indicates not-present.
    - this is space-consuming and redundant, instead * Use Sparse values * 
    - The Sparse values representation simply records the column numbers and values of the non-zeroes.
    - The process of creating the dummy variables in called "One-Hot Encoding"


In [1]:
from pyspark.ml.feature import OneHotEncoderEstimator
from pyspark import SparkContext
sc = SparkContext()
onehot = OneHotEncoderEstimator(inputCols=['type_idx'], outputCols=['type_dummy'])


In [None]:
# fit the encoded to the data
onehot = onehot.fit(cars)

# how many category levels?
onehot.categorySizes

# apply the encoder on the data as shown below
cars = onehot.transform(cars)
cars.select('type','type_idx', 'type_dummy').distinct().sort('type_idx').show()

#### Dense vs Sparse format
    1. 

In [4]:
from pyspark.mllib.linalg import DenseVector, SparseVector

# Store the vector
DenseVector([1,0,0,0,0,7,0,0])



DenseVector([1.0, 0.0, 0.0, 0.0, 0.0, 7.0, 0.0, 0.0])

In [5]:
# Sparse representation of storing vectors
SparseVector(8, [0,5], [1,7])

SparseVector(8, {0: 1.0, 5: 7.0})

In [None]:
# Import the one hot encoder class
from pyspark.ml.feature import OneHotEncoderEstimator

# Create an instance of the one hot encoder
onehot = OneHotEncoderEstimator(inputCols=['org_idx'], outputCols=['org_dummy'])

# Apply the one hot encoder to the flights data
onehot = onehot.fit(flights)
flights_onehot = onehot.transform(flights)

# Check the results
flights_onehot.select('org', 'org_idx', 'org_dummy').distinct().sort('org_idx').show()

In [6]:
shirst_size = {'S':8,'M':18,'L':20, 'XL':7}
gets XL 

#### Regression 
    - How to build regression models to predict numerical values?
    - Model needs to describe the average of any specified values.
        1. Find the Residuals (the distance between the point(observed values and its' correspondent value)
        2. Loss Function = MSE = 1/N ∑ (i = 1 to N) (yi - y_cap)^2  
        3. Take the columns and ensemble them to features

In [7]:
from pyspark.ml.regression import LinearRegression
regression  = LinearRegression(labelCol = 'consumption')

# train the model using fit
regression = regression.fit(cars_train)

# predict the model using transform
predictions  = regression.transform(cars_test)

# Calcualte Root Mean Square Error (RMSE)

from pyspark.ml.evaluation import RegressionEvaluator
# Find RMSE 
RegressionEvaluator(labelCol='consumption').evaluate(predictions)

        

#### Regression Evaluator.
    Regression Evaluator can also calculate following metrics:
    - mae (Mean absolute Error)
    - r2 (R2)
    - mse (Mean Square Error)


In [None]:
# Examine Intercept
regression.intercept
regression.coefficients


In [None]:
from pyspark.ml.regression import LinearRegression
from pyspark.ml.evaluation import RegressionEvaluator

# Create a regression object and train on training data
regression = LinearRegression(labelCol='duration').fit(flights_train)

# Create predictions for the testing data and take a look at the predictions
predictions = regression.transform(flights_test)
predictions.select('duration', 'prediction').show(5, False)

# Calculate the RMSE
RegressionEvaluator(labelCol='duration').evaluate(predictions)

In [None]:
# Intercept (average minutes on ground)
inter = regression.intercept
print(inter)

# Coefficients
coefs = regression.coefficients
print(coefs)

# Average minutes per km
minutes_per_km = regression.coefficients[0]
print(minutes_per_km)

# Average speed in km per hour
avg_speed = 60 / minutes_per_km
print(avg_speed)

In [None]:
from pyspark.ml.regression import LinearRegression
from pyspark.ml.evaluation import RegressionEvaluator

# Create a regression object and train on training data
regression = LinearRegression(labelCol='duration').fit(flights_train)

# Create predictions for the testing data
predictions = regression.transform(flights_test)

# Calculate the RMSE on testing data
RegressionEvaluator(labelCol='duration').evaluate(predictions)

#### Interpreting coefficients

Remember that origin airport, org, has eight possible values (ORD, SFO, JFK, LGA, SMF, SJC, TUS and OGG) which have been one-hot encoded to seven dummy variables in org_dummy.

The values for km and org_dummy have been assembled into features, which has eight columns with sparse representation. Column indices in features are as follows:

- 0 — km
- 1 — ORD
- 2 — SFO
- 3 — JFK
- 4 — LGA
- 5 — SMF
- 6 — SJC and
- 7 — TUS.
Note that OGG does not appear in this list because it is the reference level for the origin airport category.

In this exercise you'll be using the intercept and coefficients attributes to interpret the model.



In [None]:
# Average speed in km per hour
avg_speed_hour = 60/regression.coefficients[0]
print(avg_speed_hour)

# Average minutes on ground at OGG
inter = regression.intercept
print(inter)

# Average minutes on ground at JFK
avg_ground_jfk = inter + regression.coefficients[3]
print(avg_ground_jfk)

# Average minutes on ground at LGA
avg_ground_lga = inter + regression.coefficients[4]
print(avg_ground_lga)

#### Bucketing and Engineering
     - To make convenient use of discrete values (like age, time etc.) which can be done using buckets or bins
     - Example: Bucketing Heights
         - When plotted with Heights at different ages, (age vs height)
         - as the age vary, the height vary making different buckets (the categories of similar heights, ages matching). This similarity or grouping the count is called bucketting.
         - can be segreggated to short, average, tall

In [11]:
# import Bucketizer method from ml.feature sub-method
from pyspark.ml.feature import Bucketizer

# Create buckets using splits
bucketizer = Bucketizer(splits=[3500, 4500, 6000, 6500], inputCol='rpm', outputCol='rpm_bin')


In [None]:
# Apply buckets to the data
cars  = bucketizer.transform(cars)

# the result
bucketed.select('rpg','rpb_bin').show(5)

# groupby
cars.groupby('rpm_bin').count().show()

# One-hot encoded RPB buckets

# Intercept and coefficient of the model
regression.coefficients
regression.intercept

#### More Feature Engineering
    Operations on a single column:
        log()
        sqrt()
        pow()
    Operation on two columns:
        product
        ratio

In [None]:
# Engineering density
cars  = cars.withColumns('density_line', cars.mass / cars.length) # Linear Density
cars  = cars.withColumns('density_line', cars.mass / cars.length**2) # Linear Area
cars  = cars.withColumns('density_line', cars.mass / cars.length**3) # Linear cubic

# The predictions are viable by doing severals samples.
# from the above case, it is clear that, using length, height and width are predicted. using basic maths
# for more accurracy and to know which is an exact match, we have to understand the confusion-matrix, which gives us the best value to select with.


#### Bucketing departure time

Time of day data are a challenge with regression models. They are also a great candidate for bucketing.

In this lesson you will convert the flight departure times from numeric values between 0 (corresponding to 00:00) and 24 (corresponding to 24:00) to binned values. You'll then take those binned values and one-hot encode them.

In [None]:
from pyspark.ml.feature import Bucketizer, OneHotEncoderEstimator

# Create buckets at 3 hour intervals through the day
buckets = Bucketizer(splits=[0, 3, 6, 9, 12, 15, 18, 21, 24], inputCol='depart', outputCol='depart_bucket')

# Bucket the departure times
bucketed = buckets.transform(flights)
bucketed.select('depart', 'depart_bucket').show(5)

# Create a one-hot encoder
onehot = OneHotEncoderEstimator(inputCols=['depart_bucket'], outputCols=['depart_dummy'])

# One-hot encode the bucketed departure times
flights_onehot = onehot.fit(bucketed).transform(bucketed)
flights_onehot.select('depart', 'depart_bucket', 'depart_dummy').show(5)

#### Flight duration model: Adding departure time

In the previous exercise the departure time was bucketed and converted to dummy variables. Now you're going to include those dummy variables in a regression model for flight duration.

The data are in flights. The km, org_dummy and depart_dummy columns have been assembled into features, where km is index 0, org_dummy runs from index 1 to 7 and depart_dummy from index 8 to 14.

The data have been split into training and testing sets and a linear regression model, regression, has been built on the training data. Predictions have been made on the testing data and are available as predictions.

In [None]:
# Find the RMSE on testing data
from pyspark.ml.evaluation import RegressionEvaluator
RegressionEvaluator(labelCol='duration').evaluate(predictions)

# Average minutes on ground at OGG for flights departing between 21:00 and 24:00
avg_eve_ogg = regression.intercept
print(avg_eve_ogg)

# Average minutes on ground at OGG for flights departing between 00:00 and 03:00
avg_night_ogg = regression.intercept + regression.coefficients[8]
print(avg_night_ogg)

# Average minutes on ground at JFK for flights departing between 00:00 and 03:00
avg_night_jfk = regression.intercept + regression.coefficients[3] + regression.coefficients[8]
print(avg_night_jfk)

#### Regularization
#### Features : Only a few
 - LinearRegression works on coefficients
 - few columns and many columns - suits good with Linear Regression
 - Many columns and few rows - will be much challenging
 - Parsimonous model -  one that has just the minimum required number of predictions.
 - To do so, select only best set of columns
 - MSE = 1/N ∑ (i = 0toN) (yi - yi_cap)^2
 - punished for having to many co-efficients
 - + regualization added to the MSE 
 
- Lasso Regression - absolute value of the co-efficients
- Ridge Regression - square of the co-efficients
- strengthe of the regularization is denoted by lambda : 
- lambda : 0 - no regularlization (standard regression)
- lambda : infinite - complete regularization (all coefficients zero)


In [None]:
# Use vector assembler 
assembler = VectorAssembler(inputCols=['mass','cyl','type_dummy','density_line','density_quad','density_cube'], outputCol = 'features')
cars = assembler.transform(cars)

# Linear Regression.
regression = LinearRegression(labelCol = 'consumption').fit(cars_train)

# RMSE on testing data, calculate RMSE

# Examine the coefficients 
regression.coefficients

# Ridge Regression, this can be achieved by giving elasticNetParam = 0
ridge = LinearRegression(labelCol='consumption', elasticNetParam=0, regParam=0.1)
ridge.fit(cars_train)

# Lasso Regression, this can be achieved by giving elasticNetParam = 1
lasso = LinearRegression(labelCol='consumption', elasticNetParam=1, regParam=0.1)
lasso.fit(cars_train)



#### Flight duration model: More features!

Let's add more features to our model. This will not necessarily result in a better model. Adding some features might improve the model. Adding other features might make it worse.

More features will always make the model more complicated and difficult to interpret.

These are the features you'll include in the next model:

km
org (origin airport, one-hot encoded, 8 levels)
depart (departure time, binned in 3 hour intervals, one-hot encoded, 8 levels)
dow (departure day of week, one-hot encoded, 7 levels) and
mon (departure month, one-hot encoded, 12 levels).
These have been assembled into the features column, which is a sparse representation of 32 columns (remember one-hot encoding produces a number of columns which is one fewer than the number of levels).

The data are available as flights, randomly split into flights_train and flights_test. The object predictions is also available.

In [None]:
from pyspark.ml.regression import LinearRegression
from pyspark.ml.evaluation import RegressionEvaluator

# Fit linear regression model to training data
regression = LinearRegression(labelCol='duration').fit(flights_train)

# Make predictions on testing data
predictions = regression.transform(flights_test)

# Calculate the RMSE on testing data
rmse = RegressionEvaluator(labelCol='duration').evaluate(predictions)
print("The test RMSE is", rmse)

# Look at the model coefficients
coeffs = regression.coefficients
print(coeffs)

#### Flight duration model: Regularisation!

In the previous exercise you added more predictors to the flight duration model. The model performed well on testing data, but with so many coefficients it was difficult to interpret.

In this exercise you'll use Lasso regression (regularized with a L1 penalty) to create a more parsimonious model. Many of the coefficients in the resulting model will be set to zero. This means that only a subset of the predictors actually contribute to the model. Despite the simpler model, it still produces a good RMSE on the testing data.

You'll use a specific value for the regularization strength. Later you'll learn how to find the best value using cross validation.

The data (same as previous exercise) are available as flights, randomly split into flights_train and flights_test.

In [None]:
from pyspark.ml.regression import LinearRegression
from pyspark.ml.evaluation import RegressionEvaluator

# Fit Lasso model (α = 1) to training data
regression = LinearRegression(labelCol='duration', regParam=1, elasticNetParam=1).fit(flights_train)

# Calculate the RMSE on testing data
rmse = RegressionEvaluator(labelCol='duration').evaluate(regression.transform(flights_test))
print("The test RMSE is", rmse)

# Look at the model coefficients
coeffs = regression.coefficients
print(coeffs)

# Number of zero coefficients
zero_coeff = sum([beta == 0 for beta in regression.coefficients])
print("Number of ceofficients equal to 0:", zero_coeff)