## Team 14: Multi-Task, Logistic Regression, and Linear Regression Algorithms from Scratch in PySpark
---

#### Summary 

This notebook includes all the code used to run the Linear, Logistic, and Multi-task algorithms from scratch using RDDs in PySpark. For the full toy example implementation, we will train on 2.5% of the data from 2015-2018 and then test on the full 2019 flights dataset. Although our implementation includes both Lasso and Ridge regularization term, for the purposes of illustration we will focus on the basic implementation without any Regularization to compare the results of the Logistic and Linear Regression trained as independent tasks versus the Multi-task model.

_Note: we verified that the from scratch algorithms are working by comparing the From Scratch results on 500 iterations to the MLlib Logistic and Linear regression implementations. We have documented the results in the final report, see sections 5 and 6 for code that we ran._

#### 1. Load Dependencies

In [0]:
# Import Libaries and Create and Connect to spark session 
from pyspark.sql import SparkSession,SQLContext
sql_jar="/path/to/sql_jar_file/sqljdbc42.jar"
spark_snow_jar="/usr/.../snowflake/spark-snowflake_2.11-2.5.5-spark_2.3.jar"
snow_jdbc_jar="/usr/.../snowflake/snowflake-jdbc-3.10.3.jar"
oracle_jar="/usr/path/to/oracle_jar_file//v12/jdbc/lib/oracle6.jar"
spark=(SparkSession
.builder
.master('yarn')
.appName('Spark job new_job')
.config('spark.driver.memory','10g')
.config('spark.submit.deployMode','client')
.config('spark.executor.memory','15g')
.config('spark.executor.cores',4)
.config('spark.yarn.queue','short')
.config('spark.jars','{},{},{},{}'.format(sql_jar,spark_snow_jar,snow_jdbc_jar,oracle_jar))
.enableHiveSupport()
.getOrCreate())

In [0]:
from pyspark.sql.functions import *
import pyspark.sql.functions as F
from pyspark.sql.types import IntegerType,BooleanType,DateType,DoubleType,FloatType
from pyspark.mllib.evaluation import MulticlassMetrics

from pyspark.sql.types import *
from pyspark.sql import SQLContext
from pyspark.sql import Row
from pyspark.rdd import portable_hash
from pyspark.sql.window import Window
from pyspark.sql.functions import to_timestamp, to_date
from pyspark.sql.functions import substring

import pandas as pd
import numpy as np
import math as math
import time
import datetime
import matplotlib.pyplot as plt
from pylab import rcParams
import matplotlib.ticker as mtick
import seaborn as sns
from graphframes import *
import geopandas as gpd
import plotly as plotly
!pip install heatmapz
from heatmap import heatmap, corrplot

pd.set_option("display.max_rows", 999)
pd.set_option("display.max_columns", 200)


from pyspark.ml import *
from pyspark.ml.linalg import *
from pyspark.ml.stat import *
from pyspark.ml.feature import *
from pyspark.sql.window import *

#Blob credentials
blob_container = "cemgr14c" # The name of your container created in https://portal.azure.com
storage_account = "cemgr14" #The name of your Storage account created in https://portal.azure.com
secret_scope = "w261gr14" # The name of the scope created in your local computer using the Databricks CLI
secret_key = "keygr14" # The name of the secret key created in your local computer using the Databricks CLI
blob_url = f"wasbs://{blob_container}@{storage_account}.blob.core.windows.net"
mount_path = "/mnt/mids-w261"

spark.conf.set(
  f"fs.azure.sas.{blob_container}.{storage_account}.blob.core.windows.net",
  dbutils.secrets.get(scope = secret_scope, key = secret_key)
)

spark.conf.set("spark.sql.execution.arrow.pyspark.enabled", "true")

#### 2. Logistic Regression

Logistic Regression is used to solve classification problems, in our case it helps us solve a binary classification problem of delay (1) or no delay (0). In particular the logistic regression the outcome variable (y) is the probability of a flight being delayed (the positive class). Below are the Loss Function and Gradient Descent Update formulas. Unlike the linear regression we cannot use the Mean Squared Error (MSE) as the cost function because the logistic regression (sigmoid function) is non-linear which leads to a non-convex function. Thus to meet the properties of convexity we use the log loss formula in place of MSE.

#### Logistic (Sigmoid) Function

$$g(z)=\frac{1}{1+e^{(-z)}}$$

#### Logistic Regression Loss Function (Log Loss Function)

$$J\left(\theta\right)=-\frac{1}{m}[\sum_{i=1}^{m}{y^{\left(i\right)}log{h_\theta\left(x^{\left(i\right)}\right)}}+\left(1-y^{\left(i\right)}\right)log{\left(1-h_\theta\left(x^{\left(i\right)}\right)\right)}]$$

where ___m___ is the number of examples

Using this loss function above, if y=1 and the prediction=1 then the cost is 0. However, it penalizes heavily if y=1 and the prediction = 0.

#### Gradient Descent Formula
$$
\theta_j=\theta_j-\alpha\sum_{i=1}^{m}{(h_\theta\left(x^{\left(i\right)}\right)}-y^{\left(i\right)})x_j^{\left(i\right)}
$$

where $$h_\theta\left(x\right)=\frac{1}{1+e^{-\theta^Tx}}$$


Notice this formula is similar to the gradient descent update formula for linear regression, but instead here we use Logistic Regression Hypothesis: 
$$h_\theta\left(x\right)=\frac{1}{1+e^{-\theta^Tx}}$$

This section includes the Logistic Regression model implementation from Scratch based on the mathmatical formulas above.

##### - Load Data

In [0]:
def undersampling_Adj(data, column, val1, val2):

    '''
    Input: data = (dataframe with data), column = (column to check ratio for), val1 = majority value for column, val2 = minority value for column
    Output: dataframe with balanced count for minority and majority count by undersampling majority count.
    '''

    major_df = data.filter(col(column) == val1)
    minor_df = data.filter(col(column) == val2)
    n1 = major_df.count()
    n2 = minor_df.count()
    ratio = n2/n1

    sampled_majority_df = major_df.sample(False, ratio, 123)
    combined_df = sampled_majority_df.unionAll(minor_df)

    return combined_df, ratio

In [0]:
# Helper Function to Parse RDD
def parse(line):
    """
    Parse RDD into a tuple
    Args:
        line (line) : line from RDD
    Returns:
        tuple in format (features, (delayLog, delayLN))
    """  
    features, delayLog, delayLN = line[:-2], line[-2], line[-1]
    return (np.array(features, dtype = 'double'), (np.array(delayLog, dtype = 'double'), np.array(delayLN, dtype = 'double')))
  
# Helper function to extract predictions from dataframe from ML model output
def extract(row):
    """
    Get predictions from Spark DataFrame
    Args:
        row
    Returns:
            (LogLabel, probabilities, label, prediction)
    
    """
    return (row.DEP_DEL15,) + tuple(row.probability.toArray().tolist()) +  (row.label,) + (row.prediction,)

We are getting 2.5% of the data for the training data and using all of 2019 for the test dataset. We remove records with missing values and only select relevant numeric features for the model.

In [0]:
# THIS LINK TO UPLOAD THE FULL DATA
# df_airlines = spark.read.parquet(f"{blob_url}/airplanes_weather_final_5yr_EZ/*").cache()

# READ DATA INCLUDING AIRPORT RANK.
airlines = spark.read.parquet(f"{blob_url}/df_airlines_rank_graphs").cache()

# Reduce the amount of data (to run on DBCE)

# CONTINUE WITH A PORTION OF DATA FOR TESTING
proportion = 0.025   # 0.025
(airline_test1, airline_rest) = airlines.randomSplit([proportion, 1- proportion], seed=123)

#CONTINUE WITH ALL DATA
# airline_test = df_airlines

# Select only the columns needed
subset_df = airline_test1.select('YEAR','DAY_OF_MONTH','DAY_OF_WEEK', 'DISTANCE','wind_speed_mps_orig','ceiling_ht_dim_orig','visibility_meters_orig','temp_cels_orig',
                                'dew_pt_orig','atmos_press_orig','precip_milimeters_orig', 'wind_speed_mps_dest','ceiling_ht_dim_dest','visibility_meters_dest',
                                'temp_cels_dest', 'dew_pt_dest', 'atmos_press_dest','precip_milimeters_dest', 'rolling_ninety_day_average','Air_Page_Rank_traffic',
                                'OD_delay_pair', 'time_of_day_int', 'DEP_DEL15', 'DEP_DELAY').cache()                
subset_df2 =   airlines.select('YEAR','DAY_OF_MONTH','DAY_OF_WEEK', 'DISTANCE','wind_speed_mps_orig','ceiling_ht_dim_orig','visibility_meters_orig','temp_cels_orig',
                                'dew_pt_orig','atmos_press_orig','precip_milimeters_orig', 'wind_speed_mps_dest','ceiling_ht_dim_dest','visibility_meters_dest',
                                'temp_cels_dest', 'dew_pt_dest', 'atmos_press_dest','precip_milimeters_dest', 'rolling_ninety_day_average','Air_Page_Rank_traffic',
                                'OD_delay_pair', 'time_of_day_int', 'DEP_DEL15', 'DEP_DELAY').cache()  

subset_df2 = subset_df2.na.drop()

# OUTCOME VARIABLES: DEP_TIME  DEP_DEL15
subset_df = subset_df.na.drop()

# Manage unbalanced data
subset_df, rs = undersampling_Adj(subset_df, "DEP_DEL15", 0, 1)

# SPLIT DATA FOR TRAINING AND TESTING
year_train_val = 2018
train = subset_df.filter(subset_df.YEAR <= year_train_val).cache()
test = subset_df2.filter(subset_df2.YEAR > year_train_val).cache()

train = train.drop("YEAR")
test = test.drop("YEAR")

train_rdd = train.rdd
test_rdd = test.rdd

# Parse data as tuple
train_rdd = train_rdd.map(tuple) \
                 .map(parse).cache()

test_rdd = test_rdd.map(tuple) \
                 .map(parse).cache()

# train_rdd.collect()
# test_rdd.collect()

##### - Defining the LogLoss, Sigmoid Functions and Gradient Update

In [0]:
# Based on https://towardsdatascience.com/logistic-regression-from-scratch-69db4f587e17
# helper function: sigmoid function
def sigmoid(z):
    """
    Apply sigmoid function to a value
    Args:
        z - (numeric) numeric value
    Returns:
        value compressed with sigmoid function
    """
    return 1.0/(1+np.exp(-z))

# Log Loss Function
def LogLoss(dataRDD, W):
    """
    Compute the log loss.
    Args:
        dataRDD - each record is a tuple of (features_array, y)
        W       - (array) model coefficients with bias at index 0
    """
    augmentedData = dataRDD.map(lambda x: (np.append([1.0], x[0]), x[1][0])).cache()
    #  Log Loss =                                    prediction 1                       +            prediction  0
    # Log Loss =                            y *    log (sigmoid (dot(X,weights)))      + (1-y) * log (1- sigmoid(dot(X,weights)))   . mean
    loss =  augmentedData.map(lambda X_y: ((X_y[1] * np.log(sigmoid(np.dot(X_y[0], W))))+ ((1-X_y[1])*np.log(1-sigmoid(np.dot(X_y[0], W)))))).mean()
    loss = -loss
    return loss

In [0]:
#Function to perform a single GD step
def GDUpdate(dataRDD, W, regType = None, learningRate = 0.1, regParam = 0.1):
    """
    Perform one OLS gradient descent step/update.
    Args:
        dataRDD - records are tuples of (features_array, y)
        W       - (array) model coefficients with bias at index 0
    Returns:
        new_model - (array) updated coefficients, bias at index 0
    """
    # add a bias 'feature' of 1 at index 0
    augmentedData = dataRDD.map(lambda x: (np.append([1.0], x[0]), x[1][0])).cache()
    w_broadcast = sc.broadcast(W)
    w = np.append([0.],W[1:])

    if regType =='ridge':
        reg = regParam * 2 * w
    elif regType == 'lasso':
        #reg = 1 * np.sign(w)
        reg = W * 1
#         reg = (reg>0).astype(int) * 2- 1
        reg = (reg>0) * 2- 1
        reg[0] = 0
        reg = reg * regParam
    else:
        reg = np.float(0)
    
    #                               X               .        (sigmoid(X   .    Theta )     - y)          / m 
    grad = augmentedData.map(lambda x: np.dot(x[0], (sigmoid(np.dot(x[0],w_broadcast.value))-x[1]))).mean()
    #
    grad = grad + reg
    new_model = w_broadcast.value - learningRate * grad
  
    #return grad
    return new_model

##### - Defining Function to normalize the data

In [0]:
def normalize(dataRDD, refRDD):
    """
    Scale and center data round mean of each feature.
    Args:
        dataRDD - records are tuples of (features_array, y)
    Returns:
        normedRDD - records are tuples of (features_array, y)
    """
    featureMeans = refRDD.map(lambda x: x[0]).mean()
    featureStdev = np.sqrt(refRDD.map(lambda x: x[0]).variance())
    
    normedRDD = dataRDD.map(lambda x: ((x[0]-featureMeans)/featureStdev,x[1])).cache()
    
    return normedRDD

In [0]:
def normalizeLR(dataRDD, refRDD):
    """
    Scale and center data round mean of each feature.
    Args:
        dataRDD - records are tuples of (features_array, y)
    Returns:
        normedRDD - records are tuples of (features_array, y)
    """
    featureMeans = refRDD.map(lambda x: x[0]).mean()
    featureStdev = np.sqrt(refRDD.map(lambda x: x[0]).variance())
    
    yMean = refRDD.map(lambda x: x[1][1]).mean()
    yStdev = np.sqrt(refRDD.map(lambda x: x[1][1]).variance())
    
    normedRDD = dataRDD.map(lambda x: ( (x[0]-featureMeans) / featureStdev, (x[1][0], (x[1][1]-yMean)/yStdev))  ).cache()
    
    return normedRDD

##### - Functions to Evaluate Models

In [0]:
def modprediction(data, model):
    """
        Return model predictions for Logistic Regression
        Args:
            data - RDD with records in the following format (features_array, (ylog, ylr))
        Returns:
            predictions - RDD with records are tuples of (prediction, probabilities, label)
    """
    augmentedData = data.map(lambda x: (np.append([1.0], x[0]), x[1][0])).cache()
    
    predictions = augmentedData.map(lambda x: (float(x[1]),sigmoid(np.dot(x[0],model)))) \
                    .map(lambda x: (( float(np.where(x[1]>.5,1,0))   , float(x[0])), x[1]))
    
    return predictions

In [0]:
def rmse(data, model):
    """
        Compute Root Mean Squared Error (RMSE)
        Args:
            data     - each record is a tuple of (features_array, y)
            model  - (array) model coefficients with bias at index 0
    """
    
    augmentedData = data.map(lambda x: (np.append([1.0], x[0]), x[1][1])).cache()
    
    rmse_result = augmentedData.map(lambda x: (np.dot(x[0],model) - x[1] )**2).mean()
    rmse_result = np.sqrt(rmse_result)
    
    return rmse_result

In [0]:
def logrmetrics(metrics):
    """
        Return evaluation metrics for Logistic Regression model predictions
        Args:
            metrics - SparkDataframe
        Returns:
            dfmetrics - Dataframe of accuracy, recall, precision, f1_score, f05_score, f2_score
    
    """
    dfmetric = {}
    recall = metrics.recall(1.0)
    precision = metrics.precision(1.0)
    dfmetric["accuracy"] = metrics.accuracy
    dfmetric["recall"] = recall
    dfmetric["precision"] = precision
    if (recall + precision) != 0:
        dfmetric["f1_score"] = 2*(recall * precision) / (recall + precision)
    else:
        dfmetric["f1_score"] = "na"
    beta = 0.5
    if ((beta**2 * precision) + recall) != 0:
        dfmetric["f05_score"] = (1+beta**2)*(recall * precision) / ((beta**2 * precision) + recall)
    else:
        dfmetric["f05_score"] = "na"
    beta = 2
    if ((beta**2 *precision) + recall) !=0:
        dfmetric["f2_score"] = (1+beta**2)*(recall * precision) / ((beta**2 *precision) + recall) 
    else:
        dfmetric["f2_score"] = "na"
    
    return dfmetric

In [0]:
def LogRegression(data, nSteps, verbose, model, regType, learningRate, regParam):
    """
         Return model predictions for Logistic Regression
         Args:
              data - RDD with records in the following format (features_array, (ylog, ylr))
              nSteps - (numeric) value for number of steps in gradient descent
              verbose -(boolean) True will print Loss and Models with each step
              model - (numpy array) baseline model to initialize weights
              regType - (str) Regularization type 'Lasso', 'Ridge' or 'none'
              learningRate - (numeric) Learning Rate
              regParam - (numeric) value for amount of regularization         
         Returns:
              model - returns model intercept and weights
    """  
    if verbose: print(f"BASELINE:  Loss = {LogLoss(data,model)}")
    for idx in range(nSteps):
        model = GDUpdate(data, model, regType, learningRate, regParam)
        loss = LogLoss(data, model) 
        if verbose:
            print("----------")
            print("STEP: {}".format(str(idx+1)))
            print("Loss: {}".format(str(loss)))
            print("Model:{}".format([w for w in model]))
        else:
            print("STEP: {}".format(str(idx+1)))
    
    return model

##### - Running Logistic Regression Model

In [0]:
normalizedrdd = normalize(train_rdd, train_rdd).cache()
normalizedtest = normalize(test_rdd, train_rdd).cache()
baseline = np.append(.5,np.zeros(len(train_rdd.take(1)[0][0]))).tolist()
baseline = np.array(baseline)

In [0]:
# Using Model without "Ridge" or "Lasso" regularization
mlogR = LogRegression(normalizedrdd, 500, True, baseline, "none", 0.1, 0.1)
print(mlogR)

In [0]:
# TESTING LOGISTIC REGRESSION
test2 = modprediction(normalizedtest, mlogR)

metrics_logR = MulticlassMetrics(test2.map(lambda x: x[0]))
metrics = logrmetrics(metrics_logR)
dm1 = pd.DataFrame({"Log Reg": list(metrics.values())}, index = list(metrics.keys()))
dm1

Unnamed: 0,Log Reg
accuracy,0.587091
recall,0.65073
precision,0.259132
f1_score,0.37066
f05_score,0.294587
f2_score,0.499701


#### 3. Linear Regression

In feature model we will use Linear Regression:

$$
\text{Delay in minutes} = \beta_{0} + \sum_{i=1}^{n}(\beta{i} * \text{Flight}_{i} )+ \sum_{j=1}^{m}(\beta{j} * \text{Weather}_{j} ) + \sum_{k=1}^{l}(\beta{k} * \text{Others}_{k} )
$$


formula for the gradient for linear regression:**  

$$
{\triangledown}f = \frac{2}{n}\sum_{i=1}^{n} (y^i - W_i*X^i) X^i
$$

>
In the formula, we have the error \\(y^i - W_i*X^i\\) for each value as the multiplier (weight) for \\(X^i\\).  As we are adding and dividing over the range of observations (n), this is equal to having the average of the observations \\(X^i\\), multiplied or weighted by the errors \\( y^i\\) - \\(W_i\\)*\\(X^i\\).

>Then the updated weights are calculated as follows:
$$
W_{i+1} = W_i - \alpha * {\triangledown}f
$$

##### - Loss Function and Gradient Update Function

**The loss function for Ridge Regression is:**

$$
f({\theta}) = \frac{1}{n}\sum_{i=1}^{n}[{\theta}_i\cdot{x\text{'}_i} - y_i]^2 + \lambda \sum_{i=1}^{n} {\theta}_i^2
$$

>The gradient for ridge regression is:

$$
{\triangledown}f = \frac{2}{n}\sum_{i=1}^{n} (y^i - W_i*X^i) X^i + 2{\lambda}{W}
$$

>Then the updated weights are calculated as follows:
$$
W_{i+1} = W_i - \alpha * {\triangledown}f
$$

**The loss function for Lasso Regression is:**
$$
f({\theta}) = \frac{1}{n}\sum_{i=1}^{n}[{\beta}_i\cdot{x\text{'}_i} - y_i]^2 + \lambda \sum_{i=1}^{n}|{\beta}_i|
$$

>The gradient for lasso regression is:

$$
{\triangledown}f = \frac{2}{n}\sum_{i=1}^{n} (y^i - W_i*X^i) X^i + {\lambda}  (I_{w>0}(w)*2-1)
$$

>Then the updated weights are calculated as follows:
$$
W_{i+1} = W_i - \alpha * {\triangledown}f
$$

In [0]:
def OLSLoss(dataRDD, W):
    """
    Compute mean squared error.
    Args:
        dataRDD - each record is a tuple of (features_array, y)
        W       - (array) model coefficients with bias at index 0
    """
    #augmentedData = dataRDD.map(lambda x: (np.append([1.0], x[0]), x[1]))
    augmentedData = dataRDD.map(lambda x: (np.append([1.0], x[0]), x[1][1])).cache()
    loss = None
    
    loss = augmentedData.map(lambda x:(np.dot(x[0],W) - x[1])**2).mean()
    
    return loss

In [0]:

def GDUpdateLR(dataRDD, W, regType = None, learningRate = 0.1, regParam = 0.1):
    """
    Perform one OLS gradient descent step/update.
    Args:
        dataRDD - records are tuples of (features_array, y)
        W       - (array) model coefficients with bias at index 0
    Returns:
        new_model - (array) updated coefficients, bias at index 0
    """
    # add a bias 'feature' of 1 at index 0
    augmentedData = dataRDD.map(lambda x: (np.append([1.0], x[0]), x[1][1])).cache()

    w_broadcast = sc.broadcast(W)
    w = np.append([0.],W[1:])

    if regType =='ridge':
        reg = regParam * 2 * w
    elif regType == 'lasso':
        reg = W * 1
        reg = (reg>0) * 2- 1
        reg[0] = 0
        reg = reg * regParam
    else:
        reg = np.float(0)
    
    grad = augmentedData.map(lambda d: np.dot(d[0], ( np.dot(d[0],W) - d[1] )  )).mean()*2
    
    grad = grad + reg
    new_model = W - (learningRate * grad)
   
    return new_model

In [0]:
def OLSRegression(dataLR, nSteps, verbose, modelLR, regType, learningRate, regParam):
    """
         Return model predictions for Logistic Regression
         Args:
              dataLR - RDD with records in the following format (features_array, (ylog, ylr))
              nSteps - (numeric) value for number of steps in gradient descent
              verbose -(boolean) True will print Loss and Models with each step
              modelLR - (numpy array) baseline model to intialize weights
              regType - (str) Regularization type 'Lasso', 'Ridge' or 'none'
              learningRate - (numeric) Learning Rate
              regParam - (numeric) value for amount of regularization         
         Returns:
              modelLR - returns model intercept and weights
    """  
    if verbose: print(f"BASELINE:  Loss = {OLSLoss(dataLR,modelLR)}")
    for idx in range(nSteps):
        modelLR = GDUpdateLR(dataLR, modelLR, regType, learningRate, regParam)
        lossLR = OLSLoss(dataLR, modelLR) 
        if verbose:
            print("----------")
            print("STEP: {}".format(str(idx+1)))
            print("Loss: {}".format(str(lossLR)))
            print("Model:{}".format([w for w in modelLR]))
        else:
            print("STEP: {}".format(str(idx+1)))
    
    return modelLR

##### - Running Linear Regression - Simple, Lasso and Ridge Regulated

In [0]:
%%time
modelLR_none = OLSRegression(normalizedrdd, 500, True, baseline, "none", 0.1, 0.1)
print(modelLR_none)

On 2.5 % of Training and Test Data

modelLR_none = [9.59953387514031, -0.2792319760809585, -0.13610874019388292, 0.09113025424615373, 1.02147516098602, -1.4332248187979513, -0.20103124634254874, 0.14475215078420597, 0.1399253081719669, -0.6109345440124277, 2.0489768014831813, 1.2293126151409495, -1.0410810747412278, -0.2223936331208716, -0.3714203250148143, 0.39819825448972956, -0.44132472698144004, 1.1145417744369044, 2.1629760289168947, 0.35206343690682024, 4.480645297739111, 3.064779101029841]

rmse = 48.805181731859925 (test)  
rmse = 41.14886101296988  (train)

In [0]:
print(rmse(normalizedtest, modelLR_none))
print(rmse(normalizedrdd, modelLR_none))

In [0]:
# %%time
# modelLR_lasso = OLSRegression(normalizedrdd, 500, True, baseline, "lasso", 0.1, 0.1)
# print(modelLR_lasso)

On 2.5 % of Training and Test Data

modelLR_lasso = [9.599533875140297, -0.225435266963635, -0.08737547449007968, 0.04836256142201045, 0.9811356848301295, -1.395399157016233, -0.15550883022806483, 0.014510635372070231, 0.15314739867503557, -0.6260577724725453, 2.0103878563781974, 1.197397568363566, -1.0720329129008053, -0.1849426257561813, -0.03321992667839169, 0.17275204054095572, -0.3662288819505662, 1.0847986207757947, 2.12150330147695, 0.3117688343773923, 4.441178495612246, 2.9954111222542754]

rmse = 50.56335989236697  (test)  
rmse = 42.887176843410465  (train)

In [0]:
# print(rmse(normalizedtest, modelLR_lasso))
# print(rmse(normalizedrdd, modelLR_lasso))

In [0]:
# %%time
# modelLR_ridge = OLSRegression(normalizedrdd, 500, True, baseline, "ridge", 0.1, 0.1)
# print(modelLR_ridge)

On 2.5 % of Training and Test Data

modelLR_ridge = [9.599533875140304, -0.2510552936007499, -0.12436994459760073, 0.11460389884298267, 1.0163926328627515, -1.3448453020266176, -0.19537960470796906, 0.1581762091277311, 0.15287614125538387, -0.582538237206741, 1.88966624415537, 1.1769673813410553, -1.0131006810944299, -0.21359370720778978, -0.17683145029249256, 0.2638164979500447, -0.4166130286252367, 1.036594425156843, 1.9670923033243415, 0.33128262742382614, 4.094652409003303, 2.773497418605601]

rmse = 50.54581324445492 (test)  
rmse = 42.87196429524765 (train)

In [0]:
# print(rmse(normalizedtest, modelLR_ridge))
# print(rmse(normalizedrdd, modelLR_ridge))

#### 4. Multi-Task Model - Logistic Regression and Linear Regression

The Multi-Task algorithm combines binary classifcation task (logisic regression) and prediction task (linear regression) by adding their respective weighted loss functions together in a Multi-Task loss function. Below are the formulas for the Multi-Task algorithm. See the final report for additional details and a more comprehensive mathmatical explanations.


___Simplified Multi-Task Loss Function___
$$ Multi-Task\ Loss=\left(\beta\right)LogLoss+\left(1-\beta\right)MSE $$

___Multi-Task Loss Function___

$$ Multi-Task\ Loss =\beta\left(-\frac{1}{m}\sum_{i=1}^{m}{y^{\left(i\right)}log{h_\theta\left(x^{\left(i\right)}\right)}}+\left(1-y\left(i\right)\right)log{\left(1-h_\theta\left(x^{\left(i\right)}\right)\right)}\right)+\left(1-\beta\right)\frac{1}{m}\sum_{i=1}^{m}\left(y^{\left(i\right)}-{\hat{y}}^{\left(i\right)}\right)^2$$
_where __β__=weight of each loss function_

___Multi-Task Gradient___

$$Gradient = \beta\ \ast\ \sum_{i=1}^{m}{{(h}_\theta\left(x^{\left(i\right)})-y^{\left(i\right)}\right)x_j^{\left(i\right)}\ +\ }\left(1-\beta\right)\frac{2}{m}\sum_{i=1}^{m}{{(\left[\theta^T\cdot x_i^\prime-y_i\right]}^\ast x_i^\prime)}$$

##### - Loss and Gradient Update Functions

In [0]:
def MTLoss(data, modelLR, modellogr, beta):
    """
    Compute multi-task loss function
    Args:
        data     - each record is a tuple of (features_array, y)
        modelLR  - (array) Linear Regression model coefficients with bias at index 0
        modellogr - (array) Logistic Regression model coefficients with bias
        beta - (numeric) weight for the loss function of each
    """
    augmentedDF = data.map(lambda x: (np.append([1.0], x[0]), x[1]))
    lossLR = None
    
    multi_loss = augmentedDF.map(lambda x:( (np.dot(x[0],modelLR) - x[1][1])**2, ((x[1][0] * np.log(sigmoid(np.dot(x[0], modellogr))))+ ((1-x[1][0])*np.log(1-sigmoid(np.dot(x[0], modellogr)))))         ))


    
    lossLogR = multi_loss.map(lambda x: x[1]).mean()
    lossLogR = -lossLogR
    lossLR = multi_loss.map(lambda x: x[0]).mean()
    
    # Multi-task loss
    MTLossVal = lossLR*beta + (1-beta)*lossLogR
       
    return MTLossVal

In [0]:
def GDMTUpdateLR(data, modelLR, modellogr, beta, LRregType, LRregParam,logregType, logregParam, learningRate = 0.1):
    """
    Perform one OLS gradient descent step/update.
    Args:
        data - records are tuples of (features_array, y)
        modelLR  - (array) Linear Regression model coefficients with bias at index 0
        modellogr - (array) Logistic Regression model coefficients with bias
        beta - (numeric) weight for the loss function of each
        LRregType - (str) Linear Regression regularization type "lasso" or "ridge" 
        LRregType - (str) Logistic Regression regularization type "lasso" or "ridge"
        LRregParam - (float) Linear Regression regularization value 
        logregParam - (float) Logistic Regression regularization value 
        learningRate = (float) Learning rate value
                
    Returns:
        (new_model_LR, new_model_logR) - (tuple) new models with updated coefficients, bias at index 0
    """

    #helper function 
    def get_reg( W, regType, regParam):
        w = np.append([0.],W[1:])
        if regType =='ridge':
            reg = regParam * 2 * w
        elif regType == 'lasso':
            reg = W * 1
            reg = (reg>0) * 2- 1
            reg[0] = 0
            reg = reg * regParam
        else:
            reg = np.float(0)
        return reg
    
    
    modellogr_broadcast = sc.broadcast(modellogr)
    modelLR_broadcast = sc.broadcast(modelLR)

    #UPDATE MODEL FOR LINEAR REGRESSION
    # add a bias 'feature' of 1 at index 0
    augmentedDF =data.map(lambda x: (np.append([1.0], x[0]), x[1])).cache()
    
    #gradLR = augmentedLR.map(lambda d: np.dot(d[0], ( np.dot(d[0], modelLR) - d[1] )  )).mean()*2
    grads = augmentedDF.map(lambda d: (np.dot(d[0], ( np.dot(d[0], modelLR_broadcast.value) - d[1][1] )  )  ,np.dot(d[0], (sigmoid(np.dot(d[0],modellogr_broadcast.value))-d[1][0])))).cache()
    
    # Get regularization for each model
    lrReg = get_reg(modelLR_broadcast.value, LRregType, LRregParam)
    logReg = get_reg(modellogr_broadcast.value, logregType, logregParam)
    
    # add regularization to the gradients
    gradLR = grads.map(lambda x: x[0]).mean()*2 + lrReg
    grad_LogR = grads.map(lambda x: x[1]).mean() + logReg
    
    new_model_LR = modelLR - (learningRate * gradLR * beta)
        
    new_model_logR = modellogr_broadcast.value - (learningRate * grad_LogR * (1- beta))
    
    return new_model_LR, new_model_logR

In [0]:
def MTRegression(data, nSteps, verbose, modelLR, modellogr, beta, LRregType, LRregParam,logregType, logregParam, learningRate):

    if verbose: print(f"BASELINE:  Loss = {MTLoss(data, modelLR, modellogr, beta)}")
    for idx in range(nSteps):
        modelLR, modellogr = GDMTUpdateLR(data, modelLR, modellogr, beta, LRregType, LRregParam,logregType, logregParam, learningRate)
        lossMT = MTLoss(data, modelLR, modellogr, beta) 
        if verbose:
            print("----------")
            print("STEP: {}".format(str(idx+1)))
            print("Loss: {}".format(str(lossMT)))
            print("Model LR:{}".format([w for w in modelLR]))
            print("Model LogR:{}".format([w for w in modellogr]))
        else:
            print("STEP: {}".format(str(idx+1)))
    
    return modelLR, modellogr

##### - Running Multi-Task Linear Regression/Logistic Regression

This is running the From Scratch Multi-Task Linear/Logistic regression. There was no Mllib implementation to compare results to. _Note: Prior to this we validate the separate implementations of the Linear and Logistic Regression models._

In [0]:
normalizedrddLR = normalizeLR(train_rdd, train_rdd).cache()
baselineLR = np.array(np.append(.5,np.zeros(len(train_rdd.take(1)[0][0]))).tolist())
baselineLogR = np.array(np.append(.5,np.zeros(len(train_rdd.take(1)[0][0]))).tolist())
beta = 0.4
LRregType, LRregParam = 'none', 0.1
logregType = "none"
logregParam, learningRate = 0.1, 0.1

modelLR_r, modellogr_r = MTRegression(normalizedrddLR, 500, True, baselineLR, baselineLogR, beta, LRregType, LRregParam, logregType, logregParam, learningRate)

print("************ Linear Regression Model ***********************")
print(modelLR_r)
print("************ Logistic Regression Model ***********************")
print(modellogr_r)

Ordinary Regression: 500 cycles  
Linear Regression Model: [2.59385309251664e-15, -0.0066755325398335195, -0.0032517617727512646, 0.0021781046593111236, 0.024434626665988918, -0.034305982515761925, -0.004809744626119722, 0.003750208219042718, 0.0031371375881927292, -0.014515112316497419, 0.04898028505827534, 0.02937948825565854, -0.024759849917462013, -0.005303913056616505, -0.009381267934277364, 0.009928395977872477, -0.010653979506296344, 0.02661917472825337, 0.05169649103626083, 0.008398731761794863, 0.107083932115518, 0.07329387887352118]  

Logistic Regression Model: [-1.5608584798386809, -0.008930345650416394, -0.00754982060762126, 0.02085226944940337, 0.0809303103085285, -0.12435144920036682, -0.016436544861239383, 0.04196927065542048, -0.016811550220550007, -0.042282945659215414, 0.08411007917968513, 0.10176418844922179, -0.08626204030609436, -0.013895899882409211, 0.027870389931953198, -0.01814050653089008, -0.016875934418088476, 0.06572880574862172, 0.1664458825464685, 0.032862342003531844, 0.1706525313333855, 0.29933201837669987]  

RMSE
50.56727869008469
42.89631028863097

In [0]:
test2 = modprediction(normalizedtest, modellogr_r)
metrics_multi_lr = MulticlassMetrics(test2.map(lambda x: x[0]))
metricsmulti_lr = logrmetrics(metrics_multi_lr )
dm3 = pd.DataFrame({"Log Reg": list(metricsmulti_lr .values())}, index = list(metricsmulti_lr .keys()))
dm3

Unnamed: 0,Log Reg
accuracy,0.586973
recall,0.650773
precision,0.259075
f1_score,0.370609
f05_score,0.29453
f2_score,0.499679


In [0]:
print(rmse(normalizedtest, modelLR_r))
print(rmse(normalizedrdd, modelLR_r))

In [0]:
# LRregType, LRregParam = 'lasso', 0.1
# logregType = "lasso"
# logregParam, learningRate = 0.1, 0.1
# modelLR_lasso, modellogr_lasso = MTRegression(normalizedrddLR, 500, True, baselineLR, baselineLogR, beta, LRregType, LRregParam, logregType, logregParam, learningRate)

# print("************ Linear Regression Model ***********************")
# print(modelLR_lasso)
# print("************ Logistic Regression Model ***********************")
# print(modellogr_lasso)

Linear RegressionL [7.119291095579074e-16, 0.0013753228985011832, 0.0022305590267884157, 0.0005694047593501144, 5.765944750487419e-05, -0.0013846221944906624, -0.003084031779083989, 0.004782094226221991, 0.00241322417068159, -0.0024658133625012507, 0.007667745972616723, 0.005142395505455799, -0.0005382926375558979, 0.0025802628094925436, 0.005291135986926982, 0.003717816609837388, -0.005008949005457102, 0.006522689009120985, 0.004012990915436847, 0.0026599904055640616, 0.0632531478201713, 0.03473985640767136]   

Model LogR:[-1.4926861948541212, 0.0006254816794506101, -0.005202931334545911, -0.0035799266960007195, 0.007445485164502436, 0.0014782303117756024, -0.004181763690276592, -0.0023012786190521398, -0.0046050555703766385, -0.001674735264985538, 0.0018463649283606007, 0.001759927400029589, -1.96714956669378e-05, 0.004453148554301257, 0.000767082804270831, -0.000510357555673145, -0.004146290873988328, -0.0021717090992834642, -0.0009323704292673266, -0.0021064152148614563, -0.0022065918924323736, 0.0019786728487526588]

In [0]:
# test2 = modprediction(normalizedtest, modellogr_lasso)
# metrics_lasso_lr = MulticlassMetrics(test2.map(lambda x: x[0]))
# metricsRlr = logrmetrics(metrics_lasso_lr)
# dm3 = pd.DataFrame({"Log Reg Lasso": list(metricsRlr.values())}, index = list(metricsRlr.keys()))
# dm3

In [0]:
# print(rmse(normalizedtest, modelLR_lasso))
# print(rmse(normalizedrdd, modelLR_lasso))

In [0]:
# LRregType, LRregParam = 'ridge', 0.1
# logregType = "ridge"
# logregParam, learningRate = 0.1, 0.1
# modelLR_ridge, modellogr_ridge = MTRegression(normalizedrddLR, 500, True, baselineLR, baselineLogR, beta, LRregType, LRregParam, logregType, logregParam, learningRate)

# print("************ Linear Regression Model ***********************")
# print(modelLR_ridge)
# print("************ Logistic Regression Model ***********************")
# print(modellogr_ridge)

Model LR:[2.453879181130427e-15, -0.006000400429794389, -0.0029723452591739635, 0.002739054915927632, 0.024293551327300024, -0.03214561113872439, -0.0046699601681354065, 0.0038016331638882524, 0.0036388218720580087, -0.013916284572585737, 0.045163890668018944, 0.028129628280935563, -0.024203710134539123, -0.005104031912128296, -0.004265254197828877, 0.00633715000682583, -0.009965201099429192, 0.024773273070143258, 0.04701374794626212, 0.00791651776996051, 0.09786228720350557, 0.06629068734657798]  

Model LogR:[-1.513568483004315, -0.003212346310817316, -0.003030040825448543, 0.012419592484305167, 0.05821470759191354, -0.05762322691805871, -0.007304049235834673, 0.024032999488324342, 0.004031735320076424, -0.028047678477025172, 0.04719851208767767, 0.0612613612514279, -0.04151958875570505, -0.006092271690781037, 0.019565837028024106, 0.0003644437098173261, -0.019738607401050032, 0.031150061785722598, 0.07352034799055975, 0.019072716715317236, 0.08962246375277483, 0.1366317412321985]

In [0]:
# test2 = modprediction(normalizedtest, modellogr_ridge)
# metrics_lasso_lr = MulticlassMetrics(test2.map(lambda x: x[0]))
# metricsRlr = logrmetrics(metrics_lasso_lr)
# dm3 = pd.DataFrame({"Log Reg Lasso": list(metricsRlr.values())}, index = list(metricsRlr.keys()))
# dm3

In [0]:
# print(rmse(normalizedtest, modelLR_ridge))
# print(rmse(normalizedrdd, modelLR_ridge))

#### 5. PySpark Logistic Regression to Test Results 

Since we have already tested the Lasso and Ridge regressions we are commenting these out. See final report for comparison of results.

##### - Import Dependencies

In [0]:
# ML related libraries
from pyspark.ml import Pipeline
from pyspark.mllib.evaluation import MulticlassMetrics
from pyspark.ml.feature import StringIndexer, VectorAssembler, OneHotEncoder
from pyspark.ml.feature import MinMaxScaler
from pyspark.ml.feature import StandardScaler, Imputer
from pyspark.ml.classification import LogisticRegression
from pyspark.ml.evaluation import BinaryClassificationEvaluator
from pyspark.ml.tuning import CrossValidator, ParamGridBuilder
from pyspark.ml.classification import GBTClassifier
from pyspark.ml.regression import LinearRegression

from pyspark.mllib.evaluation import BinaryClassificationMetrics
from pyspark.ml.classification import RandomForestClassifier

from pyspark.ml.linalg import Vectors
from itertools import combinations

from pyspark.sql.types import IntegerType

##### - Select Data for Logistic Regression

In [0]:
myY = "DEP_DEL15"

#categoricals = ['YEAR', 'QUARTER', 'MONTH', 'DAY_OF_MONTH', 'DAY_OF_WEEK', 'OP_CARRIER_AIRLINE_ID', 'ORIGIN_AIRPORT_ID', 'DEST_AIRPORT_ID', 'SEASON', 'WKDAY', 'DEPARTURE_Hour_CRS']

numerics = ['YEAR','DAY_OF_MONTH','DAY_OF_WEEK', 'DISTANCE','wind_speed_mps_orig','ceiling_ht_dim_orig','visibility_meters_orig','temp_cels_orig',
                                'dew_pt_orig','atmos_press_orig','precip_milimeters_orig', 'wind_speed_mps_dest','ceiling_ht_dim_dest','visibility_meters_dest',
                                'temp_cels_dest', 'dew_pt_dest', 'atmos_press_dest','precip_milimeters_dest', 'rolling_ninety_day_average','Air_Page_Rank_traffic',
                                'OD_delay_pair', 'time_of_day_int']

# 'CRS_DEP_TIME', 'DEP_DELAY', 'DEP_DELAY_NEW', 'DEP_DEL15', 'DEP_DELAY_GROUP'

myX = numerics
print(len(numerics))

# Convert Column to Integer
subset_df = subset_df.withColumn("time_of_day_int", subset_df["time_of_day_int"].cast(IntegerType()))

#CREATE A COPY OF myY with name label to be used in the Grid Search 
# subset_df = subset_df.withColumn("label", subset_df[myY, "DEP_DELAY"])

# SELECT THE COLUMNS WITH THE VARIABLES
dflr = subset_df.select(myX + [myY])

# SELECT THE COLUMNS WITH THE VARIABLES
# dflr = subset_df.select(myX + [myY])


# SPLIT DATA FOR TRAINING AND TESTING
year_train_val = 2018
trainLog = dflr.filter(dflr.YEAR <= year_train_val).cache()
testLog = dflr.filter(dflr.YEAR > year_train_val).cache()

trainLog = trainLog.drop("YEAR")
testLog = testLog.drop("YEAR")

numerics = ['DAY_OF_MONTH','DAY_OF_WEEK', 'DISTANCE','wind_speed_mps_orig','ceiling_ht_dim_orig','visibility_meters_orig','temp_cels_orig',
                                'dew_pt_orig','atmos_press_orig','precip_milimeters_orig', 'wind_speed_mps_dest','ceiling_ht_dim_dest','visibility_meters_dest',
                                'temp_cels_dest', 'dew_pt_dest', 'atmos_press_dest','precip_milimeters_dest', 'rolling_ninety_day_average','Air_Page_Rank_traffic',
                                'OD_delay_pair', 'time_of_day_int']

myX = numerics

##### - Define Imputers and Vector Assembler

In [0]:
# Establish stages for our GBT model
# indexers = map(lambda c: StringIndexer(inputCol=c, outputCol=c+"_idx", handleInvalid = 'keep'), categoricals)
imputers = Imputer(inputCols = numerics, outputCols = numerics)
featureCols = numerics

# Define vector assemblers
# model_matrix_stages = [imputers] + \
#                      [VectorAssembler(inputCols=featureCols, outputCol="features"), StringIndexer(inputCol="DEP_DEL15", outputCol="label")]

model_matrix_stages = [imputers] + \
                     [VectorAssembler(inputCols=featureCols, outputCol="features")]


# Apply StandardScaler to create scaledFeatures
scaler = StandardScaler(inputCol="features",
                        outputCol="scaledFeatures",
                        withStd=True,
                        withMean=True)

In [0]:
# Logistic Regression
def logR_PS(data, maxIter, regParam, elasticNetParam, myY, model_matrix_stages, scaler):
    """
         Return Logistic Regression for MLlib
         Args:
              data - sparkDataframe
              maxIter - (numeric) number of iterations
              regParam - (numeric) value for amount of regularization    
              elasticNetParam - (numeric)
              myY - (str) outcome variable column name
              model_matrix_stages - (list) pre-processing stages
              scaler 
         Returns:
              pipLogR - returns model fitted by LR Pipeline
    """  
    lr = LogisticRegression(maxIter=maxIter, featuresCol = "scaledFeatures", regParam=regParam, elasticNetParam=elasticNetParam, fitIntercept=True, labelCol = myY)

    logrpipeline = Pipeline(stages=model_matrix_stages+[scaler]+[lr])

    pipLogR = logrpipeline.fit(data)
    
    return pipLogR

##### - Ordinary Logistic Regression

In [0]:
# Logist Regression
# Regular: regParam = 0
# Lasso: regParam >0, elasticNetParam = 1
# Ridge: regParam >0, elasticNetParam = 0

maxIter, regParam, elasticNetParam = 100, 0, 0
pipLogR = logR_PS(trainLog, maxIter, regParam, elasticNetParam, myY, model_matrix_stages, scaler)

mod = pipLogR.stages[-1]

print("Coefficients Logistic Regression: ", mod.coefficientMatrix)
print("Intercept Logistic Regression: ", mod.interceptVector)

logR_pred = pipLogR.transform(testLog)

# preddf = logR_pred.rdd.map(extract).toDF(["DEP_DEL15", "p0", "p1", "label", "prediction"])
preddf = logR_pred.select("prediction","DEP_DEL15").rdd
# Get performance metrics
metricsLR = MulticlassMetrics(preddf)
mLR = logrmetrics(metricsLR)
dmPS1 = pd.DataFrame({"LogRegPS": list(mLR.values())}, index = list(mLR.keys()))
dmPS1

Unnamed: 0,LogRegPS
accuracy,0.611427
recall,0.651446
precision,0.608865
f1_score,0.629436
f05_score,0.61693
f2_score,0.64246


##### - Ridge Logistic Regression

In [0]:
# Logist Regression
# Regular: regParam = 0
# Lasso: regParam >0, elasticNetParam = 1
# Ridge: regParam >0, elasticNetParam = 0

# maxIter, regParam, elasticNetParam = 100, 0.1, 0
# pipLogR = logR_PS(trainLog, maxIter, regParam, elasticNetParam, myY, model_matrix_stages, scaler)

# mod = pipLogR.stages[-1]

# print("Coefficients Logistic Regression: ", mod.coefficientMatrix)
# print("Intercept Logistic Regression: ", mod.interceptVector)

# logR_pred = pipLogR.transform(testLog)

# # preddf = logR_pred.rdd.map(extract).toDF(["DEP_DEL15", "p0", "p1", "label", "prediction"])
# preddf = logR_pred.select("prediction","DEP_DEL15").rdd
# # Get performance metrics
# metricsLR = MulticlassMetrics(preddf)
# mLR = logrmetrics(metricsLR)
# dmPS1 = pd.DataFrame({"LogReg-Ridge": list(mLR.values())}, index = list(mLR.keys()))
# dmPS1

##### - Lasso Logistic Regression

In [0]:
# Logist Regression
# Regular: regParam = 0
# Lasso: regParam >0, elasticNetParam = 1
# Ridge: regParam >0, elasticNetParam = 0

# maxIter, regParam, elasticNetParam = 100, 0.1, 0
# pipLogR = logR_PS(trainLog, maxIter, regParam, elasticNetParam, myY, model_matrix_stages, scaler)

# mod = pipLogR.stages[-1]

# print("Coefficients Logistic Regression: ", mod.coefficientMatrix)
# print("Intercept Logistic Regression: ", mod.interceptVector)

# logR_pred = pipLogR.transform(testLog)

# # preddf = logR_pred.rdd.map(extract).toDF(["DEP_DEL15", "p0", "p1", "label", "prediction"])
# preddf = logR_pred.select("prediction","DEP_DEL15").rdd
# # Get performance metrics
# metricsLR = MulticlassMetrics(preddf)
# mLR = logrmetrics(metricsLR)
# dmPS1 = pd.DataFrame({"LogRegPS_Lasso": list(mLR.values())}, index = list(mLR.keys()))
# dmPS1

#### 6. PySpark Linear Regression

Since we have already tested the Lasso and Ridge regressions we are commenting these out. See final report for comparison of results.

Reference: https://towardsdatascience.com/building-a-linear-regression-with-pyspark-and-mllib-d065c3ba246a

In [0]:
def rmsePS(data):
    """
    Compute Root Mean Squared Error (RMSE)
    Args:
        data     - each record is a tuple of (features_array, y)
        model  - (array) model coefficients with bias at index 0
    """
        
    pred_lab = data.select("prediction", "DEP_DELAY").rdd.cache()

    rmse_result = pred_lab.map(lambda x: (x[0] - x[1] )**2).mean()
    rmse_result = np.sqrt(rmse_result)

    return rmse_result

##### - Prepare Data for Linear Regression

In [0]:
myYLR = "DEP_DELAY"

numerics = ['YEAR','DAY_OF_MONTH','DAY_OF_WEEK', 'DISTANCE','wind_speed_mps_orig','ceiling_ht_dim_orig','visibility_meters_orig','temp_cels_orig',
                                'dew_pt_orig','atmos_press_orig','precip_milimeters_orig', 'wind_speed_mps_dest','ceiling_ht_dim_dest','visibility_meters_dest',
                                'temp_cels_dest', 'dew_pt_dest', 'atmos_press_dest','precip_milimeters_dest', 'rolling_ninety_day_average','Air_Page_Rank_traffic',
                                'OD_delay_pair', 'time_of_day_int']

# SELECT THE COLUMNS WITH THE VARIABLES
dfLN = subset_df.select(numerics + [myYLR])

# SELECT THE COLUMNS WITH THE VARIABLES
# dflr = subset_df.select(myX + [myY])


# SPLIT DATA FOR TRAINING AND TESTING
year_train_val = 2018
trainLN = dfLN.filter(dfLN.YEAR <= year_train_val).cache()
testLN = dfLN.filter(dfLN.YEAR > year_train_val).cache()

trainLN = trainLN.drop("YEAR")
testLN = testLN.drop("YEAR")

numerics = ['DAY_OF_MONTH','DAY_OF_WEEK', 'DISTANCE','wind_speed_mps_orig','ceiling_ht_dim_orig','visibility_meters_orig','temp_cels_orig',
                                'dew_pt_orig','atmos_press_orig','precip_milimeters_orig', 'wind_speed_mps_dest','ceiling_ht_dim_dest','visibility_meters_dest',
                                'temp_cels_dest', 'dew_pt_dest', 'atmos_press_dest','precip_milimeters_dest', 'rolling_ninety_day_average','Air_Page_Rank_traffic',
                                'OD_delay_pair', 'time_of_day_int']

myX = numerics

In [0]:
trainLN.select(mean("DEP_DELAY")).collect()

##### - Imputers and Vector Assembler

This is used for the PySpark MLlib implementation that we used to verify that our from scratch implementation is working.

In [0]:
# indexers = map(lambda c: StringIndexer(inputCol=c, outputCol=c+"_idx", handleInvalid = 'keep'), categoricals)
imputers = Imputer(inputCols = numerics, outputCols = numerics)
featureCols = numerics

# Define vector assemblers
# model_matrix_stagesLR = [imputers] + \
#                      [VectorAssembler(inputCols=featureCols, outputCol="features"), StringIndexer(inputCol=myYLR, outputCol="label")]

model_matrix_stagesLR = [imputers] + \
                     [VectorAssembler(inputCols=featureCols, outputCol="features")]


# Apply StandardScaler to create scaledFeatures
scaler = StandardScaler(inputCol="features",
                        outputCol="scaledFeatures",
                        withStd=True,
                        withMean=True)

##### - Ordinary Linear Regression

Below is the Linear Regression MLlib implementation without any Regularization applied.

In [0]:
# LINEAR REGRESSION
# Regular: regParam = 0
# Lasso: regParam >0, elasticNetParam = 1
# Ridge: regParam >0, elasticNetParam = 0
regParam = 0
elasticNetParam = 0

lnR = LinearRegression(maxIter=200, featuresCol = "scaledFeatures", labelCol = myYLR, regParam=regParam, elasticNetParam = elasticNetParam, fitIntercept=True)

lrpipeline = Pipeline(stages=model_matrix_stagesLR+[scaler]+[lnR])

pipLR = lrpipeline.fit(trainLN)

modLR = pipLR.stages[-1]

print("Coefficients Linear Regression: ", modLR.coefficients)
print("Intercept Linear Regression: ", modLR.intercept)

LR_pred = pipLR.transform(testLN)

print("RMSE: ",rmsePS(LR_pred))

##### - Lasso Linear Regression

In [0]:
# # LINEAR REGRESSION
# # Regular: regParam = 0
# # Lasso: regParam >0, elasticNetParam = 1
# # Ridge: regParam >0, elasticNetParam = 0
# regParam = 0.1
# elasticNetParam = 1

# lnR = LinearRegression(maxIter=200, featuresCol = "scaledFeatures", labelCol = myYLR, regParam=regParam, elasticNetParam = elasticNetParam, fitIntercept=True)

# lrpipeline = Pipeline(stages=model_matrix_stagesLR+[scaler]+[lnR])

# pipLR = lrpipeline.fit(trainLN)

# modLR = pipLR.stages[-1]

# print("Coefficients Linear Regression: ", modLR.coefficients)
# print("Intercept Linear Regression: ", modLR.intercept)

# LR_pred = pipLR.transform(testLN)

# print("RMSE: ",rmsePS(LR_pred))

##### - Ridge Linear Regression

In [0]:
# # LINEAR REGRESSION
# # Regular: regParam = 0
# # Lasso: regParam >0, elasticNetParam = 1
# # Ridge: regParam >0, elasticNetParam = 0
# regParam = 0.1
# elasticNetParam = 0

# lnR = LinearRegression(maxIter=200, featuresCol = "scaledFeatures", labelCol = myYLR, regParam=regParam, elasticNetParam = elasticNetParam, fitIntercept=True)

# lrpipeline = Pipeline(stages=model_matrix_stagesLR+[scaler]+[lnR])

# pipLR = lrpipeline.fit(trainLN)

# modLR = pipLR.stages[-1]

# print("Coefficients Linear Regression: ", modLR.coefficients)
# print("Intercept Linear Regression: ", modLR.intercept)

# LR_pred = pipLR.transform(testLN)

# print("RMSE: ",rmsePS(LR_pred))