## Flight Delay Prediction - Logistic Regression

#### Importing packages

In [3]:
from pyspark.sql import functions as f
from pyspark.sql.types import StructType, StructField, StringType, DoubleType, IntegerType, NullType, ShortType, DateType, BooleanType, BinaryType
from pyspark.sql import SQLContext
import pyspark.ml.feature as ftr
import pyspark.ml as ml
from pyspark.ml.regression import DecisionTreeRegressor
from pyspark.sql.window import Window
from pyspark.ml.classification import LogisticRegression
from pyspark.ml import Pipeline
from pyspark.ml.feature import StringIndexer, VectorAssembler, OneHotEncoder
from pyspark.ml.feature import StandardScaler, Imputer
from pyspark.ml.evaluation import BinaryClassificationEvaluator
from pyspark.ml.pipeline import PipelineModel
import numpy as np
import seaborn as sns
import re
import time
import sklearn.metrics as metrics
import matplotlib.pyplot as plt
import pandas as pd

sqlContext = SQLContext(sc)

#### Setting up the Data, Team and Model paths

In [5]:
# Data directory
DATA_PATH = "dbfs:/mnt/mids-w261/data/datasets_final_project/"
display(dbutils.fs.ls(DATA_PATH))

path,name,size
dbfs:/mnt/mids-w261/data/datasets_final_project/airlines_data/,airlines_data/,0
dbfs:/mnt/mids-w261/data/datasets_final_project/allstate-claims-severity.zip,allstate-claims-severity.zip,51204863
dbfs:/mnt/mids-w261/data/datasets_final_project/dac.tar.gz,dac.tar.gz,4576820670
dbfs:/mnt/mids-w261/data/datasets_final_project/kdd-cup-2014-predicting-excitement-at-donors-choose.zip,kdd-cup-2014-predicting-excitement-at-donors-choose.zip,971133938
dbfs:/mnt/mids-w261/data/datasets_final_project/parquet_airlines_data/,parquet_airlines_data/,0
dbfs:/mnt/mids-w261/data/datasets_final_project/parquet_airlines_data_3m/,parquet_airlines_data_3m/,0
dbfs:/mnt/mids-w261/data/datasets_final_project/parquet_airlines_data_6m/,parquet_airlines_data_6m/,0
dbfs:/mnt/mids-w261/data/datasets_final_project/porto-seguro-safe-driver-prediction.zip,porto-seguro-safe-driver-prediction.zip,80247571
dbfs:/mnt/mids-w261/data/datasets_final_project/walmart-recruiting-trip-type-classification.zip,walmart-recruiting-trip-type-classification.zip,11510035
dbfs:/mnt/mids-w261/data/datasets_final_project/weather_data/,weather_data/,0


In [6]:
# create team folder
# dbutils.fs.mkdirs('dbfs:/mnt/w261/team22')
TEAM_PATH = 'dbfs:/mnt/w261/team22/'

In [7]:
# Path to save models
MODEL_LR = 'dbfs:/mnt/w261/team22/model/lr'

In [8]:
# Read from parquet
trainRDD = spark.read.option("header", "true").parquet(TEAM_PATH+"trainRDD.parquet")
validationRDD = spark.read.option("header", "true").parquet(TEAM_PATH+"validationRDD.parquet")
testRDD = spark.read.option("header", "true").parquet(TEAM_PATH+"testRDD.parquet")


# Checking the number of records for each dataset
print(f"... train dataset has {trainRDD.count()} records for evaluation")
print(f"... validation dataset has {validationRDD.count()} records for evaluation")
print(f"... test dataset has {testRDD.count()} records for evaluation")

In [9]:
trainRDD.printSchema()

###Logistic Regression

#### Logistic regression
**- Expectation:** We face a binary classification task - Delay and Ontime. Logistic regression algorithm which takes in a features and calculates the outcome (dependent variable) based on probability of each class is intutive and easy to interpret. Logistic regression also works well with categorical variables. In predicting flight delay task, we expect logistic regression to take in all numeric and categorical variables - timeline, airline, airport and weather related conditions and then inform us if the flight will be on-time (0) or delayed (1). 

**- Trade-Off** Logistic Regression can take in many categorical features which needs to be encoded into a vector. However, numberous categorical variables especially ones with many distinct categories increases the training time significantly.

**- The algorithm:** Logistic Regression is a classification algorithm used to assign observation to discrete set of classes using probability. The algorithm takes in real input and assigns values between 0 and 1 using the sigmoid cost function which is an 'S' shaped curve.

Logistic Regression Hypothesis Expectation is: $$0 <= h_{\theta}(x) <= 1$$

$$h_{\theta}(x) = \frac{1}{(1 + e^{-(\beta_0 +\beta_1X)})}$$

In the above funton x is the input to the function and e is the base of the natural log. With a chosen threshold of 0.5, any prediction returned with a value of greater than 0.5 will be classified as a "Delay". The aim of this threshold is to maximise the likelihood that a given datapoint gets classified correctly, which is the Maximum Liklihood Estimation.

#### Dropping Categorical Variables With Multiple Distinct Categories:

Initially it was decided to keep **TAIL_NUM** and **OP_CARRIER_FL_NUM** to train our model to identify systemic delays of a particular plane or carrier. However, the running time significantly increased and it was decided to drop those features from the dataset

In [13]:
# Dropping categorical features with many distinct values
trainRDD_LR = trainRDD.drop("TAIL_NUM", "OP_CARRIER_FL_NUM").cache()
validationRDD_LR = validationRDD.drop("TAIL_NUM", "OP_CARRIER_FL_NUM")
testRDD_LR = testRDD.drop("TAIL_NUM", "OP_CARRIER_FL_NUM")

#### Pipeline

**- String Indexer**:
Encodes a column of string labels/categories to a column of indices. The ordering of the indices is done on the basis of popularity and the range is [0, numOfLabels).

**- One Hot Encoder:**
One hot encoder maps the label indices to a binary vector representation with at the most a single one-value. These methods are generally used when we need to use categorical features but the algorithm expects continuous features. The spark one hot encoder takes the indexed label/category from the string indexer and then encodes it into a sparse vector.
The first component which is a 0 indicates that it is a sparse vector. The second component talks about the size of the vector. The third component talks about the indices where the vector is populated while the fourth component talks about what values these are. This truncates the vector and is really efficient when you have really large vector representations.

**- Vector Assembler:**
Vector assembler’s job is to combine the raw features and features generated from various transforms into a single feature vector. It accepts boolean, numerical and vector type inputs.

**- Logistic Regression:**
The date will be trained using Logistic Regression algorithm which works well for binary classification tasks. The below parameters have been passed into Logistic Regression:
  - maxIter: Maximum number of iterations to converge passed in as 10
  
  - weightsCol: Changing the default weight to incorportate the unbalanced data

In [15]:
# Extracting the numeric and categorical features
numerics = [feature for (feature, dataType) in trainRDD_LR.dtypes if ((dataType == "double") | (dataType == "int")) & ((feature != "DEP_DEL15"))]
categoricals = [feature for (feature, dataType) in trainRDD_LR.dtypes if (dataType == "string") & (feature != "DEP_DEL15")]

# Defining variable names for ML pipeline input 
stages = []
featureCols = []

# Creating StringIndexer and OneHotEncoder for categorical features
for c in categoricals:
  stringIndexers = StringIndexer(inputCol=c, outputCol=c+"Index", handleInvalid = 'keep')
  encoder = OneHotEncoder(inputCol=c+"Index", outputCol=c+"OHE")
  stages += [stringIndexers, encoder]
  featureCols += [c+"OHE"]


# Creating StringIndexer for label
label_stringIndexer = StringIndexer(inputCol="DEP_DEL15", outputCol="label", handleInvalid = 'keep')

# Adding imputers for numeric columns
#imputers = ftr.Imputer(inputCols = numerics, outputCols = numerics)

# feature inputs for assembler
featureCols += numerics

# Creating a vector assembler so that the input is in a single vector
VecAssembler = VectorAssembler(inputCols=featureCols, outputCol="features")

# Scaling to normalize features
scaler = StandardScaler(inputCol="features",
                        outputCol="scaledFeatures",
                        withStd=True,
                        withMean=True)



# Handling class imbalance by adding a weight to each label
dataset_size=float(trainRDD_LR.select("DEP_DEL15").count())
numPositives=trainRDD_LR.select("DEP_DEL15").where('DEP_DEL15 == 1').count()
per_ones=(float(numPositives)/float(dataset_size))*100
numNegatives=float(dataset_size-numPositives)
BalancingRatio= numNegatives/dataset_size
trainRDD_LR = trainRDD_LR.withColumn("classWeights", f.when(trainRDD_LR["DEP_DEL15"] == "1.0",BalancingRatio).otherwise(1-BalancingRatio))


#Logistic Regression Classifier
lr = LogisticRegression(maxIter=10, elasticNetParam=0.5, featuresCol = "scaledFeatures", labelCol="label", weightCol="classWeights")

# Setting stage variable
stages += [label_stringIndexer, VecAssembler, scaler, lr]

# setting up the pipeline
pipeline = Pipeline(stages=stages)

In [16]:
from pyspark.ml.evaluation import BinaryClassificationEvaluator
from pyspark.ml.tuning import CrossValidator, ParamGridBuilder

# Build the parameter grid for model tuning
paramGrid = ParamGridBuilder() \
              .addGrid(lr.regParam, [0.1, 0.01]) \
              .build()

# Execute CrossValidator for model tuning
crossval = CrossValidator(estimator=pipeline,
                          estimatorParamMaps=paramGrid,
                          evaluator=BinaryClassificationEvaluator(),
                          numFolds=3)

# Train the tuned model and establish our best model
cvModel = crossval.fit(trainRDD_LR)
glm_model = cvModel.bestModel

In [17]:
dbutils.fs.mkdirs('dbfs:/mnt/w261/team22/model/lr2')
MODEL_LR2 = 'dbfs:/mnt/w261/team22/model/lr2'

In [18]:
# Saving the model
glm_model.write().overwrite().save(MODEL_LR2)

In [19]:
from pyspark.ml.evaluation import BinaryClassificationEvaluator
evaluator = BinaryClassificationEvaluator(labelCol="label", rawPredictionCol="prediction", metricName='areaUnderROC')

# Make predicitons
predictionAndTarget = cvModel.transform(validationRDD_LR).select("label", "prediction")

auc = evaluator.evaluate(predictionAndTarget)
auc

In [20]:
from pyspark.ml.evaluation import BinaryClassificationEvaluator
#evaluator = BinaryClassificationEvaluator(labelCol="label", rawPredictionCol="prediction", metricName='areaUnderROC')

# Make predicitons
predictionAndTarget_test = glm_model.transform(testRDD_LR).select("label", "prediction")

auc_test = evaluator.evaluate(predictionAndTarget)
auc_test

In [21]:
# Metrics on Validation Data

# Metrics - part 1

predictions = glm_model.transform(validationRDD_LR)
evaluator = BinaryClassificationEvaluator()

# Metrics - part 2
tp = predictions[(predictions.DEP_DEL15 == 1) & (predictions.prediction == 1)].count()
tn = predictions[(predictions.DEP_DEL15 == 0) & (predictions.prediction == 0)].count()
fp = predictions[(predictions.DEP_DEL15 == 0) & (predictions.prediction == 1)].count()
fn = predictions[(predictions.DEP_DEL15 == 1) & (predictions.prediction == 0)].count()

total = predictions.count()
recall = float(tp)/(tp + fn)

# Metrics - part 3
data = {'Actual: delay': [tp, fn], 'Actual: on-time': [fp, tn]}
confusion_matrix = pd.DataFrame.from_dict(data, orient="index",
                                         columns=['Prediction: delay', "Prediction: on-time"])

print("Test Area Under ROC: ", "{:.2f}".format(evaluator.evaluate(predictions, {evaluator.metricName: 'areaUnderROC'})))
print("Test Area Under Precision-Recall Curve: ", "{:.2f}".format(evaluator.evaluate(predictions, {evaluator.metricName: 'areaUnderPR'})))

print("True positive rate: {:.2%}".format(tp/(tp + fn)))
print("True negative rate: {:.2%}".format(tn/(tn + fp)))
print("False positive rate: {:.2%}".format(fp/(tn + fp)))
print("False negative rate: {:.2%}".format(fn/(tp + fn)))


# Metrics - part 4
precision = tp/(tp + fp)
print("Precision: {:.2%}".format(precision))
recall = tp/(tp + fn)
print("Recall: {:.2%}".format(recall))

f1_score = (2 * precision * recall)/(precision + recall)
print("F1 Score: {:.2%}".format(f1_score))


print("########### Confusion Martix ###########")
print(confusion_matrix)

In [22]:
# Metrics on test Data

# Metrics - part 1

predictions = glm_model.transform(testRDD_LR)
evaluator = BinaryClassificationEvaluator()

# Metrics - part 2
tp = predictions[(predictions.DEP_DEL15 == 1) & (predictions.prediction == 1)].count()
tn = predictions[(predictions.DEP_DEL15 == 0) & (predictions.prediction == 0)].count()
fp = predictions[(predictions.DEP_DEL15 == 0) & (predictions.prediction == 1)].count()
fn = predictions[(predictions.DEP_DEL15 == 1) & (predictions.prediction == 0)].count()

total = predictions.count()
recall = float(tp)/(tp + fn)

# Metrics - part 3
data = {'Actual: delay': [tp, fn], 'Actual: on-time': [fp, tn]}
confusion_matrix = pd.DataFrame.from_dict(data, orient="index",
                                         columns=['Prediction: delay', "Prediction: on-time"])

print("Test Area Under ROC: ", "{:.2f}".format(evaluator.evaluate(predictions, {evaluator.metricName: 'areaUnderROC'})))
print("Test Area Under Precision-Recall Curve: ", "{:.2f}".format(evaluator.evaluate(predictions, {evaluator.metricName: 'areaUnderPR'})))

print("True positive rate: {:.2%}".format(tp/(tp + fn))) # Positive classes classified accurately
print("True negative rate: {:.2%}".format(tn/(tn + fp))) # Proportion of the negative class got correctly classified
print("False positive rate: {:.2%}".format(fp/(tn + fp))) # Proportion of the negative class got incorrectly classified
print("False negative rate: {:.2%}".format(fn/(tp + fn))) # Proportion of the positive class got incorrectly classified


# Metrics - part 4
precision = tp/(tp + fp)
print("Precision: {:.2%}".format(precision))
recall = tp/(tp + fn) #True Positive
print("Recall: {:.2%}".format(recall))

f1_score = (2 * precision * recall)/(precision + recall)
print("F1 Score: {:.2%}".format(f1_score))


print("########### Confusion Martix ###########")
print(confusion_matrix)

##### Interpretation:

When training weights were applied to each label to counteract the imbalanced classes in the dataset. With that, the area under the roc curve for test data at 79% is pretty decent. However the False Positive rate of 11.7% is very high to be viable as a business solution. One reason for might be the linear seperation perfromed by Logistic Regression.

In [24]:
# Checking the folder
#display(dbutils.fs.ls('dbfs:/mnt/w261/team22/model/lr2/stages/'))

path,name,size
dbfs:/mnt/w261/team22/model/lr/stages/00_StringIndexer_4358b82be985/,00_StringIndexer_4358b82be985/,0
dbfs:/mnt/w261/team22/model/lr/stages/01_OneHotEncoder_a1ef51365cbc/,01_OneHotEncoder_a1ef51365cbc/,0
dbfs:/mnt/w261/team22/model/lr/stages/02_StringIndexer_9674c55eaf33/,02_StringIndexer_9674c55eaf33/,0
dbfs:/mnt/w261/team22/model/lr/stages/03_OneHotEncoder_5e811dfa62ab/,03_OneHotEncoder_5e811dfa62ab/,0
dbfs:/mnt/w261/team22/model/lr/stages/04_StringIndexer_df18f760a518/,04_StringIndexer_df18f760a518/,0
dbfs:/mnt/w261/team22/model/lr/stages/05_OneHotEncoder_7224f2baee82/,05_OneHotEncoder_7224f2baee82/,0
dbfs:/mnt/w261/team22/model/lr/stages/06_StringIndexer_2a434eebe99d/,06_StringIndexer_2a434eebe99d/,0
dbfs:/mnt/w261/team22/model/lr/stages/07_OneHotEncoder_83fa09956f2a/,07_OneHotEncoder_83fa09956f2a/,0
dbfs:/mnt/w261/team22/model/lr/stages/08_StringIndexer_5bea141bd535/,08_StringIndexer_5bea141bd535/,0
dbfs:/mnt/w261/team22/model/lr/stages/09_OneHotEncoder_77f4d0ed4e79/,09_OneHotEncoder_77f4d0ed4e79/,0


In [25]:
# Loading the model
loaded_model_lr = PipelineModel.load(MODEL_LR)
param_dict = loaded_model_lr.stages[-1].extractParamMap()

In [26]:
# Loading the model
loaded_model_lr2 = PipelineModel.load(MODEL_LR2)
param_dict2 = loaded_model_lr2.stages[-1].extractParamMap()

In [27]:
lr_summary2 = glm_model.stages[len(glm_model.stages)-1].summary
#display(lr_summary2.roc)

In [28]:
display(predictions.select('label', 'rawPrediction', 'prediction', 'probability'))

label,rawPrediction,prediction,probability
0.0,"List(1, 3, List(), List(1.4648690303953376, 2.7534169494836505, -3.6170217994259914))",1.0,"List(1, 3, List(), List(0.21580916778192635, 0.7828510545178153, 0.001339777700258354))"
0.0,"List(1, 3, List(), List(2.0551598891089298, 1.438156100076067, -3.6170217994259914))",0.0,"List(1, 3, List(), List(0.6480885595946951, 0.34968178808192074, 0.002229652323384048))"
0.0,"List(1, 3, List(), List(2.058877618134735, 1.602130366016745, -3.6170217994259914))",0.0,"List(1, 3, List(), List(0.6109601435242358, 0.38694573879910815, 0.0020941176766561074))"
0.0,"List(1, 3, List(), List(2.0177046589063563, 1.7108920902966325, -3.6170217994259914))",0.0,"List(1, 3, List(), List(0.5749240597953589, 0.42302251033168553, 0.002053429872955566))"
0.0,"List(1, 3, List(), List(2.2019616596467677, 1.2704956074644314, -3.6170217994259914))",0.0,"List(1, 3, List(), List(0.7158471194008209, 0.2820263684164519, 0.002126512182727233))"
0.0,"List(1, 3, List(), List(2.12862442097952, 1.3497760278416642, -3.6170217994259914))",0.0,"List(1, 3, List(), List(0.683933298222563, 0.3138803939986929, 0.0021863077787440633))"
0.0,"List(1, 3, List(), List(2.2670514603637426, 1.2365956784088528, -3.6170217994259914))",0.0,"List(1, 3, List(), List(0.7354954569018017, 0.26245734699302276, 0.002047196105175442))"
0.0,"List(1, 3, List(), List(1.9805491801282955, 1.6770202268025274, -3.6170217994259914))",0.0,"List(1, 3, List(), List(0.5740807022227576, 0.42379126281048896, 0.002128034966753479))"
0.0,"List(1, 3, List(), List(1.9781349240770847, 1.6782027383148075, -3.6170217994259914))",0.0,"List(1, 3, List(), List(0.5732024603153604, 0.42466762427598165, 0.0021299154086580323))"
0.0,"List(1, 3, List(), List(2.0515830378486557, 1.3987900181050774, -3.6170217994259914))",0.0,"List(1, 3, List(), List(0.6561497213781479, 0.34158480428005494, 0.002265474341797243))"


## Running Model without Class Weights

Since trees algorithm had a much better performance, we wanted to check if the class weights passed into the model (not present in trees) was influencing the prediction. However, the false positive rate did not improve. This led us to believe that the poor performance of logistic regression might be due to the algorithm's ability to only seperate data linearly.

In [30]:
# Read from parquet
trainRDD = spark.read.option("header", "true").parquet(TEAM_PATH+"trainRDD.parquet")
validationRDD = spark.read.option("header", "true").parquet(TEAM_PATH+"validationRDD.parquet")
testRDD = spark.read.option("header", "true").parquet(TEAM_PATH+"testRDD.parquet")


# Checking the number of records for each dataset
print(f"... train dataset has {trainRDD.count()} records for evaluation")
print(f"... validation dataset has {validationRDD.count()} records for evaluation")
print(f"... test dataset has {testRDD.count()} records for evaluation")

#### Dropping Categorical Variables With Multiple Distinct Categories:

Initially it was decided to keep **TAIL_NUM** and **OP_CARRIER_FL_NUM** to train our model to identify systemic delays of a particular plane or carrier. However, the running time significantly increased and it was decided to drop those features from the dataset

In [32]:
# Dropping categorical features with many distinct values
trainRDD_LR = trainRDD.drop("TAIL_NUM", "OP_CARRIER_FL_NUM").cache()
validationRDD_LR = validationRDD.drop("TAIL_NUM", "OP_CARRIER_FL_NUM")
testRDD_LR = testRDD.drop("TAIL_NUM", "OP_CARRIER_FL_NUM")

#### Pipeline

**- String Indexer**:
Encodes a column of string labels/categories to a column of indices. The ordering of the indices is done on the basis of popularity and the range is [0, numOfLabels).

**- One Hot Encoder:**
One hot encoder maps the label indices to a binary vector representation with at the most a single one-value. These methods are generally used when we need to use categorical features but the algorithm expects continuous features. The spark one hot encoder takes the indexed label/category from the string indexer and then encodes it into a sparse vector.
The first component which is a 0 indicates that it is a sparse vector. The second component talks about the size of the vector. The third component talks about the indices where the vector is populated while the fourth component talks about what values these are. This truncates the vector and is really efficient when you have really large vector representations.

**- Vector Assembler:**
Vector assembler’s job is to combine the raw features and features generated from various transforms into a single feature vector. It accepts boolean, numerical and vector type inputs.

**- Logistic Regression:**
The date will be trained using Logistic Regression algorithm which works well for binary classification tasks. The below parameters have been passed into Logistic Regression:
  - maxIter: Maximum number of iterations to converge passed in as 10
  
  - weightsCol: Changing the default weight to incorportate the unbalanced data

In [34]:
# Extracting the numeric and categorical features
numerics = [feature for (feature, dataType) in trainRDD_LR.dtypes if ((dataType == "double") | (dataType == "int")) & ((feature != "DEP_DEL15"))]
categoricals = [feature for (feature, dataType) in trainRDD_LR.dtypes if (dataType == "string") & (feature != "DEP_DEL15")]

# Defining variable names for ML pipeline input 
stages = []
featureCols = []

# Creating StringIndexer and OneHotEncoder for categorical features
for c in categoricals:
  stringIndexers = StringIndexer(inputCol=c, outputCol=c+"Index", handleInvalid = 'keep')
  encoder = OneHotEncoder(inputCol=c+"Index", outputCol=c+"OHE")
  stages += [stringIndexers, encoder]
  featureCols += [c+"OHE"]


# Creating StringIndexer for label
label_stringIndexer = StringIndexer(inputCol="DEP_DEL15", outputCol="label", handleInvalid = 'keep')

# Adding imputers for numeric columns
#imputers = ftr.Imputer(inputCols = numerics, outputCols = numerics)

# feature inputs for assembler
featureCols += numerics

# Creating a vector assembler so that the input is in a single vector
VecAssembler = VectorAssembler(inputCols=featureCols, outputCol="features")

# Scaling to normalize features
scaler = StandardScaler(inputCol="features",
                        outputCol="scaledFeatures",
                        withStd=True,
                        withMean=True)



# Handling class imbalance by adding a weight to each label
#dataset_size=float(trainRDD_LR.select("DEP_DEL15").count())
#numPositives=trainRDD_LR.select("DEP_DEL15").where('DEP_DEL15 == 1').count()
#per_ones=(float(numPositives)/float(dataset_size))*100
#numNegatives=float(dataset_size-numPositives)
#BalancingRatio= numNegatives/dataset_size
#trainRDD_LR = trainRDD_LR.withColumn("classWeights", f.when(trainRDD_LR["DEP_DEL15"] == "1.0",BalancingRatio).otherwise(1-BalancingRatio))


#Logistic Regression Classifier
lr = LogisticRegression(maxIter=10, elasticNetParam=0.5, featuresCol = "scaledFeatures", labelCol="label")

# Setting stage variable
stages += [label_stringIndexer, VecAssembler, scaler, lr]

# setting up the pipeline
pipeline = Pipeline(stages=stages)

In [35]:
from pyspark.ml.evaluation import BinaryClassificationEvaluator
from pyspark.ml.tuning import CrossValidator, ParamGridBuilder

# Build the parameter grid for model tuning
paramGrid = ParamGridBuilder() \
              .addGrid(lr.regParam, [0.1, 0.01]) \
              .build()

# Execute CrossValidator for model tuning
crossval3 = CrossValidator(estimator=pipeline,
                          estimatorParamMaps=paramGrid,
                          evaluator=BinaryClassificationEvaluator().setRawPredictionCol('prediction').setLabelCol('label'),
                          numFolds=3,
                          parallelism=3)

# Train the tuned model and establish our best model
cvModel3 = crossval3.fit(trainRDD_LR)
glm_model3 = cvModel3.bestModel

In [36]:
dbutils.fs.mkdirs('dbfs:/mnt/w261/team22/model/lr3')
MODEL_LR3 = 'dbfs:/mnt/w261/team22/model/lr3'

In [37]:
# Saving the model
glm_model3.write().overwrite().save(MODEL_LR3)

In [38]:
from pyspark.ml.evaluation import BinaryClassificationEvaluator
evaluator = BinaryClassificationEvaluator(labelCol="label", rawPredictionCol="prediction", metricName='areaUnderROC')

# Make predicitons
predictionAndTarget = cvModel3.transform(validationRDD_LR).select("label", "prediction")

auc = evaluator.evaluate(predictionAndTarget)
auc

In [39]:
from pyspark.ml.evaluation import BinaryClassificationEvaluator
#evaluator = BinaryClassificationEvaluator(labelCol="label", rawPredictionCol="prediction", metricName='areaUnderROC')

# Make predicitons
predictionAndTarget_test = glm_model3.transform(testRDD_LR).select("label", "prediction")

auc_test = evaluator.evaluate(predictionAndTarget)
auc_test

In [40]:
# Metrics on Validation Data

# Metrics - part 1

predictions = glm_model3.transform(validationRDD_LR)
evaluator = BinaryClassificationEvaluator()

# Metrics - part 2
tp = predictions[(predictions.DEP_DEL15 == 1) & (predictions.prediction == 1)].count()
tn = predictions[(predictions.DEP_DEL15 == 0) & (predictions.prediction == 0)].count()
fp = predictions[(predictions.DEP_DEL15 == 0) & (predictions.prediction == 1)].count()
fn = predictions[(predictions.DEP_DEL15 == 1) & (predictions.prediction == 0)].count()

total = predictions.count()
recall = float(tp)/(tp + fn)

# Metrics - part 3
data = {'Actual: delay': [tp, fn], 'Actual: on-time': [fp, tn]}
confusion_matrix = pd.DataFrame.from_dict(data, orient="index",
                                         columns=['Prediction: delay', "Prediction: on-time"])

print("Test Area Under ROC: ", "{:.2f}".format(evaluator.evaluate(predictions, {evaluator.metricName: 'areaUnderROC'})))
print("Test Area Under Precision-Recall Curve: ", "{:.2f}".format(evaluator.evaluate(predictions, {evaluator.metricName: 'areaUnderPR'})))

print("True positive rate: {:.2%}".format(tp/(tp + fn)))
print("True negative rate: {:.2%}".format(tn/(tn + fp)))
print("False positive rate: {:.2%}".format(fp/(tn + fp)))
print("False negative rate: {:.2%}".format(fn/(tp + fn)))


# Metrics - part 4
precision = tp/(tp + fp)
print("Precision: {:.2%}".format(precision))
recall = tp/(tp + fn)
print("Recall: {:.2%}".format(recall))

f1_score = (2 * precision * recall)/(precision + recall)
print("F1 Score: {:.2%}".format(f1_score))


print("########### Confusion Martix ###########")
print(confusion_matrix)

In [41]:
# Metrics on test Data

# Metrics - part 1

predictions = glm_model3.transform(testRDD_LR)
evaluator = BinaryClassificationEvaluator()

# Metrics - part 2
tp = predictions[(predictions.DEP_DEL15 == 1) & (predictions.prediction == 1)].count()
tn = predictions[(predictions.DEP_DEL15 == 0) & (predictions.prediction == 0)].count()
fp = predictions[(predictions.DEP_DEL15 == 0) & (predictions.prediction == 1)].count()
fn = predictions[(predictions.DEP_DEL15 == 1) & (predictions.prediction == 0)].count()

total = predictions.count()
recall = float(tp)/(tp + fn)

# Metrics - part 3
data = {'Actual: delay': [tp, fn], 'Actual: on-time': [fp, tn]}
confusion_matrix = pd.DataFrame.from_dict(data, orient="index",
                                         columns=['Prediction: delay', "Prediction: on-time"])

print("Test Area Under ROC: ", "{:.2f}".format(evaluator.evaluate(predictions, {evaluator.metricName: 'areaUnderROC'})))
print("Test Area Under Precision-Recall Curve: ", "{:.2f}".format(evaluator.evaluate(predictions, {evaluator.metricName: 'areaUnderPR'})))

print("True positive rate: {:.2%}".format(tp/(tp + fn))) # Positive classes classified accurately
print("True negative rate: {:.2%}".format(tn/(tn + fp))) # Proportion of the negative class got correctly classified
print("False positive rate: {:.2%}".format(fp/(tn + fp))) # Proportion of the negative class got incorrectly classified
print("False negative rate: {:.2%}".format(fn/(tp + fn))) # Proportion of the positive class got incorrectly classified


# Metrics - part 4
precision = tp/(tp + fp)
print("Precision: {:.2%}".format(precision))
recall = tp/(tp + fn) #True Positive
print("Recall: {:.2%}".format(recall))

f1_score = (2 * precision * recall)/(precision + recall)
print("F1 Score: {:.2%}".format(f1_score))


print("########### Confusion Martix ###########")
print(confusion_matrix)