# Group 04 | Model Building | Gradient Boosted Trees

-------------------------
Amber Curran (akc6be)

Manpreet Dhindsa (mkd8bb)

Quinton Mays (rub9ez)

---------------------------

## Load Data

To begin we create our Spark Session and load the data from the parquet file.

In [1]:
from pyspark.sql import SparkSession

spark = SparkSession.builder \
        .appName("group04GBT") \
        .getOrCreate()

In [2]:
import pandas as pd
from pyspark.ml.feature import *
from pyspark.ml.linalg import Vectors
from pyspark.ml import Pipeline, PipelineModel  
from pyspark.ml.classification import GBTClassifier
from pyspark.sql.functions import *

In [3]:
df = spark.read.parquet("/project/ds5559/fa21-group04/data/processed_df.parquet")

We then perform a training/validation test split on the data, reserving 80% of the data for training and validation and 20% for testing.

In [4]:
trainVal, test = df.randomSplit([0.8, 0.2], seed=304)

## Pipeline

Next, a feature vector is constructed using the selected features from the `Group04FeatureSelection.ipynb` notebook. The columns selected as inputs are features chosen by a `UnivariateFeatureSelector`.

In [5]:
final_feature_vectorizer =  VectorAssembler(inputCols=['FinalCatFeatures',
                                                       'selectedContFeatures'],
                                            outputCol='features',
                                            handleInvalid='skip')

First, we select the features we want in model manually.from pyspark.ml.feature import Imputer

In [6]:
gbt = GBTClassifier(featuresCol = 'features', labelCol = 'hospital_death')

In [7]:
gbt_pipeline = Pipeline(stages=[final_feature_vectorizer,
                               gbt])

## Model Training and Tuning

To train and tune our model, we create a paramGrid of possible hyperparameter values for the model and pass it to a `CrossValidator` to determine the optimal values for the hyperparameters. For the Gradient Boosted Tree Model we chose `maxBins`, the max number of bins for discretizing continuous features, and `maxDepth`, th maximum depth of the tree, as the tunable hyperparameters.

In [8]:
from pyspark.ml.tuning import ParamGridBuilder, CrossValidator, CrossValidatorModel
from pyspark.ml.evaluation import BinaryClassificationEvaluator, MulticlassClassificationEvaluator

In [9]:
gbt_paramGrid = ParamGridBuilder() \
    .addGrid(gbt.maxDepth, [5, 10]) \
    .addGrid(gbt.maxBins, [10, 20]) \
    .build()

We chose `k=5` as the number of folds for the `CrossValidator` and selected AUC as the metric to determine the chosen model.

In [10]:
crossval = CrossValidator(estimator=gbt_pipeline,
                          estimatorParamMaps=gbt_paramGrid,
                          evaluator=BinaryClassificationEvaluator().setLabelCol(gbt.getLabelCol()),
                          numFolds=5,
                          seed=304,
                          parallelism=4)

In [11]:
#gbt_model = crossval.fit(trainVal)
gbt_model = CrossValidatorModel.load('/project/ds5559/fa21-group04/models/gbtc')

The model is trained using the trainVal data. For expediency in notebook editing, the trained model may be loaded from the location shown above.

## Model Evaluation

To evaluate the model performance, the test set is predicted using the chosen model from the `CrossValidator`.

In [12]:
preds = gbt_model.transform(test)

The predictions are then passed to two objects:

* `BinaryClassificationEvaluator` - for AUC metric
* `MulticlassClassificationEvaluator` - for confusion matrix and all other model metrics

In [13]:
testEvaluator = BinaryClassificationEvaluator(rawPredictionCol='probability',
                                              labelCol='hospital_death',
                                              metricName='areaUnderROC')

In [14]:
testEvaluator2 = MulticlassClassificationEvaluator(labelCol='hospital_death',
                                                   predictionCol="prediction",
                                                   probabilityCol='probability',
                                                   metricLabel=1)

### Accuracy

In [15]:
testEvaluator2.evaluate(preds, {testEvaluator2.metricName: "accuracy"})

0.92372634643377

### Precision

In [16]:
testEvaluator2.evaluate(preds, {testEvaluator2.metricName: "precisionByLabel"})

0.6817288801571709

### Recall

In [17]:
testEvaluator2.evaluate(preds, {testEvaluator2.metricName: "recallByLabel"})

0.28142741281427414

### F1

In [18]:
testEvaluator2.evaluate(preds, {testEvaluator2.metricName: "fMeasureByLabel"})

0.39839265212399544

### False Negative Rate

In [19]:
1 - testEvaluator2.evaluate(preds, {testEvaluator2.metricName: "truePositiveRateByLabel"})

0.7185725871857258

### False Negative Rate

In [20]:
testEvaluator2.evaluate(preds, {testEvaluator2.metricName: "falsePositiveRateByLabel"})

0.012952746461981291

### AUC

In [21]:
testEvaluator.evaluate(preds)

0.8715263491374267

### Confusion Matrix

In [22]:
from pyspark.mllib.evaluation import MulticlassMetrics
import pyspark.sql.functions as F
from pyspark.sql.types import FloatType
preds_and_labels = preds.select(['prediction','hospital_death']).withColumn('label', F.col('hospital_death').cast(FloatType())).orderBy('prediction')
preds_and_labels = preds_and_labels.select(['prediction','label'])
conf_matrix = MulticlassMetrics(preds_and_labels.rdd.map(tuple))
conf_matrix.confusionMatrix().toArray()

array([[12345.,   162.],
       [  886.,   347.]])

In [23]:
preds.groupBy('prediction').count().show()

+----------+-----+
|prediction|count|
+----------+-----+
|       0.0|13231|
|       1.0|  509|
+----------+-----+



It is clear from the low values of the model metrics and the high AUC, that the imbalanced dataset coupled with Spark's default behavior of selecting the higher porbability result as the reported result, is causing the metrics to suffer. Therefore, a probability cutoff value must be selected to reduce the effects of the imbalanced dataset.

### Selected Model Hyperparameters

The hyperparameters selected by the `CrossValidator` are shown below:

In [24]:
import numpy as np

In [25]:
gbt_model.getEstimatorParamMaps()[ np.argmax(gbt_model.avgMetrics)]

{Param(parent='GBTClassifier_e512e9cafe8d', name='maxBins', doc='Max number of bins for discretizing continuous features.  Must be >=2 and >= number of categories for any categorical feature.'): 20,
 Param(parent='GBTClassifier_e512e9cafe8d', name='maxDepth', doc='Maximum depth of the tree. (>= 0) E.g., depth 0 means 1 leaf node; depth 1 means 1 internal node + 2 leaf nodes.'): 5}

The chosen value for `maxBins` is 20 and `maxDepth` is 5. As both of these values were the maximum in the paramGrid, it may be prudent to explore a largae parameter space in future work.

In [26]:
preds.groupBy('hospital_death').count().show()

+--------------+-----+
|hospital_death|count|
+--------------+-----+
|             1| 1233|
|             0|12507|
+--------------+-----+



In [27]:
#gbt_model.write().overwrite().save("/project/ds5559/fa21-group04/models/gbtc")

### Probability Threshold Determination

Next, we chose to set a probability cutoff value for determining `hospital_death`. To do this, we separate out the probability of `hospital_death = 1` from raw prediction column and then compare the model accuracy, precision, recall, and F1 metrics for different probability cutoff values.

In [28]:
from pyspark.sql.types import DoubleType, FloatType

getprob = udf(lambda v:float(v[1]),FloatType())
## Select out the necessary columns
output = preds.select(col("rawPrediction"),
                              col("hospital_death").cast(DoubleType()),
                              getprob(col("probability")).alias("probability"),
                              col("prediction"))
output

DataFrame[rawPrediction: vector, hospital_death: double, probability: float, prediction: double]

In [29]:
from pyspark.sql.types import DoubleType
from pyspark.mllib.evaluation import BinaryClassificationMetrics

performance_df = spark.createDataFrame([(0,0,0,0,0,0,0,0,0)], ['cutoff', 'accuracy', 'recall', 'precision', 'F1', 'TP', 'FP', 'FN', 'TN'])
for cutoff in range(5, 95, 5):
    cutoff = (cutoff * 0.01)
  
    print('Testing cutoff = ', str(format(cutoff, '.2f')))
    lrpredictions_prob_temp = output.withColumn('prediction', when(col('probability') >= cutoff, 1).otherwise(0).cast(DoubleType()))
    tp = lrpredictions_prob_temp[(lrpredictions_prob_temp.hospital_death == 1) & (lrpredictions_prob_temp.prediction == 1)].count()
    tn = lrpredictions_prob_temp[(lrpredictions_prob_temp.hospital_death == 0) & (lrpredictions_prob_temp.prediction == 0)].count()
    fp = lrpredictions_prob_temp[(lrpredictions_prob_temp.hospital_death == 0) & (lrpredictions_prob_temp.prediction == 1)].count()
    fn = lrpredictions_prob_temp[(lrpredictions_prob_temp.hospital_death == 1) & (lrpredictions_prob_temp.prediction == 0)].count()
    a = ((tp + tn)/lrpredictions_prob_temp.count())
    if(tp + fn == 0.0):
        r = 0.0
        p = float(tp) / (tp + fp)
    elif(tp + fp == 0.0):
        r = float(tp) / (tp + fn)
        p = 0.0
    else:
        r = float(tp) / (tp + fn)
        p = float(tp) / (tp + fp)
    
    if(p + r == 0):
        f1 = 0
    else:
        f1 = 2 * ((p * r)/(p + r))
    print("Accuracy:", format(a, '.4f'), "Recall:", format(r, '.4f'), "Precision: ", format(p, '.4f'), "F1 score:", format(f1, '.4f'), "TP", tp, "FP", fp, "FN", fn, "TN", tn)
    performance_df_row = spark.createDataFrame([(format(cutoff, '.2f'),format(a, '.4f'), format(r, '.4f'), format(p, '.4f'), format(f1, '.4f'), tp, fp, fn, tn)], 
                                               ['cutoff', 'accuracy', 'recall', 'precision', 'F1', 'TP', 'FP', 'FN', 'TN'])
    performance_df = performance_df.union(performance_df_row)
display(performance_df)

Testing cutoff =  0.05
Accuracy: 0.3516 Recall: 0.9862 Precision:  0.1203 F1 score: 0.2144 TP 1216 FP 8892 FN 17 TN 3615
Testing cutoff =  0.10
Accuracy: 0.7985 Recall: 0.7802 Precision:  0.2780 F1 score: 0.4100 TP 962 FP 2498 FN 271 TN 10009
Testing cutoff =  0.15
Accuracy: 0.8687 Recall: 0.6504 Precision:  0.3687 F1 score: 0.4707 TP 802 FP 1373 FN 431 TN 11134
Testing cutoff =  0.20
Accuracy: 0.8941 Recall: 0.5523 Precision:  0.4299 F1 score: 0.4835 TP 681 FP 903 FN 552 TN 11604
Testing cutoff =  0.25
Accuracy: 0.9066 Recall: 0.5036 Precision:  0.4807 F1 score: 0.4919 TP 621 FP 671 FN 612 TN 11836
Testing cutoff =  0.30
Accuracy: 0.9134 Recall: 0.4647 Precision:  0.5195 F1 score: 0.4906 TP 573 FP 530 FN 660 TN 11977
Testing cutoff =  0.35
Accuracy: 0.9189 Recall: 0.4169 Precision:  0.5655 F1 score: 0.4799 TP 514 FP 395 FN 719 TN 12112
Testing cutoff =  0.40
Accuracy: 0.9234 Recall: 0.3796 Precision:  0.6190 F1 score: 0.4706 TP 468 FP 288 FN 765 TN 12219
Testing cutoff =  0.45
Accurac

DataFrame[cutoff: string, accuracy: string, recall: string, precision: string, F1: string, TP: bigint, FP: bigint, FN: bigint, TN: bigint]

In [30]:
performance_df.show()

+------+--------+------+---------+------+----+----+----+-----+
|cutoff|accuracy|recall|precision|    F1|  TP|  FP|  FN|   TN|
+------+--------+------+---------+------+----+----+----+-----+
|     0|       0|     0|        0|     0|   0|   0|   0|    0|
|  0.05|  0.3516|0.9862|   0.1203|0.2144|1216|8892|  17| 3615|
|  0.10|  0.7985|0.7802|   0.2780|0.4100| 962|2498| 271|10009|
|  0.15|  0.8687|0.6504|   0.3687|0.4707| 802|1373| 431|11134|
|  0.20|  0.8941|0.5523|   0.4299|0.4835| 681| 903| 552|11604|
|  0.25|  0.9066|0.5036|   0.4807|0.4919| 621| 671| 612|11836|
|  0.30|  0.9134|0.4647|   0.5195|0.4906| 573| 530| 660|11977|
|  0.35|  0.9189|0.4169|   0.5655|0.4799| 514| 395| 719|12112|
|  0.40|  0.9234|0.3796|   0.6190|0.4706| 468| 288| 765|12219|
|  0.45|  0.9241|0.3252|   0.6552|0.4347| 401| 211| 832|12296|
|  0.50|  0.9237|0.2814|   0.6817|0.3984| 347| 162| 886|12345|
|  0.55|  0.9226|0.2482|   0.6923|0.3654| 306| 136| 927|12371|
|  0.60|  0.9216|0.2182|   0.7042|0.3331| 269| 113| 964

Based on the output shown above, when evaluating models by F1 score, the optimal cutoff value is 0.25.