# Group 04 | Model Building | Random Forest

-------------------------
Amber Curran (akc6be)

Manpreet Dhindsa (mkd8bb)

Quinton Mays (rub9ez)

---------------------------

## Load Data

To begin we create our Spark Session, load the required libraries, and then load the data from the parquet file.

In [1]:
from pyspark.sql import SparkSession

spark = SparkSession.builder \
        .appName("group04RF") \
        .getOrCreate()

In [2]:
import pandas as pd
from pyspark.ml.feature import *
from pyspark.ml.linalg import Vectors
from pyspark.ml import Pipeline, PipelineModel  
from pyspark.ml.classification import RandomForestClassifier
from pyspark.sql.functions import *

In [3]:
df = spark.read.parquet("/project/ds5559/fa21-group04/data/processed_df.parquet")

We then perform a training/validation test split on the data, reserving 80% of the data for training and validation and 20% for testing.

In [4]:
trainVal, test = df.randomSplit([0.8, 0.2], seed=304)

## Pipeline

Next, a feature vector is constructed using the selected features from the `Group04FeatureSelection.ipynb` notebook. The columns selected as inputs are features chosen by a `UnivariateFeatureSelector`.

In [5]:
final_feature_vectorizer =  VectorAssembler(inputCols=['FinalCatFeatures',
                                                       'selectedContFeatures'],
                                            outputCol='features',
                                            handleInvalid='skip')

Next, a `RandomForestClassifier` is constructed to predict `hospital_death` based on the selected features.

In [6]:
rf = RandomForestClassifier(featuresCol = 'features', labelCol = 'hospital_death')

A pipeline is then created to feed the selected features to the classifier.

In [7]:
rf_pipeline = Pipeline(stages=[final_feature_vectorizer,
                               rf])

## Model Training and Tuning

To begin model training, the necessary packages are imported.

In [8]:
from pyspark.ml.tuning import ParamGridBuilder, CrossValidator, CrossValidatorModel
from pyspark.ml.evaluation import BinaryClassificationEvaluator, MulticlassClassificationEvaluator

Next, we specify our parameter grid for tuning our Random Forest model. For this model, our tunable hyperparameters were:

* numTrees = number of trees in the foreset
* maxBins = max number of features considered for splitting a node
* maxDepth = max number of levels in each decision tree

In [9]:
rf_paramGrid = ParamGridBuilder() \
    .addGrid(rf.maxDepth, [2, 5, 10]) \
    .addGrid(rf.maxBins, [5, 10, 20]) \
    .addGrid(rf.numTrees, [5, 20, 50]) \
    .build()

In [10]:
crossval = CrossValidator(estimator=rf_pipeline,
                          estimatorParamMaps=rf_paramGrid,
                          evaluator=BinaryClassificationEvaluator().setLabelCol(rf.getLabelCol()),
                          numFolds=5,
                          seed=304,
                          parallelism=4)

The `CrossValidator` pipeline is then fit using the training data. This model may also be loaded from the storage location for faster execution.

In [11]:
# rf_model = crossval.fit(trainVal)
rf_model = CrossValidatorModel.load('/project/ds5559/fa21-group04/models/rfmodel') # avoids long training time once run

## Model Evaluation

To evaluate the model performance, the test set is predicted using the chosen model from the `CrossValidator`.

In [12]:
preds = rf_model.transform(test)

The predictions are then passed to two objects:

* `BinaryClassificationEvaluator` - for AUC metric
* `MulticlassClassificationEvaluator` - for confusion matrix and all other model metrics

In [13]:
testEvaluator = BinaryClassificationEvaluator(rawPredictionCol='probability',
                                              labelCol='hospital_death',
                                              metricName='areaUnderROC')

In [14]:
testEvaluator2 = MulticlassClassificationEvaluator(labelCol='hospital_death',
                                                   predictionCol="prediction",
                                                   probabilityCol='probability',
                                                   metricLabel=1) # remember to set metricLabel to 1 for correct metrics!!

### Accuracy

In [15]:
testEvaluator2.evaluate(preds, {testEvaluator2.metricName: "accuracy"})

0.9348617176128093

### Precision

In [16]:
testEvaluator2.evaluate(preds, {testEvaluator2.metricName: "precisionByLabel"})

0.9225

### Recall

In [17]:
testEvaluator2.evaluate(preds, {testEvaluator2.metricName: "recallByLabel"})

0.29927007299270075

### F1

In [18]:
testEvaluator2.evaluate(preds, {testEvaluator2.metricName: "fMeasureByLabel"})

0.45192896509491737

### True Positive Rate

In [19]:
testEvaluator2.evaluate(preds, {testEvaluator2.metricName: "truePositiveRateByLabel"})

0.29927007299270075

### False Positive Rate

In [20]:
testEvaluator2.evaluate(preds, {testEvaluator2.metricName: "falsePositiveRateByLabel"})

0.0024786119772927163

### Area Under ROC Curve

In [21]:
testEvaluator.evaluate(preds)

0.9211786087544431

### Confusion Matrix

In [22]:
from pyspark.mllib.evaluation import MulticlassMetrics
import pyspark.sql.functions as F
from pyspark.sql.types import FloatType
preds_and_labels = preds.select(['prediction','hospital_death']).withColumn('label', F.col('hospital_death').cast(FloatType())).orderBy('prediction')
preds_and_labels = preds_and_labels.select(['prediction','label'])
conf_matrix = MulticlassMetrics(preds_and_labels.rdd.map(tuple))
conf_matrix.confusionMatrix().toArray()

array([[12476.,    31.],
       [  864.,   369.]])

In [23]:
preds.groupBy('prediction').count().show()

+----------+-----+
|prediction|count|
+----------+-----+
|       0.0|13340|
|       1.0|  400|
+----------+-----+



It is clear from the low values of the model metrics and the high AUC, that the imbalanced dataset coupled with Spark's default behavior of selecting the higher porbability result as the reported result, is causing the metrics to suffer. Therefore, a probability cutoff value must be selected to reduce the effects of the imbalanced dataset.

## Selected Model Hyperparameters

The hyperparameters selected by the `CrossValidator` are shown below:

In [24]:
import numpy as np

In [25]:
rf_model.getEstimatorParamMaps()[ np.argmax(rf_model.avgMetrics) ]

{Param(parent='RandomForestClassifier_e6bc472efbab', name='numTrees', doc='Number of trees to train (>= 1).'): 50,
 Param(parent='RandomForestClassifier_e6bc472efbab', name='maxDepth', doc='Maximum depth of the tree. (>= 0) E.g., depth 0 means 1 leaf node; depth 1 means 1 internal node + 2 leaf nodes.'): 10,
 Param(parent='RandomForestClassifier_e6bc472efbab', name='maxBins', doc='Max number of bins for discretizing continuous features.  Must be >=2 and >= number of categories for any categorical feature.'): 20}

In [26]:
# rf_model.write().overwrite().save("/project/ds5559/fa21-group04/models/rfmodel")

The chosen value for `numTrees` is 50,`maxBins` is 20, and `maxDepth` is 10. As all of these values were the maximum in the paramGrid, it may be prudent to explore a largae parameter space in future work.

### Probability Threshold Determination

In [27]:
from pyspark.sql.types import DoubleType, FloatType

getprob = udf(lambda v:float(v[1]),FloatType())
## Select out the necessary columns
output = preds.select(col("rawPrediction"),
                              col("hospital_death").cast(DoubleType()),
                              getprob(col("probability")).alias("probability"),
                              col("prediction"))
output

DataFrame[rawPrediction: vector, hospital_death: double, probability: float, prediction: double]

Next, we chose to set a probability cutoff value for determining `hospital_death`. To do this, we separate out the probability of `hospital_death = 1` from raw prediction column and then compare the model accuracy, precision, recall, and F1 metrics for different probability cutoff values.

In [28]:
from pyspark.sql.types import DoubleType
from pyspark.mllib.evaluation import BinaryClassificationMetrics

performance_df = spark.createDataFrame([(0,0,0,0,0,0,0,0,0)], ['cutoff', 'accuracy', 'recall', 'precision', 'F1', 'TP', 'FP', 'FN', 'TN'])
for cutoff in range(5, 95, 5):
    cutoff = (cutoff * 0.01)
  
    print('Testing cutoff = ', str(format(cutoff, '.2f')))
    lrpredictions_prob_temp = output.withColumn('prediction', when(col('probability') >= cutoff, 1).otherwise(0).cast(DoubleType()))
    tp = lrpredictions_prob_temp[(lrpredictions_prob_temp.hospital_death == 1) & (lrpredictions_prob_temp.prediction == 1)].count()
    tn = lrpredictions_prob_temp[(lrpredictions_prob_temp.hospital_death == 0) & (lrpredictions_prob_temp.prediction == 0)].count()
    fp = lrpredictions_prob_temp[(lrpredictions_prob_temp.hospital_death == 0) & (lrpredictions_prob_temp.prediction == 1)].count()
    fn = lrpredictions_prob_temp[(lrpredictions_prob_temp.hospital_death == 1) & (lrpredictions_prob_temp.prediction == 0)].count()
    a = ((tp + tn)/lrpredictions_prob_temp.count())
    if(tp + fn == 0.0):
        r = 0.0
        p = float(tp) / (tp + fp)
    elif(tp + fp == 0.0):
        r = float(tp) / (tp + fn)
        p = 0.0
    else:
        r = float(tp) / (tp + fn)
        p = float(tp) / (tp + fp)
    
    if(p + r == 0):
        f1 = 0
    else:
        f1 = 2 * ((p * r)/(p + r))
    print("Accuracy:", format(a, '.4f'), "Recall:", format(r, '.4f'), "Precision: ", format(p, '.4f'), "F1 score:", format(f1, '.4f'), "TP", tp, "FP", fp, "FN", fn, "TN", tn)
    performance_df_row = spark.createDataFrame([(format(cutoff, '.2f'),format(a, '.4f'), format(r, '.4f'), format(p, '.4f'), format(f1, '.4f'), tp, fp, fn, tn)], 
                                               ['cutoff', 'accuracy', 'recall', 'precision', 'F1', 'TP', 'FP', 'FN', 'TN'])
    performance_df = performance_df.union(performance_df_row)
display(performance_df)

Testing cutoff =  0.05
Accuracy: 0.6841 Recall: 0.9457 Precision:  0.2143 F1 score: 0.3495 TP 1166 FP 4274 FN 67 TN 8233
Testing cutoff =  0.10
Accuracy: 0.8513 Recall: 0.8313 Precision:  0.3584 F1 score: 0.5009 TP 1025 FP 1835 FN 208 TN 10672
Testing cutoff =  0.15
Accuracy: 0.8990 Recall: 0.7299 Precision:  0.4604 F1 score: 0.5646 TP 900 FP 1055 FN 333 TN 11452
Testing cutoff =  0.20
Accuracy: 0.9184 Recall: 0.6586 Precision:  0.5370 F1 score: 0.5916 TP 812 FP 700 FN 421 TN 11807
Testing cutoff =  0.25
Accuracy: 0.9290 Recall: 0.5888 Precision:  0.6080 F1 score: 0.5983 TP 726 FP 468 FN 507 TN 12039
Testing cutoff =  0.30
Accuracy: 0.9357 Recall: 0.5320 Precision:  0.6819 F1 score: 0.5977 TP 656 FP 306 FN 577 TN 12201
Testing cutoff =  0.35
Accuracy: 0.9385 Recall: 0.4615 Precision:  0.7587 F1 score: 0.5739 TP 569 FP 181 FN 664 TN 12326
Testing cutoff =  0.40
Accuracy: 0.9395 Recall: 0.4152 Precision:  0.8232 F1 score: 0.5520 TP 512 FP 110 FN 721 TN 12397
Testing cutoff =  0.45
Accura

DataFrame[cutoff: string, accuracy: string, recall: string, precision: string, F1: string, TP: bigint, FP: bigint, FN: bigint, TN: bigint]

In [29]:
performance_df.show()

+------+--------+------+---------+------+----+----+----+-----+
|cutoff|accuracy|recall|precision|    F1|  TP|  FP|  FN|   TN|
+------+--------+------+---------+------+----+----+----+-----+
|     0|       0|     0|        0|     0|   0|   0|   0|    0|
|  0.05|  0.6841|0.9457|   0.2143|0.3495|1166|4274|  67| 8233|
|  0.10|  0.8513|0.8313|   0.3584|0.5009|1025|1835| 208|10672|
|  0.15|  0.8990|0.7299|   0.4604|0.5646| 900|1055| 333|11452|
|  0.20|  0.9184|0.6586|   0.5370|0.5916| 812| 700| 421|11807|
|  0.25|  0.9290|0.5888|   0.6080|0.5983| 726| 468| 507|12039|
|  0.30|  0.9357|0.5320|   0.6819|0.5977| 656| 306| 577|12201|
|  0.35|  0.9385|0.4615|   0.7587|0.5739| 569| 181| 664|12326|
|  0.40|  0.9395|0.4152|   0.8232|0.5520| 512| 110| 721|12397|
|  0.45|  0.9379|0.3536|   0.8862|0.5055| 436|  56| 797|12451|
|  0.50|  0.9349|0.2993|   0.9225|0.4519| 369|  31| 864|12476|
|  0.55|  0.9311|0.2449|   0.9527|0.3897| 302|  15| 931|12492|
|  0.60|  0.9272|0.1955|   0.9679|0.3252| 241|   8| 992