# Group 04 | Model Building | Logistic Regression with Downsampling

-------------------------
Amber Curran (akc6be)

Manpreet Dhindsa (mkd8bb)

Quinton Mays (rub9ez)

---------------------------

## Load Data

To begin we create our Spark Session and load the data from the parquet file.

In [1]:
from pyspark.sql import SparkSession

spark = SparkSession.builder \
        .appName("group04Logistic") \
        .getOrCreate()

In [2]:
import pandas as pd
from pyspark.ml.feature import *
from pyspark.ml.linalg import Vectors
from pyspark.ml import Pipeline, PipelineModel  
from pyspark.ml.classification import LogisticRegression
from pyspark.sql.functions import *

In [3]:
df = spark.read.parquet("/project/ds5559/fa21-group04/data/processed_df.parquet")

We then perform a training/validation test split on the data, reserving 80% of the data for training and validation and 20% for testing.

In [4]:
trainVal, test = df.randomSplit([0.8, 0.2], seed=304)

Next, we investigated if downsampling our data to remove the imbalance in `hospital_death` would improve the performance of our model. To do this, two dataframes are constructed by filtering the data based on `hospital_death` value.

In [5]:
death_df = trainVal.filter(col("hospital_death") == 1)
life_df = trainVal.filter(col("hospital_death") == 0)

Next, the downsampling ratio is determined by ratio of `hospital_death = 1` to `hospital_death = 0`.

In [6]:
ratio = int(life_df.count() / death_df.count())

In [7]:
ratio

10

This ratio is approximately 10:1, which indicates that there is moderate imbalance present in our response variable. We then account for this imbalance by sampling `life_df` using the inverse of this ratio and unioning the result to the `death_df`.

In [8]:
sampled_life_df = life_df.sample(False, 1/ratio, seed=304)
trainValDownsampled = sampled_life_df.unionAll(death_df)

## Pipeline

Next, a feature vector is constructed using the selected features from the `Group04FeatureSelection.ipynb` notebook. The columns selected as inputs are features chosen by a `UnivariateFeatureSelector`.

In [9]:
final_feature_vectorizer =  VectorAssembler(inputCols=['FinalCatFeatures',
                                                       'selectedContFeatures'],
                                            outputCol='features',
                                            handleInvalid='skip')

First, we select the features we want in model manually.from pyspark.ml.feature import Imputer

In [10]:
lr = LogisticRegression(labelCol='hospital_death', featuresCol='features', maxIter=10, regParam=0.01)

In [11]:
lr_pipeline = Pipeline(stages=[final_feature_vectorizer,
                               lr])

## Model Training and Tuning

In [12]:
from pyspark.ml.tuning import ParamGridBuilder, CrossValidator, CrossValidatorModel
from pyspark.ml.evaluation import BinaryClassificationEvaluator, MulticlassClassificationEvaluator

In [13]:
lr_paramGrid = ParamGridBuilder() \
    .addGrid(lr.regParam, [0.1, 0.01]) \
    .build()

In [14]:
crossval = CrossValidator(estimator=lr_pipeline,
                          estimatorParamMaps=lr_paramGrid,
                          evaluator=BinaryClassificationEvaluator().setLabelCol(lr.getLabelCol()),
                          numFolds=5,
                          seed=304,
                          parallelism=4)

In [15]:
# log_reg_model = crossval.fit(trainValDownsampled)
log_reg_model = CrossValidatorModel.load('/project/ds5559/fa21-group04/models/logregmodeldown')

## Model Evaluation

To evaluate the model performance, the test set is predicted using the chosen model from the `CrossValidator`.

In [16]:
preds = log_reg_model.transform(test)

The predictions are then passed to two objects:

* `BinaryClassificationEvaluator` - for AUC metric
* `MulticlassClassificationEvaluator` - for confusion matrix and all other model metrics

In [17]:
testEvaluator = BinaryClassificationEvaluator(rawPredictionCol='probability',
                                              labelCol='hospital_death',
                                              metricName='areaUnderROC')

In [18]:
testEvaluator2 = MulticlassClassificationEvaluator(labelCol='hospital_death',
                                                   predictionCol="prediction",
                                                   probabilityCol='probability',
                                                   metricLabel=1)

### Accuracy

In [19]:
testEvaluator2.evaluate(preds, {testEvaluator2.metricName: "accuracy"})

0.7941048034934498

### Precision

In [20]:
testEvaluator2.evaluate(preds, {testEvaluator2.metricName: "precisionByLabel"})

0.262782401902497

### Recall

In [21]:
testEvaluator2.evaluate(preds, {testEvaluator2.metricName: "recallByLabel"})

0.7169505271695052

### F1

In [22]:
testEvaluator2.evaluate(preds, {testEvaluator2.metricName: "fMeasureByLabel"})

0.38459865129432236

### True Positive Rate

In [23]:
testEvaluator2.evaluate(preds, {testEvaluator2.metricName: "truePositiveRateByLabel"})

0.7169505271695052

### False Positive Rate

In [24]:
testEvaluator2.evaluate(preds, {testEvaluator2.metricName: "falsePositiveRateByLabel"})

0.1982889581834173

### AUC

In [25]:
testEvaluator.evaluate(preds)

0.8387229834180133

### Confusion Matrix

In [26]:
from pyspark.mllib.evaluation import MulticlassMetrics
import pyspark.sql.functions as F
from pyspark.sql.types import FloatType
preds_and_labels = preds.select(['prediction','hospital_death']).withColumn('label', F.col('hospital_death').cast(FloatType())).orderBy('prediction')
preds_and_labels = preds_and_labels.select(['prediction','label'])
conf_matrix = MulticlassMetrics(preds_and_labels.rdd.map(tuple))
conf_matrix.confusionMatrix().toArray()

array([[10027.,  2480.],
       [  349.,   884.]])

In [27]:
preds.groupBy('prediction').count().show()

+----------+-----+
|prediction|count|
+----------+-----+
|       0.0|10376|
|       1.0| 3364|
+----------+-----+



As with the other models, it is clear from the low values of the model metrics and the high AUC, that the imbalanced dataset coupled with Spark's default behavior of selecting the higher porbability result as the reported result, is causing the metrics to suffer. Therefore, a probability cutoff value must be selected to reduce the effects of the imbalanced dataset.

In [28]:
# log_reg_model.write().overwrite().save("/project/ds5559/fa21-group04/models/logregmodeldown")

### Selected Model Hyperparameters

In [29]:
import numpy as np

In [30]:
log_reg_model.getEstimatorParamMaps()[ np.argmax(log_reg_model.avgMetrics)]

{Param(parent='LogisticRegression_468318d65ca1', name='regParam', doc='regularization parameter (>= 0).'): 0.01}

The chosen value for `regParam` is 0.01, indicating that there is some small benefit in applying a penalty term to the logisitic regression model.

### Probability Threshold Determination

In [31]:
from pyspark.sql.types import DoubleType, FloatType

getprob = udf(lambda v:float(v[1]),FloatType())
## Select out the necessary columns
output = preds.select(col("rawPrediction"),
                              col("hospital_death").cast(DoubleType()),
                              getprob(col("probability")).alias("probability"),
                              col("prediction"))

In [32]:
from pyspark.sql.types import DoubleType
from pyspark.mllib.evaluation import BinaryClassificationMetrics

performance_df = spark.createDataFrame([(0,0,0,0,0,0,0,0,0)], ['cutoff', 'accuracy', 'recall', 'precision', 'F1', 'TP', 'FP', 'FN', 'TN'])
for cutoff in range(5, 95, 5):
    cutoff = (cutoff * 0.01)
  
    print('Testing cutoff = ', str(format(cutoff, '.2f')))
    lrpredictions_prob_temp = output.withColumn('prediction', when(col('probability') >= cutoff, 1).otherwise(0).cast(DoubleType()))
    tp = lrpredictions_prob_temp[(lrpredictions_prob_temp.hospital_death == 1) & (lrpredictions_prob_temp.prediction == 1)].count()
    tn = lrpredictions_prob_temp[(lrpredictions_prob_temp.hospital_death == 0) & (lrpredictions_prob_temp.prediction == 0)].count()
    fp = lrpredictions_prob_temp[(lrpredictions_prob_temp.hospital_death == 0) & (lrpredictions_prob_temp.prediction == 1)].count()
    fn = lrpredictions_prob_temp[(lrpredictions_prob_temp.hospital_death == 1) & (lrpredictions_prob_temp.prediction == 0)].count()
    a = ((tp + tn)/lrpredictions_prob_temp.count())
    if(tp + fn == 0.0):
        r = 0.0
        p = float(tp) / (tp + fp)
    elif(tp + fp == 0.0):
        r = float(tp) / (tp + fn)
        p = 0.0
    else:
        r = float(tp) / (tp + fn)
        p = float(tp) / (tp + fp)
    
    if(p + r == 0):
        f1 = 0
    else:
        f1 = 2 * ((p * r)/(p + r))
    print("Accuracy:", format(a, '.4f'), "Recall:", format(r, '.4f'), "Precision: ", format(p, '.4f'), "F1 score:", format(f1, '.4f'), "TP", tp, "FP", fp, "FN", fn, "TN", tn)
    performance_df_row = spark.createDataFrame([(format(cutoff, '.2f'),format(a, '.4f'), format(r, '.4f'), format(p, '.4f'), format(f1, '.4f'), tp, fp, fn, tn)], 
                                               ['cutoff', 'accuracy', 'recall', 'precision', 'F1', 'TP', 'FP', 'FN', 'TN'])
    performance_df = performance_df.union(performance_df_row)
display(performance_df)

Testing cutoff =  0.05
Accuracy: 0.1068 Recall: 0.9935 Precision:  0.0908 F1 score: 0.1664 TP 1225 FP 12264 FN 8 TN 243
Testing cutoff =  0.10
Accuracy: 0.1630 Recall: 0.9903 Precision:  0.0961 F1 score: 0.1752 TP 1221 FP 11488 FN 12 TN 1019
Testing cutoff =  0.15
Accuracy: 0.2751 Recall: 0.9781 Precision:  0.1083 F1 score: 0.1950 TP 1206 FP 9933 FN 27 TN 2574
Testing cutoff =  0.20
Accuracy: 0.3951 Recall: 0.9570 Precision:  0.1250 F1 score: 0.2212 TP 1180 FP 8258 FN 53 TN 4249
Testing cutoff =  0.25
Accuracy: 0.5031 Recall: 0.9286 Precision:  0.1452 F1 score: 0.2512 TP 1145 FP 6739 FN 88 TN 5768
Testing cutoff =  0.30
Accuracy: 0.5857 Recall: 0.8929 Precision:  0.1653 F1 score: 0.2789 TP 1101 FP 5561 FN 132 TN 6946
Testing cutoff =  0.35
Accuracy: 0.6531 Recall: 0.8500 Precision:  0.1862 F1 score: 0.3055 TP 1048 FP 4581 FN 185 TN 7926
Testing cutoff =  0.40
Accuracy: 0.7056 Recall: 0.8102 Precision:  0.2077 F1 score: 0.3306 TP 999 FP 3811 FN 234 TN 8696
Testing cutoff =  0.45
Accurac

DataFrame[cutoff: string, accuracy: string, recall: string, precision: string, F1: string, TP: bigint, FP: bigint, FN: bigint, TN: bigint]

In [33]:
performance_df.show()

+------+--------+------+---------+------+----+-----+---+-----+
|cutoff|accuracy|recall|precision|    F1|  TP|   FP| FN|   TN|
+------+--------+------+---------+------+----+-----+---+-----+
|     0|       0|     0|        0|     0|   0|    0|  0|    0|
|  0.05|  0.1068|0.9935|   0.0908|0.1664|1225|12264|  8|  243|
|  0.10|  0.1630|0.9903|   0.0961|0.1752|1221|11488| 12| 1019|
|  0.15|  0.2751|0.9781|   0.1083|0.1950|1206| 9933| 27| 2574|
|  0.20|  0.3951|0.9570|   0.1250|0.2212|1180| 8258| 53| 4249|
|  0.25|  0.5031|0.9286|   0.1452|0.2512|1145| 6739| 88| 5768|
|  0.30|  0.5857|0.8929|   0.1653|0.2789|1101| 5561|132| 6946|
|  0.35|  0.6531|0.8500|   0.1862|0.3055|1048| 4581|185| 7926|
|  0.40|  0.7056|0.8102|   0.2077|0.3306| 999| 3811|234| 8696|
|  0.45|  0.7531|0.7762|   0.2349|0.3607| 957| 3117|276| 9390|
|  0.50|  0.7941|0.7170|   0.2628|0.3846| 884| 2480|349|10027|
|  0.55|  0.8277|0.6659|   0.2956|0.4095| 821| 1956|412|10551|
|  0.60|  0.8535|0.6115|   0.3295|0.4283| 754| 1534|479

The optimal value for the probability cutoff is 0.7 based on the F1 score. This value is much higher than all the other models due to the fact that this model was trained on downsampled data.