# Group 04 | Model Building | Logistic Regression

-------------------------
Amber Curran (akc6be)

Manpreet Dhindsa (mkd8bb)

Quinton Mays (rub9ez)

---------------------------

## Load Data

To begin we create our Spark Session and load the data from the parquet file.

In [1]:
from pyspark.sql import SparkSession

spark = SparkSession.builder \
        .appName("group04LogisticNDS") \
        .getOrCreate()

In [2]:
import pandas as pd
from pyspark.ml.feature import *
from pyspark.ml.linalg import Vectors
from pyspark.ml import Pipeline, PipelineModel  
from pyspark.ml.classification import LogisticRegression
from pyspark.sql.functions import *

In [3]:
df = spark.read.parquet("/project/ds5559/fa21-group04/data/processed_df.parquet")

We then perform a training/validation test split on the data, reserving 80% of the data for training and validation and 20% for testing.

In [4]:
trainVal, test = df.randomSplit([0.8, 0.2], seed=304)

## Pipeline

Next, a feature vector is constructed using the selected features from the `Group04FeatureSelection.ipynb` notebook. The columns selected as inputs are features chosen by a `UnivariateFeatureSelector`.

In [5]:
final_feature_vectorizer =  VectorAssembler(inputCols=['FinalCatFeatures',
                                                       'selectedContFeatures'],
                                            outputCol='features',
                                            handleInvalid='skip')

Next, a `LogisticRegression` classifier object is constructed to predict `hospital_death` based on the selected features.

In [6]:
lr = LogisticRegression(labelCol='hospital_death', featuresCol='features', maxIter=10, regParam=0.01)

In [7]:
lr_pipeline = Pipeline(stages=[final_feature_vectorizer,
                               lr])

## Model Training and Tuning

In [8]:
from pyspark.ml.tuning import ParamGridBuilder, CrossValidator, CrossValidatorModel
from pyspark.ml.evaluation import BinaryClassificationEvaluator, MulticlassClassificationEvaluator

In [9]:
lr_paramGrid = ParamGridBuilder() \
    .addGrid(lr.regParam, [0.1, 0.01]) \
    .build()

In [10]:
crossval = CrossValidator(estimator=lr_pipeline,
                          estimatorParamMaps=lr_paramGrid,
                          evaluator=BinaryClassificationEvaluator().setLabelCol(lr.getLabelCol()),
                          numFolds=5,
                          seed=304,
                          parallelism=4)

In [11]:
# log_reg_model = crossval.fit(trainVal)
log_reg_model = CrossValidatorModel.load('/project/ds5559/fa21-group04/models/logregmodel')

## Model Evaluation

To evaluate the model performance, the test set is predicted using the chosen model from the `CrossValidator`.

In [12]:
preds = log_reg_model.transform(test)

The predictions are then passed to two objects:

* `BinaryClassificationEvaluator` - for AUC metric
* `MulticlassClassificationEvaluator` - for confusion matrix and all other model metrics

In [13]:
testEvaluator = BinaryClassificationEvaluator(rawPredictionCol='probability',
                                              labelCol='hospital_death',
                                              metricName='areaUnderROC')

In [14]:
testEvaluator2 = MulticlassClassificationEvaluator(labelCol='hospital_death',
                                                   predictionCol="prediction",
                                                   probabilityCol='probability',
                                                   metricLabel=1)

### Accuracy

In [15]:
testEvaluator2.evaluate(preds, {testEvaluator2.metricName: "accuracy"})

0.9178311499272198

### Precision

In [16]:
testEvaluator2.evaluate(preds, {testEvaluator2.metricName: "precisionByLabel"})

0.7549019607843137

### Recall

In [17]:
testEvaluator2.evaluate(preds, {testEvaluator2.metricName: "recallByLabel"})

0.12489862124898621

### F1

In [18]:
testEvaluator2.evaluate(preds, {testEvaluator2.metricName: "fMeasureByLabel"})

0.21433542101600556

### True Positive Rate

In [19]:
testEvaluator2.evaluate(preds, {testEvaluator2.metricName: "truePositiveRateByLabel"})

0.12489862124898621

### False Positive Rate

In [20]:
testEvaluator2.evaluate(preds, {testEvaluator2.metricName: "falsePositiveRateByLabel"})

0.003997761253697929

### Area Under ROC Curve

In [21]:
testEvaluator.evaluate(preds)

0.8384881108914761

### Confusion Matrix

In [22]:
from pyspark.mllib.evaluation import MulticlassMetrics
import pyspark.sql.functions as F
from pyspark.sql.types import FloatType
preds_and_labels = preds.select(['prediction','hospital_death']).withColumn('label', F.col('hospital_death').cast(FloatType())).orderBy('prediction')
preds_and_labels = preds_and_labels.select(['prediction','label'])
conf_matrix = MulticlassMetrics(preds_and_labels.rdd.map(tuple))
conf_matrix.confusionMatrix().toArray()

array([[12457.,    50.],
       [ 1079.,   154.]])

In [23]:
preds.groupBy('prediction').count().show()

+----------+-----+
|prediction|count|
+----------+-----+
|       0.0|13536|
|       1.0|  204|
+----------+-----+



It is clear from the low values of the model metrics and the high AUC, that the imbalanced dataset coupled with Spark's default behavior of selecting the higher porbability result as the reported result, is causing the metrics to suffer. Therefore, a probability cutoff value must be selected to reduce the effects of the imbalanced dataset.

In [24]:
# log_reg_model.write().overwrite().save("/project/ds5559/fa21-group04/models/logregmodel")

### Selected Model Hyperparameters

The hyperparameters selected by the `CrossValidator` are shown below:

In [25]:
import numpy as np

In [26]:
log_reg_model.getEstimatorParamMaps()[ np.argmax(log_reg_model.avgMetrics)]

{Param(parent='LogisticRegression_0671f424ae0e', name='regParam', doc='regularization parameter (>= 0).'): 0.1}

The chosen value for `regParam` is 0.1, indicating that there is some benefit in applying a penalty term to the logisitic regression model.

### Probability Threshold Determination

Next, we chose to set a probability cutoff value for determining `hospital_death`. To do this, we separate out the probability of `hospital_death = 1` from raw prediction column and then compare the model accuracy, precision, recall, and F1 metrics for different probability cutoff values.

In [27]:
from pyspark.sql.types import DoubleType, FloatType

getprob = udf(lambda v:float(v[1]),FloatType())
## Select out the necessary columns
output = preds.select(col("rawPrediction"),
                              col("hospital_death").cast(DoubleType()),
                              getprob(col("probability")).alias("probability"),
                              col("prediction"))
output

DataFrame[rawPrediction: vector, hospital_death: double, probability: float, prediction: double]

In [28]:
from pyspark.sql.types import DoubleType
from pyspark.mllib.evaluation import BinaryClassificationMetrics

performance_df = spark.createDataFrame([(0,0,0,0,0,0,0,0,0)], ['cutoff', 'accuracy', 'recall', 'precision', 'F1', 'TP', 'FP', 'FN', 'TN'])
for cutoff in range(5, 95, 5):
    cutoff = (cutoff * 0.01)
  
    print('Testing cutoff = ', str(format(cutoff, '.2f')))
    lrpredictions_prob_temp = output.withColumn('prediction', when(col('probability') >= cutoff, 1).otherwise(0).cast(DoubleType()))
    tp = lrpredictions_prob_temp[(lrpredictions_prob_temp.hospital_death == 1) & (lrpredictions_prob_temp.prediction == 1)].count()
    tn = lrpredictions_prob_temp[(lrpredictions_prob_temp.hospital_death == 0) & (lrpredictions_prob_temp.prediction == 0)].count()
    fp = lrpredictions_prob_temp[(lrpredictions_prob_temp.hospital_death == 0) & (lrpredictions_prob_temp.prediction == 1)].count()
    fn = lrpredictions_prob_temp[(lrpredictions_prob_temp.hospital_death == 1) & (lrpredictions_prob_temp.prediction == 0)].count()
    a = ((tp + tn)/lrpredictions_prob_temp.count())
    if(tp + fn == 0.0):
        r = 0.0
        p = float(tp) / (tp + fp)
    elif(tp + fp == 0.0):
        r = float(tp) / (tp + fn)
        p = 0.0
    else:
        r = float(tp) / (tp + fn)
        p = float(tp) / (tp + fp)
    
    if(p + r == 0):
        f1 = 0
    else:
        f1 = 2 * ((p * r)/(p + r))
    print("Accuracy:", format(a, '.4f'), "Recall:", format(r, '.4f'), "Precision: ", format(p, '.4f'), "F1 score:", format(f1, '.4f'), "TP", tp, "FP", fp, "FN", fn, "TN", tn)
    performance_df_row = spark.createDataFrame([(format(cutoff, '.2f'),format(a, '.4f'), format(r, '.4f'), format(p, '.4f'), format(f1, '.4f'), tp, fp, fn, tn)], 
                                               ['cutoff', 'accuracy', 'recall', 'precision', 'F1', 'TP', 'FP', 'FN', 'TN'])
    performance_df = performance_df.union(performance_df_row)
display(performance_df)

Testing cutoff =  0.05
Accuracy: 0.5684 Recall: 0.8978 Precision:  0.1602 F1 score: 0.2719 TP 1107 FP 5804 FN 126 TN 6703
Testing cutoff =  0.10
Accuracy: 0.8147 Recall: 0.6951 Precision:  0.2831 F1 score: 0.4023 TP 857 FP 2170 FN 376 TN 10337
Testing cutoff =  0.15
Accuracy: 0.8825 Recall: 0.5483 Precision:  0.3899 F1 score: 0.4557 TP 676 FP 1058 FN 557 TN 11449
Testing cutoff =  0.20
Accuracy: 0.9072 Recall: 0.4347 Precision:  0.4811 F1 score: 0.4568 TP 536 FP 578 FN 697 TN 11929
Testing cutoff =  0.25
Accuracy: 0.9162 Recall: 0.3569 Precision:  0.5507 F1 score: 0.4331 TP 440 FP 359 FN 793 TN 12148
Testing cutoff =  0.30
Accuracy: 0.9189 Recall: 0.2968 Precision:  0.5961 F1 score: 0.3963 TP 366 FP 248 FN 867 TN 12259
Testing cutoff =  0.35
Accuracy: 0.9205 Recall: 0.2393 Precision:  0.6570 F1 score: 0.3508 TP 295 FP 154 FN 938 TN 12353
Testing cutoff =  0.40
Accuracy: 0.9197 Recall: 0.1955 Precision:  0.6847 F1 score: 0.3041 TP 241 FP 111 FN 992 TN 12396
Testing cutoff =  0.45
Accura

DataFrame[cutoff: string, accuracy: string, recall: string, precision: string, F1: string, TP: bigint, FP: bigint, FN: bigint, TN: bigint]

In [29]:
performance_df.show()

+------+--------+------+---------+------+----+----+----+-----+
|cutoff|accuracy|recall|precision|    F1|  TP|  FP|  FN|   TN|
+------+--------+------+---------+------+----+----+----+-----+
|     0|       0|     0|        0|     0|   0|   0|   0|    0|
|  0.05|  0.5684|0.8978|   0.1602|0.2719|1107|5804| 126| 6703|
|  0.10|  0.8147|0.6951|   0.2831|0.4023| 857|2170| 376|10337|
|  0.15|  0.8825|0.5483|   0.3899|0.4557| 676|1058| 557|11449|
|  0.20|  0.9072|0.4347|   0.4811|0.4568| 536| 578| 697|11929|
|  0.25|  0.9162|0.3569|   0.5507|0.4331| 440| 359| 793|12148|
|  0.30|  0.9189|0.2968|   0.5961|0.3963| 366| 248| 867|12259|
|  0.35|  0.9205|0.2393|   0.6570|0.3508| 295| 154| 938|12353|
|  0.40|  0.9197|0.1955|   0.6847|0.3041| 241| 111| 992|12396|
|  0.45|  0.9195|0.1590|   0.7396|0.2617| 196|  69|1037|12438|
|  0.50|  0.9178|0.1249|   0.7549|0.2143| 154|  50|1079|12457|
|  0.55|  0.9167|0.0973|   0.7947|0.1734| 120|  31|1113|12476|
|  0.60|  0.9148|0.0673|   0.7981|0.1242|  83|  21|1150

For the logistic regression model, the optimal cutoff value based on F1 score is 0.2. When compared to the downsampled logistic regression model, this value is much lower (0.7 v 0.2). This difference can be explained by the large imbalance in the response variable in our dataset.