# Group 04 | Model Building | APACHE Variable Model -Logistic Regression

-------------------------
Amber Curran (akc6be)

Manpreet Dhindsa (mkd8bb)

Quinton Mays (rub9ez)

---------------------------

## Load Data

To begin we create our Spark Session and load the data from the parquet file.

In [1]:
from pyspark.sql import SparkSession

spark = SparkSession.builder \
        .appName("group04Base") \
        .getOrCreate()

In [2]:
import pandas as pd
from pyspark.ml.feature import *
from pyspark.ml.linalg import Vectors
from pyspark.ml import Pipeline, PipelineModel  
from pyspark.ml.classification import LogisticRegression
from pyspark.sql.functions import *

In [3]:
df = spark.read.parquet("/project/ds5559/fa21-group04/data/processed_df.parquet")

For the "APACHE model", we included variables containing APACHE hospital and icu death probabilities. The data dictionary explains these two variables as probabilistic predictions of mortality for a patient utilizing the APACHE III score and other covariates, including diagnosis. This logistic regression model was explored to determine if the `hospital_death` variable could be determined only using the two apache variables within this dataset.

In [4]:
vars_to_keep = ["apache_4a_hospital_death_prob_imputed", 
              "apache_4a_icu_death_prob_imputed",
               "hospital_death"]
apache_df = df[vars_to_keep]
apache_df.head()

Row(apache_4a_hospital_death_prob_imputed=0.10000000149011612, apache_4a_icu_death_prob_imputed=0.05000000074505806, hospital_death=0)

We then perform a training/validation test split on the data, reserving 80% of the data for training and validation and 20% for testing.

In [5]:
trainVal, test = apache_df.randomSplit([0.8, 0.2], seed=304)

## Pipeline

In [6]:
final_vectorizer =  VectorAssembler(inputCols=["apache_4a_hospital_death_prob_imputed", 
                                    "apache_4a_icu_death_prob_imputed"],
                                    outputCol='features',
                                    handleInvalid='skip')

In [7]:
apache_lr = LogisticRegression(labelCol='hospital_death', featuresCol='features', maxIter=10, regParam=0.01)

In [8]:
apache_lr_pipeline = Pipeline(stages=[final_vectorizer,
                               apache_lr])

## Model Training and Tuning

In [9]:
from pyspark.ml.tuning import ParamGridBuilder, CrossValidator
from pyspark.ml.evaluation import BinaryClassificationEvaluator, MulticlassClassificationEvaluator

In [10]:
apache_lr_paramGrid = ParamGridBuilder() \
    .addGrid(apache_lr.regParam, [0.1, 0.01]) \
    .build()

In [11]:
crossval = CrossValidator(estimator=apache_lr_pipeline,
                          estimatorParamMaps=apache_lr_paramGrid,
                          evaluator=BinaryClassificationEvaluator().setLabelCol(apache_lr.getLabelCol()),
                          numFolds=5,
                          seed=304,
                          parallelism=4)

In [12]:
apache_model = crossval.fit(trainVal)

## Model Evaluation

In [13]:
preds = apache_model.transform(test)

In [14]:
testEvaluator = BinaryClassificationEvaluator(rawPredictionCol='probability',
                                              labelCol='hospital_death',
                                              metricName='areaUnderROC')

In [15]:
testEvaluator2 = MulticlassClassificationEvaluator(labelCol='hospital_death',
                                                   predictionCol="prediction",
                                                   probabilityCol='probability',
                                                    metricLabel=1)

### Accuracy

In [16]:
testEvaluator2.evaluate(preds, {testEvaluator2.metricName: "accuracy"})

0.9235112936344969

### Precision

In [17]:
testEvaluator2.evaluate(preds, {testEvaluator2.metricName: "precisionByLabel"})

0.6867088607594937

### Recall

In [18]:
testEvaluator2.evaluate(preds, {testEvaluator2.metricName: "recallByLabel"})

0.18690783807062877

### F1

In [19]:
testEvaluator2.evaluate(preds, {testEvaluator2.metricName: "fMeasureByLabel"})

0.29383886255924174

### True Positive Rate

In [20]:
testEvaluator2.evaluate(preds, {testEvaluator2.metricName: "truePositiveRateByLabel"})

0.18690783807062877

### False Positive Rate

In [21]:
testEvaluator2.evaluate(preds, {testEvaluator2.metricName: "falsePositiveRateByLabel"})

0.007935871743486974

### Area Under Curve

In [22]:
testEvaluator.evaluate(preds)

0.835780984881046

### Confusion Matrix

In [23]:
from pyspark.mllib.evaluation import MulticlassMetrics
import pyspark.sql.functions as F
from pyspark.sql.types import FloatType
preds_and_labels = preds.select(['prediction','hospital_death']).withColumn('label', F.col('hospital_death').cast(FloatType())).orderBy('prediction')
preds_and_labels = preds_and_labels.select(['prediction','label'])
conf_matrix = MulticlassMetrics(preds_and_labels.rdd.map(tuple))
conf_matrix.confusionMatrix().toArray()

array([[12376.,    99.],
       [  944.,   217.]])

In [24]:
preds.groupBy('prediction').count().show()

+----------+-----+
|prediction|count|
+----------+-----+
|       0.0|13320|
|       1.0|  316|
+----------+-----+



## Selected Model Hyperparameters

In [25]:
import numpy as np

In [26]:
apache_model.getEstimatorParamMaps()[ np.argmax(apache_model.avgMetrics) ]

{Param(parent='LogisticRegression_d16c95465cea', name='regParam', doc='regularization parameter (>= 0).'): 0.01}

### Cutoff Value Selection

By default, the cutoff value is 0.5, however, we wanted to determine a cutoff value that resulted in the highest F1 value. F1 is the weighted average of precision and recall, which we believe is important for this scenario. We decided to not use accuracy as the threshold determination because F1 is usually more useful than accuracy, especially with the uneven class distribution.

In [27]:
from pyspark.sql.types import DoubleType, FloatType

getprob = udf(lambda v:float(v[1]),FloatType())
## Select out the necessary columns
output = preds.select(col("rawPrediction"),
                              col("hospital_death").cast(DoubleType()),
                              getprob(col("probability")).alias("probability"),
                              col("prediction"))
output

DataFrame[rawPrediction: vector, hospital_death: double, probability: float, prediction: double]

In [28]:
from pyspark.sql.types import DoubleType
from pyspark.mllib.evaluation import BinaryClassificationMetrics

performance_df = spark.createDataFrame([(0,0,0,0,0,0,0,0,0)], ['cutoff', 'accuracy', 'recall', 'precision', 'F1', 'TP', 'FP', 'FN', 'TN'])
for cutoff in range(5, 95, 5):
    cutoff = (cutoff * 0.01)
  
    print('Testing cutoff = ', str(format(cutoff, '.2f')))
    lrpredictions_prob_temp = output.withColumn('prediction', when(col('probability') >= cutoff, 1).otherwise(0).cast(DoubleType()))
    tp = lrpredictions_prob_temp[(lrpredictions_prob_temp.hospital_death == 1) & (lrpredictions_prob_temp.prediction == 1)].count()
    tn = lrpredictions_prob_temp[(lrpredictions_prob_temp.hospital_death == 0) & (lrpredictions_prob_temp.prediction == 0)].count()
    fp = lrpredictions_prob_temp[(lrpredictions_prob_temp.hospital_death == 0) & (lrpredictions_prob_temp.prediction == 1)].count()
    fn = lrpredictions_prob_temp[(lrpredictions_prob_temp.hospital_death == 1) & (lrpredictions_prob_temp.prediction == 0)].count()
    a = ((tp + tn)/lrpredictions_prob_temp.count())
    if(tp + fn == 0.0):
        r = 0.0
        p = float(tp) / (tp + fp)
    elif(tp + fp == 0.0):
        r = float(tp) / (tp + fn)
        p = 0.0
    else:
        r = float(tp) / (tp + fn)
        p = float(tp) / (tp + fp)
    
    if(p + r == 0):
        f1 = 0
    else:
        f1 = 2 * ((p * r)/(p + r))
    print("Accuracy:", format(a, '.4f'), "Recall:", format(r, '.4f'), "Precision: ", format(p, '.4f'), "F1 score:", format(f1, '.4f'), "TP", tp, "FP", fp, "FN", fn, "TN", tn)
    performance_df_row = spark.createDataFrame([(format(cutoff, '.2f'),format(a, '.4f'), format(r, '.4f'), format(p, '.4f'), format(f1, '.4f'), tp, fp, fn, tn)], 
                                               ['cutoff', 'accuracy', 'recall', 'precision', 'F1', 'TP', 'FP', 'FN', 'TN'])
    performance_df = performance_df.union(performance_df_row)
display(performance_df)

Testing cutoff =  0.05
Accuracy: 0.4492 Recall: 0.9388 Precision:  0.1278 F1 score: 0.2250 TP 1090 FP 7440 FN 71 TN 5035
Testing cutoff =  0.10
Accuracy: 0.8719 Recall: 0.5814 Precision:  0.3487 F1 score: 0.4359 TP 675 FP 1261 FN 486 TN 11214
Testing cutoff =  0.15
Accuracy: 0.9031 Recall: 0.4625 Precision:  0.4348 F1 score: 0.4482 TP 537 FP 698 FN 624 TN 11777
Testing cutoff =  0.20
Accuracy: 0.9134 Recall: 0.3902 Precision:  0.4892 F1 score: 0.4341 TP 453 FP 473 FN 708 TN 12002
Testing cutoff =  0.25
Accuracy: 0.9190 Recall: 0.3471 Precision:  0.5373 F1 score: 0.4218 TP 403 FP 347 FN 758 TN 12128
Testing cutoff =  0.30
Accuracy: 0.9231 Recall: 0.3127 Precision:  0.5922 F1 score: 0.4092 TP 363 FP 250 FN 798 TN 12225
Testing cutoff =  0.35
Accuracy: 0.9245 Recall: 0.2782 Precision:  0.6272 F1 score: 0.3854 TP 323 FP 192 FN 838 TN 12283
Testing cutoff =  0.40
Accuracy: 0.9242 Recall: 0.2403 Precision:  0.6473 F1 score: 0.3505 TP 279 FP 152 FN 882 TN 12323
Testing cutoff =  0.45
Accuracy

DataFrame[cutoff: string, accuracy: string, recall: string, precision: string, F1: string, TP: bigint, FP: bigint, FN: bigint, TN: bigint]

In [29]:
performance_df.show()

+------+--------+------+---------+------+----+----+----+-----+
|cutoff|accuracy|recall|precision|    F1|  TP|  FP|  FN|   TN|
+------+--------+------+---------+------+----+----+----+-----+
|     0|       0|     0|        0|     0|   0|   0|   0|    0|
|  0.05|  0.4492|0.9388|   0.1278|0.2250|1090|7440|  71| 5035|
|  0.10|  0.8719|0.5814|   0.3487|0.4359| 675|1261| 486|11214|
|  0.15|  0.9031|0.4625|   0.4348|0.4482| 537| 698| 624|11777|
|  0.20|  0.9134|0.3902|   0.4892|0.4341| 453| 473| 708|12002|
|  0.25|  0.9190|0.3471|   0.5373|0.4218| 403| 347| 758|12128|
|  0.30|  0.9231|0.3127|   0.5922|0.4092| 363| 250| 798|12225|
|  0.35|  0.9245|0.2782|   0.6272|0.3854| 323| 192| 838|12283|
|  0.40|  0.9242|0.2403|   0.6473|0.3505| 279| 152| 882|12323|
|  0.45|  0.9240|0.2093|   0.6713|0.3191| 243| 119| 918|12356|
|  0.50|  0.9235|0.1869|   0.6867|0.2938| 217|  99| 944|12376|
|  0.55|  0.9229|0.1585|   0.7104|0.2592| 184|  75| 977|12400|
|  0.60|  0.9226|0.1395|   0.7397|0.2348| 162|  57| 999