# Group 04 | Model Building | Ridge Regression

-------------------------
Amber Curran (akc6be)

Manpreet Dhindsa (mkd8bb)

Quinton Mays (rub9ez)

---------------------------

## Load Data

To begin we create our Spark Session and load the data from the parquet file.

In [1]:
from pyspark.sql import SparkSession

spark = SparkSession.builder \
        .appName("group04RF") \
        .getOrCreate()

In [2]:
import pandas as pd
from pyspark.ml.feature import *
from pyspark.ml.linalg import Vectors
from pyspark.ml import Pipeline, PipelineModel  
from pyspark.ml.classification import LogisticRegression
from pyspark.sql.functions import *

In [3]:
df = spark.read.parquet("/project/ds5559/fa21-group04/data/processed_df.parquet")

We then perform a training/validation test split on the data, reserving 80% of the data for training and validation and 20% for testing.

In [4]:
trainValridge, testridge = df.randomSplit([0.8, 0.2], seed=304)

## Pipeline

Next, a feature vector is constructed using the selected features from the `Group04FeatureSelection.ipynb` notebook. The columns selected as inputs are features chosen by a `UnivariateFeatureSelector`.

In [5]:
final_feature_vectorizer =  VectorAssembler(inputCols=['FinalCatFeatures',
                                                       'selectedContFeatures'],
                                            outputCol='features',
                                            handleInvalid='skip')

Next, a `RidgeRegressionClassifier` is constructed to predict `hospital_death` based on the selected features.

In [6]:
ridge = LogisticRegression(featuresCol = 'features', labelCol = 'hospital_death', elasticNetParam=0.0, maxIter=10, regParam=0.01)

A pipeline is then created to feed the selected features to the classifier.

In [7]:
ridge_pipeline = Pipeline(stages=[final_feature_vectorizer,
                                  ridge])

## Model Training and Tuning

To begin model training, the necessary packages are imported.

In [8]:
from pyspark.ml.tuning import ParamGridBuilder, CrossValidator
from pyspark.ml.evaluation import BinaryClassificationEvaluator, MulticlassClassificationEvaluator

Next, we specify our parameter grid for tuning our Ridge Regression model. For this model, our tunable hyperparameter is:

* lambda

In [9]:
ridge_paramGrid = ParamGridBuilder() \
    .addGrid(ridge.regParam, [0.1, 0.01]) \
    .build()

In [10]:
crossvalridge = CrossValidator(estimator=ridge_pipeline,
                          estimatorParamMaps=ridge_paramGrid,
                          evaluator=BinaryClassificationEvaluator().setLabelCol(ridge.getLabelCol()),
                          numFolds=5,
                          seed=304,
                          parallelism=4)

In [11]:
ridge_model = crossvalridge.fit(trainValridge)

## Model Evaluation

To evaluate the model performance, the test set is predicted using the chosen model from the `CrossValidator`.

In [12]:
predsridge = ridge_model.transform(testridge)

The predictions are then passed to two objects:

* `BinaryClassificationEvaluator` - for AUC metric
* `MulticlassClassificationEvaluator` - for confusion matrix and all other model metrics

In [13]:
ridge_evaluator = BinaryClassificationEvaluator(rawPredictionCol='probability',
                                              labelCol='hospital_death',
                                              metricName='areaUnderROC')

In [14]:
testEvaluator_ridge2 = MulticlassClassificationEvaluator(labelCol='hospital_death',
                                                   predictionCol="prediction",
                                                   probabilityCol='probability',
                                                   metricLabel=1)

### Accuracy

In [15]:
testEvaluator_ridge2.evaluate(predsridge, {testEvaluator_ridge2.metricName: "accuracy"})

0.9202112056321502

### Precision

In [16]:
testEvaluator_ridge2.evaluate(predsridge, {testEvaluator_ridge2.metricName: "precisionByLabel"})

0.7746478873239436

### Recall

In [17]:
testEvaluator_ridge2.evaluate(predsridge, {testEvaluator_ridge2.metricName: "recallByLabel"})

0.13692946058091288

### F1

In [18]:
testEvaluator_ridge2.evaluate(predsridge, {testEvaluator_ridge2.metricName: "fMeasureByLabel"})

0.23272214386459802

### True Positive Rate

In [19]:
testEvaluator_ridge2.evaluate(predsridge, {testEvaluator_ridge2.metricName: "truePositiveRateByLabel"})

0.13692946058091288

### False Positive Rate

In [20]:
testEvaluator_ridge2.evaluate(predsridge, {testEvaluator_ridge2.metricName: "falsePositiveRateByLabel"})

0.003861314455795994

### Area Under ROC Curve

In [21]:
ridge_evaluator.evaluate(predsridge)

0.840405211038798

### Confusion Matrix

In [22]:
from pyspark.mllib.evaluation import MulticlassMetrics
import pyspark.sql.functions as F
from pyspark.sql.types import FloatType
preds_and_labels_ridge = predsridge.select(['prediction','hospital_death']).withColumn('label', F.col('hospital_death').cast(FloatType())).orderBy('prediction')
preds_and_labels_ridge = preds_and_labels_ridge.select(['prediction','label'])
conf_matrix_ridge = MulticlassMetrics(preds_and_labels_ridge.rdd.map(tuple))
conf_matrix_ridge.confusionMatrix().toArray()

array([[12383.,    48.],
       [ 1040.,   165.]])

In [23]:
predsridge.groupBy('prediction').count().show()

+----------+-----+
|prediction|count|
+----------+-----+
|       0.0|13423|
|       1.0|  213|
+----------+-----+



## Selected Model Hyperparameters

In [24]:
import numpy as np

In [25]:
ridge_model.getEstimatorParamMaps()[ np.argmax(ridge_model.avgMetrics) ]

{Param(parent='LogisticRegression_fd98b9740d5e', name='regParam', doc='regularization parameter (>= 0).'): 0.1}

### Cutoff Value Selection

By default, the cutoff value is 0.5, however, we wanted to determine a cutoff value that resulted in the highest F1 value. F1 is the weighted average of precision and recall, which we believe is important for this scenario. We decided to not use accuracy as the threshold determination because F1 is usually more useful than accuracy, especially with the uneven class distribution.

In [26]:
from pyspark.sql.types import DoubleType, FloatType

getprob = udf(lambda v:float(v[1]),FloatType())

output = predsridge.select(col("rawPrediction"),
                              col("hospital_death").cast(DoubleType()),
                              getprob(col("probability")).alias("probability"),
                              col("prediction"))
output

DataFrame[rawPrediction: vector, hospital_death: double, probability: float, prediction: double]

This function determines each of the different metrics for cutoff values between 0.05 and 0.90.

In [28]:
from pyspark.sql.types import DoubleType
from pyspark.mllib.evaluation import BinaryClassificationMetrics

performance_df = spark.createDataFrame([(0,0,0,0,0,0,0,0,0)], ['cutoff','accuracy', 'recall', 'precision', 'F1', 'TP', 'FP', 'FN', 'TN'])
for cutoff in range(5, 95, 5):
    cutoff = (cutoff * 0.01)
  
    print('Testing cutoff = ', str(format(cutoff, '.2f')))
    lrpredictions_prob_temp = output.withColumn('prediction', when(col('probability') >= cutoff, 1).otherwise(0).cast(DoubleType()))
    tp = lrpredictions_prob_temp[(lrpredictions_prob_temp.hospital_death == 1) & (lrpredictions_prob_temp.prediction == 1)].count()
    tn = lrpredictions_prob_temp[(lrpredictions_prob_temp.hospital_death == 0) & (lrpredictions_prob_temp.prediction == 0)].count()
    fp = lrpredictions_prob_temp[(lrpredictions_prob_temp.hospital_death == 0) & (lrpredictions_prob_temp.prediction == 1)].count()
    fn = lrpredictions_prob_temp[(lrpredictions_prob_temp.hospital_death == 1) & (lrpredictions_prob_temp.prediction == 0)].count()
    a = ((tp + tn)/lrpredictions_prob_temp.count())
    if(tp + fn == 0.0):
        r = 0.0
        p = float(tp) / (tp + fp)
    elif(tp + fp == 0.0):
        r = float(tp) / (tp + fn)
        p = 0.0
    else:
        r = float(tp) / (tp + fn)
        p = float(tp) / (tp + fp)
    
    if(p + r == 0):
        f1 = 0
    else:
        f1 = 2 * ((p * r)/(p + r))
    print("Accuracy:", format(a, '.4f'), "Recall:", format(r, '.4f'), "Precision: ", format(p, '.4f'), "F1 score:", format(f1, '.4f'), "TP", tp, "FP", fp, "FN", fn, "TN", tn)
    performance_df_row = spark.createDataFrame([(format(cutoff, '.2f'), format(a, '.4f'), format(r, '.4f'), format(p, '.4f'), format(f1, '.4f'), tp, fp, fn, tn)], ['cutoff', 'accuracy', 'recall', 'precision', 'F1', 'TP', 'FP', 'FN', 'TN'])
    performance_df = performance_df.union(performance_df_row)
display(performance_df)

Testing cutoff =  0.05
Accuracy: 0.5612 Recall: 0.9071 Precision:  0.1569 F1 score: 0.2676 TP 1093 FP 5872 FN 112 TN 6559
Testing cutoff =  0.10
Accuracy: 0.8165 Recall: 0.6971 Precision:  0.2822 F1 score: 0.4017 TP 840 FP 2137 FN 365 TN 10294
Testing cutoff =  0.15
Accuracy: 0.8843 Recall: 0.5519 Precision:  0.3905 F1 score: 0.4574 TP 665 FP 1038 FN 540 TN 11393
Testing cutoff =  0.20
Accuracy: 0.9070 Recall: 0.4307 Precision:  0.4714 F1 score: 0.4501 TP 519 FP 582 FN 686 TN 11849
Testing cutoff =  0.25
Accuracy: 0.9160 Recall: 0.3527 Precision:  0.5373 F1 score: 0.4259 TP 425 FP 366 FN 780 TN 12065
Testing cutoff =  0.30
Accuracy: 0.9198 Recall: 0.2888 Precision:  0.5959 F1 score: 0.3890 TP 348 FP 236 FN 857 TN 12195
Testing cutoff =  0.35
Accuracy: 0.9212 Recall: 0.2373 Precision:  0.6485 F1 score: 0.3475 TP 286 FP 155 FN 919 TN 12276
Testing cutoff =  0.40
Accuracy: 0.9216 Recall: 0.2008 Precision:  0.6954 F1 score: 0.3117 TP 242 FP 106 FN 963 TN 12325
Testing cutoff =  0.45
Accura

DataFrame[cutoff: string, accuracy: string, recall: string, precision: string, F1: string, TP: bigint, FP: bigint, FN: bigint, TN: bigint]

In [29]:
performance_df.show()

+------+--------+------+---------+------+----+----+----+-----+
|cutoff|accuracy|recall|precision|    F1|  TP|  FP|  FN|   TN|
+------+--------+------+---------+------+----+----+----+-----+
|     0|       0|     0|        0|     0|   0|   0|   0|    0|
|  0.05|  0.5612|0.9071|   0.1569|0.2676|1093|5872| 112| 6559|
|  0.10|  0.8165|0.6971|   0.2822|0.4017| 840|2137| 365|10294|
|  0.15|  0.8843|0.5519|   0.3905|0.4574| 665|1038| 540|11393|
|  0.20|  0.9070|0.4307|   0.4714|0.4501| 519| 582| 686|11849|
|  0.25|  0.9160|0.3527|   0.5373|0.4259| 425| 366| 780|12065|
|  0.30|  0.9198|0.2888|   0.5959|0.3890| 348| 236| 857|12195|
|  0.35|  0.9212|0.2373|   0.6485|0.3475| 286| 155| 919|12276|
|  0.40|  0.9216|0.2008|   0.6954|0.3117| 242| 106| 963|12325|
|  0.45|  0.9219|0.1718|   0.7555|0.2799| 207|  67| 998|12364|
|  0.50|  0.9202|0.1369|   0.7746|0.2327| 165|  48|1040|12383|
|  0.55|  0.9185|0.1046|   0.7975|0.1849| 126|  32|1079|12399|
|  0.60|  0.9174|0.0797|   0.8421|0.1456|  96|  18|1109