# Group 04 | Model Building | Lasso Regression

-------------------------
Amber Curran (akc6be)

Manpreet Dhindsa (mkd8bb)

Quinton Mays (rub9ez)

---------------------------

## Load Data

To begin we create our Spark Session and load the data from the parquet file.

In [1]:
from pyspark.sql import SparkSession

spark = SparkSession.builder \
        .appName("group04RF") \
        .getOrCreate()

In [2]:
import pandas as pd
from pyspark.ml.feature import *
from pyspark.ml.linalg import Vectors
from pyspark.ml import Pipeline, PipelineModel  
from pyspark.ml.classification import LogisticRegression
from pyspark.sql.functions import *

In [3]:
df = spark.read.parquet("/project/ds5559/fa21-group04/data/processed_df.parquet")

We then perform a training/validation test split on the data, reserving 80% of the data for training and validation and 20% for testing.

In [4]:
trainVallasso, testlasso = df.randomSplit([0.8, 0.2], seed=304)

## Pipeline

Next, a feature vector is constructed using the selected features from the `Group04FeatureSelection.ipynb` notebook. The columns selected as inputs are features chosen by a `UnivariateFeatureSelector`.

In [5]:
final_feature_vectorizer =  VectorAssembler(inputCols=['FinalCatFeatures',
                                                       'selectedContFeatures'],
                                            outputCol='features',
                                            handleInvalid='skip')

Next, a `LassoRegressionClassifier` is constructed to predict `hospital_death` based on the selected features.

In [6]:
lasso = LogisticRegression(featuresCol = 'features', labelCol = 'hospital_death', elasticNetParam=1.0, maxIter=10, regParam=0.01)


A pipeline is then created to feed the selected features to the classifier.

In [7]:
lasso_pipeline = Pipeline(stages=[final_feature_vectorizer,
                                  lasso])

## Model Training and Tuning

To begin model training, the necessary packages are imported.

In [8]:
from pyspark.ml.tuning import ParamGridBuilder, CrossValidator
from pyspark.ml.evaluation import BinaryClassificationEvaluator, MulticlassClassificationEvaluator

Next, we specify our parameter grid for tuning our Ridge Regression model. For this model, our tunable hyperparameter is:

* lambda

In [9]:
lasso_paramGrid = ParamGridBuilder() \
    .addGrid(lasso.regParam, [0.1, 0.01]) \
    .build()

In [10]:
crossval_lasso = CrossValidator(estimator=lasso_pipeline,
                          estimatorParamMaps=lasso_paramGrid,
                          evaluator=BinaryClassificationEvaluator().setLabelCol(lasso.getLabelCol()),
                          numFolds=5,
                          seed=304,
                          parallelism=4)

In [11]:
lasso_model = crossval_lasso.fit(trainVallasso)

## Model Evaluation

To evaluate the model performance, the test set is predicted using the chosen model from the `CrossValidator`.


In [12]:
predslasso = lasso_model.transform(testlasso)

The predictions are then passed to two objects:

* `BinaryClassificationEvaluator` - for AUC metric
* `MulticlassClassificationEvaluator` - for confusion matrix and all other model metrics

In [13]:
lasso_evaluator = BinaryClassificationEvaluator(rawPredictionCol='probability',
                                              labelCol='hospital_death',
                                              metricName='areaUnderROC')

In [14]:
testEvaluator_lasso2 = MulticlassClassificationEvaluator(labelCol='hospital_death',
                                                   predictionCol="prediction",
                                                   probabilityCol='probability',
                                                   metricLabel=1)

### Accuracy

In [15]:
testEvaluator_lasso2.evaluate(predslasso, {testEvaluator_lasso2.metricName: "accuracy"})

0.9196245233206218

### Precision

In [16]:
testEvaluator_lasso2.evaluate(predslasso, {testEvaluator_lasso2.metricName: "precisionByLabel"})

0.6967509025270758

### Recall

In [17]:
testEvaluator_lasso2.evaluate(predslasso, {testEvaluator_lasso2.metricName: "recallByLabel"})

0.16016597510373445

### F1

In [18]:
testEvaluator_lasso2.evaluate(predslasso, {testEvaluator_lasso2.metricName: "fMeasureByLabel"})

0.26045883940620784

### True Positive Rate

In [19]:
testEvaluator_lasso2.evaluate(predslasso, {testEvaluator_lasso2.metricName: "truePositiveRateByLabel"})

0.16016597510373445

### False Positive Rate

In [20]:
testEvaluator_lasso2.evaluate(predslasso, {testEvaluator_lasso2.metricName: "falsePositiveRateByLabel"})

0.006757300297642989

### Area Under ROC Curve

In [21]:
lasso_evaluator.evaluate(predslasso)

0.8434704965601

### Confusion Matrix

In [22]:
from pyspark.mllib.evaluation import MulticlassMetrics
import pyspark.sql.functions as F
from pyspark.sql.types import FloatType
preds_and_labels_lasso = predslasso.select(['prediction','hospital_death']).withColumn('label', F.col('hospital_death').cast(FloatType())).orderBy('prediction')
preds_and_labels_lasso = preds_and_labels_lasso.select(['prediction','label'])
conf_matrix_ridge = MulticlassMetrics(preds_and_labels_lasso.rdd.map(tuple))
conf_matrix_ridge.confusionMatrix().toArray()

array([[12347.,    84.],
       [ 1012.,   193.]])

In [23]:
predslasso.groupBy('prediction').count().show()

+----------+-----+
|prediction|count|
+----------+-----+
|       0.0|13359|
|       1.0|  277|
+----------+-----+



## Selected Model Hyperparameters

In [24]:
import numpy as np

In [25]:
lasso_model.getEstimatorParamMaps()[ np.argmax(lasso_model.avgMetrics) ]

{Param(parent='LogisticRegression_647efb63004c', name='regParam', doc='regularization parameter (>= 0).'): 0.01}

### Cutoff Value Selection

By default, the cutoff value is 0.5, however, we wanted to determine a cutoff value that resulted in the highest F1 value. F1 is the weighted average of precision and recall, which we believe is important for this scenario. We decided to not use accuracy as the threshold determination because F1 is usually more useful than accuracy, especially with the uneven class distribution.

In [26]:
from pyspark.sql.types import DoubleType, FloatType

getprob = udf(lambda v:float(v[1]),FloatType())

output = predslasso.select(col("rawPrediction"),
                              col("hospital_death").cast(DoubleType()),
                              getprob(col("probability")).alias("probability"),
                              col("prediction"))
output

DataFrame[rawPrediction: vector, hospital_death: double, probability: float, prediction: double]

This function determines each of the different metrics for cutoff values between 0.05 and 0.90.

In [27]:
from pyspark.sql.types import DoubleType
from pyspark.mllib.evaluation import BinaryClassificationMetrics

performance_df = spark.createDataFrame([(0,0,0,0,0,0,0,0,0)], ['cutoff','accuracy', 'recall', 'precision', 'F1', 'TP', 'FP', 'FN', 'TN'])
for cutoff in range(5, 95, 5):
    cutoff = (cutoff * 0.01)
  
    print('Testing cutoff = ', str(format(cutoff, '.2f')))
    lrpredictions_prob_temp = output.withColumn('prediction', when(col('probability') >= cutoff, 1).otherwise(0).cast(DoubleType()))
    tp = lrpredictions_prob_temp[(lrpredictions_prob_temp.hospital_death == 1) & (lrpredictions_prob_temp.prediction == 1)].count()
    tn = lrpredictions_prob_temp[(lrpredictions_prob_temp.hospital_death == 0) & (lrpredictions_prob_temp.prediction == 0)].count()
    fp = lrpredictions_prob_temp[(lrpredictions_prob_temp.hospital_death == 0) & (lrpredictions_prob_temp.prediction == 1)].count()
    fn = lrpredictions_prob_temp[(lrpredictions_prob_temp.hospital_death == 1) & (lrpredictions_prob_temp.prediction == 0)].count()
    a = ((tp + tn)/lrpredictions_prob_temp.count())
    if(tp + fn == 0.0):
        r = 0.0
        p = float(tp) / (tp + fp)
    elif(tp + fp == 0.0):
        r = float(tp) / (tp + fn)
        p = 0.0
    else:
        r = float(tp) / (tp + fn)
        p = float(tp) / (tp + fp)
    
    if(p + r == 0):
        f1 = 0
    else:
        f1 = 2 * ((p * r)/(p + r))
    print("Accuracy:", format(a, '.4f'), "Recall:", format(r, '.4f'), "Precision: ", format(p, '.4f'), "F1 score:", format(f1, '.4f'), "TP", tp, "FP", fp, "FN", fn, "TN", tn)
    performance_df_row = spark.createDataFrame([(format(cutoff, '.2f'), format(a, '.4f'), format(r, '.4f'), format(p, '.4f'), format(f1, '.4f'), tp, fp, fn, tn)], ['cutoff', 'accuracy', 'recall', 'precision', 'F1', 'TP', 'FP', 'FN', 'TN'])
    performance_df = performance_df.union(performance_df_row)
display(performance_df)

Testing cutoff =  0.05
Accuracy: 0.5344 Recall: 0.9245 Precision:  0.1511 F1 score: 0.2598 TP 1114 FP 6258 FN 91 TN 6173
Testing cutoff =  0.10
Accuracy: 0.8326 Recall: 0.6730 Precision:  0.3004 F1 score: 0.4154 TP 811 FP 1889 FN 394 TN 10542
Testing cutoff =  0.15
Accuracy: 0.8890 Recall: 0.5386 Precision:  0.4039 F1 score: 0.4616 TP 649 FP 958 FN 556 TN 11473
Testing cutoff =  0.20
Accuracy: 0.9090 Recall: 0.4407 Precision:  0.4836 F1 score: 0.4611 TP 531 FP 567 FN 674 TN 11864
Testing cutoff =  0.25
Accuracy: 0.9153 Recall: 0.3660 Precision:  0.5300 F1 score: 0.4330 TP 441 FP 391 FN 764 TN 12040
Testing cutoff =  0.30
Accuracy: 0.9171 Recall: 0.3054 Precision:  0.5559 F1 score: 0.3942 TP 368 FP 294 FN 837 TN 12137
Testing cutoff =  0.35
Accuracy: 0.9193 Recall: 0.2664 Precision:  0.5967 F1 score: 0.3683 TP 321 FP 217 FN 884 TN 12214
Testing cutoff =  0.40
Accuracy: 0.9201 Recall: 0.2299 Precision:  0.6324 F1 score: 0.3372 TP 277 FP 161 FN 928 TN 12270
Testing cutoff =  0.45
Accuracy

DataFrame[cutoff: string, accuracy: string, recall: string, precision: string, F1: string, TP: bigint, FP: bigint, FN: bigint, TN: bigint]

In [28]:
performance_df.show()

+------+--------+------+---------+------+----+----+----+-----+
|cutoff|accuracy|recall|precision|    F1|  TP|  FP|  FN|   TN|
+------+--------+------+---------+------+----+----+----+-----+
|     0|       0|     0|        0|     0|   0|   0|   0|    0|
|  0.05|  0.5344|0.9245|   0.1511|0.2598|1114|6258|  91| 6173|
|  0.10|  0.8326|0.6730|   0.3004|0.4154| 811|1889| 394|10542|
|  0.15|  0.8890|0.5386|   0.4039|0.4616| 649| 958| 556|11473|
|  0.20|  0.9090|0.4407|   0.4836|0.4611| 531| 567| 674|11864|
|  0.25|  0.9153|0.3660|   0.5300|0.4330| 441| 391| 764|12040|
|  0.30|  0.9171|0.3054|   0.5559|0.3942| 368| 294| 837|12137|
|  0.35|  0.9193|0.2664|   0.5967|0.3683| 321| 217| 884|12214|
|  0.40|  0.9201|0.2299|   0.6324|0.3372| 277| 161| 928|12270|
|  0.45|  0.9208|0.1925|   0.6844|0.3005| 232| 107| 973|12324|
|  0.50|  0.9196|0.1602|   0.6968|0.2605| 193|  84|1012|12347|
|  0.55|  0.9190|0.1344|   0.7265|0.2269| 162|  61|1043|12370|
|  0.60|  0.9183|0.1054|   0.7791|0.1857| 127|  36|1078