## Group 04 Model Building

-------------------------
Amber Curran (akc6be)

Manpreet Dhindsa (mkd8bb)

Quinton Mays (rub9ez)

---------------------------

To begin, we load the dataset as we did in the EDA.

In [1]:
from pyspark.sql import SparkSession

spark = SparkSession.builder \
        .appName("group04model") \
        .getOrCreate()

In [2]:
import pandas as pd

In [3]:
schemadf = pd.read_csv('/project/ds5559/fa21-group04/data/WiDS_Datathon_2020_Dictionary.csv')

In [4]:
from pyspark.sql.types import *
data_types = {
    'integer': IntegerType(),
    'binary': IntegerType(),
    'numeric': FloatType(),
    'string': StringType()
}

In [5]:
schema = StructType(
    [
        StructField(row['Variable Name'] , 
                    data_types[row['Data Type']], 
                    True) for index, row in schemadf.iterrows()
    ]
)

In [6]:
df = spark.read.format('csv') \
    .schema(schema) \
    .option('header', True) \
    .load('/project/ds5559/fa21-group04/data/training_v2.csv')

In [7]:
df.persist()

DataFrame[encounter_id: int, patient_id: int, hospital_id: int, hospital_death: int, age: float, bmi: string, elective_surgery: int, ethnicity: string, gender: string, height: float, hospital_admit_source: string, icu_admit_source: string, icu_admit_type: string, icu_id: int, icu_stay_type: string, icu_type: string, pre_icu_los_days: float, readmission_status: int, weight: float, albumin_apache: float, apache_2_diagnosis: string, apache_3j_diagnosis: string, apache_post_operative: int, arf_apache: int, bilirubin_apache: float, bun_apache: float, creatinine_apache: float, fio2_apache: float, gcs_eyes_apache: int, gcs_motor_apache: int, gcs_unable_apache: int, gcs_verbal_apache: int, glucose_apache: float, heart_rate_apache: float, hematocrit_apache: float, intubated_apache: int, map_apache: float, paco2_apache: float, paco2_for_ph_apache: float, pao2_apache: float, ph_apache: float, resprate_apache: float, sodium_apache: float, temp_apache: float, urineoutput_apache: float, ventilated_a

Note, we did not drop the columns as we did in the EDA, as that was coercing all columns to string type. As a temporary workaround, we decided to remove this step.

In [47]:
from pyspark.ml.feature import *
from pyspark.ml.linalg import Vectors

First we wanted to bucketize a variable, BMI

In [9]:
bmi_splits = [-float("inf"), 18.5, 24.9, 29.9, float("inf")]

In [10]:
bmi_bucketizer = Bucketizer(splits=bmi_splits,
                            inputCol='bmi',
                            outputCol='bucket_bmi')

Next, we create a schema dictionary from the data dictionary to look up values.

In [15]:
var_types = dict(zip(schemadf['Variable Name'], schemadf['Data Type']))

We then use a One Hot Encoding setup to create all one hot encodings possible, minus the exception columns specified.

In [80]:
do_not_ohe = ['apache_2_bodysystem']

In [81]:
string_vars = [col for col in df.columns if var_types[col] == 'string']
string_vars = [var for var in string_vars if var not in do_not_ohe]


In [82]:
string_index_vars = ['{}_i'.format(var) for var in string_vars if var not in do_not_ohe]

In [83]:
ohe_vars = ['{}_ohe'.format(var) for var in string_vars ]

In [84]:
string_indexer = StringIndexer(inputCols=string_vars,
                              outputCols=string_index_vars)

In [85]:
oneHotEncoder = OneHotEncoder(inputCols=string_index_vars,
                             outputCols=ohe_vars)

We then do the same for all the float variables in the model, but as the maxAbsScaler object only takes one column as input, we create a list of these objects to pass to the pipeline.

In [86]:
float_vars = [col for col in df.columns if var_types[col] == 'numeric']

In [87]:
max_ab_scalers = []
do_not_scale = ['bmi', 'apache_4a_hospital_death_prob',
               "apache_4a_icu_death_prob"]
for var in float_vars:
    if var not in do_not_scale:
        max_ab_scalers.append(MaxAbsScaler(inputCol=var,
                                           outputCol='{}_maxAbSc'.format(var)))

Next, the feature columns are selected and passed to the VectorAssembler.

In [88]:
assembler = VectorAssembler(
    inputCols=["icu_admit_type_ohe", "age",  
               "elective_surgery", "readmission_status", 
               "apache_4a_hospital_death_prob",
              "apache_4a_icu_death_prob",
              'pre_icu_los_days'],
    outputCol="features")

The pipeline is then built for a simple logistic regression model.

In [89]:
from pyspark.ml.classification import LogisticRegression

In [90]:
lr = LogisticRegression(labelCol='hospital_death', featuresCol='features', maxIter=10, regParam=0.01)

In [91]:
from pyspark.ml import Pipeline  

In [92]:
lr_pipeline = Pipeline(stages=[string_indexer, oneHotEncoder]+max_ab_scalers+[lr])

In [93]:
model = lr_pipeline.fit(df)

IllegalArgumentException: requirement failed: Column age must be of type class org.apache.spark.ml.linalg.VectorUDT:struct<type:tinyint,size:int,indices:array<int>,values:array<double>> but was actually class org.apache.spark.sql.types.FloatType$:float.