# Group 04 | Feature Selection

-------------------------
Amber Curran (akc6be)

Manpreet Dhindsa (mkd8bb)

Quinton Mays (rub9ez)

---------------------------

## Load Data

To begin feature selection, a spark session is created and the imported dataframe is read from the parquet file:

In [1]:
from pyspark.sql import SparkSession

spark = SparkSession.builder \
        .appName("group04VarSel2") \
        .getOrCreate()

In [2]:
df = spark.read.parquet("/project/ds5559/fa21-group04/data/df.parquet")

Next, we import the modules necessary for creating pyspark pipelines and for importing the data dictionary.

In [3]:
import pandas as pd
from pyspark.ml.feature import *
from pyspark.ml.linalg import Vectors
from pyspark.ml import Pipeline  
from pyspark.ml.classification import LogisticRegression

We specify that the model features are all columns that are not `hospital_death`, which is our response variable.

In [4]:
model_features = [col for col in df.columns if col not in ['hospital_death']] 

The data dictionary is read into a `pandas` dataframe and then zipped to a dictionary containing the feature names and data types.

In [5]:
schemadf = pd.read_csv('/project/ds5559/fa21-group04/data/WiDS_Datathon_2020_Dictionary.csv')
schemadf = schemadf[schemadf['Variable Name'] != 'icu_admit_type']
var_types = dict(zip(schemadf['Variable Name'], schemadf['Data Type']))

We then split the features by type into four lists: 

<b>Categorical Feature Types </b>
1. Integer
2. Binary
3. String

<b>Continuous Feature Types </b>

4. Float

The engineered features (average features created in DATA EXPLORATION AND CLEANING notebook), are excluded from the list comprehension as they are not included in the data dictionary. They are then concatenated onto float list, as they are all floating point values.

In [6]:
special_vars = ['bmi'] # any variables that we want to give special treatment (bucketize, etc)
engineered_vars = [col for col in model_features if col not in list(var_types.keys())]
exclude_vars = special_vars + engineered_vars
orig_vars = [col for col in model_features if col in list(var_types.keys())]
model_int_vars = [col for col in orig_vars if var_types[col] == 'integer' and col not in exclude_vars]
model_float_vars = [col for col in orig_vars if var_types[col] == 'numeric' and col not in exclude_vars] + engineered_vars + ['bmi']
model_binary_vars = [col for col in orig_vars if var_types[col] == 'binary' and col not in exclude_vars]
model_string_vars = [col for col in orig_vars if var_types[col] == 'string' and col not in exclude_vars]

The possible model features are shown below:

In [7]:
print('Possible Model Integer Features : {}'.format(model_int_vars))

Possible Model Integer Features : ['hospital_id', 'icu_id']


In [8]:
print('Possible Model Float Features : {}'.format(model_float_vars))

Possible Model Float Features : ['age', 'height', 'pre_icu_los_days', 'weight', 'apache_4a_hospital_death_prob', 'apache_4a_icu_death_prob', 'avgmaxtod1_diasbp_min', 'avgmaxtod1_heartrate_min', 'avgmaxtod1_mbp_min', 'avgmaxtod1_resprate_min', 'avgmaxtod1_spo2_min', 'avgmaxtod1_sysbp_min', 'avgmaxtod1_temp_min', 'avgmaxtoh1_diasbp_min', 'avgmaxtoh1_heartrate_min', 'avgmaxtoh1_mbp_min', 'avgmaxtoh1_resprate_min', 'avgmaxtoh1_spo2_min', 'avgmaxtoh1_sysbp_min', 'avgmaxtoh1_temp_min', 'avgmaxtod1_bun_min', 'avgmaxtod1_calcium_min', 'avgmaxtod1_creatinine_min', 'avgmaxtod1_glucose_min', 'avgmaxtod1_hco3_min', 'avgmaxtod1_hemaglobin_min', 'avgmaxtod1_hematocrit_min', 'avgmaxtod1_platelets_min', 'avgmaxtod1_potassium_min', 'avgmaxtod1_sodium_min', 'avgmaxtod1_wbc_min', 'bmi']


In [9]:
print('Possible Model Binary Features : {}'.format(model_binary_vars))

Possible Model Binary Features : ['elective_surgery', 'readmission_status', 'aids', 'cirrhosis', 'diabetes_mellitus', 'hepatic_failure', 'immunosuppression', 'leukemia', 'lymphoma', 'solid_tumor_with_metastasis']


In [10]:
print('Possible Model String Features : {}'.format(model_string_vars))

Possible Model String Features : ['ethnicity', 'gender', 'hospital_admit_source', 'icu_admit_source', 'icu_stay_type', 'icu_type', 'apache_3j_bodysystem', 'apache_2_bodysystem']


## Feature Selection

To determine the features to use in our model, two `UnivariateFeatureSelector` pipelines are constructed; one for categorical features and one for continuous features.

### Categorical Feature Selection

Next, as all of the integer variables in our model are categorical variables, they are vectorized.

In [11]:
int_var_vectorizer = VectorAssembler(inputCols=model_int_vars, outputCol='intFeatures', handleInvalid='skip')

Likewise, binary variables are also vectorized.

In [12]:
binary_var_vectorizer = VectorAssembler(inputCols=model_binary_vars, 
                                        outputCol='binaryFeatures',
                                        handleInvalid='skip')

String variables are passed to a `StringIndexer` object to convert them to integers and then to a `VectorAssembler` to process them into a feature vector for possible selection by the `UnivariateFeatureSelector`.

In [13]:
indexed_string_vars = ['{}_indexed'.format(var) for var in model_string_vars]
string_var_indexer = StringIndexer(inputCols=model_string_vars,
                                   outputCols=indexed_string_vars,
                                   handleInvalid='skip')

In [14]:
string_var_vectorizer = VectorAssembler(inputCols=string_var_indexer.getOutputCols(),
                                        outputCol='StringFeatures',
                                        handleInvalid='skip')

Next, categorical features are combined into one `VectorAssembler` and passed to a `UnivariateFeatureSelector` object in the pipeline to choose which variables are most useful for determining the outcome, `hospital_death`. As these features are categorical, and the response is categorical, the `UnivariateFeatureSelector` uses a Chi-Squared test for determining which categorical variables to include in the model.

In [15]:
cat_feature_vectorizer = VectorAssembler(inputCols=[string_var_vectorizer.getOutputCol(),
                                                    int_var_vectorizer.getOutputCol(),
                                                    string_var_vectorizer.getOutputCol()],
                                        outputCol='catFeatures',
                                        handleInvalid='skip')

In [16]:
cat_selector = UnivariateFeatureSelector(featuresCol=cat_feature_vectorizer.getOutputCol(),
                                         labelCol='hospital_death',
                                         outputCol='selectedCatFeatures').setFeatureType('categorical').setLabelType('categorical')

A piepline is then constructed and fit to the data using the created objects.

In [17]:
cat_pipeline = Pipeline(stages=[
    int_var_vectorizer,
    binary_var_vectorizer,
    string_var_indexer,
    string_var_vectorizer,
    cat_feature_vectorizer,
    cat_selector
])

In [18]:
cat_model = cat_pipeline.fit(df)

In [19]:
df = cat_model.transform(df)

Next, the features that the `UnivariateFeatureSelector` chose are shown.

In [20]:
catfeat = cat_model.stages[0].getInputCols() + cat_model.stages[1].getInputCols() + cat_model.stages[3].getInputCols() 

In [21]:
selected_cat_feats = [catfeat[i] for i in cat_model.stages[5].selectedFeatures]

In [22]:
selected_cat_feats

['elective_surgery',
 'readmission_status',
 'cirrhosis',
 'diabetes_mellitus',
 'hepatic_failure',
 'immunosuppression',
 'leukemia',
 'ethnicity_indexed',
 'gender_indexed',
 'icu_admit_source_indexed',
 'icu_stay_type_indexed',
 'icu_type_indexed',
 'aids',
 'hospital_admit_source_indexed',
 'icu_id',
 'solid_tumor_with_metastasis',
 'hospital_id',
 'lymphoma']

The string features, which must be one hot encoded for any model, are separated from the other categorical features for further processing by a `OneHotEncoder`.

In [23]:
selected_string_feats = [feat for feat in indexed_string_vars if feat in selected_cat_feats]

In [24]:
selected_string_feats

['ethnicity_indexed',
 'gender_indexed',
 'hospital_admit_source_indexed',
 'icu_admit_source_indexed',
 'icu_stay_type_indexed',
 'icu_type_indexed']

The `OneHotEncoder` object is then created for the final pipeline. It one hot encodes the features that were selected in the previous pipeline for use in future models.

In [25]:
ohe_string_vars = ['{}_ohe'.format(var) for var in selected_string_feats]
string_var_ohe = OneHotEncoder(inputCols=selected_string_feats,outputCols=ohe_string_vars)

string_var_vectorizer = VectorAssembler(inputCols=string_var_ohe.getOutputCols(),
                                        outputCol='SelectedStringFeatures',
                                        handleInvalid='skip')

The oter variables are passed to their own `VectorAssembler` and then all features are combined into a final feature vector.

In [26]:
other_var_vectorizer = VectorAssembler(inputCols=[col for col in selected_cat_feats if col not in selected_string_feats],
                                       outputCol='oheCatFeatures',
                                       handleInvalid='skip')

In [27]:
final_cat_vectorizer = VectorAssembler(inputCols=[string_var_vectorizer.getOutputCol(),
                                                  other_var_vectorizer.getOutputCol()],
                                       outputCol='FinalCatFeatures',
                                       handleInvalid='skip')

A pipeline is then constructed that is identical to the first pipeline with the exception of the inclusion of the `OneHotEncoder` for the 

In [28]:
final_cat_pipeline = Pipeline(stages=[
    string_var_ohe,
    string_var_vectorizer,
    other_var_vectorizer,
    final_cat_vectorizer
])

In [29]:
final_cat_model = final_cat_pipeline.fit(df)

In [30]:
df = final_cat_model.transform(df)

## Continuous Feature Selection

Determining the continuous features to use in the model was also accomplished using a `UnivariateFeatureSelector` object. In addition, these features are also imputed to the feature mean to remove any `Nan` from the dataset. In the case of a continuous predictor and a categorical response, the `UnivariateFeatureSelector` uses an ANOVA F-Test to determine which variables to include in the model.

In [31]:
imputed_float_vars = ['{}_imputed'.format(var) for var in model_float_vars]
float_imputer = Imputer(inputCols=model_float_vars, outputCols=imputed_float_vars)

float_var_vectorizer = VectorAssembler(inputCols=float_imputer.getOutputCols(), outputCol='floatFeatures', handleInvalid='skip')

float_MaxAbs_Scaler = MaxAbsScaler(inputCol=float_var_vectorizer.getOutputCol(),
                                   outputCol='scaledFloatFeatures')

In [32]:
cont_selector = UnivariateFeatureSelector(featuresCol=float_MaxAbs_Scaler.getOutputCol(),
                                          labelCol='hospital_death',
                                          outputCol='selectedContFeatures').setFeatureType('continuous').setLabelType('categorical')

In [33]:
cont_pipeline = Pipeline(stages=[
    float_imputer,
    float_var_vectorizer,
    float_MaxAbs_Scaler,
    cont_selector
])

The constructured piepline is then fit to the data and used to transform the data. 

In [34]:
cont_model = cont_pipeline.fit(df)

In [35]:
df = cont_model.transform(df)

The chosen continuous features are shown below:

In [36]:
[cont_model.stages[1].getInputCols()[i] for i in cont_model.stages[3].selectedFeatures]

['age_imputed',
 'pre_icu_los_days_imputed',
 'weight_imputed',
 'apache_4a_hospital_death_prob_imputed',
 'apache_4a_icu_death_prob_imputed',
 'avgmaxtod1_diasbp_min_imputed',
 'avgmaxtod1_heartrate_min_imputed',
 'avgmaxtod1_mbp_min_imputed',
 'avgmaxtod1_resprate_min_imputed',
 'avgmaxtod1_spo2_min_imputed',
 'avgmaxtod1_sysbp_min_imputed',
 'avgmaxtod1_temp_min_imputed',
 'avgmaxtoh1_diasbp_min_imputed',
 'avgmaxtoh1_heartrate_min_imputed',
 'avgmaxtoh1_mbp_min_imputed',
 'avgmaxtoh1_resprate_min_imputed',
 'avgmaxtoh1_spo2_min_imputed',
 'avgmaxtoh1_sysbp_min_imputed',
 'avgmaxtoh1_temp_min_imputed',
 'avgmaxtod1_bun_min_imputed',
 'avgmaxtod1_calcium_min_imputed',
 'avgmaxtod1_creatinine_min_imputed',
 'avgmaxtod1_glucose_min_imputed',
 'avgmaxtod1_hco3_min_imputed',
 'avgmaxtod1_hemaglobin_min_imputed',
 'avgmaxtod1_hematocrit_min_imputed',
 'avgmaxtod1_platelets_min_imputed',
 'avgmaxtod1_potassium_min_imputed',
 'avgmaxtod1_wbc_min_imputed',
 'bmi_imputed',
 'height_imputed',


The transformed dataframe, inlcuding the columns `CatFeatures` and `selectedContFeatures` is exported for use in the various models.

In [37]:
df.write.mode("overwrite").parquet("/project/ds5559/fa21-group04/data/processed_df.parquet")