## The below script determines which features warrant retention for our models, due to the predictive value that they provide

### Outline of notebook:

#### 2.1 Formatting notebook environment and importing dataset
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;Function: <u>CSV_Reader()</u><br>
#### 2.2 Assessing feature correlations with patient readmissions
#### 2.3 Assessing the feature importance via Random Forest
#### 2.4 Assessing Akaike Information Criterion (AIC)
#### 2.5 Assessing feature variability with Principal Component Analysis (PCA)
#### 2.6 Exporting engineered datasets to S3 for machine learning
&nbsp;&nbsp;&nbsp;&nbsp;Train/validation and test data subsets saved in csv<br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;Function: <u>Export_Processed_CSV()</u><br>

#### Purpose of Feature Selection
There is an inherent trade-off between the the value that additional features brings to a predictive model, and the added computational effort that assessing this increased "dimensionality" of this data brings. As a massive amount of irrelevant data will not help with predictions, and the opportunity cost of time that is spent assessing this irrelevant data might otherwise be spent in identifying a different, elusive, and optimized model iteration.

As such, this notebook conducts a series of tests to identify features that may not provide enough inherent predictive information to justify their continued inclusion. Such variables will be removed, and the resulting filtered dataset will be exported for machine learning.

Techniques that assess both a feature's relationship with patient readmission, and the degree of inherent variability in contained within a feature, will be employed. A small number of the worst performers from each test will be removed, provided that there is no strong subjective/contextual reason for their retention.

Feature selection involves both technical and subjective components, and as such it is not an exact science. Therefore it will not necessarily improve predictions, but offers the potential to do so. As prior testing indicated that there is some degree of overfitting in our deployed model (i.e. the model does not predict as well on unseen data as it did on it's validation data), examining whether dimensionality reduction will improve model performance represents proper due dilligence.

Variations of a full, and different partial datasets were made. The below combination of retained features  yielded the best eventual models.

## 2.1 Formatting notebook environment and importing dataset

#### We begin by importing the necessary libraries.

In [None]:
#pip install numexpr==2.8.0

In [None]:
#pip install --upgrade s3fs

#### Importing the necessary Libraries

In [1]:
#Data manipulation libraries
import pandas as pd
import numpy as np
import random as random
from sklearn.utils import shuffle

#Sagemaker and other AWS libraries
import sagemaker
import sagemaker.amazon.common as smac
import boto3
from sagemaker import get_execution_role
import os

#importing some basic machine learning libraries to test feature importance
from sklearn.ensemble import RandomForestClassifier as rfc
import statsmodels.api as sm
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler

#Defining some variables for Sagemaker session
region = boto3.Session().region_name
smclient = boto3.Session().client('sagemaker')
role = get_execution_role()

sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /home/ec2-user/.config/sagemaker/config.yaml


#### As this is a fairly large dataset, we will reformat the default display dimensions to reflect this.

In [2]:
#500 rows and columns will be presented
pd.set_option('display.max_columns', 200)
pd.set_option('display.max_rows', 500)

#### The below function will read in our dataset.

In [3]:
def CSV_Reader(bucket, subfolder, source_file_name):
    
    '''The CSV_Reader() function takes in the names of the bucket, subfolder, and source file name, and desired dataframe name. 
    It first constructs the filepath, and then imports the file residing at this filepath, based on the title fed to the function.
    
    Arguments 
    --------- 
    bucket: Head S3 repository bucket
    subfolder: Subfolder containing the source data
    source_file_name: Name of source CSV data file 
    
    Return
    ---------
    Returns the source data in a pandas dataframe '''
    
    data_location = 's3://{}/{}/{}'.format(bucket, subfolder, source_file_name)  
    dataset = pd.read_csv(data_location, low_memory=False)
    return dataset

#### Reading in the data from S3

In [4]:
bucket = 'diabetes-directory'
cleaned_diabetes = 'diabetes_processed_data'
input_file_name = 'prepared_diabetes.csv'
reduced_dimensions_diabetes_train = 'reduced_dimensions_diabetes_train.csv'
reduced_dimensions_diabetes_test = 'reduced_dimensions_diabetes_test.csv'

#### Reading our data and confirming it looks as expected

In [5]:
diabetes = CSV_Reader(bucket, cleaned_diabetes, input_file_name)

In [6]:
print(diabetes.shape)

diabetes.sample(5)

(100114, 98)


Unnamed: 0,readmitted,time_in_hospital,num_lab_procedures,num_procedures,num_medications,number_outpatient,number_emergency,number_inpatient,number_diagnoses,medication_metformin,medication_glipizide,medication_glyburide,medication_insulin,change,any_medication,race_AfricanAmerican,race_Asian,race_Hispanic,race_Other,age_1,age_2,age_3,age_4,age_5,age_6,age_7,age_9,age_10,gender_Male,discharge_disposition_hhealth,discharge_disposition_hospice,discharge_disposition_hospital,discharge_disposition_leftAMA,discharge_disposition_nursing,discharge_disposition_outpatient,discharge_disposition_psych,discharge_disposition_unknown,diag_1_Nothing,diag_1_blooddis,diag_1_digestive,diag_1_infection,diag_1_injury,diag_1_mentaldis,diag_1_metabolic,diag_1_musculoskeletal,diag_1_neoplasm,diag_1_nervous,diag_1_other,diag_1_pregnancy,diag_1_respiratory,diag_1_skin,diag_1_urogenital,diag_2_Nothing,diag_2_blooddis,diag_2_digestive,diag_2_infection,diag_2_injury,diag_2_mentaldis,diag_2_metabolic,diag_2_musculoskeletal,diag_2_neoplasm,diag_2_nervous,diag_2_other,diag_2_pregnancy,diag_2_respiratory,diag_2_skin,diag_2_urogenital,diag_3_Nothing,diag_3_blooddis,diag_3_digestive,diag_3_infection,diag_3_injury,diag_3_mentaldis,diag_3_metabolic,diag_3_musculoskeletal,diag_3_neoplasm,diag_3_nervous,diag_3_other,diag_3_pregnancy,diag_3_respiratory,diag_3_skin,diag_3_urogenital,max_glu_serum_>200,max_glu_serum_>300,A1Cresult_>7,A1Cresult_Norm,admission_type_id_2,admission_type_id_3,admission_type_id_4,admission_type_id_5,admission_type_id_7,admission_source_id_1,admission_source_id_2,admission_source_id_3,admission_source_id_4,admission_source_id_5,admission_source_id_6,admission_source_id_8
88590,0,4,45,2,21,0,0,0,8,0,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
45941,1,1,44,0,7,0,0,0,5,0,1,0,1,1,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
71495,0,4,49,1,17,0,0,0,6,0,0,0,1,0,1,1,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,1,0,0,0,1,0,0,0,0,0,0
59151,0,4,38,2,1,0,0,0,7,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
18804,0,6,59,0,19,0,0,0,9,0,0,0,1,1,1,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


#### We extract the train/validation dataset, and remove the column which recorded this designation

In [7]:
X = diabetes.drop(columns=['readmitted'])  # Remove the outcome column
y = diabetes['readmitted']

## 2.2 Assessing feature correlations with patient readmissions

Below we fit a basic linear correlation between each individual feature and patient readmission. We take the absolute value, as we are interested in the magnitude, not direction, of these correlations.

In [8]:
correlations = X.corrwith(y).abs().sort_values(ascending=False).reset_index()

#We rename the columns of our correlation-value-dataset for interpretability.
correlations.columns = ['feature', 'correlation with outcome']

In [9]:
#Our retained variables
correlations

Unnamed: 0,feature,correlation with outcome
0,number_inpatient,0.167589
1,discharge_disposition_outpatient,0.072407
2,number_emergency,0.060787
3,number_diagnoses,0.052343
4,time_in_hospital,0.04553
5,discharge_disposition_nursing,0.042177
6,num_medications,0.040725
7,medication_insulin,0.034057
8,discharge_disposition_psych,0.029802
9,any_medication,0.02591


The bottom 20 or so vairables on this list do not demonstrate raise any contextual red flags (i.e. I cannot think of any reason why they would be of particular predictive/other value). As so, they will be removed.

To filter down our ~100 variables to only those that correlate with patient readmission (and thereby reduce our dataset's dimensionality by only computing/focusing on relevant data) we will remove features the features representing the bottom 20% of our feature-correlation-values.

In [10]:
correlations_keep = correlations['feature'].head(int(len(diabetes.columns) * 80 / 100))
print(len(correlations_keep))

78


In [11]:
corr_dataset = diabetes[correlations_keep]

print(corr_dataset.shape)
corr_dataset.sample(3)

(100114, 78)


Unnamed: 0,number_inpatient,discharge_disposition_outpatient,number_emergency,number_diagnoses,time_in_hospital,discharge_disposition_nursing,num_medications,medication_insulin,discharge_disposition_psych,any_medication,medication_metformin,num_lab_procedures,age_6,discharge_disposition_hospital,diag_2_neoplasm,diag_3_urogenital,number_outpatient,change,diag_1_respiratory,diag_1_injury,diag_1_Nothing,discharge_disposition_hhealth,discharge_disposition_hospice,age_9,diag_3_Nothing,diag_3_neoplasm,race_Other,diag_2_pregnancy,age_2,diag_1_pregnancy,diag_3_pregnancy,admission_source_id_1,admission_type_id_3,diag_2_skin,diag_1_musculoskeletal,max_glu_serum_>300,diag_3_skin,age_3,age_1,A1Cresult_Norm,num_procedures,diag_3_respiratory,diag_2_Nothing,admission_source_id_6,diag_3_metabolic,diag_2_blooddis,admission_source_id_4,discharge_disposition_leftAMA,A1Cresult_>7,diag_1_skin,diag_1_blooddis,age_5,medication_glyburide,diag_3_other,diag_2_musculoskeletal,admission_source_id_3,diag_2_urogenital,max_glu_serum_>200,admission_source_id_8,diag_1_other,admission_type_id_5,diag_1_digestive,diag_3_injury,diag_3_digestive,diag_1_neoplasm,diag_1_metabolic,admission_type_id_7,diag_2_respiratory,diag_1_infection,diag_2_infection,diag_1_mentaldis,race_Hispanic,admission_source_id_2,admission_source_id_5,discharge_disposition_unknown,race_Asian,gender_Male,diag_1_urogenital
80870,1,0,0,9,5,0,21,0,0,1,0,69,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0
86104,0,0,0,8,3,0,28,0,0,1,1,32,1,0,0,0,2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0
57752,0,0,0,5,6,1,18,0,0,1,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0


## 2.3 Assessing the feature importance via Random Forest

Similar to our above linear correlation values, by running a basic random forest on the remaining varaibles we can determine the relative degree to which they influence predictions.

In [12]:
#We are able to call this model in from sklearn, and do not need to call a model container.
rfc = rfc()
rfc.set_params(n_estimators=200, min_samples_split=5, min_samples_leaf=1, max_features='sqrt', max_depth=60)

#Fits a random forest on the dataset for the purpose of feature importance:
rfc.fit(corr_dataset, y)

#### Now we can list out the importance of each feature:

In [13]:
FIAllDummyAllFeatures = pd.DataFrame({'feature':pd.Series(corr_dataset.columns), 'importance':rfc.feature_importances_})

FIAllDummyAllFeatures = FIAllDummyAllFeatures.sort_values('importance', ascending=False)
FIAllDummyAllFeatures

Unnamed: 0,feature,importance
11,num_lab_procedures,0.11586
6,num_medications,0.098327
4,time_in_hospital,0.066531
0,number_inpatient,0.061482
40,num_procedures,0.046642
3,number_diagnoses,0.041968
16,number_outpatient,0.02531
2,number_emergency,0.022625
76,gender_Male,0.02053
31,admission_source_id_1,0.016047


#### As there are not any variables within the bottom 10-15 of this feature importance list, we will remove those among the lowest 15% of feature importance values.

In [14]:
FI_keep = FIAllDummyAllFeatures['feature'].head(int(len(FIAllDummyAllFeatures) * 85 / 100))
print(len(FI_keep))

66


In [15]:
FI_dataset = corr_dataset[FI_keep]

print(FI_dataset.shape)
FI_dataset.sample(3)

(100114, 66)


Unnamed: 0,num_lab_procedures,num_medications,time_in_hospital,number_inpatient,num_procedures,number_diagnoses,number_outpatient,number_emergency,gender_Male,admission_source_id_1,change,diag_3_Nothing,age_9,diag_2_Nothing,medication_insulin,diag_2_respiratory,admission_type_id_3,diag_2_urogenital,age_6,medication_glyburide,diag_3_respiratory,diag_3_metabolic,medication_metformin,discharge_disposition_hhealth,diag_1_digestive,diag_1_respiratory,diag_1_injury,age_5,discharge_disposition_outpatient,diag_3_urogenital,discharge_disposition_nursing,admission_type_id_5,diag_3_injury,diag_1_Nothing,any_medication,diag_1_urogenital,A1Cresult_Norm,diag_3_digestive,discharge_disposition_unknown,A1Cresult_>7,diag_2_skin,admission_source_id_8,diag_3_other,discharge_disposition_hospital,race_Other,diag_1_metabolic,diag_1_other,diag_3_skin,diag_1_infection,diag_1_musculoskeletal,diag_2_neoplasm,diag_1_neoplasm,admission_source_id_4,discharge_disposition_hospice,diag_2_blooddis,diag_1_mentaldis,diag_2_infection,diag_1_skin,race_Hispanic,diag_3_neoplasm,diag_2_musculoskeletal,diag_1_blooddis,admission_source_id_6,age_3,max_glu_serum_>300,max_glu_serum_>200
59249,1,11,2,0,3,9,0,0,1,1,0,0,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0
78837,54,14,2,0,0,4,0,0,1,0,1,0,0,0,1,0,0,0,0,1,0,0,1,0,0,0,0,0,0,0,0,0,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
53312,28,11,3,0,1,9,0,0,1,0,1,0,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


## 2.4 Assessing feature importance with Akaike/Bayesian Information Criterion (AIC/BIC)

The below code will impliment a series of stepwise actions, sequentially identifying feature for removal that correspond to the greatest relative decrease in the AIC metric. A high decrease in the AIC value designates that the feature in question has realtively low impact on predictions, and therefore might be a good choice for removal.

In [16]:
selected_features = list(FI_dataset.columns)
max_iterations = 20  # Set a maximum number of iterations

for iteration in range(1, max_iterations + 1):
    # Add a constant term to the features (required for statsmodels)
    X = sm.add_constant(FI_dataset[selected_features])
    
    # Fit the ordinary least squares (OLS) model
    model = sm.OLS(y, X).fit()
    
    # Get AIC value
    current_aic = model.aic
    
    # Identify the least significant feature
    least_significant_feature = model.pvalues[1:].idxmax()  # Exclude 'const' term
    
    # Remove the least significant feature
    selected_features.remove(least_significant_feature)
    
    # Check if only 'const' term is left
    if len(selected_features) == 0:
        break
    
    # Fit the model again after removing the feature
    model = sm.OLS(y, sm.add_constant(FI_dataset[selected_features])).fit()
    
    # Calculate the new AIC
    new_aic = model.aic
    
    # Print progress
    print(f"Iteration {iteration}: AIC = {new_aic}, Removed Feature: {least_significant_feature}")
    
    # Check for convergence based on a change in AIC
    if abs(new_aic - current_aic) < 1e-4:
        print("Converged.")
        break

# Display the selected features
print("\nSelected Features:", selected_features)


Iteration 1: AIC = 50121.45041946409, Removed Feature: age_9
Iteration 2: AIC = 50119.45197114616, Removed Feature: change
Iteration 3: AIC = 50117.453896325605, Removed Feature: race_Hispanic
Iteration 4: AIC = 50115.45949997238, Removed Feature: diag_1_blooddis
Iteration 5: AIC = 50113.46554896209, Removed Feature: admission_source_id_6
Iteration 6: AIC = 50111.532109878346, Removed Feature: medication_insulin
Iteration 7: AIC = 50109.76871432981, Removed Feature: time_in_hospital
Iteration 8: AIC = 50108.06526256996, Removed Feature: diag_1_metabolic
Iteration 9: AIC = 50106.38562072988, Removed Feature: admission_type_id_5
Iteration 10: AIC = 50104.89057352749, Removed Feature: diag_2_musculoskeletal
Iteration 11: AIC = 50103.449893821584, Removed Feature: diag_2_urogenital
Iteration 12: AIC = 50102.12515987831, Removed Feature: diag_3_skin
Iteration 13: AIC = 50100.94775154319, Removed Feature: diag_2_infection
Iteration 14: AIC = 50099.78553381152, Removed Feature: medication_gly

In [17]:
import statsmodels.api as sm

selected_features = list(FI_dataset.columns)
max_iterations = 20  # Set a maximum number of iterations

for iteration in range(1, max_iterations + 1):
    # Add a constant term to the features (required for statsmodels)
    X = sm.add_constant(FI_dataset[selected_features])
    
    # Fit the ordinary least squares (OLS) model
    model = sm.OLS(y, X).fit()
    
    # Get BIC value
    current_bic = model.bic
    
    # Identify the least significant feature
    least_significant_feature = model.pvalues[1:].idxmax()  # Exclude 'const' term
    
    # Remove the least significant feature
    selected_features.remove(least_significant_feature)
    
    # Check if only 'const' term is left
    if len(selected_features) == 0:
        break
    
    # Fit the model again after removing the feature
    model = sm.OLS(y, sm.add_constant(FI_dataset[selected_features])).fit()
    
    # Calculate the new BIC
    new_bic = model.bic
    
    # Print progress
    print(f"Iteration {iteration}: BIC = {new_bic}, Removed Feature: {least_significant_feature}")
    
    # Check for convergence based on a change in BIC
    if abs(new_bic - current_bic) < 1e-4:
        print("Converged.")
        break

# Display the selected features
print("\nSelected Features:", selected_features)


Iteration 1: BIC = 50749.37869729789, Removed Feature: age_9
Iteration 2: BIC = 50737.8661841643, Removed Feature: change
Iteration 3: BIC = 50726.35404452808, Removed Feature: race_Hispanic
Iteration 4: BIC = 50714.84558335919, Removed Feature: diag_1_blooddis
Iteration 5: BIC = 50703.33756753324, Removed Feature: admission_source_id_6
Iteration 6: BIC = 50691.89006363383, Removed Feature: medication_insulin
Iteration 7: BIC = 50680.61260326963, Removed Feature: time_in_hospital
Iteration 8: BIC = 50669.395086694116, Removed Feature: diag_1_metabolic
Iteration 9: BIC = 50658.20138003837, Removed Feature: admission_type_id_5
Iteration 10: BIC = 50647.19226802032, Removed Feature: diag_2_musculoskeletal
Iteration 11: BIC = 50636.23752349875, Removed Feature: diag_2_urogenital
Iteration 12: BIC = 50625.39872473981, Removed Feature: diag_3_skin
Iteration 13: BIC = 50614.70725158903, Removed Feature: diag_2_infection
Iteration 14: BIC = 50604.03096904169, Removed Feature: medication_glybur

Both AIC and BIC recommend dropping many of the same variables. However, some of these variables  have relatively high correlation/feature importance relative to others on the list (e.g. change in medication, time in hospital). 

Therefore, rather than dropping the top values, we will selectively choose 5 variables from this list that are near the top, but do not demonstrate a standout relationship with the outcome variable.

In [18]:
AIC_BIC_reduced = FI_dataset.drop(['diag_1_blooddis', 'admission_source_id_6', 'diag_1_metabolic', 'admission_type_id_5', 'diag_2_musculoskeletal'], axis=1)

print(AIC_BIC_reduced.shape)
AIC_BIC_reduced.sample(3)

(100114, 61)


Unnamed: 0,num_lab_procedures,num_medications,time_in_hospital,number_inpatient,num_procedures,number_diagnoses,number_outpatient,number_emergency,gender_Male,admission_source_id_1,change,diag_3_Nothing,age_9,diag_2_Nothing,medication_insulin,diag_2_respiratory,admission_type_id_3,diag_2_urogenital,age_6,medication_glyburide,diag_3_respiratory,diag_3_metabolic,medication_metformin,discharge_disposition_hhealth,diag_1_digestive,diag_1_respiratory,diag_1_injury,age_5,discharge_disposition_outpatient,diag_3_urogenital,discharge_disposition_nursing,diag_3_injury,diag_1_Nothing,any_medication,diag_1_urogenital,A1Cresult_Norm,diag_3_digestive,discharge_disposition_unknown,A1Cresult_>7,diag_2_skin,admission_source_id_8,diag_3_other,discharge_disposition_hospital,race_Other,diag_1_other,diag_3_skin,diag_1_infection,diag_1_musculoskeletal,diag_2_neoplasm,diag_1_neoplasm,admission_source_id_4,discharge_disposition_hospice,diag_2_blooddis,diag_1_mentaldis,diag_2_infection,diag_1_skin,race_Hispanic,diag_3_neoplasm,age_3,max_glu_serum_>300,max_glu_serum_>200
76711,2,8,1,9,1,6,0,0,1,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0
54799,36,3,10,0,2,9,0,0,1,0,0,0,1,0,1,1,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0
23696,66,15,6,0,0,9,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


## 2.5 Assessing feature variability with Principal Component Analysis (PCA)

PCA is a technique to reduce the dimensionality of a dataset, while maintaining as much of the data's variance as as possible. We will fit a PCA model, and then assess variance score for each feature. Features with low variance scores represent candidates for removal from the model.

#### Standardizing our features 

In [19]:
scaler = StandardScaler()
diabetes_scaled = scaler.fit_transform(AIC_BIC_reduced)

#### Fitting the PCA to our data

In [20]:
# Perform PCA considering the target variable
pca = PCA(n_components=AIC_BIC_reduced.shape[1])  # Use all features initially
pca.fit_transform(AIC_BIC_reduced)

array([[-3.21357661e+01,  1.33532645e+00, -1.02267207e+00, ...,
        -3.25866086e-02,  3.63990402e-02, -2.66300792e-03],
       [ 2.68320249e+01,  9.69509192e+00,  5.62068164e+00, ...,
        -2.58094460e-02,  1.95709802e-03, -8.24111830e-03],
       [ 4.41927789e+00,  1.29621911e+00,  4.39065164e+00, ...,
         2.36685169e-02,  2.57625691e-02, -1.73481457e-02],
       ...,
       [-9.50791414e+00,  3.65319681e+00,  8.42307553e-01, ...,
         5.52127406e-02, -1.93517293e-02, -1.47793568e-02],
       [ 3.01584872e+00,  5.64943971e+00,  4.81942717e+00, ...,
        -4.26342759e-02, -2.79282670e-03, -2.00809684e-02],
       [-3.12112938e+01, -8.05084181e+00,  4.94891013e+00, ...,
         5.20392978e-02, -6.50812152e-03, -3.85866908e-03]])

#### Compiling variance scores from PCA model

In [21]:
# Calculate feature variance based on PCA components
pca_feature_variances = abs(pca.components_)

# Calculate cumulative sum of feature variances
pca_cumulative_variances = pca_feature_variances.sum(axis=0)

Creating a dataset detailing feature names and variance scores

In [22]:
#Compiling elements into dataset
pca_features = list(zip(AIC_BIC_reduced.columns, pca_cumulative_variances))
#Assigning column names
pca_features_df = pd.DataFrame(pca_features, columns = ["feature", "importance_value"]) 
#Sorting by variance
pca_sorted_df = pca_features_df.sort_values(by='importance_value', ascending=False)

print(pca_sorted_df.shape)

(61, 2)


In [23]:
pca_sorted_df

Unnamed: 0,feature,importance_value
32,diag_1_Nothing,4.859158
26,diag_1_injury,4.614685
24,diag_1_digestive,4.556351
47,diag_1_musculoskeletal,4.538388
31,diag_3_injury,4.433842
49,diag_1_neoplasm,4.353279
15,diag_2_respiratory,4.327196
30,discharge_disposition_nursing,4.268294
39,diag_2_skin,4.261847
29,diag_3_urogenital,4.22972


As we can see, some of the lowest variance features (which we might consider dropping, as low variance features tend not to be of much predictive value), are the variables that also correlate most highly with out outcome variable (e.g. num_lab_procedures, num_medications). Therefore, we will opt to keep all of our variables as of this point. However, it is good to know that there are are few if any irrelevant and low-variance features remaining in our dataset.

## 2.6 Splitting and exporting engineered datasets to s3 for machine learning

Here we filter our original dataset by the columns we have selected to retain (i.e., those remaining after steps 2.2 through 2.4).

In [24]:
reduced_dataset = pd.concat([diabetes['readmitted'], AIC_BIC_reduced], axis=1)
print(reduced_dataset.shape)
reduced_dataset.sample()

(100114, 62)


Unnamed: 0,readmitted,num_lab_procedures,num_medications,time_in_hospital,number_inpatient,num_procedures,number_diagnoses,number_outpatient,number_emergency,gender_Male,admission_source_id_1,change,diag_3_Nothing,age_9,diag_2_Nothing,medication_insulin,diag_2_respiratory,admission_type_id_3,diag_2_urogenital,age_6,medication_glyburide,diag_3_respiratory,diag_3_metabolic,medication_metformin,discharge_disposition_hhealth,diag_1_digestive,diag_1_respiratory,diag_1_injury,age_5,discharge_disposition_outpatient,diag_3_urogenital,discharge_disposition_nursing,diag_3_injury,diag_1_Nothing,any_medication,diag_1_urogenital,A1Cresult_Norm,diag_3_digestive,discharge_disposition_unknown,A1Cresult_>7,diag_2_skin,admission_source_id_8,diag_3_other,discharge_disposition_hospital,race_Other,diag_1_other,diag_3_skin,diag_1_infection,diag_1_musculoskeletal,diag_2_neoplasm,diag_1_neoplasm,admission_source_id_4,discharge_disposition_hospice,diag_2_blooddis,diag_1_mentaldis,diag_2_infection,diag_1_skin,race_Hispanic,diag_3_neoplasm,age_3,max_glu_serum_>300,max_glu_serum_>200
86792,0,47,7,3,0,0,9,0,0,1,1,0,0,1,0,0,0,1,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


#### Now, we will split this data into a 85% train/validation dataset, and a residual 15% holdout test dataset for our final evaluation.

In [25]:
# Set a seed for reproducibility
np.random.seed(42)

# Create an array of random indices
random_indices = np.random.permutation(reduced_dataset.index)

# Determine the size of the training set (e.g., 85%)
train_size = int(0.85 * len(reduced_dataset))

# Assign train/test labels based on the random indices
reduced_dataset['split'] = 'test'
reduced_dataset.loc[random_indices[:train_size], 'split'] = 'train'

print(reduced_dataset.shape)
reduced_dataset.sample(3)

(100114, 63)


Unnamed: 0,readmitted,num_lab_procedures,num_medications,time_in_hospital,number_inpatient,num_procedures,number_diagnoses,number_outpatient,number_emergency,gender_Male,admission_source_id_1,change,diag_3_Nothing,age_9,diag_2_Nothing,medication_insulin,diag_2_respiratory,admission_type_id_3,diag_2_urogenital,age_6,medication_glyburide,diag_3_respiratory,diag_3_metabolic,medication_metformin,discharge_disposition_hhealth,diag_1_digestive,diag_1_respiratory,diag_1_injury,age_5,discharge_disposition_outpatient,diag_3_urogenital,discharge_disposition_nursing,diag_3_injury,diag_1_Nothing,any_medication,diag_1_urogenital,A1Cresult_Norm,diag_3_digestive,discharge_disposition_unknown,A1Cresult_>7,diag_2_skin,admission_source_id_8,diag_3_other,discharge_disposition_hospital,race_Other,diag_1_other,diag_3_skin,diag_1_infection,diag_1_musculoskeletal,diag_2_neoplasm,diag_1_neoplasm,admission_source_id_4,discharge_disposition_hospice,diag_2_blooddis,diag_1_mentaldis,diag_2_infection,diag_1_skin,race_Hispanic,diag_3_neoplasm,age_3,max_glu_serum_>300,max_glu_serum_>200,split
74186,0,32,19,1,1,6,5,0,0,0,1,1,0,0,1,0,0,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,train
55145,0,2,13,7,1,1,9,0,0,1,1,0,0,1,0,0,0,0,1,0,0,0,0,0,1,0,0,1,0,0,0,0,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,train
87162,0,72,29,6,0,3,9,0,0,1,0,1,0,0,0,1,0,0,0,1,0,0,1,1,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,train


#### Splitting the datasets and removing the "split" column.

In [26]:
reduced_dataset_train = reduced_dataset.loc[reduced_dataset['split'] == "train",].drop(["split"], axis=1)
reduced_dataset_test = reduced_dataset.loc[reduced_dataset['split'] == "test",].drop(["split"], axis=1)

In [27]:
print(reduced_dataset_test.shape)
print(reduced_dataset_train.shape)

(15018, 62)
(85096, 62)


In [28]:
reduced_dataset_train.sample()

Unnamed: 0,readmitted,num_lab_procedures,num_medications,time_in_hospital,number_inpatient,num_procedures,number_diagnoses,number_outpatient,number_emergency,gender_Male,admission_source_id_1,change,diag_3_Nothing,age_9,diag_2_Nothing,medication_insulin,diag_2_respiratory,admission_type_id_3,diag_2_urogenital,age_6,medication_glyburide,diag_3_respiratory,diag_3_metabolic,medication_metformin,discharge_disposition_hhealth,diag_1_digestive,diag_1_respiratory,diag_1_injury,age_5,discharge_disposition_outpatient,diag_3_urogenital,discharge_disposition_nursing,diag_3_injury,diag_1_Nothing,any_medication,diag_1_urogenital,A1Cresult_Norm,diag_3_digestive,discharge_disposition_unknown,A1Cresult_>7,diag_2_skin,admission_source_id_8,diag_3_other,discharge_disposition_hospital,race_Other,diag_1_other,diag_3_skin,diag_1_infection,diag_1_musculoskeletal,diag_2_neoplasm,diag_1_neoplasm,admission_source_id_4,discharge_disposition_hospice,diag_2_blooddis,diag_1_mentaldis,diag_2_infection,diag_1_skin,race_Hispanic,diag_3_neoplasm,age_3,max_glu_serum_>300,max_glu_serum_>200
40790,0,62,18,2,0,0,5,0,0,0,0,0,0,1,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


In [29]:
reduced_dataset_test.sample()

Unnamed: 0,readmitted,num_lab_procedures,num_medications,time_in_hospital,number_inpatient,num_procedures,number_diagnoses,number_outpatient,number_emergency,gender_Male,admission_source_id_1,change,diag_3_Nothing,age_9,diag_2_Nothing,medication_insulin,diag_2_respiratory,admission_type_id_3,diag_2_urogenital,age_6,medication_glyburide,diag_3_respiratory,diag_3_metabolic,medication_metformin,discharge_disposition_hhealth,diag_1_digestive,diag_1_respiratory,diag_1_injury,age_5,discharge_disposition_outpatient,diag_3_urogenital,discharge_disposition_nursing,diag_3_injury,diag_1_Nothing,any_medication,diag_1_urogenital,A1Cresult_Norm,diag_3_digestive,discharge_disposition_unknown,A1Cresult_>7,diag_2_skin,admission_source_id_8,diag_3_other,discharge_disposition_hospital,race_Other,diag_1_other,diag_3_skin,diag_1_infection,diag_1_musculoskeletal,diag_2_neoplasm,diag_1_neoplasm,admission_source_id_4,discharge_disposition_hospice,diag_2_blooddis,diag_1_mentaldis,diag_2_infection,diag_1_skin,race_Hispanic,diag_3_neoplasm,age_3,max_glu_serum_>300,max_glu_serum_>200
44472,0,43,17,4,0,0,5,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0


Everything looks good, so now we can export 4 versions of out dataset (respective train, and test datasets for model development, and then a full version for an operational assessment of the model -- a version both with and without headers).

In [30]:
def Export_Processed_CSV(bucket, processed_data_folder, local_file_name, S3_file_name, header_presence):
        
    '''Exports a dataframe into CSV format, and sends to a specified S3 bucket location
    
    Arguments 
    --------- 
    bucket: A list of the columns (i.e. the 3 diagnosis columns) to be updated
    processed_data_folder: the relevant subfolder within the main bucket
    local_file_name: The name of the dataframe within the notebook
    S3_file_name: The name of the file uppn export (with .csv extension included)
    header_presence: whether or not a header will be present within the exported csv
    
    Return
    ---------
    Exports a csv file to a specified S3 location'''
    
    local_file_name.to_csv(S3_file_name, index=False, header=header_presence)
    boto3.Session().resource('s3').Bucket(bucket).Object('{}/{}'.format(processed_data_folder, S3_file_name)).upload_file(S3_file_name)

#### Exporting CSVs

In [31]:
Export_Processed_CSV(bucket, cleaned_diabetes, reduced_dataset_train, reduced_dimensions_diabetes_train, True)
Export_Processed_CSV(bucket, cleaned_diabetes, reduced_dataset_test, reduced_dimensions_diabetes_test, True)

Please continue to the third script in this repository: 3. Model_Tuning_Deployment_Predictions