## Part III: Feature Engineering and Data Preparation

#### Setup Environment

In [2]:
%run environment-setup.ipynb

Stored 's3_datalake_path_csv' (str)
Stored 'local_data_path_csv' (str)
Stored 's3_datalake_path_parquet' (str)


In [49]:
# import additional libs needed
from sklearn.preprocessing import StandardScaler

In [3]:
# load the cleaned datset from Athena/S3
sepsis_dataset = load_clean_dataset()

2024-11-14 23:01:40,560	INFO worker.py:1786 -- Started a local Ray instance.


### Data Transformation

The dataset is cleaned and complete, however additional work is still required to prepare for modeling.  In this section, the following steps will be taken:

-  Encode categorical features
-  Transform the time series data into patient time series sequences
-  Split dataset: the dataset will be split in to train/val/test sets
-  Normalize dataset: the dataset will be normalized using a standard scaler

In [4]:
# one hot encode the sex feature (M/F)
one_hot = pd.get_dummies(sepsis_dataset['gender'], prefix='gender', dtype='int')

# Join the encoded df
sepsis_dataset_encoded = sepsis_dataset.drop('gender',axis = 1)
sepsis_dataset_encoded = sepsis_dataset.join(one_hot)
sepsis_dataset_encoded

Unnamed: 0,patient_id,hour,sepsislabel,hr,o2sat,temp,sbp,map,dbp,resp,...,creatinine_lag,glucose_lag,lactate_lag,hct_lag,bun_lag,potassium_lag,magnesium_lag,calcium_lag,gender_0,gender_1
0,17072,0,0,65.0,100.0,35.78,129.0,72.0,69.0,16.5,...,99.0,99.0,99.0,99.0,99.0,99.0,99.0,99.0,1,0
1,17072,1,0,65.0,100.0,35.78,129.0,72.0,69.0,16.5,...,99.0,99.0,99.0,99.0,99.0,99.0,99.0,99.0,1,0
2,17072,2,0,78.0,100.0,35.78,129.0,42.5,69.0,16.5,...,99.0,99.0,99.0,99.0,99.0,99.0,99.0,99.0,1,0
3,17072,7,0,68.0,100.0,35.78,142.0,93.5,78.0,16.0,...,3.0,3.0,2.0,3.0,3.0,3.0,3.0,3.0,1,0
4,17072,8,0,71.0,100.0,35.78,121.0,74.0,91.0,14.0,...,4.0,4.0,3.0,4.0,4.0,4.0,4.0,4.0,1,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1181709,104763,33,0,81.0,98.0,36.80,122.0,71.0,53.0,18.0,...,5.0,5.0,132.0,5.0,5.0,5.0,5.0,5.0,1,0
1181710,104763,34,0,80.0,98.0,36.80,119.0,66.0,47.0,17.0,...,6.0,6.0,133.0,6.0,6.0,6.0,6.0,6.0,1,0
1181711,104763,35,0,80.0,100.0,36.70,113.0,67.0,52.0,12.0,...,7.0,0.0,134.0,7.0,7.0,7.0,7.0,7.0,1,0
1181712,104763,36,0,80.0,100.0,36.70,111.0,68.0,54.0,16.0,...,8.0,1.0,135.0,8.0,8.0,8.0,8.0,8.0,1,0


#### Transform Dataset into Patient-Level Time-Series

Currently, our data is formatted in a row per time step - so the time series for a given paitent would have [x] row entries corresponding to the duration of their time series, [x].  For modeling, this needs to be converted to sequences for each patient.  The sequence will be a single row, with one column per time step (there will be total time steps of LOOKBACK_WINDOW + PREDICTION_HORIZON).  In each column will be a vector of the variables for that patient at that time step.

In [5]:
# set target sequence length for each patient   
target_sequence_length = LOOKBACK_WINDOW + PREDICTION_HORIZON

In [75]:
# helper to filter patient time series to most recent (LOOKBACK_WINDOW + PREDICTION_HORIZON) samples
def truncate_patient_time_series(grouped_df):
  # don't include the positive sepsis time steps - we want to preict 6 hours before
  grouped_df_filtered = grouped_df[grouped_df['sepsislabel'] == 0]

  # filter to get the most recent 
  grouped_df_filtered = grouped_df[grouped_df['hour'] > (max(grouped_df['hour']) - target_sequence_length)]
  grouped_df_filtered = grouped_df_filtered[grouped_df_filtered['hour'] <= (max(grouped_df_filtered['hour']) - PREDICTION_HORIZON)]
  grouped_df_filtered['hour'] = grouped_df_filtered['hour'] - min(grouped_df_filtered['hour'])
  return grouped_df_filtered

# Execute grouping and sequence truncation
ts_limited_sepsis_data = sepsis_dataset_encoded.groupby('patient_id').apply(truncate_patient_time_series).reset_index(drop=True)
ts_limited_sepsis_data

  ts_limited_sepsis_data = sepsis_dataset_encoded.groupby('patient_id').apply(truncate_patient_time_series).reset_index(drop=True)


Unnamed: 0,patient_id,hour,sepsislabel,hr,o2sat,temp,sbp,map,dbp,resp,...,creatinine_lag,glucose_lag,lactate_lag,hct_lag,bun_lag,potassium_lag,magnesium_lag,calcium_lag,gender_0,gender_1
0,1,0,0,108.0,87.0,36.67,149.0,89.67,63.995019,30.0,...,8.0,8.0,123.0,8.0,8.0,8.0,8.0,8.0,1,0
1,1,1,0,107.0,90.0,36.67,156.0,96.67,63.995019,26.0,...,9.0,9.0,124.0,9.0,9.0,9.0,9.0,9.0,1,0
2,1,2,0,104.0,91.0,36.67,168.0,141.33,63.995019,29.0,...,10.0,10.0,125.0,10.0,10.0,10.0,10.0,10.0,1,0
3,1,3,0,102.0,88.0,36.50,146.0,90.67,63.995019,27.0,...,11.0,11.0,126.0,11.0,11.0,11.0,11.0,11.0,1,0
4,1,4,0,106.0,91.0,36.50,137.0,75.67,63.995019,25.0,...,12.0,12.0,127.0,12.0,12.0,12.0,12.0,12.0,1,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
585715,120000,19,0,74.0,97.0,36.70,113.0,83.00,63.000000,18.0,...,5.0,0.0,123.0,5.0,5.0,5.0,5.0,5.0,1,0
585716,120000,20,0,72.0,98.0,36.60,116.0,88.00,68.000000,16.0,...,6.0,1.0,124.0,6.0,6.0,6.0,6.0,6.0,1,0
585717,120000,21,0,74.0,98.0,36.60,118.0,88.00,72.000000,18.0,...,7.0,2.0,125.0,7.0,7.0,7.0,7.0,7.0,1,0
585718,120000,22,0,82.0,97.0,36.60,120.0,82.00,66.000000,16.0,...,8.0,3.0,126.0,8.0,8.0,8.0,8.0,8.0,1,0


In [76]:
# narrow down our columns to just the variables
feature_cols = ts_limited_sepsis_data.columns.to_list()
ignore_cols = ['patient_id', 'hour', 'sepsislabel', 'gender']
feature_cols = [x for x in feature_cols if x not in ignore_cols]

In [86]:
feature_cols

['hr',
 'o2sat',
 'temp',
 'sbp',
 'map',
 'dbp',
 'resp',
 'wbc',
 'platelets',
 'creatinine',
 'glucose',
 'lactate',
 'hct',
 'bun',
 'potassium',
 'magnesium',
 'calcium',
 'age',
 'hospadmtime',
 'iculos',
 'wbc_lag',
 'platelets_lag',
 'creatinine_lag',
 'glucose_lag',
 'lactate_lag',
 'hct_lag',
 'bun_lag',
 'potassium_lag',
 'magnesium_lag',
 'calcium_lag',
 'gender_0',
 'gender_1']

In [33]:
# Helper function to perform vectorization of features at each time step
def get_patient_feature_vector(row):
  vector = []
  for col in feature_cols:
    vector.append(row[col])
  return vector

# test on a few samples
ts_limited_sepsis_data.head().apply(get_patient_feature_vector, axis=1)

0    [[108.0, 87.0, 36.67, 149.0, 89.67, 63.9950186...
1    [[107.0, 90.0, 36.67, 156.0, 96.67, 63.9950186...
2    [[104.0, 91.0, 36.67, 168.0, 141.33, 63.995018...
3    [[102.0, 88.0, 36.5, 146.0, 90.67, 63.99501869...
4    [[106.0, 91.0, 36.5, 137.0, 75.67, 63.99501869...
dtype: object

In [34]:
# Apply to the whole dataset
ts_limited_sepsis_data["feature_vector"] = ts_limited_sepsis_data.apply(get_patient_feature_vector, axis=1)

In [35]:
# Drop everything except the patient ID, date, and selected features
drop_columns = [col for col in ts_limited_sepsis_data.columns if col not in ['patient_id', 'hour', 'sepsislabel', 'feature_vector']]
ts_limited_sepsis_data.drop(columns=drop_columns, inplace=True)
ts_limited_sepsis_data.head()

Unnamed: 0,patient_id,hour,sepsislabel,feature_vector
0,1,0,0,"[[108.0, 87.0, 36.67, 149.0, 89.67, 63.9950186..."
1,1,1,0,"[[107.0, 90.0, 36.67, 156.0, 96.67, 63.9950186..."
2,1,2,0,"[[104.0, 91.0, 36.67, 168.0, 141.33, 63.995018..."
3,1,3,0,"[[102.0, 88.0, 36.5, 146.0, 90.67, 63.99501869..."
4,1,4,0,"[[106.0, 91.0, 36.5, 137.0, 75.67, 63.99501869..."


In [22]:
# Transform the dataset to have time step as columns, features in each col
ts_limited_sepsis_sequence = ts_limited_sepsis_data.pivot(index="patient_id", columns="hour", values="feature_vector")
ts_limited_sepsis_sequence


hour,0,1,2,3,4,5,6,7,8,9,...,14,15,16,17,18,19,20,21,22,23
patient_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,"[108.0, 87.0, 36.67, 149.0, 89.67, 63.99501869...","[107.0, 90.0, 36.67, 156.0, 96.67, 63.99501869...","[104.0, 91.0, 36.67, 168.0, 141.33, 63.9950186...","[102.0, 88.0, 36.5, 146.0, 90.67, 63.995018699...","[106.0, 91.0, 36.5, 137.0, 75.67, 63.995018699...","[112.0, 89.0, 36.5, 157.0, 123.67, 63.99501869...","[112.0, 89.0, 36.5, 157.0, 123.67, 63.99501869...","[107.0, 91.0, 37.44, 141.0, 97.0, 63.995018699...","[111.0, 91.0, 37.44, 138.0, 126.0, 63.99501869...","[104.0, 90.0, 37.44, 126.0, 117.33, 63.9950186...",...,"[108.0, 89.0, 37.11, 139.0, 102.33, 63.9950186...","[117.0, 89.0, 36.78, 126.0, 104.67, 63.9950186...","[107.0, 93.0, 36.78, 126.0, 104.67, 63.9950186...","[117.0, 93.0, 36.33, 126.0, 84.0, 63.995018699...","[117.0, 93.0, 36.33, 126.0, 84.0, 63.995018699...","[114.5, 89.5, 36.33, 157.5, 121.17, 63.9950186...","[96.0, 95.0, 36.33, 119.5, 87.5, 63.9950186995...","[84.0, 95.0, 36.33, 111.5, 67.83, 63.995018699...","[86.0, 97.0, 36.33, 127.0, 76.33, 63.995018699...","[99.5, 96.0, 36.33, 143.5, 96.17, 63.995018699..."
3,"[72.5, 94.5, 37.06, 140.0, 85.5, 56.0, 29.0, 8...","[72.0, 96.0, 37.06, 147.0, 84.0, 54.0, 20.0, 8...","[75.0, 97.0, 37.06, 150.0, 86.0, 55.0, 22.0, 8...","[81.5, 95.5, 37.06, 155.0, 93.5, 62.5, 20.5, 8...","[80.0, 95.0, 37.5, 151.0, 86.0, 57.0, 26.0, 8....","[81.0, 94.0, 37.5, 152.0, 86.5, 57.5, 25.5, 8....","[82.5, 96.0, 37.78, 151.0, 87.5, 58.5, 24.5, 8...","[84.5, 94.5, 38.06, 147.5, 87.5, 58.5, 25.0, 8...","[80.0, 96.0, 38.06, 146.0, 84.0, 56.0, 22.0, 8...","[71.0, 96.0, 38.06, 136.0, 78.0, 53.0, 22.0, 8...",...,"[76.0, 91.0, 36.89, 141.0, 87.0, 60.0, 25.0, 8...","[75.0, 94.0, 36.89, 137.0, 81.0, 53.0, 25.0, 8...","[83.0, 94.0, 37.06, 142.0, 86.0, 55.0, 25.0, 8...","[84.0, 94.0, 37.06, 146.0, 86.0, 56.0, 25.0, 8...","[85.0, 95.0, 37.06, 145.0, 88.0, 58.0, 17.0, 8...","[85.0, 95.0, 37.06, 145.0, 88.0, 58.0, 17.0, 8...","[82.0, 95.0, 37.67, 141.0, 81.0, 53.0, 26.0, 8...","[82.0, 95.0, 37.67, 141.0, 81.0, 53.0, 26.0, 8...","[74.0, 94.0, 37.67, 129.0, 74.0, 50.0, 24.0, 8...","[72.0, 93.0, 37.67, 144.0, 90.0, 62.0, 30.0, 8..."
7,"[122.0, 94.5, 37.39, 116.0, 79.0, 62.0, 21.0, ...","[121.0, 94.0, 37.28, 97.0, 65.0, 52.0, 22.0, 9...","[122.0, 95.0, 37.28, 108.0, 69.0, 53.0, 26.0, ...","[125.0, 95.0, 38.0, 101.0, 66.0, 52.0, 27.0, 9...","[122.0, 94.0, 38.0, 91.0, 59.0, 45.0, 19.0, 9....","[121.0, 95.0, 37.94, 94.0, 62.0, 49.0, 22.5, 9...","[121.0, 95.0, 37.94, 95.0, 64.0, 50.0, 26.0, 9...","[128.0, 94.0, 38.22, 97.0, 65.0, 51.0, 28.0, 9...","[125.0, 95.0, 38.22, 94.0, 64.0, 50.0, 27.0, 9...","[121.0, 96.0, 38.06, 96.0, 68.0, 55.0, 23.0, 9...",...,"[123.0, 95.0, 38.06, 117.0, 81.0, 65.0, 20.0, ...","[117.0, 95.0, 38.06, 108.0, 77.0, 63.0, 16.0, ...","[113.0, 96.0, 38.06, 117.0, 75.0, 61.0, 21.0, ...","[112.0, 96.0, 37.5, 115.0, 78.0, 61.0, 14.0, 8...","[111.0, 95.0, 37.5, 112.0, 76.0, 59.0, 12.0, 8...","[109.0, 96.0, 37.5, 108.0, 75.0, 60.0, 13.0, 8...","[110.0, 96.0, 37.5, 109.0, 75.0, 59.0, 14.0, 8...","[111.0, 96.0, 38.33, 115.0, 77.0, 60.0, 13.5, ...","[111.0, 95.0, 38.33, 109.0, 77.0, 63.0, 15.0, ...","[110.0, 95.0, 38.33, 109.0, 77.0, 63.0, 14.0, ..."
8,"[78.0, 100.0, 36.67, 105.0, 70.0, 50.0, 20.0, ...","[80.0, 100.0, 36.89, 103.0, 61.0, 49.0, 18.0, ...","[86.0, 100.0, 36.89, 97.5, 65.0, 49.0, 19.0, 1...","[82.0, 98.0, 36.89, 113.0, 65.0, 42.0, 15.0, 1...","[87.0, 100.0, 36.89, 98.0, 61.0, 42.0, 16.0, 1...","[88.0, 99.0, 36.89, 97.0, 63.0, 44.0, 16.0, 11...","[88.0, 100.0, 36.67, 117.0, 71.0, 50.0, 18.0, ...","[86.0, 100.0, 36.67, 113.0, 73.0, 51.0, 16.0, ...","[85.0, 100.0, 36.56, 111.0, 69.0, 48.0, 18.0, ...","[72.0, 100.0, 36.56, 99.0, 58.0, 40.0, 15.0, 9...",...,"[72.0, 98.0, 36.22, 114.0, 64.0, 45.0, 15.0, 9...","[71.0, 99.0, 36.22, 115.0, 67.0, 48.0, 16.0, 9...","[71.0, 100.0, 36.22, 109.0, 61.0, 43.0, 13.0, ...","[81.0, 97.0, 35.67, 116.0, 72.0, 51.0, 18.0, 9...","[77.0, 89.0, 35.67, 122.0, 75.0, 52.0, 17.0, 9...","[72.0, 97.0, 35.67, 114.0, 63.0, 44.0, 16.0, 9...","[71.0, 97.0, 35.67, 109.0, 66.0, 46.0, 16.0, 9...","[65.0, 98.0, 36.0, 103.0, 58.0, 41.0, 13.0, 9....","[65.0, 96.0, 36.0, 108.0, 60.0, 42.0, 15.0, 9....","[68.0, 100.0, 36.0, 108.0, 62.0, 44.0, 17.0, 9..."
9,"[135.0, 97.0, 38.44, 130.0, 93.0, 73.0, 27.5, ...","[131.0, 98.0, 38.44, 105.0, 75.0, 59.0, 24.0, ...","[129.0, 100.0, 37.72, 133.5, 99.5, 80.0, 26.5,...","[128.0, 99.5, 37.83, 137.5, 104.0, 84.0, 26.5,...","[122.0, 99.0, 37.67, 152.0, 113.0, 90.0, 28.5,...","[117.0, 96.0, 36.78, 146.0, 112.0, 91.0, 27.0,...","[119.0, 97.0, 36.78, 151.0, 117.0, 95.0, 32.0,...","[125.0, 95.0, 36.78, 153.0, 115.0, 93.0, 30.0,...","[128.0, 96.0, 36.28, 149.0, 113.0, 92.0, 28.5,...","[129.0, 94.0, 36.28, 136.0, 102.0, 83.0, 28.0,...",...,"[135.0, 97.0, 38.67, 136.0, 101.0, 82.0, 31.0,...","[129.0, 98.0, 38.67, 139.0, 105.0, 85.0, 29.0,...","[122.0, 99.0, 38.06, 136.0, 105.0, 86.0, 28.0,...","[121.0, 98.0, 38.06, 140.0, 109.0, 89.0, 28.0,...","[115.0, 98.0, 37.72, 139.0, 105.0, 85.0, 27.0,...","[113.0, 97.0, 37.72, 134.0, 101.0, 81.0, 26.0,...","[119.0, 100.0, 37.94, 140.0, 106.0, 85.0, 26.5...","[118.0, 96.0, 37.94, 138.0, 108.0, 88.0, 26.0,...","[111.0, 97.0, 37.39, 136.0, 106.0, 86.0, 26.0,...","[116.0, 96.0, 37.72, 143.0, 109.0, 88.0, 30.0,..."
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
119994,"[79.0, 97.5, 37.65, 116.0, 75.0, 59.0, 14.5, 7...","[74.0, 99.0, 37.4, 116.0, 72.0, 56.0, 15.0, 7....","[71.0, 98.0, 37.4, 117.0, 73.0, 56.0, 16.5, 7....","[71.0, 98.0, 37.35, 122.0, 74.0, 55.0, 16.5, 7...","[70.0, 97.5, 37.35, 117.0, 73.0, 54.0, 14.0, 7...","[70.0, 98.0, 37.5, 102.0, 70.0, 54.0, 16.0, 7....","[70.0, 96.0, 37.6, 122.5, 77.0, 57.5, 19.0, 7....","[76.0, 95.0, 37.6, 114.0, 74.0, 54.0, 12.0, 7....","[68.0, 95.0, 37.6, 116.0, 68.0, 54.0, 20.0, 7....","[68.0, 97.0, 37.6, 122.0, 76.0, 56.0, 18.0, 7....",...,"[74.5, 96.0, 37.8, 128.0, 76.0, 56.0, 18.0, 7....","[82.0, 95.0, 37.7, 138.0, 80.0, 56.0, 18.0, 7....","[76.0, 95.0, 37.7, 122.0, 70.0, 48.0, 23.0, 7....","[72.0, 96.0, 37.7, 114.0, 72.0, 54.0, 17.0, 7....","[72.0, 93.0, 37.8, 112.0, 74.0, 56.0, 20.0, 7....","[72.0, 94.0, 37.6, 116.0, 74.0, 56.0, 19.0, 7....","[72.0, 94.0, 37.4, 126.0, 78.0, 56.0, 25.0, 11...","[70.0, 95.0, 37.3, 134.0, 80.0, 58.0, 20.0, 11...","[70.0, 95.0, 37.2, 128.0, 78.0, 56.0, 20.0, 11...","[70.0, 96.0, 37.2, 132.0, 78.0, 56.0, 19.0, 11..."
119995,"[65.0, 96.0, 36.3, 138.0, 96.0, 70.0, 20.0, 7....","[67.0, 93.0, 36.3, 152.0, 113.0, 86.0, 20.0, 7...","[62.0, 93.0, 36.3, 143.0, 107.0, 80.0, 17.0, 7...","[71.0, 95.0, 36.3, 170.0, 124.0, 97.0, 28.0, 7...","[66.0, 96.0, 35.9, 175.0, 116.0, 82.0, 28.0, 7...","[58.0, 97.0, 35.9, 172.0, 120.0, 84.0, 14.0, 7...","[59.0, 95.0, 35.9, 154.0, 110.0, 82.0, 14.0, 7...","[57.0, 94.0, 35.9, 175.0, 119.0, 83.0, 13.0, 7...","[56.0, 93.0, 35.4, 151.0, 109.0, 81.0, 13.0, 7...","[54.0, 97.0, 35.4, 157.0, 110.0, 80.0, 20.0, 7...",...,"[59.0, 94.0, 36.1, 142.0, 102.0, 75.0, 22.0, 7...","[62.0, 94.0, 36.1, 135.0, 101.0, 79.0, 21.0, 7...","[65.0, 93.0, 36.2, 130.0, 96.0, 72.0, 18.0, 7....","[61.0, 93.0, 36.2, 138.0, 99.0, 71.0, 15.0, 7....","[66.0, 94.0, 36.2, 149.0, 107.0, 80.0, 19.0, 7...","[64.0, 94.0, 36.2, 128.0, 93.0, 68.0, 16.0, 7....","[61.0, 94.0, 36.0, 128.0, 95.0, 72.0, 19.0, 7....","[65.0, 94.0, 36.0, 134.0, 88.0, 66.0, 22.0, 7....","[63.0, 94.0, 36.0, 129.0, 95.0, 71.0, 20.0, 7....","[60.0, 94.0, 36.0, 129.0, 96.0, 74.0, 16.0, 7...."
119996,"[90.0, 99.0, 37.0, 130.0, 68.0, 50.0, 20.0, 12...","[87.0, 99.0, 37.0, 109.0, 71.0, 67.0, 19.0, 12...","[78.0, 99.0, 37.0, 108.0, 66.0, 52.0, 19.0, 12...","[79.0, 99.0, 36.5, 136.0, 86.0, 74.0, 20.0, 12...","[81.0, 99.0, 36.5, 126.0, 77.0, 65.0, 20.0, 12...","[83.0, 99.0, 36.5, 126.0, 82.0, 68.0, 19.0, 12...","[81.0, 99.0, 36.5, 122.0, 71.0, 58.0, 20.0, 12...","[86.0, 99.5, 36.4, 130.0, 77.0, 63.0, 20.0, 12...","[81.0, 100.0, 36.4, 126.0, 84.0, 73.0, 18.0, 1...","[96.0, 95.0, 36.4, 131.0, 85.0, 72.0, 18.0, 12...",...,"[98.0, 98.0, 35.7, 109.0, 88.0, 83.0, 18.0, 12...","[97.0, 98.0, 35.7, 120.0, 83.0, 72.0, 18.0, 12...","[113.0, 98.0, 35.7, 120.0, 83.0, 72.0, 16.0, 1...","[124.0, 98.0, 35.7, 120.0, 83.0, 76.0, 18.0, 1...","[114.0, 98.0, 36.0, 120.0, 83.0, 76.0, 18.0, 1...","[105.0, 98.0, 36.0, 81.0, 70.0, 67.0, 18.0, 12...","[103.0, 98.0, 36.0, 106.0, 71.0, 61.0, 16.0, 1...","[103.0, 98.0, 36.0, 128.0, 83.0, 70.0, 16.0, 1...","[84.0, 99.0, 36.4, 132.0, 82.0, 66.0, 17.0, 12...","[89.0, 98.0, 36.4, 129.0, 72.0, 60.0, 19.0, 12..."
119998,"[97.0, 100.0, 36.8, 146.0, 109.0, 87.0, 20.0, ...","[76.0, 94.0, 36.8, 151.0, 101.0, 70.0, 17.0, 1...","[66.0, 93.0, 36.8, 119.0, 93.0, 75.0, 18.0, 12...","[66.0, 93.0, 36.8, 119.0, 93.0, 75.0, 18.0, 12...","[70.0, 92.0, 36.8, 138.0, 102.0, 78.0, 18.0, 1...","[70.0, 92.0, 36.8, 138.0, 102.0, 78.0, 18.0, 1...","[83.0, 100.0, 36.8, 174.0, 133.0, 107.0, 18.0,...","[84.0, 100.0, 36.8, 146.0, 114.0, 93.0, 18.0, ...","[82.0, 99.0, 36.6, 137.0, 98.0, 70.0, 18.0, 12...","[85.0, 100.0, 36.6, 137.0, 98.0, 70.0, 18.0, 1...",...,"[84.0, 100.0, 36.6, 158.0, 118.0, 87.0, 22.0, ...","[78.0, 93.0, 36.6, 180.5, 137.0, 105.0, 23.0, ...","[77.0, 100.0, 36.0, 173.0, 129.0, 97.0, 23.0, ...","[81.0, 94.0, 36.0, 170.0, 124.0, 92.0, 22.0, 1...","[65.0, 100.0, 36.0, 180.0, 126.0, 88.0, 24.0, ...","[66.0, 98.0, 36.0, 160.0, 115.0, 83.0, 21.0, 1...","[67.0, 93.0, 36.0, 171.0, 132.0, 106.0, 21.0, ...","[73.0, 95.0, 36.0, 205.5, 158.5, 127.5, 20.5, ...","[84.0, 88.5, 36.0, 185.0, 129.5, 94.5, 26.0, 1...","[84.0, 100.0, 36.0, 196.0, 136.0, 95.0, 23.0, ..."


In [36]:
# helper to re-map target value to patient in ts dataset
def remap_sepsis_outcome_to_patient_ts(patient_ts_row):
    p_id = patient_ts_row['patient_id']
    patient_ts_row['sepsislabel'] = sepsis_dataset[sepsis_dataset['patient_id'] == p_id]['sepsislabel'].max()
    return patient_ts_row

ts_limited_sepsis_sequence_label = ts_limited_sepsis_sequence.reset_index().apply(remap_sepsis_outcome_to_patient_ts, axis=1)

In [37]:
# grab just the patient label column for use later as our target var
patient_sequence_sepsis_label = ts_limited_sepsis_sequence_label['sepsislabel']
patient_sequence_sepsis_label = np.array(patient_sequence_sepsis_label.to_list())
patient_sequence_sepsis_label.shape

(24405,)

In [38]:
# Convert the data to an array
patient_sepsis_sequences = np.array(ts_limited_sepsis_sequence.values.tolist())
patient_sepsis_sequences.shape

(24405, 24, 33)

### Split Dataset

In [39]:
# shuffle the dataset
indices = np.arange(patient_sepsis_sequences.shape[0])
np.random.seed(23)
np.random.shuffle(indices, )

X = patient_sepsis_sequences[indices]
y = patient_sequence_sepsis_label[indices]

In [78]:
# Split the data into test/train/val sets with a 80/10/10 split
n = X.shape[0]
X_train = X[:int(n*0.8), :, :]
y_train = y[:int(n*0.8)]

X_test = X[int(n*0.8):int(n*0.9), :, :]
y_test = y[int(n*0.8):int(n*0.9)]

X_val = X[int(n*0.9):, :, :]
y_val = y[int(n*0.9):]

print(f"Train data shape: X: {X_train.shape}, y: {y_train.shape}")
print(f"Test data shape: X: {X_test.shape}, y: {y_test.shape}")
print(f"Validation data shape: X: {X_val.shape}, y: {y_val.shape}")

Train data shape: X: (19524, 24, 33), y: (19524,)
Test data shape: X: (2440, 24, 33), y: (2440,)
Validation data shape: X: (2441, 24, 33), y: (2441,)


#### Scale/Normalize Continuous Features

In [80]:
# Will apply standard scaling to the continuous features
scaler = StandardScaler()

# setup index to apply only to cont features
num_continuous_features = len(feature_cols) - 2
num_continuous_features

30

In [81]:
# We need to temporarily flatten our datasets as scaler supports only two dims
X_train_2d = X_train.reshape(-1, X_train.shape[2])
X_test_2d = X_test.reshape(-1, X_test.shape[2])
X_val_2d = X_val.reshape(-1, X_val.shape[2])

print(f"Train data flattened shape: X: {X_train_2d.shape}")
print(f"Test data flattened shape: X: {X_test_2d.shape}")
print(f"Validation data flattened shape: X: {X_val_2d.shape}")

Train data flattened shape: X: (468576, 33)
Test data flattened shape: X: (58560, 33)
Validation data flattened shape: X: (58584, 33)


In [82]:
# apply scaling to continuous features only
X_train_2d[:, :num_continuous_features] = scaler.fit_transform(X_train_2d[:, :num_continuous_features])
X_test_2d[:, :num_continuous_features] = scaler.transform(X_test_2d[:, :num_continuous_features])
X_val_2d[:, :num_continuous_features] = scaler.transform(X_val_2d[:, :num_continuous_features])

In [83]:
# reshape back to original
X_train_norm = X_train_2d.reshape(X_train.shape)
X_test_norm = X_test_2d.reshape(X_test.shape)
X_val_norm = X_val_2d.reshape(X_val.shape)

print(f"Train data un-flattened shape: X: {X_train_norm.shape}")
print(f"Test data un-flattened shape: X: {X_test_norm.shape}")
print(f"Validation data un-flattened shape: X: {X_val_norm.shape}")

Train data un-flattened shape: X: (19524, 24, 33)
Test data un-flattened shape: X: (2440, 24, 33)
Validation data un-flattened shape: X: (2441, 24, 33)


In [84]:
# Save mean and standard deviation arrays to S3
scaler_mean = scaler.mean_
scaler_stddev = scaler.scale_

print(f"Scaler mean: {scaler_mean} and std dev: {scaler_stddev}")

np.save(f"{local_data_path_csv}/scaler_mean.npy", scaler_mean)
np.save(f"{local_data_path_csv}/scaler_stddev.npy", scaler_stddev)

Scaler mean: [-9.24544706e-16 -2.52492052e-15 -3.65936405e-15  1.04327381e-15
  8.06245409e-15 -8.97611771e-16  1.49384724e-15  1.86993927e-15
  2.38019983e-16  2.66207751e-15  7.92287768e-16 -2.80021304e-14
 -4.51880124e-16 -1.21297419e-14 -7.29564352e-15 -5.83829330e-15
  1.09505721e-14 -6.06583861e-16  4.80127104e-16  6.72979285e-15
  4.47461525e-15  1.16903723e-15 -1.00950931e-15 -1.10161349e-14
  8.21862345e-15  2.45370908e-15  3.22421436e-15 -5.50034932e-16
  4.44718185e-16  9.86166425e-16] and std dev: [1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.
 1. 1. 1. 1. 1. 1.]
