# 06a: Data Preparation for Transformer Modeling

## Overview
This notebook focuses on the comprehensive data preparation steps required for transformer modeling using patient static and dynamic data. It covers loading and checking the data for missing values, followed by detailed preprocessing of both static and dynamic features. This includes handling missingness, creating derived features, binning continuous data, structuring dynamic data into time windows, merging static and dynamic information, normalizing continuous features, and finally saving the cleaned and combined data for subsequent modeling.


## 1. Import Necessary Libraries

This section imports the Python libraries required for data manipulation, numerical operations, and preprocessing.

In [None]:
import sys
import os
import pandas as pd
import numpy as np
from sklearn.preprocessing import OrdinalEncoder, StandardScaler

# Suppress warnings
import warnings
warnings.filterwarnings('ignore')

## 2. Mount Google Drive

This section mounts the Google Drive to access files directly within the Colab environment.

In [None]:
from google.colab import drive

# Mount Google drive
drive.mount('/content/drive')

# Base file path
basePath = 'drive/MyDrive/Colab Notebooks/AAI-590-01_02/AAI590_CapstoneProject'

Mounted at /content/drive


## 3. Load Static & Dynamic Data

This section focuses on loading the patient static and dynamic data from CSV files into pandas DataFrames, and displays the head of both dataframes after loading.

In [None]:
# Note: use below code if running in the local machine
# static_data_path = r'../data/processed/patient_static_data_df.csv'

# Note: use below code if running in the Google colab
static_data_path = os.path.join(basePath, 'data/processed/patient_static_data_df.csv')

# Load static data
static_data_df = pd.read_csv(static_data_path)
display(static_data_df.head())

Unnamed: 0,RecordID,Age,Gender,Height,ICUType,In-hospital_death,Length_of_stay,SAPS-I,SOFA,Survival,Weight
0,140101,39.0,0.0,170.2,3.0,0,10,10,7,-1,253.0
1,140102,70.0,0.0,-1.0,3.0,0,39,11,6,393,123.5
2,140104,61.0,1.0,188.0,2.0,0,5,18,7,-1,80.0
3,140106,64.0,1.0,162.6,2.0,0,22,22,14,-1,80.0
4,140107,45.0,1.0,-1.0,3.0,0,19,15,7,-1,105.5


In [None]:
# Note: use below code if running in the local machine
# dynamic_data_path = r'../data/processed/patient_dynamic_tensors_df.csv'

# Note: use below code if running in the Google colab
dynamic_data_path = os.path.join(basePath, 'data/processed/patient_dynamic_tensors_df.csv')

# Load static data
dynamic_data_df = pd.read_csv(dynamic_data_path)
display(dynamic_data_df.head())

Unnamed: 0,RecordID,Minutes,ALP,ALT,AST,Albumin,BUN,Bilirubin,Cholesterol,Creatinine,...,Platelets,RespRate,SaO2,SysABP,Temp,TroponinI,TroponinT,Urine,WBC,pH
0,140101,4,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,...,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0
1,140101,34,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,...,-1.0,-1.0,-1.0,-1.0,38.1,-1.0,-1.0,90.0,-1.0,-1.0
2,140101,64,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,...,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0
3,140101,124,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,...,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,80.0,-1.0,-1.0
4,140101,184,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,...,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,100.0,-1.0,-1.0


## 4. Check for Missing Values

This section checks for standard missing values (NaN) and also counts occurrences of the sentinel value -1, which is used to represent missing data in this specific dataset. This helps identify the extent of missingness in each column.

In [None]:
# Check shape
print("Shape of static dataframe: ", static_data_df.shape)
print("Shape of dynamic dataframe: ", dynamic_data_df.shape)

Shape of static dataframe:  (4000, 11)
Shape of dynamic dataframe:  (295354, 38)


In [None]:
# Check for missing values
print("Missing values in static_data_df:")
print(static_data_df.isnull().sum())
print("\nMissing values in dynamic_data_df:")
print(dynamic_data_df.isnull().sum())
print("\nChecking for -1 values in static_data_df:")
print((static_data_df == -1).sum())
print("\nChecking for -1 values in dynamic_data_df:")
print((dynamic_data_df == -1).sum())

Missing values in static_data_df:
RecordID             0
Age                  0
Gender               0
Height               0
ICUType              0
In-hospital_death    0
Length_of_stay       0
SAPS-I               0
SOFA                 0
Survival             0
Weight               0
dtype: int64

Missing values in dynamic_data_df:
RecordID       0
Minutes        0
ALP            0
ALT            0
AST            0
Albumin        0
BUN            0
Bilirubin      0
Cholesterol    0
Creatinine     0
DiasABP        0
FiO2           0
GCS            0
Glucose        0
HCO3           0
HCT            0
HR             0
K              0
Lactate        0
MAP            0
MechVent       0
Mg             0
NIDiasABP      0
NIMAP          0
NISysABP       0
Na             0
PaCO2          0
PaO2           0
Platelets      0
RespRate       0
SaO2           0
SysABP         0
Temp           0
TroponinI      0
TroponinT      0
Urine          0
WBC            0
pH             0
dtype: int64

Chec

## 5. Data Preparation

This is a major section that encompasses all the steps taken to preprocess both the static and dynamic patient data for transformer modeling. This includes handling missing values, creating new features, binning continuous data into categories, and structuring the dynamic data into time windows. Then brings together the preprocessed static and dynamic data into a single DataFrame ready for modeling by merging them, normalizing continuous features, and finally saving the cleaned and combined DataFrame as a CSV file.

### 5.1 Static Features

This section focuses on preparing the static patient data for modeling. This involves several sub-steps: dropping outcome-related features to prevent data leakage, standardizing categorical features like Gender and ICUType, adding derived numeric features such as BMI, and finally, binning numeric features like Age and BMI into clinically relevant categories, including creating and encoding Age-BMI interaction tokens.

#### 5.1.1 Outcome Features: Drop

In this section outcome-related features like `Length_of_stay`, `SAPS-I`, `SOFA`, and `Survival` are dropped from the static data to prevent data leakage, as these variables are not available at the time of ICU admission and could provide information about the target variable (`In-hospital_death`), leading to misleading model performance.


In [None]:
# Static DataFrame features list (includes outcomes)
static_cols = static_data_df.columns.tolist()

# Remove outcome-related predictors
excluded_features = ['Length_of_stay', 'SAPS-I', 'SOFA', 'Survival']
filtered_static_cols = [col for col in static_cols if col not in excluded_features]

# Drop filtered static columns from static DataFrame
static_data_df.drop(columns=excluded_features, inplace=True)

# Verify remaining columns
filtered_static_cols = static_data_df.columns.tolist()
print("Remaining columns:", filtered_static_cols)

Remaining columns: ['RecordID', 'Age', 'Gender', 'Height', 'ICUType', 'In-hospital_death', 'Weight']


#### 5.1.2 Categorical Features: Standardize

This section focuses on preparing categorical features in the static data for modeling. It specifically handles the Gender feature by replacing the sentinel value -1 with 2 to represent 'Unknown' and converting it to an integer type. It also ensures the ICUType is an integer type. This standardization is done for better interpretability and compatibility with modeling techniques.

In [None]:
# Handle missing values for categorical variables
print('Unique "Gender" feature values: ', static_data_df['Gender'].unique())

# Map Gender: 1 = Male, 0 = Female, -1 = Unknown
# Standardized for interpretability and modeling - Now 2 stands for 'Unknown'
static_data_df['Gender'] = static_data_df['Gender'].replace(-1, 2).astype(int)
print('Unique "Gender" feature values after standardization: ', static_data_df['Gender'].unique())

# Convert ICUType to integer format for embedding; no missing values detected so safe to use directly
static_data_df['ICUType'] = static_data_df['ICUType'].astype(int)
print('Unique "ICUType" feature values: ', static_data_df['ICUType'].unique())

Unique "Gender" feature values:  [ 0.  1. -1.]
Unique "Gender" feature values after standardization:  [0 1 2]
Unique "ICUType" feature values:  [3 2 1 4]


#### 5.1.3 Numeric Features: Add Derived Features

This section focuses on creating new, potentially more informative features from existing numeric ones in the static data. Specifically, it calculates Body Mass Index (BMI) from height and weight. This is done to combine two potentially skewed features into a single variable that captures body composition, which can be important for modeling.

In [None]:
# Filter valid height and weight data
height_filtered = static_data_df['Height'][static_data_df['Height'] != -1]
weight_filtered = static_data_df['Weight'][static_data_df['Weight'] != -1]

# Verify height and weight range to understand the data distribution
print("Height:", height_filtered.min(), "cm to ", height_filtered.max(), "cm")
print("Weight:", weight_filtered.min(), "kg to ", weight_filtered.max(), "kg")

Height: 1.8 cm to  431.8 cm
Weight: 21.7 kg to  300.0 kg


In [None]:
# Replace -1 sentinel values with NaN directly in place for easier handling of missing data
static_data_df['Height'] = static_data_df['Height'].replace(-1, np.nan)
static_data_df['Weight'] = static_data_df['Weight'].replace(-1, np.nan)

# Add binary missingness masks to indicate whether original height/weight values were missing
static_data_df['Height_missing'] = static_data_df['Height'].isna().astype(int)
static_data_df['Weight_missing'] = static_data_df['Weight'].isna().astype(int)

# Compute BMI (kg/m²) using the formula: Weight (kg) / (Height (m))^2
# Note: Height is converted from cm to meters by dividing by 100
static_data_df['BMI'] = static_data_df['Weight'] / ((static_data_df['Height'] / 100) ** 2)

# Handle invalid BMI entries that might result from missing or zero height/weight
static_data_df['BMI'] = static_data_df['BMI'].replace([np.inf, -np.inf], np.nan)

# Find records with extreme BMI values to identify potential data entry errors or outliers
low_bmi = static_data_df.loc[static_data_df['BMI'] == static_data_df['BMI'].min(), ['Height', 'Weight', 'BMI']]
high_bmi = static_data_df.loc[static_data_df['BMI'] == static_data_df['BMI'].max(), ['Height', 'Weight', 'BMI']]

print("Low BMI record:\n", low_bmi)
print("High BMI record:\n", high_bmi)

Low BMI record:
       Height  Weight       BMI
3951   406.4    57.0  3.451179
High BMI record:
       Height  Weight            BMI
2956     1.8    68.7  212037.037037


#### 5.1.4 Numeric Features: Binning

This section covers the discretization of continuous numeric features like BMI and Age into clinically meaningful categories. It includes binning BMI into categories such as 'Underweight', 'Normal Weight', and 'Obese', and Age into groups like 'Adolescent', 'Young Adult', and 'Geriatric'. This section also creates interaction tokens by combining the Age and BMI bins to capture joint demographic information. Ordinal encoding is applied to these binned features and interaction tokens to preserve the clinical severity order.

---

**BMI Binning:** the process of binning the calculated BMI values into clinically meaningful categories such as 'Underweight', 'Normal Weight', 'Overweight', and 'Obese'. This discretization helps in handling outliers and making the feature more interpretable for transformer models. Missing BMI values (represented as NaN) are assigned to an 'Unknown' category, and then an ordinal encoding is applied to these bins to preserve the clinical severity order.


| **Bin Name**            | **BMI Range**   | **Clinical Interpretation**                             |
|-------------------------|-----------------|----------------------------------------------------------|
| Severely Underweight    | < 16             | High malnutrition risk, possible frailty                 |
| Underweight             | 16 – 18.5        | Mild malnutrition, may indicate chronic illness          |
| Normal Weight           | 18.5 – 24.9      | Healthy range                                            |
| Overweight              | 25 – 29.9        | Mild elevation, sometimes protective in ICU              |
| Obese Class I           | 30 – 34.9        | Moderate cardiometabolic risk                           |
| Obese Class II          | 35 – 39.9        | Elevated ICU complication risk                           |
| Obese Class III         | ≥ 40             | Severe obesity, often linked to adverse outcomes         |

In [None]:
# BMI Binning with Missing Handling

# Define bins and labels according to clinical guidelines
bmi_bins = [0, 16, 18.5, 25, 30, 35, 40, np.inf]
bmi_labels = [
    'Severely Underweight',
    'Underweight',
    'Normal Weight',
    'Overweight',
    'Obese Class I',
    'Obese Class II',
    'Obese Class III',
    'Unknown' # Placeholder for missing values
]

# Bin BMI values using the defined bins and labels
static_data_df['BMI_bin'] = pd.cut(
    static_data_df['BMI'],
    bins=bmi_bins,
    labels=bmi_labels[:-1], # Exclude 'Unknown' during initial binning
    right=False # Bins are inclusive of the left edge, exclusive of the right
)

# Handle NaNs resulting from pd.cut (for original NaNs in BMI) by filling with 'Unknown'
static_data_df['BMI_bin'] = static_data_df['BMI_bin'].astype(object).fillna('Unknown')

# Apply ordinal encoding to BMI bins to preserve clinical severity order
# The order of categories is explicitly defined to ensure correct encoding
bmi_encoder = OrdinalEncoder(categories=[bmi_labels])
static_data_df['BMI_bin_encoded'] = bmi_encoder.fit_transform(
    static_data_df[['BMI_bin']]
)

# Display frequency distribution of encoded BMI labels to verify binning results
print("Frequency Distribution of BMI Labels:")
print(static_data_df['BMI_bin_encoded'].value_counts().sort_index())

Frequency Distribution of BMI Labels:
BMI_bin_encoded
0.0      17
1.0      53
2.0     619
3.0     724
4.0     397
5.0     148
6.0     147
7.0    1895
Name: count, dtype: int64


---
**Age Binning:** the process of binning the 'Age' feature into clinically relevant categories: 'Adolescent', 'Young Adult', 'Middle-Aged', 'Senior', and 'Geriatric'. This is done to discretize the continuous age data into groups that reflect different physiological and risk profiles, which can be beneficial for transformer models. Missing age values (if any) are handled by assigning them to an 'Unknown' category, and ordinal encoding is applied to maintain the intended order of age groups.

| **Bin Name**     | **Age Range** | **Clinical Interpretation**                |
|------------------|---------------|---------------------------------------------|
| Adolescent        | 15–17         | Transitional physiology                     |
| Young Adult       | 18–40         | Lower baseline risk                         |
| Middle-Aged       | 41–65         | Increased complexity      |
| Senior            | 66–80         | chronic conditions                 |
| Geriatric         | 81–90         | High sensitivity to critical outcomes       |

In [None]:
# Age Binning with Missing Handling

print("Age:", static_data_df['Age'].min(), "to", static_data_df['Age'].max())

# Define bins and labels for age categories
age_bins = [15, 18, 41, 66, 81, 91]  # Upper bound 91 to include age=90
age_labels = [
    'Adolescent',
    'Young Adult',
    'Middle-Aged',
    'Senior',
    'Geriatric',
    'Unknown' # Placeholder for missing values
]

# Bin age values into the defined categories
static_data_df['Age_bin'] = pd.cut(
    static_data_df['Age'],
    bins=age_bins,
    labels=age_labels[:-1], # Exclude 'Unknown' during initial binning
    right=False # Bins are inclusive of the left edge, exclusive of the right
)

# Handle missing ages (if any) resulting from pd.cut by filling with 'Unknown'
static_data_df['Age_bin'] = static_data_df['Age_bin'].astype(object).fillna('Unknown')

# Apply ordinal encoding to Age bins to preserve clinical severity order
age_encoder = OrdinalEncoder(categories=[age_labels])
static_data_df['Age_bin_encoded'] = age_encoder.fit_transform(
    static_data_df[['Age_bin']]
)

# Display frequency distribution of age labels to verify binning results
print("Frequency Distribution of Age Labels:")
print(static_data_df['Age_bin_encoded'].value_counts().sort_index())

Age: 15.0 to 90.0
Frequency Distribution of Age Labels:
Age_bin_encoded
0.0       6
1.0     429
2.0    1459
3.0    1315
4.0     791
Name: count, dtype: int64


---
**Age–BMI Interaction Tokens:** explains combining the Age and BMI bins into interaction tokens. This creates a new categorical feature that captures the joint demographic state of patients, which can improve the richness of embeddings in Transformer models and help in analyzing age and BMI specific risk patterns. These interaction tokens are then ordinally encoded to maintain a meaningful order.

In [None]:
# Create and Encode Age–BMI Interaction Tokens

# Create Age–BMI combination token by concatenating the age and BMI bin labels
static_data_df['AgeBMI_token'] = static_data_df['Age_bin'].astype(str) + '_' + static_data_df['BMI_bin'].astype(str).fillna('Unknown_Unknown')

# Sort tokens by clinical severity order based on their encoded age and BMI values
# This ensures that the ordinal encoding maintains a meaningful order
interaction_order = sorted(static_data_df[['Age_bin_encoded', 'BMI_bin_encoded', 'AgeBMI_token']].drop_duplicates().values.tolist())

# Create a mapping from interaction tokens to ordinal integers
interaction_token_map = {token[2]: idx for idx, token in enumerate(interaction_order)}
static_data_df['AgeBMI_token_encoded'] = static_data_df['AgeBMI_token'].map(interaction_token_map)

# Display interaction token mapping to show the assigned integer for each token
print("Interaction Token Mapping:")
print(interaction_token_map)

# Display frequency distribution of interaction tokens to verify encoding results
print("\nFrequency Distribution of Interaction Tokens:")
print(static_data_df['AgeBMI_token_encoded'].value_counts().sort_index())

Interaction Token Mapping:
{'Adolescent_Normal Weight': 0, 'Adolescent_Unknown': 1, 'Young Adult_Severely Underweight': 2, 'Young Adult_Underweight': 3, 'Young Adult_Normal Weight': 4, 'Young Adult_Overweight': 5, 'Young Adult_Obese Class I': 6, 'Young Adult_Obese Class II': 7, 'Young Adult_Obese Class III': 8, 'Young Adult_Unknown': 9, 'Middle-Aged_Severely Underweight': 10, 'Middle-Aged_Underweight': 11, 'Middle-Aged_Normal Weight': 12, 'Middle-Aged_Overweight': 13, 'Middle-Aged_Obese Class I': 14, 'Middle-Aged_Obese Class II': 15, 'Middle-Aged_Obese Class III': 16, 'Middle-Aged_Unknown': 17, 'Senior_Severely Underweight': 18, 'Senior_Underweight': 19, 'Senior_Normal Weight': 20, 'Senior_Overweight': 21, 'Senior_Obese Class I': 22, 'Senior_Obese Class II': 23, 'Senior_Obese Class III': 24, 'Senior_Unknown': 25, 'Geriatric_Severely Underweight': 26, 'Geriatric_Underweight': 27, 'Geriatric_Normal Weight': 28, 'Geriatric_Overweight': 29, 'Geriatric_Obese Class I': 30, 'Geriatric_Obese C

#### 5.1.5 Clean Data

This section focuses on dropping unnecessary columns from the static data after the necessary features have been created and encoded. Columns like original 'Age', 'Height', 'Weight', 'BMI', and the intermediate binning columns and interaction token string are removed to keep only the relevant features for the next steps. This helps streamline the dataset and prepare it for further processing.

In [None]:
# Drop unnecessary columns that are no longer needed after feature engineering
static_columns_to_drop = ['Age', 'Height', 'Weight', 'BMI', 'BMI_bin', 'BMI_bin_encoded', 'Age_bin', 'Age_bin_encoded', 'AgeBMI_token']
static_data_cleaned_df = static_data_df.drop(columns=static_columns_to_drop)

# Display the first few rows of the cleaned static DataFrame
static_data_cleaned_df.head()

Unnamed: 0,RecordID,Gender,ICUType,In-hospital_death,Height_missing,Weight_missing,AgeBMI_token_encoded
0,140101,0,3,0,0,0,8
1,140102,0,3,0,1,0,25
2,140104,1,2,0,0,0,12
3,140106,1,2,0,0,0,14
4,140107,1,3,0,1,0,17


### 5.2 Dynamic features

This section focuses on preparing the dynamic patient data for modeling. This involves several key steps: filtering out sparse features with a high proportion of missing values, handling temporal missingness through bi-directional filling and hybrid imputation, adding derived physiological features, binning continuous features into clinically relevant categories, cleaning the data by dropping unnecessary columns, and finally, windowing the data into 60-minute intervals for time-series analysis.

#### 5.2.1 Sparse Feature Filtering

This section addresses the issue of features with a high proportion of missing values (represented by the sentinel value -1). It removes columns where more than 95% of the data is missing, as these features are unlikely to be informative for modeling. This helps to reduce the dimensionality of the dynamic data and improve computational efficiency.

In [None]:
# Function to exclude features With extreme sparsity
def remove_sparse_features(df, threshold=0.95, sentinel=-1):
    df_clean = df.copy()

    # Treat sentinel value as missing
    df_missing_mask = (df_clean == sentinel)

    # Also include standard NaNs as missing entries (if any present)
    if df_clean.isnull().values.any():
        df_missing_mask |= df_clean.isnull()

    # Calculate proportion of missing values for each column
    missing_ratio = df_missing_mask.mean()

    # Identify columns exceeding the missingness threshold
    cols_to_drop = missing_ratio[missing_ratio > threshold].index.tolist()
    print(f"Dropping {len(cols_to_drop)} columns: {cols_to_drop}")

    # Remove the identified sparse columns from the DataFrame
    df_clean.drop(columns=cols_to_drop, inplace=True)

    return df_clean

# Drop columns exceeding the missingness threshold 95%
dynamic_data_filtered = remove_sparse_features(dynamic_data_df, threshold=0.95)

feature_cols = [col for col in dynamic_data_filtered.columns if col not in ['RecordID', 'Minutes']]
print('Features retained after filtering: ', feature_cols)

dynamic_data_filtered.head()

Dropping 19 columns: ['ALP', 'ALT', 'AST', 'Albumin', 'BUN', 'Bilirubin', 'Cholesterol', 'Creatinine', 'Glucose', 'HCO3', 'K', 'Lactate', 'Mg', 'Na', 'Platelets', 'SaO2', 'TroponinI', 'TroponinT', 'WBC']
Features retained after filtering:  ['DiasABP', 'FiO2', 'GCS', 'HCT', 'HR', 'MAP', 'MechVent', 'NIDiasABP', 'NIMAP', 'NISysABP', 'PaCO2', 'PaO2', 'RespRate', 'SysABP', 'Temp', 'Urine', 'pH']


Unnamed: 0,RecordID,Minutes,DiasABP,FiO2,GCS,HCT,HR,MAP,MechVent,NIDiasABP,NIMAP,NISysABP,PaCO2,PaO2,RespRate,SysABP,Temp,Urine,pH
0,140101,4,-1.0,0.8,-1.0,-1.0,-1.0,-1.0,1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0
1,140101,34,-1.0,-1.0,-1.0,-1.0,122.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,38.1,90.0,-1.0
2,140101,64,-1.0,-1.0,10.0,-1.0,114.0,-1.0,-1.0,54.0,70.33,103.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0
3,140101,124,-1.0,0.8,-1.0,-1.0,100.0,-1.0,1.0,47.0,65.33,102.0,-1.0,-1.0,-1.0,-1.0,-1.0,80.0,-1.0
4,140101,184,-1.0,-1.0,-1.0,-1.0,102.0,-1.0,-1.0,40.0,59.67,99.0,-1.0,-1.0,-1.0,-1.0,-1.0,100.0,-1.0


#### 5.2.2 Handle Temporal Missingness

This section addresses missing values in the dynamic data that occur over time. It first generates binary indicator masks to show where values are missing. Then, it applies a bi-directional fill to propagate known values forwards and backwards within each patient's time series data. Finally, it uses a hybrid imputation strategy, filling remaining NaNs with the patient-level mean where possible, and falling back to the population median for any remaining missing values.

In [None]:
# Function that generates binary indicators for missing values across selected features
def generate_missingness_masks(df, feature_cols, sentinel=-1):
    df_masked = df.copy()

    # Convert sentinel values to standard NaN to be recognized as missing by pandas
    df_masked.replace(sentinel, np.nan, inplace=True)

    # Construct binary mask columns for given features: 1 for valid, 0 for missing
    for col in feature_cols:
        mask_col = f"{col}_mask"
        df_masked[mask_col] = df_masked[col].notnull().astype(int)

    return df_masked

# Generate DataFrame with added _mask columns indicating original data presence
dynamic_data_masked = generate_missingness_masks(dynamic_data_filtered, feature_cols)
dynamic_data_masked.head()

Unnamed: 0,RecordID,Minutes,DiasABP,FiO2,GCS,HCT,HR,MAP,MechVent,NIDiasABP,...,NIDiasABP_mask,NIMAP_mask,NISysABP_mask,PaCO2_mask,PaO2_mask,RespRate_mask,SysABP_mask,Temp_mask,Urine_mask,pH_mask
0,140101,4,,0.8,,,,,1.0,,...,0,0,0,0,0,0,0,0,0,0
1,140101,34,,,,,122.0,,,,...,0,0,0,0,0,0,0,1,1,0
2,140101,64,,,10.0,,114.0,,,54.0,...,1,1,1,0,0,0,0,0,0,0
3,140101,124,,0.8,,,100.0,,1.0,47.0,...,1,1,1,0,0,0,0,0,1,0
4,140101,184,,,,,102.0,,,40.0,...,1,1,1,0,0,0,0,0,1,0


In [None]:
# Function to performs bi-directional filling of missing values within each patient's time series
def bi_directional_fill(group, feature_cols, sort_by_col):
    group = group.sort_values(sort_by_col).copy() # Sorts observations chronologically by 'Minutes' to ensure correct temporal filling
    group[feature_cols] = group[feature_cols].ffill()  # Applies forward fill to propagate known values into future time steps
    group[feature_cols] = group[feature_cols].bfill()  # Applies backward fill to fill any remaining gaps by propagating values backward

    return group

# Apply bi-directional filling to dynamic data, grouping by RecordID
dynamic_data_filled = dynamic_data_masked.groupby('RecordID').apply(
    lambda group: bi_directional_fill(group, feature_cols, 'Minutes')
).reset_index(drop=True)

# Check for remaining missing values after bi-directional fill
print(dynamic_data_filled[feature_cols].isnull().sum())

DiasABP       66726
FiO2          74560
GCS             509
HCT            3485
HR              508
MAP           67269
MechVent      84335
NIDiasABP     40133
NIMAP         40317
NISysABP      39155
PaCO2         53338
PaO2          53342
RespRate     229866
SysABP        66726
Temp            509
Urine          4030
pH            52241
dtype: int64


In [None]:
# Function to apply a hybrid imputation strategy
def hybrid_impute(df, cols):
    # Apply patient-level mean imputation
    df_imputed = df.copy()
    df_imputed[cols] = df_imputed.groupby('RecordID')[cols].transform(lambda g: g.fillna(g.mean()))

    # Then fallback to population median for residual NaNs
    for col in cols:
        df_imputed[col].fillna(df_imputed[col].median(), inplace=True)

    return df_imputed

# Apply hybrid imputation to fill any remaining missing values
dynamic_data_filled = hybrid_impute(dynamic_data_filled, feature_cols)

# Final check for missing values to ensure all are imputed
print(dynamic_data_filled[feature_cols].isnull().sum())

DiasABP      0
FiO2         0
GCS          0
HCT          0
HR           0
MAP          0
MechVent     0
NIDiasABP    0
NIMAP        0
NISysABP     0
PaCO2        0
PaO2         0
RespRate     0
SysABP       0
Temp         0
Urine        0
pH           0
dtype: int64


In [None]:
dynamic_data_filled.head()

Unnamed: 0,RecordID,Minutes,DiasABP,FiO2,GCS,HCT,HR,MAP,MechVent,NIDiasABP,...,NIDiasABP_mask,NIMAP_mask,NISysABP_mask,PaCO2_mask,PaO2_mask,RespRate_mask,SysABP_mask,Temp_mask,Urine_mask,pH_mask
0,132539,7,58.0,0.5,15.0,33.7,73.0,78.0,1.0,65.0,...,1,1,1,0,0,1,0,1,1,0
1,132539,37,58.0,0.5,15.0,33.7,77.0,78.0,1.0,58.0,...,1,1,1,0,0,1,0,1,1,0
2,132539,97,58.0,0.5,15.0,33.7,60.0,78.0,1.0,62.0,...,1,1,1,0,0,1,0,0,1,0
3,132539,157,58.0,0.5,15.0,33.7,62.0,78.0,1.0,52.0,...,1,1,1,0,0,1,0,0,1,0
4,132539,188,58.0,0.5,15.0,33.7,62.0,78.0,1.0,52.0,...,0,0,0,0,0,0,0,0,0,0


#### 5.2.3 Add Derived Features

This section focuses on creating new, potentially more informative features from existing numeric ones in the dynamic data. This includes calculating metrics like Shock Index, Pulse Pressure, and PaO2/FiO2 ratio, among others. These derived features are designed to capture clinically relevant physiological states and relationships, which can be important for modeling patient conditions over time.

In [None]:
def create_derived_features(df):
    df = df.copy()

    # Avoid division-by-zero warnings
    df['ShockIndex'] = df['HR'] / df['SysABP'].replace(0, np.nan) # early shock detection
    df['PulsePressure'] = df['SysABP'] - df['DiasABP'] # Cardiac output and vascular tone
    df['MeanSysRatio'] = df['MAP'] / df['SysABP'].replace(0, np.nan) # Reflects relative perfusion pressure
    df['PaO2_FiO2'] = df['PaO2'] / df['FiO2'].replace(0, np.nan) # Oxygenation efficiency (ARDS marker)
    df['RespQuotient'] = df['PaCO2'] / df['PaO2'].replace(0, np.nan) # Gas exchange efficiency; imbalance → failure
    df['Temp_HR'] = df['HR'] / df['Temp'].replace(0, np.nan) # Detects sepsis or compensatory hyperthermia
    df['GCS_MAP'] = df['GCS'] / df['MAP'].replace(0, np.nan) # Cerebral perfusion proxy

    return df

dynamic_data_derived = create_derived_features(dynamic_data_filled)

# Apply bi-directional fill only for derived features to handle NaNs created during calculation, grouping by RecordID
derived_cols = ['ShockIndex', 'PulsePressure', 'MeanSysRatio', 'PaO2_FiO2', 'RespQuotient', 'Temp_HR', 'GCS_MAP']
dynamic_data_derived = dynamic_data_derived.groupby('RecordID').apply(
    lambda group: bi_directional_fill(group, derived_cols, 'Minutes')
).reset_index(drop=True)

# Apply hybrid imputation to fill any remaining missing values
dynamic_data_derived = hybrid_impute(dynamic_data_derived, derived_cols)

# Check for remaining missing values in all columns
print(dynamic_data_derived.isnull().sum())

RecordID          0
Minutes           0
DiasABP           0
FiO2              0
GCS               0
HCT               0
HR                0
MAP               0
MechVent          0
NIDiasABP         0
NIMAP             0
NISysABP          0
PaCO2             0
PaO2              0
RespRate          0
SysABP            0
Temp              0
Urine             0
pH                0
DiasABP_mask      0
FiO2_mask         0
GCS_mask          0
HCT_mask          0
HR_mask           0
MAP_mask          0
MechVent_mask     0
NIDiasABP_mask    0
NIMAP_mask        0
NISysABP_mask     0
PaCO2_mask        0
PaO2_mask         0
RespRate_mask     0
SysABP_mask       0
Temp_mask         0
Urine_mask        0
pH_mask           0
ShockIndex        0
PulsePressure     0
MeanSysRatio      0
PaO2_FiO2         0
RespQuotient      0
Temp_HR           0
GCS_MAP           0
dtype: int64


#### 5.2.4 Binning

This section discretizes several continuous dynamic features, specifically Heart Rate (HR), Mean Arterial Pressure (MAP), and Glasgow Coma Scale (GCS), into clinically relevant bins (e.g., 'low', 'normal', 'high' for HR). This process helps to handle outliers and simplifies the features for modeling. Ordinal encoding is then applied to these binned features to maintain a meaningful order based on clinical severity.

In [None]:
print('HR: ', dynamic_data_derived['HR'].min(), ' - ', dynamic_data_derived['HR'].max())
print('MAP: ', dynamic_data_derived['MAP'].min(), ' - ', dynamic_data_derived['MAP'].max())
print('GCS: ', dynamic_data_derived['GCS'].min(), ' - ', dynamic_data_derived['GCS'].max())

HR:  0.0  -  300.0
MAP:  0.0  -  300.0
GCS:  3.0  -  15.0


In [None]:
def bin_features(df):
    df = df.copy()
    # Bin HR into low, normal, high, critical ranges
    df['HR_bin'] = pd.cut(df['HR'], bins=[-1, 60, 100, 140, np.inf], labels=['low', 'normal', 'high', 'critical'])
    # Bin MAP into shock, borderline, normal, elevated ranges
    df['MAP_bin'] = pd.cut(df['MAP'], bins=[-1, 60, 70, 90, np.inf], labels=['shock', 'borderline', 'normal', 'elevated'])
    # Bin GCS into severe, moderate, mild categories
    df['GCS_bin'] = pd.cut(df['GCS'], bins=[0, 8, 13, 15], labels=['severe', 'moderate', 'mild'])
    return df

dynamic_data_binned = bin_features(dynamic_data_derived)

# Define the order of categories for ordinal encoding to preserve clinical severity
ordered_categories = [
    ['low', 'normal', 'high', 'critical'],        # HR_bin
    ['shock', 'borderline', 'normal', 'elevated'],# MAP_bin
    ['mild', 'moderate', 'severe']                # GCS_bin
]

# Apply ordinal encoding to the binned features
encoder = OrdinalEncoder(categories=ordered_categories)
dynamic_data_binned[['HR_bin_enc', 'MAP_bin_enc', 'GCS_bin_enc']] = encoder.fit_transform(
    dynamic_data_binned[['HR_bin', 'MAP_bin', 'GCS_bin']]
)

# Print the mapping of original bin labels to encoded integers
for col, categories in zip(['HR_bin', 'MAP_bin', 'GCS_bin'], encoder.categories_):
    print(f"{col} mapping:")
    for i, label in enumerate(categories):
        print(f"  {label} → {i}")

HR_bin mapping:
  low → 0
  normal → 1
  high → 2
  critical → 3
MAP_bin mapping:
  shock → 0
  borderline → 1
  normal → 2
  elevated → 3
GCS_bin mapping:
  mild → 0
  moderate → 1
  severe → 2


#### 5.2.5 Clean Data

This section focuses on dropping unnecessary columns from the dynamic data after the necessary features have been created and encoded. Original columns like 'SysABP', 'DiasABP', 'MAP', 'PaO2', 'PaCO2', 'FiO2', 'HR', 'Temp', and 'GCS', as well as their intermediate binned versions ('HR_bin', 'MAP_bin', 'GCS_bin'), are removed to keep only the relevant features for the next steps. This helps streamline the dataset and prepare it for further processing.

In [None]:
# Define the list of columns to drop from the dynamic data after binning and feature engineering
col_drop = ['SysABP', 'DiasABP', 'MAP', 'PaO2', 'PaCO2', 'FiO2', 'HR', 'Temp', 'GCS', 'HR_bin', 'MAP_bin', 'GCS_bin']

# Drop the specified columns from the dynamic data DataFrame
dynamic_data_binned.drop(columns=col_drop, inplace=True)

# Display the first few rows of the cleaned dynamic DataFrame
dynamic_data_binned.head()

Unnamed: 0,RecordID,Minutes,HCT,MechVent,NIDiasABP,NIMAP,NISysABP,RespRate,Urine,pH,...,ShockIndex,PulsePressure,MeanSysRatio,PaO2_FiO2,RespQuotient,Temp_HR,GCS_MAP,HR_bin_enc,MAP_bin_enc,GCS_bin_enc
0,132539,7,33.7,1.0,65.0,92.33,147.0,19.0,900.0,7.39,...,0.62931,58.0,0.672414,234.0,0.333333,2.079772,0.192308,1.0,2.0,0.0
1,132539,37,33.7,1.0,58.0,91.0,157.0,19.0,60.0,7.39,...,0.663793,58.0,0.672414,234.0,0.333333,2.162921,0.192308,1.0,2.0,0.0
2,132539,97,33.7,1.0,62.0,87.0,137.0,18.0,30.0,7.39,...,0.517241,58.0,0.672414,234.0,0.333333,1.685393,0.192308,0.0,2.0,0.0
3,132539,157,33.7,1.0,52.0,75.67,123.0,19.0,170.0,7.39,...,0.534483,58.0,0.672414,234.0,0.333333,1.741573,0.192308,1.0,2.0,0.0
4,132539,188,33.7,1.0,52.0,75.67,123.0,19.0,170.0,7.39,...,0.534483,58.0,0.672414,234.0,0.333333,1.741573,0.192308,1.0,2.0,0.0


In [None]:
dynamic_data_binned.columns

Index(['RecordID', 'Minutes', 'HCT', 'MechVent', 'NIDiasABP', 'NIMAP',
       'NISysABP', 'RespRate', 'Urine', 'pH', 'DiasABP_mask', 'FiO2_mask',
       'GCS_mask', 'HCT_mask', 'HR_mask', 'MAP_mask', 'MechVent_mask',
       'NIDiasABP_mask', 'NIMAP_mask', 'NISysABP_mask', 'PaCO2_mask',
       'PaO2_mask', 'RespRate_mask', 'SysABP_mask', 'Temp_mask', 'Urine_mask',
       'pH_mask', 'ShockIndex', 'PulsePressure', 'MeanSysRatio', 'PaO2_FiO2',
       'RespQuotient', 'Temp_HR', 'GCS_MAP', 'HR_bin_enc', 'MAP_bin_enc',
       'GCS_bin_enc'],
      dtype='object')

#### 5.2.6 Windowing

This section prepares the dynamic data for time-series analysis by aggregating it into 60-minute intervals over a 48-hour period. This involves creating time bins and then aggregating the features within each bin based on predefined functions (e.g., mean for numerical features, max for masks). Finally, it handles sparse bins by applying a bi-directional fill again after the aggregation, ensuring that each time window for each patient has a complete set of aggregated values. This transforms the granular time series data into a fixed sequence of "patches" or "windows", making it suitable for transformer models which typically work with fixed-length input sequences.

In [None]:
# Define 60-Minute Buckets Over 48 Hours

# Create time bins from 0 to 2880 Minutes at 60-minute intervals
bin_edges = np.arange(0, 2881, 60)
bin_labels = [f"{i}-{i+60}" for i in bin_edges[:-1]]

# Assign each data point to a time bin based on the 'Minutes' column
dynamic_data_binned['TimeBin'] = pd.cut(
    dynamic_data_binned['Minutes'],
    bins=bin_edges,
    labels=bin_labels,
    include_lowest=True,
    right=True # Bins are inclusive of the right edge
)

dynamic_data_binned.head()

Unnamed: 0,RecordID,Minutes,HCT,MechVent,NIDiasABP,NIMAP,NISysABP,RespRate,Urine,pH,...,PulsePressure,MeanSysRatio,PaO2_FiO2,RespQuotient,Temp_HR,GCS_MAP,HR_bin_enc,MAP_bin_enc,GCS_bin_enc,TimeBin
0,132539,7,33.7,1.0,65.0,92.33,147.0,19.0,900.0,7.39,...,58.0,0.672414,234.0,0.333333,2.079772,0.192308,1.0,2.0,0.0,0-60
1,132539,37,33.7,1.0,58.0,91.0,157.0,19.0,60.0,7.39,...,58.0,0.672414,234.0,0.333333,2.162921,0.192308,1.0,2.0,0.0,0-60
2,132539,97,33.7,1.0,62.0,87.0,137.0,18.0,30.0,7.39,...,58.0,0.672414,234.0,0.333333,1.685393,0.192308,0.0,2.0,0.0,60-120
3,132539,157,33.7,1.0,52.0,75.67,123.0,19.0,170.0,7.39,...,58.0,0.672414,234.0,0.333333,1.741573,0.192308,1.0,2.0,0.0,120-180
4,132539,188,33.7,1.0,52.0,75.67,123.0,19.0,170.0,7.39,...,58.0,0.672414,234.0,0.333333,1.741573,0.192308,1.0,2.0,0.0,180-240


In [None]:
# Aggregation Logic per Feature Type

# Custom aggregation to preserve NaN for missing time bins
def sum_with_nan(x):
    return np.nan if len(x) == 0 or x.isna().all() else x.sum()

# Define aggregation functions for different types of features
agg_funcs = {
    # Numeric vital signs – average over bin
    'RespRate': 'mean', 'Urine': sum_with_nan,  # urine → total output
    'pH': 'mean', 'HCT': 'mean',
    'NIDiasABP': 'mean', 'NIMAP': 'mean', 'NISysABP': 'mean',

    # Mask flags – 1 if any value observed in bin
    'HR_mask': 'max', 'MAP_mask': 'max', 'GCS_mask': 'max', 'Urine_mask': 'max',
    'pH_mask': 'max', 'HCT_mask': 'max', 'FiO2_mask': 'max', 'PaO2_mask': 'max',
    'PaCO2_mask': 'max', 'RespRate_mask': 'max', 'SysABP_mask': 'max', 'Temp_mask': 'max',
    'NIDiasABP_mask': 'max', 'NIMAP_mask': 'max', 'NISysABP_mask': 'max', 'DiasABP_mask': 'max',

    # Derived features – average within bin
    'ShockIndex': 'mean', 'PulsePressure': 'mean', 'MeanSysRatio': 'mean',
    'PaO2_FiO2': 'mean', 'RespQuotient': 'mean', 'Temp_HR': 'mean', 'GCS_MAP': 'mean',

    # Binary or categorical signals – mode or max
    'MechVent': 'max', 'MechVent_mask': 'max',

    # Binned features - using 'last' for features like HR_bin_enc where the last observation might be most relevant,
    # and 'min' for features like MAP_bin_enc and GCS_bin_enc where the worst score (lowest value) is clinically significant
    'HR_bin_enc': 'last', 'MAP_bin_enc': 'min', 'GCS_bin_enc': 'min'
}

# Group the dynamic data by RecordID and TimeBin and apply the aggregation functions
patch_level_df = dynamic_data_binned.groupby(['RecordID', 'TimeBin']).agg(agg_funcs).reset_index()

patch_level_df.head()

Unnamed: 0,RecordID,TimeBin,RespRate,Urine,pH,HCT,NIDiasABP,NIMAP,NISysABP,HR_mask,...,MeanSysRatio,PaO2_FiO2,RespQuotient,Temp_HR,GCS_MAP,MechVent,MechVent_mask,HR_bin_enc,MAP_bin_enc,GCS_bin_enc
0,132539,0-60,19.0,960.0,7.39,33.7,61.5,91.665,152.0,1.0,...,0.672414,234.0,0.333333,2.121347,0.192308,1.0,0.0,1.0,2.0,0.0
1,132539,60-120,18.0,30.0,7.39,33.7,62.0,87.0,137.0,1.0,...,0.672414,234.0,0.333333,1.685393,0.192308,1.0,0.0,0.0,2.0,0.0
2,132539,120-180,19.0,170.0,7.39,33.7,52.0,75.67,123.0,1.0,...,0.672414,234.0,0.333333,1.741573,0.192308,1.0,0.0,1.0,2.0,0.0
3,132539,180-240,19.5,230.0,7.39,33.7,52.0,74.17,118.5,1.0,...,0.672414,234.0,0.333333,1.928988,0.192308,1.0,0.0,1.0,2.0,0.0
4,132539,240-300,20.0,60.0,7.39,33.7,52.0,72.67,114.0,1.0,...,0.672414,234.0,0.333333,1.957672,0.192308,1.0,0.0,1.0,2.0,0.0


In [None]:
# Extract only feature columns (exclude group identifiers)
patch_level_feature_cols = patch_level_df.columns.difference(['RecordID', 'TimeBin'])

# Handle sparse bins by applying bidirectional fill, grouping by RecordID
patch_level_df = patch_level_df.groupby('RecordID').apply(
    lambda group: bi_directional_fill(group, patch_level_feature_cols, 'TimeBin')
).reset_index(drop=True)

# Check for any remaining missing values after removing sparse bins
print(patch_level_df.isnull().sum())

RecordID          0
TimeBin           0
RespRate          0
Urine             0
pH                0
HCT               0
NIDiasABP         0
NIMAP             0
NISysABP          0
HR_mask           0
MAP_mask          0
GCS_mask          0
Urine_mask        0
pH_mask           0
HCT_mask          0
FiO2_mask         0
PaO2_mask         0
PaCO2_mask        0
RespRate_mask     0
SysABP_mask       0
Temp_mask         0
NIDiasABP_mask    0
NIMAP_mask        0
NISysABP_mask     0
DiasABP_mask      0
ShockIndex        0
PulsePressure     0
MeanSysRatio      0
PaO2_FiO2         0
RespQuotient      0
Temp_HR           0
GCS_MAP           0
MechVent          0
MechVent_mask     0
HR_bin_enc        0
MAP_bin_enc       0
GCS_bin_enc       0
dtype: int64


### 5.3 Final Dataframe

This section focuses on bringing together the preprocessed static and dynamic data into a single DataFrame ready for modeling. This involves merging the static data with the dynamic data based on patient ID, and then normalizing the continuous numerical features using StandardScaler. Finally, the cleaned and combined DataFrame is saved as a CSV file for future use.

#### 5.3.1 Merge Static and Dynamic DataFrames

This section merges the static patient data into each record of the dynamic data based on the common 'RecordID'. This combines the time-invariant patient information with their time-series physiological measurements into a single DataFrame for subsequent modeling.

In [None]:
# Merge static into each dynamic record using a left merge on 'RecordID'
final_combined_df = pd.merge(patch_level_df, static_data_cleaned_df, on='RecordID', how='left')

final_combined_df.head()

Unnamed: 0,RecordID,TimeBin,RespRate,Urine,pH,HCT,NIDiasABP,NIMAP,NISysABP,HR_mask,...,MechVent_mask,HR_bin_enc,MAP_bin_enc,GCS_bin_enc,Gender,ICUType,In-hospital_death,Height_missing,Weight_missing,AgeBMI_token_encoded
0,132539,0-60,19.0,960.0,7.39,33.7,61.5,91.665,152.0,1.0,...,0.0,1.0,2.0,0.0,0,4,0,1,1,17
1,132539,60-120,18.0,30.0,7.39,33.7,62.0,87.0,137.0,1.0,...,0.0,0.0,2.0,0.0,0,4,0,1,1,17
2,132539,120-180,19.0,170.0,7.39,33.7,52.0,75.67,123.0,1.0,...,0.0,1.0,2.0,0.0,0,4,0,1,1,17
3,132539,180-240,19.5,230.0,7.39,33.7,52.0,74.17,118.5,1.0,...,0.0,1.0,2.0,0.0,0,4,0,1,1,17
4,132539,240-300,20.0,60.0,7.39,33.7,52.0,72.67,114.0,1.0,...,0.0,1.0,2.0,0.0,0,4,0,1,1,17


In [None]:
final_combined_df.columns

Index(['RecordID', 'TimeBin', 'RespRate', 'Urine', 'pH', 'HCT', 'NIDiasABP',
       'NIMAP', 'NISysABP', 'HR_mask', 'MAP_mask', 'GCS_mask', 'Urine_mask',
       'pH_mask', 'HCT_mask', 'FiO2_mask', 'PaO2_mask', 'PaCO2_mask',
       'RespRate_mask', 'SysABP_mask', 'Temp_mask', 'NIDiasABP_mask',
       'NIMAP_mask', 'NISysABP_mask', 'DiasABP_mask', 'ShockIndex',
       'PulsePressure', 'MeanSysRatio', 'PaO2_FiO2', 'RespQuotient', 'Temp_HR',
       'GCS_MAP', 'MechVent', 'MechVent_mask', 'HR_bin_enc', 'MAP_bin_enc',
       'GCS_bin_enc', 'Gender', 'ICUType', 'In-hospital_death',
       'Height_missing', 'Weight_missing', 'AgeBMI_token_encoded'],
      dtype='object')

#### 5.3.2 Normalization

This section applies standardization to a defined list of continuous numerical features in the combined static and dynamic DataFrame using StandardScaler. This process transforms the features to have a mean of 0 and a standard deviation of 1, which is often necessary for machine learning models, including transformers, to perform optimally.

In [None]:
# Define the list of columns to be scaled
scale_cols = [
    'RespRate', 'Urine', 'pH', 'HCT',
    'NIDiasABP', 'NIMAP', 'NISysABP',
    'ShockIndex', 'PulsePressure', 'MeanSysRatio',
    'PaO2_FiO2', 'RespQuotient', 'Temp_HR', 'GCS_MAP'
]

# Initialize the StandardScaler
scaler = StandardScaler()

# Apply StandardScaler to the specified columns
final_combined_df[scale_cols] = scaler.fit_transform(final_combined_df[scale_cols])

In [None]:
final_combined_df.head()

Unnamed: 0,RecordID,TimeBin,RespRate,Urine,pH,HCT,NIDiasABP,NIMAP,NISysABP,HR_mask,...,MechVent_mask,HR_bin_enc,MAP_bin_enc,GCS_bin_enc,Gender,ICUType,In-hospital_death,Height_missing,Weight_missing,AgeBMI_token_encoded
0,132539,0-60,-0.062473,1.31487,-0.01841,0.415878,0.330632,1.115615,1.523818,1.0,...,0.0,1.0,2.0,0.0,0,4,0,1,1,17
1,132539,60-120,-0.405697,-0.3844,-0.01841,0.415878,0.366022,0.789919,0.887245,1.0,...,0.0,0.0,2.0,0.0,0,4,0,1,1,17
2,132539,120-180,-0.062473,-0.128596,-0.01841,0.415878,-0.341788,-0.001107,0.293111,1.0,...,0.0,1.0,2.0,0.0,0,4,0,1,1,17
3,132539,180-240,0.109139,-0.018966,-0.01841,0.415878,-0.341788,-0.105833,0.10214,1.0,...,0.0,1.0,2.0,0.0,0,4,0,1,1,17
4,132539,240-300,0.280751,-0.329585,-0.01841,0.415878,-0.341788,-0.210558,-0.088832,1.0,...,0.0,1.0,2.0,0.0,0,4,0,1,1,17


#### 5.3.3 Save Cleaned DataFrame

This section saves the final combined and preprocessed DataFrame as a CSV file. This allows the cleaned data to be easily loaded and used for subsequent modeling steps without needing to rerun the entire preprocessing pipeline.

In [None]:
# Save final DataFrame as CSV file

features_dir = os.path.join(basePath, 'data', 'features')

# Note: use below code if running in the local machine
# features_dir = r'../data/features/transformer_features_data_df.csv'

# Note: use below code if running in the Google colab
final_combined_file = os.path.join(features_dir, 'transformer_features_data_df.csv')

# Save the static dataframe to the specified path
final_combined_df.to_csv(final_combined_file, index=False)

## References:

Cleveland Clinic. (2022, September 5). *Body Mass Index (BMI)*. Cleveland Clinic. https://my.clevelandclinic.org/health/articles/9464-body-mass-index-bmi

Geifman, N., Cohen, R., & Rubin, E. (2013). *Redefining meaningful age groups in the context of disease. AGE, 35*(6), 2357–2366. https://doi.org/10.1007/s11357-013-9510-6

Staff, E. (2025, April 25). *Shock index formula: A practical tool for evaluating signs of shock*. EMS1. https://www.ems1.com/clinical/shock-index-formula-a-practical-tool-for-early-detection-of-shock

Kouz, K., Scheeren, T. W. L., de Backer, D., & Saugel, B. (2020). *Pulse Wave Analysis to Estimate Cardiac Output. Anesthesiology, 134*(1), 119–126. https://doi.org/10.1097/aln.0000000000003553

Bienvenu, N. E. L., Bibiche, K. K., Prince, K. D., Médard, B. I., Jerry, M., Meyer, C., & others. (2024). *Assessment of ARDS severity: PaO₂/FiO₂ versus PaO₂/lactate. International Journal of Clinical Anesthesiology, 12*(2), 1133. https://www.jscimedcentral.com/jounal-article-info/International-Journal-of-Clinical-Anesthesiology/Assessment-of-ARDS-Severity:-PaO2-FiO2-versus-PaO2-Lactate-12059

Sandeep Sharma, Hashmi, M. F., & Bracken Burns. (2019, September 13). *Alveolar Gas Equation*. Nih.gov; StatPearls Publishing. https://www.ncbi.nlm.nih.gov/books/NBK482268/

Heydari, F., Reza Azizkhani, Ahmadi, O., Saeed Majidinejad, Nasr-Esfahani, M., & Ahmadi, A. (2021). *Physiologic Scoring Systems versus Glasgow Coma Scale in Predicting In-Hospital Mortality of Trauma Patients; a Diagnostic Accuracy Study. PubMed, 9*(1), e64–e64. https://doi.org/10.22037/aaem.v9i1.1376