# 04b: Data Preparation for CNN-LSTM

This notebook prepares the cleaned ICU time series and static data for input into a CNN-LSTM model. 

## 1. Imports and Configuration

Import all necessary libraries for data preparation, including pandas, numpy, scikit-learn, imbalanced-learn, and TensorFlow. Set random seeds for reproducibility and configure display/plotting options for consistency.

## Workflow Overview

This notebook prepares ICU time series and static data for CNN-LSTM modeling. The workflow includes:

1. **Data Loading and Audit:** Load cleaned data, preview, and check types.
2. **Feature Selection:** Select EDA-driven dynamic and static features.
3. **Log Transformation:** Apply log1p to skewed static features.
4. **Sequence Construction:** Group by patient, extract sequences and static features.
5. **Padding and Scaling:** Pad sequences and scale features.
6. **Train/Val/Test Split:** Patient-index-based splitting for alignment.
7. **Class Imbalance Handling:** Apply SMOTE to training set.
8. **Save Processed Data:** Store arrays for modeling.



In [65]:
# Imports and Configuration
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from imblearn.over_sampling import SMOTE
import tensorflow as tf
from tensorflow.keras.preprocessing.sequence import pad_sequences
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')

# Set random seed for reproducibility
np.random.seed(42)
tf.random.set_seed(42)
pd.set_option('display.max_columns', 100)
sns.set(style='whitegrid')
plt.rcParams['figure.figsize'] = (12, 6)

## 1. Data Loading and Initial Audit

Load the cleaned ICU dataset and missingness mask. Preview the first few rows and check data types to confirm correct loading. Summarize the shape and schema of the dataset for transparency.

In [66]:
# Load cleaned data
data = pd.read_csv('../data/processed/timeseries_cleaned_all_features.csv')
mask = pd.read_csv('../data/processed/timeseries_missingness_mask.csv')
print('Data shape:', data.shape)

# Preview
display(data.head())
print(data.dtypes)

Data shape: (295354, 48)


Unnamed: 0,RecordID,Minutes,ALP,ALT,AST,Albumin,BUN,Bilirubin,Cholesterol,Creatinine,DiasABP,FiO2,GCS,Glucose,HCO3,HCT,HR,K,Lactate,MAP,MechVent,Mg,NIDiasABP,NIMAP,NISysABP,Na,PaCO2,PaO2,Platelets,RespRate,SaO2,SysABP,Temp,TroponinI,TroponinT,Urine,WBC,pH,Age,Gender,Height,ICUType,In-hospital_death,Length_of_stay,SAPS-I,SOFA,Survival,Weight
0,132539,7,0.0,0.0,0.0,0.0,13.0,0.0,0.0,0.8,0.0,0.0,15.0,205.0,26.0,33.7,73.0,4.4,0.0,0.0,0.0,1.5,65.0,92.33,147.0,137.0,0.0,0.0,221.0,19.0,0.0,0.0,35.1,0.0,0.0,900.0,11.2,0.0,54.0,0.0,170.2,4.0,0.0,5.0,6.0,1.0,70.0,78.6
1,132539,37,0.0,0.0,0.0,0.0,13.0,0.0,0.0,0.8,0.0,0.0,15.0,205.0,26.0,33.7,77.0,4.4,0.0,0.0,0.0,1.5,58.0,91.0,157.0,137.0,0.0,0.0,221.0,19.0,0.0,0.0,35.6,0.0,0.0,60.0,11.2,0.0,54.0,0.0,170.2,4.0,0.0,5.0,6.0,1.0,70.0,78.6
2,132539,97,0.0,0.0,0.0,0.0,13.0,0.0,0.0,0.8,0.0,0.0,15.0,205.0,26.0,33.7,60.0,4.4,0.0,0.0,0.0,1.5,62.0,87.0,137.0,137.0,0.0,0.0,221.0,18.0,0.0,0.0,35.6,0.0,0.0,30.0,11.2,0.0,54.0,0.0,170.2,4.0,0.0,5.0,6.0,1.0,70.0,78.6
3,132539,157,0.0,0.0,0.0,0.0,13.0,0.0,0.0,0.8,0.0,0.0,15.0,205.0,26.0,33.7,62.0,4.4,0.0,0.0,0.0,1.5,52.0,75.67,123.0,137.0,0.0,0.0,221.0,19.0,0.0,0.0,35.6,0.0,0.0,170.0,11.2,0.0,54.0,0.0,170.2,4.0,0.0,5.0,6.0,1.0,70.0,78.6
4,132539,188,0.0,0.0,0.0,0.0,13.0,0.0,0.0,0.8,0.0,0.0,15.0,205.0,26.0,33.7,62.0,4.4,0.0,0.0,0.0,1.5,52.0,75.67,123.0,137.0,0.0,0.0,221.0,19.0,0.0,0.0,35.6,0.0,0.0,170.0,11.2,0.0,54.0,0.0,170.2,4.0,0.0,5.0,6.0,1.0,70.0,78.6


RecordID               int64
Minutes                int64
ALP                  float64
ALT                  float64
AST                  float64
Albumin              float64
BUN                  float64
Bilirubin            float64
Cholesterol          float64
Creatinine           float64
DiasABP              float64
FiO2                 float64
GCS                  float64
Glucose              float64
HCO3                 float64
HCT                  float64
HR                   float64
K                    float64
Lactate              float64
MAP                  float64
MechVent             float64
Mg                   float64
NIDiasABP            float64
NIMAP                float64
NISysABP             float64
Na                   float64
PaCO2                float64
PaO2                 float64
Platelets            float64
RespRate             float64
SaO2                 float64
SysABP               float64
Temp                 float64
TroponinI            float64
TroponinT     

## 2. Feature Selection

Select features based on EDA findings and clinical relevance. Avoid duplicate or already-encoded columns. Document the rationale for each feature group.

- **Dynamic (time series) features:**
    - Cardiovascular: HR, SysABP, DiasABP, MAP, NISysABP, NIDiasABP, NIMAP, MechVent
    - Respiratory: RespRate, SaO2, FiO2, PaO2, PaCO2
    - Renal: Creatinine, BUN, Urine
    - Metabolic/Electrolytes: Na, K, Glucose, Lactate, HCO3, pH
    - Neurological: GCS
    - Other: Temp
- **Static features:**
    - Age, Gender, Height, Weight, ICUType, SAPS-I, SOFA, Length_of_stay, Survival
- **Target:**
    - In-hospital_death (binary)

In [67]:
# Dynamic (time series) features
time_series_features = [
    'HR', 'SysABP', 'DiasABP', 'MAP', 'NISysABP', 'NIDiasABP', 'NIMAP', 'MechVent',
    'RespRate', 'SaO2', 'FiO2', 'PaO2', 'PaCO2',
    'Creatinine', 'BUN', 'Urine',
    'Na', 'K', 'Glucose', 'Lactate', 'HCO3', 'pH',
    'GCS', 'Temp'
]
# Static features (no further encoding needed for ICUType or Gender)
static_features = [
    'Age', 'Gender', 'Height', 'Weight', 'ICUType', 'SAPS-I', 'SOFA',
    'Length_of_stay', 'Survival'
]
target_col = 'In-hospital_death'

## 4. Log Transformation of Skewed Features

Apply log1p transformation to highly skewed static features as recommended by EDA: `Weight`, `Length_of_stay`, and `Survival`. This helps reduce the impact of outliers and long-tailed distributions.

In [68]:
# Apply log1p 
for col in ['Weight', 'Length_of_stay', 'Survival']:
    if col in data.columns:
        data[col] = np.log1p(data[col])

## 5. Sequence Construction and Target Extraction

Group by `RecordID` to create time series sequences and extract static features and target labels for each patient. This ensures each patient is represented as a sequence for the CNN-LSTM model.

In [69]:
# Group by RecordID and create sequences
grouped = data.groupby('RecordID')
X_seq = [group[time_series_features].values for _, group in grouped]
X_static = grouped[static_features].first().values
y = grouped[target_col].first().values

## 6. Sequence Padding and Feature Scaling

Pad all sequences to the same length and scale features using `StandardScaler`. This ensures uniform input shape and normalized values for the neural network.

In [70]:
# Pad sequences
max_seq_len = max([seq.shape[0] for seq in X_seq])
X_seq_padded = pad_sequences(X_seq, maxlen=max_seq_len, dtype='float32', padding='post', value=0.0)

# Scale features
scaler = StandardScaler()
n_features = len(time_series_features)
X_seq_reshaped = X_seq_padded.reshape(-1, n_features)
X_seq_scaled = scaler.fit_transform(X_seq_reshaped).reshape(-1, max_seq_len, n_features)

## 7. Train/Validation/Test Split and Class Imbalance Handling

Split the data into train, validation, and test sets. Use SMOTE to address class imbalance in the training set. This is critical due to the observed imbalance in the target variable (`In-hospital_death`).

In [71]:
# Patient-index-based splitting for aligned arrays
from sklearn.model_selection import train_test_split

# Get patient indices
n_patients = len(y)
indices = np.arange(n_patients)

# Split indices for train, val, test
train_idx, test_idx = train_test_split(indices, test_size=0.2, stratify=y, random_state=42)
train_idx, val_idx = train_test_split(train_idx, test_size=0.2, stratify=y[train_idx], random_state=42)

# Use indices to split all arrays
X_train, X_val, X_test = X_seq_scaled[train_idx], X_seq_scaled[val_idx], X_seq_scaled[test_idx]
static_train, static_val, static_test = X_static[train_idx], X_static[val_idx], X_static[test_idx]
y_train, y_val, y_test = y[train_idx], y[val_idx], y[test_idx]

# SMOTE on flattened time series  
sm = SMOTE(random_state=42)
X_train_flat = X_train.reshape(X_train.shape[0], -1)
X_train_res, y_train_res = sm.fit_resample(X_train_flat, y_train)
X_train_res = X_train_res.reshape(-1, X_train.shape[1], X_train.shape[2])

# For static features: assign static features for synthetic samples by random sampling from minority class
n_orig = static_train.shape[0]
n_total = X_train_res.shape[0]
n_synth = n_total - n_orig
static_train_res = static_train.copy()
if n_synth > 0:
    # Find indices of minority class in original static_train
    minority_class = 1 if np.sum(y_train == 1) < np.sum(y_train == 0) else 0
    minority_indices = np.where(y_train == minority_class)[0]
    synth_static = static_train[np.random.choice(minority_indices, size=n_synth, replace=True)]
    static_train_res = np.concatenate([static_train, synth_static], axis=0)



## 8. Save Prepared Data

Save the processed arrays for model training and evaluation. This ensures reproducibility and easy loading for downstream modeling notebooks.

In [72]:
# Save processed data
np.savez('../data/processed/cnn_lstm_data.npz',
         X_train=X_train_res, y_train=y_train_res,
         X_val=X_val, y_val=y_val,
         X_test=X_test, y_test=y_test,
         static_train=static_train_res, static_val=static_val, static_test=static_test)
print("Prepared data saved.")

Prepared data saved.
