# 04a: Data Preparation for XGBoost Modeling

## Goal for this notebook

The goal of this notebook is to take the cleaned, aggregated dataset and prepare it for training our XGBoost model. This involves separating the features from the target variable, encoding categorical features appropriately, splitting the data into training and testing sets, and addressing the significant class imbalance using an over-sampling technique (SMOTE). The final output will be data splits that are ready for the modeling phase.

## 1. Setup and Data Loading

We'll start by importing the necessary libraries from pandas and scikit-learn, and then load the `aggregated_cleaned.csv` file created in the previous data cleaning notebook.

In [1]:
# Run this cell if you get import-error later, and then restart kernel
!pip uninstall -y scikit-learn imbalanced-learn
!pip install scikit-learn
!pip install imbalanced-learn

Found existing installation: scikit-learn 1.3.2
Uninstalling scikit-learn-1.3.2:
  Successfully uninstalled scikit-learn-1.3.2
Found existing installation: imbalanced-learn 0.12.4
Uninstalling imbalanced-learn-0.12.4:
  Successfully uninstalled imbalanced-learn-0.12.4
Collecting scikit-learn
  Using cached scikit_learn-1.3.2-cp38-cp38-macosx_10_9_x86_64.whl.metadata (11 kB)
Using cached scikit_learn-1.3.2-cp38-cp38-macosx_10_9_x86_64.whl (10.1 MB)
Installing collected packages: scikit-learn
Successfully installed scikit-learn-1.3.2
Collecting imbalanced-learn
  Using cached imbalanced_learn-0.12.4-py3-none-any.whl.metadata (8.3 kB)
Using cached imbalanced_learn-0.12.4-py3-none-any.whl (258 kB)
Installing collected packages: imbalanced-learn
Successfully installed imbalanced-learn-0.12.4


In [1]:
import pandas as pd
import numpy as np
import os
from sklearn.model_selection import train_test_split
from imblearn.over_sampling import SMOTE
import pickle

# Load the cleaned data from the previous step
df = pd.read_csv('../data/processed/aggregated_cleaned.csv')
print("Successfully loaded 'aggregated_cleaned.csv'.")
print(f"Dataset shape: {df.shape}")


Successfully loaded 'aggregated_cleaned.csv'.
Dataset shape: (4000, 175)


## 2. Feature and Target Separation

First, we need to separate our dataset into the feature matrix (X), which contains all the predictor variables, and the target vector (y), which is the In-hospital_death column we want to predict.



In [2]:
# Define the target variable
target = 'In-hospital_death'

# Separate features (X) and target (y)
X = df.drop(columns=[target, 'RecordID']) # Drop target and identifier
y = df[target]

print(f"Shape of feature matrix (X): {X.shape}")
print(f"Shape of target vector (y): {y.shape}")

Shape of feature matrix (X): (4000, 173)
Shape of target vector (y): (4000,)


## 3. Categorical Feature Encoding

Our EDA identified ICUType and Gender as categorical features. To prevent the model from assuming a false numerical order, we will convert them into a format it can understand using one-hot encoding. This creates a new binary column for each category.

In [3]:
categorical_features = ['ICUType', 'Gender']

print(f"Original shape of X: {X.shape}")

# Apply one-hot encoding
X_encoded = pd.get_dummies(X, columns=categorical_features, drop_first=True)

print(f"Shape of X after one-hot encoding: {X_encoded.shape}")
X_encoded.head()

Original shape of X: (4000, 173)
Shape of X after one-hot encoding: (4000, 176)


Unnamed: 0,Age,Height,SAPS-I,SOFA,Weight,ALP_mean,ALP_min,ALP_max,ALP_count,ALT_mean,...,pH_mean,pH_std,pH_min,pH_max,pH_count,ICUType_2.0,ICUType_3.0,ICUType_4.0,Gender_0.0,Gender_1.0
0,52.0,185.4,9.0,2.0,90.0,229.0,229.0,229.0,1.0,5.0,...,7.394333,0.040302,7.374,7.408,0.0,0,0,0,0,1
1,65.0,-1.0,11.0,3.0,-1.0,128.55,125.6,131.4,0.0,40.25,...,7.420983,0.021805,7.396,7.442,0.0,1,0,0,0,1
2,47.0,-1.0,4.0,1.0,86.6,55.0,55.0,55.0,1.0,68.0,...,7.418,0.039983,7.386,7.462,0.0,0,0,0,1,0
3,35.0,154.9,-1.0,7.0,67.0,70.6,67.0,74.2,0.0,29.2,...,7.374706,0.052929,7.27,7.44,17.0,1,0,0,1,0
4,64.0,-1.0,-1.0,-1.0,74.3,258.0,216.0,300.0,2.0,240.333333,...,7.43,0.039158,7.39,7.48,4.0,0,1,0,1,0


## 4. Data Splitting

We will now split our data into training and testing sets. The model will be trained on the training set, and its performance will be evaluated on the unseen test set. We use stratification to ensure the proportion of the minority class is the same in both splits, which is crucial for our imbalanced dataset.

In [4]:
# Using the encoded X and original y before any oversampling
X_train, X_test, y_train, y_test = train_test_split(
    X_encoded,
    y,
    test_size=0.2,
    random_state=42,
    stratify=y # Stratify is crucial for imbalanced datasets
)

print("Data split into training and testing sets.")
print(f"X_train shape: {X_train.shape}")
print(f"X_test shape: {X_test.shape}")
print(f"Original y_train distribution:\n{y_train.value_counts(normalize=True)}")
print(f"\ny_test distribution:\n{y_test.value_counts(normalize=True)}")


print("\nApplying SMOTE to the training data...")
smote = SMOTE(random_state=42)
X_train_smote, y_train_smote = smote.fit_resample(X_train, y_train)

print(f"\nResampled (SMOTE) training set shape: {X_train_smote.shape}")
print(f"Resampled training set distribution:\n{y_train_smote.value_counts()}")

Data split into training and testing sets.
X_train shape: (3200, 176)
X_test shape: (800, 176)
Original y_train distribution:
0    0.861563
1    0.138437
Name: In-hospital_death, dtype: float64

y_test distribution:
0    0.86125
1    0.13875
Name: In-hospital_death, dtype: float64

Applying SMOTE to the training data...

Resampled (SMOTE) training set shape: (5514, 176)
Resampled training set distribution:
0    2757
1    2757
Name: In-hospital_death, dtype: int64


## 5. Handle Class Imbalance with SMOTE

To address the significant class imbalance, we will apply the Synthetic Minority Over-sampling Technique (SMOTE). This technique generates new, synthetic data points for the minority class (mortality) to create a balanced dataset for the model to learn from. Crucially, this is only applied to the training data to prevent data leakage and ensure our test set remains representative of the real-world data distribution.

In [5]:
print(f"Original training set shape: {X_train.shape}")
print(f"Original training set distribution:\n{y_train.value_counts()}")

# Initialize SMOTE
smote = SMOTE(random_state=42)

# Apply SMOTE to the training data only
X_train_smote, y_train_smote = smote.fit_resample(X_train, y_train)

print("\nApplying SMOTE to the training set...")
print(f"Resampled training set shape: {X_train_smote.shape}")
print(f"Resampled training set distribution:\n{y_train_smote.value_counts()}")


Original training set shape: (3200, 176)
Original training set distribution:
0    2757
1     443
Name: In-hospital_death, dtype: int64

Applying SMOTE to the training set...
Resampled training set shape: (5514, 176)
Resampled training set distribution:
0    2757
1    2757
Name: In-hospital_death, dtype: int64


## 6. Save Prepared Data

Finally, we save our prepared data splits. We will save both the original splits and the SMOTE-resampled training split, allowing us to compare modeling approaches in the next notebook. Using `pickle` is a good way to preserve the DataFrame structure and data types.


In [6]:
# Define the output path
output_dir = '../data/features/'

# Create the directory if it doesn't exist
os.makedirs(output_dir, exist_ok=True)

# Save the data splits using pickle
with open(os.path.join(output_dir, 'aggregated_data_splits.pkl'), 'wb') as f:
    pickle.dump({
        'X_train': X_train,
        'X_test': X_test,
        'y_train': y_train,
        'y_test': y_test,
        'X_train_smote': X_train_smote,
        'y_train_smote': y_train_smote
    }, f)

print("\nPrepared data splits saved to 'aggregated_data_splits.pkl'")


Prepared data splits saved to 'aggregated_data_splits.pkl'
