# 03a: Data Cleaning (Aggregated Data)

## Goal for this notebook

The goal of this notebook is to perform the necessary data cleaning on the aggregated feature set. This involves loading the data, handling data leakage by removing outcome-related variables, addressing columns with a high percentage of missing data, and imputing the remaining missing values. The final output will be a clean, complete dataset ready for the model preparation phase.

## 1. Setup and Data Loading
We'll start by importing the necessary libraries and loading the aggregated feature dataset created in the ingestion phase.

In [1]:
import pandas as pd
import numpy as np
from sklearn.impute import KNNImputer
import os

# Load the processed data from the previous step
df = pd.read_csv('../data/processed/patient_aggregated_features_df.csv')
print("Successfully loaded 'patient_aggregated_features_df.csv'.")
print(f"Original dataset shape: {df.shape}")


Successfully loaded 'patient_aggregated_features_df.csv'.
Original dataset shape: (4000, 191)


## 2. Handle Data Leakage

**Finding**: Our dataset contains features like Survival and Length_of_stay which are determined at the end of a patient's stay. To build a realistic predictive model that can make predictions early in an ICU admission, this information (which comes from the "future") must be removed to prevent data leakage.

In [2]:
leaky_features = ['Survival', 'Length_of_stay']

# Check which leaky features are in the dataframe before dropping
features_to_drop = [feat for feat in leaky_features if feat in df.columns]

if features_to_drop:
    df.drop(columns=features_to_drop, inplace=True)
    print(f"Removed leaky features: {features_to_drop}")
    print(f"Shape after removing leaky features: {df.shape}")
else:
    print("No leaky features found to remove.")

Removed leaky features: ['Survival', 'Length_of_stay']
Shape after removing leaky features: (4000, 189)


## 3. Handle High-Missingness Features

**Finding**: The EDA revealed that several features, mostly related to specialized lab tests, are missing for over 80% of patients. Imputing this much data would be unreliable. Therefore, we will remove these columns to avoid introducing noise into the model.

In [3]:
# Calculate the percentage of missing values
missing_percent = df.isnull().mean()

# Identify columns to drop (missing > 80%)
cols_to_drop_missing = missing_percent[missing_percent > 0.8].index

if not cols_to_drop_missing.empty:
    df.drop(columns=cols_to_drop_missing, inplace=True)
    print(f"Removed {len(cols_to_drop_missing)} columns with >80% missing values.")
    print(f"Shape after removing high-missingness columns: {df.shape}")
else:
    print("No columns with >80% missing values to remove.")

Removed 14 columns with >80% missing values.
Shape after removing high-missingness columns: (4000, 175)


## 4. Impute Remaining Missing Values

For the remaining features with missing data, we will use K-Nearest Neighbors (KNN) imputation. This method estimates a missing value based on the values of the 'k' most similar patients, which can lead to more realistic imputations than using a simple mean or median.

In [4]:
# Separate identifiers and the target variable, which should not be imputed
ids_and_target = df[['RecordID', 'In-hospital_death']]
features_to_impute = df.drop(columns=['RecordID', 'In-hospital_death'])

# Initialize the KNNImputer
imputer = KNNImputer(n_neighbors=5)

# Fit and transform the feature data
print("Imputing remaining missing values with KNNImputer...")
imputed_features = imputer.fit_transform(features_to_impute)

# Convert the imputed data back to a DataFrame
df_imputed_features = pd.DataFrame(imputed_features, columns=features_to_impute.columns)

# Combine the imputed features with the identifiers and target
df_cleaned = pd.concat([ids_and_target.reset_index(drop=True), df_imputed_features.reset_index(drop=True)], axis=1)

print("Imputation complete.")
print(f"Final cleaned shape: {df_cleaned.shape}")
print(f"Missing values remaining: {df_cleaned.isnull().sum().sum()}")

Imputing remaining missing values with KNNImputer...
Imputation complete.
Final cleaned shape: (4000, 175)
Missing values remaining: 0


## 5. Save Cleaned Data

Finally, we save the fully cleaned and imputed DataFrame to a new CSV file. This file will be the input for the next stage of our pipeline: model preparation.

In [5]:
# Define the output path
output_dir = '../data/processed/'
output_file = os.path.join(output_dir, 'aggregated_cleaned.csv')

# Create the directory if it doesn't exist
os.makedirs(output_dir, exist_ok=True)

# Save the cleaned dataframe
df_cleaned.to_csv(output_file, index=False)

print(f"\nCleaned data saved to: {output_file}")


Cleaned data saved to: ../data/processed/aggregated_cleaned.csv
