# 01: Raw Data Ingestion and Structuring

## Overview
This notebook covers the initial steps of obtaining and preparing the data for analysis. It includes importing necessary libraries, mounting Google Drive to access files, importing custom utility functions, downloading the raw dataset from Kaggle, processing the raw data to create static and dynamic features, saving this processed data, and finally aggregating the static and dynamic features into a single DataFrame suitable for downstream tasks.

## 1. Import Necessary Libraries

This section imports essential Python libraries such as `kagglehub` for dataset interaction, `os` and `sys` for system operations, and `pandas` and `numpy` for data manipulation and numerical computing.

In [None]:
# Reference: https://github.com/Kaggle/kagglehub/blob/main/README.md#installation
%pip install kagglehub



In [None]:
import kagglehub
import sys
import os
import pandas as pd
import numpy as np

## 2. Mount Google Drive

This section mounts the Google Drive to access files directly within the Colab environment.

In [None]:
from google.colab import drive

# Mount Google drive
drive.mount('/content/drive')

# Base file path
basePath = 'drive/MyDrive/Colab Notebooks/AAI-590-01_02/AAI590_CapstoneProject'

Mounted at /content/drive


## 3. Import Custom Modules

This section adds the source directory to the system path, allowing the notebook to import custom utility functions

In [None]:
# Reference: https://coderivers.org/blog/sys-path-append-python/

# Note: use below code if running in the local machine
# sys.path.append(os.path.abspath(os.path.join('..', 'src')))

# Note: use below code if running in the Google colab
sys.path.append(os.path.join(basePath, 'src'))

from utils import copy_entire_directory, convert_time_minutes

## 4. Download Kaggle Dataset

This section uses the `kagglehub` library to log in to Kaggle and download the specified dataset.

In [None]:
# Note: use below code if running in the local machine

# Copy dataset from cache to target directory
# target_dir = r"../data/raw"

# Note: use below code if running in the Google colab

# Copy dataset from cache to target directory
target_dir = os.path.join(basePath, 'data', 'raw')

### Note: Skip this section if already downloded the dataset from Kaggle

In [None]:
# References:
# https://github.com/Kaggle/kagglehub/blob/main/README.md#option-1-calling-kagglehublogin
# https://github.com/Kaggle/kaggle-api/tree/main/docs#api-credentials
kagglehub.login()

VBox(children=(HTML(value='<center> <img\nsrc=https://www.kaggle.com/static/images/site-logo.png\nalt=\'Kaggle…

Kaggle credentials set.
Kaggle credentials successfully validated.


In [None]:
# References:
# https://www.kaggle.com/datasets/msafi04/predict-mortality-of-icu-patients-physionet
# https://github.com/Kaggle/kagglehub/blob/main/README.md#download-dataset

# Download latest version
path = kagglehub.dataset_download("msafi04/predict-mortality-of-icu-patients-physionet")

print("Path to dataset files:", path)

Downloading from https://www.kaggle.com/api/v1/datasets/download/msafi04/predict-mortality-of-icu-patients-physionet?dataset_version_number=1...


100%|██████████| 7.64M/7.64M [00:00<00:00, 106MB/s]

Extracting files...





Path to dataset files: /root/.cache/kagglehub/datasets/msafi04/predict-mortality-of-icu-patients-physionet/versions/1


In [None]:
copy_entire_directory(path, target_dir)

## 5. Save Structured Data

This section saves the processed static data, which was collected by iterating through patient files, extracting static features, merging with outcome data, and structuring it into a pandas DataFrame. It also saves the dynamic tensors, which were created by processing time-varying vital data for each patient, converting timestamps to minutes, pivoting the data to a time-by-feature format, and converting it into NumPy arrays (tensors). This allows the cleaned and structured data to be easily loaded for subsequent steps without reprocessing the raw files.

In [None]:
# Note: use below code if running in the local machine
# data_dir = r'../data/raw/set-a/set-a'
# outcomes_path = r'../data/raw/Outcomes-a.txt'

# Note: use below code if running in the Google colab
data_dir = os.path.join(target_dir, 'set-a', 'set-a')
outcomes_path = os.path.join(target_dir, 'Outcomes-a.txt')

# To collect unique dynamic and static features
all_dynamic_features = set()
all_static_features = set()

# Load outcome data and extract static feature names (excluding RecordID)
outcomes_df = pd.read_csv(outcomes_path)
outcomes_df.set_index('RecordID', inplace=True)
outcome_static_features = set(outcomes_df.columns) - {'RecordID'}

# Add static feature names from patient and outcome data into static set
static_params = ['RecordID', 'Age', 'Gender', 'Height', 'ICUType', 'Weight']
all_static_features.update(static_params)
all_static_features.update(outcome_static_features)

# Iterate through all patient files and collect all feature names
for file in os.listdir(data_dir):
    if file.endswith('.txt'):
        df = pd.read_csv(os.path.join(data_dir, file))
        all_params = df['Parameter']
        all_dynamic_features.update(all_params)

# Derive dynamic feature names by excluding all known static feature names
all_dynamic_features = all_dynamic_features - all_static_features

# Add 'RecordID' and 'Minutes' to the dynamic features set
all_dynamic_features.add('RecordID')
all_dynamic_features.add('Minutes')

# Convert to lists
all_dynamic_features = list(all_dynamic_features)
all_static_features = list(all_static_features)

# Reorder dynamic features with 'RecordID' and 'Minutes' at the beginning
dynamic_features_ordered = ['RecordID', 'Minutes']
dynamic_features_ordered.extend(sorted([f for f in all_dynamic_features if f not in ['RecordID', 'Minutes']]))
all_dynamic_features = dynamic_features_ordered

# Reorder static features with 'RecordID' at the beginning
static_features_ordered = ['RecordID']
static_features_ordered.extend(sorted([f for f in all_static_features if f not in ['RecordID']]))
all_static_features = static_features_ordered

# View static and dynamic feature names
print("Dynamic features:", all_dynamic_features)
print("Static features:", all_static_features)

Dynamic features: ['RecordID', 'Minutes', 'ALP', 'ALT', 'AST', 'Albumin', 'BUN', 'Bilirubin', 'Cholesterol', 'Creatinine', 'DiasABP', 'FiO2', 'GCS', 'Glucose', 'HCO3', 'HCT', 'HR', 'K', 'Lactate', 'MAP', 'MechVent', 'Mg', 'NIDiasABP', 'NIMAP', 'NISysABP', 'Na', 'PaCO2', 'PaO2', 'Platelets', 'RespRate', 'SaO2', 'SysABP', 'Temp', 'TroponinI', 'TroponinT', 'Urine', 'WBC', 'pH']
Static features: ['RecordID', 'Age', 'Gender', 'Height', 'ICUType', 'In-hospital_death', 'Length_of_stay', 'SAPS-I', 'SOFA', 'Survival', 'Weight']


In [None]:
# Builds static and dynamic tensors
patient_static_data = []
patient_dynamic_tensors = []

# Iterate through all patient files
for file in os.listdir(data_dir):
    if file.endswith('.txt'):
        path = os.path.join(data_dir, file)
        df = pd.read_csv(path)

        # Get RecordID
        rid = int(df[df['Parameter'] == 'RecordID']['Value'].values[0])

        # Static Features: Filter by static feature set
        df['Value'] = pd.to_numeric(df['Value'], errors='coerce')
        # Use the reordered all_static_features list for filtering
        static_subset = df[df['Parameter'].isin(all_static_features)]
        static_dict = static_subset.drop_duplicates('Parameter').set_index('Parameter')['Value'].to_dict()
        static_dict['RecordID'] = rid # update with integer value

        # Inject static data from outcomes
        if rid in outcomes_df.index:
            static_dict.update(outcomes_df.loc[rid].to_dict())

        # Reorder and fill missing static features with -1 using the reordered list
        ordered_static = {key: static_dict.get(key, -1) for key in all_static_features}
        patient_static_data.append(ordered_static)

        # Dynamic Features: Filter by dynamic feature set (excluding RecordID and Minutes for pivoting)
        dynamic_cols_for_pivot = [f for f in all_dynamic_features if f not in ['RecordID', 'Minutes']]
        dynamic_subset = df[df['Parameter'].isin(dynamic_cols_for_pivot)].copy()

        # Converts time string in 'HH:MM' format into total minutes since ICU admission
        dynamic_subset['Minutes'] = dynamic_subset['Time'].apply(convert_time_minutes)

        # Pivot the dynamic data
        pivot = dynamic_subset.pivot_table(index='Minutes', columns='Parameter', values='Value', aggfunc='last')

        # Add RecordID and Minutes columns after pivoting
        pivot['RecordID'] = rid
        pivot['Minutes'] = pivot.index

        # Reindex the columns to match the reordered all_dynamic_features, sort by index and fill NaN with -1
        pivot = pivot.reindex(columns=all_dynamic_features).sort_index().fillna(-1)

        # Convert the pivoted DataFrame to a NumPy array (tensor)
        tensor = pivot.to_numpy()
        # Append the dynamic tensor to the list of patient dynamic tensors
        patient_dynamic_tensors.append(tensor)

# Display data for the first patient
print("Static keys:", patient_static_data[0].keys())
print("Dynamic tensor shape (time × features):", patient_dynamic_tensors[0].shape)

Static keys: dict_keys(['RecordID', 'Age', 'Gender', 'Height', 'ICUType', 'In-hospital_death', 'Length_of_stay', 'SAPS-I', 'SOFA', 'Survival', 'Weight'])
Dynamic tensor shape (time × features): (59, 38)


In [None]:
# Convert patient static data tensor to a dataframe and display the first few rows
df_static = pd.DataFrame(patient_static_data)
df_static.head()

Unnamed: 0,RecordID,Age,Gender,Height,ICUType,In-hospital_death,Length_of_stay,SAPS-I,SOFA,Survival,Weight
0,140101,39.0,0.0,170.2,3.0,0,10,10,7,-1,253.0
1,140102,70.0,0.0,-1.0,3.0,0,39,11,6,393,123.5
2,140104,61.0,1.0,188.0,2.0,0,5,18,7,-1,80.0
3,140106,64.0,1.0,162.6,2.0,0,22,22,14,-1,80.0
4,140107,45.0,1.0,-1.0,3.0,0,19,15,7,-1,105.5


In [None]:
# Display dynamic tensor shape for each patient
[t.shape for t in patient_dynamic_tensors]

[(59, 38),
 (53, 38),
 (90, 38),
 (104, 38),
 (71, 38),
 (73, 38),
 (64, 38),
 (77, 38),
 (3, 38),
 (85, 38),
 (56, 38),
 (81, 38),
 (81, 38),
 (59, 38),
 (105, 38),
 (109, 38),
 (64, 38),
 (58, 38),
 (93, 38),
 (73, 38),
 (98, 38),
 (93, 38),
 (89, 38),
 (51, 38),
 (84, 38),
 (69, 38),
 (54, 38),
 (73, 38),
 (54, 38),
 (85, 38),
 (55, 38),
 (45, 38),
 (101, 38),
 (56, 38),
 (61, 38),
 (135, 38),
 (47, 38),
 (53, 38),
 (68, 38),
 (72, 38),
 (73, 38),
 (116, 38),
 (130, 38),
 (58, 38),
 (121, 38),
 (78, 38),
 (110, 38),
 (53, 38),
 (70, 38),
 (95, 38),
 (56, 38),
 (98, 38),
 (80, 38),
 (55, 38),
 (75, 38),
 (51, 38),
 (67, 38),
 (51, 38),
 (10, 38),
 (93, 38),
 (93, 38),
 (62, 38),
 (53, 38),
 (106, 38),
 (80, 38),
 (123, 38),
 (61, 38),
 (88, 38),
 (80, 38),
 (101, 38),
 (93, 38),
 (83, 38),
 (54, 38),
 (64, 38),
 (65, 38),
 (57, 38),
 (78, 38),
 (76, 38),
 (52, 38),
 (12, 38),
 (69, 38),
 (64, 38),
 (73, 38),
 (73, 38),
 (82, 38),
 (26, 38),
 (92, 38),
 (56, 38),
 (91, 38),
 (63, 38),

In [None]:
# Create a list of DataFrames, one for each patient
patient_dataframes = [pd.DataFrame(patient_tensor, columns=all_dynamic_features) for patient_tensor in patient_dynamic_tensors]

# Concatenate all patient dataframes to a DataFrame
df_dynamic = pd.concat(patient_dataframes, ignore_index=True)

# Convert 'RecordID' and 'Minutes' columns to integer type
df_dynamic['RecordID'] = df_dynamic['RecordID'].astype(int)
df_dynamic['Minutes'] = df_dynamic['Minutes'].astype(int)

# Display first few rows
df_dynamic.head()

  df_dynamic = pd.concat(patient_dataframes, ignore_index=True)


Unnamed: 0,RecordID,Minutes,ALP,ALT,AST,Albumin,BUN,Bilirubin,Cholesterol,Creatinine,...,Platelets,RespRate,SaO2,SysABP,Temp,TroponinI,TroponinT,Urine,WBC,pH
0,140101,4,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,...,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0
1,140101,34,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,...,-1.0,-1.0,-1.0,-1.0,38.1,-1.0,-1.0,90.0,-1.0,-1.0
2,140101,64,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,...,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0
3,140101,124,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,...,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,80.0,-1.0,-1.0
4,140101,184,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,...,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,100.0,-1.0,-1.0


In [None]:
# Review DataFrames before saving
print("Unique RecordIDs in dynamic df:", len(df_dynamic['RecordID'].unique()))
print("Unique RecordIDs in static df:", len(df_static['RecordID'].unique()))

# Find RecordIDs in df_static that are not in df_dynamic
missing_in_dynamic = pd.Index(df_static['RecordID'].unique()).difference(pd.Index(df_dynamic['RecordID'].unique()))
print("RecordIDs in df_static missing from df_dynamic:", missing_in_dynamic)

# Find indices of empty or -1 tensors in dynamic data
empty_or_minus_one_tensors_indices = []
for i, tensor in enumerate(patient_dynamic_tensors):
    if tensor.shape[0] == 0:
        empty_or_minus_one_tensors_indices.append(i)
    else:
        # Check if all values in the tensor are -1
        if np.all(tensor == -1):
            empty_or_minus_one_tensors_indices.append(i)

print("Indices of empty or -1 tensors:", empty_or_minus_one_tensors_indices)

# Get the RecordIDs for the identified indices
missing_record_ids_from_tensors = [patient_static_data[i]['RecordID'] for i in empty_or_minus_one_tensors_indices]
print("RecordIDs corresponding to empty or -1 tensors:", missing_record_ids_from_tensors)

Unique RecordIDs in dynamic df: 3997
Unique RecordIDs in static df: 4000
RecordIDs in df_static missing from df_dynamic: Index([140501, 140936, 141264], dtype='int64')
Indices of empty or -1 tensors: [147, 307, 437]
RecordIDs corresponding to empty or -1 tensors: [140501, 140936, 141264]


In [None]:
# Save static and dynamic DataFrames as CSV files

processed_dir = os.path.join(basePath, 'data', 'processed')

# Note: use below code if running in the local machine
# static_df_file = r'../data/processed/patient_static_data_df.csv'
# dynamic_df_file = r'../data/processed/patient_dynamic_tensors_df.csv'

# Note: use below code if running in the Google colab
static_df_file = os.path.join(processed_dir, 'patient_static_data_df.csv')
dynamic_df_file = os.path.join(processed_dir, 'patient_dynamic_tensors_df.csv')

# Save the static dataframe
df_static.to_csv(static_df_file, index=False)

# Save the dynamic dataframe
df_dynamic.to_csv(dynamic_df_file, index=False)

## 6. Aggregate Patient Features

This section aggregates the processed static data and dynamic tensors for each patient into a single pandas DataFrame. It combines the static features directly and calculates summary statistics (mean, std, min, max, count) for each dynamic feature across time, creating a flattened representation of the data suitable for modeling.

In [None]:
# Combine static and dynamic data for each patient into a dictionary
patient_combined_data = []
for i in range(len(patient_static_data)):
    patient_combined_data.append({
        'static': patient_static_data[i],
        'dynamic': patient_dynamic_tensors[i]
    })

In [None]:
# Process each patient's combined data
patient_features_list = []

for patient_data_dict in patient_combined_data:
    features = {}

    # Extract static features, including RecordID
    static_data = patient_data_dict['static']
    features.update(static_data)

    # Calculate summary statistics for dynamic features
    dynamic_tensor = patient_data_dict['dynamic']
    dynamic_df = pd.DataFrame(dynamic_tensor, columns=all_dynamic_features)

    # Exclude 'RecordID' and 'Minutes' from aggregation
    dynamic_cols_for_aggregation = [col for col in dynamic_df.columns if col not in ['RecordID', 'Minutes']]

    for col in dynamic_cols_for_aggregation:
        # Exclude the -1 fill value from calculations
        valid_data = dynamic_df[col][dynamic_df[col] != -1]

        if not valid_data.empty:
            features[f'{col}_mean'] = valid_data.mean()
            features[f'{col}_std'] = valid_data.std()
            features[f'{col}_min'] = valid_data.min()
            features[f'{col}_max'] = valid_data.max()
            features[f'{col}_count'] = valid_data.count()
        else:
            # cases where all values are -1
            features[f'{col}_mean'] = np.nan
            features[f'{col}_std'] = np.nan
            features[f'{col}_min'] = np.nan
            features[f'{col}_max'] = np.nan
            features[f'{col}_count'] = 0

    patient_features_list.append(features)

# Create the aggregated dataframe
df_aggregated = pd.DataFrame(patient_features_list)

In [None]:
# Display the first few rows of the aggregated dataframe
df_aggregated.head()

Unnamed: 0,RecordID,Age,Gender,Height,ICUType,In-hospital_death,Length_of_stay,SAPS-I,SOFA,Survival,...,WBC_mean,WBC_std,WBC_min,WBC_max,WBC_count,pH_mean,pH_std,pH_min,pH_max,pH_count
0,140101,39.0,0.0,170.2,3.0,0,10,10,7,-1,...,13.25,3.606245,10.7,15.8,2,7.376667,0.047258,7.34,7.43,3
1,140102,70.0,0.0,-1.0,3.0,0,39,11,6,393,...,12.7,1.272792,11.8,13.6,2,7.41,0.028284,7.39,7.43,2
2,140104,61.0,1.0,188.0,2.0,0,5,18,7,-1,...,14.866667,4.042689,10.2,17.3,3,7.349091,0.037001,7.29,7.41,11
3,140106,64.0,1.0,162.6,2.0,0,22,22,14,-1,...,7.28,1.719593,5.0,8.9,5,7.394375,0.0575,7.27,7.47,16
4,140107,45.0,1.0,-1.0,3.0,0,19,15,7,-1,...,11.933333,1.059874,10.8,12.9,3,7.461667,0.037639,7.42,7.53,6


In [None]:
# Note: use below code if running in the local machine
# df_aggregated_file = r'../data/processed/patient_aggregated_features_df.csv'

# Note: use below code if running in the Google colab
df_aggregated_file = os.path.join(processed_dir, 'patient_aggregated_features_df.csv')

# Save the aggregated dataframe
df_aggregated.to_csv(df_aggregated_file, index=False)