# Raw Data Ingestion and Structuring

This notebook covers the initial steps of obtaining and preparing the data for analysis. It includes importing necessary libraries, mounting Google Drive to access files, importing custom utility functions, downloading the raw dataset from Kaggle, processing the raw data to create static and dynamic features, saving this processed data, and finally aggregating the static and dynamic features into a single DataFrame suitable for downstream tasks.

## Import Necessary Libraries

This section imports essential Python libraries such as `kagglehub` for dataset interaction, `os` and `sys` for system operations, `pickle` for object serialization, and `pandas` and `numpy` for data manipulation and numerical computing.

In [None]:
# Reference: https://github.com/Kaggle/kagglehub/blob/main/README.md#installation
%pip install kagglehub



In [None]:
import kagglehub
import sys
import os
import pickle
import pandas as pd
import numpy as np

## Mount Google Drive

This section mounts the Google Drive to access files directly within the Colab environment.

In [None]:
from google.colab import drive

# Mount Google drive
drive.mount('/content/drive')

# Base file path
basePath = 'drive/MyDrive/Colab Notebooks/AAI-590-01_02/AAI590_CapstoneProject'

Mounted at /content/drive


## Import Custom Modules

This section adds the source directory to the system path, allowing the notebook to import custom utility functions

In [None]:
# Reference: https://coderivers.org/blog/sys-path-append-python/

# Note: use below code if running in the local machine
# sys.path.append(os.path.abspath(os.path.join('..', 'src')))

# Note: use below code if running in the Google colab
sys.path.append(os.path.join(basePath, 'src'))

from utils import copy_entire_directory, convert_time_minutes

## Download Kaggle Dataset

This section uses the `kagglehub` library to log in to Kaggle and download the specified dataset.

In [None]:
# References:
# https://github.com/Kaggle/kagglehub/blob/main/README.md#option-1-calling-kagglehublogin
# https://github.com/Kaggle/kaggle-api/tree/main/docs#api-credentials
kagglehub.login()

VBox(children=(HTML(value='<center> <img\nsrc=https://www.kaggle.com/static/images/site-logo.png\nalt=\'Kaggle…

Kaggle credentials set.
Kaggle credentials successfully validated.


In [None]:
# References:
# https://www.kaggle.com/datasets/msafi04/predict-mortality-of-icu-patients-physionet
# https://github.com/Kaggle/kagglehub/blob/main/README.md#download-dataset

# Download latest version
path = kagglehub.dataset_download("msafi04/predict-mortality-of-icu-patients-physionet")

print("Path to dataset files:", path)

Downloading from https://www.kaggle.com/api/v1/datasets/download/msafi04/predict-mortality-of-icu-patients-physionet?dataset_version_number=1...


100%|██████████| 7.64M/7.64M [00:00<00:00, 130MB/s]

Extracting files...





Path to dataset files: /root/.cache/kagglehub/datasets/msafi04/predict-mortality-of-icu-patients-physionet/versions/1


In [None]:
# Note: use below code if running in the local machine

# Copy dataset from cache to target directory
# target_dir = "../data/raw"
# copy_entire_directory(path, target_dir)

In [None]:
# Note: use below code if running in the Google colab

# Copy dataset from cache to target directory
target_dir = os.path.join(basePath, 'data', 'raw')
copy_entire_directory(path, target_dir)

## Save Structured Data

This section saves the processed static data, which was collected by iterating through patient files, extracting static features, merging with outcome data, and structuring it into a pandas DataFrame. It also saves the dynamic tensors, which were created by processing time-varying vital data for each patient, converting timestamps to minutes, pivoting the data to a time-by-feature format, and converting it into NumPy arrays (tensors). This allows the cleaned and structured data to be easily loaded for subsequent steps without reprocessing the raw files.

In [None]:
# Note: use below code if running in the local machine
# data_dir = r'../data/raw/set-a/set-a'
# outcomes_path = r'../data/raw/Outcomes-a.txt'

# Note: use below code if running in the Google colab
data_dir = os.path.join(target_dir, 'set-a', 'set-a')
outcomes_path = os.path.join(target_dir, 'Outcomes-a.txt')

# To collect unique dynamic and static features
all_dynamic_features = set()
all_static_features = set()

# Load outcome data and extract static feature names (excluding RecordID)
outcomes_df = pd.read_csv(outcomes_path)
outcomes_df.set_index('RecordID', inplace=True)
outcome_static_features = set(outcomes_df.columns) - {'RecordID'}

# Add static feature names from patient and outcome data into static set
static_params = ['RecordID', 'Age', 'Gender', 'Height', 'ICUType', 'Weight']
all_static_features.update(static_params)
all_static_features.update(outcome_static_features)

# Iterate through all patient files and collect all feature names
for file in os.listdir(data_dir):
    if file.endswith('.txt'):
        df = pd.read_csv(os.path.join(data_dir, file))
        all_params = df['Parameter']
        all_dynamic_features.update(all_params)

# Derive dynamic feature names by excluding all known static feature names
all_dynamic_features = all_dynamic_features - all_static_features

# Convert to sorted lists for consistent tensor construction
all_dynamic_features = sorted(all_dynamic_features)
all_static_features = sorted(all_static_features)

# View static and dynamic feature names
print("Dynamic features:", all_dynamic_features)
print("Static features:", all_static_features)

Dynamic features: ['ALP', 'ALT', 'AST', 'Albumin', 'BUN', 'Bilirubin', 'Cholesterol', 'Creatinine', 'DiasABP', 'FiO2', 'GCS', 'Glucose', 'HCO3', 'HCT', 'HR', 'K', 'Lactate', 'MAP', 'MechVent', 'Mg', 'NIDiasABP', 'NIMAP', 'NISysABP', 'Na', 'PaCO2', 'PaO2', 'Platelets', 'RespRate', 'SaO2', 'SysABP', 'Temp', 'TroponinI', 'TroponinT', 'Urine', 'WBC', 'pH']
Static features: ['Age', 'Gender', 'Height', 'ICUType', 'In-hospital_death', 'Length_of_stay', 'RecordID', 'SAPS-I', 'SOFA', 'Survival', 'Weight']


In [None]:
# Builds static and dynamic tensors
patient_static_data = []
patient_dynamic_tensors = []

# Iterate through all patient files
for file in os.listdir(data_dir):
    if file.endswith('.txt'):
        path = os.path.join(data_dir, file)
        df = pd.read_csv(path)

        # Get RecordID
        rid = int(df[df['Parameter'] == 'RecordID']['Value'].values[0])

        # Static Features: Filter by static feature set
        df['Value'] = pd.to_numeric(df['Value'], errors='coerce')
        static_subset = df[df['Parameter'].isin(all_static_features)]
        static_dict = static_subset.drop_duplicates('Parameter').set_index('Parameter')['Value'].to_dict()
        static_dict['RecordID'] = rid # update with integer value

        # Inject static data from outcomes
        if rid in outcomes_df.index:
            static_dict.update(outcomes_df.loc[rid].to_dict())

        # Reorder and fill missing static features with -1
        ordered_static = {key: static_dict.get(key, -1) for key in all_static_features}
        patient_static_data.append(ordered_static)

        # Dynamic Features: Filter by dynamic feature set
        dynamic_subset = df[df['Parameter'].isin(all_dynamic_features)].copy()

        # Converts time string in 'HH:MM' format into total minutes since ICU admission
        dynamic_subset['Minutes'] = dynamic_subset['Time'].apply(convert_time_minutes)

        # Pivot the dynamic data to have 'Minutes' as index, 'Parameter' as columns, and 'Value' as values
        # Use 'last' as the aggregation function in case of multiple values at the same timestamp
        pivot = dynamic_subset.pivot_table(index='Minutes', columns='Parameter', values='Value', aggfunc='last')
        # Reindex the columns to match the order of all_dynamic_features, fill missing values with -1, and sort by index
        pivot = pivot.reindex(columns=all_dynamic_features).fillna(-1).sort_index()

        # Convert the pivoted DataFrame to a NumPy array (tensor)
        tensor = pivot.to_numpy()
        # Append the dynamic tensor to the list of patient dynamic tensors
        patient_dynamic_tensors.append(tensor)

# Display data for the first patient
print("Static keys:", patient_static_data[0].keys())
print("Dynamic tensor shape (time × features):", patient_dynamic_tensors[0].shape)

Static keys: dict_keys(['Age', 'Gender', 'Height', 'ICUType', 'In-hospital_death', 'Length_of_stay', 'RecordID', 'SAPS-I', 'SOFA', 'Survival', 'Weight'])
Dynamic tensor shape (time × features): (82, 36)


In [None]:
# Convert patient static data tensor to a dataframe and display the first few rows
df_static = pd.DataFrame(patient_static_data)
df_static.head()

Unnamed: 0,Age,Gender,Height,ICUType,In-hospital_death,Length_of_stay,RecordID,SAPS-I,SOFA,Survival,Weight
0,50.0,1.0,175.3,3.0,0,34,137671,17,10,441,71.7
1,90.0,0.0,-1.0,3.0,0,21,135981,15,4,-1,68.0
2,38.0,0.0,167.6,3.0,0,16,139976,17,13,-1,42.4
3,68.0,1.0,167.6,1.0,0,6,140654,13,11,-1,78.0
4,70.0,0.0,162.6,4.0,0,18,135885,9,4,-1,79.0


In [None]:
# Display dynamic tensor shape for each patient
[t.shape for t in patient_dynamic_tensors]

[(82, 36),
 (61, 36),
 (83, 36),
 (66, 36),
 (75, 36),
 (57, 36),
 (42, 36),
 (57, 36),
 (85, 36),
 (77, 36),
 (62, 36),
 (70, 36),
 (102, 36),
 (104, 36),
 (59, 36),
 (19, 36),
 (80, 36),
 (47, 36),
 (125, 36),
 (53, 36),
 (60, 36),
 (65, 36),
 (87, 36),
 (70, 36),
 (82, 36),
 (73, 36),
 (84, 36),
 (73, 36),
 (77, 36),
 (101, 36),
 (88, 36),
 (77, 36),
 (79, 36),
 (87, 36),
 (50, 36),
 (46, 36),
 (66, 36),
 (68, 36),
 (11, 36),
 (106, 36),
 (112, 36),
 (60, 36),
 (55, 36),
 (53, 36),
 (79, 36),
 (33, 36),
 (57, 36),
 (76, 36),
 (51, 36),
 (51, 36),
 (59, 36),
 (71, 36),
 (60, 36),
 (60, 36),
 (91, 36),
 (81, 36),
 (76, 36),
 (75, 36),
 (80, 36),
 (65, 36),
 (85, 36),
 (60, 36),
 (40, 36),
 (91, 36),
 (71, 36),
 (78, 36),
 (81, 36),
 (92, 36),
 (63, 36),
 (44, 36),
 (81, 36),
 (86, 36),
 (96, 36),
 (81, 36),
 (58, 36),
 (107, 36),
 (92, 36),
 (80, 36),
 (82, 36),
 (70, 36),
 (63, 36),
 (56, 36),
 (115, 36),
 (58, 36),
 (83, 36),
 (69, 36),
 (6, 36),
 (52, 36),
 (55, 36),
 (77, 36),
 (6

In [None]:
# Reference: https://docs.python.org/3/library/pickle.html#module-interface

processed_dir = os.path.join(basePath, 'data', 'processed')

# Note: use below code if running in the local machine
# static_df_file = r'../data/processed/patient_static_data_df.csv'
# dynamic_data_file = r'../data/processed/patient_dynamic_tensors.pkl'

# Note: use below code if running in the Google colab
static_df_file = os.path.join(processed_dir, 'patient_static_data_df.csv')
dynamic_data_file = os.path.join(processed_dir, 'patient_dynamic_tensors.pkl')

# Save the static dataframe
df_static.to_csv(static_df_file, index=False)

# Save dynamic tensor (pickle) an object
with open(dynamic_data_file, 'wb') as f:
    pickle.dump(patient_dynamic_tensors, f)

In [None]:
# Display first patient dynamic tensor data
patient_dynamic_tensors[0]

array([[58. , 15. , 30. , ..., 22. ,  6.1, -1. ],
       [-1. , -1. , -1. , ..., -1. , -1. , -1. ],
       [-1. , -1. , -1. , ..., -1. , -1. , -1. ],
       ...,
       [-1. , -1. , -1. , ..., -1. , -1. , -1. ],
       [-1. , -1. , -1. , ..., -1. , -1. , -1. ],
       [-1. , -1. , -1. , ..., 90. , -1. , -1. ]])

In [None]:
# Reference: https://docs.python.org/3/library/pickle.html#module-interface

# Load (unpickle) the object
with open(dynamic_data_file, 'rb') as f:
    loaded_object = pickle.load(f)

# Display first patient dynamic tensor data from the loaded object to compare
loaded_object[0]

array([[58. , 15. , 30. , ..., 22. ,  6.1, -1. ],
       [-1. , -1. , -1. , ..., -1. , -1. , -1. ],
       [-1. , -1. , -1. , ..., -1. , -1. , -1. ],
       ...,
       [-1. , -1. , -1. , ..., -1. , -1. , -1. ],
       [-1. , -1. , -1. , ..., -1. , -1. , -1. ],
       [-1. , -1. , -1. , ..., 90. , -1. , -1. ]])

## Aggregate Patient Features

This section aggregates the processed static data and dynamic tensors for each patient into a single pandas DataFrame. It combines the static features directly and calculates summary statistics (mean, std, min, max, count) for each dynamic feature across time, creating a flattened representation of the data suitable for modeling.

In [None]:
# Combine static and dynamic data for each patient into a dictionary
patient_combined_data = []
for i in range(len(patient_static_data)):
    patient_combined_data.append({
        'static': patient_static_data[i],
        'dynamic': patient_dynamic_tensors[i]
    })

In [None]:
# Process each patient's combined data
patient_features_list = []

for patient_data_dict in patient_combined_data:
    features = {}

    # Extract static features
    static_data = patient_data_dict['static']
    features.update(static_data)

    # Calculate summary statistics for dynamic features
    dynamic_tensor = patient_data_dict['dynamic']
    dynamic_df = pd.DataFrame(dynamic_tensor, columns=all_dynamic_features)

    for col in dynamic_df.columns:
        # Exclude the -1 fill value from calculations
        valid_data = dynamic_df[col][dynamic_df[col] != -1]

        if not valid_data.empty:
            features[f'{col}_mean'] = valid_data.mean()
            features[f'{col}_std'] = valid_data.std()
            features[f'{col}_min'] = valid_data.min()
            features[f'{col}_max'] = valid_data.max()
            features[f'{col}_count'] = valid_data.count()
        else:
            # cases where all values are -1
            features[f'{col}_mean'] = np.nan
            features[f'{col}_std'] = np.nan
            features[f'{col}_min'] = np.nan
            features[f'{col}_max'] = np.nan
            features[f'{col}_count'] = 0

    patient_features_list.append(features)

# Create the aggregated dataframe
df_aggregated = pd.DataFrame(patient_features_list)

In [None]:
# Display the first few rows of the aggregated dataframe
df_aggregated.head()

Unnamed: 0,Age,Gender,Height,ICUType,In-hospital_death,Length_of_stay,RecordID,SAPS-I,SOFA,Survival,...,WBC_mean,WBC_std,WBC_min,WBC_max,WBC_count,pH_mean,pH_std,pH_min,pH_max,pH_count
0,50.0,1.0,175.3,3.0,0,34,137671,17,10,441,...,7.2,1.493318,6.1,8.9,3,7.417143,0.057652,7.33,7.5,7
1,90.0,0.0,-1.0,3.0,0,21,135981,15,4,-1,...,15.7,0.565685,15.3,16.1,2,,,,,0
2,38.0,0.0,167.6,3.0,0,16,139976,17,13,-1,...,19.2,13.467244,3.0,31.4,4,7.323636,0.112006,7.13,7.5,11
3,68.0,1.0,167.6,1.0,0,6,140654,13,11,-1,...,11.0,1.414214,10.0,12.0,2,7.346,0.0251,7.33,7.39,5
4,70.0,0.0,162.6,4.0,0,18,135885,9,4,-1,...,14.8,1.555635,13.7,15.9,2,7.272222,0.0284,7.2,7.3,18


In [None]:
# Note: use below code if running in the local machine
# df_aggregated_file = r'../data/processed/patient_aggregated_features_df.csv'

# Note: use below code if running in the Google colab
df_aggregated_file = os.path.join(processed_dir, 'patient_aggregated_features_df.csv')

# Save the aggregated dataframe
df_aggregated.to_csv(df_aggregated_file, index=False)