UIUC: CS598:DLH Example/Use Case PR Submission From:

Swaroop Potdar netid (spotd), Jake Cumberland netid (jakepc3)

# Designing Pilot/Proof of Concept Study With PyHealth and MIMIC-IV

# Objective:
Considering the privacy concerns and paucity of available data, the goal of this workflow is to match the MIMIC-IV Waveform Database (WaveDB) with MIMIC-IV Clinical Ground truth data to enable useful analysis. Specifically, by linking patient diagnosis information with waveform data, we can create datasets that can be used for machine learning tasks. These datasets make for an ideal pilot or proof of concept study, where we can asses the validity of our hypothesis before collecting more and diverse data on a larger scale.

For example, we can identify which patients are positive for conditions like Atrial Fibrillation, and then match them with their corresponding waveform records and choose control subjects who are negative for the target condition but still have waveform records.


**The workflow proceeds through the following steps:**

1.  Load MIMIC-IV clinical data (DIAGNOSES_ICD table) as ground truth.

2.  Load MIMIC-IV waveform database records (RECORDS file).

3.  Extract diagnosis names from the ICD-9 and ICD-10 datasets.

4.  Merge clinical diagnosis data with waveform records to filter patients with specific diagnoses.

5.  Perform the final filtering by diagnosis name or ICD code to generate the final dataset for machine learning or other analysis.

6.  Download only the necessary waveform files

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


### Clone PyHealth and Install dependencies

In [None]:
# This is done in Colab, if executing locally adjust the path to match your local setup.
%cd /content/drive/MyDrive/
!git clone https://github.com/sunlabuiuc/PyHealth.git
%cd /content/drive/MyDrive/PyHealth/
!pip install -r requirements.txt

/content/drive/MyDrive


### Load MIMIC-IV Clinical Data

The MIMIC-IV clinical dataset is loaded, specifically focusing on the DIAGNOSES_ICD table, which contains the clinical diagnoses and their corresponding ICD codes.

MIMIC-IV Dataset: The MIMIC4Dataset object is initialized, pointing to the root directory of the MIMIC-IV Clinical dataset.

Data Collection: The dataset is loaded, specifically the diagnoses data, which includes patient IDs and their diagnoses in ICD 9 / 10 format.

In [None]:
from pyhealth.datasets import MIMIC4Dataset
import urllib.request

mimic_dataset = MIMIC4Dataset(
    ehr_root="/content/drive/MyDrive/MIMICIV/physionet.org/files/mimiciv/3.1/", # Adapt and point this to where you have MIMICIV downloaded.
    ehr_tables=["DIAGNOSES_ICD"],
    ehr_config_path="/content/drive/MyDrive/PyHealth/pyhealth/datasets/configs/mimic4_ehr.yaml",
    dev=False
)

mimic4 = mimic_dataset.global_event_df.collect()

Memory usage Starting MIMIC4Dataset init: 4159.5 MB


INFO:pyhealth.datasets.mimic4:Memory usage Starting MIMIC4Dataset init: 4159.5 MB


Initializing MIMIC4EHRDataset with tables: ['DIAGNOSES_ICD'] (dev mode: False)


INFO:pyhealth.datasets.mimic4:Initializing MIMIC4EHRDataset with tables: ['DIAGNOSES_ICD'] (dev mode: False)


Memory usage Before initializing mimic4_ehr: 4159.5 MB


INFO:pyhealth.datasets.mimic4:Memory usage Before initializing mimic4_ehr: 4159.5 MB


Initializing mimic4_ehr dataset from /content/drive/MyDrive/MIMICIV/physionet.org/files/mimiciv/3.1/ (dev mode: False)


INFO:pyhealth.datasets.base_dataset:Initializing mimic4_ehr dataset from /content/drive/MyDrive/MIMICIV/physionet.org/files/mimiciv/3.1/ (dev mode: False)


Scanning table: diagnoses_icd from /content/drive/MyDrive/MIMICIV/physionet.org/files/mimiciv/3.1/hosp/diagnoses_icd.csv.gz


INFO:pyhealth.datasets.base_dataset:Scanning table: diagnoses_icd from /content/drive/MyDrive/MIMICIV/physionet.org/files/mimiciv/3.1/hosp/diagnoses_icd.csv.gz


Joining with table: /content/drive/MyDrive/MIMICIV/physionet.org/files/mimiciv/3.1/hosp/admissions.csv.gz


INFO:pyhealth.datasets.base_dataset:Joining with table: /content/drive/MyDrive/MIMICIV/physionet.org/files/mimiciv/3.1/hosp/admissions.csv.gz


Scanning table: patients from /content/drive/MyDrive/MIMICIV/physionet.org/files/mimiciv/3.1/hosp/patients.csv.gz


INFO:pyhealth.datasets.base_dataset:Scanning table: patients from /content/drive/MyDrive/MIMICIV/physionet.org/files/mimiciv/3.1/hosp/patients.csv.gz


Scanning table: admissions from /content/drive/MyDrive/MIMICIV/physionet.org/files/mimiciv/3.1/hosp/admissions.csv.gz


INFO:pyhealth.datasets.base_dataset:Scanning table: admissions from /content/drive/MyDrive/MIMICIV/physionet.org/files/mimiciv/3.1/hosp/admissions.csv.gz


Scanning table: icustays from /content/drive/MyDrive/MIMICIV/physionet.org/files/mimiciv/3.1/icu/icustays.csv.gz


INFO:pyhealth.datasets.base_dataset:Scanning table: icustays from /content/drive/MyDrive/MIMICIV/physionet.org/files/mimiciv/3.1/icu/icustays.csv.gz


Memory usage After initializing mimic4_ehr: 4160.3 MB


INFO:pyhealth.datasets.mimic4:Memory usage After initializing mimic4_ehr: 4160.3 MB


Memory usage After EHR dataset initialization: 4160.3 MB


INFO:pyhealth.datasets.mimic4:Memory usage After EHR dataset initialization: 4160.3 MB


Memory usage Before combining data: 4160.3 MB


INFO:pyhealth.datasets.mimic4:Memory usage Before combining data: 4160.3 MB


Combining data from ehr dataset


INFO:pyhealth.datasets.mimic4:Combining data from ehr dataset


Creating combined dataframe


INFO:pyhealth.datasets.mimic4:Creating combined dataframe


Memory usage After combining data: 4160.3 MB


INFO:pyhealth.datasets.mimic4:Memory usage After combining data: 4160.3 MB


Memory usage Completed MIMIC4Dataset init: 4160.3 MB


INFO:pyhealth.datasets.mimic4:Memory usage Completed MIMIC4Dataset init: 4160.3 MB


### Filter and Clean the Clinical Data

Next, we filter the dataset to remove null diagnoses and columns with all null values.

Filtering: The data is filtered to exclude rows where the diagnosis ICD code is null.

Column Cleanup: We also drop columns where all values are null.

In [None]:
import polars as pl

mimic4_filtered = mimic4.filter(pl.col("diagnoses_icd/icd_code") != "null")
mimic4_filtered = mimic4_filtered[[s.name for s in mimic4_filtered if not (s.null_count() == mimic4_filtered.height)]]

### Load MIMIC-IV Waveform Database Records
Now, we load the MIMIC-IV waveform database (WaveDB) records. These records represent the paths to the waveform files.

Get WaveDB Records: We fetch the record paths for waveform files hosted on PhysioNet.

Clean HTML: Using regex, we clean up the HTML tags from the downloaded content.

Extract Record IDs: The record IDs (which are part of the path) are extracted using a regex pattern. We then build a DataFrame with the record_id and corresponding download_url.

In [None]:
base_url = "https://physionet.org/files/mimic4wdb/0.1.0/"

url = "https://physionet.org/content/mimic4wdb/0.1.0/RECORDS"

response = requests.get(url)
response.raise_for_status()
cleaned_content = re.sub(r'<pre.*?>|</pre>|<code.*?>|</code>', '', response.text)

lines = cleaned_content.strip().splitlines()

records = [
    {
        "record_id": re.search(r'/p(\d{8})/', line).group(1),
        "download_url": base_url + line.strip('/')
    }
    for line in lines
    if re.search(r'/p\d{8}/', line)
]

wave_records = pd.DataFrame(records)
record_ids = wave_records['record_id'].tolist()

### Merge Clinical Data with Waveform Data
Now, we filter the MIMIC-IV clinical dataset (mimic4_filtered) by matching patient IDs from the waveform dataset (wave_records).

We filter the mimic4_filtered dataset to include only those rows where the patient_id is present in the waveform records. This ensures that we only work with patients who have corresponding waveform data available.

In [None]:
mimic4_match = mimic4_filtered.filter(pl.col("patient_id").is_in(record_ids))
result = mimic4_match

### Load ICD-9 and ICD-10 Diagnosis Labels
To get more human-readable names for the diagnoses, we load and clean the ICD-9 and ICD-10 code mappings from bioontology.org


In [None]:
import pandas as pd

ics9 = pd.read_csv("https://data.bioontology.org/ontologies/ICD9CM/download?apikey=8b5b7825-538d-40e0-9e9e-5ab9274a9aeb&download_format=csv",compression='gzip',header=0,sep=',',quotechar='"')
raw_data = ics9
raw_data['flat_code'] = raw_data['Class ID'].apply(lambda x: x.split('/')[-1].replace('.', ''))
ics9_data = raw_data[['flat_code','Preferred Label']]

ics10 = pd.read_csv("https://data.bioontology.org/ontologies/ICD10CM/download?apikey=8b5b7825-538d-40e0-9e9e-5ab9274a9aeb&download_format=csv", compression='gzip', header=0, sep=',', quotechar='"')
raw_data = ics10
raw_data['flat_code'] = raw_data['Class ID'].apply(lambda x: x.split('/')[-1].replace('.', ''))
ics10_data = raw_data[['flat_code','Preferred Label']]

all_ics = pd.concat([ics9_data,ics10_data])

### Merge Diagnosis Data with Waveform Data
Now, we merge the filtered MIMIC-IV clinical data (mimic4_match) with the diagnosis labels (all_ics), allowing us to add the human-readable diagnosis names to our dataset.

Merging: We perform a left join between the filtered MIMIC-IV data and the ICD codes to add the diagnosis labels.

Missing Values: We fill any missing values with "NA" for clarity.

In [None]:
mimic_merged = pd.merge(result.to_pandas(), all_ics, left_on='diagnoses_icd/icd_code', right_on='flat_code', how='left')
mimic_merged.fillna("NA",inplace=True)
mimic_merged

### Search for Specific Diagnoses (e.g., Atrial Fibrillation)

Finally, we can filter the merged dataset to find patients diagnosed with Atrial Fibrillation, or any other condition you might choose. These would be our positive class.

In [None]:
target_diagnosis = "atrial" # Adapt this to search for your specific target condition
positive_patients = mimic_merged[mimic_merged['Preferred Label'].str.contains(target_diagnosis,case=False)]


### Select Control Group (Negative Class)

Finally we filter our control group, where the patients are negative for the target condition and patient ids do not overlap with the positive patients.

In [None]:
positive_patient_ids = positive_patients['patient_id'].tolist()

negative_patients = mimic_merged[~mimic_merged['patient_id'].isin(positive_patient_ids)]

### Download Files for Positive and Negative Controls

Finally we only download the required files for our controls sets. These can then be read through WFDB and normalized or tokenized or structured as per the needs of your machine learning models.

In [None]:
import random
import os
import subprocess

positive_patient_ids = list(set(chk['patient_id'].tolist()))  # Unique positive patient IDs
negative_patient_ids = list(set(negative_patients['patient_id'].tolist()))  # Unique negative patient IDs

target_num = 1 # Adapt this to match your required dataset size set to 1 for demo.

# Select target_num random positive patients (those with Target Condition)
selected_positive_ids = random.sample(positive_patient_ids, 1)

# Select target_num random negative patients (those without Target Condition)
selected_negative_ids = random.sample(negative_patient_ids, 1)

selected_patient_ids = selected_positive_ids + selected_negative_ids

selected_wave_records = wave_records[wave_records['record_id'].isin(selected_patient_ids)]

# Create a directory to store the downloaded waveform data
download_dir = "/content/drive/MyDrive/waveform_data/"
os.makedirs(download_dir, exist_ok=True)

# Download the waveform data for the selected patients
for index, row in selected_wave_records.iterrows():
    record_id = row['record_id']
    download_url = row['download_url']

    # Determine if the patient is positive or negative
    if record_id in selected_positive_ids:
        target_subdir = 'positive'
    else:
        target_subdir = 'negative'

    # Prepare the output directory for saving the files (separate for positive and negative)
    output_dir = os.path.join(download_dir, target_subdir, f"p{record_id}")
    os.makedirs(output_dir, exist_ok=True)  # Create sub-directory for each patient

    # Dynamically build the patient-specific download URL
    patient_directory_url = download_url.rstrip('/') + "/"

    print(f"Attempting to download files from: {patient_directory_url}")

    # Use wget to recursively download all files from the patient's subdirectory
    subprocess.run([
        "wget",
        "-r",         # Recursive download
        "-N",         # Only download newer files
        "-c",         # Continue downloading where it left off
        "-np",        # No parent (don't go to parent directories)
        "-nH",        # Disable generation of host directories
        "--cut-dirs=3", # Remove unnecessary parts of the URL structure
        "-P", output_dir,  # Set the output directory for downloaded files
        patient_directory_url  # The URL to the patient's specific subdirectory
    ])

    print(f"Downloaded all files to: {output_dir}")

print("Download complete for selected patients.")

### Optional (Loading Data with WFDB)

In the next example we will demonstrate how to use WFDB to read into these files, select the leads of interest and assign appropriate ground truth labels.