# MIMIC-IV Data Exploration

This notebook explores the MIMIC-IV dataset for integration into our COVID-19 detection pipeline. We'll examine the structure of the dataset, identify relevant tables, and prepare them for use in our NER and classification tasks.

In [None]:
import os
import sys
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import re
from datetime import datetime
import gzip

# Configure display options
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', 50)
pd.set_option('display.max_colwidth', 100)

# Add project root to path
sys.path.append('..')

## MIMIC-IV Dataset Overview

MIMIC-IV is a large, freely available clinical database containing de-identified data from hospital admissions. The dataset includes:

1. **Hospital Data (hosp)**: Contains administrative data, diagnoses, procedures, medications, laboratory tests
2. **ICU Data (icu)**: Contains detailed ICU information including vitals, clinical measurements, and interventions

For our COVID-19 detection project, we're interested in clinical notes, symptoms, diagnoses, and outcomes.

In [None]:
# Define paths to MIMIC-IV data
MIMIC_PATH = '/Users/Apexr/physionet.org/files/mimiciv/3.1'
HOSP_PATH = os.path.join(MIMIC_PATH, 'hosp')
ICU_PATH = os.path.join(MIMIC_PATH, 'icu')

# Define project data directory
PROJECT_DATA_DIR = '../data'
PROJECT_MIMIC_DIR = os.path.join(PROJECT_DATA_DIR, 'external/mimic')

# Create directory for organized MIMIC data in our project
os.makedirs(PROJECT_MIMIC_DIR, exist_ok=True)

## Explore Data Dictionary

Let's first examine the patients, admissions, and diagnoses tables to understand their structure.

In [None]:
def read_gzip_csv(filepath, nrows=None, sample_frac=None):
    """Read a gzipped CSV file with options for sampling rows"""
    try:
        df = pd.read_csv(filepath, nrows=nrows, compression='gzip')
        
        if sample_frac is not None and nrows is None:
            df = df.sample(frac=sample_frac, random_state=42)
            
        return df
    except Exception as e:
        print(f"Error reading {filepath}: {e}")
        return None

In [None]:
# Read patients data (sample)
patients_path = os.path.join(HOSP_PATH, 'patients.csv.gz')
patients_df = read_gzip_csv(patients_path, nrows=5)
print(f"Patients table structure:")
display(patients_df)

In [None]:
# Read admissions data (sample)
admissions_path = os.path.join(HOSP_PATH, 'admissions.csv.gz')
admissions_df = read_gzip_csv(admissions_path, nrows=5)
print(f"Admissions table structure:")
display(admissions_df)

In [None]:
# Read diagnoses dictionary
d_icd_diagnoses_path = os.path.join(HOSP_PATH, 'd_icd_diagnoses.csv.gz')
d_icd_diagnoses_df = read_gzip_csv(d_icd_diagnoses_path)

# Look for COVID-19 related diagnoses
covid_diagnoses = d_icd_diagnoses_df[d_icd_diagnoses_df['long_title'].str.contains('COVID|coronavirus|SARS', case=False, na=False)]
print(f"COVID-19 related diagnoses:")
display(covid_diagnoses)

In [None]:
# Read diagnoses data (sample)
diagnoses_path = os.path.join(HOSP_PATH, 'diagnoses_icd.csv.gz')
diagnoses_df = read_gzip_csv(diagnoses_path, nrows=100)
print(f"Diagnoses table structure:")
display(diagnoses_df)

## Finding Clinical Notes

For our NER component, we need clinical notes. Let's check if the MIMIC-IV dataset contains clinical notes.

In [None]:
# Check for a notes file in the hospital data
import glob

# List all files in the hospital directory
hospital_files = os.listdir(HOSP_PATH)
note_files = [f for f in hospital_files if 'note' in f.lower()]
print(f"Files containing 'note' in hosp directory: {note_files}")

### Note: Missing Note Events

MIMIC-IV doesn't seem to contain clinical notes directly in its 3.1 version. The full clinical notes are typically in a separate module called MIMIC-IV-Note that may need separate access.

For our project, we can:
1. Use the OMR (Outpatient Medication Reconciliation) table which contains some text data
2. Use the MIMIC-III notes that may be available elsewhere
3. Continue with our approach of using synthetic data as well as other real data sources like clinical trials

In [None]:
# Check the OMR (Outpatient Medication Reconciliation) table for text content
omr_path = os.path.join(HOSP_PATH, 'omr.csv.gz')
if os.path.exists(omr_path):
    omr_df = read_gzip_csv(omr_path, nrows=5)
    print(f"OMR table structure:")
    display(omr_df)

## Extract Data for our COVID-19 Detection Project

Let's extract the most relevant data for our project:

1. Patient demographics
2. COVID-19 diagnoses
3. Any available text data that could be used for NER

In [None]:
# 1. Extract patient demographic data
patients_sample = read_gzip_csv(patients_path, sample_frac=0.1)
if patients_sample is not None:
    # Save to project directory
    patients_output_path = os.path.join(PROJECT_MIMIC_DIR, 'patients_sample.csv')
    patients_sample.to_csv(patients_output_path, index=False)
    print(f"Saved patient sample ({len(patients_sample)} records) to {patients_output_path}")

In [None]:
# 2. Extract admissions data
admissions_sample = read_gzip_csv(admissions_path, sample_frac=0.1)
if admissions_sample is not None:
    # Save to project directory
    admissions_output_path = os.path.join(PROJECT_MIMIC_DIR, 'admissions_sample.csv')
    admissions_sample.to_csv(admissions_output_path, index=False)
    print(f"Saved admissions sample ({len(admissions_sample)} records) to {admissions_output_path}")

In [None]:
# 3. Extract diagnoses and search for possible COVID-19 cases
# We need both the diagnoses and the dictionary
d_icd_output_path = os.path.join(PROJECT_MIMIC_DIR, 'd_icd_diagnoses.csv')
d_icd_diagnoses_df.to_csv(d_icd_output_path, index=False)
print(f"Saved diagnosis dictionary to {d_icd_output_path}")

# Find relevant ICD codes for COVID-19 and respiratory conditions
covid_icd_codes = covid_diagnoses['icd_code'].tolist()
print(f"Found {len(covid_icd_codes)} COVID-19 related ICD codes")

# Also include other respiratory conditions that might be COVID-related
resp_diagnoses = d_icd_diagnoses_df[d_icd_diagnoses_df['long_title'].str.contains('pneumonia|respiratory|breathing', case=False, na=False)]
resp_icd_codes = resp_diagnoses['icd_code'].tolist()
print(f"Found {len(resp_icd_codes)} respiratory condition ICD codes")

# Combine all relevant codes
relevant_icd_codes = covid_icd_codes + resp_icd_codes
print(f"Total of {len(relevant_icd_codes)} relevant ICD codes")

In [None]:
# Read diagnoses data in batches and extract relevant cases
def extract_relevant_diagnoses(diagnoses_path, icd_codes, output_path, chunk_size=100000):
    """Extract diagnoses matching the specified ICD codes"""
    relevant_diagnoses = []
    
    try:
        # Process file in chunks to handle large file size
        for chunk in pd.read_csv(diagnoses_path, chunksize=chunk_size, compression='gzip'):
            # Filter to relevant diagnoses
            mask = chunk['icd_code'].isin(icd_codes)
            if mask.any():
                relevant_chunk = chunk[mask]
                relevant_diagnoses.append(relevant_chunk)
                print(f"Found {len(relevant_chunk)} relevant diagnoses in this chunk")
                
        if relevant_diagnoses:
            # Combine all relevant diagnoses
            result_df = pd.concat(relevant_diagnoses, ignore_index=True)
            # Save to file
            result_df.to_csv(output_path, index=False)
            print(f"Saved {len(result_df)} relevant diagnoses to {output_path}")
            return result_df
        else:
            print("No relevant diagnoses found")
            return pd.DataFrame()
            
    except Exception as e:
        print(f"Error processing diagnoses: {e}")
        return pd.DataFrame()

# Extract relevant diagnoses
diagnoses_output_path = os.path.join(PROJECT_MIMIC_DIR, 'relevant_diagnoses.csv')
relevant_diagnoses_df = extract_relevant_diagnoses(diagnoses_path, relevant_icd_codes, diagnoses_output_path)

In [None]:
# 4. Extract OMR data for text analysis
omr_sample_path = os.path.join(PROJECT_MIMIC_DIR, 'omr_sample.csv')

try:
    # Process file in chunks to handle large file size
    chunk_size = 10000
    sample_size = 1000
    
    # Initialize an empty list to store sampled chunks
    sampled_chunks = []
    total_rows = 0
    
    for chunk in pd.read_csv(omr_path, chunksize=chunk_size, compression='gzip'):
        # Sample each chunk
        sample_fraction = min(sample_size / chunk.shape[0], 1.0)
        sampled_chunk = chunk.sample(frac=sample_fraction, random_state=42)
        sampled_chunks.append(sampled_chunk)
        
        total_rows += len(sampled_chunk)
        if total_rows >= sample_size:
            break
    
    # Combine all sampled chunks
    omr_sample = pd.concat(sampled_chunks, ignore_index=True)
    # Limit to sample_size if we got more
    if len(omr_sample) > sample_size:
        omr_sample = omr_sample.head(sample_size)
        
    # Save to file
    omr_sample.to_csv(omr_sample_path, index=False)
    print(f"Saved {len(omr_sample)} OMR records to {omr_sample_path}")
    
except Exception as e:
    print(f"Error processing OMR data: {e}")

## Check for Lab Results Related to COVID-19

Let's extract lab test results that might be relevant to COVID-19 patients.

In [None]:
# Check the lab items dictionary for COVID-19 tests
labitems_path = os.path.join(HOSP_PATH, 'd_labitems.csv.gz')
labitems_df = read_gzip_csv(labitems_path)

# Look for COVID-19 related tests
covid_lab_items = labitems_df[labitems_df['label'].str.contains('COVID|coronavirus|SARS|antibody', case=False, na=False)]
print(f"COVID-19 related lab tests:")
display(covid_lab_items)

In [None]:
# Save lab items dictionary
labitems_output_path = os.path.join(PROJECT_MIMIC_DIR, 'd_labitems.csv')
labitems_df.to_csv(labitems_output_path, index=False)
print(f"Saved lab items dictionary to {labitems_output_path}")

In [None]:
# Extract lab events for the patients with relevant diagnoses if we have them
if not relevant_diagnoses_df.empty:
    # Get unique subject_ids from relevant diagnoses
    relevant_subject_ids = relevant_diagnoses_df['subject_id'].unique().tolist()
    print(f"Found {len(relevant_subject_ids)} unique patients with relevant diagnoses")
    
    # Extract lab events for these patients
    labevents_path = os.path.join(HOSP_PATH, 'labevents.csv.gz')
    labevents_output_path = os.path.join(PROJECT_MIMIC_DIR, 'relevant_labevents.csv')
    
    try:
        # Process file in chunks to handle large file size
        relevant_labevents = []
        total_relevant = 0
        
        for chunk in pd.read_csv(labevents_path, chunksize=100000, compression='gzip'):
            # Filter to relevant subject_ids
            mask = chunk['subject_id'].isin(relevant_subject_ids)
            if mask.any():
                relevant_chunk = chunk[mask]
                relevant_labevents.append(relevant_chunk)
                total_relevant += len(relevant_chunk)
                print(f"Found {len(relevant_chunk)} relevant lab events in this chunk. Total so far: {total_relevant}")
                
        if relevant_labevents:
            # Combine all relevant lab events
            result_df = pd.concat(relevant_labevents, ignore_index=True)
            # Save to file
            result_df.to_csv(labevents_output_path, index=False)
            print(f"Saved {len(result_df)} relevant lab events to {labevents_output_path}")
        else:
            print("No relevant lab events found")
            
    except Exception as e:
        print(f"Error processing lab events: {e}")
else:
    print("No relevant diagnoses to match with lab events. Taking a sample instead.")
    # Take a random sample of lab events
    labevents_path = os.path.join(HOSP_PATH, 'labevents.csv.gz')
    labevents_sample_path = os.path.join(PROJECT_MIMIC_DIR, 'labevents_sample.csv')
    
    try:
        # Process file in chunks to handle large file size
        chunk_size = 100000
        sample_size = 1000
        
        # Initialize an empty list to store sampled chunks
        sampled_chunks = []
        total_rows = 0
        
        for chunk in pd.read_csv(labevents_path, chunksize=chunk_size, compression='gzip'):
            # Sample each chunk
            sample_fraction = min(sample_size / chunk.shape[0], 1.0)
            sampled_chunk = chunk.sample(frac=sample_fraction, random_state=42)
            sampled_chunks.append(sampled_chunk)
            
            total_rows += len(sampled_chunk)
            if total_rows >= sample_size:
                break
        
        # Combine all sampled chunks
        labevents_sample = pd.concat(sampled_chunks, ignore_index=True)
        # Limit to sample_size if we got more
        if len(labevents_sample) > sample_size:
            labevents_sample = labevents_sample.head(sample_size)
            
        # Save to file
        labevents_sample.to_csv(labevents_sample_path, index=False)
        print(f"Saved {len(labevents_sample)} lab events to {labevents_sample_path}")
    except Exception as e:
        print(f"Error processing lab events sample: {e}")

## ICU Data Extraction

Let's also extract some chart events from the ICU data.

In [None]:
# Extract a sample of chart events
chartevents_path = os.path.join(ICU_PATH, 'chartevents.csv.gz')
chartevents_sample_path = os.path.join(PROJECT_MIMIC_DIR, 'chartevents_sample.csv')

try:
    # Process file in chunks to handle large file size
    chunk_size = 100000
    sample_size = 1000
    
    # Initialize an empty list to store sampled chunks
    sampled_chunks = []
    total_rows = 0
    
    for chunk in pd.read_csv(chartevents_path, chunksize=chunk_size, compression='gzip'):
        # Sample each chunk
        sample_fraction = min(sample_size / chunk.shape[0], 1.0)
        sampled_chunk = chunk.sample(frac=sample_fraction, random_state=42)
        sampled_chunks.append(sampled_chunk)
        
        total_rows += len(sampled_chunk)
        if total_rows >= sample_size:
            break
    
    # Combine all sampled chunks
    chartevents_sample = pd.concat(sampled_chunks, ignore_index=True)
    # Limit to sample_size if we got more
    if len(chartevents_sample) > sample_size:
        chartevents_sample = chartevents_sample.head(sample_size)
        
    # Save to file
    chartevents_sample.to_csv(chartevents_sample_path, index=False)
    print(f"Saved {len(chartevents_sample)} chart events to {chartevents_sample_path}")
except Exception as e:
    print(f"Error processing chart events: {e}")

## Integrating with Our Pipeline

Now let's create a function to load this data into our pipeline for the COVID-19 detection project.

In [None]:
def create_mimic_integration_module():
    """Create a Python module to integrate MIMIC data with our pipeline"""
    module_path = '../src/mimic_integration.py'
    
    code = """\
"""MIMIC-IV Integration Module

This module provides functions to integrate MIMIC-IV data into our COVID-19 detection pipeline.
It handles loading and preprocessing the data for both NER and classification tasks.
"""

import os
import pandas as pd
import numpy as np
import logging

# Set up logging
logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(name)s - %(levelname)s - %(message)s')
logger = logging.getLogger(__name__)

# Path to the MIMIC data in our project
MIMIC_DIR = os.path.join('data', 'external', 'mimic')

def load_mimic_demographics():
    """Load patient demographic data from MIMIC-IV"""
    patients_path = os.path.join(MIMIC_DIR, 'patients_sample.csv')
    admissions_path = os.path.join(MIMIC_DIR, 'admissions_sample.csv')
    
    try:
        patients_df = pd.read_csv(patients_path)
        logger.info(f"Loaded {len(patients_df)} patient records")
        
        admissions_df = pd.read_csv(admissions_path)
        logger.info(f"Loaded {len(admissions_df)} admission records")
        
        # Merge to get complete demographic information
        demographics = pd.merge(patients_df, admissions_df, on='subject_id', how='inner')
        logger.info(f"Combined demographics dataset contains {len(demographics)} records")
        
        return demographics
    except Exception as e:
        logger.error(f"Error loading demographic data: {e}")
        return pd.DataFrame()

def load_mimic_diagnoses():
    """Load diagnoses data from MIMIC-IV"""
    diagnoses_path = os.path.join(MIMIC_DIR, 'relevant_diagnoses.csv')
    d_icd_path = os.path.join(MIMIC_DIR, 'd_icd_diagnoses.csv')
    
    try:
        # Check if we have the relevant diagnoses file
        if os.path.exists(diagnoses_path):
            diagnoses_df = pd.read_csv(diagnoses_path)
        else:
            logger.warning(f"Relevant diagnoses file not found: {diagnoses_path}")
            return pd.DataFrame()
        
        # Load the ICD codes dictionary
        d_icd_df = pd.read_csv(d_icd_path)
        
        # Merge to get diagnosis descriptions
        merged_df = pd.merge(diagnoses_df, d_icd_df, on='icd_code', how='left')
        logger.info(f"Loaded and merged {len(merged_df)} diagnoses with descriptions")
        
        return merged_df
    except Exception as e:
        logger.error(f"Error loading diagnoses data: {e}")
        return pd.DataFrame()

def load_mimic_lab_results():
    """Load lab results data from MIMIC-IV"""
    lab_results_path = os.path.join(MIMIC_DIR, 'relevant_labevents.csv')
    lab_sample_path = os.path.join(MIMIC_DIR, 'labevents_sample.csv')
    d_labitems_path = os.path.join(MIMIC_DIR, 'd_labitems.csv')
    
    try:
        # Check if we have the relevant lab events file
        if os.path.exists(lab_results_path):
            lab_df = pd.read_csv(lab_results_path)
            logger.info(f"Loaded {len(lab_df)} relevant lab events")
        elif os.path.exists(lab_sample_path):
            lab_df = pd.read_csv(lab_sample_path)
            logger.info(f"Loaded {len(lab_df)} sample lab events")
        else:
            logger.warning(f"No lab events file found")
            return pd.DataFrame()
        
        # Load the lab items dictionary
        if os.path.exists(d_labitems_path):
            d_labitems_df = pd.read_csv(d_labitems_path)
            
            # Merge to get lab test descriptions
            merged_df = pd.merge(lab_df, d_labitems_df, on='itemid', how='left')
            logger.info(f"Merged {len(merged_df)} lab events with descriptions")
            
            return merged_df
        else:
            logger.warning(f"Lab items dictionary not found: {d_labitems_path}")
            return lab_df
    except Exception as e:
        logger.error(f"Error loading lab results data: {e}")
        return pd.DataFrame()

def load_mimic_text_data():
    """Load text data from MIMIC-IV for NER"""
    omr_path = os.path.join(MIMIC_DIR, 'omr_sample.csv')
    
    try:
        if os.path.exists(omr_path):
            omr_df = pd.read_csv(omr_path)
            logger.info(f"Loaded {len(omr_df)} OMR records for text analysis")
            
            # Extract text columns for NER
            text_columns = [col for col in omr_df.columns if omr_df[col].dtype == 'object']
            text_df = omr_df[['subject_id', 'hadm_id'] + text_columns]
            
            return text_df
        else:
            logger.warning(f"OMR data not found: {omr_path}")
            return pd.DataFrame()
    except Exception as e:
        logger.error(f"Error loading text data: {e}")
        return pd.DataFrame()

def get_sample_clinical_notes():
    """Get sample clinical notes for NER from MIMIC-IV"""
    text_df = load_mimic_text_data()
    
    if not text_df.empty:
        # Extract text columns and combine them
        text_columns = [col for col in text_df.columns if col not in ['subject_id', 'hadm_id']]
        
        # Combine all text columns into a single note for each patient
        notes = []
        for _, row in text_df.iterrows():
            note_parts = []
            for col in text_columns:
                if isinstance(row[col], str) and len(row[col].strip()) > 0:
                    note_parts.append(f"{col}: {row[col]}")
            
            if note_parts:
                note = "\n".join(note_parts)
                notes.append({
                    'subject_id': row['subject_id'],
                    'hadm_id': row['hadm_id'],
                    'note_text': note
                })
        
        if notes:
            notes_df = pd.DataFrame(notes)
            logger.info(f"Created {len(notes_df)} clinical notes for NER")
            return notes_df
    
    logger.warning("No suitable clinical notes found in MIMIC-IV data")
    return pd.DataFrame()

def prepare_mimic_for_classification():
    """Prepare MIMIC-IV data for the classification pipeline"""
    # Load all the necessary components
    demographics = load_mimic_demographics()
    diagnoses = load_mimic_diagnoses()
    lab_results = load_mimic_lab_results()
    
    if demographics.empty:
        logger.warning("Cannot prepare classification data without demographics")
        return pd.DataFrame()
    
    try:
        # Start with demographics as our base
        classification_data = demographics.copy()
        
        # Add a COVID flag if we have diagnoses
        if not diagnoses.empty:
            # Find COVID-specific ICD codes
            covid_diagnoses = diagnoses[diagnoses['long_title'].str.contains('COVID|coronavirus|SARS-CoV', case=False, na=False)]
            covid_patients = covid_diagnoses['subject_id'].unique().tolist()
            
            # Add COVID flag to classification data
            classification_data['covid_diagnosis'] = classification_data['subject_id'].isin(covid_patients)
            
            logger.info(f"Identified {len(covid_patients)} patients with COVID-19 diagnoses")
        
        # Add lab result features if available
        if not lab_results.empty:
            # Find COVID-specific lab tests
            covid_tests = lab_results[lab_results['label'].str.contains('COVID|coronavirus|SARS', case=False, na=False)]
            
            if not covid_tests.empty:
                # Create a pivot table of COVID test results
                covid_test_pivot = covid_tests.pivot_table(
                    index='subject_id',
                    columns='itemid',
                    values='valuenum',
                    aggfunc='mean'
                )
                
                # Rename columns to test names
                itemid_to_label = dict(zip(covid_tests['itemid'], covid_tests['label']))
                covid_test_pivot.columns = [f"test_{itemid_to_label.get(col, col)}" for col in covid_test_pivot.columns]
                
                # Reset index to make subject_id a column again
                covid_test_pivot.reset_index(inplace=True)
                
                # Merge with classification data
                classification_data = pd.merge(classification_data, covid_test_pivot, on='subject_id', how='left')
                
                logger.info(f"Added {len(covid_test_pivot.columns)-1} COVID test features to classification data")
        
        # Save the prepared data
        output_path = os.path.join(MIMIC_DIR, 'classification_data.csv')
        classification_data.to_csv(output_path, index=False)
        logger.info(f"Saved classification data with {len(classification_data)} rows and {len(classification_data.columns)} features to {output_path}")
        
        return classification_data
    except Exception as e:
        logger.error(f"Error preparing classification data: {e}")
        return pd.DataFrame()

if __name__ == "__main__":
    # Test loading functions
    demographics = load_mimic_demographics()
    diagnoses = load_mimic_diagnoses()
    lab_results = load_mimic_lab_results()
    text_data = load_mimic_text_data()
    
    # Prepare data for our pipeline
    classification_data = prepare_mimic_for_classification()
    notes = get_sample_clinical_notes()
    
    print("\nData Summary:")
    print(f"Demographics: {len(demographics)} records")
    print(f"Diagnoses: {len(diagnoses)} records")
    print(f"Lab Results: {len(lab_results)} records")
    print(f"Text Data: {len(text_data)} records")
    print(f"Classification Data: {len(classification_data)} records with {len(classification_data.columns)} features")
    print(f"Clinical Notes: {len(notes)} notes")
"""
    
    with open(module_path, 'w') as f:
        f.write(code)
    
    print(f"Created MIMIC integration module at {module_path}")

# Create the integration module
create_mimic_integration_module()

## Summary

We've accomplished the following:

1. Explored the structure of the MIMIC-IV dataset
2. Identified relevant tables for our COVID-19 detection project
3. Extracted and sampled key data including:
   - Patient demographics
   - Diagnoses (especially respiratory and possible COVID-19 cases)
   - Lab results
   - Some text data from the OMR table
4. Created a Python module to integrate this data with our pipeline

### Notes on the Data

1. MIMIC-IV does not include full clinical notes in the standard release. These are part of a separate module (MIMIC-IV-Note) that may require additional access.
2. For NER tasks, we'll need to rely on the limited text data available in tables like OMR, or continue using our synthetic data approach.
3. For the classification tasks, we have good structured data including demographics, diagnoses, and lab results.

### Next Steps

1. Use the integrated MIMIC data in our NER pipeline where possible
2. Build and train our classification model using the structured data
3. Consider applying for access to MIMIC-IV-Note for more comprehensive clinical notes in the future