# MIMIC-IV Clinical Notes Exploration

This notebook explores the structure and content of MIMIC-IV clinical notes to inform model selection for text processing. We analyze:
- **Discharge summaries** and **Radiology reports**
- Text length distributions to determine if we need standard BERT (512 tokens) or Longformer/ModernBERT (8k+ tokens)
- Linkage between text files and metadata (_detail) files

**Context**: Preparing unstructured clinical text for an RL agent (P-CAFE).

## Section 1: Setup & Imports

In [1]:
# Import required libraries
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Set plot style for better readability
sns.set_style("whitegrid")
plt.rcParams['figure.figsize'] = (14, 6)
plt.rcParams['font.size'] = 10

print("Libraries imported successfully!")

Libraries imported successfully!


## Section 2: Define File Paths

Update these paths to match your MIMIC-IV data location. The files should be in the `note` module of MIMIC-IV.

In [5]:
# Define file paths - Update these to match your data location
# Standard MIMIC-IV structure: physionet.org/files/mimiciv/3.1/note/

# Adjust this base path to your MIMIC-IV data directory
base_path = 'C:\\Users\\Eli\\Data\\physionet.org\\files\\mimic-iv-note\\2.2\\Unzipped\\'

# Clinical text files
discharge_file = base_path + 'discharge.csv'
radiology_file = base_path + 'radiology.csv'

# Metadata files
discharge_detail_file = base_path + 'discharge_detail.csv'
radiology_detail_file = base_path + 'radiology_detail.csv'

print("File paths defined.")
print(f"Discharge: {discharge_file}")
print(f"Discharge Detail: {discharge_detail_file}")
print(f"Radiology: {radiology_file}")
print(f"Radiology Detail: {radiology_detail_file}")

File paths defined.
Discharge: C:\Users\Eli\Data\physionet.org\files\mimic-iv-note\2.2\Unzipped\discharge.csv
Discharge Detail: C:\Users\Eli\Data\physionet.org\files\mimic-iv-note\2.2\Unzipped\discharge_detail.csv
Radiology: C:\Users\Eli\Data\physionet.org\files\mimic-iv-note\2.2\Unzipped\radiology.csv
Radiology Detail: C:\Users\Eli\Data\physionet.org\files\mimic-iv-note\2.2\Unzipped\radiology_detail.csv


## Part 1: Full Dataset Statistics & Analysis

This section analyzes the complete dataset using chunking to handle large files efficiently.
We calculate statistics for:
- Total number of records
- Number of unique `subject_id` and `hadm_id`
- Word count distributions for all notes

**Note**: We use `pd.read_csv(..., chunksize=10000)` to iterate through the entire files without loading everything into memory at once.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from tqdm import tqdm

# Set plot style
sns.set_style("whitegrid")
plt.rcParams['figure.figsize'] = (14, 6)
plt.rcParams['font.size'] = 10

print("Starting full dataset analysis with chunking...\n")

In [None]:
def analyze_file_with_chunks(filepath, file_label, chunksize=10000):
    """
    Analyze a large CSV file using chunking.
    
    Args:
        filepath: Path to the CSV file (can be .csv or .csv.gz)
        file_label: Label for printing (e.g., 'Discharge' or 'Radiology')
        chunksize: Number of rows to read per chunk
    
    Returns:
        dict with statistics: total_records, unique_subjects, unique_hadms, word_counts
    """
    print(f"\n{'='*60}")
    print(f"Analyzing {file_label}: {filepath}")
    print(f"{'='*60}")
    
    total_records = 0
    unique_subjects = set()
    unique_hadms = set()
    word_counts = []  # Store only counts (integers) to save memory
    
    try:
        # Read file in chunks
        chunks = pd.read_csv(filepath, chunksize=chunksize, compression='gzip' if filepath.endswith('.gz') else None)
        
        for chunk_num, chunk in enumerate(chunks, 1):
            # Update total records
            total_records += len(chunk)
            
            # Update unique IDs
            if 'subject_id' in chunk.columns:
                unique_subjects.update(chunk['subject_id'].dropna().unique())
            if 'hadm_id' in chunk.columns:
                unique_hadms.update(chunk['hadm_id'].dropna().unique())
            
            # Calculate word counts for text column
            if 'text' in chunk.columns:
                # Calculate word count for each note and store only the count
                chunk_word_counts = chunk['text'].fillna('').apply(lambda x: len(str(x).split())).tolist()
                word_counts.extend(chunk_word_counts)
            
            # Progress update every 10 chunks
            if chunk_num % 10 == 0:
                print(f"  Processed {chunk_num} chunks ({total_records:,} records)...")
        
        print(f"\nCompleted! Processed {chunk_num} chunks total.")
        
    except FileNotFoundError:
        print(f"ERROR: File not found: {filepath}")
        print(f"Please update the base_path variable to point to your MIMIC-IV data directory.")
        return None
    except Exception as e:
        print(f"ERROR analyzing file: {e}")
        return None
    
    # Calculate statistics
    stats = {
        'total_records': total_records,
        'unique_subjects': len(unique_subjects),
        'unique_hadms': len(unique_hadms),
        'word_counts': word_counts
    }
    
    # Print results
    print(f"\n{file_label} Statistics:")
    print(f"  Total records: {stats['total_records']:,}")
    print(f"  Unique subject_id: {stats['unique_subjects']:,}")
    print(f"  Unique hadm_id: {stats['unique_hadms']:,}")
    
    if word_counts:
        word_counts_array = np.array(word_counts)
        print(f"\n  Word Count Statistics:")
        print(f"    Mean: {word_counts_array.mean():.2f}")
        print(f"    Median: {np.median(word_counts_array):.2f}")
        print(f"    Max: {word_counts_array.max():,}")
        print(f"    95th Percentile: {np.percentile(word_counts_array, 95):.2f}")
    
    return stats

print("Function defined: analyze_file_with_chunks()")

In [None]:
# Update file paths to use .csv.gz format
# Adjust the base_path to match your MIMIC-IV data location
base_path = 'C:\\Users\\Eli\\Data\\physionet.org\\files\\mimic-iv-note\\2.2\\Unzipped\\'

# Clinical text files (now using .csv.gz format)
discharge_file = base_path + 'discharge.csv.gz'
radiology_file = base_path + 'radiology.csv.gz'

# Metadata files
discharge_detail_file = base_path + 'discharge_detail.csv.gz'
radiology_detail_file = base_path + 'radiology_detail.csv.gz'

print("File paths configured for .csv.gz format:")
print(f"  Discharge: {discharge_file}")
print(f"  Discharge Detail: {discharge_detail_file}")
print(f"  Radiology: {radiology_file}")
print(f"  Radiology Detail: {radiology_detail_file}")

In [None]:
# Analyze discharge.csv.gz
discharge_stats = analyze_file_with_chunks(discharge_file, 'Discharge Notes', chunksize=10000)

In [None]:
# Analyze discharge_detail.csv.gz
discharge_detail_stats = analyze_file_with_chunks(discharge_detail_file, 'Discharge Detail', chunksize=10000)

In [None]:
# Analyze radiology.csv.gz
radiology_stats = analyze_file_with_chunks(radiology_file, 'Radiology Notes', chunksize=10000)

In [None]:
# Analyze radiology_detail.csv.gz
radiology_detail_stats = analyze_file_with_chunks(radiology_detail_file, 'Radiology Detail', chunksize=10000)

In [None]:
# Visualization: Histogram of Word Count Distribution (Discharge vs Radiology)
print("\nCreating word count distribution histogram...")

if discharge_stats and radiology_stats and discharge_stats['word_counts'] and radiology_stats['word_counts']:
    fig, ax = plt.subplots(1, 1, figsize=(14, 6))
    
    # Plot histograms
    ax.hist(discharge_stats['word_counts'], bins=100, alpha=0.6, label='Discharge', color='steelblue', edgecolor='black')
    ax.hist(radiology_stats['word_counts'], bins=100, alpha=0.6, label='Radiology', color='coral', edgecolor='black')
    
    # Add reference lines
    ax.axvline(x=512, color='red', linestyle='--', linewidth=2, label='BERT Limit (512 words)')
    
    # Labels and formatting
    ax.set_xlabel('Word Count', fontsize=12)
    ax.set_ylabel('Frequency', fontsize=12)
    ax.set_title('Word Count Distribution: Discharge vs Radiology Notes (Full Dataset)', fontsize=14, fontweight='bold')
    ax.legend(fontsize=10)
    ax.grid(True, alpha=0.3)
    
    # Use log scale if distribution is highly skewed
    # Uncomment the next line if needed:
    # ax.set_yscale('log')
    
    plt.tight_layout()
    plt.show()
    
    print("\nHistogram created successfully!")
else:
    print("\nSkipping histogram: Data not available or file analysis failed.")

## Part 2: Sample Embeddings with Clinical-ModernBERT

This section generates embeddings for a small sample of notes using the Clinical-ModernBERT model.

**Key points:**
- Load **all records** from discharge.csv.gz and radiology.csv.gz
- Generate embeddings for only the **first 50 records** from each dataset
- Use **Mean Pooling** of the last_hidden_state (NOT pooler_output)
- Save results as parquet files

In [None]:
# Import required libraries for embeddings
import torch
from transformers import AutoTokenizer, AutoModel
import numpy as np

# Setup device
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f"Using device: {device}")

# Load the Clinical ModernBERT model and tokenizer
model_name = "Simonlee711/Clinical_ModernBERT"
print(f"\nLoading model: {model_name}...")
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModel.from_pretrained(model_name)

# Move model to device and set to evaluation mode
model = model.to(device)
model.eval()

print(f"Model loaded successfully!")

In [None]:
def mean_pooling(model_output, attention_mask):
    """
    Perform mean pooling on the last_hidden_state.
    
    Logic: sum(token_embeddings * attention_mask) / sum(attention_mask)
    
    Args:
        model_output: Output from the model (contains last_hidden_state)
        attention_mask: Attention mask tensor
    
    Returns:
        Mean-pooled embeddings
    """
    # Extract last_hidden_state from model output
    token_embeddings = model_output.last_hidden_state  # Shape: (batch_size, seq_len, hidden_size)
    
    # Expand attention_mask to match token_embeddings dimensions
    input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
    
    # Sum token embeddings weighted by attention mask
    sum_embeddings = torch.sum(token_embeddings * input_mask_expanded, dim=1)
    
    # Sum attention mask
    sum_mask = torch.clamp(input_mask_expanded.sum(dim=1), min=1e-9)
    
    # Calculate mean
    mean_embeddings = sum_embeddings / sum_mask
    
    return mean_embeddings

def get_embeddings_with_mean_pooling(texts, batch_size=8):
    """
    Generate embeddings for a list of texts using mean pooling.
    
    Args:
        texts: List of text strings
        batch_size: Batch size for processing
    
    Returns:
        numpy array of embeddings
    """
    all_embeddings = []
    
    for i in range(0, len(texts), batch_size):
        batch_texts = texts[i:i + batch_size]
        
        # Tokenize
        encoded_input = tokenizer(batch_texts, padding=True, truncation=True, 
                                 max_length=8192, return_tensors='pt')
        encoded_input = {k: v.to(device) for k, v in encoded_input.items()}
        
        # Generate embeddings (no gradient calculation needed)
        with torch.no_grad():
            model_output = model(**encoded_input)
        
        # Apply mean pooling
        embeddings = mean_pooling(model_output, encoded_input['attention_mask'])
        
        # Move to CPU and convert to numpy
        all_embeddings.append(embeddings.cpu().numpy())
    
    # Concatenate all batches
    return np.vstack(all_embeddings)

print("Embedding functions defined:")
print("  - mean_pooling()")
print("  - get_embeddings_with_mean_pooling()")

In [None]:
# Load all records from discharge.csv.gz
print("Loading all records from discharge.csv.gz...")
discharge_sample = pd.read_csv(discharge_file, compression='gzip')
print(f"Loaded {len(discharge_sample)} discharge notes")
print(f"Columns: {list(discharge_sample.columns)}")

# Load all records from radiology.csv.gz
print("\nLoading all records from radiology.csv.gz...")
radiology_sample = pd.read_csv(radiology_file, compression='gzip')
print(f"Loaded {len(radiology_sample)} radiology notes")
print(f"Columns: {list(radiology_sample.columns)}")

In [None]:
# Generate embeddings for first 50 discharge records
print("\nGenerating embeddings for first 50 discharge records...")
discharge_texts = discharge_sample['text'].head(50).fillna('').tolist()
discharge_embeddings = get_embeddings_with_mean_pooling(discharge_texts, batch_size=8)
print(f"Generated {len(discharge_embeddings)} embeddings with shape: {discharge_embeddings.shape}")

# Create DataFrame with required columns
discharge_sample_emb = pd.DataFrame({
    'note_id': discharge_sample['note_id'].head(50).values,
    'subject_id': discharge_sample['subject_id'].head(50).values,
    'hadm_id': discharge_sample['hadm_id'].head(50).values,
    'note_embedding': list(discharge_embeddings)
})

print(f"\nCreated discharge_sample_emb DataFrame:")
print(f"  Shape: {discharge_sample_emb.shape}")
print(f"  Columns: {list(discharge_sample_emb.columns)}")
print(f"  Embedding dimension: {len(discharge_sample_emb['note_embedding'].iloc[0])}")

# Save to parquet
output_file = 'discharge_50_embeddings.parquet'
discharge_sample_emb.to_parquet(output_file, index=False)
print(f"\nSaved to: {output_file}")

In [None]:
# Generate embeddings for first 50 radiology records
print("\nGenerating embeddings for first 50 radiology records...")
radiology_texts = radiology_sample['text'].head(50).fillna('').tolist()
radiology_embeddings = get_embeddings_with_mean_pooling(radiology_texts, batch_size=8)
print(f"Generated {len(radiology_embeddings)} embeddings with shape: {radiology_embeddings.shape}")

# Create DataFrame with required columns
radiology_sample_emb = pd.DataFrame({
    'note_id': radiology_sample['note_id'].head(50).values,
    'subject_id': radiology_sample['subject_id'].head(50).values,
    'hadm_id': radiology_sample['hadm_id'].head(50).values,
    'note_embedding': list(radiology_embeddings)
})

print(f"\nCreated radiology_sample_emb DataFrame:")
print(f"  Shape: {radiology_sample_emb.shape}")
print(f"  Columns: {list(radiology_sample_emb.columns)}")
print(f"  Embedding dimension: {len(radiology_sample_emb['note_embedding'].iloc[0])}")

# Save to parquet
output_file = 'radiology_50_embeddings.parquet'
radiology_sample_emb.to_parquet(output_file, index=False)
print(f"\nSaved to: {output_file}")

## Summary

This notebook has completed:

### Part 1: Full Dataset Analysis
- Analyzed complete datasets using chunking (10,000 rows per chunk)
- Calculated total records, unique IDs, and word count statistics
- Generated histogram comparing discharge and radiology note lengths

### Part 2: Sample Embeddings
- Loaded all records from discharge and radiology datasets
- Generated embeddings for 50 discharge notes and 50 radiology notes
- Used Clinical-ModernBERT with **mean pooling** strategy
- Saved embeddings to parquet files:
  - `discharge_50_embeddings.parquet`
  - `radiology_50_embeddings.parquet`

### Next Steps
- Review word count statistics to determine optimal model architecture
- Use the sample embeddings for downstream tasks (e.g., similarity analysis, clustering)
- Scale up embedding generation if needed using the same chunking approach