# Chart Events Feature Extraction for MIMIC-IV

This notebook extracts a rich feature list from the ICU chartevents table, which is the largest table in MIMIC-IV. We identify all unique items (features) available and capture sample values (numeric vs text) and units for each item to decide on the embedding strategy (Normalization vs BioBERT) later.

Since chartevents is too large to load into RAM at once, we use a chunk processing strategy.

## Setup: Import Libraries

In [1]:
import pandas as pd
import os

## Step 1: Define File Paths

Define placeholders for the data files. Update these paths to match your data location.

In [9]:
# Define file paths - update these to match your data location
chartevents_file_path = 'C:\\Users\\Eli\\Data\\physionet.org\\files\\mimiciv\\3.1\\icu\\chartevents.csv.gz'  # Large file containing ICU chart events
d_items_file_path = 'C:\\Users\\Eli\\Data\\physionet.org\\files\\mimiciv\\3.1\\icu\\d_items.csv.gz'  # Dictionary file with item metadata
output_file_path = '../data/features/features_chartevents.csv'  # Output file for features

## Step 2: Extract Unique Items with Sample Data (Chunk Processing)

Since chartevents is too large to load into RAM, we read it in chunks, extract unique itemids with sample values, and consolidate.

In [10]:
# Initialize an empty list to hold unique features from each chunk
print("Reading chartevents file in chunks...")
unique_features_list = []

# Read chartevents in chunks of 1,000,000 rows at a time
chunk_number = 0
for chunk in pd.read_csv(
    chartevents_file_path,
    usecols=['itemid', 'value', 'valuenum', 'valueuom', 'warning'],
    chunksize=1000000
):
    chunk_number += 1
    print(f"Processing chunk {chunk_number}...")
    
    # Drop duplicate itemids within this chunk, keeping first occurrence as a sample
    chunk_unique = chunk.drop_duplicates(subset=['itemid'], keep='first')
    
    # Append to our list
    unique_features_list.append(chunk_unique)
    
    print(f"  Chunk {chunk_number}: {len(chunk_unique)} unique items extracted")

print(f"\nTotal chunks processed: {chunk_number}")

Reading chartevents file in chunks...
Processing chunk 1...
  Chunk 1: 1448 unique items extracted
Processing chunk 2...
  Chunk 2: 1512 unique items extracted
Processing chunk 3...
  Chunk 3: 1517 unique items extracted
Processing chunk 4...
  Chunk 4: 1544 unique items extracted
Processing chunk 5...
  Chunk 5: 1414 unique items extracted
Processing chunk 6...
  Chunk 6: 1514 unique items extracted
Processing chunk 7...
  Chunk 7: 1396 unique items extracted
Processing chunk 8...
  Chunk 8: 1439 unique items extracted
Processing chunk 9...
  Chunk 9: 1509 unique items extracted
Processing chunk 10...
  Chunk 10: 1534 unique items extracted
Processing chunk 11...
  Chunk 11: 1448 unique items extracted
Processing chunk 12...
  Chunk 12: 1490 unique items extracted
Processing chunk 13...
  Chunk 13: 1438 unique items extracted
Processing chunk 14...
  Chunk 14: 1540 unique items extracted
Processing chunk 15...
  Chunk 15: 1490 unique items extracted
Processing chunk 16...
  Chunk 16: 

## Step 3: Consolidate and Remove Duplicates

Concatenate all chunks and perform a final deduplication to ensure one row per unique itemid.

In [11]:
# Concatenate all chunks into a single DataFrame
print("\nConsolidating chunks...")
consolidated_features = pd.concat(unique_features_list, ignore_index=True)
print(f"Total rows before final deduplication: {len(consolidated_features)}")

# Perform final deduplication to ensure one row per unique itemid
unique_features = consolidated_features.drop_duplicates(subset=['itemid'], keep='first')
print(f"Total unique items after final deduplication: {len(unique_features)}")


Consolidating chunks...
Total rows before final deduplication: 639234
Total unique items after final deduplication: 2311


## Step 4: Merge with Item Metadata

Load the d_items dictionary and merge it with our extracted features to add descriptive columns like label, category, and param_type.

In [12]:
# Read the d_items dictionary
print("\nReading d_items dictionary file...")
d_items_df = pd.read_csv(d_items_file_path)
print(f"Total items in d_items dictionary: {len(d_items_df)}")

# Perform inner join on itemid to add metadata
print("\nMerging with d_items metadata...")
features_with_metadata = pd.merge(
    unique_features,
    d_items_df,
    on='itemid',
    how='inner'
)

print(f"Items after merge: {len(features_with_metadata)}")


Reading d_items dictionary file...
Total items in d_items dictionary: 4095

Merging with d_items metadata...
Items after merge: 2311


## Step 5: Save Output

Save the final DataFrame to features_chartevents.csv.

In [13]:
# Create the output directory if it doesn't exist
output_dir = os.path.dirname(output_file_path)
if not os.path.exists(output_dir):
    os.makedirs(output_dir)

# Save the features to CSV
print(f"\nSaving features to {output_file_path}...")
features_with_metadata.to_csv(output_file_path, index=False)
print("Features saved successfully!")


Saving features to ../data/features/features_chartevents.csv...
Features saved successfully!


## Step 6: Verification

Display the first 10 rows to verify we have both metadata (label) and sample data (value/valueuom).

In [14]:
# Display first 10 rows
print("\n" + "="*80)
print("VERIFICATION: First 10 rows of the feature list:")
print("="*80)
print(features_with_metadata.head(10))

# Display total count
print("\n" + "="*80)
print(f"TOTAL NUMBER OF CHART EVENT FEATURES: {len(features_with_metadata)}")
print("="*80)

# Display column names to show we have both metadata and sample data
print("\n" + "="*80)
print("Available columns:")
print("="*80)
print(list(features_with_metadata.columns))


VERIFICATION: First 10 rows of the feature list:
0  226512               39.4      39.4       kg      0.0   
1  226707                 60      60.0     Inch      0.0   
2  226730                152     152.0       cm      0.0   
3  220048  SR (Sinus Rhythm)       NaN      NaN      0.0   
4  224642               Oral       NaN      NaN      0.0   
5  224650                NaN       NaN      NaN      0.0   
6  223761               98.7      98.7       °F      0.0   
7  220179                 84      84.0     mmHg      0.0   
8  220180                 48      48.0     mmHg      0.0   
9  220181                 56      56.0     mmHg      0.0   

                                   label           abbreviation      linksto  \
0                  Admission Weight (Kg)  Admission Weight (Kg)  chartevents   
1                                 Height                 Height  chartevents   
2                            Height (cm)            Height (cm)  chartevents   
3                           H

## Summary

This notebook has successfully:
1. ✅ Imported pandas and defined file paths for chartevents.csv.gz and d_items.csv.gz
2. ✅ Implemented chunk processing strategy to handle large chartevents file (chunks of 1,000,000 rows)
3. ✅ Extracted unique itemids with sample values (value, valuenum, valueuom, warning) from each chunk
4. ✅ Consolidated all chunks and performed final deduplication to ensure one row per itemid
5. ✅ Merged with d_items metadata to add descriptive columns (label, category, param_type)
6. ✅ Saved the result to features_chartevents.csv
7. ✅ Verified output shows both metadata (label) and sample data (value/valueuom)

The resulting feature list provides a rich dataset with both metadata and sample values/units, enabling informed decisions about embedding strategies (Normalization for numeric values vs BioBERT for text values).