# Laboratory Feature Extraction for MIMIC-IV MDP Environment

This notebook extracts valid laboratory features for an RL agent by filtering the dictionary of all possible lab tests to include only those present in actual patient data.

## Setup: Import Libraries

In [1]:
import pandas as pd
import os


## Step 1: Define File Paths

Define placeholders for the data files. Update these paths to match your data location.

In [None]:
# Define file paths - update these to match your data location
#base_path = 'C:\\Users\\Eli\\Data\\physionet.org\\files\\mimiciv\\3.1\\hosp\\'
base_path = '~/Data/physionet.org/files/mimiciv/3.1/hosp/'

labevents_file_path = base_path + 'labevents.csv.gz'  # Large file containing actual lab events
d_labevents_file_path = base_path + 'd_labitems.csv.gz'  # Dictionary file with all possible lab tests

output_path = '../data/features/'  # Directory to save filtered features
output_file_path = '../data/features/features_labevents.csv'  # Output file for filtered features

## Step 2: Extract Unique Item IDs from Lab Events

Read only the `itemid` column from the large labevents file for optimization, then extract unique values.

In [3]:
# Read only the itemid column from labevents for efficiency
print("Reading labevents file (only itemid column)...")
labevents_df = pd.read_csv(labevents_file_path, usecols=['itemid'])

# Extract unique itemids
unique_itemids = labevents_df['itemid'].unique()

# Convert to set for faster lookup
unique_itemids_set = set(unique_itemids)

# Print the count of unique items
print(f"Number of unique lab test items found in labevents: {len(unique_itemids_set)}")

Reading labevents file (only itemid column)...
Number of unique lab test items found in labevents: 976


## Step 3: Filter Dictionary to Keep Only Valid Items

Read the complete d_labevents dictionary and filter it to keep only the rows where itemid exists in our actual data.

In [4]:
# Read the complete d_labevents dictionary
print("\nReading d_labevents dictionary file...")
d_labevents_df = pd.read_csv(d_labevents_file_path)

print(f"Total items in d_labevents dictionary: {len(d_labevents_df)}")

# Filter to keep only items that exist in actual data
filtered_features_df = d_labevents_df[d_labevents_df['itemid'].isin(unique_itemids_set)]

print(f"Items after filtering (present in actual data): {len(filtered_features_df)}")


Reading d_labevents dictionary file...
Total items in d_labevents dictionary: 1650
Items after filtering (present in actual data): 976


## Step 3.5: Categorize Lab Tests into Clinical Groups

Enrich the feature list by grouping individual lab tests into clinically meaningful categories.

In [5]:
def categorize_lab_item(row):
    """
    Categorize a lab item into a clinical group using hierarchical grouping strategy.
    
    Most fluid/category combinations are kept as broad groups (e.g., "Urine - Chemistry").
    However, "Blood - Chemistry" and "Blood - Hematology" are subdivided into clinical panels.
    
    Args:
        row: A row from the DataFrame containing 'label', 'fluid', and 'category' columns.
    
    Returns:
        str: The clinical group name.
    """
    # Extract and normalize fields
    label = str(row['label']).lower() if pd.notna(row['label']) else ''
    fluid = str(row['fluid']).lower() if pd.notna(row['fluid']) else ''
    category = str(row['category']).lower() if pd.notna(row['category']) else ''
    
    # Create basic group string (using title case for consistency)
    basic_group = f"{fluid.title()} - {category.title()}"
    
    # Case 1: Blood - Chemistry
    if fluid == 'blood' and category == 'chemistry':
        # BMP / Electrolytes
        if any(term in label for term in ['sodium', 'potassium', 'chloride', 'bicarbonate', 
                                           'anion gap', 'urea nitrogen', 'creatinine', 
                                           'glucose', 'calcium']):
            return 'Blood - Chemistry (BMP/Electrolytes)'
        
        # Liver Function (LFT)
        if any(term in label for term in ['albumin', 'bilirubin', 'alkaline phosphatase', 
                                           'alt', 'ast', 'ggt', 'protein, total']):
            return 'Blood - Chemistry (LFT)'
        
        # Cardiac Markers
        if any(term in label for term in ['troponin', 'ck-mb', 'creatine kinase', 
                                           'bnp', 'myoglobin']):
            return 'Blood - Chemistry (Cardiac Markers)'
        
        # Lipid Panel
        if any(term in label for term in ['cholesterol', 'hdl', 'ldl', 'triglyceride']):
            return 'Blood - Chemistry (Lipid Panel)'
        
        # Blood Gas (in Chem)
        if any(term in label for term in ['pco2', 'po2', 'ph', 'base excess', 'lactate']):
            return 'Blood - Chemistry (Blood Gas)'
        
        # Inflammatory
        if any(term in label for term in ['c-reactive protein', 'sedimentation rate']):
            return 'Blood - Chemistry (Inflammatory)'
        
        # Endocrine
        if any(term in label for term in ['tsh', 'thyroxine', 'cortisol', 'insulin', 'a1c']):
            return 'Blood - Chemistry (Endocrine)'
        
        # Default for Blood - Chemistry
        return 'Blood - Chemistry (Other)'
    
    # Case 2: Blood - Hematology
    if fluid == 'blood' and category == 'hematology':
        # CBC
        if any(term in label for term in ['hemoglobin', 'hematocrit', 'wbc', 'platelet', 
                                           'rbc', 'mcv', 'mch', 'rdw', 'neutrophil', 
                                           'lymphocyte', 'monocyte', 'eosinophil', 'basophil']):
            return 'Blood - Hematology (CBC)'
        
        # Coagulation
        if any(term in label for term in ['pt', 'inr', 'ptt', 'fibrinogen', 'd-dimer']):
            return 'Blood - Hematology (Coagulation)'
        
        # Default for Blood - Hematology
        return 'Blood - Hematology (Other)'
    
    # Case 3: All Other Groups - return basic_group directly
    return basic_group

# Apply the categorization function
print("\nCategorizing lab tests into clinical groups...")
filtered_features_df['clinical_group'] = filtered_features_df.apply(categorize_lab_item, axis=1)

# Display the distribution of clinical groups
print("\nDistribution of Clinical Groups:")
print("="*80)
print(filtered_features_df['clinical_group'].value_counts())
print("="*80)
print(f"\nTotal number of distinct groups: {filtered_features_df['clinical_group'].nunique()}")


Categorizing lab tests into clinical groups...

Distribution of Clinical Groups:
clinical_group
Other / Special Chemistry    574
Urinalysis                   137
Blood Gas                    110
BMP / Electrolytes            47
CBC / Hematology              35
Liver Function (LFT)          30
Coagulation                   16
Lipids                        12
Cardiac Markers                7
Thyroid / Endocrine            6
Inflammatory                   2
Name: count, dtype: int64


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  filtered_features_df['clinical_group'] = filtered_features_df.apply(categorize_lab_item, axis=1)


## Step 5: Save Filtered Feature List with Clinical Groups

Save the enriched DataFrame to a CSV file with all original columns plus the new clinical_group column.

In [6]:
# Create the output directory if it doesn't exist
output_dir = os.path.dirname(output_file_path)
if not os.path.exists(output_dir):
    os.makedirs(output_dir)

# Save the filtered features to CSV
print(f"\nSaving filtered features to {output_file_path}...")
filtered_features_df.to_csv(output_file_path, index=False)
print("Features saved successfully!")


Saving filtered features to ..\data\features\features_labevents.csv...
Features saved successfully!


## Step 6: Verification

Display the first 5 rows (including the new clinical_group column) and total number of features available for the RL agent.

In [7]:
# Display first 5 rows
print("\n" + "="*80)
print("VERIFICATION: First 5 rows of the feature list:")
print("="*80)
print(filtered_features_df.head())

# Display total count
print("\n" + "="*80)
print(f"TOTAL NUMBER OF FEATURES AVAILABLE FOR THE AGENT: {len(filtered_features_df)}")
print("="*80)


VERIFICATION: First 5 rows of the feature list:
   itemid                                label  fluid   category  \
0   50801           Alveolar-arterial Gradient  Blood  Blood Gas   
1   50802                          Base Excess  Blood  Blood Gas   
2   50803  Calculated Bicarbonate, Whole Blood  Blood  Blood Gas   
3   50804                 Calculated Total CO2  Blood  Blood Gas   
4   50805                    Carboxyhemoglobin  Blood  Blood Gas   

  clinical_group  
0      Blood Gas  
1      Blood Gas  
2      Blood Gas  
3      Blood Gas  
4      Blood Gas  

TOTAL NUMBER OF FEATURES AVAILABLE FOR THE AGENT: 976


## Summary

This notebook has successfully:
1. ✅ Extracted unique lab test item IDs from the actual patient data (labevents)
2. ✅ Filtered the dictionary of all possible lab tests (d_labevents) to include only those present in actual data
3. ✅ Categorized lab tests into clinically meaningful groups (Urinalysis, Blood Gas, CBC, BMP, etc.)
4. ✅ Saved the filtered and enriched feature list to `features_labevents.csv`
5. ✅ Verified the results by displaying the first 5 rows, clinical group distribution, and total count

The resulting feature list defines the Action Space for the RL agent, ensuring that the agent can only select lab tests that actually exist in the patient data, now organized into clinically meaningful categories for better interpretability.