# Microbiology Feature Extraction for MIMIC-IV P-CAFE Environment

This notebook extracts unique microbiology features for an RL agent. Unlike lab events, microbiology events are defined by both the test performed (e.g., "Blood Culture") and the specimen type (e.g., "Blood"). We extract all unique combinations of these fields to create a catalog of available microbiology tests.

## Setup: Import Libraries

In [None]:
import pandas as pd
import os

## Step 1: Define File Paths

Define the path for the microbiology events data file. Update this path to match your data location.

In [None]:
# Define file paths - update these to match your data location
microbiology_file_path = 'microbiologyevents.csv.gz'  # MIMIC-IV microbiology events file
output_file_path = 'features_microbiologyevents.csv'  # Output file for unique features

## Step 2: Load Microbiology Data (Optimized)

Read only the specific columns required to define the feature: `test_itemid`, `test_name`, `spec_itemid`, and `spec_type_desc`.

In [None]:
# Read only the required columns from microbiologyevents for efficiency
print("Reading microbiologyevents file (only required columns)...")
microbiology_df = pd.read_csv(
    microbiology_file_path, 
    usecols=['test_itemid', 'test_name', 'spec_itemid', 'spec_type_desc']
)

print(f"Total rows loaded: {len(microbiology_df)}")
print(f"\nColumns loaded: {list(microbiology_df.columns)}")

## Step 3: Extract Unique Features

Create a DataFrame containing only the unique combinations of the four columns, sorted by `test_name` and `spec_type_desc` for readability.

In [None]:
# Extract unique combinations of the feature-defining columns
print("\nExtracting unique microbiology features...")
unique_features_df = microbiology_df.drop_duplicates(
    subset=['test_itemid', 'test_name', 'spec_itemid', 'spec_type_desc']
)

# Sort by test_name and spec_type_desc for readability
unique_features_df = unique_features_df.sort_values(
    by=['test_name', 'spec_type_desc']
).reset_index(drop=True)

print(f"Number of unique microbiology features found: {len(unique_features_df)}")

## Step 4: Save Output

Save the unique feature list to a CSV file without the DataFrame index.

In [None]:
# Save the unique features to CSV (without index)
print(f"\nSaving unique features to {output_file_path}...")
unique_features_df.to_csv(output_file_path, index=False)
print("Features saved successfully!")

## Step 5: Verification

Display the first 10 rows and total number of unique microbiology features.

In [None]:
# Display first 10 rows
print("\n" + "="*80)
print("VERIFICATION: First 10 rows of the extracted feature list:")
print("="*80)
print(unique_features_df.head(10))

# Display total count
print("\n" + "="*80)
print(f"TOTAL NUMBER OF UNIQUE MICROBIOLOGY FEATURES: {len(unique_features_df)}")
print("="*80)

## Summary

This notebook has successfully:
1. ✅ Defined the file path for microbiology events data
2. ✅ Loaded only the required columns (`test_itemid`, `test_name`, `spec_itemid`, `spec_type_desc`) for optimization
3. ✅ Extracted unique combinations of these columns using `drop_duplicates()`
4. ✅ Sorted results by `test_name` and `spec_type_desc` for readability
5. ✅ Saved the unique feature list to `features_microbiologyevents.csv` (without index)
6. ✅ Verified the results by displaying the first 10 rows and total count

The resulting feature list defines the microbiology test action space for the MIMIC-IV P-CAFE RL environment, where each feature represents a unique combination of test type and specimen type.