# ICD Diagnoses Feature Extraction for P-CAFE/MDP Environment

This notebook extracts the list of unique diagnosis features for the P-CAFE/MDP environment. We extract unique ICD diagnosis codes (both ICD-9 and ICD-10) from the patient data and enrich them with their full names from the dictionary table.

## Setup: Import Libraries

In [1]:
import pandas as pd
import os

## Step 1: Define File Paths

Define placeholders for the data files. Update these paths to match your data location.

In [None]:
# Define file paths - update these to match your data location
base_path = '~/Data/physionet.org/files/mimiciv/3.1/hosp/'
#diagnoses_file_path = 'C:\\Users\\Eli\\Data\\physionet.org\\files\\mimiciv\\3.1\\hosp\\diagnoses_icd.csv.gz'  # File containing patient diagnoses
#dictionary_file_path = 'C:\\Users\\Eli\\Data\\physionet.org\\files\\mimiciv\\3.1\\hosp\\d_icd_diagnoses.csv.gz'  # Dictionary file with ICD code descriptions
diagnoses_file_path = base_path + 'diagnoses_icd.csv.gz'  # File containing patient diagnoses
dictionary_file_path = base_path + 'd_icd_diagnoses.csv.gz'  # Dictionary file with ICD code descriptions
output_file_path = '../data/features/icd_diagnoses_features.csv'  # Output file for features



## Step 2: Extract Unique ICD Codes

Read only the relevant columns from the diagnoses file for optimization, then extract unique ICD codes.

In [5]:
# Read only the icd_code and icd_version columns for efficiency
print("Reading diagnoses_icd file (only icd_code and icd_version columns)...")
diagnoses_df = pd.read_csv(diagnoses_file_path, usecols=['icd_code', 'icd_version'])

print(f"Total diagnosis records: {len(diagnoses_df)}")

# Drop duplicates to get unique diagnoses
unique_diagnoses_df = diagnoses_df.drop_duplicates()

print(f"Number of unique diagnosis codes: {len(unique_diagnoses_df)}")

Reading diagnoses_icd file (only icd_code and icd_version columns)...
Total diagnosis records: 6364488
Number of unique diagnosis codes: 28583


## Step 3: Enrich with Metadata from Dictionary

Read the ICD diagnoses dictionary and perform an inner merge on both icd_code and icd_version to ensure accurate mapping.

In [6]:
# Read the d_icd_diagnoses dictionary file
print("\nReading d_icd_diagnoses dictionary file...")
dictionary_df = pd.read_csv(dictionary_file_path)

print(f"Total items in dictionary: {len(dictionary_df)}")

# Perform inner merge on both icd_code and icd_version
enriched_df = pd.merge(
    unique_diagnoses_df,
    dictionary_df,
    on=['icd_code', 'icd_version'],
    how='inner'
)

print(f"Diagnoses after enrichment: {len(enriched_df)}")


Reading d_icd_diagnoses dictionary file...
Total items in dictionary: 112107
Diagnoses after enrichment: 28583


## Step 4: Format Output

Select only the required columns and sort by version and code.

In [7]:
# Select only required columns
features_df = enriched_df[['icd_code', 'icd_version', 'long_title']].copy()

# Sort by version and code
features_df = features_df.sort_values(by=['icd_version', 'icd_code']).reset_index(drop=True)

print(f"\nFinal feature count: {len(features_df)}")


Final feature count: 28583


## Step 5: Save to CSV

Save the feature list to a CSV file.

In [8]:
# Create the output directory if it doesn't exist
output_dir = os.path.dirname(output_file_path)
if not os.path.exists(output_dir):
    os.makedirs(output_dir)

# Save to CSV without index
print(f"\nSaving features to {output_file_path}...")
features_df.to_csv(output_file_path, index=False)
print("Features saved successfully!")


Saving features to ../data/features/icd_diagnoses_features.csv...
Features saved successfully!


## Step 6: Verification

Display the first rows and total count of unique diagnosis features.

In [9]:
# Display first rows
print("\n" + "="*80)
print("VERIFICATION: First rows of the feature list:")
print("="*80)
print(features_df.head())

# Display total count
print("\n" + "="*80)
print(f"TOTAL NUMBER OF UNIQUE DIAGNOSIS FEATURES: {len(features_df)}")
print("="*80)


VERIFICATION: First rows of the feature list:
  icd_code  icd_version                      long_title
0     0020            9                   Typhoid fever
1     0029            9  Paratyphoid fever, unspecified
2     0030            9      Salmonella gastroenteritis
3     0031            9           Salmonella septicemia
4    00324            9        Salmonella osteomyelitis

TOTAL NUMBER OF UNIQUE DIAGNOSIS FEATURES: 28583


## Summary

This notebook has successfully:
1. ✅ Extracted unique ICD diagnosis codes (both ICD-9 and ICD-10) from patient data
2. ✅ Enriched the codes with full descriptions from the d_icd_diagnoses dictionary
3. ✅ Merged on both icd_code and icd_version to ensure accurate mapping
4. ✅ Formatted and sorted the output by version and code
5. ✅ Saved the feature list to `icd_diagnoses_features.csv`
6. ✅ Verified the results by displaying the first rows and total count

The resulting feature list defines the diagnosis feature space for the P-CAFE/MDP environment.