# Ontology Standardization Pipeline – ICD-10 Matching
This notebook demonstrates how to standardize diagnosis codes using direct and fuzzy matching.

In [2]:
!pip install fuzzywuzzy[speedup]

Collecting fuzzywuzzy[speedup]
  Downloading fuzzywuzzy-0.18.0-py2.py3-none-any.whl.metadata (4.9 kB)
Collecting python-levenshtein>=0.12 (from fuzzywuzzy[speedup])
  Downloading python_levenshtein-0.27.1-py3-none-any.whl.metadata (3.7 kB)
Collecting Levenshtein==0.27.1 (from python-levenshtein>=0.12->fuzzywuzzy[speedup])
  Downloading levenshtein-0.27.1-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (3.6 kB)
Collecting rapidfuzz<4.0.0,>=3.9.0 (from Levenshtein==0.27.1->python-levenshtein>=0.12->fuzzywuzzy[speedup])
  Downloading rapidfuzz-3.13.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Downloading python_levenshtein-0.27.1-py3-none-any.whl (9.4 kB)
Downloading levenshtein-0.27.1-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (161 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m161.7/161.7 kB[0m [31m3.5 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading fuzzywuzzy-0.18.0-py2.py3-none-any.whl (18 kB)
Do

In [3]:
import pandas as pd
from fuzzywuzzy import process

In [4]:
# Load raw EMR data
raw_data = pd.DataFrame({
    'Patient_ID': [101, 102, 103, 104],
    'Diagnosis': [' i10', 'E11.9', 'type ii diabetes', 'high bp']
})

In [5]:
# Load ICD-10 reference data
icd10_lookup = pd.DataFrame({
    'ICD10_Code': ['I10', 'E11.9'],
    'Description': ['Essential (primary) hypertension', 'Type 2 diabetes without complications']
})

In [6]:
# Step 1: Clean raw diagnosis data
raw_data['Diagnosis_Clean'] = raw_data['Diagnosis'].str.strip().str.upper()

In [7]:
# Step 2: Direct match with ICD-10 codes
merged_df = raw_data.merge(icd10_lookup, how='left', left_on='Diagnosis_Clean', right_on='ICD10_Code')
merged_df['Mapping_Status'] = merged_df['Description'].notnull()
merged_df

Unnamed: 0,Patient_ID,Diagnosis,Diagnosis_Clean,ICD10_Code,Description,Mapping_Status
0,101,i10,I10,I10,Essential (primary) hypertension,True
1,102,E11.9,E11.9,E11.9,Type 2 diabetes without complications,True
2,103,type ii diabetes,TYPE II DIABETES,,,False
3,104,high bp,HIGH BP,,,False



Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `y` variable to `hue` and set `legend=False` for the same effect.




Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `y` variable to `hue` and set `legend=False` for the same effect.




Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `y` variable to `hue` and set `legend=False` for the same effect.




Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `y` variable to `hue` and set `legend=False` for the same effect.



In [8]:
# Step 3: Fuzzy matching for unmapped cases
unmapped_rows = merged_df[merged_df['Mapping_Status'] == False].copy()

def fuzzy_match_with_score(diagnosis, descriptions):
    result = process.extractOne(diagnosis, descriptions)
    if result:
        match, score, _ = result
        return pd.Series([match, score])
    return pd.Series([None, 0])

unmapped_rows[['Fuzzy_Match', 'Match_Score']] = unmapped_rows['Diagnosis_Clean'].apply(
    lambda x: fuzzy_match_with_score(x, icd10_lookup['Description'])
)

In [9]:
# Step 4: Define action based on confidence threshold
threshold = 80
unmapped_rows['Action'] = unmapped_rows['Match_Score'].apply(
    lambda score: 'Auto-Map' if score >= threshold else 'Needs Manual Review'
)
unmapped_rows

Unnamed: 0,Patient_ID,Diagnosis,Diagnosis_Clean,ICD10_Code,Description,Mapping_Status,Fuzzy_Match,Match_Score,Action
2,103,type ii diabetes,TYPE II DIABETES,,,False,Type 2 diabetes without complications,86,Auto-Map
3,104,high bp,HIGH BP,,,False,Type 2 diabetes without complications,49,Needs Manual Review


In [10]:
# Final merged output
final_output = pd.concat([merged_df[merged_df['Mapping_Status']], unmapped_rows], ignore_index=True)
final_output[['Patient_ID', 'Diagnosis', 'Diagnosis_Clean', 'Fuzzy_Match', 'Match_Score', 'Action']]

Unnamed: 0,Patient_ID,Diagnosis,Diagnosis_Clean,Fuzzy_Match,Match_Score,Action
0,101,i10,I10,,,
1,102,E11.9,E11.9,,,
2,103,type ii diabetes,TYPE II DIABETES,Type 2 diabetes without complications,86.0,Auto-Map
3,104,high bp,HIGH BP,Type 2 diabetes without complications,49.0,Needs Manual Review
