# Medical Terminology Matching: Drug Name Recognition

## The Challenge

Healthcare applications frequently need to match drug names from various sources:

1. **Clinical Notes**: Doctors write "Tylenol" but the formulary lists "Acetaminophen"
2. **Prescription Processing**: Handwritten Rx might be OCR'd as "Amoxicilin" instead of "Amoxicillin"
3. **Patient Self-Reports**: "I take that blood pressure medicine... starts with L..." 
4. **Insurance Claims**: Brand names vs. generics vs. NDC codes

The stakes are high: incorrect drug matching can lead to adverse interactions, wrong dosages, or insurance denials.

## What You'll Learn

- Why **phonetic matching** (Soundex, Metaphone) excels for drug names
- Building a drug name lookup system with fuzzy search
- Handling brand vs. generic name mappings
- Using multiple algorithms together for high-precision matching
- Safety considerations for medical applications

## Dataset

We'll use the **FDA National Drug Code (NDC) Directory**, which contains:
- Brand and generic drug names
- Active ingredients
- Dosage forms and routes
- Manufacturer information

**Download**: https://open.fda.gov/apis/drug/ndc/download/

Place the downloaded `product.json` or `product.csv` in the same directory as this notebook.

In [None]:
import fuzzyrust as fr
import json
import csv
import os
import re
from collections import defaultdict

print(f"FuzzyRust loaded for medical terminology matching")

## 1. Why Phonetic Matching for Drug Names?

Drug names pose unique challenges:

| Challenge | Example | Why It's Hard |
|-----------|---------|---------------|
| Similar spellings | Hydroxyzine vs Hydralazine | 2 char difference, completely different drugs |
| Complex phonetics | Acetaminophen pronounced "ah-SEE-tah-MIN-oh-fen" | Users spell phonetically |
| Brand confusion | Celebrex vs Celexa vs Cerebyx | Sound-alikes ("tall man" lettering) |
| OCR errors | "Amoxicilin" from scanned Rx | Missing letters from poor scan |

**Phonetic algorithms** like Soundex and Metaphone encode words by how they sound, not how they're spelled. This catches many drug name variations that edit distance would miss.

In [None]:
# Demonstrate why phonetic matching helps
drug_pairs = [
    # Misspellings that sound the same
    ("Acetaminophen", "Acetamenophen"),
    ("Amoxicillin", "Amoxicilin"),
    ("Metformin", "Metformen"),
    ("Lisinopril", "Lysinopril"),
    
    # Dangerous sound-alikes (should NOT match)
    ("Celebrex", "Celexa"),      # Different drug classes!
    ("Hydroxyzine", "Hydralazine"), # Very different uses
]

print(f"{'Drug 1':<20} {'Drug 2':<20} {'Levenshtein':<12} {'Soundex Match':<14} {'Metaphone Match'}")
print("=" * 85)

for drug1, drug2 in drug_pairs:
    lev_sim = fr.levenshtein_similarity(drug1.lower(), drug2.lower())
    
    # Soundex comparison
    soundex1 = fr.soundex(drug1)
    soundex2 = fr.soundex(drug2)
    soundex_match = "YES" if soundex1 == soundex2 else "NO"
    
    # Metaphone comparison
    meta1 = fr.metaphone(drug1)
    meta2 = fr.metaphone(drug2)
    meta_match = "YES" if meta1 == meta2 else "NO"
    
    print(f"{drug1:<20} {drug2:<20} {lev_sim:.2%}        {soundex_match:<14} {meta_match}")

**Key Insight**: Phonetic matching catches misspellings like "Acetamenophen" → "Acetaminophen" that share pronunciation. But it may also flag dangerous look-alikes like "Celebrex" and "Celexa" - which is actually useful for safety alerts.

For medical applications, we'll use **multiple matching strategies** together.

## 2. Loading Drug Data

Let's load the FDA NDC data and prepare it for matching.

In [None]:
def load_ndc_data(filepath="product.json"):
    """
    Load drug data from FDA NDC JSON export.
    
    Expected structure:
    {
        "results": [
            {
                "brand_name": "TYLENOL",
                "generic_name": "ACETAMINOPHEN",
                "active_ingredients": [{"name": "...", "strength": "..."}],
                "dosage_form": "TABLET",
                "route": "ORAL",
                "labeler_name": "Johnson & Johnson"
            }
        ]
    }
    """
    if os.path.exists(filepath):
        with open(filepath, 'r', encoding='utf-8') as f:
            data = json.load(f)
            return data.get('results', data) if isinstance(data, dict) else data
    
    # Check for CSV format
    csv_path = filepath.replace('.json', '.csv')
    if os.path.exists(csv_path):
        drugs = []
        with open(csv_path, 'r', encoding='utf-8') as f:
            reader = csv.DictReader(f)
            for row in reader:
                drugs.append({
                    'brand_name': row.get('PROPRIETARYNAME', ''),
                    'generic_name': row.get('NONPROPRIETARYNAME', ''),
                    'dosage_form': row.get('DOSAGEFORMNAME', ''),
                    'route': row.get('ROUTENAME', ''),
                    'labeler_name': row.get('LABELERNAME', '')
                })
        return drugs
    
    print(f"Dataset not found at {filepath}")
    print("Download from: https://open.fda.gov/apis/drug/ndc/download/")
    print("\nUsing sample drug data for demonstration...")
    return get_sample_drugs()


def get_sample_drugs():
    """
    Sample dataset for demonstration.
    Includes common drugs with brand/generic mappings.
    """
    return [
        # Pain relievers
        {'brand_name': 'Tylenol', 'generic_name': 'Acetaminophen', 'dosage_form': 'Tablet', 'route': 'Oral', 'drug_class': 'Analgesic'},
        {'brand_name': 'Advil', 'generic_name': 'Ibuprofen', 'dosage_form': 'Tablet', 'route': 'Oral', 'drug_class': 'NSAID'},
        {'brand_name': 'Aleve', 'generic_name': 'Naproxen', 'dosage_form': 'Tablet', 'route': 'Oral', 'drug_class': 'NSAID'},
        {'brand_name': 'Motrin', 'generic_name': 'Ibuprofen', 'dosage_form': 'Tablet', 'route': 'Oral', 'drug_class': 'NSAID'},
        {'brand_name': 'Bayer', 'generic_name': 'Aspirin', 'dosage_form': 'Tablet', 'route': 'Oral', 'drug_class': 'NSAID'},
        
        # Blood pressure medications
        {'brand_name': 'Prinivil', 'generic_name': 'Lisinopril', 'dosage_form': 'Tablet', 'route': 'Oral', 'drug_class': 'ACE Inhibitor'},
        {'brand_name': 'Zestril', 'generic_name': 'Lisinopril', 'dosage_form': 'Tablet', 'route': 'Oral', 'drug_class': 'ACE Inhibitor'},
        {'brand_name': 'Norvasc', 'generic_name': 'Amlodipine', 'dosage_form': 'Tablet', 'route': 'Oral', 'drug_class': 'CCB'},
        {'brand_name': 'Toprol-XL', 'generic_name': 'Metoprolol Succinate', 'dosage_form': 'Tablet', 'route': 'Oral', 'drug_class': 'Beta Blocker'},
        {'brand_name': 'Lopressor', 'generic_name': 'Metoprolol Tartrate', 'dosage_form': 'Tablet', 'route': 'Oral', 'drug_class': 'Beta Blocker'},
        {'brand_name': 'Cozaar', 'generic_name': 'Losartan', 'dosage_form': 'Tablet', 'route': 'Oral', 'drug_class': 'ARB'},
        {'brand_name': 'Diovan', 'generic_name': 'Valsartan', 'dosage_form': 'Tablet', 'route': 'Oral', 'drug_class': 'ARB'},
        
        # Diabetes medications
        {'brand_name': 'Glucophage', 'generic_name': 'Metformin', 'dosage_form': 'Tablet', 'route': 'Oral', 'drug_class': 'Biguanide'},
        {'brand_name': 'Januvia', 'generic_name': 'Sitagliptin', 'dosage_form': 'Tablet', 'route': 'Oral', 'drug_class': 'DPP-4 Inhibitor'},
        {'brand_name': 'Jardiance', 'generic_name': 'Empagliflozin', 'dosage_form': 'Tablet', 'route': 'Oral', 'drug_class': 'SGLT2 Inhibitor'},
        {'brand_name': 'Trulicity', 'generic_name': 'Dulaglutide', 'dosage_form': 'Solution', 'route': 'Subcutaneous', 'drug_class': 'GLP-1 Agonist'},
        {'brand_name': 'Ozempic', 'generic_name': 'Semaglutide', 'dosage_form': 'Solution', 'route': 'Subcutaneous', 'drug_class': 'GLP-1 Agonist'},
        
        # Cholesterol medications
        {'brand_name': 'Lipitor', 'generic_name': 'Atorvastatin', 'dosage_form': 'Tablet', 'route': 'Oral', 'drug_class': 'Statin'},
        {'brand_name': 'Crestor', 'generic_name': 'Rosuvastatin', 'dosage_form': 'Tablet', 'route': 'Oral', 'drug_class': 'Statin'},
        {'brand_name': 'Zocor', 'generic_name': 'Simvastatin', 'dosage_form': 'Tablet', 'route': 'Oral', 'drug_class': 'Statin'},
        {'brand_name': 'Pravachol', 'generic_name': 'Pravastatin', 'dosage_form': 'Tablet', 'route': 'Oral', 'drug_class': 'Statin'},
        
        # Antibiotics
        {'brand_name': 'Amoxil', 'generic_name': 'Amoxicillin', 'dosage_form': 'Capsule', 'route': 'Oral', 'drug_class': 'Penicillin'},
        {'brand_name': 'Augmentin', 'generic_name': 'Amoxicillin/Clavulanate', 'dosage_form': 'Tablet', 'route': 'Oral', 'drug_class': 'Penicillin'},
        {'brand_name': 'Zithromax', 'generic_name': 'Azithromycin', 'dosage_form': 'Tablet', 'route': 'Oral', 'drug_class': 'Macrolide'},
        {'brand_name': 'Cipro', 'generic_name': 'Ciprofloxacin', 'dosage_form': 'Tablet', 'route': 'Oral', 'drug_class': 'Fluoroquinolone'},
        {'brand_name': 'Levaquin', 'generic_name': 'Levofloxacin', 'dosage_form': 'Tablet', 'route': 'Oral', 'drug_class': 'Fluoroquinolone'},
        {'brand_name': 'Keflex', 'generic_name': 'Cephalexin', 'dosage_form': 'Capsule', 'route': 'Oral', 'drug_class': 'Cephalosporin'},
        
        # Antidepressants
        {'brand_name': 'Prozac', 'generic_name': 'Fluoxetine', 'dosage_form': 'Capsule', 'route': 'Oral', 'drug_class': 'SSRI'},
        {'brand_name': 'Zoloft', 'generic_name': 'Sertraline', 'dosage_form': 'Tablet', 'route': 'Oral', 'drug_class': 'SSRI'},
        {'brand_name': 'Lexapro', 'generic_name': 'Escitalopram', 'dosage_form': 'Tablet', 'route': 'Oral', 'drug_class': 'SSRI'},
        {'brand_name': 'Celexa', 'generic_name': 'Citalopram', 'dosage_form': 'Tablet', 'route': 'Oral', 'drug_class': 'SSRI'},
        {'brand_name': 'Effexor', 'generic_name': 'Venlafaxine', 'dosage_form': 'Capsule', 'route': 'Oral', 'drug_class': 'SNRI'},
        {'brand_name': 'Cymbalta', 'generic_name': 'Duloxetine', 'dosage_form': 'Capsule', 'route': 'Oral', 'drug_class': 'SNRI'},
        {'brand_name': 'Wellbutrin', 'generic_name': 'Bupropion', 'dosage_form': 'Tablet', 'route': 'Oral', 'drug_class': 'NDRI'},
        
        # Anxiety/Sleep
        {'brand_name': 'Xanax', 'generic_name': 'Alprazolam', 'dosage_form': 'Tablet', 'route': 'Oral', 'drug_class': 'Benzodiazepine'},
        {'brand_name': 'Ativan', 'generic_name': 'Lorazepam', 'dosage_form': 'Tablet', 'route': 'Oral', 'drug_class': 'Benzodiazepine'},
        {'brand_name': 'Klonopin', 'generic_name': 'Clonazepam', 'dosage_form': 'Tablet', 'route': 'Oral', 'drug_class': 'Benzodiazepine'},
        {'brand_name': 'Valium', 'generic_name': 'Diazepam', 'dosage_form': 'Tablet', 'route': 'Oral', 'drug_class': 'Benzodiazepine'},
        {'brand_name': 'Ambien', 'generic_name': 'Zolpidem', 'dosage_form': 'Tablet', 'route': 'Oral', 'drug_class': 'Sedative-Hypnotic'},
        
        # Proton pump inhibitors
        {'brand_name': 'Nexium', 'generic_name': 'Esomeprazole', 'dosage_form': 'Capsule', 'route': 'Oral', 'drug_class': 'PPI'},
        {'brand_name': 'Prilosec', 'generic_name': 'Omeprazole', 'dosage_form': 'Capsule', 'route': 'Oral', 'drug_class': 'PPI'},
        {'brand_name': 'Prevacid', 'generic_name': 'Lansoprazole', 'dosage_form': 'Capsule', 'route': 'Oral', 'drug_class': 'PPI'},
        {'brand_name': 'Protonix', 'generic_name': 'Pantoprazole', 'dosage_form': 'Tablet', 'route': 'Oral', 'drug_class': 'PPI'},
        
        # Allergies
        {'brand_name': 'Zyrtec', 'generic_name': 'Cetirizine', 'dosage_form': 'Tablet', 'route': 'Oral', 'drug_class': 'Antihistamine'},
        {'brand_name': 'Claritin', 'generic_name': 'Loratadine', 'dosage_form': 'Tablet', 'route': 'Oral', 'drug_class': 'Antihistamine'},
        {'brand_name': 'Allegra', 'generic_name': 'Fexofenadine', 'dosage_form': 'Tablet', 'route': 'Oral', 'drug_class': 'Antihistamine'},
        {'brand_name': 'Benadryl', 'generic_name': 'Diphenhydramine', 'dosage_form': 'Tablet', 'route': 'Oral', 'drug_class': 'Antihistamine'},
        
        # Thyroid
        {'brand_name': 'Synthroid', 'generic_name': 'Levothyroxine', 'dosage_form': 'Tablet', 'route': 'Oral', 'drug_class': 'Thyroid Hormone'},
        {'brand_name': 'Levoxyl', 'generic_name': 'Levothyroxine', 'dosage_form': 'Tablet', 'route': 'Oral', 'drug_class': 'Thyroid Hormone'},
        
        # Dangerous look-alikes (for safety testing)
        {'brand_name': 'Celebrex', 'generic_name': 'Celecoxib', 'dosage_form': 'Capsule', 'route': 'Oral', 'drug_class': 'COX-2 Inhibitor'},
        {'brand_name': 'Cerebyx', 'generic_name': 'Fosphenytoin', 'dosage_form': 'Solution', 'route': 'Intravenous', 'drug_class': 'Anticonvulsant'},
        {'brand_name': 'Hydroxyzine', 'generic_name': 'Hydroxyzine', 'dosage_form': 'Tablet', 'route': 'Oral', 'drug_class': 'Antihistamine'},
        {'brand_name': 'Hydralazine', 'generic_name': 'Hydralazine', 'dosage_form': 'Tablet', 'route': 'Oral', 'drug_class': 'Vasodilator'},
    ]


# Load data
drugs = load_ndc_data()
print(f"Loaded {len(drugs):,} drug entries")

# Show sample
print("\nSample entries:")
for drug in drugs[:5]:
    brand = drug.get('brand_name', 'N/A')
    generic = drug.get('generic_name', 'N/A')
    print(f"  {brand} ({generic})")

## 3. Building a Drug Name Index

We'll create an index that supports multiple search strategies:

1. **Exact match**: For confident lookups
2. **Phonetic match**: Using Metaphone for sound-alike drugs
3. **Fuzzy match**: Using N-gram similarity for typos

The key insight is to **combine multiple scores** into a final confidence rating.

In [None]:
def normalize_drug_name(name: str) -> str:
    """
    Normalize a drug name for matching.
    
    - Lowercase
    - Remove dosage information (e.g., "500mg")
    - Remove form indicators (e.g., "tablet", "capsule")
    - Remove common suffixes
    """
    if not name:
        return ""
    
    normalized = name.lower().strip()
    
    # Remove dosage patterns (e.g., "500mg", "10 mg", "0.5mg")
    normalized = re.sub(r'\d+\.?\d*\s*(mg|mcg|g|ml|%)', '', normalized)
    
    # Remove form indicators
    form_words = ['tablet', 'capsule', 'solution', 'suspension', 'injection',
                  'cream', 'ointment', 'gel', 'drops', 'syrup', 'oral', 'extended-release',
                  'er', 'sr', 'xl', 'cr', 'dr']
    for word in form_words:
        normalized = re.sub(rf'\b{word}\b', '', normalized)
    
    # Clean up whitespace and punctuation
    normalized = re.sub(r'[^a-z\s]', '', normalized)
    normalized = ' '.join(normalized.split())
    
    return normalized


class DrugMatcher:
    """
    Multi-strategy drug name matcher.
    
    Combines:
    - Exact matching (highest confidence)
    - Phonetic matching (Metaphone)
    - Fuzzy matching (N-gram similarity)
    """
    
    def __init__(self, drugs: list):
        self.drugs = drugs
        
        # Build lookup structures
        self.by_brand = {}           # normalized brand -> [drugs]
        self.by_generic = {}         # normalized generic -> [drugs]
        self.by_metaphone = defaultdict(list)  # metaphone code -> [drugs]
        
        # N-gram index for fuzzy search
        self.ngram_index = fr.NgramIndex(ngram_size=2)
        
        # All searchable names (for index)
        all_names = set()
        
        print("Building drug index...")
        for i, drug in enumerate(drugs):
            brand = drug.get('brand_name', '')
            generic = drug.get('generic_name', '')
            
            # Normalize names
            norm_brand = normalize_drug_name(brand)
            norm_generic = normalize_drug_name(generic)
            
            # Brand name lookup
            if norm_brand:
                if norm_brand not in self.by_brand:
                    self.by_brand[norm_brand] = []
                self.by_brand[norm_brand].append(drug)
                all_names.add(norm_brand)
                
                # Phonetic index
                meta_brand = fr.metaphone(norm_brand)
                self.by_metaphone[meta_brand].append(('brand', drug))
            
            # Generic name lookup
            if norm_generic:
                if norm_generic not in self.by_generic:
                    self.by_generic[norm_generic] = []
                self.by_generic[norm_generic].append(drug)
                all_names.add(norm_generic)
                
                # Phonetic index
                meta_generic = fr.metaphone(norm_generic)
                self.by_metaphone[meta_generic].append(('generic', drug))
        
        # Build N-gram index with all names
        for name in all_names:
            self.ngram_index.add(name)
        
        print(f"Indexed {len(self.by_brand)} brand names")
        print(f"Indexed {len(self.by_generic)} generic names")
        print(f"Indexed {len(self.by_metaphone)} phonetic codes")
        print(f"N-gram index contains {len(self.ngram_index)} entries")
    
    def search(self, query: str, limit: int = 5) -> list:
        """
        Search for drugs matching a query.
        
        Returns list of (drug, score, match_type) tuples.
        """
        norm_query = normalize_drug_name(query)
        if not norm_query:
            return []
        
        results = []
        seen_drugs = set()  # Deduplicate by brand+generic
        
        # Strategy 1: Exact match (highest confidence)
        for drug in self.by_brand.get(norm_query, []):
            key = (drug.get('brand_name', ''), drug.get('generic_name', ''))
            if key not in seen_drugs:
                results.append((drug, 1.0, 'exact_brand'))
                seen_drugs.add(key)
        
        for drug in self.by_generic.get(norm_query, []):
            key = (drug.get('brand_name', ''), drug.get('generic_name', ''))
            if key not in seen_drugs:
                results.append((drug, 1.0, 'exact_generic'))
                seen_drugs.add(key)
        
        # Strategy 2: Phonetic match
        meta_query = fr.metaphone(norm_query)
        for name_type, drug in self.by_metaphone.get(meta_query, []):
            key = (drug.get('brand_name', ''), drug.get('generic_name', ''))
            if key not in seen_drugs:
                # Score based on Jaro-Winkler similarity (phonetic matches may have spelling diffs)
                if name_type == 'brand':
                    score = fr.jaro_winkler_similarity(norm_query, normalize_drug_name(drug.get('brand_name', '')))
                else:
                    score = fr.jaro_winkler_similarity(norm_query, normalize_drug_name(drug.get('generic_name', '')))
                results.append((drug, score * 0.95, f'phonetic_{name_type}'))  # Slight penalty for phonetic
                seen_drugs.add(key)
        
        # Strategy 3: Fuzzy N-gram match
        ngram_results = self.ngram_index.search(norm_query, min_similarity=0.5, limit=limit * 2)
        
        for match in ngram_results:
            matched_name = match.text
            
            # Find drugs with this name
            for drug in self.by_brand.get(matched_name, []) + self.by_generic.get(matched_name, []):
                key = (drug.get('brand_name', ''), drug.get('generic_name', ''))
                if key not in seen_drugs:
                    results.append((drug, match.score * 0.9, 'fuzzy'))  # Penalty for fuzzy
                    seen_drugs.add(key)
        
        # Sort by score descending
        results.sort(key=lambda x: -x[1])
        
        return results[:limit]
    
    def find_look_alikes(self, drug_name: str, threshold: float = 0.7) -> list:
        """
        Find drugs that look or sound similar to a given drug.
        
        Useful for safety checks (detecting potential mix-ups).
        """
        norm_query = normalize_drug_name(drug_name)
        meta_query = fr.metaphone(norm_query)
        
        look_alikes = []
        
        # Check phonetic matches
        for code in self.by_metaphone:
            if code != meta_query:
                # Check if codes are similar (within 1-2 chars)
                code_sim = fr.levenshtein_similarity(meta_query, code)
                if code_sim >= 0.6:
                    for name_type, drug in self.by_metaphone[code]:
                        name = drug.get('brand_name' if name_type == 'brand' else 'generic_name', '')
                        jw_sim = fr.jaro_winkler_similarity(norm_query, normalize_drug_name(name))
                        if jw_sim >= threshold:
                            look_alikes.append((drug, jw_sim, 'sound_alike'))
        
        # Check visual similarity (edit distance)
        all_matches = self.ngram_index.search(norm_query, min_similarity=threshold, limit=20)
        
        seen = set()
        for match in all_matches:
            if match.text != norm_query:
                for drug in self.by_brand.get(match.text, []) + self.by_generic.get(match.text, []):
                    key = (drug.get('brand_name', ''), drug.get('generic_name', ''))
                    if key not in seen:
                        look_alikes.append((drug, match.score, 'look_alike'))
                        seen.add(key)
        
        # Sort and deduplicate
        look_alikes.sort(key=lambda x: -x[1])
        return look_alikes[:10]


# Build the matcher
drug_matcher = DrugMatcher(drugs)

## 4. Testing Drug Name Search

Let's test with various query types:

- **Exact queries**: "Tylenol", "Acetaminophen"
- **Misspellings**: "Amoxicilin", "Metformen"
- **Partial names**: "Lisi" (for Lisinopril)
- **Brand vs Generic confusion**: "What's the generic for Lipitor?"

In [None]:
# Test queries
test_queries = [
    # Exact matches
    "Tylenol",
    "Acetaminophen",
    "Lisinopril",
    
    # Common misspellings
    "Amoxicilin",      # Missing 'l'
    "Metformen",       # Wrong vowel
    "Atorvastaten",    # Wrong ending
    "Lysinopril",      # Wrong first vowel
    
    # Phonetic variants
    "Acetamenophen",   # Phonetically identical
    "Semaglutyde",     # Different ending pronunciation
    
    # Partial / truncated
    "Metopro",         # Partial name
]

print(f"{'Query':<20} {'Top Match':<25} {'Type':<15} {'Score':<8}")
print("=" * 75)

for query in test_queries:
    results = drug_matcher.search(query, limit=1)
    
    if results:
        drug, score, match_type = results[0]
        brand = drug.get('brand_name', 'N/A')
        generic = drug.get('generic_name', '')
        display = f"{brand} ({generic[:15]}...)" if len(generic) > 15 else f"{brand} ({generic})"
        print(f"{query:<20} {display:<25} {match_type:<15} {score:.2%}")
    else:
        print(f"{query:<20} {'NO MATCH':<25}")

## 5. Brand to Generic Mapping

A common use case is resolving brand names to their generic equivalents (or vice versa). This is critical for:

- Insurance formulary checks
- Generic substitution
- Drug interaction databases (often use generic names)

In [None]:
class DrugResolver:
    """
    Resolves drug names to canonical forms with brand/generic mappings.
    """
    
    def __init__(self, matcher: DrugMatcher):
        self.matcher = matcher
        
        # Build brand <-> generic mappings
        self.brand_to_generic = {}  # brand -> generic
        self.generic_to_brands = defaultdict(list)  # generic -> [brands]
        
        for drug in matcher.drugs:
            brand = drug.get('brand_name', '').strip()
            generic = drug.get('generic_name', '').strip()
            
            if brand and generic:
                norm_brand = normalize_drug_name(brand)
                norm_generic = normalize_drug_name(generic)
                
                self.brand_to_generic[norm_brand] = generic
                if brand not in self.generic_to_brands[norm_generic]:
                    self.generic_to_brands[norm_generic].append(brand)
    
    def resolve(self, query: str) -> dict:
        """
        Resolve a drug query to its canonical form.
        
        Returns:
            {
                'query': original query,
                'matched_name': what we matched,
                'generic_name': canonical generic name,
                'brand_names': list of known brand names,
                'confidence': match confidence,
                'drug_class': therapeutic class (if available)
            }
        """
        results = self.matcher.search(query, limit=1)
        
        if not results:
            return None
        
        drug, score, match_type = results[0]
        
        generic = drug.get('generic_name', '')
        norm_generic = normalize_drug_name(generic)
        
        return {
            'query': query,
            'matched_brand': drug.get('brand_name', ''),
            'generic_name': generic,
            'brand_names': self.generic_to_brands.get(norm_generic, []),
            'confidence': score,
            'match_type': match_type,
            'drug_class': drug.get('drug_class', 'Unknown')
        }
    
    def get_generic(self, brand_name: str) -> str:
        """Get generic name for a brand."""
        norm = normalize_drug_name(brand_name)
        return self.brand_to_generic.get(norm)
    
    def get_brands(self, generic_name: str) -> list:
        """Get brand names for a generic."""
        norm = normalize_drug_name(generic_name)
        return self.generic_to_brands.get(norm, [])


# Create resolver
resolver = DrugResolver(drug_matcher)

# Test resolution
print("Drug Resolution Examples:")
print("=" * 60)

test_drugs = ["Tylenol", "Acetaminophen", "Lipitor", "Metformin", "Amoxicilin"]

for drug in test_drugs:
    result = resolver.resolve(drug)
    if result:
        print(f"\nQuery: {drug}")
        print(f"  Generic: {result['generic_name']}")
        print(f"  Brands: {', '.join(result['brand_names'][:3])}")
        print(f"  Class: {result['drug_class']}")
        print(f"  Confidence: {result['confidence']:.1%}")

## 6. Safety: Detecting Look-Alike Sound-Alike (LASA) Drugs

One of the most important applications of drug name matching is **safety checking**. 

Look-Alike Sound-Alike (LASA) drugs are a major source of medication errors. The FDA maintains a list of these dangerous pairs:

| Drug 1 | Drug 2 | Risk |
|--------|--------|------|
| Celebrex | Celexa | Anti-inflammatory vs antidepressant |
| Hydroxyzine | Hydralazine | Antihistamine vs blood pressure |
| Metformin | Metronidazole | Diabetes vs antibiotic |

Let's build a LASA detector:

In [None]:
def check_lasa_risk(drug_matcher: DrugMatcher, drug_name: str) -> dict:
    """
    Check if a drug has look-alike/sound-alike risks.
    
    Returns warnings for drugs that could be confused.
    """
    look_alikes = drug_matcher.find_look_alikes(drug_name, threshold=0.65)
    
    # Filter to different drug classes (actual risk)
    norm_query = normalize_drug_name(drug_name)
    
    # Find the queried drug's class
    results = drug_matcher.search(drug_name, limit=1)
    query_class = results[0][0].get('drug_class', '') if results else ''
    
    warnings = []
    for drug, score, match_type in look_alikes:
        drug_class = drug.get('drug_class', '')
        # Only warn if different class (actual danger)
        if drug_class and drug_class != query_class:
            warnings.append({
                'drug': drug.get('brand_name') or drug.get('generic_name'),
                'generic': drug.get('generic_name', ''),
                'class': drug_class,
                'similarity': score,
                'risk_type': match_type
            })
    
    return {
        'queried_drug': drug_name,
        'queried_class': query_class,
        'lasa_warnings': warnings[:5]  # Top 5 risks
    }


# Test LASA detection
test_drugs = ["Celebrex", "Hydroxyzine", "Metformin", "Celexa"]

print("LASA (Look-Alike Sound-Alike) Risk Analysis")
print("=" * 70)

for drug in test_drugs:
    result = check_lasa_risk(drug_matcher, drug)
    
    print(f"\n{drug} ({result['queried_class']}):")
    
    if result['lasa_warnings']:
        for warn in result['lasa_warnings']:
            print(f"  WARNING: Similar to {warn['drug']} ({warn['class']})")
            print(f"           Similarity: {warn['similarity']:.1%}, Type: {warn['risk_type']}")
    else:
        print("  No significant LASA risks detected")

## 7. Production-Ready Drug Lookup

Let's combine everything into a production-ready drug lookup system with:

- Confidence thresholds for different use cases
- LASA safety warnings
- Brand/generic resolution
- Structured output for integration

In [None]:
class DrugLookupService:
    """
    Production drug lookup service.
    
    Confidence levels:
    - HIGH (>0.95): Exact or near-exact match, safe for automated processing
    - MEDIUM (0.80-0.95): Good match, suitable for suggestions with confirmation
    - LOW (0.65-0.80): Possible match, requires manual verification
    - REJECTED (<0.65): Too uncertain for medical use
    """
    
    def __init__(self, drugs: list):
        self.matcher = DrugMatcher(drugs)
        self.resolver = DrugResolver(self.matcher)
    
    def lookup(self, query: str, include_safety: bool = True) -> dict:
        """
        Look up a drug with full analysis.
        """
        # Get top matches
        matches = self.matcher.search(query, limit=3)
        
        if not matches:
            return {
                'status': 'NOT_FOUND',
                'query': query,
                'message': 'No matching drugs found',
                'suggestions': []
            }
        
        top_drug, top_score, match_type = matches[0]
        
        # Determine confidence level
        if top_score >= 0.95:
            confidence = 'HIGH'
            status = 'MATCHED'
        elif top_score >= 0.80:
            confidence = 'MEDIUM'
            status = 'PROBABLE_MATCH'
        elif top_score >= 0.65:
            confidence = 'LOW'
            status = 'POSSIBLE_MATCH'
        else:
            confidence = 'REJECTED'
            status = 'UNCERTAIN'
        
        result = {
            'status': status,
            'query': query,
            'confidence': confidence,
            'score': top_score,
            'match_type': match_type,
            'drug': {
                'brand_name': top_drug.get('brand_name', ''),
                'generic_name': top_drug.get('generic_name', ''),
                'drug_class': top_drug.get('drug_class', 'Unknown'),
                'dosage_form': top_drug.get('dosage_form', ''),
                'route': top_drug.get('route', '')
            },
            'alternatives': [
                {
                    'brand': d.get('brand_name', ''),
                    'generic': d.get('generic_name', ''),
                    'score': s
                }
                for d, s, _ in matches[1:]
            ]
        }
        
        # Add safety warnings if requested
        if include_safety and top_score >= 0.65:
            lasa = check_lasa_risk(self.matcher, top_drug.get('brand_name') or top_drug.get('generic_name'))
            if lasa['lasa_warnings']:
                result['safety_warnings'] = [
                    f"LASA risk: May be confused with {w['drug']} ({w['class']})"
                    for w in lasa['lasa_warnings'][:3]
                ]
        
        return result


# Initialize service
drug_service = DrugLookupService(drugs)

# Test the service
print("Drug Lookup Service Demo")
print("=" * 70)

test_queries = [
    "Tylenol",           # Exact match
    "Amoxicilin",        # Misspelling
    "Celebrex",          # Has LASA risk
    "Metformin 500mg",   # With dosage
    "XYZ123",            # No match
]

for query in test_queries:
    result = drug_service.lookup(query)
    
    print(f"\nQuery: '{query}'")
    print(f"  Status: {result['status']} (Confidence: {result.get('confidence', 'N/A')})")
    
    if 'drug' in result:
        drug = result['drug']
        print(f"  Match: {drug['brand_name']} ({drug['generic_name']})")
        print(f"  Class: {drug['drug_class']}")
        print(f"  Score: {result['score']:.1%}")
    
    if result.get('safety_warnings'):
        for warn in result['safety_warnings']:
            print(f"  SAFETY: {warn}")

## 8. Batch Processing for Clinical NLP

When processing clinical notes at scale, we need efficient batch operations.

Let's process a simulated batch of drug mentions from clinical notes:

In [None]:
def process_clinical_notes(drug_service: DrugLookupService, mentions: list) -> dict:
    """
    Process drug mentions extracted from clinical notes.
    
    Returns statistics and flagged items for review.
    """
    results = {
        'HIGH': [],
        'MEDIUM': [],
        'LOW': [],
        'NOT_FOUND': [],
        'safety_alerts': []
    }
    
    for mention in mentions:
        lookup = drug_service.lookup(mention, include_safety=True)
        
        confidence = lookup.get('confidence', 'NOT_FOUND')
        if confidence == 'REJECTED' or lookup['status'] == 'NOT_FOUND':
            results['NOT_FOUND'].append(mention)
        else:
            results[confidence].append({
                'mention': mention,
                'resolved': lookup['drug']['generic_name'],
                'score': lookup['score']
            })
        
        # Track safety alerts
        if lookup.get('safety_warnings'):
            results['safety_alerts'].append({
                'mention': mention,
                'warnings': lookup['safety_warnings']
            })
    
    return results


# Simulated drug mentions from clinical notes
clinical_mentions = [
    # Clean mentions
    "Metformin",
    "Lisinopril",
    "Atorvastatin",
    "Omeprazole",
    
    # Brand names
    "Lipitor",
    "Zoloft",
    "Nexium",
    
    # Misspellings (OCR or transcription errors)
    "Amoxicilin",
    "Metoprolal",
    "Acetomenophen",
    
    # With dosage (should still match)
    "Metformin 500mg",
    "Lisinopril 10mg tablet",
    
    # LASA risks
    "Celebrex",
    "Hydroxyzine",
    
    # Unresolvable
    "vitamin supplement",
    "herbal tea",
]

results = process_clinical_notes(drug_service, clinical_mentions)

# Print summary
print("Clinical Notes Processing Summary")
print("=" * 50)
print(f"Total mentions processed: {len(clinical_mentions)}")
print(f"")
print(f"HIGH confidence:    {len(results['HIGH'])}")
print(f"MEDIUM confidence:  {len(results['MEDIUM'])}")
print(f"LOW confidence:     {len(results['LOW'])}")
print(f"Not found:          {len(results['NOT_FOUND'])}")
print(f"")
print(f"Safety alerts:      {len(results['safety_alerts'])}")

if results['safety_alerts']:
    print("\nSafety Alerts:")
    for alert in results['safety_alerts']:
        print(f"  {alert['mention']}: {alert['warnings'][0]}")

if results['NOT_FOUND']:
    print(f"\nUnresolved mentions: {results['NOT_FOUND']}")

## 9. Production Considerations

### Threshold Guidelines for Medical Use

| Use Case | Minimum Confidence | Rationale |
|----------|-------------------|------------|
| Drug interaction check | 0.95 (HIGH) | Safety-critical, must be certain |
| Formulary lookup | 0.85 (MEDIUM+) | Cost implications, but can verify |
| Clinical note NLP | 0.80 (MEDIUM) | Flagged for human review anyway |
| Auto-complete suggestion | 0.65 (LOW) | User will select correct option |

### Algorithm Selection

- **Exact + Phonetic**: Best for patient self-reports ("I take the blood pressure pill that sounds like...")
- **N-gram**: Best for OCR errors and transcription mistakes
- **Jaro-Winkler**: Best for comparing two specific drug names

### Performance

- **N-gram index**: Sub-millisecond lookups for 100K+ drugs
- **Metaphone hashing**: O(1) phonetic lookups
- **Memory**: ~1MB per 10K drug records with full metadata

### Compliance Notes

For FDA-regulated applications:
- Log all fuzzy matches with confidence scores
- Require human verification for MEDIUM/LOW confidence
- Implement LASA checks as part of verification workflow
- Consider using the FDA's official NDC database as source of truth

## Summary

In this guide, we built a comprehensive medical terminology matching system:

1. **Phonetic matching**: Soundex and Metaphone for sound-alike drugs
2. **Multi-strategy search**: Exact → Phonetic → Fuzzy fallback
3. **Brand/Generic resolution**: Map between drug name forms
4. **LASA safety checks**: Detect dangerous look-alike pairs
5. **Production service**: Confidence levels and structured output

### Key Takeaways

- **Phonetic algorithms** excel for medical terms where users spell phonetically
- **Multiple strategies** combined provide robust matching
- **Confidence thresholds** must be higher for safety-critical applications
- **LASA detection** is a critical safety feature, not just fuzzy matching
- **Always normalize** drug names before matching (remove dosages, forms)

### When to Use This Approach

- Clinical NLP / note processing
- Prescription processing and verification
- Patient self-report intake
- Drug interaction checking
- Insurance claim processing