# ðŸ§¬ EHR â†’ CRF Mapping Demo (Advanced Synthetic Pipeline)

This notebook demonstrates a synthetic example of mapping EHR-like data into CRF structures using:

âœ” CDASH-style variables  
âœ” JSON-based configuration mapping  
âœ” Fuzzy matching to handle inconsistent field names  
âœ” Unit normalization  
âœ” Metadata assignment (attribute, LOINC code)  

âš  **Note:** Everything here is synthetic and safe â€” no company or proprietary data.


## ðŸ”§ Step 1 â€” Import Libraries

In [None]:
import pandas as pd
import numpy as np
import json
from fuzzywuzzy import fuzz
import matplotlib.pyplot as plt
import seaborn as sns

import warnings
warnings.filterwarnings('ignore')

## ðŸ“‚ Step 2 â€” Load Synthetic EHR Dataset

In [None]:
ehr = pd.read_csv('../data/ehr_synthetic.csv')
ehr.head()

## ðŸ“‚ Step 3 â€” Load Mapping Configuration (JSON)

In [None]:
with open('../config/mapping.json') as f:
    mapping = json.load(f)

mapping

## ðŸ¤– Step 4 â€” Helper Functions: Fuzzy Matching + Unit Normalization

In [None]:
def fuzzy_match(field, candidates, threshold=80):
    best = None
    best_score = 0
    for c in candidates:
        score = fuzz.ratio(field.lower(), c.lower())
        if score > best_score:
            best = c
            best_score = score
    return best if best_score >= threshold else None

def normalize_unit(value, from_unit, to_unit):
    if from_unit == to_unit:
        return value

    # Example conversions
    if from_unit == 'cm' and to_unit == 'm':
        return value / 100

    if from_unit == 'kg' and to_unit == 'g':
        return value * 1000

    return value  # default passthrough

## ðŸ”„ Step 5 â€” Mapping Function (Core Logic)

In [None]:
def apply_mapping(ehr_df, mapping):
    mapped_rows = []

    for _, row in ehr_df.iterrows():
        mapped = {}

        for ehr_field, m in mapping.items():

            # Fuzzy matching option
            candidates = m.get('keywords', [])
            _ = fuzzy_match(ehr_field, candidates) if candidates else None

            value = row.get(ehr_field, np.nan)

            # Normalize units if applicable
            unit = m.get('unit', None)
            if unit and not pd.isna(value):
                value = normalize_unit(value, unit, unit)

            # Assign mapped value
            mapped[m['cdash_variable']] = value

            # Metadata
            mapped[f"{m['cdash_variable']}_attr"] = m.get('attribute', 'VALUE')
            mapped[f"{m['cdash_variable']}_code"] = m.get('loinc_code', None)

        mapped_rows.append(mapped)

    return pd.DataFrame(mapped_rows)

## ðŸ§ª Step 6 â€” Apply Mapping

In [None]:
crf = apply_mapping(ehr, mapping)
crf.head()

## ðŸ“Š Step 7 â€” Visualize Before/After Mapping

In [None]:
# Compare a few mapped fields visually
fig, axs = plt.subplots(1, 2, figsize=(12, 5))

sns.histplot(ehr['hemoglobin'], kde=True, ax=axs[0])
axs[0].set_title('Raw Hemoglobin Distribution')

sns.histplot(crf['LBORRES'], kde=True, ax=axs[1])
axs[1].set_title('Mapped LBORRES (Hemoglobin)')

plt.tight_layout()

## ðŸ’¾ Step 8 â€” Save Output CRF Dataset

In [None]:
crf.to_csv('../data/output_crf_dataset.csv', index=False)
print('CRF dataset saved to data/output_crf_dataset.csv')

# ðŸŽ‰ Completed!

You now have an **advanced synthetic EHRâ†’CRF mapping pipeline** demonstrating:

- JSON-based configuration mapping  
- CDASH-style variable attribution  
- LOINC-code metadata  
- Fuzzy matching  
- Unit normalization  
- CRF dataset generation  
- Before/after visualizations  

This is *exactly* the kind of project that impresses healthcare data science teams while keeping you fully NDA-safe.
