<a href="https://colab.research.google.com/github/sppandlkk/healthcare-nlp-llm-pipelines/blob/main/notebooks/01_deid_clinical_notes.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# De-identification of Clinical Notes using NLP
This project demonstrates de-identification of clinical notes using multiple natural language processing (NLP) techniques. We create synthetic clinical notes containing sensitive personal health information (PHI), manually annotate the entities, and evaluate the performance of various models,including Microsoft Presidio and HuggingFace NER models. The evaluation framework is designed to benchmark open-source models as well as proprietary/vendor solutions such as Tonic.ai and Private.ai. This framework leverages real-world EHR data to assess entity detection performance and robustness.

In [None]:
# python pacakge and import
!pip install transformers[sentencepiece]
!pip install presidio-analyzer presidio-anonymizer
import pandas as pd
from transformers import pipeline
from presidio_analyzer import AnalyzerEngine#, RecognizerResult
import matplotlib.pyplot as plt
import seaborn as sns

## Create Synthetic Notes

I generate synthetic clinical notes that contain multiple PHI entities, such as patient names, provider names, and family members. Each entity is manually annotated with its start and end character indices to serve as ground truth for evaluation.

Note: The synthetic note includes intentional errors such as missing spaces in names (NurseKate) to test the robustness of de-identification models.

In [None]:
# create synthetic notes. Notice that I have NurseKate (missing space) to tell whether model can flag it
note_text = """
Patient Emma Su underwent inpatient surgery for acute exacerbation of asthma and was admitted to the hospital for further management. During her stay, she received treatment with IV steroids, bronchodilators, and oxygen therapy. She was also seen by Dr. Lee, a pulmonologist, who adjusted her medication regimen. The patient's family members, including Rob (brother), Liv, and Dave (partner), visited her regularly and provided emotional support. The dad (Pete) expressed concern about her condition and stated "I'm glad she's getting the care she needs, sweetie". Her mom, Jen K, will be picking her up from the hospital today after discharge. During her stay, the patient underwent various tests, including pulmonary function tests and chest X-rays, which showed significant improvement after treatment. The patient was also educated on proper inhaler use and asthma management by NurseKate. She will follow up with Dr. Smith in 2 weeks to reassess her symptoms and adjust her medication regimen as needed. Her friend mike will be helping her with errands and chores during her recovery. The patient's condition improved significantly during her stay, and she was discharged in stable condition with instructions to rest and continue her medication regimen. Emma's condition will continue to be monitored by her healthcare team, including Dr. Smith Y. and Nurse Kate W. Documented by: Kate Whittier. Signed by: Dr. Smith Yeats. Date: March 15, 2023, 14:30
"""
# manually annotate entity_text
ground_truth = pd.DataFrame(
    {
    "entity_text": ["Emma Su", "Lee", "Rob", "Liv", "Dave", "Pete", "Jen K", "Kate", "Smith", "mike", "Emma", "Smith Y.", "Kate W.", "Kate Whittier", "Smith Yeats"],
    "entity_start_index": [9, 255, 354, 369, 378, 457, 575, 889, 923, 1021, 1261, 1346, 1365, 1388, 1418],
    "entity_end_index": [16, 258, 357, 372, 382, 461, 580, 893, 928, 1025, 1265, 1354, 1372, 1401, 1429]
})
ground_truth.head()

#Microsoft Presidio Model

[Microsoft Presidio](https://github.com/microsoft/presidio) is an open-source library for detecting personally identifiable information (PII) in text. In this section, we apply the Presidio recognizer to our synthetic note to automatically detect names. Later, we will compare the predicted entities with our ground truth annotations.

Presidio outputs start/end indices and entity types, which can be directly compared to ground truth for evaluation.

In [None]:
# initialize presidio_analyzer
analyzer = AnalyzerEngine()
# detect PII
results = analyzer.analyze(text=note_text, entities=["PERSON"], language="en")
df_presidio = pd.DataFrame([
    {
        "model":"presidio",
        "entity_type":ent.entity_type,
        "entity_start_index":ent.start,
        "entity_end_index":ent.end,
        "entity_text":note_text[ent.start:ent.end]
    } for ent in results
])
df_presidio.head()

## Alternative NER Models from HuggingFace
- BERT-Base NER ([dslim/bert-base-NER](https://huggingface.co/dslim/bert-base-NER)):
A smaller BERT model fine-tuned on the CoNLL-2003 dataset to recognize person names and other standard entities.

- BERT-Large NER ([dbmdz/bert-large-cased-finetuned-conll03-english](https://huggingface.co/dbmdz/bert-large-cased-finetuned-conll03-english)):
A larger cased BERT model fine-tuned similarly on the CoNLL-2003 dataset, expected to provide better context understanding due to more parameters.

We use the HuggingFace pipeline API for NER and extract entities including start/end positions to compare with ground truth.

In [None]:
# Use a small NER model for demo
bert_base = pipeline("ner", model="dslim/bert-base-NER", aggregation_strategy="simple")
ents = bert_base(note_text)
df_bert_base = pd.DataFrame([
    {
        "model":"bert_base",
        "entity_type": ent["entity_group"],
        "entity_start_index": ent["start"],
        "entity_end_index": ent["end"],
        "entity_text": ent["word"]
    } for ent in ents
])
df_bert_base.head()

In [None]:
model_name = "dbmdz/bert-large-cased-finetuned-conll03-english"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForTokenClassification.from_pretrained(model_name)

# Build NER pipeline
ner_pipeline = pipeline("ner", model=model, tokenizer=tokenizer, aggregation_strategy="simple")
# Run NER
ents = ner_pipeline(note_text)
df_bert_finetuned = pd.DataFrame([
    {
        "model": "bert_finetuned",
        "entity_type": ent["entity_group"],
        "entity_start_index": ent["start"],
        "entity_end_index": ent["end"],
        "entity_text": ent["word"]
    } for ent in ents
])
df_bert_finetuned.head()

## Evaluation
Model performance is evaluated using overlap ratio thresholds. If a predicted span overlaps a ground truth span above a threshold, it is counted as a true positive; otherwise, it is a false negative.

Metrics computed include:

Precision: TP / (TP + FP) – indicates how much correct information is retained without over-censoring.

Recall: TP / (TP + FN) – more critical for de-identification, since missing a PHI entity could result in sensitive information leakage.

Multiple thresholds are applied to assess robustness, and Seaborn line plots are used to visualize precision and recall across thresholds for each model.

Note: Because this project uses synthetic data, the exact ranking or performance differences between models may not fully reflect real-world behavior, but it effectively demonstrates the methodology and evaluation approach.

In [None]:
def evaluate_ner_models(ground_truth, model_dfs, model_names, thresholds=[0.1, 0.2, 0.3, 0.4, 0.5]):
    """
    Evaluate multiple NER models against ground truth using overlap-based matching.

    Parameters
    ----------
    ground_truth : pd.DataFrame
        DataFrame containing ground truth entities with columns:
        ['entity_text', 'entity_start_index', 'entity_end_index'].
    model_dfs : list of pd.DataFrame
        List of predicted entities DataFrames, each with the same columns as ground_truth.
    model_names : list of str
        Names of the models corresponding to model_dfs.
    thresholds : list of float, optional
        Minimum overlap ratio to consider a predicted entity as a true positive.

    Returns
    -------
    pd.DataFrame
        DataFrame containing precision and recall for each model at each threshold.
        Columns: ['model', 'threshold', 'precision', 'recall'].
    """
    results = []

    for model_df, model_name in zip(model_dfs, model_names):
        for t in thresholds:
            tp, fp, fn = 0, 0, 0
            matched_gt_idx = set()

            # Iterate over predicted entities
            for _, m_row in model_df.iterrows():
                m_start, m_end = m_row['entity_start_index'], m_row['entity_end_index']
                match_found = False

                for gt_idx, gt_row in ground_truth.iterrows():
                    gt_start, gt_end = gt_row['entity_start_index'], gt_row['entity_end_index']

                    # Compute overlap
                    overlap = max(0, min(m_end, gt_end) - max(m_start, gt_start))
                    overlap_ratio = overlap / (gt_end - gt_start)

                    if overlap_ratio >= t:
                        tp += 1
                        matched_gt_idx.add(gt_idx)
                        match_found = True
                        break

                if not match_found:
                    fp += 1

            fn = len(ground_truth) - len(matched_gt_idx)
            precision = tp / (tp + fp) if (tp + fp) > 0 else 0
            recall = tp / (tp + fn) if (tp + fn) > 0 else 0

            results.append({
                'model': model_name,
                'threshold': t,
                'precision': precision,
                'recall': recall
            })

    results_df = pd.DataFrame(results)
    return results_df

In [None]:
evaluation_result = \
evaluate_ner_models(ground_truth,
                    model_dfs=[df_presidio, df_bert_base, df_bert_finetuned],
                    model_names=["Presidio", "BERT Base", "BERT Finetuned"],
                    thresholds = [0.1, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0])
evaluation_result

In [None]:
def plot_ner_score(results_df, metric='precision'):
    """
    Plot NER evaluation score for multiple models across thresholds.

    Parameters
    ----------
    results_df : pd.DataFrame
        DataFrame from evaluate_ner_models containing columns:
        ['model', 'threshold', 'precision', 'recall'].
    metric : str, optional
        Metric to plot. Must be either 'precision' or 'recall'.

    Returns
    -------
    None
        Displays a line plot for the selected metric.
    """
    if metric not in ['precision', 'recall']:
        raise ValueError("metric must be 'precision' or 'recall'")

    plt.figure(figsize=(10,6))
    sns.lineplot(data=results_df, x='threshold', y=metric, hue='model', marker='o')
    plt.ylim(0.5, 1.1)
    plt.title(f"NER Model Evaluation: {metric.capitalize()} vs Threshold")
    plt.xlabel("Overlap Threshold")
    plt.ylabel(metric.capitalize())
    plt.show()

In [None]:
plot_ner_score(evaluation_result, "recall")

In [None]:
plot_ner_score(evaluation_result, "precision")