<a href="https://colab.research.google.com/github/sppandlkk/healthcare-nlp-llm-pipelines/blob/main/notebooks/01_deid_clinical_notes.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# De-identification of Clinical Notes using NLP
This project focuses on de-identification of clinical notes with the goal of protecting patient privacy while preserving the medical utility of the text. My main contributions are:

- Experimented with rule-based and pretrained NER models, such as Microsoft Presidio and BERT-based models.
- Integrated LLM-based approaches to test whether large language models can improve recall in entity detection.
  - Applied prompt engineering to improve zero-shot LLM performance.
  - Manually corrected issues produced by LLM outputs that lost alignment due to tokenization.
- Conducted evaluation and comparison between NER models and LLM-based approaches.

In [None]:
# python pacakge and import
!pip install transformers[sentencepiece]
!pip install presidio-analyzer presidio-anonymizer
!pip install -q -U google-generativeai

import pandas as pd
from presidio_analyzer import AnalyzerEngine
import matplotlib.pyplot as plt
import seaborn as sns
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM, pipeline, AutoModelForTokenClassification, AutoModelForCausalLM
import google.generativeai as genai
from google.colab import userdata
import json
import re

## Create Synthetic Notes

I generate synthetic clinical notes that contain multiple PHI entities, such as patient names, provider names, and family members. Each entity is manually annotated with its start and end character indices to serve as ground truth for evaluation.

Note: The synthetic note includes intentional errors such as missing spaces in names (NurseKate) to test the robustness of de-identification models.

In [None]:
# create synthetic notes. Notice that I have NurseKate (missing space) to tell whether model can flag it
note_text = """
Patient Emma M. Su underwent inpatient surgery for acute exacerbation of asthma and was admitted to Happy Valley Hospital for further management. During her stay, she received treatment with IV steroids, bronchodilators, and oxygen therapy. She was also seen by Dr. Lee, a pulmonologist affiliated with Pulmonary Department, who adjusted her medication regimen. The patient's family members, including brother Adam, Liv, and Dave  Ledger (partner), visited her regularly and provided emotional support. The dad, PeteSu, expressed concern about her condition and stated "I'm glad she's getting the care she needs". Her mom, Jen K, will be picking her up from Happy Valley Hospital's Discharge Unit today after discharge.
During her stay, the patient underwent various tests, including pulmonary function tests and chest X-rays, which showed significant improvement after treatment. These tests were conducted by Support Pulmonary Function Laboratory. The patient was also educated on proper inhaler use and asthma management by NurseKate.
She will follow up with Dr. Smith in 2 weeks to reassess her symptoms and adjust her medication regimen as needed. Her friend, mike hope, will be helping her with errands and chores during her recovery.
The patient's condition improved significantly during her stay, and she was discharged in stable condition with instructions to rest and continue her medication regimen. Emma's condition will continue to be monitored by her healthcare team, including Dr. Smith Y. and Nurse Kate W. from Happy Valley Hospital.
Documented by: Kate Whittier. Signed by: Dr. Smith M Yeats, MD. Date: March 15, 2023, 14:30
"""

# manually annotate entity_text
ground_truth = pd.DataFrame(
        {
        "entity_text": ["Emma M. Su", "Lee", "Adam", "Liv", "Dave  Ledger", "PeteSu", "Jen K", "Kate", "Smith", "mike hope", "Smith Y.", "Kate W.", "Kate Whittier", "Smith M Yeats"],
        "entity_start_index": [9, 267, 411, 417, 426, 513, 624, 1033, 1067, 1166, 1497, 1516, 1567, 1597],
        "entity_end_index": [19, 270, 415, 420, 438, 519, 629, 1037, 1072, 1175, 1505, 1523, 1580, 1610]
    }
)
ground_truth

##Microsoft Presidio Model

[Microsoft Presidio](https://github.com/microsoft/presidio) is an open-source library for detecting personally identifiable information (PII) in text. In this section, we apply the Presidio recognizer to our synthetic note to automatically detect names. Later, we will compare the predicted entities with our ground truth annotations.

Presidio outputs start/end indices and entity types, which can be directly compared to ground truth for evaluation.

In [None]:
# initialize presidio_analyzer
analyzer = AnalyzerEngine()
# detect PII
results = analyzer.analyze(text=note_text, entities=["PERSON"], language="en")
df_presidio = pd.DataFrame([
    {
        "model":"presidio",
        "entity_type":ent.entity_type,
        "entity_start_index":ent.start,
        "entity_end_index":ent.end,
        "entity_text":note_text[ent.start:ent.end]
    } for ent in results
])
df_presidio

## Alternative NER Models from HuggingFace
- BERT-Base NER ([dslim/bert-base-NER](https://huggingface.co/dslim/bert-base-NER)):
A smaller BERT model fine-tuned on the CoNLL-2003 dataset to recognize person names and other standard entities.

- BERT-Large NER ([dbmdz/bert-large-cased-finetuned-conll03-english](https://huggingface.co/dbmdz/bert-large-cased-finetuned-conll03-english)):
A larger cased BERT model fine-tuned similarly on the CoNLL-2003 dataset, expected to provide better context understanding due to more parameters.

We use the HuggingFace pipeline API for NER and extract entities including start/end positions to compare with ground truth.

In [None]:
# Use a small NER model for demo
bert_base = pipeline("ner", model="dslim/bert-base-NER", aggregation_strategy="simple")
ents = bert_base(note_text)
df_bert_base = pd.DataFrame([
    {
        "model":"bert_base",
        "entity_type": ent["entity_group"],
        "entity_start_index": ent["start"],
        "entity_end_index": ent["end"],
        "entity_text": ent["word"]
    } for ent in ents if ent["entity_group"] == "PER"
])
df_bert_base

In [None]:
model_name = "dbmdz/bert-large-cased-finetuned-conll03-english"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForTokenClassification.from_pretrained(model_name)

# Build NER pipeline
ner_pipeline = pipeline("ner", model=model, tokenizer=tokenizer, aggregation_strategy="simple")
# Run NER
ents = ner_pipeline(note_text)
df_bert_large = pd.DataFrame([
    {
        "model": "bert_large",
        "entity_type": ent["entity_group"],
        "entity_start_index": ent["start"],
        "entity_end_index": ent["end"],
        "entity_text": ent["word"]
    } for ent in ents if ent["entity_group"] == "PER"
])
df_bert_large

## Evaluation
Model performance is evaluated using overlap ratio thresholds. If a predicted span overlaps a ground truth span above a threshold, it is counted as a true positive; otherwise, it is a false negative.

Metrics computed include:

Precision: TP / (TP + FP) – indicates how much correct information is retained without over-censoring.

Recall: TP / (TP + FN) – more critical for de-identification, since missing a PHI entity could result in sensitive information leakage.

Multiple thresholds are applied to assess robustness, and Seaborn line plots are used to visualize precision and recall across thresholds for each model.

In [None]:
def evaluate_ner_models(ground_truth, model_dfs, model_names, thresholds=[0.1, 0.2, 0.3, 0.4, 0.5]):
    """
    Evaluate multiple NER models against ground truth using overlap-based matching.

    Parameters
    ----------
    ground_truth : pd.DataFrame
        DataFrame containing ground truth entities with columns:
        ["entity_text", "entity_start_index", "entity_end_index"].
    model_dfs : list of pd.DataFrame
        List of predicted entities DataFrames, each with the same columns as ground_truth.
    model_names : list of str
        Names of the models corresponding to model_dfs.
    thresholds : list of float, optional
        Minimum overlap ratio to consider a predicted entity as a true positive.

    Returns
    -------
    pd.DataFrame
        DataFrame containing precision and recall for each model at each threshold.
        Columns: ["model", "threshold", "precision", "recall"].
    """
    results = []

    for model_df, model_name in zip(model_dfs, model_names):
        for t in thresholds:
            tp, fp, fn = 0, 0, 0
            matched_gt_idx = set()

            # Iterate over predicted entities
            for _, m_row in model_df.iterrows():
                m_start, m_end = m_row["entity_start_index"], m_row["entity_end_index"]
                match_found = False

                for gt_idx, gt_row in ground_truth.iterrows():
                    gt_start, gt_end = gt_row["entity_start_index"], gt_row["entity_end_index"]

                    # Compute overlap
                    overlap = max(0, min(m_end, gt_end) - max(m_start, gt_start))
                    overlap_ratio = overlap / (gt_end - gt_start)

                    if overlap_ratio >= t:
                        tp += 1
                        matched_gt_idx.add(gt_idx)
                        match_found = True
                        break

                if not match_found:
                    fp += 1

            fn = len(ground_truth) - len(matched_gt_idx)
            precision = tp / (tp + fp) if (tp + fp) > 0 else 0
            recall = tp / (tp + fn) if (tp + fn) > 0 else 0

            results.append({
                "model": model_name,
                "threshold": t,
                "precision": precision,
                "recall": recall
            })

    results_df = pd.DataFrame(results)
    return results_df

In [None]:
evaluation_result = \
evaluate_ner_models(ground_truth,
                    model_dfs=[df_presidio, df_bert_base, df_bert_large],
                    model_names=["Presidio", "BERT Base", "BERT Large"],
                    thresholds = [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0])
evaluation_result

In [None]:
def plot_metric(df, metric="precision"):
    """
    Plot evaluation results for precision or recall across thresholds.

    Args:
        df (pd.DataFrame): DataFrame with columns ["model", "threshold", "precision", "recall"].
        metric (str): One of ["precision", "recall"].
    """
    plt.figure(figsize=(8,6))
    sns.lineplot(
        data=df,
        x="threshold",
        y=metric,
        hue="model",
        style="model",    # <-- this makes each model different line/marker
        markers=True,     # <-- enable markers
        dashes=True,      # <-- enable dashed lines
        palette="tab10",  # <-- better color separation
        alpha=0.9
    )
    plt.title(f"{metric.capitalize()} across thresholds", fontsize=14)
    plt.ylabel(metric.capitalize())
    plt.xlabel("Overlap threshold")
    plt.legend(title="Model", bbox_to_anchor=(1.05, 1), loc="upper left")
    plt.show()

In [None]:
plot_metric(evaluation_result, "recall")

In [None]:
plot_metric(evaluation_result, "precision")

## Using LLMs with Zero-shot Prompts

Beyond traditional NER models, I also experimented with large language models (LLMs) through API calls.

- Design a zero-shot prompt that instructs the LLM (Gemini 1.5 Flash) to extract all person names from clinical notes and return them in a structured JSON format.
- prompt engineer to refine the instructions to make the model output more reliable and consistently formatted.

This step highlights how LLMs can adapt to tasks they were not explicitly trained on, using only carefully designed prompts.

In [None]:
# use your own key. Here I am using Colab secret to store my key
genai.configure(api_key=userdata.get("gemini_key"))

model = genai.GenerativeModel("gemini-1.5-flash-latest")

# provide prompt
prompt1 = f"""
Extract all PERSON names from the following clinical note.
Return ONLY a JSON array with (start index, end index, and name)

Clinical Note:

{note_text}
"""

response1 = model.generate_content(prompt1)
print(response1.text)


In [None]:
prompt2 = f"""
Extract all PERSON names from the following clinical note.
Return ONLY a JSON array with (start index, end index, and name).
If the name has prefix (Dr. or Nurse), remove the prefix and recalculate indices.
If the name has double spaces, keep double spaces in the string.
Do not replace double spaces by single space.

Clinical Note:

{note_text}
"""
response2 = model.generate_content(prompt2)
print(response2.text)


## Handling Index Alignment Issues

While prompt engineering improved the accuracy of entity extraction, I encountered a major limitation: the indices returned by the LLM were often wrong.

This happens because LLMs process input through tokenization, which breaks alignment with the raw character positions in the original text.


To address this, I manually recalculated the indices for the extracted entities. With correct indices, I was able to run a proper evaluation against other models.


In [None]:
## Extract json and load
match = re.search(r"""\[.*\]""", response2.text, re.DOTALL)
json_result = json.loads(match.group(0))
names = [item["name"] for item in json_result]
names

In [None]:
def align_entities_with_indices(note_text, names):
    results = []
    search_start = 0  # where to start searching in the note

    for name in names:
        idx = note_text.find(name, search_start)
        if idx == -1:
            # if not found, skip or raise warning
            print(f"Warning: '{name}' not found in text after position {search_start}")
            continue

        start_idx = idx
        end_idx = idx + len(name)
        results.append({
            "model":"gemini-1.5-flash",
            "entity_type":"PERSON",
            "entity_start_index":start_idx,
            "entity_end_index":end_idx,
            "entity_text":name
        })

        # move the search start forward to avoid re-finding the same text
        search_start = end_idx

    return pd.DataFrame(results)

df_gemini = align_entities_with_indices(note_text, names)
df_gemini

In [None]:
evaluation_result = \
evaluate_ner_models(ground_truth,
                    model_dfs=[df_presidio, df_bert_base, df_bert_large, df_gemini],
                    model_names=["Presidio", "BERT Base", "BERT Large", "Gemini 1.5 Flash"],
                    thresholds = [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0])
evaluation_result

In [None]:
plot_metric(evaluation_result, "recall")

In [None]:
plot_metric(evaluation_result, "precision")

# Conclusion: NLP vs. LLM For De-identifying Names
| Feature                         | NLP Models (NER)                                           | LLMs                                                                |
| ------------------------------- | ---------------------------------------------------------- | ------------------------------------------------------------------- |
| **Training process**            | Task-specific, trained for de-identification               | General-purpose, requires prompt engineering                        |
| **Ease of use & adaptability**  | Easy to use within training domain, limited outside domain | Generalizes to unseen contexts, can improve with prompt engineering |
| **Index extraction capability** | Accurate character positions                               | Tokenization can break alignment, manual correction may be needed   |
| **Human oversight**             | Minimal                                                    | Important to ensure correctness                                     |


**Summary**:
Traditional NLP models perform robustly for structured, task-specific de-identification tasks, delivering consistent and precise entity extraction. LLMs, in contrast, can handle unstructured or complex clinical notes, identifying a broader range of entities, but require careful human oversight to address tokenization and index alignment issues.

**Note**: Because this project uses synthetic data, the exact ranking or performance differences between models may not fully reflect real-world behavior, but it effectively demonstrates the methodology and evaluation approach.