# Clinical Text Summarization and Hallucination-Aware Evaluation

This notebook provides a demonstration of clinical text summarization and
hallucination-aware evaluation inspired by:

> *A Data-Centric Approach to Generate Faithful and High Quality Clinical Summaries with Large Language Models* (Hegselmann et al., 2024).


It shows
- Synthetic clinical encounters (no real patient data)
- Summaries produced by a simple baseline
- Hallucination-style checks using keyword heuristics
- Text-based metrics (ROUGE, BLEU) computed without `evaluate`

In [None]:
# If running in Colab or a clean environment, uncomment this cell to install dependencies.
# This uses a *minimal* set of packages (no transformers, no evaluate).
#
# !pip install pyhealth sacrebleu rouge-score


## Imports and basic setup


In [1]:
import random
from pprint import pprint

import pyhealth
print("PyHealth version:", pyhealth.__version__)

# Metrics libraries (do not depend on transformers)
import sacrebleu
from rouge_score import rouge_scorer


PyHealth version: 1.1.6


## Create synthetic clinical encounters

We define a small list of synthetic encounters. Each encounter has:

- an `encounter_id`
- a free-text `history` describing the clinical scenario
- a short `reference_summary` which we treat as the gold standard summary
- a list of `key_conditions` we expect to appear in a faithful summary



In [2]:
toy_encounters = [
    {
        "encounter_id": "E001",
        "history": (
            "65-year-old male with history of hypertension and diabetes admitted "
            "for shortness of breath. Chest X-ray shows bilateral infiltrates. "
            "Started on IV antibiotics and oxygen therapy."
        ),
        "reference_summary": (
            "Elderly male with hypertension and diabetes admitted for dyspnea "
            "and bilateral infiltrates, treated with IV antibiotics and oxygen."
        ),
        "key_conditions": ["hypertension", "diabetes", "dyspnea", "bilateral infiltrates"],
    },
    {
        "encounter_id": "E002",
        "history": (
            "54-year-old female with breast cancer in remission presents with new-onset "
            "headache and visual changes. MRI of the brain is pending. Given analgesics "
            "and neurology consulted."
        ),
        "reference_summary": (
            "Middle-aged female with history of breast cancer in remission presents with "
            "new headache and visual symptoms; brain MRI pending and neurology consulted."
        ),
        "key_conditions": ["breast cancer", "headache", "visual changes", "MRI"],
    },
    {
        "encounter_id": "E003",
        "history": (
            "72-year-old with chronic kidney disease and heart failure presents with "
            "leg swelling and fatigue. Labs show elevated creatinine and BNP. "
            "Diuretics initiated and nephrology consulted."
        ),
        "reference_summary": (
            "Older adult with CKD and heart failure presents with edema and fatigue; "
            "found to have elevated creatinine and BNP, started on diuretics and "
            "seen by nephrology."
        ),
        "key_conditions": ["chronic kidney disease", "heart failure", "edema", "fatigue", "creatinine", "BNP"],
    },
]

len(toy_encounters)


3

## Define a simple rule-based summarizer

We use a simple lead-based extractive baseline:

- Take the first few sentences from the history.
- Optionally truncate to a maximum token length.
- Return this as the "model" summary.


In [3]:
import re

def simple_summarize(text: str, max_sentences: int = 2, max_tokens: int = 64) -> str:
    if not isinstance(text, str) or not text.strip():
        return ""

    sentences = re.split(r"(?<=[.!?])\s+", text.strip())
    sentences = [s for s in sentences if s.strip()]
    if not sentences:
        return ""
    selected = sentences[:max_sentences]
    summary = " ".join(selected).strip()

    # Truncate by token count
    tokens = summary.split()
    if len(tokens) > max_tokens:
        summary = " ".join(tokens[:max_tokens])

    return summary

def batch_summarize(texts, max_sentences: int = 2, max_tokens: int = 64):
    return [
        simple_summarize(t, max_sentences=max_sentences, max_tokens=max_tokens)
        for t in texts
    ]


## Generate summaries from the clean histories

Apply the baseline summarizer to each encounter's clean history and store
the result in `model_summary`. This is analogous to running a real neural
sequence-to-sequence model, but much cheaper and entirely deterministic.


In [9]:
for enc in toy_encounters:
    enc["model_summary"] = simple_summarize(enc["history"])

pprint([
    {
        "id": e["encounter_id"],
        "history": e["history"],
        "reference_summary": e["reference_summary"],
        "model_summary": e["model_summary"],
    }
    for e in toy_encounters
])


[{'history': '65-year-old male with history of hypertension and diabetes '
             'admitted for shortness of breath. Chest X-ray shows bilateral '
             'infiltrates. Started on IV antibiotics and oxygen therapy.',
  'id': 'E001',
  'model_summary': '65-year-old male with history of hypertension and diabetes '
                   'admitted for shortness of breath. Chest X-ray shows '
                   'bilateral infiltrates.',
  'reference_summary': 'Elderly male with hypertension and diabetes admitted '
                       'for dyspnea and bilateral infiltrates, treated with IV '
                       'antibiotics and oxygen.'},
 {'history': '54-year-old female with breast cancer in remission presents with '
             'new-onset headache and visual changes. MRI of the brain is '
             'pending. Given analgesics and neurology consulted.',
  'id': 'E002',
  'model_summary': '54-year-old female with breast cancer in remission '
                   'presents with

## Add noisy / templated content to simulate more complex inputs

To mimic more realistic clinical notes, we create a noisy version of the history
by appending extra social history and templated text. We then summarize this noisy
history with the same baseline model to see how robust it is.


In [5]:
for enc in toy_encounters:
    enc["noisy_history"] = (
        enc["history"]
        + " Social: lives with family, enjoys gardening, no recent travel. "
          "Multiple duplicated notes and templated phrases are present in the record."
    )

for enc in toy_encounters:
    enc["summary_noisy_input"] = simple_summarize(enc["noisy_history"])

pprint([
    {
        "id": e["encounter_id"],
        "clean_history_summary": e["model_summary"],
        "noisy_history_summary": e["summary_noisy_input"],
    }
    for e in toy_encounters
])


[{'clean_history_summary': '65-year-old male with history of hypertension and '
                           'diabetes admitted for shortness of breath. Chest '
                           'X-ray shows bilateral infiltrates.',
  'id': 'E001',
  'noisy_history_summary': '65-year-old male with history of hypertension and '
                           'diabetes admitted for shortness of breath. Chest '
                           'X-ray shows bilateral infiltrates.'},
 {'clean_history_summary': '54-year-old female with breast cancer in remission '
                           'presents with new-onset headache and visual '
                           'changes. MRI of the brain is pending.',
  'id': 'E002',
  'noisy_history_summary': '54-year-old female with breast cancer in remission '
                           'presents with new-onset headache and visual '
                           'changes. MRI of the brain is pending.'},
 {'clean_history_summary': '72-year-old with chronic kidney disease and 

## Simple hallucination-style checks

Here we implement heuristics to flag potential hallucinations. First, we maintain a list of condition
keywrods that we care about (e.g., cancer, stroke).

A summary is flagged as hallucinated if it mentions a condition that does **not**
appear in the original history. We also check for **missing key facts**, which are
expected conditions in `key_conditions` that do not appear in the model summary.

These heuristics are simple but illustrate the idea of hallucination-aware
evaluation from the paper.


In [6]:
condition_keywords = [
    "cancer", "stroke", "myocardial infarction", "sepsis",
    "pneumonia", "hypertension", "diabetes", "heart failure",
    "kidney disease", "dialysis",
]

def detect_hallucinations(history: str, summary: str):
    """Return a set of condition keywords that appear in the summary but not in the history."""
    history_lower = history.lower()
    summary_lower = summary.lower()
    hallucinated = set()

    for cond in condition_keywords:
        if cond in summary_lower and cond not in history_lower:
            hallucinated.add(cond)

    return hallucinated

def missing_key_facts(key_conditions, summary: str):
    """Return a set of key conditions that are missing from the summary."""

    summary_lower = summary.lower()
    missing = set()

    for cond in key_conditions:
        if cond.lower() not in summary_lower:
            missing.add(cond)

    return missing

for enc in toy_encounters:
    h = enc["history"]
    s = enc["model_summary"]

    enc["hallucinated_conditions"] = detect_hallucinations(h, s)
    enc["missing_key_conditions"] = missing_key_facts(enc["key_conditions"], s)
pprint([
    {
        "id": e["encounter_id"],
        "model_summary": e["model_summary"],
        "hallucinated_conditions": sorted(e["hallucinated_conditions"]),
        "missing_key_conditions": sorted(e["missing_key_conditions"]),
    }
    for e in toy_encounters
])


[{'hallucinated_conditions': [],
  'id': 'E001',
  'missing_key_conditions': ['dyspnea'],
  'model_summary': '65-year-old male with history of hypertension and diabetes '
                   'admitted for shortness of breath. Chest X-ray shows '
                   'bilateral infiltrates.'},
 {'hallucinated_conditions': [],
  'id': 'E002',
  'missing_key_conditions': [],
  'model_summary': '54-year-old female with breast cancer in remission '
                   'presents with new-onset headache and visual changes. MRI '
                   'of the brain is pending.'},
 {'hallucinated_conditions': [],
  'id': 'E003',
  'missing_key_conditions': ['edema'],
  'model_summary': '72-year-old with chronic kidney disease and heart failure '
                   'presents with leg swelling and fatigue. Labs show elevated '
                   'creatinine and BNP.'}]


## Compute ROUGE and BLEU metrics

We now compute standard text-based metrics directly, using their respective packages:

- **ROUGE-1 / ROUGE-2 / ROUGE-L** with `rouge-score`
- **BLEU** with `sacrebleu`



In [7]:
refs = [e["reference_summary"] for e in toy_encounters]
preds = [e["model_summary"] for e in toy_encounters]

# ROUGE
rouge_types = ["rouge1", "rouge2", "rougeL"]
scorer = rouge_scorer.RougeScorer(rouge_types, use_stemmer=True)

rouge_scores = {r: [] for r in rouge_types}
for ref, pred in zip(refs, preds):
    scores = scorer.score(ref, pred)  # (target, prediction)
    for r in rouge_types:
        rouge_scores[r].append(scores[r].fmeasure)

avg_rouge = {r: sum(vals) / len(vals) for r, vals in rouge_scores.items()}

# BLEU
bleu = sacrebleu.corpus_bleu(preds, [refs])

print("Average ROUGE scores (F1):")
for r, v in avg_rouge.items():
    print(f"  {r}: {v:.4f}")

print(f"Corpus BLEU: {bleu.score:.2f}")


Average ROUGE scores (F1):
  rouge1: 0.5524
  rouge2: 0.3604
  rougeL: 0.5234
Corpus BLEU: 25.31


## Aggregate hallucination-style statistics

Summarize the hallucination-style checks into the 
- proportion of summaries with any hallucinated condition keyword
- Average number of missing key conditions per summary



In [8]:
num_encounters = len(toy_encounters)
num_with_hallucinations = sum(1 for e in toy_encounters if e["hallucinated_conditions"])
avg_missing_key_conditions = sum(len(e["missing_key_conditions"]) for e in toy_encounters) / num_encounters

print(f"Number of encounters: {num_encounters}")
print(f"Summaries with any hallucinated condition: {num_with_hallucinations} "
      f"({num_with_hallucinations / num_encounters:.2f})")
print(f"Average number of missing key conditions per summary: {avg_missing_key_conditions:.2f}")


Number of encounters: 3
Summaries with any hallucinated condition: 0 (0.00)
Average number of missing key conditions per summary: 0.67
