# Task Definition and Cross-Dataset Alignment

This notebook defines the unified clinical NLP task used across NBME, Synthea,
and MTSamples. The goal is to ensure that evaluation of open-source large language
models is **fair, comparable, and reproducible** across heterogeneous clinical
text distributions.

## Motivation

Public clinical NLP datasets differ substantially in:
- documentation style
- structure and noise
- availability of annotations

Directly comparing model performance across such datasets without a unified task
definition can lead to misleading conclusions. This notebook explicitly defines
the task, labels, and evaluation signals used in our study.

## Task Definition: Clinical Symptom Identification

We study the task of **clinical symptom identification from free-text clinical notes**.

Given a clinical note, a model is expected to identify mentions of:
- patient-reported symptoms
- subjective clinical complaints
- explicitly negated symptoms (e.g., "denies chest pain")

The task does **not** require diagnosis prediction, coding, or treatment planning.

## Definition of a Symptom

In this study, a symptom refers to:
- a subjective complaint (e.g., pain, nausea, palpitations)
- a patient-experienced condition (e.g., fatigue, dizziness, insomnia)
- explicitly negated symptoms (important for clinical reasoning)

We explicitly exclude:
- demographic attributes (age, sex)
- administrative details
- temporal markers without clinical complaints
- rubric-specific or exam-only artifacts

In [1]:
#Loading NBME Feature Vocabulary
import pandas as pd

nbme_features = pd.read_csv("../data/nbme/features.csv")
nbme_features.head()

Unnamed: 0,feature_num,case_num,feature_text
0,0,0,Family-history-of-MI-OR-Family-history-of-myoc...
1,1,0,Family-history-of-thyroid-disorder
2,2,0,Chest-pressure
3,3,0,Intermittent-symptoms
4,4,0,Lightheaded


## Symptom Mapping for NBME

NBME feature annotations include a mixture of:
- symptoms (e.g., chest pain, nausea)
- demographics (e.g., age, sex)
- rubric-specific concepts (e.g., exam timing)

For this study, we treat **only symptom-like features** as valid targets.

In [2]:
#Example: Symptom-Like Features
symptom_like = nbme_features[
    nbme_features["feature_text"].str.contains(
        "pain|nausea|vomit|fatigue|palpitation|headache|insomnia|dizziness",
        case=False,
        na=False
    )
]

symptom_like.sample(10)

Unnamed: 0,feature_num,case_num,feature_text
130,904,9,Global-headache-OR-diffuse-headache
77,508,5,Associated-nausea
84,515,5,Fatigue-OR-Difficulty-concentrating
131,905,9,Neck-pain
93,606,6,Chest-pain
65,406,4,Insomnia
20,107,1,Right-sided-LQ-abdominal-pain-OR-Right-lower-q...
75,506,5,No-chest-pain
24,111,1,8-to-10-hours-of-acute-pain
134,908,9,Nausea


In [3]:
#Example: Explicitly Excluded Features
excluded_features = nbme_features[
    nbme_features["feature_text"].str.contains(
        "year|female|male|LMP|Pap|age|duration",
        case=False,
        na=False
    )
]

excluded_features.sample(10)

Unnamed: 0,feature_num,case_num,feature_text
113,805,8,67-year
34,208,2,Female
139,913,9,Female
39,213,2,Onset-3-years-ago
58,315,3,35-year
66,407,4,Female
27,201,2,Last-Pap-smear-I-year-ago
42,216,2,44-year
70,501,5,Female
127,901,9,20-year


## Cross-Dataset Alignment

| Dataset    | Text Source                     | Language Style          | Supervision Type        |
|------------|---------------------------------|-------------------------|-------------------------|
| NBME       | Exam patient notes              | Structured, concise     | Span-level annotations  |
| Synthea    | Synthetic EHR narratives        | Templated, clean        | Programmatic labels     |
| MTSamples  | Clinical transcription samples  | Noisy, narrative        | Weak supervision        |

## Dataset-Specific Usage

### NBME
- Input: `pn_history`
- Reference: annotated symptom spans
- Role: clean, exam-style benchmark

### Synthea
- Input: constructed synthetic clinical notes
- Reference: condition and encounter metadata
- Role: controlled synthetic EHR-style data

### MTSamples
- Input: transcription text
- Reference: keyword- and context-based symptom cues
- Role: real-world noisy clinical language

## Evaluation Strategy

Model outputs are evaluated using:
- mention-level matching against reference symptoms
- semantic similarity to symptom descriptions
- qualitative error categorization (missed, hallucinated, negated)

We do not optimize for dataset-specific competition metrics.
Instead, we focus on **robustness and generalization** across datasets.

## Cross-Dataset Robustness

Robustness is assessed by:
- consistency of performance across datasets
- sensitivity to prompt phrasing
- error patterns under distribution shift

This allows systematic analysis of how open-source LLMs behave
when clinical text style and noise characteristics change.

### Summary

This unified task definition enables fair and reproducible evaluation of
open-source LLMs for clinical symptom identification across heterogeneous
public clinical datasets. All subsequent experiments adhere to this setup.