Add PhysioNet De-Identification dataset, NER task, and TransformerDeID model by mtmckenna · Pull Request #981 · sunlabuiuc/PyHealth

mtmckenna · 2026-04-14T05:58:01Z

Contributor

Contribution Type

Full Pipeline: Dataset + Task + Model (Option 4)

Paper

Johnson, Alistair E.W., et al. "Deidentification of free-text medical records using pre-trained bidirectional
transformers." Proceedings of the ACM Conference on Health, Inference, and Learning (CHIL), 2020.

https://doi.org/10.1145/3368555.3384455

Description

Implements BERT-based clinical text de-identification as a PyHealth pipeline. Given clinical notes with protected health information (PHI), the model performs token-level NER to detect and classify PHI into 7 categories (NAME, DATE, LOCATION, AGE, CONTACT, ID, PROFESSION) using BIO tagging.

Dataset: Parses PhysioNet deidentifiedmedicaltext 1.0 files (id.text + id.res), aligns PHI spans between original and de-identified versions, and produces token-level BIO labels.
Task: Converts patient events into NER samples. Supports configurable overlapping windows (paper Section 3.3) to handle notes longer than BERT's 512 token limit.
Model: TransformerDeID - pretrained transformer encoder (BERT or RoBERTa).
Example: End-to-end training + evaluation script with binary PHI metrics and overlapping window prediction merging.

Data Access

The test data in test-resources/core/physionet_deid/ is synthetic (fake).
Real data requires PhysioNet credentialed access:
https://physionet.org/content/deidentifiedmedicaltext/1.0/

Ablation Results

Our results are worse than the original paper's results. The hypothesis is that the results are worse because we're only using the phsyionet data and not adding in the other datasets.

Config	Precision	Recall	F1
BERT, no window	95.1%	70.3%	80.8%
BERT, win=100/60	86.9%	75.7%	80.9%
RoBERTa, no window	98.1%	64.7%	78.0%
RoBERTa, win=100/60	82.6%	68.6%	75.0%

Files to Review

File	Description
`pyhealth/datasets/physionet_deid.py`	Dataset: parsing, PHI classification, BIO tagging
`pyhealth/datasets/configs/physionet_deid.yaml`	Dataset YAML config
`pyhealth/tasks/deid_ner.py`	Task with windowing support
`pyhealth/models/transformer_deid.py`	TransformerDeID model
`tests/core/test_physionet_deid.py`	Dataset + task tests (22 tests)
`tests/core/test_transformer_deid.py`	Model tests (20 tests)
`examples/physionet_deid_ner_transformer_deid.py`	Training + ablation script
`docs/api/datasets/pyhealth.datasets.PhysioNetDeIDDataset.rst`	Dataset docs
`docs/api/tasks/pyhealth.tasks.DeIDNERTask.rst`	Task docs
`docs/api/models/pyhealth.models.TransformerDeID.rst`	Model docs

joshuasteier · 2026-04-18T04:13:05Z

Hi @mtmckenna,

The dataset, task, and model all follow the BaseDataset, BaseTask, and BaseModel contracts correctly, the __init__.py changes are purely additive, and the tests are substantive (concrete assertions about specific patients and label positions, not just smoke tests). The honest reporting of ablation numbers that are below the paper's results, with a hypothesis about why, is appreciated.

A few suggestions, in rough order of importance. None of these are blockers.

_index_data writes a CSV into root

In PhysioNetDeIDDataset._index_data, you write physionet_deid_metadata.csv into the root directory. PhysioNet data is credentialed and users may have the data directory mounted read-only. Could you write this to the cache directory instead, or to a tmp path, and point the YAML config at the cached location? Your test tearDownClass cleans up the file, which hints that you already noticed this is awkward.

phi_spans_in_original alignment is fragile on real data

The anchoring approach (split the de-identified text on PHI tags and use the non-PHI chunks as landmarks in the original) works well when non-PHI chunks are unique. If a short non-PHI chunk happens to appear earlier in the original than where it semantically belongs, orig.find(before, pos) will lock onto the wrong position and shift all subsequent spans. Your ablation F1 of 80.9 percent says it mostly works, but could you add a defensive test case to test_physionet_deid.py where a non-PHI word repeats across PHI boundaries? Even if the current behavior is correct, it would make future regressions visible.

classify_phi fallback

The function falls back to "NAME" if nothing matches. For real PhysioNet tags most are covered, but a silent catch-all means mislabeled tags become NAME by default. Consider logging a warning (once per unknown tag string) when the fallback fires, so users of real data can see if the heuristic is missing cases.

Task output schema inconsistency

input_schema uses {"text": "text"} (string value) but output_schema uses {"labels": TextProcessor} (class value). Could you make them consistent with whatever pattern other text-output tasks in pyhealth/tasks/ use? If both forms are supported, a short comment explaining why the task uses the class form for labels would help future readers.

deidentify test coverage

test_deidentify_custom_redact_marker asserts that [REDACTED] is absent when redact="[PHI]", but does not assert that [PHI] is actually present in the output. If the untrained model happens to predict all-O on that short input, both strings would be absent and the test would pass vacuously. Could you either add a second assertion or use a longer input where PHI prediction is likely enough to make the test meaningful?

Small items

The commit history is 16 commits including some WIP messages ("WIP e2e script", "add deidentify method so i can test it out manually"). Not a blocker, but if you squash before merge it will make git log easier to follow.
The windowing test TestDeIDNERTaskWindowing.test_windowing_produces_more_samples asserts > 4, but you have 11 notes total. Should that threshold be > 11?
In the example, compute_metrics uses p[1:].max() to score non-O predictions. Consider p[1:].max() versus the more standard 1 - p[0] for the non-O confidence. The current version implicitly assumes the best single non-O class score represents the confidence of "this is PHI," which is fine but worth a comment.

Once the _index_data write location is resolved, I think this is ready to merge. The contribution is well documented, the tests are real, and the paper reproduction is honest.

mtmckenna · 2026-04-19T07:22:39Z

Thank you for the thorough and thoughtful review! I addressed suggestions--please let me know if you have any feedback. Also, I can squash the commits before merge, or happy to use squash-and-merge if that's enabled on the repo.

CSV write location: _index_data now writes to a temp directory instead of root, so read-only data directories work.
PHI span alignment tests: Added two test cases for repeated non-PHI text across PHI boundaries (repeated word before/after a tag, and same anchor between two tags).
classify_phi fallback: Now logs a warning (once per unknown tag string) when the NAME fallback fires.
Task schema consistency: Both input_schema and output_schema use TextProcessor now, matching the medical_coding.py pattern.
deidentify test: Added assertion that every output word is either an original word or the custom redact marker, so the test can't pass erroneously.
Windowing test threshold: Changed from > 4 to > 11--thank you for catching that!
compute_metrics confidence: Switched to 1 - p[0] with a comment explaining the choice.

joshuasteier

Thank you, @mtmckenna. LGTM.
One tiny nit, not a blocker: the tempfile.mkdtemp directory created in PhysioNetDeIDDataset.__init__ isn't explicitly cleaned up anywhere. The test class cleans up its own cache_dir, but a long-running notebook that instantiates the dataset multiple times would leak pyhealth_deid_* directories under /tmp. If you want, you can add a __del__ or a close() method that calls shutil.rmtree(self._tmp_dir, ignore_errors=True). Or leave it and let the OS handle cleanup. Your call.

For the squash question: yes, please squash before merge. It will keep git log --oneline pyhealth/datasets/physionet_deid.py readable for anyone tracing why the dataset works the way it does.

Approving.

…D model

mtmckenna · 2026-04-19T15:49:21Z

Good suggestion on the __del__ method! Implemented. Also squashed down to one commit.

Thank you again for your help!

…D model (sunlabuiuc#981)

joshuasteier approved these changes Apr 19, 2026

View reviewed changes

Add PhysioNet De-Identification dataset, NER task, and TransformerDeI…

18d7b60

…D model

mtmckenna force-pushed the deid-bert branch from fd438a9 to 18d7b60 Compare April 19, 2026 15:48

joshuasteier merged commit 5f63039 into sunlabuiuc:master Apr 19, 2026
1 check passed

joshuasteier mentioned this pull request Apr 19, 2026

Add CCEP ECoG dataset, Localize SOZ task, and SPES models #1027

Open

paul-nguyen-1 pushed a commit to paul-nguyen-1/PyHealth that referenced this pull request Apr 20, 2026

Add PhysioNet De-Identification dataset, NER task, and TransformerDeI…

2df73a3

…D model (sunlabuiuc#981)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add PhysioNet De-Identification dataset, NER task, and TransformerDeID model#981

Add PhysioNet De-Identification dataset, NER task, and TransformerDeID model#981
joshuasteier merged 1 commit intosunlabuiuc:masterfrom
mtmckenna:deid-bert

mtmckenna commented Apr 14, 2026

Uh oh!

joshuasteier commented Apr 18, 2026

Uh oh!

mtmckenna commented Apr 19, 2026

Uh oh!

joshuasteier left a comment

Uh oh!

mtmckenna commented Apr 19, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

mtmckenna commented Apr 14, 2026

Contributor

Contribution Type

Paper

Description

Data Access

Ablation Results

Files to Review

Uh oh!

joshuasteier commented Apr 18, 2026

Uh oh!

mtmckenna commented Apr 19, 2026

Uh oh!

joshuasteier left a comment

Choose a reason for hiding this comment

Uh oh!

mtmckenna commented Apr 19, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants