Skip to content

Add PhysioNet De-Identification dataset, NER task, and TransformerDeID model#981

Merged
joshuasteier merged 1 commit intosunlabuiuc:masterfrom
mtmckenna:deid-bert
Apr 19, 2026
Merged

Add PhysioNet De-Identification dataset, NER task, and TransformerDeID model#981
joshuasteier merged 1 commit intosunlabuiuc:masterfrom
mtmckenna:deid-bert

Conversation

@mtmckenna
Copy link
Copy Markdown
Contributor

Contributor

Matt McKenna (mtm16@illinois.edu)

Contribution Type

Full Pipeline: Dataset + Task + Model (Option 4)

Paper

Johnson, Alistair E.W., et al. "Deidentification of free-text medical records using pre-trained bidirectional
transformers." Proceedings of the ACM Conference on Health, Inference, and Learning (CHIL), 2020.

https://doi.org/10.1145/3368555.3384455

Description

Implements BERT-based clinical text de-identification as a PyHealth pipeline. Given clinical notes with protected health information (PHI), the model performs token-level NER to detect and classify PHI into 7 categories (NAME, DATE, LOCATION, AGE, CONTACT, ID, PROFESSION) using BIO tagging.

  • Dataset: Parses PhysioNet deidentifiedmedicaltext 1.0 files (id.text + id.res), aligns PHI spans between original and de-identified versions, and produces token-level BIO labels.
  • Task: Converts patient events into NER samples. Supports configurable overlapping windows (paper Section 3.3) to handle notes longer than BERT's 512 token limit.
  • Model: TransformerDeID - pretrained transformer encoder (BERT or RoBERTa).
  • Example: End-to-end training + evaluation script with binary PHI metrics and overlapping window prediction merging.

Data Access

The test data in test-resources/core/physionet_deid/ is synthetic (fake).
Real data requires PhysioNet credentialed access:
https://physionet.org/content/deidentifiedmedicaltext/1.0/

Ablation Results

Our results are worse than the original paper's results. The hypothesis is that the results are worse because we're only using the phsyionet data and not adding in the other datasets.

Config Precision Recall F1
BERT, no window 95.1% 70.3% 80.8%
BERT, win=100/60 86.9% 75.7% 80.9%
RoBERTa, no window 98.1% 64.7% 78.0%
RoBERTa, win=100/60 82.6% 68.6% 75.0%

Files to Review

File Description
pyhealth/datasets/physionet_deid.py Dataset: parsing, PHI classification, BIO tagging
pyhealth/datasets/configs/physionet_deid.yaml Dataset YAML config
pyhealth/tasks/deid_ner.py Task with windowing support
pyhealth/models/transformer_deid.py TransformerDeID model
tests/core/test_physionet_deid.py Dataset + task tests (22 tests)
tests/core/test_transformer_deid.py Model tests (20 tests)
examples/physionet_deid_ner_transformer_deid.py Training + ablation script
docs/api/datasets/pyhealth.datasets.PhysioNetDeIDDataset.rst Dataset docs
docs/api/tasks/pyhealth.tasks.DeIDNERTask.rst Task docs
docs/api/models/pyhealth.models.TransformerDeID.rst Model docs

@joshuasteier
Copy link
Copy Markdown
Collaborator

Hi @mtmckenna,

The dataset, task, and model all follow the BaseDataset, BaseTask, and BaseModel contracts correctly, the __init__.py changes are purely additive, and the tests are substantive (concrete assertions about specific patients and label positions, not just smoke tests). The honest reporting of ablation numbers that are below the paper's results, with a hypothesis about why, is appreciated.

A few suggestions, in rough order of importance. None of these are blockers.

_index_data writes a CSV into root

In PhysioNetDeIDDataset._index_data, you write physionet_deid_metadata.csv into the root directory. PhysioNet data is credentialed and users may have the data directory mounted read-only. Could you write this to the cache directory instead, or to a tmp path, and point the YAML config at the cached location? Your test tearDownClass cleans up the file, which hints that you already noticed this is awkward.

phi_spans_in_original alignment is fragile on real data

The anchoring approach (split the de-identified text on PHI tags and use the non-PHI chunks as landmarks in the original) works well when non-PHI chunks are unique. If a short non-PHI chunk happens to appear earlier in the original than where it semantically belongs, orig.find(before, pos) will lock onto the wrong position and shift all subsequent spans. Your ablation F1 of 80.9 percent says it mostly works, but could you add a defensive test case to test_physionet_deid.py where a non-PHI word repeats across PHI boundaries? Even if the current behavior is correct, it would make future regressions visible.

classify_phi fallback

The function falls back to "NAME" if nothing matches. For real PhysioNet tags most are covered, but a silent catch-all means mislabeled tags become NAME by default. Consider logging a warning (once per unknown tag string) when the fallback fires, so users of real data can see if the heuristic is missing cases.

Task output schema inconsistency

input_schema uses {"text": "text"} (string value) but output_schema uses {"labels": TextProcessor} (class value). Could you make them consistent with whatever pattern other text-output tasks in pyhealth/tasks/ use? If both forms are supported, a short comment explaining why the task uses the class form for labels would help future readers.

deidentify test coverage

test_deidentify_custom_redact_marker asserts that [REDACTED] is absent when redact="[PHI]", but does not assert that [PHI] is actually present in the output. If the untrained model happens to predict all-O on that short input, both strings would be absent and the test would pass vacuously. Could you either add a second assertion or use a longer input where PHI prediction is likely enough to make the test meaningful?

Small items

  1. The commit history is 16 commits including some WIP messages ("WIP e2e script", "add deidentify method so i can test it out manually"). Not a blocker, but if you squash before merge it will make git log easier to follow.

  2. The windowing test TestDeIDNERTaskWindowing.test_windowing_produces_more_samples asserts > 4, but you have 11 notes total. Should that threshold be > 11?

  3. In the example, compute_metrics uses p[1:].max() to score non-O predictions. Consider p[1:].max() versus the more standard 1 - p[0] for the non-O confidence. The current version implicitly assumes the best single non-O class score represents the confidence of "this is PHI," which is fine but worth a comment.

Once the _index_data write location is resolved, I think this is ready to merge. The contribution is well documented, the tests are real, and the paper reproduction is honest.

@mtmckenna
Copy link
Copy Markdown
Contributor Author

Thank you for the thorough and thoughtful review! I addressed suggestions--please let me know if you have any feedback. Also, I can squash the commits before merge, or happy to use squash-and-merge if that's enabled on the repo.

  1. CSV write location: _index_data now writes to a temp directory instead of root, so read-only data directories work.
  2. PHI span alignment tests: Added two test cases for repeated non-PHI text across PHI boundaries (repeated word before/after a tag, and same anchor between two tags).
  3. classify_phi fallback: Now logs a warning (once per unknown tag string) when the NAME fallback fires.
  4. Task schema consistency: Both input_schema and output_schema use TextProcessor now, matching the medical_coding.py pattern.
  5. deidentify test: Added assertion that every output word is either an original word or the custom redact marker, so the test can't pass erroneously.
  6. Windowing test threshold: Changed from > 4 to > 11--thank you for catching that!
  7. compute_metrics confidence: Switched to 1 - p[0] with a comment explaining the choice.

Copy link
Copy Markdown
Collaborator

@joshuasteier joshuasteier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you, @mtmckenna. LGTM.
One tiny nit, not a blocker: the tempfile.mkdtemp directory created in PhysioNetDeIDDataset.__init__ isn't explicitly cleaned up anywhere. The test class cleans up its own cache_dir, but a long-running notebook that instantiates the dataset multiple times would leak pyhealth_deid_* directories under /tmp. If you want, you can add a __del__ or a close() method that calls shutil.rmtree(self._tmp_dir, ignore_errors=True). Or leave it and let the OS handle cleanup. Your call.

For the squash question: yes, please squash before merge. It will keep git log --oneline pyhealth/datasets/physionet_deid.py readable for anyone tracing why the dataset works the way it does.

Approving.

@mtmckenna
Copy link
Copy Markdown
Contributor Author

Good suggestion on the __del__ method! Implemented. Also squashed down to one commit.

Thank you again for your help!

@joshuasteier joshuasteier merged commit 5f63039 into sunlabuiuc:master Apr 19, 2026
1 check passed
paul-nguyen-1 pushed a commit to paul-nguyen-1/PyHealth that referenced this pull request Apr 20, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants