Add PhysioNet De-Identification dataset, NER task, and TransformerDeID model#981
Conversation
|
Hi @mtmckenna, The dataset, task, and model all follow the A few suggestions, in rough order of importance. None of these are blockers.
In
The anchoring approach (split the de-identified text on PHI tags and use the non-PHI chunks as landmarks in the original) works well when non-PHI chunks are unique. If a short non-PHI chunk happens to appear earlier in the original than where it semantically belongs,
The function falls back to Task output schema inconsistency
Small items
Once the |
|
Thank you for the thorough and thoughtful review! I addressed suggestions--please let me know if you have any feedback. Also, I can squash the commits before merge, or happy to use squash-and-merge if that's enabled on the repo.
|
joshuasteier
left a comment
There was a problem hiding this comment.
Thank you, @mtmckenna. LGTM.
One tiny nit, not a blocker: the tempfile.mkdtemp directory created in PhysioNetDeIDDataset.__init__ isn't explicitly cleaned up anywhere. The test class cleans up its own cache_dir, but a long-running notebook that instantiates the dataset multiple times would leak pyhealth_deid_* directories under /tmp. If you want, you can add a __del__ or a close() method that calls shutil.rmtree(self._tmp_dir, ignore_errors=True). Or leave it and let the OS handle cleanup. Your call.
For the squash question: yes, please squash before merge. It will keep git log --oneline pyhealth/datasets/physionet_deid.py readable for anyone tracing why the dataset works the way it does.
Approving.
|
Good suggestion on the Thank you again for your help! |
Contributor
Matt McKenna (mtm16@illinois.edu)
Contribution Type
Full Pipeline: Dataset + Task + Model (Option 4)
Paper
Johnson, Alistair E.W., et al. "Deidentification of free-text medical records using pre-trained bidirectional
transformers." Proceedings of the ACM Conference on Health, Inference, and Learning (CHIL), 2020.
https://doi.org/10.1145/3368555.3384455
Description
Implements BERT-based clinical text de-identification as a PyHealth pipeline. Given clinical notes with protected health information (PHI), the model performs token-level NER to detect and classify PHI into 7 categories (NAME, DATE, LOCATION, AGE, CONTACT, ID, PROFESSION) using BIO tagging.
Data Access
The test data in
test-resources/core/physionet_deid/is synthetic (fake).Real data requires PhysioNet credentialed access:
https://physionet.org/content/deidentifiedmedicaltext/1.0/
Ablation Results
Our results are worse than the original paper's results. The hypothesis is that the results are worse because we're only using the phsyionet data and not adding in the other datasets.
Files to Review
pyhealth/datasets/physionet_deid.pypyhealth/datasets/configs/physionet_deid.yamlpyhealth/tasks/deid_ner.pypyhealth/models/transformer_deid.pytests/core/test_physionet_deid.pytests/core/test_transformer_deid.pyexamples/physionet_deid_ner_transformer_deid.pydocs/api/datasets/pyhealth.datasets.PhysioNetDeIDDataset.rstdocs/api/tasks/pyhealth.tasks.DeIDNERTask.rstdocs/api/models/pyhealth.models.TransformerDeID.rst