Skip to content

A corpus of clinical text which can be used as a reference standard for de-identification of Norwegian clinical text

Notifications You must be signed in to change notification settings

synnobra/NorSynthClinical-PHI

Repository files navigation

The NorSynthClinical PHI Corpus

A corpus of clinical text which can be used as a reference standard for de-identification of Norwegian clinical text.

The reference standard corpus is synthetic and based on the NorSynthClinical corpus. NorSynthClinical is a synthetic corpus describing patients’ family history relating to cases of cardiac disease, presented and described here: https://github.com/ltgoslo/NorSynthClinical.

To create this reference standard, the NorSynthClinical corpus was extended with personal information, and then, annotated using the following tags:

  • First_Name
  • Last_Name
  • Age
  • Health_Care_Unit
  • Phone_Number
  • Social_Security_Number
  • Date_Full
  • Date_Part
  • Location

The verified version of the reference standard is the "reference_standard_annotated.txt" file.

The reference standard is made by Synnøve Bråten. It was made as a part of a master's thesis in the Joint Master's Programme in Health Informatics at Stockholm University/Karolinska Institutet. The master's thesis describing the work is available here: https://daisy.dsv.su.se/fil/visa?id=230054.

See also Bråten, S., Wie, W., & Dalianis, H. (2021). Creating and Evaluating a Synthetic Norwegian Clinical Corpus for De-Identification. In Proceedings of the 23rd Nordic Conference on Computational Linguistics (NoDaLiDa) (pp. 222-230). https://aclanthology.org/2021.nodalida-main.22.pdf

About

A corpus of clinical text which can be used as a reference standard for de-identification of Norwegian clinical text

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages