A corpus of clinical text which can be used as a reference standard for de-identification of Norwegian clinical text.
The reference standard corpus is synthetic and based on the NorSynthClinical corpus. NorSynthClinical is a synthetic corpus describing patients’ family history relating to cases of cardiac disease, presented and described here: https://github.com/ltgoslo/NorSynthClinical.
To create this reference standard, the NorSynthClinical corpus was extended with personal information, and then, annotated using the following tags:
- First_Name
- Last_Name
- Age
- Health_Care_Unit
- Phone_Number
- Social_Security_Number
- Date_Full
- Date_Part
- Location
The verified version of the reference standard is the "reference_standard_annotated.txt" file.
The reference standard is made by Synnøve Bråten. It was made as a part of a master's thesis in the Joint Master's Programme in Health Informatics at Stockholm University/Karolinska Institutet. The master's thesis describing the work is available here: https://daisy.dsv.su.se/fil/visa?id=230054.
See also Bråten, S., Wie, W., & Dalianis, H. (2021). Creating and Evaluating a Synthetic Norwegian Clinical Corpus for De-Identification. In Proceedings of the 23rd Nordic Conference on Computational Linguistics (NoDaLiDa) (pp. 222-230). https://aclanthology.org/2021.nodalida-main.22.pdf