# Data preprocess

## prepare data for event classification
The original released data contained a list of `.txt` and `.ann` files, where the `.txt` files are the unstructured clinical notes, and the `.ann` files are annotations including medication entries, event class, and context class.
For event classification, the first step was to reformat all these files into a csv file consisting of three columns: `file`, `text`, and `label`. The `file` column is the combined ID for the clinical note, `text` is the chuncked text from the note, and `label` is the groundtruth annotation.

In [1]:
!pip install mendelai-brat-parser

Defaulting to user installation because normal site-packages is not writeable
You should consider upgrading via the '/opt/rh/rh-python36/root/usr/bin/python3 -m pip install --upgrade pip' command.[0m


In [37]:
!rm -r  n2c2_data/*processed
!python n2c2_code/extract_data_fixed_len.py n2c2_data/raw/train n2c2_data/train_processed n2c2_data/train.csv
!python n2c2_code/extract_data_fixed_len.py n2c2_data/raw/dev n2c2_data/dev_processed n2c2_data/dev.csv
!python n2c2_code/extract_data_fixed_len.py n2c2_data/raw/test n2c2_data/test_processed n2c2_data/test.csv

all: 6196
all: 1033
all: 1764


For event classification, the data set looks like below.

In [38]:
import pandas as pd
pd.read_csv('n2c2_data/train.csv', usecols=['file', 'text', 'label'])

Unnamed: 0,file,text,label
0,/280-03_T2,no shortness of breath. She does get occasiona...,NoDisposition
1,/280-03_T3,She does get occasional abdominal pulsations. ...,NoDisposition
2,/280-03_T4,pulsations. She does get epigastric discomfort...,NoDisposition
3,/280-03_T5,discomfort. Current medications include ranola...,NoDisposition
4,/280-03_T6,"mg twice daily, aspirin 325 mg once daily, Pla...",NoDisposition
...,...,...,...
6191,/289-02_T8,mainly ocular. Prior Raynaud's syndrome. Recom...,Disposition
6192,/289-02_T9,days when he had a couple of minor spells. Med...,NoDisposition
6193,/289-02_T10,he had a couple of minor spells. Medications: ...,NoDisposition
6194,/289-02_T11,testing correlating well with a slight increas...,Disposition


## prepare data for context classification
For context classification, we used the same data splits as event classification. There are 5 types of context classification subtasks: `Action`, `Negation`, `Temporality`, `Certainty`, and `Actor`. For each type, we create a folder with `train.csv`, `dev.csv`, `test.csv`. The data format is the same as event classification.

In [39]:
!python n2c2_code/transfer_context_data.py n2c2_data

ori: 6196
new: 1191
Done: .//n2c2_data/n2c2_data_Action/train.csv
Done: .//n2c2_data/n2c2_data_Negation/train.csv
Done: .//n2c2_data/n2c2_data_Temporality/train.csv
Done: .//n2c2_data/n2c2_data_Certainty/train.csv
Done: .//n2c2_data/n2c2_data_Actor/train.csv
ori: 1033
new: 221
Done: .//n2c2_data/n2c2_data_Action/dev.csv
Done: .//n2c2_data/n2c2_data_Negation/dev.csv
Done: .//n2c2_data/n2c2_data_Temporality/dev.csv
Done: .//n2c2_data/n2c2_data_Certainty/dev.csv
Done: .//n2c2_data/n2c2_data_Actor/dev.csv
ori: 1764
new: 1764
Done: .//n2c2_data/n2c2_data_Action/test.csv
Done: .//n2c2_data/n2c2_data_Negation/test.csv
Done: .//n2c2_data/n2c2_data_Temporality/test.csv
Done: .//n2c2_data/n2c2_data_Certainty/test.csv
Done: .//n2c2_data/n2c2_data_Actor/test.csv


The context classification data looks like below:

In [40]:
import pandas as pd
pd.read_csv('./n2c2_data/n2c2_data_Action/train.csv', usecols=['file', 'text', 'label'])

Unnamed: 0,file,text,label
0,/280-03_T8,data. Although theoretically she should not ha...,Start
1,/357-01_T1,The patient was found unconscious on the floor...,UniqueDose
2,/371-05_T35,weight loss and congratulated patient on recen...,Start
3,/371-05_T37,w/ improvement. cont topical estradiol f/u in ...,Increase
4,/371-05_T39,"OA, b/l knee OA s/p knee replacements. Recentl...",Start
...,...,...,...
1186,/378-04_T24,she was having at that time are not clear. U/A...,Stop
1187,/289-02_T7,from Singulair. Probable mild occult gastroeso...,Start
1188,/289-02_T8,mainly ocular. Prior Raynaud's syndrome. Recom...,Start
1189,/289-02_T11,testing correlating well with a slight increas...,Start


# Model training
The next step is to use the annotated data to train the model. Here we implemented the classification model by using `transformers` provided by `Hugging Face`, a open-source Python library for transformer models. It is worth noting that the following code is for event classification. For context classification, please modify the data path in the script `n2c2_code/run_training.sh`.

In [41]:
# run roberta
!sh n2c2_code/run_training.sh n2c2_roberta roberta-base roberta-base
# run bioclinicalbert
!sh n2c2_code/run_training.sh n2c2_biocl emilyalsentzer/Bio_ClinicalBERT emilyalsentzer/Bio_ClinicalBERT

+ dir_name=n2c2_roberta
+ model_name=roberta-base
+ tokenizer_name=roberta-base
+ output_path=output_n2c2_roberta
+ log_path=log_n2c2_roberta
+ '[' '!' -d log_n2c2_roberta ']'
+ batch_size=32
+ grad=1
+ epoch=10
+ max_seq_len=128
+ save_steps=100
+ train_file=train.csv
+ dev_file=dev.csv
+ test_file=test.csv
+ metric=f1_micro
+ learning_rate=2e-5
+ for i in 42 62 82
+ data_path=n2c2_data
+ output_dir=output_n2c2_roberta/model_2e-5_42
+ '[' '!' -d output_n2c2_roberta/model_2e-5_42 ']'
+ for i in 42 62 82
+ data_path=n2c2_data
+ output_dir=output_n2c2_roberta/model_2e-5_62
+ '[' '!' -d output_n2c2_roberta/model_2e-5_62 ']'
+ for i in 42 62 82
+ data_path=n2c2_data
+ output_dir=output_n2c2_roberta/model_2e-5_82
+ '[' '!' -d output_n2c2_roberta/model_2e-5_82 ']'
+ dir_name=n2c2_biocl
+ model_name=emilyalsentzer/Bio_ClinicalBERT
+ tokenizer_name=emilyalsentzer/Bio_ClinicalBERT
+ output_path=output_n2c2_biocl
+ log_path=log_n2c2_biocl
+ '[' '!' -d log_n2c2_biocl ']'
+ batch_size=32
+ grad=1


# Model testing/inference
For each of `RoBERTa` and `BioClinicalBERT`, we selected the checkpoint that achieved the best performance on the dev set, and then we tested the selected checkpoint on the test data.

In [11]:
# Here we run the roberta checkpoint. For bioclinicalbert, modify the model_path in the script.
!sh n2c2_code/run_test.sh

+ batch_size=48
+ train_file=train.csv
+ dev_file=dev.csv
+ test_file=test.csv
+ data_path=n2c2_data/
+ model_path=output_n2c2_roberta/model_2e-5_42/checkpoint-1100/
+ tokenizer_name=output_n2c2_roberta/model_2e-5_42/checkpoint-1100/
+ output_dir=./pred_rb
+ max_seq_len=128
+ metric=f1_micro
+ '[' '!' -d ./pred_rb ']'
+ python n2c2_code/model/run_classification.py --fp16 --max_seq_len 128 --model_name_or_path output_n2c2_roberta/model_2e-5_42/checkpoint-1100/ --config_name output_n2c2_roberta/model_2e-5_42/checkpoint-1100/ --tokenizer_name output_n2c2_roberta/model_2e-5_42/checkpoint-1100/ --task_name n2c2 --data_dir n2c2_data/ --train_file train.csv --dev_file dev.csv --test_file test.csv --custom_metric f1_micro --output_dir ./pred_rb --per_device_train_batch_size 48 --per_device_eval_batch_size 48 --overwrite_output_dir --overwrite_cache --logging_steps 1 --do_predict
2022-11-29 12:04:57.248717: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dyna

+ date
Tue Nov 29 12:05:46 EST 2022


The output file is named as `test_results_n2c2.txt`, which contains two columns--the indices and predictions, seperated by `\t`. For event classification and 5 context classification subtasks (in total 6 classification tasks), the model training and testing process are the same. Finally, we can get an output file for each with the same filename. Therefore, we need to manually rename the output files into different names as below. 

In [12]:
!ls pred_rb

Action.txt  Actor.txt  Certainty.txt  Event.txt  Negation.txt  Temporality.txt


# Evaluation
We evaluated the model performance by using the evaluation script provided by the shared task organizers. Because the model output file format is different from the required file format for the evaluation script, we need to reformat our output files first.

In [42]:
# reformat our output and store the reformated output file in `pred_rb_final`
!sh n2c2_code/reformat_results.sh

+ for x in Action Actor Certainty Negation Temporality
+ python n2c2_code/merge_id_pred.py n2c2_data/test.csv pred_rb/Action.txt pred_rb/Action_fileid.txt
+ for x in Action Actor Certainty Negation Temporality
+ python n2c2_code/merge_id_pred.py n2c2_data/test.csv pred_rb/Actor.txt pred_rb/Actor_fileid.txt
+ for x in Action Actor Certainty Negation Temporality
+ python n2c2_code/merge_id_pred.py n2c2_data/test.csv pred_rb/Certainty.txt pred_rb/Certainty_fileid.txt
+ for x in Action Actor Certainty Negation Temporality
+ python n2c2_code/merge_id_pred.py n2c2_data/test.csv pred_rb/Negation.txt pred_rb/Negation_fileid.txt
+ for x in Action Actor Certainty Negation Temporality
+ python n2c2_code/merge_id_pred.py n2c2_data/test.csv pred_rb/Temporality.txt pred_rb/Temporality_fileid.txt
+ python n2c2_code/reformat_results_e2e.py n2c2_data/raw/test n2c2_data/test.csv pred_rb pred_rb_final
Done! output is pred_rb_final


In [46]:
# run the evaluation script
!python n2c2_code/eval_script.py n2c2_data/raw/test_gold pred_rb_final

Files skipped in /home/yguo262/n2c2_2022_classification/n2c2_data/raw/test_gold:
169-02.ann, 186-04.ann

******************** Evaluation n2c2 2022 Track 1 ********************
************* Contextualized Medication Event Extraction *************

*********************** Medication Extraction ************************
                      ------- strict -------    ------ lenient -------
                      Prec.   Rec.    F(b=1)    Prec.   Rec.    F(b=1)
                Drug  1.0000  1.0000  1.0000    1.0000  1.0000  1.0000


************************ Event Classification ************************
                      ------- strict -------    ------ lenient -------
                      Prec.   Rec.    F(b=1)    Prec.   Rec.    F(b=1)
         Disposition  0.2036  0.2145  0.2089    0.2036  0.2145  0.2089
       Nodisposition  0.7517  0.7511  0.7514    0.7517  0.7511  0.7514
        Undetermined  0.0381  0.0328  0.0352    0.0381  0.0328  0.0352
                      ------------------