The repo contains code for fine-tuning a pretrained language model (BERT, SciBERT, etc.) for the task of multi-class classification of uncertainty cues (a.k.a hedges).
You can use the code to either:
- train and evaluate your own model (see Train and evaluate); or
- use my fine-tuned model to generate predictions on your data (see Predict).
I use the Simple Transformers and W&B packages to perform the fine-tuning.
The requirements are listed in the environment.yml file. It is recommended to create a virtual environment with conda (you need to have Anaconda
or Miniconda
installed):
$ conda env create -f environment.yml
$ conda activate hedgehog
HEDGEhog is trained and evaluated on the Szeged Uncertainty Corpus (Szarvas et al. 20121). The original sentence-level XML version of this dataset is available here.
The token-level version that is used in the current repo can be downloaded from here in a form of pickled pandas DataFrame's. You can download either the split sets (train.pkl
137MB, test.pkl
17MB, dev.pkl
17MB) or the full dataset (szeged_fixed.pkl
172MB).
Each row in the df contains a token, its features (these are not relevant for HEDGEhog; they were used to train the baseline CRF model, see here), its sentence ID, and its label. The labels refer to different types of semantic uncertainty (Szarvas et al. 2012) -
- Epistemic: the proposition is possible, but its truth-value cannot be decided at the moment. Example: She may be already asleep.
- Investigation: the proposition is in the process of having its truth-value determined. Example: She examined the role of NF-kappaB in protein activation.
- Doxatic: the proposition expresses beliefs and hypotheses, which may be known as true or false by others. Example: She believes that the Earth is flat
- CoNdition: the proposition is true or false based on the truth-value of another proposition. Example: If she gets the job, she will move to Utrecht.
- Certain: the token is not an uncertainty cue.
Here is the performance of my downloadable fine-tuned model on the test set:
class | precision | recall | F1-score | support |
---|---|---|---|---|
Epistemic | 0.90 | 0.85 | 0.88 | 624 |
Doxatic | 0.88 | 0.92 | 0.90 | 142 |
Investigation | 0.83 | 0.86 | 0.84 | 111 |
Condition | 0.85 | 0.87 | 0.86 | 86 |
Certain | 1.00 | 1.00 | 1.00 | 104,751 |
macro average | 0.89 | 0.90 | 0.89 | 105,714 |
You can use the data and the code to train your own model, for example with another pretrained language model as basis or with different hyperparameters. To do this, follow the following steps:
- Download the data and place
train.pkl
,test.pkl
,dev.pkl
in thedata/
directory. - Add a dictionary with your new model args to the config.json file. See Simple Transformers for all the possible configuration options.
- Adjust the
--model_args
,--model_type
and--model_name
parameters in train_model.py. You can either change the default values in the script or pass your arguments in the command line; for example -
$ python train_model.py --model_args my_new_args --model_type roberta --model_name roberta-base
To evaluate your model, use the evaluate_model.py script. Adjust the --model_type
and --model_name
parameters for your trained model, set the --output
parameter to the path where you want to save the pickled model predictions. You can adjust the evaluate_model.py script to add additional evaluation metrics; see the docstring in the file and the Simple Transformers documentation for more details.
You can perform a sweep for hyperparameters optimization with the wandb_sweep.py script. See the docstring in the file, Simple Transformers documentation and W&B documentation for more details.
To use my fine-tuned model for generating predictions on your own data, follow the following steps:
- Prepare your data in a pickled DataFrame which contains the column 'sentence'. For each row in the df, the text in 'sentence' will be split on space and a label will be predicted for each token. A list with the predicted labels will be saved in a new column named 'predictions'.
- Download the
hedgehog
folder from here and place it in themodels/
directory. The folder contains the modelpytorch_model.bin
and info about the tokenizer, the vocabulary and the configuration. - Run the predict.py script, indicating the path to your pickled data (alternatively, edit the default value in the script):
$ python predict.py --data_pkl ../data/mydata.pkl
1 Szarvas, G., Vincze, V., Farkas, R., Móra, G., & Gurevych, I. (2012). Cross-genre and cross-domain detection of semantic uncertainty. Computational Linguistics, 38(2), 335-367.