Training mBART for term extraction

Reference

Lang, C., Wachowiak, L., Heinisch, B., & Gromann, D. Transforming Term Extraction: Transformer-Based Approaches to Multilingual Term Extraction Across Domains.

Overview

This repository contains the scripts used to finetune mBART for the termextraction task on the ACTER dataset (https://github.com/AylaRT/ACTER). Additionally, the repository contains folders named "epoch_XX", which contain the output on the test set of the models trained to XX epoch. Epoch_40 and Epoch_80 contain folders out and out_best. out is the output of the model at the maximum epoch; out_best the output of the early-stopped model. See exact number of steps per model under "checkpoint_steps" or see logged training progress with

tensorboard --logdir ./tensorboard/

Use the scripts in the root of the repository to recreate these results.

Installation

The experiment relies on several external dependecies. It is recommended to recreate the experiments on Linux.

Create a python 3 virtual environment and install all dependencies using

pip install -r requirements.txt

or if you use anaconda

conda install --file requirements.txt

This also installs the specific Fairseq version that was used for the experiments, including its requirements. Alternatively, you can get the latest version of Fairseq here: https://github.com/pytorch/fairseq

As stated by the original Fairseq documentation, for faster training on Nvidia GPUs, install nvidia apex:

git clone https://github.com/NVIDIA/apex
cd apex
pip install -v --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" \
  --global-option="--deprecated_fused_adam" --global-option="--xentropy" \
  --global-option="--fast_multihead_attn" ./

Additionally, you should install the CLI Tools of SentencePiece (SPM) for creating the sentence piece tokenized data for Fairseq Install SPM here

For the experiments, Python 3.6.6 and the CUDA 10.1 Toolkit was used. Make sure you acquire the correct pyTorch version for your CUDA Toolkit installation, to avoid version mismatch (for example when building nvidia apex).

Finally, you need to download the mBART Model with its dictionary and SentencePiece model:

# download model
wget https://dl.fbaipublicfiles.com/fairseq/models/mbart/mbart.cc25.v2.tar.gz
tar -xzvf mbart.CC25.tar.gz

Put the model files under "./models/mBART.cc25/" or change model location in the scripts.

Usage of the scripts

The scripts present in the root of this repository are numbered from 00 to 05. The repository comes with the results of scripts 00 and 01 ("termeval" and "preprocessed").

ACTER_dataprep.py Python script: Tokenizes and concatenates all files in ACTER per domain and language. Saves results to "termeval" dir in root. Change ACTER variable to point to the ACTER dataset (download here https://github.com/AylaRT/ACTER dataset)
train_test_split_termeval_revised.py Python script: Creates Train, Val and Test split, saves files to "preprocessed".
1. Default split: Train = corp and wind domain. Val = equi. Test = htfl.
2. Label separator can be defined in script. Default: semicolon (comma)
3. Language and label separator define dir name: Example en_comma = English data, semicolon separated labels
spm.sh Bash script: Create SentencePiece tokenized and binarized files for processing with fairseq. Saves files to "postprocessed".
CLI argument 1 defines dataset. Example for the English, semicolon separated dataset:
```
./spm.sh en_comma
```
train.sh Bash script: Start training process with Fairseq.\ Usage: ./train.sh dataset_name e.g. ./train.sh en_comma.
gen.sh or interactive.sh Bash scripts: gen.sh generates terms for all examples in binarized "test" file.
Usage: ./gen.sh workdir model_name test_name e.g. ./gen.sh . en_comma fr_comma
interactive.sh can be used interactively to generate terms from raw text input-sentences.
termeval_F1.py python script: Evaluate F1 score on extracted terms.

Results overview

F1 Scores on ACTER

Training	Test	Precision	Recall	F1
EN	EN	45.7	63.5	53.2
FR	EN	50.0	59.3	54.2
NL	EN	48.3	64.3	55.2
ALL	EN	50.2	61.6	55.3

EN	FR	48.8	61.3	54.4
FR	FR	52.7	59.6	55.9
NL	FR	54.3	60.9	57.4
ALL	FR	55.0	60.4	57.6

EN	NL	48.8	63.9	55.4
FR	NL	56.2	63.4	59.6
NL	NL	60.6	70.7	65.2
ALL	NL	60.6	70.0	64.9

F1 Scores on ACL RD-TEC 2.0 (60/20/20 split as described in paper)

Data Type	Precion	Recall	F1
Annotator 1	73.2	77.2	75.2
Annotator 2	79.4	80.7	80.0

Name		Name	Last commit message	Last commit date
Latest commit History 28 Commits
checkpoint_steps		checkpoint_steps
epoch_25/out		epoch_25/out
epoch_40		epoch_40
epoch_80		epoch_80
models/mbart.cc25		models/mbart.cc25
preprocessed		preprocessed
tensorboard		tensorboard
termeval		termeval
00_ACTER_dataprep.py		00_ACTER_dataprep.py
01_train_test_split_termeval_revised.py		01_train_test_split_termeval_revised.py
02_spm.sh		02_spm.sh
03_train.sh		03_train.sh
04_gen.sh		04_gen.sh
04_interactive.sh		04_interactive.sh
05_termeval_F1.py		05_termeval_F1.py
README.md		README.md
requirements.txt		requirements.txt

Text2TCS/mBART-termextraction

Folders and files

Latest commit

History

Repository files navigation

Training mBART for term extraction

Reference

Overview

Installation

Usage of the scripts

Results overview

F1 Scores on ACTER

F1 Scores on ACL RD-TEC 2.0 (60/20/20 split as described in paper)

About

Resources

Stars

Watchers

Forks

Languages