With this repository we make the source code of our paper Exploiting Dataset Structures to Improve Pairwise Sentence Classification public available.
Please use the following citation:
@InProceedings{waldis-et-al-2022-structure-batches,
author = {Waldis, Andreas and Beck, Tilman, and Gurevych, Iryna},
title = {Composing Structure-Aware Batches for Pairwise Sentence Classification},
booktitle = {Findings of the Association for Computational Linguistics: ACL 2022},
month = may,
year = {2022},
address = {Online},
publisher = {Association for Computational Linguistics},
}
Abstract: Identifying the relation between two sentences requires datasets with pairwise annotations. In many cases, these datasets contain instances that are annotated multiple times as part of different pairs. They constitute a structure that contains additional helpful information about the inter-relatedness of the text instances based on the annotations. This paper investigates how this kind of structural dataset information can be exploited during training. We propose three batch composition strategies to incorporate such information and measure their performance over 14 heterogeneous pairwise sentence classification tasks. Our results show statistically significant improvements (up to 3.9%) - independent of the pre-trained language model - for most tasks compared to baselines that follow a standard training procedure. Further, we see that even this baseline procedure can profit from having such structural information in a low-resource setting.
Contact person: Andreas Waldis, andreas.waldis@live.com
https://www.ukp.tu-darmstadt.de/
Don't hesitate to e-mail us or report an issue, if something is broken (and it shouldn't be) or if you have further questions.
This repository contains experimental software and is published for the sole purpose of giving additional background details on the respective publication.
This project includes the folder src
with the code, data
that includes a toy dataset example, and experiments
that includes additional details of the experiments in the paper.
These details include:
- The optimized batch sizes for the three experiments
batch-sizes_experiment-i.csv, batch-sizes_experiment-ii.csv, batch-sizes_experiment-iii.csv
- The results of the instability test using the Brown-Forsyth test
stability_experiment-i.csv, stability_experiment-ii.csv, stability_experiment-iii.csv
The basic requirement to use this repository is a running docker environment.
To set up the container run the following command:
docker-compose build
docker-compose up lab
After building and starting the docker container, you can access a JupyterLab instance on port 8080
.
To prepare all of our paper you need to manually download and put them into the folder data
. Afterwards you can fine the python code to parse them und generate the different folds in the folder src/parsing
.
If you wish to prepare your own dataset, please arrange it as csv with the columns id
, sentence1
, sentence2
, label
, and set
. You will find a toy example in the data
folder with 13 sentence pairs for five topics arranged in three folds.
To start a training you can call the script python3 run.py
. For gathering the performance please create a wandb.ai account and login with the following bash command wandb login
.
To adjust the training settings run.py
provides you with the following parameters:
--data_file
, the path to the specific file with the training instances, default../data/sample_fold_0.csv
--num_labels
, the number of labels of the training task, default2
--directed
, a flag parameter that indicate that the label describes a directed, if not just leave it out--dev_sets
, the sets using for development in the provided csv file, defaultdev
--test_sets
, the sets using for testing in the provided csv file, defaultdev
--model_name
, the Huggingface tag of the pre-trained language model you wish to use, defaultbert-base-uncased
--strategy
, the specific batching strategy you want to apply eitherBI_BASELINE, BI_NODE, BI_EDGE1, BI_EDGE2
for bi-encoders orCROSS_BASELINE, CROSS_NODE, CROSS_EDGE1, CROSS_EDGE2
for cross-encoders, defaultBI_NODE
--seed
, the random seed to use, default0
--batch_size
, the number of nodes or edges to sample within one batch, default8
--learning_rate
, the learning rate to use, default0.00002
--num_epochs
, the number of epochs to train on, default5
--warmup_proportion
, portion of epochs to use as warmup period, default0.1
--wandb_tag_prefix
, the prefix for the wandb tag, default8
--wandb_project
, the wandb project to use, default ``--max_tokens
, the maximum of tokens to process within one step. If batch has more token, the training use gradient accumulation to efficient process a batch, default6000
--strategy_baseline_accumulation_steps
, the number of gradient accumulation steps to apply for the training the baseline model for example with 2 a batch with 8 instance is divided into two batch of 4 instances, default2