Skip to content

Improve the training process by leveraging structural information within a dataset.

License

Notifications You must be signed in to change notification settings

UKPLab/acl2022-structure-batches

Repository files navigation

Composing Structure-Aware Batches for Pairwise Sentence Classification

With this repository we make the source code of our paper Exploiting Dataset Structures to Improve Pairwise Sentence Classification public available.

Please use the following citation:

@InProceedings{waldis-et-al-2022-structure-batches,
  author    = {Waldis, Andreas and Beck, Tilman, and Gurevych, Iryna},  
  title     = {Composing Structure-Aware Batches for Pairwise Sentence Classification},
  booktitle = {Findings of the Association for Computational Linguistics: ACL 2022},
  month     = may,
  year      = {2022},
  address   = {Online},
  publisher = {Association for Computational Linguistics},
}

Abstract: Identifying the relation between two sentences requires datasets with pairwise annotations. In many cases, these datasets contain instances that are annotated multiple times as part of different pairs. They constitute a structure that contains additional helpful information about the inter-relatedness of the text instances based on the annotations. This paper investigates how this kind of structural dataset information can be exploited during training. We propose three batch composition strategies to incorporate such information and measure their performance over 14 heterogeneous pairwise sentence classification tasks. Our results show statistically significant improvements (up to 3.9%) - independent of the pre-trained language model - for most tasks compared to baselines that follow a standard training procedure. Further, we see that even this baseline procedure can profit from having such structural information in a low-resource setting.

Contact person: Andreas Waldis, andreas.waldis@live.com

https://www.ukp.tu-darmstadt.de/

https://www.tu-darmstadt.de/

Don't hesitate to e-mail us or report an issue, if something is broken (and it shouldn't be) or if you have further questions.

This repository contains experimental software and is published for the sole purpose of giving additional background details on the respective publication.

Project structure

This project includes the folder src with the code, data that includes a toy dataset example, and experiments that includes additional details of the experiments in the paper. These details include:

  • The optimized batch sizes for the three experiments batch-sizes_experiment-i.csv, batch-sizes_experiment-ii.csv, batch-sizes_experiment-iii.csv
  • The results of the instability test using the Brown-Forsyth test stability_experiment-i.csv, stability_experiment-ii.csv, stability_experiment-iii.csv

Requirements

The basic requirement to use this repository is a running docker environment.

Installation

To set up the container run the following command:

docker-compose build
docker-compose up lab

After building and starting the docker container, you can access a JupyterLab instance on port 8080.

Preparing Data

To prepare all of our paper you need to manually download and put them into the folder data. Afterwards you can fine the python code to parse them und generate the different folds in the folder src/parsing.

If you wish to prepare your own dataset, please arrange it as csv with the columns id, sentence1, sentence2, label, and set. You will find a toy example in the data folder with 13 sentence pairs for five topics arranged in three folds.

Running the experiments

To start a training you can call the script python3 run.py. For gathering the performance please create a wandb.ai account and login with the following bash command wandb login.

Parameter description

To adjust the training settings run.py provides you with the following parameters:

  • --data_file, the path to the specific file with the training instances, default ../data/sample_fold_0.csv
  • --num_labels, the number of labels of the training task, default 2
  • --directed, a flag parameter that indicate that the label describes a directed, if not just leave it out
  • --dev_sets, the sets using for development in the provided csv file, default dev
  • --test_sets, the sets using for testing in the provided csv file, default dev
  • --model_name, the Huggingface tag of the pre-trained language model you wish to use, default bert-base-uncased
  • --strategy, the specific batching strategy you want to apply either BI_BASELINE, BI_NODE, BI_EDGE1, BI_EDGE2 for bi-encoders or CROSS_BASELINE, CROSS_NODE, CROSS_EDGE1, CROSS_EDGE2 for cross-encoders, default BI_NODE
  • --seed, the random seed to use, default 0
  • --batch_size, the number of nodes or edges to sample within one batch, default 8
  • --learning_rate, the learning rate to use, default 0.00002
  • --num_epochs, the number of epochs to train on, default 5
  • --warmup_proportion, portion of epochs to use as warmup period, default 0.1
  • --wandb_tag_prefix, the prefix for the wandb tag, default 8
  • --wandb_project, the wandb project to use, default ``
  • --max_tokens, the maximum of tokens to process within one step. If batch has more token, the training use gradient accumulation to efficient process a batch, default 6000
  • --strategy_baseline_accumulation_steps, the number of gradient accumulation steps to apply for the training the baseline model for example with 2 a batch with 8 instance is divided into two batch of 4 instances, default 2

About

Improve the training process by leveraging structural information within a dataset.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages