CiteBench: A benchmark for Scientific Citation Text Generation

Please use the following citation:

@misc{https://doi.org/10.48550/arxiv.2212.09577,
  doi = {10.48550/ARXIV.2212.09577},
  url = {https://arxiv.org/abs/2212.09577},
  author = {Funkquist, Martin and Kuznetsov, Ilia and Hou, Yufang and Gurevych, Iryna},
  title = {CiteBench: A benchmark for Scientific Citation Text Generation},
  publisher = {arXiv},
  year = {2022},  
  copyright = {Creative Commons Attribution Share Alike 4.0 International}
}

Abstract: Science progresses by incrementally building upon the prior body of knowledge documented in scientific publications. The acceleration of research across many fields makes it hard to stay up-to-date with the recent developments and to summarize the ever-growing body of prior work. To target this issue, the task of citation text generation aims to produce accurate textual summaries given a set of papers-to-cite and the citing paper context. Existing studies in citation text generation are based upon widely diverging task definitions, which makes it hard to study this task systematically. To address this challenge, we propose CiteBench: a benchmark for citation text generation that unifies multiple diverse datasets and enables standardized evaluation of citation text generation models across task designs and domains. Using the new benchmark, we investigate the performance of multiple strong baselines, test their transferability between the datasets, and deliver new insights into the task definition and evaluation to guide future research in citation text generation.

Contact persons: Martin Funkquist (martin.funkquist@liu.se), Ilia Kuznetsov (kuznetsov@ukp.informatik.tu-darmstadt.de)

https://www.ukp.tu-darmstadt.de/

https://www.tu-darmstadt.de/

Don't hesitate to send us an e-mail or report an issue, if something is broken (and it shouldn't be) or if you have further questions.

This repository contains experimental software and is published for the sole purpose of giving additional background details on the respective publication.

Project structure

The src folder contains the code for this project. The data can be downloaded using the provided script.

Requirements

The code has been developed and tested on Ubuntu 20.04 and is not guaranteed to work on other operating systems.

Installation

Prepare the environment

The easiest way to run the code in this repo is to use Anaconda. If you haven't installed it, you can find the installation guidelines here: https://docs.anaconda.com/anaconda/install/

Start by creating a new conda envirionment:

conda create --name citebench python=3.9

And activate it:

conda activate citebench

Install requirements:

pip install -r requirements.txt

Download and process dataset

Download the raw datasets:

sh get_raw_data.sh

When the download is finished, the processed benchmark dataset can be created by running the following Python script:

PYTHONPATH=src python src/data_processing/related_work_benchmark_construction.py

Alternatively, you can download the processed data directly:

sh get_processed_data.sh

You can also download the raw data from here: https://drive.google.com/file/d/1rvfB1s6GpVxxSwjnhlT1hl-x5eWXXz61/view?usp=sharing or the processed data from here: https://drive.google.com/file/d/1opDbbnQ74DTnwtUo8CCzTuQ9sF_rceYF/view?usp=sharing

Getting the CHEN dataset

The CHEN dataset (see paper for details) is not included in the data that we provide due to incomplete licensing information on the Delve data (see paper). Please contact the authors to obtain the data: https://github.com/iriscxy/relatedworkgeneration

Test your own model on the benchmark

If your model is created using Huggingface, you can run the provided test script in this repo:

PYTHONPATH=src python src/rel_work/test.py \
  --model=<PATH_TO_YOUR_MODEL> \
  --output_folder=<PATH_TO_OUTPUT_FOLDER> \
  --evaluation_metrics=rouge,bert-score

Expected results

The test script will produce two output files for each dataset: [DATASET]_predictions.json which contains a list of dictionaries with two keys: target is the labels and prediction is the output of the model and [DATASET].json which contains the results for the specified evaluation metrics (only ROUGE by default).

Parameter description

--model
- Path to pretrained model or shortcut name
--decoding_strategy
- Strategy to use for decoding, e.g. beam_search, top_k or top_p
--top_k
- K for top-k decoding
--top_p
- P for top-p decoding
--use_sep_token
- If True the input documents will be separated by the sep token in the input. Default False
--datasets
- Comma separated list of datasets to use e.g. 'lu_et_al,xing_et_al'. Default uses all datasets
--batch_size
- Batch size for running predictions. Default 8
--base_data_path
- Path to the base data folder. This is the folder where the indiviual dataset folder e.g. 'lu_et_al' is located. Default 'data/related_work/benchmark/'
--output_csv_file
- Path to the output csv file. If this argument is present, it will store the results in this file in addition to the other files.
--ignore_cache
- If True, will ignore the cache and always re-run the predictions, even for the dataset where these are already calculated.
--evaluation_metrics
- Comma separated list of evaluation metrics to use e.g. "rouge,bert-score". Avaliable evaluation metrics are: "rouge" and "bert-score". Default uses only "rouge" metric.
--use_doc_idx
- If True, will separate the documents with [idx] tokens e.g. [0] for the first document. Default False.
--no_tags
- If True, will remove the special tags e.g. '<abs>' from the inputs. Default False.
--manual_seed
- Manual seed for random number generator. Default 15.
--output_folder
- Path to the folder where the results will be saved. If not provided, will save in a folder named after the model.
--use_cpu
- If True, will use CPU instead of GPU even if GPU is available.

Evaluate model outputs on the benchmark

If your model is not created with Huggingface then you will have to create an output json file for each dataset with the naming convention [DATASET_NAME]_predictions.json. This file should contain a list of dictionaries with the keys target and prediction. When you have this, you can run the evaluation script:

PYTHONPATH=src python src/rel_work/evaluation.py \
  --results_folder=<PATH_TO_THE_OUTPUTS_OF_YOUR_MODEL> \
  --output_file=<PATH_TO_FILE_TO_STORE_RESULTS> \
  --evaluation_metrics=rouge,bert-score

Expected results

The output is a csv file with the calculated scores. Each line contains the different scores on a dataset.

Parameter description

--results_folder
- Path to the results folder, where the results from the model is stored. It will match the files that end with _predictions.json and these files should consist of a list with objects with keys 'prediction' and 'target'
--output_file
- Path to the output file. This file will contain the results of the evaluation.
--evaluation_metrics
- Comma separated list of evaluation metrics to use e.g. "rouge,bert-score". Avaliable evaluation metrics are: "rouge" and "bert-score". Default uses all metrics.
--use_stemmer
- If True, will set the 'use_stemmer' argument to True in the calculation of the ROUGE score. Defaults to False.

Citation intent evaluation

To run the citation intent evaluation, first convert your model outputs to the SciCite format by running the following script:

PYTHONPATH=src python src/data_processing/convert_to_scicite.py \
  --model_predictions_folders=<PATH_TO_THE_OUTPUTS_OF_YOUR_MODEL> \

The outputs will be stored in files on this format <PATH_TO_THE_OUTPUTS_OF_YOUR_MODEL>/acl_arc/inputs/[DATASET_NAME].jsonl one for each dataset.

Then follow the instructions here: https://github.com/allenai/scicite to get the citation intent outputs. Use the ACL-ARC pretrained model.

CORWA

Similar to citation intent, convert the model outputs to the CORWA format:

PYTHONPATH=src python src/data_processing/convert_to_corwa.py \
  --model_predictions_folders=<PATH_TO_THE_OUTPUTS_OF_YOUR_MODEL> \

Outputs are stored in <PATH_TO_THE_OUTPUTS_OF_YOUR_MODEL>/corwa/inputs/[DATASET_NAME].jsonl

Then follow the instructions here: https://github.com/jacklxc/corwa to get the discourse tagging outputs.

Extending the benchmark

We want the community to be able to update this benchmark with new datasets and evaluation metrics. We give instructions below of how to do this.

Adding a dataset

To add a new dataset, you either need to provide a processed version matching the structure of the other datasets in the benchmark, or you need to add conversion code to the related_work_benchmark_construction.py script.

Adding new evaluation metrics

To add new evaluation metrics you need to extend the evaluation.py script with code that calculates the result on these metrics. See how the other scores e.g. ROUGE and BERTScore are implemented as a reference.

Running the experiments

There are three extractive baselines for the related work benchmark included in this repo. The following is a guide of how to run them.

Lead

The Lead baseline simply takes the first sentences from the input(s) as predictions. To run Lead with default setting, run the following:

PYTHONPATH=src python src/rel_work/baselines/lead.py \
  --results_path=outputs/predictions/lead/ \
  --csv_results_path=outputs/lead.csv

This will run the Lead baseline on all the datasets in the benchmark and output the results in the results_path folder. A summary of the results can be found in csv_results_path.

This baseline has few parameters which you will find in the arguments in the file. Feel free to play around with these.

TextRank

Example of how to run the TextRank baseline with default settings:

PYTHONPATH=src python src/rel_work/baselines/text_rank.py \
  --results_path=outputs/predictions/textrank/ \
  --csv_results_path=outputs/lexrank.csv

LexRank

Example of how to run the LexRank baseline with default settings:

PYTHONPATH=src python src/rel_work/baselines/lex_rank.py \
  --results_path=outputs/predictions/lexrank/ \
  --csv_results_path=outputs/lexrank.csv

Train a model on the benchmark dataset

Example of how to start training a LED base model, on all datasets included in the benchmark:

PYTHONPATH=src python src/rel_work/train.py \
  --model=allenai/led-base-16384 \
  --output_dir=models/led-base/

Note: this script has only been tested with Huggingface models

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
src		src
tests		tests
Citation Classification.ipynb		Citation Classification.ipynb
DATASET.md		DATASET.md
LICENSE		LICENSE
LICENSE.txt		LICENSE.txt
NOTICE.txt		NOTICE.txt
README.md		README.md
get_processed_data.sh		get_processed_data.sh
get_raw_data.sh		get_raw_data.sh
random_seeds.txt		random_seeds.txt
requirements.txt		requirements.txt
sample.json		sample.json

License

Licenses found

UKPLab/citebench

Folders and files

Latest commit

History

Repository files navigation

CiteBench: A benchmark for Scientific Citation Text Generation

Project structure

Requirements

Installation

Prepare the environment

Download and process dataset

Getting the CHEN dataset

Test your own model on the benchmark

Expected results

Parameter description

Evaluate model outputs on the benchmark

Expected results

Parameter description

Citation intent evaluation

CORWA

Extending the benchmark

Adding a dataset

Adding new evaluation metrics

Running the experiments

Lead

TextRank

LexRank

Train a model on the benchmark dataset

About

Resources

License

Licenses found

Stars

Watchers

Forks

Languages