F1000RD Corpus Repository

This repository hosts F1000RD, the accompanying dataset for the article Revise and Resubmit: An Intertextual Model of Text-based Collaboration in Peer Review, Computational Linguistics (2022). It is the first openly licensed, multi-domain corpus of publications, their revisions and peer reviews from an open reviewing platform.

If you are interested in the intertextual graph data model that is introduced in the paper, please have a look at this repository

Abstract: Peer review is a key component of the publishing process in most fields of science. The increasing submission rates put a strain on reviewing quality and efficiency, motivating the development of applications to support the reviewing and editorial work. While existing NLP studies focus on the analysis of individual texts, editorial assistance often requires modeling interactions between pairs of texts -- yet general frameworks and datasets to support this scenario are missing. Relationships between texts are the core object of the intertextuality theory -- a family of approaches in literary studies not yet operationalized in NLP. Inspired by prior theoretical work, we propose the first intertextual model of text-based collaboration, which encompasses three major phenomena that make up a full iteration of the review-revise-and-resubmit cycle: pragmatic tagging, linking and long-document version alignment. While peer review is used across the fields of science and publication formats, existing datasets solely focus on conference-style review in computer science. Addressing this, we instantiate our proposed model in the first annotated multi-domain corpus in journal-style post-publication open peer review, and provide detailed insights into the practical aspects of intertextual annotation. Our resource is a major step towards multi-domain, fine-grained applications of NLP in editorial support for peer review, and our intertextual framework paves the path for general-purpose modeling of text-based collaboration.

What can be found here

The corpus is based on the data from the open reviewing platform F1000Research. The data used in the article consists of two parts: the study sample used in annotation studies, and the full crawl of F1000Research used for reference. This repository contains the study sample and accompanying analysis code. The full crawl used in this work is available on-demand. Our data comes in three formats: JATS XML (only full crawl) is used to generate Intertextual Graphs (ITG) -- our novel graph-based data model well-suited for intertextual analysis (https://github.com/UKPLab/intertext-graph.git). While ITGs require our external library to work with, we also provide our data in a simple CSV-based format (only study sample) to facilitate analysis and task-specific applications. Our data model is backed by the intertext_graph library released separately.

The repository also hosts the annotation guidelines used in the studies and the draft datasheet for F1000RD.

Data structure

analysis/
    analysis_util.py <- utility functions for analysing the data
    analytics.ipynb <- code to reproduce analysis from the article
    exp_linker.py <- simple regex-based explicit linker used in the paper
    exp_patterns.tsv <- auxiliary for the explicit linker
data/
    simple/ <- one file per task / analysis type
    itg/ <- one folder per F1000Research submission
        X-XX/ <- submission folder
          {v1, v2, v3...}.json <- ITGs for submission versions
          diff_... <- automatically produced alignments between v1 and v2 if available
          reviews/ <- ITGs for reviews for the first submission version
          linking/ <- links between reviews and the first submission version
guidelines/ <- annotation guidelines
requirements.txt
datasheet.pdf

If you want to use the ITG representation of the data in your experiments (e.g. v1.json in the submission directories), have a closer look at our intertext_graph library. It is a general-purpose library that implements a structured data model for representing documents, making it easy to work with document structure, relations and cross-document links.

Based on the intertext_graph library, the function get_mega_itg() in analysis/analysis_util.py, builds an intertextual graph object from a submission directory with the complete pragmatics, linking and versioning data.

Implicit Linking Data

In the data/simple/imp_links.csv table, all implicit linking annotations are shown. Each row has the information for one pair of nodes. In the implicit linking data, these are always sentence pairs. The columns imp_a and imp_b show the annotation from the main annotators in the main annotation study. The columns imp_a_re and imp_b_re show the annotations from the main annotators in the re-annotation study. The columns imp_c_e and imp_d_e show the annotations from the expert annotators. For all annotations, 1 indicates that annotators marked a sentence pair as linked, and 0 as non-linked.

Data Splits

The data was split by submission, ensuring that there is no overlap between train, dev and test set. Please find the split information in data/simple/splits.csv.

Analytics run-through

To reproduce the analysis from the paper:

Clone this repo
Create a fresh virtual environment (e.g. via conda)
pip install -r requirements.txt
Run the analytics.ipynb notebook in the analysis folder
Mind that some analysis will require the full crawl of F1000Research.

Citation

If you use this data in your research, please cite:

@article{10.1162/coli_a_00455,
    author = {Kuznetsov, Ilia and Buchmann, Jan and Eichler, Max and Gurevych, Iryna},
    title = "{Revise and Resubmit: An Intertextual Model of Text-based Collaboration in Peer Review}",
    journal = {Computational Linguistics},
    pages = {1-38},
    year = {2022},
    month = {08},
    issn = {0891-2017},
    doi = {10.1162/coli_a_00455},
    url = {https://doi.org/10.1162/coli\_a\_00455},
    eprint = {https://direct.mit.edu/coli/article-pdf/doi/10.1162/coli\_a\_00455/2038043/coli\_a\_00455.pdf},
}

Contact

Don't hesitate to send us an e-mail or report an issue, if something is broken or if you have further questions!

Contacts: Ilia Kuznetsov kuznetsov@ukp.informatik.tu-darmstadt.de, Jan Buchmann buchmann@ukp.informatik.tu-darmstadt.de

https://www.ukp.tu-darmstadt.de/

This repository contains experimental software and is published for the sole purpose of giving additional background details on the respective publication.

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
analysis		analysis
data		data
guidelines		guidelines
.gitignore		.gitignore
LICENSE.txt		LICENSE.txt
NOTICE.txt		NOTICE.txt
README.md		README.md
datasheet.pdf		datasheet.pdf
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

analysis

analysis

data

data

guidelines

guidelines

.gitignore

.gitignore

LICENSE.txt

LICENSE.txt

NOTICE.txt

NOTICE.txt

README.md

README.md

datasheet.pdf

datasheet.pdf

requirements.txt

requirements.txt

Repository files navigation

F1000RD Corpus Repository

What can be found here

Data structure

Implicit Linking Data

Data Splits

Analytics run-through

Citation

Contact

About

Releases

Packages

Contributors 2

Languages

License

UKPLab/f1000rd

Folders and files

Latest commit

History

Repository files navigation

F1000RD Corpus Repository

What can be found here

Data structure

Implicit Linking Data

Data Splits

Analytics run-through

Citation

Contact

About

Resources

License

Stars

Watchers

Forks

Languages