Skip to content

This repository contains the two datasets introduced in the paper "Making Science Simple: Corpora for the Lay Summarisation of Scientific Literature" accepted in EMNLP 2022.

TGoldsack1/Corpora_for_Lay_Summarisation

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

9 Commits
 
 
 
 

Repository files navigation

Corpora-for-Lay-Summarisation

This repository contains the code and data for the paper "Making Science Simple: Corpora for the Lay Summarisation of Scientific Literature", accepted in EMNLP 2022.

Data

Download links for PLOS and eLife, the two datasets introduced in our paper, are given below:

  • PLOS (~1.3GB uncompressed)
  • eLife (~330MB uncompressed)

Each dataset contains full biomedical research articles paired with expert-written lay summaries. PLOS articles are retrieved from journals published by the Public Library of Science (PLOS). Similarly, eLife articles are obtained from the eLife journal. More details/anlaysis on the content of each dataset are provided in the paper.

Data format

Each dataset consists of 3 files: train.json, val.json, and test.json - corresponding to the training, validation, and test splits. All files are in JSON format and contain a list of JSON objects, each of which contains data for a single article. The JSON objects are in the following format:

{
  "id": str,                      # unique identifier
  "year": str,                    # year of publication
  "title": str,                   # title
  "sections": List[List[str]],    # main text, divided in to sections
  "headings" List[str],           # headings of each section
  "abstract": List[str],          # abstract
  "summary": List[str],           # lay summary
  "keywords": List[str]           # keywords/topic of article
}

Note: For the majority of experiments in the paper, our model input consists of the abstract concatenated with sections.

Huggingface Datasets

The datasets will are also available on the Huggingface Datasets library (page link).

They can be retrieved as follows:

from datasets import load_dataset

dataset = load_dataset("tomasg25/scientific_lay_summarisation", "plos") # replace "plos" with "elife" for eLife dataset

Note: Both datasets are provided in a slightly different format (using strings instead of lists) via huggingface - see the dataset page on huggingface for details.

Human Evaluation Annotations

The annotations created as part of the expret-based human evaluation described in the paper are available in the human_eval folder.

Citation

When using either dataset, please cite the following:

"Making Science Simple: Corpora for the Lay Summarisation of Scientific Literature"
Tomas Goldsack, Zhihao Zhang, Chenghua Lin, Carolina Scarton
EMNLP 2022

About

This repository contains the two datasets introduced in the paper "Making Science Simple: Corpora for the Lay Summarisation of Scientific Literature" accepted in EMNLP 2022.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published