Skip to content

yaolu/Multi-XScience

master
Switch branches/tags

Name already in use

A tag already exists with the provided branch name. Many Git commands accept both tag and branch names, so creating this branch may cause unexpected behavior. Are you sure you want to create this branch?
Code

Files

Permalink
Failed to load latest commit information.
Type
Name
Latest commit message
Commit time
 
 
 
 
 
 

Multi-XScience

Dataset for the EMNLP 2020 paper, Multi-XScience: A Large-scale Dataset for Extreme Multi-document Summarization of Scientific Articles.

Authors: Yao Lu, Yue Dong, Laurent Charlin

Appendix: model implementation and evaluation details.

Dataset Statistics

train/val/test examples average document length summary length number of references
30,369/5,066/5,093 778.08 116.44 4.42

We also calculate the percentage of novel n-grams in the target summary of previous datasets. Three of them are single-document summarization datasets. Our dataset has the highest abstractiveness among all existing multi-document summarization datasets.

Datasets % of novel unigram % of novel bi-grams % of novel tri-grams % of novel 4-grams
CNN-DailyMail (single) 17.00 53.91 71.98 80.29
NY Times (single) 22.64 55.59 71.93 80.16
XSum (single) 35.76 83.45 95.50 98.49
WikiSum 18.20 51.88 69.82 78.16
Multi-News 17.76 57.10 75.71 82.30
Multi-XScience 42.33 81.75 94.57 97.62

Dataset Format

key description
aid arxiv id (e.g. 2010.14235)
mid microsoft academic graph id
abstract text of paper abstract
ref_abstract meta-information of reference papers
ref_abstract.cite_N meta-information of reference paper cite_N (special cite symbol)
ref_abstract.cite_N.mid reference paper's (cite_N) microsoft academic graph id
ref_abstract.cite_N.abstract text of reference paper (cite_N) abstract

Extended Usage

Our dataset is aligned with Microsoft Academic Graph. Anyone interested in the intersection of graph and summarization can use our dataset for exploration.

About

Multi-XScience: A Large-scale Dataset for Extreme Multi-document Summarization of Scientific Articles

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published