Multi-XScience

Dataset for the EMNLP 2020 paper, Multi-XScience: A Large-scale Dataset for Extreme Multi-document Summarization of Scientific Articles.

Authors: Yao Lu, Yue Dong, Laurent Charlin

Appendix: model implementation and evaluation details.

Dataset Statistics

word-level statistics

train/val/test examples	average document length	summary length	number of references
30,369/5,066/5,093	778.08	116.44	4.42

We also calculate the percentage of novel n-grams in the target summary of previous datasets. Three of them are single-document summarization datasets. Our dataset has the highest abstractiveness among all existing multi-document summarization datasets.

Datasets	% of novel unigram	% of novel bi-grams	% of novel tri-grams	% of novel 4-grams
CNN-DailyMail (single)	17.00	53.91	71.98	80.29
NY Times (single)	22.64	55.59	71.93	80.16
XSum (single)	35.76	83.45	95.50	98.49
WikiSum	18.20	51.88	69.82	78.16
Multi-News	17.76	57.10	75.71	82.30
Multi-XScience	42.33	81.75	94.57	97.62

Dataset Format

key	description
aid	arxiv id (e.g. 2010.14235)
mid	microsoft academic graph id
abstract	text of paper abstract
ref_abstract	meta-information of reference papers
ref_abstract.cite_N	meta-information of reference paper cite_N (special cite symbol)
ref_abstract.cite_N.mid	reference paper's (cite_N) microsoft academic graph id
ref_abstract.cite_N.abstract	text of reference paper (cite_N) abstract

Extended Usage

Our dataset is aligned with Microsoft Academic Graph. Anyone interested in the intersection of graph and summarization can use our dataset for exploration.

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
data		data
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Multi-XScience

Dataset Statistics

Dataset Format

Extended Usage

About

Releases

Packages

License

yaolu/Multi-XScience

Folders and files

Latest commit

History

Repository files navigation

Multi-XScience

Dataset Statistics

Dataset Format

Extended Usage

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Packages