Dataset for the EMNLP 2020 paper, Multi-XScience: A Large-scale Dataset for Extreme Multi-document Summarization of Scientific Articles.
Appendix: model implementation and evaluation details.
|train/val/test examples||average document length||summary length||number of references|
We also calculate the percentage of novel n-grams in the target summary of previous datasets. Three of them are single-document summarization datasets. Our dataset has the highest abstractiveness among all existing multi-document summarization datasets.
|Datasets||% of novel unigram||% of novel bi-grams||% of novel tri-grams||% of novel 4-grams|
|NY Times (single)||22.64||55.59||71.93||80.16|
|aid||arxiv id (e.g. 2010.14235)|
|mid||microsoft academic graph id|
|abstract||text of paper abstract|
|ref_abstract||meta-information of reference papers|
|ref_abstract.cite_N||meta-information of reference paper cite_N (special cite symbol)|
|ref_abstract.cite_N.mid||reference paper's (cite_N) microsoft academic graph id|
|ref_abstract.cite_N.abstract||text of reference paper (cite_N) abstract|
Our dataset is aligned with Microsoft Academic Graph. Anyone interested in the intersection of graph and summarization can use our dataset for exploration.