This repository contains the dataset for our ACL 2022 DialDoc Workshop paper MSAMSum: Towards Benchmarking Multi-lingual Dialogue Summarization.
- 1. Abstract
- 2. Dataset Construction
- 3. Multi-lingual Settings
- 4. MSAMSum
- 5. Recommendation
- 6. Citation
- 7. License
Dialogue summarization
helps users capture salient information from various types of dialogues has received much attention recently.
However, current works mainly focus on English dialogue summarization, leaving other languages less well explored.
Therefore, we present a multi-lingual dialogue summarization dataset, namely MSAMSum, which covers dialogue-summary pairs in six languages.
Specifically, we derive MSAMSum from the standard SAMSum using sophisticated translation techniques and further employ two methods to ensure the integral translation quality and summary factual consistency.
Given the proposed MSAMum, we systematically set up five multi-lingual settings for this task, including a novel mix-lingual dialogue summarization setting.
To illustrate the utility of our dataset, we benchmark various experiments with pre-trained models under different settings and report results in both supervised and zero-shot manners.
We also discuss some future works towards this task to motivate future researches.
Illustration of our data construction process.
Illustration of different multi-lingual settings.
Illustration of the mix-lingual dialogue construction process.
For MSAMSum please send an application email to xiachongfeng1996[at].gmail.com
(simply including your name, affiliation and research purpose) to obtain it.
Note that, we cannot directly release the share link of MSAMSum due to the CC BY-NC-ND 4.0 license of original SAMSum dataset.
We also kindly recommend two highly related great papers for cross-lingual dialogue summarization research:
- ClidSum: A Benchmark Dataset for Cross-Lingual Dialogue Summarization [code && data]
- The Cross-lingual Conversation Summarization Challenge
If you find this work is useful or use the data in your work, please consider cite our paper as well as the SAMSum paper:
@inproceedings{feng-etal-2022-msamsum,
title = "{MSAMS}um: Towards Benchmarking Multi-lingual Dialogue Summarization",
author = "Feng, Xiachong and
Feng, Xiaocheng and
Qin, Bing",
booktitle = "Proceedings of the Second DialDoc Workshop on Document-grounded Dialogue and Conversational Question Answering",
month = may,
year = "2022",
address = "Dublin, Ireland",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2022.dialdoc-1.1",
pages = "1--12"
}
@inproceedings{gliwa2019samsum,
title={SAMSum Corpus: A Human-annotated Dialogue Dataset for Abstractive Summarization},
author={Gliwa, Bogdan and Mochol, Iwona and Biesek, Maciej and Wawer, Aleksander},
booktitle={Proceedings of the 2nd Workshop on New Frontiers in Summarization},
pages={70--79},
year={2019}
}