Summarization Papers

Full List

Organized by Xiachong Feng.

Contributor

Yichong Huang, Haozheng Yang, Jiaan Wang

Summarization Learning Route

Summarization Learning Route (with link)

Presentations && Notes

Dialogue Summarization (2022.1)
Cross-lingual Summarization
如何把DialoGPT用到对话摘要任务？@ ACL 2021
对话摘要最新进展简述
Dialogue Summarization (2021.5)
融入常识知识的生成式对话摘要
会议摘要有难度？快来引入对话篇章结构信息
文本摘要论文列表(Chinese)
事实感知的生成式文本摘要(Chinese)
多模态摘要简述(Chinese)
文本摘要简述
Multi-modal Summarization
ACL20 Summarization
文本摘要简述 (Chinese)
ACL19 Summarization
Brief intro to summarization (Chinese)
EMNLP19 Summarization (Chinese)
ACL19-A Simple Theoretical Model of Importance for Summarization
ACL19-Multimodal Abstractive Summarization for How2 Videos

Big Model Era

TempoSum: Evaluating the Temporal Generalization of Abstractive Summarization Chi Seng Cheang, Hou Pong Chan, Derek F. Wong, Xuebo Liu, Zhaocong Li, Yanming Sun, Shudong Liu, Lidia S. Chao `` [pdf] [data]

[Abs]
Recent pre-trained language models (PLMs) achieve promising results in existing abstractive summarization datasets. However, existing summarization benchmarks overlap in time with the standard pre-training corpora and finetuning datasets. Hence, the strong performance of PLMs may rely on the parametric knowledge that is memorized during pre-training and fine-tuning. Moreover, the knowledge memorized by PLMs may quickly become outdated, which affects the generalization performance of PLMs on future data. In this work, we propose TempoSum, a novel benchmark that contains data samples from 2010 to 2022, to understand the temporal generalization ability of abstractive summarization models. Through extensive human evaluation, we show that parametric knowledge stored in summarization models significantly affects the faithfulness of the generated summaries on future data. Moreover, existing faithfulness enhancement methods cannot reliably improve the faithfulness of summarization models on future data. Finally, we discuss several recommendations to the research community on how to evaluate and improve the temporal generalization capability of text summarization models.
RadAdapt: Radiology Report Summarization via Lightweight Domain Adaptation of Large Language Models Dave Van Veen, Cara Van Uden, Maayane Attias, Anuj Pareek, Christian Bluethgen, Malgorzata Polacin, Wah Chiu, Jean-Benoit Delbrouck, Juan Manuel Zambrano Chaves, Curtis P. Langlotz, Akshay S. Chaudhari, John Pauly [pdf]

[Abs]
We systematically investigate lightweight strategies to adapt large language models (LLMs) for the task of radiology report summarization (RRS). Specifically, we focus on domain adaptation via pretraining (on natural language, biomedical text, and clinical text) and via prompting (zero-shot, in-context learning) or parameter-efficient fine-tuning (prefix tuning, LoRA). Our results on the MIMIC-III dataset consistently demonstrate best performance by maximally adapting to the task via pretraining on clinical text and parameter-efficient fine-tuning on RRS examples. Importantly, this method fine-tunes a mere 0.32% of parameters throughout the model, in contrast to end-to-end fine-tuning (100% of parameters). Additionally, we study the effect of in-context examples and out-of-distribution (OOD) training before concluding with a radiologist reader study and qualitative analysis. Our findings highlight the importance of domain adaptation in RRS and provide valuable insights toward developing effective natural language processing solutions for clinical tasks.
ImpressionGPT: An Iterative Optimizing Framework for Radiology Report Summarization with ChatGPT Chong Ma, Zihao Wu, Jiaqi Wang, Shaochen Xu, Yaonai Wei, Zhengliang Liu, Lei Guo, Xiaoyan Cai, Shu Zhang, Tuo Zhang, Dajiang Zhu, Dinggang Shen, Tianming Liu, Xiang Li [pdf]

[Abs]
The 'Impression' section of a radiology report is a critical basis for communication between radiologists and other physicians, and it is typically written by radiologists based on the 'Findings' section. However, writing numerous impressions can be laborious and error-prone for radiologists. Although recent studies have achieved promising results in automatic impression generation using large-scale medical text data for pre-training and fine-tuning pre-trained language models, such models often require substantial amounts of medical text data and have poor generalization performance. While large language models (LLMs) like ChatGPT have shown strong generalization capabilities and performance, their performance in specific domains, such as radiology, remains under-investigated and potentially limited. To address this limitation, we propose ImpressionGPT, which leverages the in-context learning capability of LLMs by constructing dynamic contexts using domain-specific, individualized data. This dynamic prompt approach enables the model to learn contextual knowledge from semantically similar examples from existing data. Additionally, we design an iterative optimization algorithm that performs automatic evaluation on the generated impression results and composes the corresponding instruction prompts to further optimize the model. The proposed ImpressionGPT model achieves state-of-the-art performance on both MIMIC-CXR and OpenI datasets without requiring additional training data or fine-tuning the LLMs. This work presents a paradigm for localizing LLMs that can be applied in a wide range of similar application scenarios, bridging the gap between general-purpose LLMs and the specific language processing needs of various domains.
Extractive Summarization via ChatGPT for Faithful Summary Generation Haopeng Zhang, Xiao Liu, Jiawei Zhang [pdf]

[Abs]
Extractive summarization is a crucial task in natural language processing that aims to condense long documents into shorter versions by directly extracting sentences. The recent introduction of ChatGPT has attracted significant interest in the NLP community due to its remarkable performance on a wide range of downstream tasks. However, concerns regarding factuality and faithfulness have hindered its practical applications for summarization systems. This paper first presents a thorough evaluation of ChatGPT's performance on extractive summarization and compares it with traditional fine-tuning methods on various benchmark datasets. Our experimental analysis reveals that ChatGPT's extractive summarization performance is still inferior to existing supervised systems in terms of ROUGE scores. In addition, we explore the effectiveness of in-context learning and chain-of-thought reasoning for enhancing its performance. Furthermore, we find that applying an extract-then-generate pipeline with ChatGPT yields significant performance improvements over abstractive baselines in terms of summary faithfulness. These observations highlight potential directions for enhancing ChatGPT's capabilities for faithful text summarization tasks using two-stage approaches.
Human-like Summarization Evaluation with ChatGPT Mingqi Gao, Jie Ruan, Renliang Sun, Xunjian Yin, Shiping Yang, Xiaojun Wan [pdf]

[Abs]
Evaluating text summarization is a challenging problem, and existing evaluation metrics are far from satisfactory. In this study, we explored ChatGPT's ability to perform human-like summarization evaluation using four human evaluation methods on five datasets. We found that ChatGPT was able to complete annotations relatively smoothly using Likert scale scoring, pairwise comparison, Pyramid, and binary factuality evaluation. Additionally, it outperformed commonly used automatic evaluation metrics on some datasets. Furthermore, we discussed the impact of different prompts, compared its performance with that of human evaluation, and analyzed the generated explanations and invalid responses.
Cross-Lingual Summarization via ChatGPT Jiaan Wang, Yunlong Liang, Fandong Meng, Zhixu Li, Jianfeng Qu, Jie Zhou [pdf]

[Abs]
Given a document in a source language, cross-lingual summarization (CLS) aims to generate a summary in a different target language. Recently, the emergence of ChatGPT has attracted wide attention from the computational linguistics community. However, it is not yet known the performance of ChatGPT on CLS. In this report, we empirically use various prompts to guide ChatGPT to perform zero-shot CLS from different paradigms (i.e., end-to-end and pipeline), and provide a preliminary evaluation on its generated summaries.We find that ChatGPT originally prefers to produce lengthy summaries with more detailed information. But with the help of an interactive prompt, ChatGPT can balance between informativeness and conciseness, and significantly improve its CLS performance. Experimental results on three widely-used CLS datasets show that ChatGPT outperforms the advanced GPT 3.5 model (i.e., text-davinci-003). In addition, we provide qualitative case studies to show the superiority of ChatGPT on CLS.
Exploring the Limits of ChatGPT for Query or Aspect-based Text Summarization Xianjun Yang, Yan Li, Xinlu Zhang, Haifeng Chen, Wei Cheng [pdf]

[Abs]
Text summarization has been a crucial problem in natural language processing (NLP) for several decades. It aims to condense lengthy documents into shorter versions while retaining the most critical information. Various methods have been proposed for text summarization, including extractive and abstractive summarization. The emergence of large language models (LLMs) like GPT3 and ChatGPT has recently created significant interest in using these models for text summarization tasks. Recent studies \cite{goyal2022news, zhang2023benchmarking} have shown that LLMs-generated news summaries are already on par with humans. However, the performance of LLMs for more practical applications like aspect or query-based summaries is underexplored. To fill this gap, we conducted an evaluation of ChatGPT's performance on four widely used benchmark datasets, encompassing diverse summaries from Reddit posts, news articles, dialogue meetings, and stories. Our experiments reveal that ChatGPT's performance is comparable to traditional fine-tuning methods in terms of Rouge scores. Moreover, we highlight some unique differences between ChatGPT-generated summaries and human references, providing valuable insights into the superpower of ChatGPT for diverse text summarization tasks. Our findings call for new directions in this area, and we plan to conduct further research to systematically examine the characteristics of ChatGPT-generated summaries through extensive human evaluation.
News Summarization and Evaluation in the Era of GPT-3 Tanya Goyal, Junyi Jessy Li, Greg Durrett [pdf] [code]

[Abs]
The recent success of zero- and few-shot prompting with models like GPT-3 has led to a paradigm shift in NLP research. In this paper, we study its impact on text summarization, focusing on the classic benchmark domain of news summarization. First, we investigate how zero-shot GPT-3 compares against fine-tuned models trained on large summarization datasets. We show that not only do humans overwhelmingly prefer GPT-3 summaries, but these also do not suffer from common dataset-specific issues such as poor factuality. Next, we study what this means for evaluation, particularly the role of gold standard test sets. Our experiments show that both reference-based and reference-free automatic metrics, e.g. recently proposed QA- or entailment-based factuality approaches, cannot reliably evaluate zero-shot summaries. Finally, we discuss future research challenges beyond generic summarization, specifically, keyword- and aspect-based summarization, showing how dominant fine-tuning approaches compare to zero-shot prompting. To support further research, we release: (a) a corpus of 10K generated summaries from fine-tuned and zero-shot models across 4 standard summarization benchmarks, (b) 1K human preference judgments and rationales comparing different systems for generic- and keyword-based summarization.
Benchmarking Large Language Models for News Summarization Tianyi Zhang, Faisal Ladhak, Esin Durmus, Percy Liang, Kathleen McKeown, Tatsunori B. Hashimoto [pdf]

[Abs]
Large language models (LLMs) have shown promise for automatic summarization but the reasons behind their successes are poorly understood. By conducting a human evaluation on ten LLMs across different pretraining methods, prompts, and model scales, we make two important observations. First, we find instruction tuning, and not model size, is the key to the LLM's zero-shot summarization capability. Second, existing studies have been limited by low-quality references, leading to underestimates of human performance and lower few-shot and finetuning performance. To better evaluate LLMs, we perform human evaluation over high-quality summaries we collect from freelance writers. Despite major stylistic differences such as the amount of paraphrasing, we find that LMM summaries are judged to be on par with human written summaries.
Is ChatGPT a General-Purpose Natural Language Processing Task Solver? Chengwei Qin, Aston Zhang, Zhuosheng Zhang, Jiaao Chen, Michihiro Yasunaga, Diyi Yang [pdf]

[Abs]
Spurred by advancements in scale, large language models (LLMs) have demonstrated the ability to perform a variety of natural language processing (NLP) tasks zero-shot -- i.e., without adaptation on downstream data. Recently, the debut of ChatGPT has drawn a great deal of attention from the natural language processing (NLP) community due to the fact that it can generate high-quality responses to human input and self-correct previous mistakes based on subsequent conversations. However, it is not yet known whether ChatGPT can serve as a generalist model that can perform many NLP tasks zero-shot. In this work, we empirically analyze the zero-shot learning ability of ChatGPT by evaluating it on 20 popular NLP datasets covering 7 representative task categories. With extensive empirical studies, we demonstrate both the effectiveness and limitations of the current version of ChatGPT. We find that ChatGPT performs well on many tasks favoring reasoning capabilities (e.g., arithmetic reasoning) while it still faces challenges when solving specific tasks such as sequence tagging. We additionally provide in-depth analysis through qualitative case studies.
A Multitask, Multilingual, Multimodal Evaluation of ChatGPT on Reasoning, Hallucination, and Interactivity Yejin Bang, Samuel Cahyawijaya, Nayeon Lee, Wenliang Dai, Dan Su, Bryan Wilie, Holy Lovenia, Ziwei Ji, Tiezheng Yu, Willy Chung, Quyet V. Do, Yan Xu, Pascale Fung `` [pdf]

[Abs]
This paper proposes a framework for quantitatively evaluating interactive LLMs such as ChatGPT using publicly available data sets. We carry out an extensive technical evaluation of ChatGPT using 21 data sets covering 8 different common NLP application tasks. We evaluate the multitask, multilingual and multi-modal aspects of ChatGPT based on these data sets and a newly designed multimodal dataset. We find that ChatGPT outperforms LLMs with zero-shot learning on most tasks and even outperforms fine-tuned models on some tasks. We find that it is better at understanding non-Latin script languages than generating them. It is able to generate multimodal content from textual prompts, via an intermediate code generation step. Moreover, we find that ChatGPT is 64.33% accurate on average in 10 different reasoning categories under logical reasoning, non-textual reasoning, and commonsense reasoning, hence making it an unreliable reasoner. It is, for example, better at deductive than inductive reasoning. ChatGPT suffers from hallucination problems like other LLMs and it generates more extrinsic hallucinations from its parametric memory as it does not have access to an external knowledge base. Finally, the interactive feature of ChatGPT enables human collaboration with the underlying LLM to improve its performance, i.e, 8% ROUGE-1 on summarization and 2% ChrF++ on machine translation, in a multi-turn "prompt engineering" fashion.

Decomposed

Summarization Programs: Interpretable Abstractive Summarization with Neural Modular Trees Swarnadeep Saha, Shiyue Zhang, Peter Hase, Mohit Bansal ICLR 2023 [pdf] [code]

[Abs]
Current abstractive summarization models either suffer from a lack of clear interpretability or provide incomplete rationales by only highlighting parts of the source document. To this end, we propose the Summarization Program (SP), an interpretable modular framework consisting of an (ordered) list of binary trees, each encoding the step-by-step generative process of an abstractive summary sentence from the source document. A Summarization Program contains one root node per summary sentence, and a distinct tree connects each summary sentence (root node) to the document sentences (leaf nodes) from which it is derived, with the connecting nodes containing intermediate generated sentences. Edges represent different modular operations involved in summarization such as sentence fusion, compression, and paraphrasing. We first propose an efficient best-first search method over neural modules, SP-Search that identifies SPs for human summaries by directly optimizing for ROUGE scores. Next, using these programs as automatic supervision, we propose seq2seq models that generate Summarization Programs, which are then executed to obtain final summaries. We demonstrate that SP-Search effectively represents the generative process behind human summaries using modules that are typically faithful to their intended behavior. We also conduct a simulation study to show that Summarization Programs improve the interpretability of summarization models by allowing humans to better simulate model reasoning. Summarization Programs constitute a promising step toward interpretable and modular abstractive summarization, a complex task previously addressed primarily through blackbox end-to-end neural systems. Supporting code available at this https URL

Benchmark

Benchmarking Large Language Models for News Summarizatio Tianyi Zhang, Faisal Ladhak, Esin Durmus, Percy Liang, Kathleen McKeown, Tatsunori B. Hashimoto [pdf]

[Abs]
Large language models (LLMs) have shown promise for automatic summarization but the reasons behind their successes are poorly understood. By conducting a human evaluation on ten LLMs across different pretraining methods, prompts, and model scales, we make two important observations. First, we find instruction tuning, and not model size, is the key to the LLM's zero-shot summarization capability. Second, existing studies have been limited by low-quality references, leading to underestimates of human performance and lower few-shot and finetuning performance. To better evaluate LLMs, we perform human evaluation over high-quality summaries we collect from freelance writers. Despite major stylistic differences such as the amount of paraphrasing, we find that LMM summaries are judged to be on par with human written summaries.
MuLD: The Multitask Long Document Benchmark G Thomas Hudson, Noura Al Moubayed [pdf] [data]
EXPLAINABOARD: An Explainable Leaderboard for NLP Pengfei Liu, Jinlan Fu, Yang Xiao, Weizhe Yuan, Shuaichen Chang, Junqi Dai, Yixin Liu, Zihuiwen Ye, Graham Neubig [pdf] [ExplainaBoard]
GLGE: A New General Language Generation Evaluation Benchmark Dayiheng Liu, Yu Yan, Yeyun Gong, Weizhen Qi, Hang Zhang, Jian Jiao, Weizhu Chen, Jie Fu, Linjun Shou, Ming Gong, Pengcheng Wang, Jiusheng Chen, Daxin Jiang, Jiancheng Lv, Ruofei Zhang, Winnie Wu, Ming Zhou, Nan Duan [pdf] [benchmark]

Survey

A Survey on Biomedical Text Summarization with Pre-trained Language Model Qianqian Xie, Zheheng Luo, Benyou Wang, Sophia Ananiadou [pdf] [code]

[Abs]
The exponential growth of biomedical texts such as biomedical literature and electronic health records (EHRs), provides a big challenge for clinicians and researchers to access clinical information efficiently. To address the problem, biomedical text summarization has been proposed to support clinical information retrieval and management, aiming at generating concise summaries that distill key information from single or multiple biomedical documents. In recent years, pre-trained language models (PLMs) have been the de facto standard of various natural language processing tasks in the general domain. Most recently, PLMs have been further investigated in the biomedical field and brought new insights into the biomedical text summarization task. In this paper, we systematically summarize recent advances that explore PLMs for biomedical text summarization, to help understand recent progress, challenges, and future directions. We categorize PLMs-based approaches according to how they utilize PLMs and what PLMs they use. We then review available datasets, recent approaches and evaluation metrics of the task. We finally discuss existing challenges and promising future directions. To facilitate the research community, we line up open resources including available datasets, recent approaches, codes, evaluation metrics, and the leaderboard in a public project: this https URL.
A Survey on Medical Document Summarization Raghav Jain, Anubhav Jangra, Sriparna Saha, Adam Jatowt [pdf]

[Abs]
The internet has had a dramatic effect on the healthcare industry, allowing documents to be saved, shared, and managed digitally. This has made it easier to locate and share important data, improving patient care and providing more opportunities for medical studies. As there is so much data accessible to doctors and patients alike, summarizing it has become increasingly necessary - this has been supported through the introduction of deep learning and transformer-based networks, which have boosted the sector significantly in recent years. This paper gives a comprehensive survey of the current techniques and trends in medical summarization
Taxonomy of Abstractive Dialogue Summarization: Scenarios, Approaches and Future Directions Qi Jia, Siyu Ren, Yizhu Liu, Kenny Q. Zhu [pdf]

[Abs]
Abstractive dialogue summarization is to generate a concise and fluent summary covering the salient information in a dialogue among two or more interlocutors. It has attracted great attention in recent years based on the massive emergence of social communication platforms and an urgent requirement for efficient dialogue information understanding and digestion. Different from news or articles in traditional document summarization, dialogues bring unique characteristics and additional challenges, including different language styles and formats, scattered information, flexible discourse structures and unclear topic boundaries. This survey provides a comprehensive investigation on existing work for abstractive dialogue summarization from scenarios, approaches to evaluations. It categorizes the task into two broad categories according to the type of input dialogues, i.e., open-domain and task-oriented, and presents a taxonomy of existing techniques in three directions, namely, injecting dialogue features, designing auxiliary training tasks and using additional data.A list of datasets under different scenarios and widely-accepted evaluation metrics are summarized for completeness. After that, the trends of scenarios and techniques are summarized, together with deep insights on correlations between extensively exploited features and different scenarios. Based on these analyses, we recommend future directions including more controlled and complicated scenarios, technical innovations and comparisons, publicly available datasets in special domains, etc.
A Survey of Automatic Text Summarization Using Graph Neural Networks Marco Ferdinand Salchner, Adam Jatowt COLING 2022 [pdf]

[Abs]
Although automatic text summarization (ATS) has been researched for several decades, the application of graph neural networks (GNNs) to this task started relatively recently. In this survey we provide an overview on the rapidly evolving approach of using GNNs for the task of automatic text summarization. In particular we provide detailed information on the functionality of GNNs in the context of ATS, and a comprehensive overview of models utilizing this approach.
A Survey on Cross-Lingual Summarization Jiaan Wang, Fandong Meng, Duo Zheng, Yunlong Liang, Zhixu Li, Jianfeng Qu, Jie Zhou TACL 2022 [pdf]

[Abs]
Cross-lingual summarization is the task of generating a summary in one language (e.g., English) for the given document(s) in a different language (e.g., Chinese). Under the globalization background, this task has attracted increasing attention of the computational linguistics community. Nevertheless, there still remains a lack of comprehensive review for this task. Therefore, we present the first systematic critical review on the datasets, approaches, and challenges in this field. Specifically, we carefully organize existing datasets and approaches according to different construction methods and solution paradigms, respectively. For each type of datasets or approaches, we thoroughly introduce and summarize previous efforts and further compare them with each other to provide deeper analyses. In the end, we also discuss promising directions and offer our thoughts to facilitate future research. This survey is for both beginners and experts in cross-lingual summarization, and we hope it will serve as a starting point as well as a source of new ideas for researchers and engineers interested in this area.
An Empirical Survey on Long Document Summarization: Datasets, Models and Metrics uan Yee Koh, Jiaxin Ju, Ming Liu, Shirui Pan ACM Computing Surveys [pdf]

[Abs]
Long documents such as academic articles and business reports have been the standard format to detail out important issues and complicated subjects that require extra attention. An automatic summarization system that can effectively condense long documents into short and concise texts to encapsulate the most important information would thus be significant in aiding the reader's comprehension. Recently, with the advent of neural architectures, significant research efforts have been made to advance automatic text summarization systems, and numerous studies on the challenges of extending these systems to the long document domain have emerged. In this survey, we provide a comprehensive overview of the research on long document summarization and a systematic evaluation across the three principal components of its research setting: benchmark datasets, summarization models, and evaluation metrics. For each component, we organize the literature within the context of long document summarization and conduct an empirical analysis to broaden the perspective on current research progress. The empirical analysis includes a study on the intrinsic characteristics of benchmark datasets, a multi-dimensional analysis of summarization models, and a review of the summarization evaluation metrics. Based on the overall findings, we conclude by proposing possible directions for future exploration in this rapidly growing field.
Multi-document Summarization via Deep Learning Techniques: A Survey Congbo Ma, Wei Emma Zhang, Mingyu Guo, Hu Wang, QUAN Z. Sheng [pdf]
Embedding Knowledge for Document Summarization: A Survey Yutong Qu, Wei Emma Zhang, Jian Yang, Lingfei Wu, Jia Wu, Xindong Wu [pdf]
A Survey on Dialogue Summarization: Recent Advances and New Frontiers Xiachong Feng, Xiaocheng Feng, Bing Qin IJCAI 2022, Survey Track [pdf]
Automatic Text Summarization Methods: A Comprehensive Review Divakar Yadav, Jalpa Desai, Arun Kumar Yadav [pdf]
Faithfulness in Natural Language Generation: A Systematic Survey of Analysis, Evaluation and Optimization Methods Wei Li, Wenhao Wu, Moye Chen, Jiachen Liu, Xinyan Xiao, Hua Wu [pdf]
Recent Advances in Neural Text Generation: A Task-Agnostic Survey Chen Tang, Frank Guerin, Yucheng Li, Chenghua Lin [pdf]
Survey of Hallucination in Natural Language Generation Ziwei Ji, Nayeon Lee, Rita Frieske, Tiezheng Yu, Dan Su, Yan Xu, Etsuko Ishii, Yejin Bang, Andrea Madotto, Pascale Fung [pdf]
A Survey on Retrieval-Augmented Text Generation Huayang Li, Yixuan Su, Deng Cai, Yan Wang, Lemao Liu [pdf]
A Survey of Controllable Text Generation using Transformer-based Pre-trained Language Models Hanqing Zhang, Haolin Song, Shaoyu Li, Ming Zhou, Dawei Song [pdf]
A Survey of Pretrained Language Models Based Text Generation Junyi Li, Tianyi Tang, Wayne Xin Zhao, Jian-Yun Nie, Ji-Rong Wen [pdf]
A Comprehensive Review on Summarizing Financial News Using Deep Learning Saurabh Kamal, Sahil Sharma [pdf]
A Survey on Multi-modal Summarization Anubhav Jangra, Adam Jatowt, Sriparna Saha, Mohammad Hasanuzzaman [pdf]
Pre-train, Prompt, and Predict: A Systematic Survey of Prompting Methods in Natural Language Processing Pengfei Liu, Weizhe Yuan, Jinlan Fu, Zhengbao Jiang, Hiroaki Hayashi, Graham Neubig [pdf]
Pretrained Language Models for Text Generation: A Survey Junyi Li, Tianyi Tang, Wayne Xin Zhao, Ji-Rong Wen IJCAI21 [pdf]
A Survey of Recent Abstract Summarization Techniques Diyah Puspitaningrum ICICT21 [pdf]
A Survey of the State-of-the-Art Models in Neural Abstractive Text Summarization AYESHA AYUB SYED, FORD LUMBAN GAOL, TOKURO MATSUO [pdf]
Automatic summarization of scientific articles: A survey Nouf Ibrahim Altmami, Mohamed El Bachir Menai Journal of King Saud University - Computer and Information Sciences [pdf]
Multi-document Summarization via Deep Learning Techniques: A Survey Congbo Ma, Wei Emma Zhang, Mingyu Guo, Hu Wang, Quan Z. Sheng [pdf]
Deep Learning Based Abstractive Text Summarization: Approaches, Datasets, Evaluation Measures, and Challenges Dima Suleiman, Arafat A. Awajan [pdf]
A Survey of Knowledge-Enhanced Text Generation Wenhao Yu, Chenguang Zhu, Zaitang Li, Zhiting Hu, Qingyun Wang, Heng Ji, Meng Jiang [pdf]
From Standard Summarization to New Tasks and Beyond: Summarization with Manifold Information Shen Gao, Xiuying Chen, Zhaochun Ren, Dongyan Zhao, Rui Yan IJCAI20 [pdf]
Neural Abstractive Text Summarization with Sequence-to-Sequence Models Tian Shi, Yaser Keneshloo, Naren Ramakrishnan, Chandan K. Reddy [pdf]
A Survey on Neural Network-Based Summarization Methods Yue Dong [pdf]
Automated text summarisation and evidence-based medicine: A survey of two domains Abeed Sarker, Diego Molla, Cecile Paris [pdf]
Automatic Keyword Extraction for Text Summarization: A Survey Santosh Kumar Bharti, Korra Sathya Babu [pdf]
Text Summarization Techniques: A Brief Survey Mehdi Allahyari, Seyedamin Pouriyeh, Mehdi Assefi, Saeid Safaei, Elizabeth D. Trippe, Juan B. Gutierrez, Krys Kochut [pdf]
Recent automatic text summarization techniques: a survey Mahak Gambhir, Vishal Gupta [pdf]

Toolkit

Summary Workbench: Unifying Application and Evaluation of Text Summarization Models Shahbaz Syed, Dominik Schwabe, Martin Potthast EMNLP 2022 Demo [pdf] [demo]

[Abs]
This paper presents Summary Workbench, a new tool for developing and evaluating text summarization models. New models and evaluation measures can be easily integrated as Docker-based plugins, allowing to examine the quality of their summaries against any input and to evaluate them using various evaluation measures. Visual analyses combining multiple measures provide insights into the models' strengths and weaknesses. The tool is hosted at \url{this https URL} and also supports local deployment for private resources.
iFacetSum: Coreference-based Interactive Faceted Summarization for Multi-Document Exploration Eran Hirsch, Alon Eirew, Ori Shapira, Avi Caciularu, Arie Cattan, Ori Ernst, Ramakanth Pasunuru, Hadar Ronen, Mohit Bansal, Ido Dagan EMNLP 2021 [pdf] [demo]
SummerTime: Text Summarization Toolkit for Non-experts Ansong Ni, Zhangir Azerbayev, Mutethia Mutuma, Troy Feng, Yusen Zhang, Tao Yu, Ahmed Hassan Awadallah, Dragomir Radev EMNLP 2021 Demo Track [pdf] [Demo]
Summary Explorer: Visualizing the State of the Art in Text Summarization Shahbaz Syed, Tariq Yousef, Khalid Al-Khatib, Stefan Jänicke, Martin Potthast [pdf] [web]
fastnlp/fastSum [code]
Graph4NLP [code] [summarization]
CTRLsum: Towards Generic Controllable Text Summarization [pdf] [code] EMNLP 2022

[Abs]
Current summarization systems yield generic summaries that are disconnected from users’ preferences and expectations. To address this limitation, we present CTRLsum, a generic framework to control generated summaries through a set of keywords. During training keywords are extracted automatically without requiring additional human annotations. At test time CTRLsum features a control function to map control signal to keywords; through engineering the control function, the same trained model is able to be applied to control summaries on various dimensions, while neither affecting the model training process nor the pretrained models. We additionally explore the combination of keywords and text prompts for more control tasks. Experiments demonstrate the effectiveness of CTRLsum on three domains of summarization datasets and five control tasks: (1) entity-centric and (2) length-controllable summarization, (3) contribution summarization on scientific papers, (4) invention purpose summarization on patent filings, and (5) question-guided summarization on news articles. Moreover, when used in a standard, unconstrained summarization setting, CTRLsum is comparable or better than strong pretrained systems.
OpenNMT-py: Open-Source Neural Machine Translation [pdf] [code]
Fairseq: Facebook AI Research Sequence-to-Sequence Toolkit written in Python. [code]
LeafNATS: An Open-Source Toolkit and Live Demo System for Neural Abstractive Text Summarization Tian Shi, Ping Wang, Chandan K. Reddy NAACL19 [pdf] [code]
TransformerSum [code]

Analysis

Human Guided Exploitation of Interpretable Attention Patterns in Summarization and Topic Segmentation Raymond Li, Wen Xiao, Linzi Xing, Lanjun Wang, Gabriel Murray, Giuseppe Carenini EMNLP 2022 [pdf] [code]

[Abs]
The multi-head self-attention mechanism of the transformer model has been thoroughly investigated recently. In one vein of study, researchers are interested in understanding why and how transformers work. In another vein, researchers propose new attention augmentation methods to make transformers more accurate, efficient and interpretable. In this paper, we combine these two lines of research in a human-in-the-loop pipeline to first discover important task-specific attention patterns. Then those patterns are injected, not only to smaller models, but also to the original model. The benefits of our pipeline and discovered patterns are demonstrated in two case studies with extractive summarization and topic segmentation. After discovering interpretable patterns in BERT-based models fine-tuned for the two downstream tasks, experiments indicate that when we inject the patterns into attention heads, the models show considerable improvements in accuracy and efficiency.
Analyzing Multi-Task Learning for Abstractive Text Summarization Frederic Kirstein, Jan Philip Wahle, Terry Ruas, Bela Gipp `` [pdf]

[Abs]
Despite the recent success of multi-task learning and pre-finetuning for natural language understanding, few works have studied the effects of task families on abstractive text summarization. Task families are a form of task grouping during the pre-finetuning stage to learn common skills, such as reading comprehension. To close this gap, we analyze the influence of multi-task learning strategies using task families for the English abstractive text summarization task. We group tasks into one of three strategies, i.e., sequential, simultaneous, and continual multi-task learning, and evaluate trained models through two downstream tasks. We find that certain combinations of task families (e.g., advanced reading comprehension and natural language inference) positively impact downstream performance. Further, we find that choice and combinations of task families influence downstream performance more than the training scheme, supporting the use of task families for abstractive text summarization.
On Decoding Strategies for Neural Text Generators Gian Wiher, Clara Meister, Ryan Cotterell [pdf]
Training Dynamics for Text Summarization Models Tanya Goyal, Jiacheng Xu, Junyi Jessy Li, Greg Durrett [https://arxiv.org/abs/2110.08370]
Does Summary Evaluation Survive Translation to Other Languages? Neslihan Iskender, Oleg Vasilyev, Tim Polzehl, John Bohannon, Sebastian Möller [pdf]
How well do you know your summarization datasets? Priyam Tejaswin, Dhruv Naik, Pengfei Liu Findings of ACL 2021 [pdf] [code]
Dissecting Generation Modes for Abstractive Summarization Models via Ablation and Attribution Jiacheng Xu, Greg Durrett ACL2021 [pdf] [code]
To Point or Not to Point: Understanding How Abstractive Summarizers Paraphrase Text Matt Wilber, William Timkey, Marten Van Schijndel Findings of ACL 2021 [pdf] [code]
What Makes a Good Summary? Reconsidering the Focus of Automatic Summarization Maartje ter Hoeve, Julia Kiseleva, Maarten de Rijke [pdf]
Intrinsic Evaluation of Summarization Datasets Rishi Bommasani, Claire Cardie EMNLP20 [pdf]
Metrics also Disagree in the Low Scoring Range: Revisiting Summarization Evaluation Metrics Manik Bhandari, Pranav Gour, Atabak Ashfaq, Pengfei Liu COLING20 Short [pdf] [code]
At Which Level Should We Extract? An Empirical Analysis on Extractive Document Summarization Qingyu Zhou, Furu Wei, Ming Zhou COLING20 [pdf]
Corpora Evaluation and System Bias detection in Multi Document Summarization Alvin Dey, Tanya Chowdhury, Yash Kumar, Tanmoy Chakraborty Findings of EMNLP [pdf]
Understanding the Extent to which Summarization Evaluation Metrics Measure the Information Quality of Summaries Daniel Deutsch, Dan Roth [pdf] [code]
Understanding Neural Abstractive Summarization Models via Uncertainty Jiacheng Xu, Shrey Desai, Greg Durrett EMNLP20 Short [pdf] [code]
Re-evaluating Evaluation in Text Summarization Manik Bhandari, Pranav Gour, Atabak Ashfaq, Pengfei Liu, Graham Neubig EMNLP20 [pdf] [code]
CDEvalSumm: An Empirical Study of Cross-Dataset Evaluation for Neural Summarization Systems Yiran Chen, Pengfei Liu, Ming Zhong, Zi-Yi Dou, Danqing Wang, Xipeng Qiu, Xuanjing Huang EMNLP20 [pdf] [code]
What Have We Achieved on Text Summarization? Dandan Huang, Leyang Cui, Sen Yang, Guangsheng Bao, Kun Wang, Jun Xie, Yue Zhang EMNLP20 [pdf]
Conditional Neural Generation using Sub-Aspect Functions for Extractive News Summarization Zhengyuan Liu, Ke Shi, Nancy F. Chen Findings of EMNLP20 [pdf]
Extractive Summarization as Text Matching Ming Zhong, Pengfei Liu, Yiran Chen, Danqing Wang, Xipeng Qiu, Xuanjing Huang ACL20 [pdf] [code]
Neural Text Summarization: A Critical Evaluation Wojciech Kryściński, Nitish Shirish Keskar, Bryan McCann, Caiming Xiong, Richard Socher EMNLP19 [pdf]
Earlier Isn’t Always Better:Sub-aspect Analysis on Corpus and System Biases in Summarization Taehee Jung, Dongyeop Kang, Lucas Mentch, Eduard Hovy EMNLP19 [pdf] [code]
A Closer Look at Data Bias in Neural Extractive Summarization Models Ming Zhong, Danqing Wang, Pengfei Liu, Xipeng Qiu, Xuanjing Huang EMNLP19 Workshop [pdf]
Countering the Effects of Lead Bias in News Summarization via Multi-Stage Training and Auxiliary Losses Matt Grenander, Yue Dong, Jackie Chi Kit Cheung, Annie Louis EMNLP19 Short [pdf]
Searching for Effective Neural Extractive Summarization: What Works and What's Next Ming Zhong, Pengfei Liu, Danqing Wang, Xipeng Qiu, Xuanjing Huang ACL19 [pdf] [code]
Content Selection in Deep Learning Models of Summarization Chris Kedzie, Kathleen McKeown, Hal Daumé III EMNLP18 [pdf] [code]

Thesis

Principled Approaches to Automatic Text Summarization Maxime Peyrard [pdf]
Neural Text Summarization and Generation Piji Li [pdf]

Theory

Bayesian Active Summarization Alexios Gidiotis, Grigorios Tsoumakas [pdf]
RefSum: Refactoring Neural Summarization Yixin Liu, Zi-Yi Dou, Pengfei Liu NAACL21 [pdf] [code]
Principled Approaches to Automatic Text Summarization Maxime Peyrard [pdf]
KLearn: Background Knowledge Inference from Summarization Data Maxime Peyrard, Robert West Findings of EMNLP20 [pdf] [code]
A Simple Theoretical Model of Importance for Summarization Maxime Peyrard ACL19 [pdf]
BottleSum: Unsupervised and Self-supervised Sentence Summarization using the Information Bottleneck Principle Peter West, Ari Holtzman, Jan Buys, Yejin Choi EMNLP19 [pdf] [code]

Dataset

ID	Name	Description	Paper	Conference
1	CNN-DailyMail	News	Abstractive Text Summarization using Sequence-to-sequence RNNs and Beyond	SIGNLL16
2	New York Times	News	The New York Times Annotated Corpus
3	DUC	News	The Effects Of Human Variation In DUC Summarization Evaluation
4	Gigaword	News	A Neural Attention Model For Abstractive Sentence Summarization	EMNLP15
5	Newsroom	News	Newsroom: A Dataset of 1.3 Million Summaries with Diverse Extractive Strategies	NAACL18
6	Xsum	News	Don’t Give Me the Details, Just the Summary! Topic-Aware Convolutional Neural Networks for Extreme Summarization	EMNLP18
7	Multi-News	Multi-document News	Multi-News: a Large-Scale Multi-Document Summarization Dataset and Abstractive Hierarchical Model	ACL19
8	SAMSum	Multi-party conversation	SAMSum Corpus: A Human-annotated Dialogue Dataset for Abstractive Summarization	EMNLP19
9	AMI	Meeting	The AMI Meeting Corpus: A pre-announcement.
10	ICSI	Meeting	The ICSI Meeting Corpus
11	MSMO	Multi-modal	MSMO: Multimodal Summarization with Multimodal Output	EMNLP18
12	How2	Multi-modal	How2: A Large-scale Dataset for Multimodal Language Understanding	NIPS18
13	ScisummNet	Scientific paper	ScisummNet: A Large Annotated Corpus and Content-Impact Models for Scientific Paper Summarization with Citation Networks	AAAI19
14	PubMed, ArXiv	Scientific paper	A Discourse-Aware Attention Model for Abstractive Summarization of Long Documents	NAACL18
15	TALKSUMM	Scientific paper	TALKSUMM: A Dataset and Scalable Annotation Method for Scientiﬁc Paper Summarization Based on Conference Talks	ACL19
16	BillSum	Legal	BillSum: A Corpus for Automatic Summarization of US Legislation	EMNLP19
17	LCSTS	Chinese Weibo	LCSTS: A Large Scale Chinese Short Text Summarization Dataset	EMNLP15
18	WikiHow	Online Knowledge Base	WikiHow: A Large Scale Text Summarization Dataset
19	Concept-map-based MDS Corpus	Educational Multi-document	Bringing Structure into Summaries : Crowdsourcing a Benchmark Corpus of Concept Maps	EMNLP17
20	WikiSum	Wikipedia Multi-document	Generating Wikipedia By Summarizing Long Sequence	ICLR18
21	GameWikiSum	Game Multi-document	GameWikiSum : a Novel Large Multi-Document Summarization Dataset	LREC20
22	En2Zh CLS, Zh2En CLS	Cross-Lingual	NCLS: Neural Cross-Lingual Summarization	EMNLP19
23	Timeline Summarization Dataset	Baidu timeline	Learning towards Abstractive Timeline Summarization	IJCAI19
24	Reddit TIFU	online discussion	Abstractive Summarization of Reddit Posts with Multi-level Memory Networks	NAACL19
25	TripAtt	Review	Attribute-aware Sequence Network for Review Summarization	EMNLP19
26	Reader Comments Summarization Corpus	Comments-based Weibo	Abstractive Text Summarization by Incorporating Reader Comments	AAAI19
27	BIGPATENT	Patent	BIGPATENT: A Large-Scale Dataset for Abstractive and Coherent Summarization	ACL19
28	Curation Corpus	News	Curation Corpus for Abstractive Text Summarisation
29	MATINF	Multi-task	MATINF: A Jointly Labeled Large-Scale Dataset for Classification, Question Answering and Summarization	ACL20
30	MLSUM	Multi-Lingual Summarization Dataset	MLSUM: The Multilingual Summarization Corpus	EMNLP20
31	Dialogue(Debate)	Argumentative Dialogue Summary Corpus	Using Summarization to Discover Argument Facets in Online Idealogical Dialog	NAACL15
32	WCEP	News Multi-document	A Large-Scale Multi-Document Summarization Dataset from the Wikipedia Current Events Portal	ACL20 Short
33	ArgKP	Argument-to-key Point Mapping	From Arguments to Key Points: Towards Automatic Argument Summarization	ACL20
34	CRD3	Dialogue	Storytelling with Dialogue: A Critical Role Dungeons and Dragons Dataset	2020
35	Gazeta	Russian news	Dataset for Automatic Summarization of Russian News
36	MIND	English news recommendation, Summarization, Classification, Entity	MIND: A Large-scale Dataset for News Recommendation	ACL20
37	public_meetings	french meeting(test set)	Align then Summarize: Automatic Alignment Methods for Summarization Corpus Creation	LREC
38	Enron	Email	Building a Dataset for Summarization and Keyword Extraction from Emails	2014
39	Columbia	Email	Summarizing Email Threads	2004
40	BC3	Email	A publicly available annotated corpus for supervised email summarization
41	WikiLingua	Cross-Lingual	WikiLingua- A New Benchmark Dataset for Cross-Lingual Abstractive Summarization	Findings of EMNLP20
42	LcsPIRT	Chinese Dialogue	Global Encoding for Long Chinese Text Summarization	TALLIP
43	CLTS，CLTS-plus	Chinese News	CLTS: A New Chinese Long Text Summarization Dataset CLTS+: A New Chinese Long Text Summarization Dataset with Abstractive Summaries	NLPCC20
44	VMSMO	Multi-modal	VMSMO: Learning to Generate Multimodal Summary for Video-based News Articles	EMNLP20
45	Multi-XScience	Multi-document	Multi-XScience: A Large-scale Dataset for Extreme Multi-document Summarization of Scientiﬁc Articles	EMNLP20 short
46	SCITLDR	Scientific Document	TLDR: Extreme Summarization of Scientific Documents	Findings of EMNLP20
47	scisumm-corpus	Scientific Document
48	QBSUM	Query-Based Chinese	QBSUM: a Large-Scale Query-Based Document Summarization Dataset from Real-world Applications	Computer Speech & Language
49	qMDS	Query-Based Multi-Document	AQuaMuSe: Automatically Generating Datasets for Query-Based Multi-Document Summarization
50	Liputan6	Indonesian	Liputan6: A Large-scale Indonesian Dataset for Text Summarization	AACL20
51	SportsSum	Sports Game	Generating Sports News from Live Commentary: A Chinese Dataset for Sports Game Summarization	AACL20
52	WikiAsp	Aspect-based	WikiAsp: A Dataset for Multi-domain Aspect-based Summarization	Transaction of the ACL
53	DebateSum	argument	DebateSum:A large-scale argument mining and summarization dataset	ARGMIN 2020
54	Open4Business	Business	Open4Business (O4B): An Open Access Dataset for Summarizing Business Documents	Workshop on Dataset Curation and Security-NeurIPS 2020
55	OrangeSum	French	BARThez: a Skilled Pretrained French Sequence-to-Sequence Model
56	Medical Conversation	medical conversation	Summarizing Medical Conversations via Identifying Important Utterances	COLING20
57	SumTitles	movie dialogue	SumTitles: a Summarization Dataset with Low Extractiveness	COLING20
58	BANS	bengali news	Bengali Abstractive News Summarization (BANS): A Neural Attention Approach	TCCE-2020
59	e-commerce	E-commerce	On the Faithfulness for E-commerce Product Summarization	COLING20
60	TWEETSUM	Twitter	TWEETSUM: Event-oriented Social Summarization Dataset	COLING20
61	SPACE	Opinion	Extractive Opinion Summarization in Quantized Transformer Spaces	TACL
62	pn-summary	Persian	Leveraging ParsBERT and Pretrained mT5 for Persian Abstractive Text Summarization	csicc2021
63	E-commerce1desensitized	Dialogue	Topic-Oriented Spoken Dialogue Summarization for Customer Service with Saliency-Aware Topic Modeling	AAAI21
64	E-commerce2desensitized	Dialogue	Unsupervised Summarization for Chat Logs with Topic-Oriented Ranking and Context-Aware Auto-Encoders	AAAI21
65	BengaliSummarization	Bengali	Unsupervised Abstractive Summarization of Bengali Text Documents	EACL21
66	MediaSum	Dialogue	MediaSum: A Large-scale Media Interview Dataset for Dialogue Summarization	NAACL21
67	Healthline and BreastCancer	multi-document	Nutri-bullets: Summarizing Health Studies by Composing Segments	AAAI21
68	GOVREPORT	Long Government reports	Efficient Attentions for Long Document Summarization	NAACL21
69	SSN	Scientific Paper	Enhancing Scientific Papers Summarization with Citation Graph	AAAI21
70	MTSamples	Medical	Towards objectively evaluating the quality of generated medical summaries
71	QMSum	Meeting, Query	QMSum: A New Benchmark for Query-based Multi-domain Meeting Summarization	NAACL21
72	MS2	Medical, Multi-Document	MS2: Multi-Document Summarization of Medical Studies
73	SummScreen	Television Series	SummScreen: A Dataset for Abstractive Screenplay Summarization	ACL 2022
74	SciDuet	Scientific Papers and Slides	D2S: Document-to-Slide Generation Via Query-Based Text Summarization	NAACL21
75	MultiHumES	Multilingual	MultiHumES: Multilingual Humanitarian Dataset for Extractive Summarization	EACL21
76	DialSumm	Dialogue	DialSumm: A Real-Life Scenario Dialogue Summarization Dataset	Findings of ACL21
77	BookSum	Book, Long-form	BookSum: A Collection of Datasets for Long-form Narrative Summarization
78	CLES	Chinese Weibo	A Large-Scale Chinese Long-Text Extractive Summarization Corpus	ICASSP
79	FacetSum	Scientific Paper	Bringing Structure into Summaries: a Faceted Summarization Dataset for Long Scientific Documents	ACL2021 short
80	ConvoSumm	Dialogue	ConvoSumm: Conversation Summarization Benchmark and Improved Abstractive Summarization with Argument Mining	ACL2021
81	AgreeSum	Multi-document with entailment annotations	AgreeSum: Agreement-Oriented Multi-Document Summarization	Findings of ACL2021
82	En2De	Cross-Lingual En2De	Cross-Lingual Abstractive Summarization with Limited Parallel Resources	ACL 2021
83	VT-SSum	Spoken	VT-SSum: A Benchmark Dataset for Video Transcript Segmentation and Summarization
84	AESLC	Email	This Email Could Save Your Life: Introducing the Task of Email Subject Line Generation	ACL 2019
85	XL-Sum	Cross-lingual	XL-Sum: Large-Scale Multilingual Abstractive Summarization for 44 Languages	Findings of ACL2021
86	TES 2012-2016	Tweet	TSSuBERT: Tweet Stream Summarization Using BERT
87	PENS	Personalized Headline	PENS: A Dataset and Generic Framework for Personalized News Headline Generation	ACL 2021
88	XSum Hallucination Annotations	Factuality	On Faithfulness and Factuality in Abstractive Summarization	ACL 2020
89	factuality-datasets	Factuality	Annotating and Modeling Fine-grained Factuality in Summarization	NAACL 2021
90	frank	Factuality	Understanding Factuality in Abstractive Summarization with FRANK: A Benchmark for Factuality Metrics	NAACL 2021
91	TRIPOD	Movie	Movie Summarization via Sparse Graph Construction	AAAI 2021
92	AdaptSum	Low-Resource	AdaptSum: Towards Low-Resource Domain Adaptation for Abstractive Summarization	NAACL 2021
93	PTS	Product	Multi-Source Pointer Network for Product Title Summarization	CIKM 2018
94	RAMDS	Reader-Aware	Reader-Aware Multi-Document Summarization: An Enhanced Model and The First Dataset	EMNLP 2017 Workshop
95	court judgment	court judgment	How to Write Summaries with Patterns? Learning towards Abstractive Summarization through Prototype Editing	EMNLP 2019
96	ADEGBTS	gaze behaviors	A Dataset for Exploring Gaze Behaviors in Text Summarization	ACM MMSys'20
97	MeQSum	Medical	On the Summarization of Consumer Health Questions	ACL 2019
98	OpoSum	Opinion	Summarizing Opinions: Aspect Extraction Meets Sentiment Prediction and They Are Both Weakly Supervised	EMNLP 2018
99	MM-AVS	Multi-modal	Multi-modal Summarization for Video-containing Documents	NAACL 2021
100	WikiCatSum	multi-doc	Generating Summaries with Topic Templates and Structured Convolutional Decoders	ACL 2019
101	SDF-TLS	Timeline	Summarize Dates First: A Paradigm Shift in Timeline Summarization	SIGIR 2021
102	RWS-Cit		*Automatic generation of related work through summarizing citations	2017
103	MTLS	Timeline	Multi-TimeLine Summarization (MTLS): Improving Timeline Summarization by Generating Multiple Summaries	ACL 2021
104	EMAILSUM	Email	EmailSum: Abstractive Email Thread Summarization	ACL 2021
105	WikiSum	WikiHow	WikiSum: Coherent Summarization Dataset for Efficient Human-Evaluation	ACL 2021 Short
106	SumPubMed	PubMed Scientific Article	SumPubMed: Summarization Dataset of PubMed Scientific Articles	ACL 2021 Student Research Workshop
107	MLGSum	Multi-lingual	Contrastive Aligned Joint Learning for Multilingual Summarization	ACL 2021 Findings
108	SMARTPHONE,COMPUTER	Product	CUSTOM: Aspect-Oriented Product Summarization for E-Commerce
109	CSDS	Customer Service Dialogue	CSDS: A Fine-grained Chinese Dataset for Customer Service Dialogue Summarization	EMNLP 2021
110	persian-dataset	persian	ARMAN: Pre-training with Semantically Selecting and Reordering of Sentences for Persian Abstractive Summarization
111	StreamHover	spoken livestream	StreamHover: Livestream Transcript Summarization and Annotation	EMNLP 2021
112	CNewSum	News	CNewSum: A Large-scale Chinese News Summarization Dataset with Human-annotated Adequacy and Deducibility Level	NLPCC 2021
113	MiRANews	news, factual	MiRANews: Dataset and Benchmarks for Multi-Resource-Assisted News Summarization	EMNLP 2021 Findings
114	HowSumm	query multi-doc	HowSumm: A Multi-Document Summarization Dataset Derived from WikiHow Articles
115	SportsSum2.0	Sports	SportsSum2.0: Generating High-Quality Sports News from Live Text Commentary
116	CoCoSum	opinion multi-ref	Comparative Opinion Summarization via Collaborative Decoding
117	MReD	Controllable	MReD: A Meta-Review Dataset for Controllable Text Generation
118	MSˆ2	Multi-Document, Medical	MSˆ2: Multi-Document Summarization of Medical Studies	EMNLP 2021
119	MassiveSumm		MassiveSumm: a very large-scale, very multilingual, news summarisation dataset	EMNLP 2021
120	XWikis	multilingual	Models and Datasets for Cross-Lingual Summarisation	EMNLP 2021
121	SUBSUME	Intent, subjective	SUBSUME: A Dataset for Subjective Summary Extraction from Wikipedia Documents	EMNLP 2021 newsum
122	TLDR9+		TLDR9+: A Large Scale Resource for Extreme Summarization of Social Media Posts	EMNLP 2021 newsum
123	20 Minuten	German	A New Dataset and Efficient Baselines for Document-level Text Simplification in German	EMNLP 2021 newsum
124	WSD	multi-lingual	A Novel Wikipedia based Dataset for Monolingual and Cross-Lingual Summarization	EMNLP 2021 newsum
125	TEDSummary	Speech	Attention-based Multi-hypothesis Fusion for Speech Summarization
126	SummaC Benchmark	Factual, NLI	SummaC: Re-Visiting NLI-based Models for Inconsistency Detection in Summarization
127	ForumSum	Conversation	ForumSum: A Multi-Speaker Conversation Summarization Dataset	EMNLP 2021 Findings
128	K-SportsSum	Sports	Knowledge Enhanced Sports Game Summarization	WSDM 2022
129	Test-Amazon	Opinion, New test for Amazon reviews	Unsupervised Opinion Summarization as Copycat-Review Generation	ACL 2020
130	Test-Amazon-Yelp	Opinion, New test for Amazon(180) and Yelp(300)	Few-Shot Learning for Opinion Summarization	EMNLP 2020
131	AmaSum	Opinion	Learning Opinion Summarizers by Selecting Informative Reviews	EMNLP 2021
132	CrossSum	Cross lingual	CrossSum: Beyond English-Centric Cross-Lingual Abstractive Text Summarization for 1500+ Language Pairs
133	HCSCL-MSDataset	Multi-modal	Hierarchical Cross-Modality Semantic Correlation Learning Model for Multimodal Summarization	AAAI 2022
134	Klexikon	German	Klexikon: A German Dataset for Joint Summarization and Simplification
135	TODSum	Customer Service	TODSum: Task-Oriented Dialogue Summarization with State Tracking
136	TWEETSUMM	Customer Service	TWEETSUMM - A Dialog Summarization Dataset for Customer Service	Findings of EMNLP 2021
137	PeerSum	Multi-document, Scientific	PeerSum: A Peer Review Dataset for Abstractive Multi-document Summarization
138	Celebrity TS, Event TS, Wiki TS	Timeline, person, event	Follow the Timeline! Generating Abstractive and Extractive Timeline Summary in Chronological Order	TOSI 2022
139	Chart-to-Text	chart	Chart-to-Text: A Large-Scale Benchmark for Chart Summarization
140	GovReport-QS	Long Document	HIBRIDS: Attention with Hierarchical Biases for Structure-aware Long Document Summarization	ACL 2022
141	EntSUM	Entity	EntSUM: A Data Set for Entity-Centric Summarization	ACL 2022
142	ALLSIDES	Framing Bias	NeuS: Neutral Multi-News Summarization for Mitigating Framing Bias	ACL 2022
143	GRAPHELSUMS	graph	Summarization with Graphical Elements
144	Annotated-Wikilarge-Newsela	Factuality	Evaluating Factuality in Text Simplification	ACL 2022
145	WikiMulti	Cross-lingual	WikiMulti: a Corpus for Cross-Lingual Summarization
146	Welsh		Introducing the Welsh Text Summarisation Dataset and Baseline Systems
147	SuMe	Biomedical	SuMe: A Dataset Towards Summarizing Biomedical Mechanisms	LREC 2022
148	CiteSum		CiteSum: Citation Text-guided Scientific Extreme Summarization and Low-resource Domain Adaptation
148	MSAMSum	Dialogue	MSAMSum: Towards Benchmarking Multi-lingual Dialogue Summarization	ACL 2022 DialDoc
149	SQuALITY	Long-Document	SQuALITY: Building a Long-Document Summarization Dataset the Hard Way	EMNLP 2022
150	X-SCITLDR		X-SCITLDR: Cross-Lingual Extreme Summarization of Scholarly Documents	JCDL 2022
151	NEWTS	News	NEWTS: A Corpus for News Topic-Focused Summarization
152	EntSUM	Entity	EntSUM: A Data Set for Entity-Centric Extractive Summarization	ACL 2022
153	ASPECTNEWS		ASPECTNEWS: Aspect-Oriented Summarization of News Documents	ACL 2022
154	RNSum	Commit Logs	RNSum: A Large-Scale Dataset for Automatic Release Note Generation via Commit Logs Summarization	ACL 2022
155	AnswerSumm	query multi-doc	AnswerSumm: A Manually-Curated Dataset and Pipeline for Answer Summarization	NAACL 2022
156	CHQ-Summ		CHQ-Summ: A Dataset for Consumer Healthcare Question Summarization
157	Multi-LexSum	multi-doc	Real-World Summaries of Civil Rights Lawsuits at Multiple Granularities
158	DACSA	Catalan and Spanish	DACSA: A large-scale Dataset for Automatic summarization of Catalan and Spanish newspaper Articles	NAACL 2022
159	BigSurvey	Academic Multi-doc	Generating a Structured Summary of Numerous Academic Papers: Dataset and Method	IJCAI 2022
160	CSL	Chinese, Academic	CSL: A Large-scale Chinese Scientific Literature Dataset	COLING 2022
161	PCC Summaries	German	Extractive Summarisation for German-language Data: A Text-level Approach with Discourse Features	COLING 2022
162	LipKey	abstractive summaries, absent keyphrases, and titles	LipKey: A Large-Scale News Dataset for Absent Keyphrases Generation and Abstractive Summarization	COLING 2022
163	PLOS	Lay summary of biomedical journal articles	Making Science Simple: Corpora for the Lay Summarisation of Scientific Literature	EMNLP 2022
164	eLife	Lay summary of biomedical journal articles	Making Science Simple: Corpora for the Lay Summarisation of Scientific Literature	EMNLP 2022
165	ECTSum	Long Earnings Call Transcripts	ECTSum: A New Benchmark Dataset For Bullet Point Summarization of Long Earnings Call Transcripts	EMNLP 2022
166	EUR-Lex-Sum	Multi- and Cross-lingual Legal	EUR-Lex-Sum: A Multi- and Cross-lingual Dataset for Long-form Summarization in the Legal Domain	EMNLP 2022
167	CrisisLTLSum	Timeline	CrisisLTLSum: A Benchmark for Local Crisis Event Timeline Extraction and Summarization
168	LANS(`upon request`)	Arabic	LANS: Large-scale Arabic News Summarization Corpus
169	MACSUM	Controllable News Dialogue	MACSUM: Controllable Summarization with Mixed Attributes
170	NarraSum	Narrative	NarraSum: A Large-Scale Dataset for Abstractive Narrative Summarization	EMNLP Findings 2022
171	LoRaLay	Long Scientific Visual	LoRaLay: A Multilingual and Multimodal Dataset for Long Range and Layout-Aware Summarization	EACL 2023
172	HunSum-1	Hungarian	HunSum-1: an Abstractive Summarization Dataset for Hungarian
173	MCLS	ultimodal Cross-Lingual	Assist Non-native Viewers: Multimodal Cross-Lingual Summarization for How2 Videos	EMNLP 2022
174	JDDC 2.1	multimodal	JDDC 2.1: A Multimodal Chinese Dialogue Dataset with Joint Tasks of Query Rewriting, Response Generation, Discourse Parsing, and Summarization	EMNLP 2022
175	CroCoSum	Code-switched Cross-lingual	CroCoSum: A Benchmark Dataset for Cross-Lingual Code-Switched Summarization
176	unarXive	scholarly	unarXive: a large scholarly data set with publications’ full-text, annotated in-text citations, and links to metadata	Scientometrics 2020
177	TempoSum		TempoSum: Evaluating the Temporal Generalization of Abstractive Summarization
178	VCSUM	meeting	VCSUM: A Versatile Chinese Meeting Summarization Dataset	ACL Findings 2023
179	MeetingBank	meeting	MeetingBank: A Benchmark Dataset for Meeting Summarization	ACL 2023

Dialogue

Dataset

MeetingBank: A Benchmark Dataset for Meeting Summarization Yebowen Hu, Timothy Ganter, Hanieh Deilamsalehy, Franck Dernoncourt, Hassan Foroosh, Fei Liu ACL 2023 [pdf] [data]

[Abs]
As the number of recorded meetings increases, it becomes increasingly important to utilize summarization technology to create useful summaries of these recordings. However, there is a crucial lack of annotated meeting corpora for developing this technology, as it can be hard to collect meetings, especially when the topics discussed are confidential. Furthermore, meeting summaries written by experienced writers are scarce, making it hard for abstractive summarizers to produce sensible output without a reliable reference. This lack of annotated corpora has hindered the development of meeting summarization technology. In this paper, we present MeetingBank, a new benchmark dataset of city council meetings over the past decade. MeetingBank is unique among other meeting corpora due to its divide-and-conquer approach, which involves dividing professionally written meeting minutes into shorter passages and aligning them with specific segments of the meeting. This breaks down the process of summarizing a lengthy meeting into smaller, more manageable tasks. The dataset provides a new testbed of various meeting summarization systems and also allows the public to gain insight into how council decisions are made. We make the collection, including meeting video links, transcripts, reference summaries, agenda, and other metadata, publicly available to facilitate the development of better meeting summarization techniques.
ECTSum: A New Benchmark Dataset For Bullet Point Summarization of Long Earnings Call Transcripts Rajdeep Mukherjee, Abhinav Bohra, Akash Banerjee, Soumya Sharma, Manjunath Hegde, Afreen Shaikh, Shivani Shrivastava, Koustuv Dasgupta, Niloy Ganguly, Saptarshi Ghosh, Pawan Goyal EMNLP 2022 [pdf] [data]

[Abs]
Despite tremendous progress in automatic summarization, state-of-the-art methods are predominantly trained to excel in summarizing short newswire articles, or documents with strong layout biases such as scientific articles or government reports. Efficient techniques to summarize financial documents, including facts and figures, have largely been unexplored, majorly due to the unavailability of suitable datasets. In this work, we present ECTSum, a new dataset with transcripts of earnings calls (ECTs), hosted by publicly traded companies, as documents, and short experts-written telegram-style bullet point summaries derived from corresponding Reuters articles. ECTs are long unstructured documents without any prescribed length limit or format. We benchmark our dataset with state-of-the-art summarizers across various metrics evaluating the content quality and factual consistency of the generated summaries. Finally, we present a simple-yet-effective approach, ECT-BPS, to generate a set of bullet points that precisely capture the important facts discussed in the calls.
TODSum: Task-Oriented Dialogue Summarization with State Tracking Lulu Zhao, Fujia Zheng, Keqing He, Weihao Zeng, Yuejie Lei, Huixing Jiang, Wei Wu, Weiran Xu, Jun Guo, Fanyu Meng [pdf]
TWEETSUMM - A Dialog Summarization Dataset for Customer Service Guy Feigenblat, Chulaka Gunasekara, Benjamin Sznajder, Sachindra Joshi, David Konopnicki, Ranit Aharonov Findings of EMNLP 2021 [pdf] [data]
ForumSum: A Multi-Speaker Conversation Summarization Dataset Misha Khalman, Yao Zhao, Mohammad Saleh EMNLP 2021 Findings [pdf] [data]
CSDS: A Fine-grained Chinese Dataset for Customer Service Dialogue Summarization Haitao Lin, Liqun Ma, Junnan Zhu, Lu Xiang, Yu Zhou, Jiajun Zhang, Chengqing Zong EMNLP 2021 [pdf] [data]
EmailSum: Abstractive Email Thread Summarization Shiyue Zhang, Asli Celikyilmaz, Jianfeng Gao, Mohit Bansal ACL 2021 [pdf] [data]
DialSumm: A Real-Life Scenario Dialogue Summarization Dataset Yulong Chen, Yang Liu, Liang Chen, Yue Zhang Findings of ACL21 [pdf] [data]
ConvoSumm: Conversation Summarization Benchmark and Improved Abstractive Summarization with Argument Mining Alexander R. Fabbri, Faiaz Rahman, Imad Rizvi, Borui Wang, Haoran Li, Yashar Mehdad, Dragomir Radev ACL 2021 [pdf] [code]
MediaSum: A Large-scale Media Interview Dataset for Dialogue Summarization Chenguang Zhu, Yang Liu, Jie Mei, Michael Zeng NAACL21 [pdf] [code]
QMSum: A New Benchmark for Query-based Multi-domain Meeting Summarization Ming Zhong, Da Yin, Tao Yu, Ahmad Zaidi, Mutethia Mutuma, Rahul Jha, Ahmed Hassan Awadallah, Asli Celikyilmaz, Yang Liu, Xipeng Qiu, Dragomir Radev NAACL21 [pdf] [data]
Storytelling with Dialogue: A Critical Role Dungeons and Dragons Dataset Revanth Rameshkumar, Peter Bailey ACL20 [pdf] [data]
SumTitles: a Summarization Dataset with Low Extractiveness Valentin Malykh, Konstantin Chernis, Ekaterina Artemova, Irina Piontkovskaya COLING20 [pdf] [code]
Summarizing Medical Conversations via Identifying Important Utterances Yan Song, Yuanhe Tian, Nan Wang, Fei Xia COLING20 [pdf] [code]
GupShup: Summarizing Open-Domain Code-Switched Conversations Laiba Mehnaz, Debanjan Mahata, Rakesh Gosangi, Uma Sushmitha Gunturi, Riya Jain, Gauri Gupta, Amardeep Kumar, Isabelle G. Lee, Anish Acharya, Rajiv Ratn Shah EMNLP 2021 [pdf][code]
SummScreen: A Dataset for Abstractive Screenplay Summarization Mingda Chen, Zewei Chu, Sam Wiseman, Kevin Gimpel ACL 2022 [pdf] [data]

[Abs]
We introduce SummScreen, a summarization dataset comprised of pairs of TV series transcripts and human written recaps. The dataset provides a challenging testbed for abstractive summarization for several reasons. Plot details are often expressed indirectly in character dialogues and may be scattered across the entirety of the transcript. These details must be found and integrated to form the succinct plot descriptions in the recaps. Also, TV scripts contain content that does not directly pertain to the central plot but rather serves to develop characters or provide comic relief. This information is rarely contained in recaps. Since characters are fundamental to TV series, we also propose two entity-centric evaluation metrics. Empirically, we characterize the dataset by evaluating several methods, including neural models and those based on nearest neighbors. An oracle extractive approach outperforms all benchmarked models according to automatic metrics, showing that the neural models are unable to fully exploit the input transcripts. Human evaluation and qualitative analysis reveal that our non-oracle models are competitive with their oracle counterparts in terms of generating faithful plot events and can benefit from better content selectors. Both oracle and non-oracle models generate unfaithful facts, suggesting future research directions.
SAMSum Corpus: A Human-annotated Dialogue Dataset for Abstractive Summarization Bogdan Gliwa, Iwona Mochol, Maciej Biesek, Aleksander Wawer EMNLP19 [pdf] [data]
Dial2Desc: End-to-end Dialogue Description Generation Haojie Pan, Junpei Zhou, Zhou Zhao, Yan Liu, Deng Cai, Min Yang [pdf]
The AMI meeting corpus: A pre-announcement Carletta, Jean and Ashby, Simone and Bourban, Sebastien and Flynn, Mike and Guillemot, Mael and Hain, Thomas and Kadlec, Jaroslav and Karaiskos, Vasilis and Kraaij, Wessel and Kronenthal, Melissa and others [pdf]
The ICSI meeting corpus Janin, Adam and Baron, Don and Edwards, Jane and Ellis, Dan and Gelbart, David and Morgan, Nelson and Peskin, Barbara and Pfau, Thilo and Shriberg, Elizabeth and Stolcke, Andreas and others [pdf]

Email Summarization

Focus on the Action: Learning to Highlight and Summarize Jointly for Email To-Do Items Summarization Kexun Zhang, Jiaao Chen, Diyi Yang Findings of ACL 2022 [pdf]
EmailSum: Abstractive Email Thread Summarization Shiyue Zhang, Asli Celikyilmaz, Jianfeng Gao, Mohit Bansal ACL 2021 [pdf] [data]
Smart To-Do: Automatic Generation of To-Do Items from Emails Sudipto Mukherjee, Subhabrata Mukherjee, Marcello Hasegawa, Ahmed Hassan Awadallah, Ryen White ACL 2020 [pdf] [code] [bib]
Identifying Implicit Quotes for Unsupervised Extractive Summarization of Conversations Ryuji Kano, Yasuhide Miura, Tomoki Taniguchi, Tomoko Ohkuma AACL20 [pdf]
This Email Could Save Your Life: Introducing the Task of Email Subject Line Generation Rui Zhang, Joel Tetreault ACL 2019 [pdf] [data] [bib]
Building a Dataset for Summarization and Keyword Extraction from Emails Vanessa Loza, Shibamouli Lahiri, Rada Mihalcea, Po-Hsiang Lai LREC 2014 [pdf]
A Publicly Available Annotated Corpus for Supervised Email Summarization Jan Ulrich, Gabriel Murray, Giuseppe Carenini AAAI 2008 [pdf]
Summarizing Email Conversations with Clue Words Giuseppe Carenini, Raymond T. Ng, Xiaodong Zhou WWW 2007 [pdf]
Task-focused Summarization of Email Simon H. Corston-Oliver Eric Ringger Michael Gamon Richard Campbell ACL 2004 [pdf]
Summarizing email threads Owen Rambow, Lokesh Shrestha, John Chen, Chirsty Lauridsen NAACL 2004 [pdf] [bib]
Facilitating email thread access by extractive summary generation Ani Nenkova Recent advances in natural language processing III: selected papers from RANLP [pdf]
Summarizing Archived Discussions: A Beginning Paula S. Newman, John C. Blitzer Proceedings of the 8th international conference on Intelligent user interfaces [pdf]
Combining linguistic and machine learning techniques for email summarization Smaranda Muresan, Evelyne Tzoukermann, Judith L. Klavans Proceedings of the ACL 2001 Workshop on Computational Natural Language Learning (ConLL) 2001 [pdf] [bib]

Meeting Summarization

Learning to Rank Utterances for Query-Focused Meeting Summarization Xingxian Liu, Yajing Xu Findings of ACL 2023 [pdf]

[Abs]
Query-focused meeting summarization(QFMS) aims to generate a specific summary for the given query according to the meeting transcripts. Due to the conflict between long meetings and limited input size, previous works mainly adopt extract-then-summarize methods, which use extractors to simulate binary labels or ROUGE scores to extract utterances related to the query and then generate a summary. However, the previous approach fails to fully use the comparison between utterances. To the extractor, comparison orders are more important than specific scores. In this paper, we propose a Ranker-Generator framework. It learns to rank the utterances by comparing them in pairs and learning from the global orders, then uses top utterances as the generator’s input. We show that learning to rank utterances helps to select utterances related to the query effectively, and the summarizer can benefit from it. Experimental results on QMSum show that the proposed model outperforms all existing multi-stage models with fewer parameters.
ExplainMeetSum: A Dataset for Explainable Meeting Summarization Aligned with Human Intent Hyun Kim, Minsoo Cho, Seung-Hoon Na ACL 2023 [pdf] [code]

[Abs]
To enhance the explainability of meeting summarization, we construct a new dataset called “ExplainMeetSum,” an augmented version of QMSum, by newly annotating evidence sentences that faithfully “explain” a summary. Using ExplainMeetSum, we propose a novel multiple extractor guided summarization, namely Multi-DYLE, which extensively generalizes DYLE to enable using a supervised extractor based on human-aligned extractive oracles. We further present an explainability-aware task, named “Explainable Evidence Extraction” (E3), which aims to automatically detect all evidence sentences that support a given summary. Experimental results on the QMSum dataset show that the proposed Multi-DYLE outperforms DYLE with gains of up to 3.13 in the ROUGE-1 score. We further present the initial results on the E3 task, under the settings using separate and joint evaluation metrics.
VCSUM: A Versatile Chinese Meeting Summarization Dataset Han Wu, Mingjie Zhan, Haochen Tan, Zhaohui Hou, Ding Liang, Linqi Song Findings of ACL 2023 [pdf] [data]

[Abs]
Compared to news and chat summarization, the development of meeting summarization is hugely decelerated by the limited data. To this end, we introduce a versatile Chinese meeting summarization dataset, dubbed VCSum, consisting of 239 real-life meetings, with a total duration of over 230 hours. We claim our dataset is versatile because we provide the annotations of topic segmentation, headlines, segmentation summaries, overall meeting summaries, and salient sentences for each meeting transcript. As such, the dataset can adapt to various summarization tasks or methods, including segmentation-based summarization, multi-granularity summarization and retrieval-then-generate summarization. Our analysis confirms the effectiveness and robustness of VCSum. We also provide a set of benchmark models regarding different downstream summarization tasks on VCSum to facilitate further research.
Query-Utterance Attention with Joint modeling for Query-Focused Meeting Summarization Xingxian Liu, Bin Duan, Bo Xiao, Yajing Xu ICASSP 2023 [pdf]

[Abs]
Query-focused meeting summarization (QFMS) aims to generate summaries from meeting transcripts in response to a given query. Previous works typically concatenate the query with meeting transcripts and implicitly model the query relevance only at the token level with attention mechanism. However, due to the dilution of key query-relevant information caused by long meeting transcripts, the original transformer-based model is insufficient to highlight the key parts related to the query. In this paper, we propose a query-aware framework with joint modeling token and utterance based on Query-Utterance Attention. It calculates the utterance-level relevance to the query with a dense retrieval module. Then both token-level query relevance and utterance-level query relevance are combined and incorporated into the generation process with attention mechanism explicitly. We show that the query relevance of different granularities contributes to generating a summary more related to the query. Experimental results on the QMSum dataset show that the proposed model achieves new state-of-the-art performance.
Meeting Decision Tracker: Making Meeting Minutes with De-Contextualized Utterances Shumpei Inoue, Hy Nguyen, Pham Viet Hoang, Tsungwei Liu, Minh-Tien Nguyen AACL-IJCNLP 2022 [pdf] [demo]

[Abs]
Meetings are a universal process to make decisions in business and project collaboration. The capability to automatically itemize the decisions in daily meetings allows for extensive tracking of past discussions. To that end, we developed Meeting Decision Tracker, a prototype system to construct decision items comprising decision utterance detector (DUD) and decision utterance rewriter (DUR). We show that DUR makes a sizable contribution to improving the user experience by dealing with utterance collapse in natural conversation. An introduction video of our system is also available at this https URL.
ESSumm: Extractive Speech Summarization from Untranscribed Meeting Jun Wang Interspeech 2022 [pdf]

[Abs]
In this paper, we propose a novel architecture for direct extractive speech-to-speech summarization, ESSumm, which is an unsupervised model without dependence on intermediate transcribed text. Different from previous methods with text presentation, we are aimed at generating a summary directly from speech without transcription. First, a set of smaller speech segments are extracted based on speech signal's acoustic features. For each candidate speech segment, a distance-based summarization confidence score is designed for latent speech representation measure. Specifically, we leverage the off-the-shelf self-supervised convolutional neural network to extract the deep speech features from raw audio. Our approach automatically predicts the optimal sequence of speech segments that capture the key information with a target summary length. Extensive results on two well-known meeting datasets (AMI and ICSI corpora) show the effectiveness of our direct speech-based method to improve the summarization quality with untranscribed data. We also observe that our unsupervised speech-based method even performs on par with recent transcript-based summarization approaches, where extra speech recognition is required.
Abstractive Meeting Summarization: A Survey Virgile Rennard, Guokan Shang, Julie Hunter, Michalis Vazirgiannis [pdf]

[Abs]
Recent advances in deep learning, and especially the invention of encoder-decoder architectures, has significantly improved the performance of abstractive summarization systems. While the majority of research has focused on written documents, we have observed an increasing interest in the summarization of dialogues and multi-party conversation over the past few years. A system that could reliably transform the audio or transcript of a human conversation into an abridged version that homes in on the most important points of the discussion would be valuable in a wide variety of real-world contexts, from business meetings to medical consultations to customer service calls. This paper focuses on abstractive summarization for multi-party meetings, providing a survey of the challenges, datasets and systems relevant to this task and a discussion of promising directions for future study.
ALIGNMEET: A Comprehensive Tool for Meeting Annotation, Alignment, and Evaluation Peter Polák, Muskaan Singh, Anna Nedoluzhko, Ondřej Bojar LREC 2022 [pdf] [data]
TANet: Thread-Aware Pretraining for Abstractive Conversational Summarization Ze Yang, Liran Wang, Zhoujin Tian, Wei Wu, Zhoujun Li Findings of NAACL 2022 [pdf]

[Abs]
Although pre-trained language models (PLMs) have achieved great success and become a milestone in NLP, abstractive conversational summarization remains a challenging but less studied task. The difficulty lies in two aspects. One is the lack of large-scale conversational summary data. Another is that applying the existing pre-trained models to this task is tricky because of the structural dependence within the conversation and its informal expression, etc. In this work, we first build a large-scale (11M) pretraining dataset called RCSum, based on the multi-person discussions in the Reddit community. We then present TANet, a thread-aware Transformer-based network. Unlike the existing pre-trained models that treat a conversation as a sequence of sentences, we argue that the inherent contextual dependency among the utterances plays an essential role in understanding the entire conversation and thus propose two new techniques to incorporate the structural information into our model. The first is thread-aware attention which is computed by taking into account the contextual dependency within utterances. Second, we apply thread prediction loss to predict the relations between utterances. We evaluate our model on four datasets of real conversations, covering types of meeting transcripts, customer-service records, and forum threads. Experimental results demonstrate that TANet achieves a new state-of-the-art in terms of both automatic evaluation and human judgment.
Summ^N: A Multi-Stage Summarization Framework for Long Input Dialogues and Documents Yusen Zhang, Ansong Ni, Ziming Mao, Chen Henry Wu, Chenguang Zhu, Budhaditya Deb, Ahmed H. Awadallah, Dragomir Radev, Rui Zhang ACL 2022 [pdf] [code]

[Abs]
Text summarization helps readers capture salient information from documents, news, interviews, and meetings. However, most state-of-the-art pretrained language models (LM) are unable to efficiently process long text for many summarization tasks. In this paper, we propose SummN, a simple, flexible, and effective multi-stage framework for input texts that are longer than the maximum context length of typical pretrained LMs. SummN first splits the data samples and generates a coarse summary in multiple stages and then produces the final fine-grained summary based on it. Our framework can process input text of arbitrary length by adjusting the number of stages while keeping the LM input size fixed. Moreover, it can deal with both single-source documents and dialogues, and it can be used on top of different backbone abstractive summarization models. To the best of our knowledge, SummN is the first multi-stage split-then-summarize framework for long input summarization. Our experiments demonstrate that SummN outperforms previous state-of-the-art methods by improving ROUGE scores on three long meeting summarization datasets AMI, ICSI, and QMSum, two long TV series datasets from SummScreen, and a long document summarization dataset GovReport. Our data and code are available at https://github.com/psunlpgroup/Summ-N.
Exploring Neural Models for Query-Focused Summarization Jesse Vig, Alexander R. Fabbri, Wojciech Kryściński [pdf] [code]
Improving Abstractive Dialogue Summarization with Hierarchical Pretraining and Topic Segment MengNan Qi, Hao Liu, YuZhuo Fu, Ting Liu EMNLP 2021 Findings [pdf]
Meeting Summarization with Pre-training and Clustering Methods Andras Huebner, Wei Ji, Xiang Xiao [pdf] [code]
Context or No Context? A preliminary exploration of human-in-the-loop approach for Incremental Temporal Summarization in meetings Nicole Beckage, Shachi H Kumar, Saurav Sahay, Ramesh Manuvinakurike EMNLP 2021| newsum [pdf]
RetrievalSum: A Retrieval Enhanced Framework for Abstractive Summarization Chenxin An, Ming Zhong, Zhichao Geng, Jianqiang Yang, Xipeng Qiu [pdf]
An Exploratory Study on Long Dialogue Summarization: What Works and What's Next Yusen Zhang, Ansong Ni, Tao Yu, Rui Zhang, Chenguang Zhu, Budhaditya Deb, Asli Celikyilmaz, Ahmed Hassan Awadallah, Dragomir Radev Findings of EMNLP 2021 Short [pdf]
DialogLM: Pre-trained Model for Long Dialogue Understanding and Summarization Ming Zhong, Yang Liu, Yichong Xu, Chenguang Zhu, Michael Zeng AAAI 2022 [pdf] [code]
Dynamic Sliding Window for Meeting Summarization Zhengyuan Liu, Nancy F. Chen SummDial@SIGDial 2021 [pdf]
MeetSum: Transforming Meeting Transcript Summarization using Transformers! Nima Sadri, Bohan Zhang, Bihan Liu [pdf]
Incremental temporal summarization in multiparty meetings Ramesh Manuvinakurike, Saurav Sahay, Wenda Chen, Lama Nachman SIGIR 2021 [pdf]
Abstractive Spoken Document Summarization using Hierarchical Model with Multi-stage Attention Diversity Optimization Potsawee Manakul, Mark J. F. Gales, Linlin Wang INTERSPEECH 2020 [pdf] [code]
What are meeting summaries? An analysis of human extractive summaries in meeting corpus Fei Liu, Yang Liu SIGDIAL 2008 [pdf]
Exploring Speaker Characteristics for Meeting Summarization Fei Liu, Yang Liu INTERSPEECH 2010 [pdf]
Automatic meeting summarization and topic detection system Tai-Chia Huang, Chia-Hsuan Hsieh, Hei-Chia Wang [pdf]
A keyphrase based approach to interactive meeting summarization Korbinian Riedhammer, Benoit Favre, Dilek Hakkani-T¨ur 2008 IEEE Spoken Language Technology Workshop [pdf]
A global optimization framework for meeting summarization Dan Gillick, Korbinian Riedhammerm, Benoit Favre, Dilek Hakkani-Tur 2009 IEEE International Conference on Acoustics, Speech and Signal Processing [pdf]
Evaluating the effectiveness of features and sampling in extractive meeting summarization Shasha Xie, Yang Liu, Hui Lin SLT 2008 [pdf]
Abstractive Meeting Summarization Using Dependency Graph Fusion Siddhartha Banerjee, Prasenjit Mitra, Kazunari Sugiyama WWW 2015 [pdf]
Automatic Community Creation for Abstractive Spoken Conversation Summarization Karan Singla, Evgeny Stepanov, Ali Orkan Bayer, Giuseppe Carenini, Giuseppe Riccardi ACL 2017 workshop [pdf] [bib]
Unsupervised Abstractive Meeting Summarization with Multi-Sentence Compression and Budgeted Submodular Maximization Guokan Shang, Wensi Ding, Zekun Zhang, Antoine Jean-Pierre Tixier, Polykarpos Meladianos, Michalis Vazirgiannis, Jean-Pierre Lorré ACL18 [pdf] [code]
Abstractive meeting summarization based on an attentional neural model Nouha Dammak, Yassine BenAyed [pdf]
A Study of Text Summarization Techniques for Generating Meeting Minutes Tu My Doan, Francois Jacquenet, Christine Largeron, Marc Bernard RCIS 2020 [pdf]
Meeting Summarization, A Challenge for Deep Learning Francois Jacquenet, Marc Bernard, Christine Largeron IWANN 2019 [pdf]
Generating Abstractive Summaries from Meeting Transcripts Siddhartha Banerjee, Prasenjit Mitra, Kazunari Sugiyama Proceedings of the 2015 ACM Symposium on Document Engineering, DocEng' 2015 [pdf]
Align then Summarize: Automatic Alignment Methods for Summarization Corpus Creation Paul Tardy, David Janiszek, Yannick Estève, Vincent Nguyen LREC 2020 [pdf] [bib]
Dialogue Discourse-Aware Graph Model and Data Augmentation for Meeting Summarization Xiachong Feng, Xiaocheng Feng, Bing Qin, Xinwei Geng IJCAI21 [pdf] [code]
How Domain Terminology Affects Meeting Summarization Performance Jia Jin Koay, Alexander Roustai, Xiaojin Dai, Dillon Burns, Alec Kerrigan, Fei Liu COLING20 Short [pdf] [code]
How to Interact and Change? Abstractive Dialogue Summarization with Dialogue Act Weight and Topic Change Info Jiasheng Di, Xiao Wei, Zhenyu Zhang KSEM 2020 [pdf] [code]
Abstractive Dialogue Summarization with Sentence-Gated Modeling Optimized by Dialogue Acts Chih-Wen Goo, Yun-Nung Chen SLT18 [pdf] [code]
A Sliding-Window Approach to Automatic Creation of Meeting Minutes Jia Jin Koay, Alexander Roustai, Xiaojin Dai, Fei Liu [pdf]
Hierarchical Learning for Generation with Long Source Sequences Tobias Rohde, Xiaoxia Wu, Yinhan Liu [pdf] [code]
A Hierarchical Network for Abstractive Meeting Summarization with Cross-Domain Pretraining Chenguang Zhu, Ruochen Xu, Michael Zeng, Xuedong Huang Findings of EMNLP20 [pdf] [code] [unofficial-code]
Abstractive Meeting Summarization via Hierarchical Adaptive Segmental Network Learning Zhou Zhao, Haojie Pan, Changjie Fan, Yan Liu, Linlin Li, Min Yang WWW19 [pdf]
Restructuring Conversations using Discourse Relations for Zero-shot Abstractive Dialogue Summarization Prakhar Ganesh, Saket Dingliwal [pdf]
Keep Meeting Summaries on Topic: Abstractive Multi-Modal Meeting Summarization Manling Li, Lingyu Zhang, Heng Ji, Richard J. Radke ACL19 [pdf]
Automatic analysis of multiparty meetings STEVE RENALS [pdf]
A Multimodal Meeting Browser that Implements an Important Utterance Detection Model based on Multimodal Information Fumio Nihei, Yukiko I. Nakano [pdf]
Exploring Methods for Predicting Important Utterances Contributing to Meeting Summarization Fumio Nihei, Yukiko I. Nakano [pdf]
Fusing Verbal and Nonverbal Information for Extractive Meeting Summarization Fumio Nihei, Yukiko I. Nakano, Yutaka Takase GIFT18 [pdf]
Meeting Extracts for Discussion Summarization Based on Multimodal Nonverbal Information Fumio Nihei, Yukiko I. Nakano, Yutaka Takase ICMI16 [pdf]
Extractive Summarization of Meeting Recordings Gabriel Murray, Steve Renals, Jean Carletta [pdf]
Multimodal Summarization of Meeting Recordings Bema Erol, Dar-Shyang Lee, Jonathan Hull ICME 2003 [pdf]
Few-Shot Learning of an Interleaved Text Summarization Model by Pretraining with Synthetic Data Sanjeev Kumar Karn, Francine Chen, Yan-Ying Chen, Ulli Waltinger, Hinrich Schütze EACL21 [pdf]
Leverage Unlabeled Data for Abstractive Speech Summarization with Self-Supervised Learning and Back-Summarization SPECOM 2020 SPECOM 2020 [pdf]
Focused Meeting Summarization via Unsupervised Relation Extraction Lu Wang, Claire Cardie SIGDIAL 2012 [pdf]
QMSum: A New Benchmark for Query-based Multi-domain Meeting Summarization Ming Zhong, Da Yin, Tao Yu, Ahmad Zaidi, Mutethia Mutuma, Rahul Jha, Ahmed Hassan Awadallah, Asli Celikyilmaz, Yang Liu, Xipeng Qiu, Dragomir Radev NAACL21 [pdf] [data]
Domain-Independent Abstract Generation for Focused Meeting Summarization Lu Wang, Claire Cardie ACL 2013 [pdf]
Summarizing Decisions in Spoken Meetings Lu Wang, Claire Cardie ACL 2011 [pdf]
Extracting Decisions from Multi-Party Dialogue Using Directed Graphical Models and Semantic Similarity Trung Bui, Matthew Frampton, John Dowding, Stanley Peters SIGDIAL 2009 [pdf] [bib]
ConvoSumm: Conversation Summarization Benchmark and Improved Abstractive Summarization with Argument Mining Alexander R. Fabbri, Faiaz Rahman, Imad Rizvi, Borui Wang, Haoran Li, Yashar Mehdad, Dragomir Radev ACL2021 [pdf] [code]

Chat Summarization

Dialogue Summarization with Static-Dynamic Structure Fusion Graph ** Shen Gao, Xin Cheng, Mingzhe Li, Xiuying Chen, Jinpeng Li, Dongyan Zhao, Rui Yan [pdf] [code]

[Abs]
Dialogue, the most fundamental and specially privileged arena of language, gains increasing ubiquity across the Web in recent years. Quickly going through the long dialogue context and capturing salient information scattered over the whole dialogue session benefit users in many real-world Web applications such as email thread summarization and meeting minutes draft. Dialogue summarization is a challenging task in that dialogue has dynamic interaction nature and presumably inconsistent information flow among various speakers. Many researchers address this task by modeling dialogue with pre-computed static graph structure using external linguistic toolkits. However, such methods heavily depend on the reliability of external tools and the static graph construction is disjoint with the graph representation learning phase, which makes the graph can’t be dynamically adapted for the downstream summarization task. In this paper, we propose a Static-Dynamic graph-based Dialogue Summarization model (SDDS), which fuses prior knowledge from human expertise and adaptively learns the graph structure in an end-to-end learning fashion. To verify the effectiveness of SDDS, we conduct experiments on three benchmark datasets (SAMSum, MediaSum, and DialogSum) and the results verify the superiority of SDDS.
Mind the Gap! Injecting Commonsense Knowledge for Abstractive Dialogue Summarization Seungone Kim, Se June Joo, Hyungjoo Chae, Chaehyeong Kim, Seung-won Hwang, Jinyoung Yeo COLING 2022 [pdf]

[Abs]
In this paper, we propose to leverage the unique characteristics of dialogues sharing commonsense knowledge across participants, to resolve the difficulties in summarizing them. We present SICK, a framework that uses commonsense inferences as additional context. Compared to previous work that solely relies on the input dialogue, SICK uses an external knowledge model to generate a rich set of commonsense inferences and selects the most probable one with a similarity-based selection method. Built upon SICK, SICK++ utilizes commonsense as supervision, where the task of generating commonsense inferences is added upon summarizing the dialogue in a multi-task learning setting. Experimental results show that with injected commonsense knowledge, our framework generates more informative and consistent summaries than existing methods.
A Finer-grain Universal Dialogue Semantic Structures based Model For Abstractive Dialogue Summarization Yuejie Lei, Fujia Zheng, Yuanmeng Yan, Keqing He, Weiran Xu EMNLP 2021 Findings [pdf] [code]
Capturing Speaker Incorrectness: Speaker-Focused Post-Correction for Abstractive Dialogue Summarization Dongyub Lee, Jungwoo Lim, Taesun Whang, Chanhee Lee, Seungwoo Cho, Mingun Park, Heuiseok Lim EMNLP 2021| newsum [pdf]
Who says like a style of Vitamin: Towards Syntax-Aware DialogueSummarization using Multi-task Learning Seolhwa Lee, Kisu Yang, Chanjun Park, João Sedoc, Heuiseok Lim [pdf]
Controllable Neural Dialogue Summarization with Personal Named Entity Planning Zhengyuan Liu, Nancy F. Chen EMNLP 2021 [pdf]
GupShup: Summarizing Open-Domain Code-Switched Conversations Laiba Mehnaz, Debanjan Mahata, Rakesh Gosangi, Uma Sushmitha Gunturi, Riya Jain, Gauri Gupta, Amardeep Kumar, Isabelle G. Lee, Anish Acharya, Rajiv Ratn Shah EMNLP 2021 [pdf][code]
Topic-Aware Contrastive Learning for Abstractive Dialogue Summarization Junpeng Liu, Yanyan Zou, Hainan Zhang, Hongshen Chen, Zhuoye Ding, Caixia Yuan, Xiaojie Wang EMNLP 2021 Findings [pdf] [code]
Give the Truth: Incorporate Semantic Slot into Abstractive Dialogue Summarization Lulu Zhao, Weihao Zeng, Weiran Xu, Jun Guo EMNLP 2021 Findings [pdf]
Low-Resource Dialogue Summarization with Domain-Agnostic Multi-Source Pretraining Yicheng Zou, Bolin Zhu, Xingwu Hu, Tao Gui, Qi Zhang EMNLP 2021 [pdf] [code]
Enhancing Semantic Understanding with Self-Supervised Methods for Abstractive Dialogue Summarization Hyunjae Lee, Jaewoong Yun, Hyunjin Choi, Seongho Joe, Youngjune L. Gwon Interspeech 2021 [pdf]
Dialogue summarization with supporting utterance flow modeling and fact regularization Wang Chen, Piji Li, Hou PongChan, Irwin King Knowledge-Based Systems [pdf]
Situation-Based Multiparticipant Chat Summarization: a Concept, an Exploration-Annotation Tool and an Example Collection Anna Smirnova, Evgeniy Slobodkin, George Chernishev ACL 2021 Student Research Workshop [pdf] [tool] [data]
Coreference-Aware Dialogue Summarization Zhengyuan Liu, Ke Shi, Nancy F. Chen SIGDIAL 2021 [pdf]
Incorporating Commonsense Knowledge into Abstractive Dialogue Summarization via Heterogeneous Graph Networks Xiachong Feng, Xiaocheng Feng, Bing Qin CCL 2021 [pdf]
Hierarchical Speaker-Aware Sequence-to-Sequence Model for Dialogue Summarization Yuejie Lei, Yuanmeng Yan, Zhiyuan Zeng, Keqing He, Ximing Zhang, Weiran Xu ICASSP21 [pdf]
Summary Grounded Conversation Generation Chulaka Gunasekara, Guy Feigenblat, Benjamin Sznajder, Sachindra Joshi, David Konopnicki Findings of ACL 2021 [pdf]
Controllable Abstractive Dialogue Summarization with Sketch Supervision Chien-Sheng Wu, Linqing Liu, Wenhao Liu, Pontus Stenetorp, Caiming Xiong ACL-Findings 2021 [pdf] [code]
Structure-Aware Abstractive Conversation Summarization via Discourse and Action Graphs Jiaao Chen, Diyi Yang NAACL21 [pdf] [code]
Planning with Learned Entity Prompts for Abstractive Summarization Shashi Narayan, Yao Zhao, Joshua Maynez, Gonçalo Simoes, Ryan McDonald TACL 2021 [pdf]
Improving Abstractive Dialogue Summarization with Graph Structures and Topic Words Lulu Zhao, Weiran Xu, Jun Guo COLING20 [pdf]
Multi-View Sequence-to-Sequence Models with Conversational Structure for Abstractive Dialogue Summarization Jiaao Chen, Diyi Yang EMNLP20 [pdf] [code]
SAMSum Corpus: A Human-annotated Dialogue Dataset for Abstractive Summarization Bogdan Gliwa, Iwona Mochol, Maciej Biesek, Aleksander Wawer EMNLP19 [pdf] [data]

Medical Dialogue Summarization

COSSUM: Towards Conversation-Oriented Structured Summarization for Automatic Medical Insurance Assessment Sheng Xu, Xiaojun Wan, Sen Hu, Mengdi Zhou, Teng Xu, Hongbin Wang, Haitao Mi KDD 2022 [pdf]

[Abs]
In medical insurance industry, a lot of human labor is required to collect information of claimants. Human assessors need to converse with claimants in order to record key information and organize it into a structured summary. With the purpose of helping save human labor, we propose the task of conversation-oriented structured summarization which aims to automatically produce the desired structured summary from a conversation automatically. One major challenge of the task is that the structured summary contains multiple fields of different types. To tackle this problem, we propose a unified approach COSSUM based on prompting to generate the values of all fields simultaneously. By learning all fields together, our approach can capture the inherent relationship between them. Moreover, we propose a specially designed curriculum learning strategy for model training. Both automatic and human evaluations are performed, and the results show the effectiveness of our proposed approach.
Counseling Summarization using Mental Health Knowledge Guided Utterance Filtering Aseem Srivastava, Tharun Suresh, Sarah Peregrine (Grin)Lord, Md. Shad Akhtar, Tanmoy Chakraborty KDD 2022 ADS Track [pdf]

[Abs]
The psychotherapy intervention technique is a multifaceted conversation between a therapist and a patient. Unlike general clinical discussions, psychotherapy's core components (viz. symptoms) are hard to distinguish, thus becoming a complex problem to summarize later. A structured counseling conversation may contain discussions about symptoms, history of mental health issues, or the discovery of the patient's behavior. It may also contain discussion filler words irrelevant to a clinical summary. We refer to these elements of structured psychotherapy as counseling components. In this paper, the aim is mental health counseling summarization to build upon domain knowledge and to help clinicians quickly glean meaning. We create a new dataset after annotating 12.9K utterances of counseling components and reference summaries for each dialogue. Further, we propose ConSum, a novel counseling-component guided summarization model. ConSum undergoes three independent modules. First, to assess the presence of depressive symptoms, it filters utterances utilizing the Patient Health Questionnaire (PHQ-9), while the second and third modules aim to classify counseling components. At last, we propose a problem-specific Mental Health Information Capture (MHIC) evaluation metric for counseling summaries. Our comparative study shows that we improve on performance and generate cohesive, semantic, and coherent summaries. We comprehensively analyze the generated summaries to investigate the capturing of psychotherapy elements. Human and clinical evaluations on the summary show that ConSum generates quality summary. Further, mental health experts validate the clinical acceptability of the ConSum. Lastly, we discuss the uniqueness in mental health counseling summarization in the real world and show evidences of its deployment on an online application with the support of http://mpathic.ai/
Adding more data does not always help: A study in medical conversation summarization with PEGASUS Varun Nair, Namit Katariya, Xavier Amatriain, Ilya Valmianski, Anitha Kannan [pdf]
Leveraging Pretrained Models for Automatic Summarization of Doctor-Patient Conversations Longxiang Zhang, Renato Negrinho, Arindam Ghosh, Vasudevan Jagannathan, Hamid Reza Hassanzadeh, Thomas Schaaf, and Matthew R. Gormley Findings of EMNLP 2021 [pdf]
Medically Aware GPT-3 as a Data Generator for Medical Dialogue Summarization Bharath Chintagunta, Namit Katariya, Xavier Amatriain, Anitha Kannan NAACL | NLPMC 2021 [pdf1] [pdf2]
Generating SOAP Notes from Doctor-Patient Conversations Using Modular Summarization Techniques Kundan Krishna, Sopan Khosla, Jeffrey P. Bigham, Zachary C. Lipton ACL 2021 [pdf] [code]
Summarizing Medical Conversations via Identifying Important Utterances Yan Song, Yuanhe Tian, Nan Wang, Fei Xia COLING 2020 [pdf] [code] [bib]
Dr.Summarize: Global Summarization of Medical Dialogue by Exploiting Local Structures Anirudh Joshi, Namit Katariya, Xavier Amatriain, Anitha Kannan Findings of EMNLP 2020 [pdf] [bib]
Medical Dialogue Summarization for Automated Reporting in Healthcare Sabine Molenaar, Lientje Maas, Verónica Burriel, Fabiano Dalpiaz,Sjaak Brinkkemper Advanced Information Systems Engineering Workshops 2020 [pdf] [bib]
Generating Medical Reports from Patient-Doctor Conversations using Sequence-to-Sequence Models Seppo Enarvi, Marilisa Amoia, Miguel Del-Agua Teba, Brian Delaney, Frank Diehl, Stefan Hahn, Kristina Harris, Liam McGrath, Yue Pan, Joel Pinto, Luca Rubini, Miguel Ruiz, Gagandeep Singh, Fabian Stemmer, Weiyi Sun, Paul Vozila, Thomas Lin, Ranjani Ramamurthy ACL 2020 Short [pdf] [bib]
Automatically Generating Psychiatric Case Notes From Digital Transcripts of Doctor-Patient Conversations Nazmul Kazi, Indika Kahanda NAACL 2019 [pdf] [bib]
Alignment Annotation for Clinic Visit Dialogue to Clinical Note Sentence Language Generation Wen-wai Yim, Meliha Yetisgen, Jenny Huang, Micah Grossman LREC 2020 [pdf] [bib]
Topic-aware Pointer-Generator Networks for Summarizing Spoken Conversations Zhengyuan Liu, Angela Ng, Sheldon Lee, Ai Ti Aw, Nancy F. Chen ASRU 2019 [pdf]

Customer Service Summarization

Other Roles Matter! Enhancing Role-Oriented Dialogue Summarization via Role Interactions Haitao Lin, Junnan Zhu, Lu Xiang, Yu Zhou, Jiajun Zhang, Chengqing Zong ACL 2022 [pdf] [code]

[Abs]
Role-oriented dialogue summarization is to generate summaries for different roles in the dialogue, e.g., merchants and consumers. Existing methods handle this task by summarizing each role’s content separately and thus are prone to ignore the information from other roles. However, we believe that other roles’ content could benefit the quality of summaries, such as the omitted information mentioned by other roles. Therefore, we propose a novel role interaction enhanced method for role-oriented dialogue summarization. It adopts cross attention and decoder self-attention interactions to interactively acquire other roles’ critical information. The cross attention interaction aims to select other roles’ critical dialogue utterances, while the decoder self-attention interaction aims to obtain key information from other roles’ summaries. Experimental results have shown that our proposed method significantly outperforms strong baselines on two public role-oriented dialogue summarization datasets. Extensive analyses have demonstrated that other roles’ content could help generate summaries with more complete semantics and correct topic structures.
An End-to-End Dialogue Summarization System for Sales Calls Abedelkadir Asi, Song Wang, Roy Eisenstadt, Dean Geckt, Yarin Kuper, Yi Mao, Royi Ronen NAACL 2022 [pdf]
Heuristic-based Inter-training to Improve Few-shot Multi-perspective Dialog Summarization Benjamin Sznajder, Chulaka Gunasekara, Guy Lev, Sachin Joshi, Eyal Shnarch, Noam Slonim [pdf]
Dialogue Summaries as Dialogue States (DS2), Template-Guided Summarization for Few-shot Dialogue State Tracking Jamin Shin, Hangyeol Yu, Hyeongdon Moon, Andrea Madotto, Juneyoung Park Findings of ACL 2022 [pdf] [code]
TWEETSUMM - A Dialog Summarization Dataset for Customer Service Guy Feigenblat, Chulaka Gunasekara, Benjamin Sznajder, Sachindra Joshi, David Konopnicki, Ranit Aharonov [pdf] [data]
Extractive Dialogue Summarization Without Annotation Based on Distantly Supervised Machine Reading Comprehension in Customer Service Bing Ma, Haifeng Sun , Jingyu Wang , Qi Qi, and Jianxin Liao TASLP [pdf]
TODSum: Task-Oriented Dialogue Summarization with State Tracking Lulu Zhao, Fujia Zheng, Keqing He, Weihao Zeng, Yuejie Lei, Huixing Jiang, Wei Wu, Weiran Xu, Jun Guo, Fanyu Meng [pdf]
CSDS: A Fine-grained Chinese Dataset for Customer Service Dialogue Summarization Haitao Lin, Liqun Ma, Junnan Zhu, Lu Xiang, Yu Zhou, Jiajun Zhang, Chengqing Zong EMNLP 2021 [pdf] [data]
Distant Supervision based Machine Reading Comprehension for Extractive Summarization in Customer Service Bing Ma, Cao Liu, Jingyu Wang, Shujie Hu, Fan Yang, Xunliang Cai, Guanglu Wan, Jiansong Chen, Jianxin Liao SIGIR 2021 [pdf]
Unsupervised Abstractive Dialogue Summarization for Tete-a-Tetes Xinyuan Zhang, Ruiyi Zhang, Manzil Zaheer, Amr Ahmed AAAI21 [pdf]
Topic-Oriented Spoken Dialogue Summarization for Customer Service with Saliency-Aware Topic Modeling Yicheng Zou, Lujun Zhao, Yangyang Kang, Jun Lin, Minlong Peng, Zhuoren Jiang, Changlong Sun, Qi Zhang, Xuanjing Huang, Xiaozhong Liu AAAI21 [pdf] [code]
Unsupervised Summarization for Chat Logs with Topic-Oriented Ranking and Context-Aware Auto-Encoders Yicheng Zou, Jun Lin, Lujun Zhao, Yangyang Kang, Zhuoren Jiang, Changlong Sun, Qi Zhang, Xuanjing Huang, Xiaozhong Liu AAAI21 [pdf] [code]
Abstractive Dialog Summarization with Semantic Scaffolds Lin Yuan, Zhou Yu [pdf]
Automatic Dialogue Summary Generation for Customer Service Chunyi Liu, Peng Wang, Jiang Xu, Zang Li and Jieping Ye KDD19 [pdf]

Domain Adaption

DIONYSUS: A Pre-trained Model for Low-Resource Dialogue Summarization Yu Li, Baolin Peng, Pengcheng He, Michel Galley, Zhou Yu, Jianfeng Gao [pdf]

[Abs]
Dialogue summarization has recently garnered significant attention due to its wide range of applications. However, existing methods for summarizing dialogues are suboptimal because they do not take into account the inherent structure of dialogue and rely heavily on labeled data, which can lead to poor performance in new domains. In this work, we propose DIONYSUS (dynamic input optimization in pre-training for dialogue summarization), a pre-trained encoder-decoder model for summarizing dialogues in any new domain. To pre-train DIONYSUS, we create two pseudo summaries for each dialogue example: one is produced by a fine-tuned summarization model, and the other is a collection of dialogue turns that convey important information. We then choose one of these pseudo summaries based on the difference in information distribution across different types of dialogues. This selected pseudo summary serves as the objective for pre-training DIONYSUS using a self-supervised approach on a large dialogue corpus. Our experiments show that DIONYSUS outperforms existing methods on six datasets, as demonstrated by its ROUGE scores in zero-shot and few-shot settings.
Domain-Oriented Prefix-Tuning: Towards Efficient and Generalizable Fine-tuning for Zero-Shot Dialogue Summarization Lulu Zhao, Fujia Zheng, Weihao Zeng, Keqing He, Weiran Xu, Huixing Jiang, Wei Wu, Yanan Wu NAACL 2022 [pdf] [code]
AdaptSum: Towards Low-Resource Domain Adaptation for Abstractive Summarization Tiezheng Yu, Zihan Liu, Pascale Fung NAACL21 [pdf] [code]
Domain Adaptation to Summarize Human Conversations Oana Sandu, Giuseppe Carenini, Gabriel Murray, Raymond Ng ACL2010 Workshop [pdf]

Others

Summarizing Community-based Question-Answer Pairs Ting-Yao Hsu, Yoshi Suhara, Xiaolan Wang EMNLP 2022 [pdf] [code]

[Abs]
Community-based Question Answering (CQA), which allows users to acquire their desired information, has increasingly become an essential component of online services in various domains such as E-commerce, travel, and dining. However, an overwhelming number of CQA pairs makes it difficult for users without particular intent to find useful information spread over CQA pairs. To help users quickly digest the key information, we propose the novel CQA summarization task that aims to create a concise summary from CQA pairs. To this end, we first design a multi-stage data annotation process and create a benchmark dataset, COQASUM, based on the Amazon QA corpus. We then compare a collection of extractive and abstractive summarization methods and establish a strong baseline approach DedupLED for the CQA summarization task. Our experiment further confirms two key challenges, sentence-type transfer and deduplication removal, towards the CQA summarization task. Our data and code are publicly available.
Curriculum Prompt Learning with Self-Training for Abstractive Dialogue Summarization Changqun Li, Linlin Wang, Xin Lin, Gerard de Melo, Liang He EMNLP 2022 [pdf]

[Abs]
Succinctly summarizing dialogue is a task of growing interest, but inherent challenges, such as insufficient training data and low information density impede our ability to train abstractive models. In this work, we propose a novel curriculum-based prompt learning method with self-training to address these problems. Specifically, prompts are learned using a curriculum learning strategy that gradually increases the degree of prompt perturbation, thereby improving the dialogue understanding and modeling capabilities of our model. Unlabeled dialogue is incorporated by means of self-training so as to reduce the dependency on labeled data. We further investigate topic-aware prompts to better plan for the generation of summaries. Experiments confirm that our model substantially outperforms strong baselines and achieves new state-of-the-art results on the AMI and ICSI datasets. Human evaluations also show the superiority of our model with regard to the summary generation quality.
STRUDEL: Structured Dialogue Summarization for Dialogue Comprehension Borui Wang, Chengcheng Feng, Arjun Nair, Madelyn Mao, Jai Desai, Asli Celikyilmaz, Haoran Li, Yashar Mehdad, Dragomir Radev EMNLP 2022 [pdf]

[Abs]
Abstractive dialogue summarization has long been viewed as an important standalone task in natural language processing, but no previous work has explored the possibility of whether abstractive dialogue summarization can also be used as a means to boost an NLP system's performance on other important dialogue comprehension tasks. In this paper, we propose a novel type of dialogue summarization task - STRUctured DiaLoguE Summarization - that can help pre-trained language models to better understand dialogues and improve their performance on important dialogue comprehension tasks. We further collect human annotations of STRUDEL summaries over 400 dialogues and introduce a new STRUDEL dialogue comprehension modeling framework that integrates STRUDEL into a graph-neural-network-based dialogue reasoning module over transformer encoder language models to improve their dialogue comprehension abilities. In our empirical experiments on two important downstream dialogue comprehension tasks - dialogue question answering and dialogue response prediction - we show that our STRUDEL dialogue comprehension model can significantly improve the dialogue comprehension performance of transformer encoder language models.
Enhancing Dialogue Summarization with Topic-Aware Global- and Local- Level Centrality Xinnian Liang, Shuangzhi Wu, Chenhao Cui, Jiaqi Bai, Chao Bian, Zhoujun Li EACL 2023 [pdf] [code]

[Abs]
Dialogue summarization aims to condense a given dialogue into a simple and focused summary text. Typically, both the roles' viewpoints and conversational topics change in the dialogue stream. Thus how to effectively handle the shifting topics and select the most salient utterance becomes one of the major challenges of this task. In this paper, we propose a novel topic-aware Global-Local Centrality (GLC) model to help select the salient context from all sub-topics. The centralities are constructed at both the global and local levels. The global one aims to identify vital sub-topics in the dialogue and the local one aims to select the most important context in each sub-topic. Specifically, the GLC collects sub-topic based on the utterance representations. And each utterance is aligned with one sub-topic. Based on the sub-topics, the GLC calculates global- and local-level centralities. Finally, we combine the two to guide the model to capture both salient context and sub-topics when generating summaries. Experimental results show that our model outperforms strong baselines on three public dialogue summarization datasets: CSDS, MC, and SAMSUM. Further analysis demonstrates that our GLC can exactly identify vital contents from sub-topics.
SWING: Balancing Coverage and Faithfulness for Dialogue Summarization Kung-Hsiang Huang, Siffi Singh, Xiaofei Ma, Wei Xiao, Feng Nan, Nicholas Dingwall, William Yang Wang, Kathleen McKeown Findings of EACL 2023 [pdf] [code]

[Abs]
Missing information is a common issue of dialogue summarization where some information in the reference summaries is not covered in the generated summaries. To address this issue, we propose to utilize natural language inference (NLI) models to improve coverage while avoiding introducing factual inconsistencies. Specifically, we use NLI to compute fine-grained training signals to encourage the model to generate content in the reference summaries that have not been covered, as well as to distinguish between factually consistent and inconsistent generated sentences. Experiments on the DialogSum and SAMSum datasets confirm the effectiveness of the proposed approach in balancing coverage and faithfulness, validated with automatic metrics and human evaluations. Additionally, we compute the correlation between commonly used automatic metrics with human judgments in terms of three different dimensions regarding coverage and factual consistency to provide insight into the most suitable metric for evaluating dialogue summaries.
Human-in-the-loop Abstractive Dialogue Summarization Jiaao Chen, Mohan Dodda, Diyi Yang [pdf] Findings of ACL 2023

[Abs]
Abstractive dialogue summarization has received increasing attention recently. Despite the fact that most of the current dialogue summarization systems are trained to maximize the likelihood of human-written summaries and have achieved significant results, there is still a huge gap in generating high-quality summaries as determined by humans, such as coherence and faithfulness, partly due to the misalignment in maximizing a single human-written summary. To this end, we propose to incorporate different levels of human feedback into the training process. This will enable us to guide the models to capture the behaviors humans care about for summaries. Specifically, we ask humans to highlight the salient information to be included in summaries to provide the local feedback, and to make overall comparisons among summaries in terms of coherence, accuracy, coverage, concise and overall quality, as the global feedback. We then combine both local and global feedback to fine-tune the dialog summarization policy with Reinforcement Learning. Experiments conducted on multiple datasets demonstrate the effectiveness and generalization of our methods over the state-of-the-art supervised baselines, especially in terms of human judgments.
ED-FAITH: Evaluating Dialogue Summarization on Faithfulness Sicong Huang, Asli Celikyilmaz, Haoran Li [pdf]

[Abs]
Abstractive summarization models typically generate content unfaithful to the input, thus highlighting the significance of evaluating the faithfulness of generated summaries. Most faithfulness metrics are only evaluated on news domain, can they be transferred to other summarization tasks? In this work, we first present a systematic study of faithfulness metrics for dialogue summarization. We evaluate common faithfulness metrics on dialogue datasets and observe that most metrics correlate poorly with human judgements despite performing well on news datasets. Given these findings, to improve existing metrics’ performance on dialogue summarization, we first finetune on in-domain dataset, then apply unlikelihood training on negative samples, and show that they can successfully improve metric performance on dialogue data. Inspired by the strong zero-shot performance of the T0 language model, we further propose T0-Score – a new metric for faithfulness evaluation, which shows consistent improvement against baseline metrics across multiple domains.
Towards Understanding Omission in Dialogue Summarization Yicheng Zou, Kaitao Song, Xu Tan, Zhongkai Fu, Tao Gui, Qi Zhang, Dongsheng Li `` [pdf]

[Abs]
Dialogue summarization aims to condense the lengthy dialogue into a concise summary, and has recently achieved significant progress. However, the result of existing methods is still far from satisfactory. Previous works indicated that omission is a major factor in affecting the quality of summarization, but few of them have further explored the omission problem, such as how omission affects summarization results and how to detect omission, which is critical for reducing omission and improving summarization quality. Moreover, analyzing and detecting omission relies on summarization datasets with omission labels (i.e., which dialogue utterances are omitted in the summarization), which are not available in the current literature. In this paper, we propose the OLDS dataset, which provides high-quality Omission Labels for Dialogue Summarization. By analyzing this dataset, we find that a large improvement in summarization quality can be achieved by providing ground-truth omission labels for the summarization model to recover omission information, which demonstrates the importance of omission detection for omission mitigation in dialogue summarization. Therefore, we formulate an omission detection task and demonstrate our proposed dataset can support the training and evaluation of this task well. We also call for research action on omission detection based on our proposed datasets. Our dataset and codes are publicly available.
Analyzing and Evaluating Faithfulness in Dialogue Summarization Bin Wang, Chen Zhang, Yan Zhang, Yiming Chen, Haizhou Li EMNLP 2022 [pdf] [code]

[Abs]
Dialogue summarization is abstractive in nature, making it suffer from factual errors. The factual correctness of summaries has the highest priority before practical applications. Many efforts have been made to improve faithfulness in text summarization. However, there is a lack of systematic study on dialogue summarization systems. In this work, we first perform the fine-grained human analysis on the faithfulness of dialogue summaries and observe that over 35% of generated summaries are faithfully inconsistent respective the source dialogues. Furthermore, we present a new model-level faithfulness evaluation method. It examines generation models with multi-choice questions created by rule-based transformations. Experimental results show that our evaluation schema is a strong proxy for the factual correctness of summarization models. The human-annotated faithfulness samples and the evaluation toolkit are released to facilitate future research toward faithful dialogue summarization.
Taxonomy of Abstractive Dialogue Summarization: Scenarios, Approaches and Future Directions Qi Jia, Siyu Ren, Yizhu Liu, Kenny Q. Zhu [pdf]

[Abs]
Abstractive dialogue summarization is to generate a concise and fluent summary covering the salient information in a dialogue among two or more interlocutors. It has attracted great attention in recent years based on the massive emergence of social communication platforms and an urgent requirement for efficient dialogue information understanding and digestion. Different from news or articles in traditional document summarization, dialogues bring unique characteristics and additional challenges, including different language styles and formats, scattered information, flexible discourse structures and unclear topic boundaries. This survey provides a comprehensive investigation on existing work for abstractive dialogue summarization from scenarios, approaches to evaluations. It categorizes the task into two broad categories according to the type of input dialogues, i.e., open-domain and task-oriented, and presents a taxonomy of existing techniques in three directions, namely, injecting dialogue features, designing auxiliary training tasks and using additional data.A list of datasets under different scenarios and widely-accepted evaluation metrics are summarized for completeness. After that, the trends of scenarios and techniques are summarized, together with deep insights on correlations between extensively exploited features and different scenarios. Based on these analyses, we recommend future directions including more controlled and complicated scenarios, technical innovations and comparisons, publicly available datasets in special domains, etc.
Leveraging Non-dialogue Summaries for Dialogue Summarization Seongmin Park, Dongchan Shin, Jihwa Lee Transcript Understanding Workshop at COLING 2022 [pdf]

[Abs]
To mitigate the lack of diverse dialogue summarization datasets in academia, we present methods to utilize non-dialogue summarization data for enhancing dialogue summarization systems. We apply transformations to document summarization data pairs to create training data that better befit dialogue summarization. The suggested transformations also retain desirable properties of non-dialogue datasets, such as improved faithfulness to the source text. We conduct extensive experiments across both English and Korean to verify our approach. Although absolute gains in ROUGE naturally plateau as more dialogue summarization samples are introduced, utilizing non-dialogue data for training significantly improves summarization performance in zero- and few-shot settings and enhances faithfulness across all training regimes.
Improving Abstractive Dialogue Summarization with Speaker-Aware Supervised Contrastive Learning Zhichao Geng, Ming Zhong, Zhangyue Yin, Xipeng Qiu, Xuanjing Huang COLING 2022 [pdf]

[Abs]
Pre-trained models have brought remarkable success on the text summarization task. For dialogue summarization, the subdomain of text summarization, utterances are concatenated to flat text before being processed. As a result, existing summarization systems based on pre-trained models are unable to recognize the unique format of the speaker-utterance pair well in the dialogue. To investigate this issue, we conduct probing tests and manual analysis, and find that the powerful pre-trained model can not identify different speakers well in the conversation, which leads to various factual errors. Moreover, we propose three speaker-aware supervised contrastive learning (SCL) tasks: Token-level SCL, Turn-level SCL, and Global-level SCL. Comprehensive experiments demonstrate that our methods achieve significant performance improvement on two mainstream dialogue summarization datasets. According to detailed human evaluations, pre-trained models equipped with SCL tasks effectively generate summaries with better factual consistency.
View Dialogue in 2D: A Two-stream Model in Time-speaker Perspective for Dialogue Summarization and beyond Keli Xie, Dongchen He, Jiaxin Zhuang, Siyuan Lu, Zhongfeng Wang COLING 2022 [pdf] [code]

[Abs]
Existing works on dialogue summarization often follow the common practice in document summarization and view the dialogue, which comprises utterances of different speakers, as a single utterance stream ordered by time. However, this single-stream approach without specific attention to the speaker-centered points has limitations in fully understanding the dialogue. To better capture the dialogue information, we propose a 2D view of dialogue based on a time-speaker perspective, where the time and speaker streams of dialogue can be obtained as strengthened input. Based on this 2D view, we present an effective two-stream model called ATM to combine the two streams. Extensive experiments on various summarization datasets demonstrate that ATM significantly surpasses other models regarding diverse metrics and beats the state-of-the-art models on the QMSum dataset in ROUGE scores. Besides, ATM achieves great improvements in summary faithfulness and human evaluation. Moreover, results on machine reading comprehension datasets show the generalization ability of the proposed methods and shed light on other dialogue-based tasks. Our code will be publicly available online.
Summarizing Dialogues with Negative Cues Junpeng Liu, Yanyan Zou, Yuxuan Xi, Shengjie Li, Mian Ma, Zhuoye Ding COLING 2022 [pdf]

[Abs]
Abstractive dialogue summarization aims to convert a long dialogue content into its short form where the salient information is preserved while the redundant pieces are ignored. Different from the well-structured text, such as news and scientific articles, dialogues often consist of utterances coming from two or more interlocutors, where the conversations are often informal, verbose, and repetitive, sprinkled with false-starts, backchanneling, reconfirmations, hesitations, speaker interruptions and the salient information is often scattered across the whole chat. The above properties of conversations make it difficult to directly concentrate on scattered outstanding utterances and thus present new challenges of summarizing dialogues. In this work, rather than directly forcing a summarization system to merely pay more attention to the salient pieces, we propose to explicitly have the model perceive the redundant parts of an input dialogue history during the training phase. To be specific, we design two strategies to construct examples without salient pieces as negative cues. Then, the sequence-to-sequence likelihood loss is cooperated with the unlikelihood objective to drive the model to focus less on the unimportant information and also pay more attention to the salient pieces. Extensive experiments on the benchmark dataset demonstrate that our simple method significantly outperforms the baselines with regard to both semantic matching and factual consistent based metrics. The human evaluation also proves the performance gains.
ClidSum: A Benchmark Dataset for Cross-Lingual Dialogue Summarization Jiaan Wang, Fandong Meng, Ziyao Lu, Duo Zheng, Zhixu Li, Jianfeng Qu, Jie Zhou EMNLP 2022 [pdf] [code]

[Abs]
We present ClidSum, a benchmark dataset for building cross-lingual summarization systems on dialogue documents. It consists of 67k+ dialogue documents from two subsets (i.e., SAMSum and MediaSum) and 112k+ annotated summaries in different target languages. Based on the proposed ClidSum, we introduce two benchmark settings for supervised and semi-supervised scenarios, respectively. We then build various baseline systems in different paradigms (pipeline and end-to-end) and conduct extensive experiments on ClidSum to provide deeper analyses. Furthermore, we propose mDialBART which extends mBART-50 (a multi-lingual BART) via further pre-training. The multiple objectives used in the further pre-training stage help the pre-trained model capture the structural characteristics as well as important content in dialogues and the transformation from source to the target language. Experimental results show the superiority of mDialBART, as an end-to-end model, outperforms strong pipeline models on ClidSum. Finally, we discuss specific challenges that current approaches faced with this task and give multiple promising directions for future research.
A Focused Study on Sequence Length for Dialogue Summarization Bin Wang, Chen Zhang, Chengwei Wei, Haizhou Li [pdf]

[Abs]
Output length is critical to dialogue summarization systems. The dialogue summary length is determined by multiple factors, including dialogue complexity, summary objective, and personal preferences. In this work, we approach dialogue summary length from three perspectives. First, we analyze the length differences between existing models' outputs and the corresponding human references and find that summarization models tend to produce more verbose summaries due to their pretraining objectives. Second, we identify salient features for summary length prediction by comparing different model settings. Third, we experiment with a length-aware summarizer and show notable improvement on existing models if summary length can be well incorporated. Analysis and experiments are conducted on popular DialogSum and SAMSum datasets to validate our findings.
DialogSum Challenge: Results of the Dialogue Summarization Shared Task Yulong Chen, Naihao Deng, Yang Liu, Yue Zhang [pdf]

[Abs]
We report the results of DialogSum Challenge, the shared task on summarizing real-life scenario dialogues at INLG 2022. Four teams participate in this shared task and three submit their system reports, exploring different methods to improve the performance of dialogue summarization. Although there is a great improvement over the baseline models regarding automatic evaluation metrics, such as Rouge scores, we find that there is a salient gap between model generated outputs and human annotated summaries by human evaluation from multiple aspects. These findings demonstrate the difficulty of dialogue summarization and suggest that more fine-grained evaluatuion metrics are in need.
Effectiveness of French Language Models on Abstractive Dialogue Summarization Task Yongxin Zhou, François Portet, Fabien Ringeval LREC 2022 [pdf]

[Abs]
Pre-trained language models have established the state-of-the-art on various natural language processing tasks, including dialogue summarization, which allows the reader to quickly access key information from long conversations in meetings, interviews or phone calls. However, such dialogues are still difficult to handle with current models because the spontaneity of the language involves expressions that are rarely present in the corpora used for pre-training the language models. Moreover, the vast majority of the work accomplished in this field has been focused on English. In this work, we present a study on the summarization of spontaneous oral dialogues in French using several language specific pre-trained models: BARThez, and BelGPT-2, as well as multilingual pre-trained models: mBART, mBARThez, and mT5. Experiments were performed on the DECODA (Call Center) dialogue corpus whose task is to generate abstractive synopses from call center conversations between a caller and one or several agents depending on the situation. Results show that the BARThez models offer the best performance far above the previous state-of-the-art on DECODA. We further discuss the limits of such pre-trained models and the challenges that must be addressed for summarizing spontaneous dialogues.
Data Augmentation for Low-Resource Dialogue Summarization Yongtai Liu, Joshua Maynez, Gonçalo Simões, Shashi Narayan Findings of NAACL 2022 [pdf]

[Abs]
We present DADS, a novel Data Augmentation technique for low-resource Dialogue Summarization. Our method generates synthetic examples by replacing sections of text from both the input dialogue and summary while preserving the augmented summary to correspond to a viable summary for the augmented dialogue. We utilize pretrained language models that produce highly likely dialogue alternatives while still being free to generate diverse alternatives. We applied our data augmentation method to the SAMSum dataset in low resource scenarios, mimicking real world problems such as chat, thread, and meeting summarization where large scale supervised datasets with human-written summaries are scarce. Through both automatic and human evaluations, we show that DADS shows strong improvements for low resource scenarios while generating topically diverse summaries without introducing additional hallucinations to the summaries.
An End-to-End Dialogue Summarization System for Sales Calls Abedelkadir Asi, Song Wang, Roy Eisenstadt, Dean Geckt, Yarin Kuper, Yi Mao, Royi Ronen NAACL 2022 Industry Track [pdf]

[Abs]
Summarizing sales calls is a routine task performed manually by salespeople. We present a production system which combines generative models fine-tuned for customer-agent setting, with a human-in-the-loop user experience for an interactive summary curation process. We address challenging aspects of dialogue summarization task in a real-world setting including long input dialogues, content validation, lack of labeled data and quality evaluation. We show how GPT-3 can be leveraged as an offline data labeler to handle training data scarcity and accommodate privacy constraints in an industrial setting. Experiments show significant improvements by our models in tackling the summarization and content validation tasks on public datasets.
Few-shot fine-tuning SOTA summarization models for medical dialogues David Fraile Navarro, Mark Dras, Shlomo Berkovsky NAACL 2022 Student Research Workshop [pdf] [code]

[Abs]
Abstractive summarization of medical dialogues presents a challenge for standard training approaches, given the paucity of suitable datasets. We explore the performance of state-of-the-art models with zero-shot and few-shot learning strategies and measure the impact of pretraining with general domain and dialogue-specific text on the summarization performance.
DialSummEval: Revisiting Summarization Evaluation for Dialogues Mingqi Gao, Xiaojun Wan NAACL 2022 [pdf] [code]

[Abs]
Dialogue summarization is receiving increasing attention from researchers due to its extraordinary difficulty and unique application value. We observe that current dialogue summarization models have flaws that may not be well exposed by frequently used metrics such as ROUGE. In our paper, we re-evaluate 18 categories of metrics in terms of four dimensions: coherence, consistency, fluency and relevance, as well as a unified human evaluation of various models for the first time. Some noteworthy trends which are different from the conventional summarization tasks are identified. We will release DialSummEval, a multi-faceted dataset of human judgments containing the outputs of 14 models on SAMSum.
Domain-Oriented Prefix-Tuning: Towards Efficient and Generalizable Fine-tuning for Zero-Shot Dialogue Summarization Lulu Zhao, Fujia Zheng, Weihao Zeng, Keqing He, Weiran Xu, Huixing Jiang, Wei Wu, Yanan Wu NAACL 2022 [pdf] [code]

[Abs]
The most advanced abstractive dialogue summarizers lack generalization ability on new domains and the existing researches for domain adaptation in summarization generally rely on large-scale pre-trainings. To explore the lightweight fine-tuning methods for domain adaptation of dialogue summarization, in this paper, we propose an efficient and generalizable Domain-Oriented Prefix-tuning model, which utilizes a domain word initialized prefix module to alleviate domain entanglement and adopts discrete prompts to guide the model to focus on key contents of dialogues and enhance model generalization. We conduct zero-shot experiments and build domain adaptation benchmarks on two multi-domain dialogue summarization datasets, TODSum and QMSum. Adequate experiments and qualitative analysis prove the effectiveness of our methods.
From spoken dialogue to formal summary: An utterance rewriting for dialogue summarization Yue Fang, Hainan Zhang, Hongshen Chen, Zhuoye Ding, Bo Long, Yanyan Lan, Yanquan Zhou NAACL 2022 [pdf]

[Abs]
Due to the dialogue characteristics of unstructured contexts and multi-parties with first-person perspective, many successful text summarization works have failed when dealing with dialogue summarization. In dialogue summarization task, the input dialogue is usually spoken style with ellipsis and co-references but the output summaries are more formal and complete. Therefore, the dialogue summarization model should be able to complete the ellipsis content and co-reference information and then produce a suitable summary accordingly. However, the current state-of-the-art models pay more attention on the topic or structure of summary, rather than the consistency of dialogue summary with its input dialogue context, which may suffer from the personal and logical inconsistency problem. In this paper, we propose a new model, named ReWriteSum, to tackle this problem. Firstly, an utterance rewriter is conducted to complete the ellipsis content of dialogue content and then obtain the rewriting utterances. Then, the co-reference data augmentation mechanism is utilized to replace the referential person name with its specific name to enhance the personal information. Finally, the rewriting utterances and the co-reference replacement data are used in the standard BART model. Experimental results on both SAMSum and DialSum datasets show that our ReWriteSum significantly outperforms baseline models, in terms of both metric-based and human evaluations. Further analysis on multi-speakers also shows that ReWriteSum can obtain relatively higher improvement with more speakers, validating the correctness and property of ReWriteSum.
Unsupervised Abstractive Dialogue Summarization with Word Graphs and POV Conversion Seongmin Park, Jihwa Lee WIT Workshop @ ACL2022 [pdf] [code]
MSAMSum: Towards Benchmarking Multi-lingual Dialogue Summarization Xiachong Feng, Xiaocheng Feng, Bing Qin ACL 2022 DialDoc Workshop [pdf] [data]
The Cross-lingual Conversation Summarization Challenge Yulong Chen, Ming Zhong, Xuefeng Bai, Naihao Deng, Jing Li, Xianchao Zhu, Yue Zhang [pdf]
Post-Training Dialogue Summarization using Pseudo-Paraphrasing Qi Jia, Yizhu Liu, Haifeng Tang, Kenny Q. Zhu Findings of NAACL 2022 [pdf] [code]

[Abs]
Previous dialogue summarization techniques adapt large language models pretrained on the narrative text by injecting dialogue-specific features into the models. These features either require additional knowledge to recognize or make the resulting models harder to tune. To bridge the format gap between dialogues and narrative summaries in dialogue summarization tasks, we propose to post-train pretrained language models (PLMs) to rephrase from dialogue to narratives. After that, the model is fine-tuned for dialogue summarization as usual. Comprehensive experiments show that our approach significantly improves vanilla PLMs on dialogue summarization and outperforms other SOTA models by the summary quality and implementation costs.
CONFIT: Toward Faithful Dialogue Summarization with Linguistically-Informed Contrastive Fine-tuning Xiangru Tang, Arjun Nair, Borui Wang, Bingyao Wang, Jai Desai, Aaron Wade, Haoran Li, Asli Celikyilmaz, Yashar Mehdad, Dragomir Radev [pdf]
Are We Summarizing the Right Way? A Survey of Dialogue Summarization Data Sets Don Tuggener, Margot Mieskes, Jan Deriu, Mark Cieliebak EMNLP 2021| newsum [pdf]
Dialogue Inspectional Summarization with Factual Inconsistency Awareness Leilei Gan, Yating Zhang, Kun Kuang, Lin Yuan, Shuo Li, Changlong Sun, Xiaozhong Liu, Fei Wu [pdf]
Do Boat and Ocean Suggest Beach? Dialogue Summarization with External Knowledge Tianqing Fang, Haojie Pan, Hongming Zhang, Yangqiu Song, Kun Xu, Dong Yu AKBC 2021 [pdf] [code]
Prompt scoring system for dialogue summarization using GPT3 Prodan, George; Pelican, Elena [pdf]
Simple Conversational Data Augmentation for Semi-supervised Abstractive Dialogue SummarizationJiaao Jiaao Chen, Diyi Yang EMNLP 2021 [pdf] [code]
A Bag of Tricks for Dialogue Summarization Muhammad Khalifa, Miguel Ballesteros, Kathleen McKeown EMNLP 2021 Short [pdf]
Hierarchical Summarization for Longform Spoken Dialog Daniel Li, Thomas Chen, Albert Tung, Lydia Chilton UIST 2021 [pdf]
RepSum: Unsupervised Dialogue Summarization based on Replacement Strategy Xiyan Fu, Yating Zhang, Tianyi Wang, Xiaozhong Liu, Changlong Sun, Zhenglu Yang ACL 2021 [pdf] [code]
Language Model as an Annotator: Exploring DialoGPT for Dialogue Summarization Xiachong Feng, Xiaocheng Feng, Libo Qin, Bing Qin, Ting Liu ACL 2021 [pdf] [code]
A Two-Phase Approach for Abstractive Podcast Summarization Chujie Zheng, Kunpeng Zhang, Harry Jiannan Wang, Ling Fan TREC 2020 Podcasts Track [pdf]
Hierarchical Learning for Generation with Long Source Sequences Tobias Rohde, Xiaoxia Wu, Yinhan Liu [pdf] [code]
Improving Online Forums Summarization via Unifying Hierarchical Attention Networks with Convolutional Neural Networks Sansiri Tarnpradab, Fereshteh Jafariakinabad, Kien A. Hua [pdf] [code]
Extractive Summarization of Call Transcripts Pratik K. Biswas, Aleksandr Iakubovich [pdf]
Legal Summarization for Multi-role Debate Dialogue via Controversy Focus Mining and Multi-task Learning Xinyu Duan, Yating Zhang, Lin Yuan, Xin Zhou, Xiaozhong Liu, Tianyi Wang, Ruocheng Wang, Qiong Zhang, Changlong Sun, Fei Wu CIKM 2019 [pdf]
Collabot: Personalized Group Chat Summarization Naama Tepper, Anat Hashavit, Maya Barnea, Inbal Ronen, Lior Leiba WSDM 2018 [pdf]
Summarizing Dialogic Arguments from Social Media Amita Misra, Shereen Oraby, Shubhangi Tandon, Sharath TS, Pranav Anand, Marilyn Walker SemDial 2017 [pdf]
The SENSEI Annotated Corpus: Human Summaries of Reader Comment Conversations in On-line News Emma Barker, Monica Lestari Paramita, Ahmet Aker, Emina Kurtic, Mark Hepple, Robert Gaizauskas SIGDIAL 2016 [pdf]
Semantic Similarity Applied to Spoken Dialogue Summarization Iryna Gurevych, Michael Strube COLING 2004 [pdf] [bib] Switchboard dialogues

Long Document

SmartBook: AI-Assisted Situation Report Generation Revanth Gangi Reddy, Yi R. Fung, Qi Zeng, Manling Li, Ziqi Wang, Paul Sullivan, Heng Ji [pdf] [code]

[Abs]
While human evaluation remains best practice for accurately judging the faithfulness of automatically-generated summaries, few solutions exist to address the increased difficulty and workload when evaluating long-form summaries. Through a survey of 162 papers on long-form summarization, we first shed light on current human evaluation practices surrounding long-form summaries. We find that 73% of these papers do not perform any human evaluation on model-generated summaries, while other works face new difficulties that manifest when dealing with long documents (e.g., low inter-annotator agreement). Motivated by our survey, we present LongEval, a set of guidelines for human evaluation of faithfulness in long-form summaries that addresses the following challenges: (1) How can we achieve high inter-annotator agreement on faithfulness scores? (2) How can we minimize annotator workload while maintaining accurate faithfulness scores? and (3) Do humans benefit from automated alignment between summary and source snippets? We deploy LongEval in annotation studies on two long-form summarization datasets in different domains (SQuALITY and PubMed), and we find that switching to a finer granularity of judgment (e.g., clause-level) reduces inter-annotator variance in faithfulness scores (e.g., std-dev from 18.5 to 6.8). We also show that scores from a partial annotation of fine-grained units highly correlates with scores from a full annotation workload (0.89 Kendall's tau using 50% judgments). We release our human judgments, annotation templates, and our software as a Python library for future research.
LongEval: Guidelines for Human Evaluation of Faithfulness in Long-form Summarization Kalpesh Krishna, Erin Bransom, Bailey Kuehl, Mohit Iyyer, Pradeep Dasigi, Arman Cohan, Kyle Lo EACL 2023 [pdf] [code]

[Abs]
While human evaluation remains best practice for accurately judging the faithfulness of automatically-generated summaries, few solutions exist to address the increased difficulty and workload when evaluating long-form summaries. Through a survey of 162 papers on long-form summarization, we first shed light on current human evaluation practices surrounding long-form summaries. We find that 73% of these papers do not perform any human evaluation on model-generated summaries, while other works face new difficulties that manifest when dealing with long documents (e.g., low inter-annotator agreement). Motivated by our survey, we present LongEval, a set of guidelines for human evaluation of faithfulness in long-form summaries that addresses the following challenges: (1) How can we achieve high inter-annotator agreement on faithfulness scores? (2) How can we minimize annotator workload while maintaining accurate faithfulness scores? and (3) Do humans benefit from automated alignment between summary and source snippets? We deploy LongEval in annotation studies on two long-form summarization datasets in different domains (SQuALITY and PubMed), and we find that switching to a finer granularity of judgment (e.g., clause-level) reduces inter-annotator variance in faithfulness scores (e.g., std-dev from 18.5 to 6.8). We also show that scores from a partial annotation of fine-grained units highly correlates with scores from a full annotation workload (0.89 Kendall's tau using 50% judgments). We release our human judgments, annotation templates, and our software as a Python library for future research.
LoRaLay: A Multilingual and Multimodal Dataset for Long Range and Layout-Aware Summarization Laura Nguyen, Thomas Scialom, Benjamin Piwowarski, Jacopo Staiano EACL 2023 [pdf] [code]

[Abs]
Text Summarization is a popular task and an active area of research for the Natural Language Processing community. By definition, it requires to account for long input texts, a characteristic which poses computational challenges for neural models. Moreover, real-world documents come in a variety of complex, visually-rich, layouts. This information is of great relevance, whether to highlight salient content or to encode long-range interactions between textual passages. Yet, all publicly available summarization datasets only provide plain text content. To facilitate research on how to exploit visual/layout information to better capture long-range dependencies in summarization models, we present LoRaLay, a collection of datasets for long-range summarization with accompanying visual/layout information. We extend existing and popular English datasets (arXiv and PubMed) with layout information and propose four novel datasets -- consistently built from scholar resources -- covering French, Spanish, Portuguese, and Korean languages. Further, we propose new baselines merging layout-aware and long-range models -- two orthogonal approaches -- and obtain state-of-the-art results, showing the importance of combining both lines of research.
GoSum: Extractive Summarization of Long Documents by Reinforcement Learning and Graph Organized discourse state Junyi Bian, Xiaodi Huang, Hong Zhou, Shanfeng Zhu [pdf]

[Abs]
Handling long texts with structural information and excluding redundancy between summary sentences are essential in extractive document summarization. In this work, we propose GoSum, a novel reinforcement-learning-based extractive model for long-paper summarization. GoSum encodes states by building a heterogeneous graph from different discourse levels for each input document. We evaluate the model on two datasets of scientific articles summarization: PubMed and arXiv where it outperforms all extractive summarization models and most of the strong abstractive baselines.
Novel Chapter Abstractive Summarization using Spinal Tree Aware Sub-Sentential Content Selection Hardy Hardy, Miguel Ballesteros, Faisal Ladhak, Muhammad Khalifa, Vittorio Castelli, Kathleen McKeown [pdf]

[Abs]
Summarizing novel chapters is a difficult task due to the input length and the fact that sentences that appear in the desired summaries draw content from multiple places throughout the chapter. We present a pipelined extractive-abstractive approach where the extractive step filters the content that is passed to the abstractive component. Extremely lengthy input also results in a highly skewed dataset towards negative instances for extractive summarization; we thus adopt a margin ranking loss for extraction to encourage separation between positive and negative examples. Our extraction component operates at the constituent level; our approach to this problem enriches the text with spinal tree information which provides syntactic context (in the form of constituents) to the extraction model. We show an improvement of 3.71 Rouge-1 points over best results reported in prior work on an existing novel chapter dataset.
How Far are We from Robust Long Abstractive Summarization? Huan Yee Koh, Jiaxin Ju, He Zhang, Ming Liu, Shirui Pan EMNLP 2022 [pdf] [code]

[Abs]
Abstractive summarization has made tremendous progress in recent years. In this work, we perform fine-grained human annotations to evaluate long document abstractive summarization systems (i.e., models and metrics) with the aim of implementing them to generate reliable summaries. For long document abstractive models, we show that the constant strive for state-of-the-art ROUGE results can lead us to generate more relevant summaries but not factual ones. For long document evaluation metrics, human evaluation results show that ROUGE remains the best at evaluating the relevancy of a summary. It also reveals important limitations of factuality metrics in detecting different types of factual errors and the reasons behind the effectiveness of BARTScore. We then suggest promising directions in the endeavor of developing factual consistency metrics. Finally, we release our annotated long document dataset with the hope that it can contribute to the development of metrics across a broader range of summarization settings.
Toward Unifying Text Segmentation and Long Document Summarization Sangwoo Cho, Kaiqiang Song, Xiaoyang Wang, Fei Liu, Dong Yu EMNLP 2022 [pdf] [code]

[Abs]
Text segmentation is important for signaling a document's structure. Without segmenting a long document into topically coherent sections, it is difficult for readers to comprehend the text, let alone find important information. The problem is only exacerbated by a lack of segmentation in transcripts of audio/video recordings. In this paper, we explore the role that section segmentation plays in extractive summarization of written and spoken documents. Our approach learns robust sentence representations by performing summarization and segmentation simultaneously, which is further enhanced by an optimization-based regularizer to promote selection of diverse summary sentences. We conduct experiments on multiple datasets ranging from scientific articles to spoken transcripts to evaluate the model's performance. Our findings suggest that the model can not only achieve state-of-the-art performance on publicly available benchmarks, but demonstrate better cross-genre transferability when equipped with text segmentation. We perform a series of analyses to quantify the impact of section segmentation on summarizing written and spoken documents of substantial length and complexity.
HeterGraphLongSum: Heterogeneous Graph Neural Network with Passage Aggregation for Extractive Long Document Summarization Tuan-Anh Phan, Ngoc-Dung Ngoc Nguyen, Khac-Hoai Nam Bui COLING 2022 [pdf] [code]

[Abs]
Graph Neural Network (GNN)-based models have proven effective in various Natural Language Processing (NLP) tasks in recent years. Specifically, in the case of the Extractive Document Summarization (EDS) task, modeling documents under graph structure is able to analyze the complex relations between semantic units (e.g., word-to-word, word-to-sentence, sentence-to-sentence) and enrich sentence representations via valuable information from their neighbors. However, long-form document summarization using graph-based methods is still an open research issue. The main challenge is to represent long documents in a graph structure in an effective way. In this regard, this paper proposes a new heterogeneous graph neural network (HeterGNN) model to improve the performance of long document summarization (HeterGraphLongSum). Specifically, the main idea is to add the passage nodes into the heterogeneous graph structure of word and sentence nodes for enriching the final representation of sentences. In this regard, HeterGraphLongSum is designed with three types of semantic units such as word, sentence, and passage. Experiments on two benchmark datasets for long documents such as Pubmed and Arxiv indicate promising results of the proposed model for the extractive long document summarization problem. Especially, HeterGraphLongSum is able to achieve state-of-the-art performance without relying on any pre-trained language models (e.g., BERT). The source code is available for further exploitation on the Github.
Multi Graph Neural Network for Extractive Long Document Summarization Xuan-Dung Doan, Le-Minh Nguyen, Khac-Hoai Nam Bui COLING 2022 [pdf] [code]

[Abs]
Heterogeneous Graph Neural Networks (HeterGNN) have been recently introduced as an emergent approach for extracting document summarization (EDS) by exploiting the cross-relations between words and sentences. However, applying HeterGNN for long documents is still an open research issue. One of the main majors is the lacking of inter-sentence connections. In this regard, this paper exploits how to apply HeterGNN for long documents by building a graph on sentence-level nodes (homogeneous graph) and combine with HeterGNN for capturing the semantic information in terms of both inter and intra-sentence connections. Experiments on two benchmark datasets of long documents such as PubMed and ArXiv show that our method is able to achieve state-of-the-art results in this research field.
HEGEL: Hypergraph Transformer for Long Document Summarization Haopeng Zhang, Xiao Liu, Jiawei Zhang EMNLP 2022 [pdf]

[Abs]
Extractive summarization for long documents is challenging due to the extended structured input context. The long-distance sentence dependency hinders cross-sentence relations modeling, the critical step of extractive summarization. This paper proposes HEGEL, a hypergraph neural network for long document summarization by capturing high-order cross-sentence relations. HEGEL updates and learns effective sentence representations with hypergraph transformer layers and fuses different types of sentence dependencies, including latent topics, keywords coreference, and section structure. We validate HEGEL by conducting extensive experiments on two benchmark datasets, and experimental results demonstrate the effectiveness and efficiency of HEGEL.
GRETEL: Graph Contrastive Topic Enhanced Language Model for Long Document Extractive Summarization Qianqian Xie, Jimin Huang, Tulika Saha, Sophia Ananiadou COLING2022 [pdf] [code]

[Abs]
Recently, neural topic models (NTMs) have been incorporated into pre-trained language models (PLMs), to capture the global semantic information for text summarization. However, in these methods, there remain limitations in the way they capture and integrate the global semantic information. In this paper, we propose a novel model, the graph contrastive topic enhanced language model (GRETEL), that incorporates the graph contrastive topic model with the pre-trained language model, to fully leverage both the global and local contextual semantics for long document extractive summarization. To better capture and incorporate the global semantic information into PLMs, the graph contrastive topic model integrates the hierarchical transformer encoder and the graph contrastive learning to fuse the semantic information from the global document context and the gold summary. To this end, GRETEL encourages the model to efficiently extract salient sentences that are topically related to the gold summary, rather than redundant sentences that cover sub-optimal topics. Experimental results on both general domain and biomedical datasets demonstrate that our proposed method outperforms SOTA methods.
Sparse Optimization for Unsupervised Extractive Summarization of Long Documents with the Frank-Wolfe Algorithm Alicia Y. Tsai, Laurent El Ghaoui SustaiNLP at EMNLP 2020 [pdf]

[Abs]
We address the problem of unsupervised extractive document summarization, especially for long documents. We model the unsupervised problem as a sparse auto-regression one and approximate the resulting combinatorial problem via a convex, norm-constrained problem. We solve it using a dedicated Frank-Wolfe algorithm. To generate a summary with k sentences, the algorithm only needs to execute ≈k iterations, making it very efficient. We explain how to avoid explicit calculation of the full gradient and how to include sentence embedding information. We evaluate our approach against two other unsupervised methods using both lexical (standard) ROUGE scores, as well as semantic (embedding-based) ones. Our method achieves better results with both datasets and works especially well when combined with embeddings for highly paraphrased summaries.
An Efficient Coarse-to-Fine Facet-Aware Unsupervised Summarization Framework based on Semantic Blocks Xinnian Liang, Jing Li, Shuangzhi Wu, Jiali Zeng, Yufan Jiang, Mu Li, Zhoujun Li COLING 2022 [pdf] [code]

[Abs]
Unsupervised summarization methods have achieved remarkable results by incorporating representations from pre-trained language models. However, existing methods fail to consider efficiency and effectiveness at the same time when the input document is extremely long. To tackle this problem, in this paper, we proposed an efficient Coarse-to-Fine Facet-Aware Ranking (C2F-FAR) framework for unsupervised long document summarization, which is based on the semantic block. The semantic block refers to continuous sentences in the document that describe the same facet. Specifically, we address this problem by converting the one-step ranking method into the hierarchical multi-granularity two-stage ranking. In the coarse-level stage, we propose a new segment algorithm to split the document into facet-aware semantic blocks and then filter insignificant blocks. In the fine-level stage, we select salient sentences in each block and then extract the final summary from selected sentences. We evaluate our framework on four long document summarization datasets: Gov-Report, BillSum, arXiv, and PubMed. Our C2F-FAR can achieve new state-of-the-art unsupervised summarization results on Gov-Report and BillSum. In addition, our method speeds up 4-28 times more than previous methods.\footnote{\url{this https URL}}
Investigating Efficiently Extending Transformers for Long Input Summarization Jason Phang, Yao Zhao, Peter J. Liu [pdf] [code]

[Abs]
While large pretrained Transformer models have proven highly capable at tackling natural language tasks, handling long sequence inputs continues to be a significant challenge. One such task is long input summarization, where inputs are longer than the maximum input context of most pretrained models. Through an extensive set of experiments, we investigate what model architectural changes and pretraining paradigms can most efficiently adapt a pretrained Transformer for long input summarization. We find that a staggered, block-local Transformer with global encoder tokens strikes a good balance of performance and efficiency, and that an additional pretraining phase on long sequences meaningfully improves downstream summarization performance. Based on our findings, we introduce PEGASUS-X, an extension of the PEGASUS model with additional long input pretraining to handle inputs of up to 16K tokens. PEGASUS-X achieves strong performance on long input summarization tasks comparable with much larger models while adding few additional parameters and not requiring model parallelism to train.
An Empirical Survey on Long Document Summarization: Datasets, Models and Metrics uan Yee Koh, Jiaxin Ju, Ming Liu, Shirui Pan ACM Computing Surveys [pdf]

[Abs]
Long documents such as academic articles and business reports have been the standard format to detail out important issues and complicated subjects that require extra attention. An automatic summarization system that can effectively condense long documents into short and concise texts to encapsulate the most important information would thus be significant in aiding the reader's comprehension. Recently, with the advent of neural architectures, significant research efforts have been made to advance automatic text summarization systems, and numerous studies on the challenges of extending these systems to the long document domain have emerged. In this survey, we provide a comprehensive overview of the research on long document summarization and a systematic evaluation across the three principal components of its research setting: benchmark datasets, summarization models, and evaluation metrics. For each component, we organize the literature within the context of long document summarization and conduct an empirical analysis to broaden the perspective on current research progress. The empirical analysis includes a study on the intrinsic characteristics of benchmark datasets, a multi-dimensional analysis of summarization models, and a review of the summarization evaluation metrics. Based on the overall findings, we conclude by proposing possible directions for future exploration in this rapidly growing field.
MemSum: Extractive Summarization of Long Documents Using Multi-Step Episodic Markov Decision Processes Nianlong Gu, Elliott Ash, Richard Hahnloser ACL 2022 [pdf] [code]

[Abs]
We introduce MemSum (Multi-step Episodic Markov decision process extractive SUMmarizer), a reinforcement-learning-based extractive summarizer enriched at each step with information on the current extraction history. When MemSum iteratively selects sentences into the summary, it considers a broad information set that would intuitively also be used by humans in this task: 1) the text content of the sentence, 2) the global text context of the rest of the document, and 3) the extraction history consisting of the set of sentences that have already been extracted. With a lightweight architecture, MemSum obtains state-of-the-art test-set performance (ROUGE) in summarizing long documents taken from PubMed, arXiv, and GovReport. Ablation studies demonstrate the importance of local, global, and history information. A human evaluation confirms the high quality and low redundancy of the generated summaries, stemming from MemSum’s awareness of extraction history.
Semantic Self-Segmentation for Abstractive Summarization of Long Legal Documents in Low-Resource Regimes Gianluca Moro, Luca Ragazzi AAAI 2022 [pdf]
Factorizing Content and Budget Decisions in Abstractive Summarization of Long Documents by Sampling Summary Views Marcio Fonseca, Yftah Ziser, Shay B. Cohen EMNLP 2022 [pdf]

[Abs]
We argue that disentangling content selection from the budget used to cover salient content improves the performance and applicability of abstractive summarizers. Our method, FactorSum, does this disentanglement by factorizing summarization into two steps through an energy function: (1) generation of abstractive summary views covering salient information in subsets of the input document (document views); (2) combination of these views into a final summary, following a budget and content guidance. This guidance may come from different sources, including from an advisor model such as BART or BigBird, or in oracle mode – from the reference. This factorization achieves significantly higher ROUGE scores on multiple benchmarks for long document summarization, namely PubMed, arXiv, and GovReport. Most notably, our model is effective for domain adaptation. When trained only on PubMed samples, it achieves a 46.29 ROUGE-1 score on arXiv, outperforming PEGASUS trained in domain by a large margin. Our experimental results indicate that the performance gains are due to more flexible budget adaptation and processing of shorter contexts provided by partial document views.
Leveraging Locality in Abstractive Text Summarization Yixin Liu, Ansong Ni, Linyong Nan, Budhaditya Deb, Chenguang Zhu, Ahmed H. Awadallah, Dragomir Radev [pdf] EMNLP 2022

[Abs]
Neural attention models have achieved significant improvements on many natural language processing tasks. However, the quadratic memory complexity of the self-attention module with respect to the input length hinders their applications in long text summarization. Instead of designing more efficient attention modules, we approach this problem by investigating if models with a restricted context can have competitive performance compared with the memory-efficient attention models that maintain a global context by treating the input as a single sequence. Our model is applied to individual pages, which contain parts of inputs grouped by the principle of locality, during both the encoding and decoding stages. We empirically investigated three kinds of locality in text summarization at different levels of granularity, ranging from sentences to documents. Our experimental results show that our model has a better performance compared with strong baseline models with efficient attention modules, and our analysis provides further insights into our locality-aware modeling strategy.
SNaC: Coherence Error Detection for Narrative Summarization Tanya Goyal, Junyi Jessy Li, Greg Durrett EMNLP 2022 [pdf] [data]

[Abs]
Progress in summarizing long texts is inhibited by the lack of appropriate evaluation frameworks. A long summary that appropriately covers the facets of that text must also present a coherent narrative, but current automatic and human evaluation methods fail to identify gaps in coherence. In this work, we introduce SNaC, a narrative coherence evaluation framework for fine-grained annotations of long summaries. We develop a taxonomy of coherence errors in generated narrative summaries and collect span-level annotations for 6.6k sentences across 150 book and movie summaries. Our work provides the first characterization of coherence errors generated by state-of-the-art summarization models and a protocol for eliciting coherence judgments from crowdworkers. Furthermore, we show that the collected annotations allow us to benchmark past work in coherence modeling and train a strong classifier for automatically localizing coherence errors in generated summaries. Finally, our SNaC framework can support future work in long document summarization and coherence evaluation, including improved summarization modeling and post-hoc summary correction.
Sequence-Based Extractive Summarisation for Scientific Articles Daniel Kershaw, Rob Koeling `` [pdf]
LDKP: A Dataset for Identifying Keyphrases from Long Scientific Documents Debanjan Mahata, Naveen Agarwal, Dibya Gautam, Amardeep Kumar, Swapnil Parekh, Yaman Kumar Singla, Anish Acharya, Rajiv Ratn Shah [pdf] [data1] [data2]
HIBRIDS: Attention with Hierarchical Biases for Structure-aware Long Document Summarization Shuyang Cao, Lu Wang ACL 2022 [pdf] [code] [data]

[Abs]
Document structure is critical for efficient information consumption. However, it is challenging to encode it efficiently into the modern Transformer architecture. In this work, we present HIBRIDS, which injects Hierarchical Biases foR Incorporating Document Structure into attention score calculation. We further present a new task, hierarchical question-summary generation, for summarizing salient content in the source document into a hierarchy of questions and summaries, where each follow-up question inquires about the content of its parent question-summary pair. We also annotate a new dataset with 6,153 question-summary hierarchies labeled on government reports. Experiment results show that our model produces better question-summary hierarchies than comparisons on both hierarchy quality and content coverage, a finding also echoed by human judges. Additionally, our model improves the generation of long-form summaries from long government reports and Wikipedia articles, as measured by ROUGE scores.
HiStruct+: Improving Extractive Text Summarization with Hierarchical Structure Information Qian Ruan, Malte Ostendorff, Georg Rehm [pdf] [code]
Long Document Summarization with Top-down and Bottom-up Inference Bo Pang, Erik Nijkamp, Wojciech Kryściński, Silvio Savarese, Yingbo Zhou, Caiming Xiong [pdf]
Summ^N: A Multi-Stage Summarization Framework for Long Input Dialogues and Documents Yusen Zhang, Ansong Ni, Ziming Mao, Chen Henry Wu, Chenguang Zhu, Budhaditya Deb, Ahmed H. Awadallah, Dragomir Radev, Rui Zhang ACL 2022 [pdf]
DYLE: Dynamic Latent Extraction for Abstractive Long-Input Summarization Ziming Mao, Chen Henry Wu, Ansong Ni, Yusen Zhang, Rui Zhang, Tao Yu, Budhaditya Deb, Chenguang Zhu, Ahmed H. Awadallah, Dragomir Radev ACL 2022 [pdf] [code]

[Abs]
Transformer-based models have achieved state-of-the-art performance on short-input summarization. However, they still struggle with summarizing longer text. In this paper, we present DYLE, a novel dynamic latent extraction approach for abstractive long-input summarization. DYLE jointly trains an extractor and a generator and treats the extracted text snippets as the latent variable, allowing dynamic snippet-level attention weights during decoding. To provide adequate supervision, we propose simple yet effective heuristics for oracle extraction as well as a consistency loss term, which encourages the extractor to approximate the averaged dynamic weights predicted by the generator. We evaluate our method on different long-document and long-dialogue summarization tasks: GovReport, QMSum, and arXiv. Experiment results show that DYLE outperforms all existing methods on GovReport and QMSum, with gains up to 6.1 ROUGE, while yielding strong results on arXiv. Further analysis shows that the proposed dynamic weights provide interpretability of our generation process.
SciBERTSUM: Extractive Summarization for Scientific Documents Athar Sefid, C Lee Giles [pdf] [code]
Neural Content Extraction for Poster Generation of Scientific Papers Sheng Xu, Xiaojun Wan [pdf]
LongT5: Efficient Text-To-Text Transformer for Long Sequences Mandy Guo, Joshua Ainslie, David Uthus, Santiago Ontanon, Jianmo Ni, Yun-Hsuan Sung, Yinfei Yang [pdf]
The Influence of Data Pre-processing and Post-processing on Long Document Summarization Xinwei Du, Kailun Dong, Yuchen Zhang, Yongsheng Li, Ruei-Yu Tsay [pdf]
End-to-End Segmentation-based News Summarization Yang Liu, Chenguang Zhu, Michael Zeng [pdf]
Leveraging Information Bottleneck for Scientific Document Summarization Jiaxin Ju, Ming Liu, Huan Yee Koh, Yuan Jin, Lan Du, Shirui Pan EMNLP 2021 Findings [pdf]
Generating Summaries for Scientific Paper Review Ana Sabina Uban, Cornelia Caragea [pdf]
Sparsity and Sentence Structure in Encoder-Decoder Attention of Summarization Systems Potsawee Manakul, Mark J. F. Gales EMNLP 2021 short paper [pdf] [code]
Bringing Structure into Summaries: a Faceted Summarization Dataset for Long Scientific Documents Rui Meng, Khushboo Thaker, Lei Zhang, Yue Dong, Xingdi Yuan, Tong Wang, Daqing He ACL 2021 short [pdf] [data]
Sliding Selector Network with Dynamic Memory for Extractive Summarization of Long Documents Peng Cui, Le Hu NAACL21 [pdf] [code]
Long-Span Summarization via Local Attention and Content Selection Potsawee Manakul, Mark J. F. Gales ACL 2021 [pdf]
Globalizing BERT-based Transformer Architectures for Long Document Summarization Quentin Grail, Julien Perez, Eric Gaussier EACL 2021 [pdf]
Discourse-Aware Unsupervised Summarization for Long Scientific Documents Yue Dong, Andrei Mircea Romascanu, Jackie Chi Kit Cheung EACL21 [pdf] [code]
Enhancing Scientific Papers Summarization with Citation Graph Chenxin An, Ming Zhong, Yiran Chen, Danqing Wang, Xipeng Qiu, Xuanjing Huang AAAI 2021 [pdf] [code]
Efficient Attentions for Long Document Summarization Luyang Huang, Shuyang Cao, Nikolaus Parulian, Heng Ji, Lu Wang NAACL 2021 [pdf] [code] [data]
Can We Automate Scientific Reviewing? Weizhe Yuan, Pengfei Liu, and Graham Neubig [pdf] [code]
Long Document Summarization in a Low Resource Setting using Pretrained Language Models Ahsaas Bajaj, Pavitra Dangati, Kalpesh Krishna, Pradhiksha Ashok Kumar, Rheeya Uppaal, Bradford Windsor, Eliot Brenner, Dominic Dotterrer, Rajarshi Das, Andrew McCallum ACL 2021 Student Research Workshop [pdf]
Summaformers @ LaySumm 20, LongSumm 20 Sayar Ghosh Roy, Nikhil Pinnaparaju, Risubh Jain, Manish Gupta, Vasudeva Varma SDP EMNLP 2020 [pdf]
On Generating Extended Summaries of Long Documents Sajad Sotudeh, Arman Cohan, Nazli Goharian SDU21 [pdf] [code]
Self-Supervised Learning for Visual Summary Identification in Scientific Publications Shintaro Yamamoto, Anne Lauscher, Simone Paolo Ponzetto, Goran Glavaš, Shigeo Morishima [pdf]
Systematically Exploring Redundancy Reduction in Summarizing Long Documents Wen Xiao, Giuseppe Carenini AACL20 [pdf [code]
On Extractive and Abstractive Neural Document Summarization with Transformer Language Models Sandeep Subramanian, Raymond Li, Jonathan Pilault, Christopher Pal EMNLP20 [pdf]
Dimsum @LaySumm 20: BART-based Approach for Scientific Document Summarization Tiezheng Yu, Dan Su, Wenliang Dai, Pascale Fung [pdf] [code]
SciSummPip: An Unsupervised Scientific Paper Summarization Pipeline Jiaxin Ju, Ming Liu, Longxiang Gao, Shirui Pan [pdf] [code]
Enhancing Extractive Text Summarization with Topic-Aware Graph Neural Networks Peng Cui, Le Hu, Yuanchao Liu COLING20 [pdf]
Multi-XScience: A Large-scale Dataset for Extreme Multi-document Summarization of Scientiﬁc Articles Yao Lu, Yue Dong, Laurent Charlin EMNLP20 Short [pdf] [data]
A Divide-and-Conquer Approach to the Summarization of Long Documents Alexios Gidiotis, Grigorios Tsoumakas IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING [pdf]
TLDR: Extreme Summarization of Scientific Documents Isabel Cachola, Kyle Lo, Arman Cohan, Daniel S. Weld Findings of EMNLP20 [pdf] [data]
Extractive Summarization of Long Documents by Combining Global and Local Context Wen Xiao, Giuseppe Carenini EMNLP19 [pdf] [code]
ScisummNet: A Large Annotated Corpus and Content-Impact Models for Scientific Paper Summarization with Citation Networks Michihiro Yasunaga, Jungo Kasai, Rui Zhang, Alexander R. Fabbri, Irene Li, Dan Friedman, Dragomir R. Radev AAAI19 [pdf] [data]
TalkSumm: A Dataset and Scalable Annotation Method for Scientific Paper Summarization Based on Conference Talks Guy Lev, Michal Shmueli-Scheuer, Jonathan Herzig, Achiya Jerbi, David Konopnicki ACL19 [pdf] [data]
A Discourse-Aware Attention Model for Abstractive Summarization of Long Documents Arman Cohan, Franck Dernoncourt, Doo Soon Kim, Trung Bui, Seokhwan Kim, Walter Chang, Nazli Goharian NAACL18 [pdf] [data]

Factual Consistency

Toolkit: factsumm

ChatGPT as a Factual Inconsistency Evaluator for Abstractive Text Summarization Zheheng Luo, Qianqian Xie, Sophia Ananiadou [pdf]

[Abs]
Long-form clinical summarization of hospital admissions has real-world significance because of its potential to help both clinicians and patients. The faithfulness of summaries is critical to their safe usage in clinical settings. To better understand the limitations of abstractive systems, as well as the suitability of existing evaluation metrics, we benchmark faithfulness metrics against fine-grained human annotations for model-generated summaries of a patient's Brief Hospital Course. We create a corpus of patient hospital admissions and summaries for a cohort of HIV patients, each with complex medical histories. Annotators are presented with summaries and source notes, and asked to categorize manually highlighted summary elements (clinical entities like conditions and medications as well as actions like "following up") into one of three categories: "Incorrect," "Missing," and "Not in Notes." We meta-evaluate a broad set of proposed faithfulness metrics and, across metrics, explore the importance of domain adaptation (e.g. the impact of in-domain pre-training and metric fine-tuning), the use of source-summary alignments, and the effects of distilling a single metric from an ensemble of pre-existing metrics. Off-the-shelf metrics with no exposure to clinical text correlate well yet overly rely on summary extractiveness. As a practical guide to long-form clinical narrative summarization, we find that most metrics correlate best to human judgments when provided with one summary sentence at a time and a minimal set of relevant source context.
A Meta-Evaluation of Faithfulness Metrics for Long-Form Hospital-Course Summarization Griffin Adams, Jason Zucker, Noémie Elhadad [pdf]

[Abs]
Long-form clinical summarization of hospital admissions has real-world significance because of its potential to help both clinicians and patients. The faithfulness of summaries is critical to their safe usage in clinical settings. To better understand the limitations of abstractive systems, as well as the suitability of existing evaluation metrics, we benchmark faithfulness metrics against fine-grained human annotations for model-generated summaries of a patient's Brief Hospital Course. We create a corpus of patient hospital admissions and summaries for a cohort of HIV patients, each with complex medical histories. Annotators are presented with summaries and source notes, and asked to categorize manually highlighted summary elements (clinical entities like conditions and medications as well as actions like "following up") into one of three categories: "Incorrect," "Missing," and "Not in Notes." We meta-evaluate a broad set of proposed faithfulness metrics and, across metrics, explore the importance of domain adaptation (e.g. the impact of in-domain pre-training and metric fine-tuning), the use of source-summary alignments, and the effects of distilling a single metric from an ensemble of pre-existing metrics. Off-the-shelf metrics with no exposure to clinical text correlate well yet overly rely on summary extractiveness. As a practical guide to long-form clinical narrative summarization, we find that most metrics correlate best to human judgments when provided with one summary sentence at a time and a minimal set of relevant source context.
Faithfulness-Aware Decoding Strategies for Abstractive Summarization David Wan, Mengwen Liu, Kathleen McKeown, Markus Dreyer, Mohit Bansal EACL 2023 [pdf] [code]

[Abs]
Despite significant progress in understanding and improving faithfulness in abstractive summarization, the question of how decoding strategies affect faithfulness is less studied. We present a systematic study of the effect of generation techniques such as beam search and nucleus sampling on faithfulness in abstractive summarization. We find a consistent trend where beam search with large beam sizes produces the most faithful summaries while nucleus sampling generates the least faithful ones. We propose two faithfulness-aware generation methods to further improve faithfulness over current generation techniques: (1) ranking candidates generated by beam search using automatic faithfulness metrics and (2) incorporating lookahead heuristics that produce a faithfulness score on the future summary. We show that both generation methods significantly improve faithfulness across two datasets as evaluated by four automatic faithfulness metrics and human evaluation. To reduce computational cost, we demonstrate a simple distillation approach that allows the model to generate faithful summaries with just greedy decoding. Our code is publicly available at this https URL
DIFFQG: Generating Questions to Summarize Factual Changes Jeremy R. Cole, Palak Jain, Julian Martin Eisenschlos, Michael J.Q. Zhang, Eunsol Choi, Bhuwan Dhingra EACL 2023 [pdf] [code]

[Abs]
Identifying the difference between two versions of the same article is useful to update knowledge bases and to understand how articles evolve. Paired texts occur naturally in diverse situations: reporters write similar news stories and maintainers of authoritative websites must keep their information up to date. We propose representing factual changes between paired documents as question-answer pairs, where the answer to the same question differs between two versions. We find that question-answer pairs can flexibly and concisely capture the updated contents. Provided with paired documents, annotators identify questions that are answered by one passage but answered differently or cannot be answered by the other. We release DIFFQG which consists of 759 QA pairs and 1153 examples of paired passages with no factual change. These questions are intended to be both unambiguous and information-seeking and involve complex edits, pushing beyond the capabilities of current question generation and factual change detection systems. Our dataset summarizes the changes between two versions of the document as questions and answers, studying automatic update summarization in a novel way.
Improving Faithfulness by Augmenting Negative Summaries from Fake Documents Tianshu Wang, Faisal Ladhak, Esin Durmus, He He EMNLP 2022 [pdf] [code]

[Abs]
Current abstractive summarization systems tend to hallucinate content that is unfaithful to the source document, posing a risk of misinformation. To mitigate hallucination, we must teach the model to distinguish hallucinated summaries from faithful ones. However, the commonly used maximum likelihood training does not disentangle factual errors from other model errors. To address this issue,we propose a back-translation-style approach to augment negative samples that mimic factual errors made by the model. Specifically, we train an elaboration model that generates hallucinated documents given the reference summaries, and then generates negative summaries from the fake documents. We incorporate the negative samples into training through a controlled generator, which produces faithful/unfaithful summaries conditioned on the control codes. Additionally, we find that adding textual entailment data through multitasking further boosts the performance. Experiments on three datasets (XSum, Gigaword, and WikiHow) show that our method consistently improves faithfulness without sacrificing informativeness according to both human and automatic evaluation
Learning with Rejection for Abstractive Text Summarization Meng Cao, Yue Dong, Jingyi He, Jackie Chi Kit Cheung EMNLP 2022 [pdf] [code]

[Abs]
State-of-the-art abstractive summarization systems frequently hallucinate content that is not supported by the source document, mainly due to noise in the training dataset.Existing methods opt to drop the noisy samples or tokens from the training set entirely, reducing the effective training set size and creating an artificial propensity to copy words from the source. In this work, we propose a training objective for abstractive summarization based on rejection learning, in which the model learns whether or not to reject potentially noisy tokens. We further propose a regularized decoding objective that penalizes non-factual candidate summaries during inference by using the rejection probability learned during training.We show that our method considerably improves the factuality of generated summaries in automatic and human evaluations when compared to five baseline models, and that it does so while increasing the abstractiveness of the generated summaries.
X-FACTOR: A Cross-metric Evaluation of Factual Correctness in Abstractive Summarization Subhajit Chaudhury, Sarathkrishna Swaminathan, Chulaka Gunasekara, Maxwell Crouse, Srinivas Ravishankar, Daiki Kimura, Keerthiram Murugesan, Ramón Fernandez Astudillo, Tahira Naseem, Pavan Kapanipathi, Alexander Gray EMNLP 2022 [pdf]

[Abs]
Abstractive summarization models often produce factually inconsistent summaries that are not supported by the original article. Recently, a number of fact-consistent evaluation techniques have been proposed to address this issue; however, a detailed analysis of how these metrics agree with one another has yet to be conducted. In this paper, we present X-FACTOR, a cross-evaluation of three high-performing fact-aware abstractive summarization methods. First, we show that summarization models are often fine-tuned on datasets that contain factually inconsistent summaries and propose a fact-aware filtering mechanism that improves the quality of training data and, consequently, the factuality of these models. Second, we propose a corrector module that can be used to improve the factual consistency of generated summaries. Third, we present a re-ranking technique that samples summary instances from the output distribution of a summarization model and re-ranks the sampled instances based on their factuality. Finally, we provide a detailed cross-metric agreement analysis that shows how tuning a model to output summaries based on a particular factuality metric influences factuality as determined by the other metrics. Our goal in this work is to facilitate research that improves the factuality and faithfulness of abstractive summarization models.
LongEval: Guidelines for Human Evaluation of Faithfulness in Long-form Summarization Kalpesh Krishna, Erin Bransom, Bailey Kuehl, Mohit Iyyer, Pradeep Dasigi, Arman Cohan, Kyle Lo EACL 2023 [pdf] [code]

[Abs]
While human evaluation remains best practice for accurately judging the faithfulness of automatically-generated summaries, few solutions exist to address the increased difficulty and workload when evaluating long-form summaries. Through a survey of 162 papers on long-form summarization, we first shed light on current human evaluation practices surrounding long-form summaries. We find that 73% of these papers do not perform any human evaluation on model-generated summaries, while other works face new difficulties that manifest when dealing with long documents (e.g., low inter-annotator agreement). Motivated by our survey, we present LongEval, a set of guidelines for human evaluation of faithfulness in long-form summaries that addresses the following challenges: (1) How can we achieve high inter-annotator agreement on faithfulness scores? (2) How can we minimize annotator workload while maintaining accurate faithfulness scores? and (3) Do humans benefit from automated alignment between summary and source snippets? We deploy LongEval in annotation studies on two long-form summarization datasets in different domains (SQuALITY and PubMed), and we find that switching to a finer granularity of judgment (e.g., clause-level) reduces inter-annotator variance in faithfulness scores (e.g., std-dev from 18.5 to 6.8). We also show that scores from a partial annotation of fine-grained units highly correlates with scores from a full annotation workload (0.89 Kendall's tau using 50% judgments). We release our human judgments, annotation templates, and our software as a Python library for future research.
mFACE: Multilingual Summarization with Factual Consistency Evaluation Roee Aharoni, Shashi Narayan, Joshua Maynez, Jonathan Herzig, Elizabeth Clark, Mirella Lapata [pdf]

[Abs]
Abstractive summarization has enjoyed renewed interest in recent years, thanks to pre-trained language models and the availability of large-scale datasets. Despite promising results, current models still suffer from generating factually inconsistent summaries, reducing their utility for real-world application. Several recent efforts attempt to address this by devising models that automatically detect factual inconsistencies in machine generated summaries. However, they focus exclusively on English, a language with abundant resources. In this work, we leverage factual consistency evaluation models to improve multilingual summarization. We explore two intuitive approaches to mitigate hallucinations based on the signal provided by a multilingual NLI model, namely data filtering and controlled generation. Experimental results in the 45 languages from the XLSum dataset show gains over strong baselines in both automatic and human evaluation.
Improving Faithfulness of Abstractive Summarization by Controlling Confounding Effect of Irrelevant Sentences Asish Ghoshal, Arash Einolghozati, Ankit Arun, Haoran Li, Lili Yu, Yashar Mehdad, Scott Wen-tau Yih, Asli Celikyilmaz[pdf]

[Abs]
Lack of factual correctness is an issue that still plagues state-of-the-art summarization systems despite their impressive progress on generating seemingly fluent summaries. In this paper, we show that factual inconsistency can be caused by irrelevant parts of the input text, which act as confounders. To that end, we leverage information-theoretic measures of causal effects to quantify the amount of confounding and precisely quantify how they affect the summarization performance. Based on insights derived from our theoretical results, we design a simple multi-task model to control such confounding by leveraging human-annotated relevant sentences when available. Crucially, we give a principled characterization of data distributions where such confounding can be large thereby necessitating the use of human annotated relevant sentences to generate factual summaries. Our approach improves faithfulness scores by 20% over strong baselines on AnswerSumm \citep{fabbri2021answersumm}, a conversation summarization dataset where lack of faithfulness is a significant issue due to the subjective nature of the task. Our best method achieves the highest faithfulness score while also achieving state-of-the-art results on standard metrics like ROUGE and METEOR. We corroborate these improvements through human evaluation.
Improved Beam Search for Hallucination Mitigation in Abstractive Summarization Arvind Krishna Sridhar, Erik Visser [pdf]

[Abs]
Advancement in large pretrained language models has significantly improved their performance for conditional language generation tasks including summarization albeit with hallucinations. To reduce hallucinations, conventional methods proposed improving beam search or using a fact checker as a postprocessing step. In this paper, we investigate the use of the Natural Language Inference (NLI) entailment metric to detect and prevent hallucinations in summary generation. We propose an NLI-assisted beam re-ranking mechanism by computing entailment probability scores between the input context and summarization model-generated beams during saliency-enhanced greedy decoding. Moreover, a diversity metric is introduced to compare its effectiveness against vanilla beam search. Our proposed algorithm significantly outperforms vanilla beam decoding on XSum and CNN/DM datasets.
Revisiting text decomposition methods for NLI-based factuality scoring of summaries John Glover, Federico Fancellu, Vasudevan Jagannathan, Matthew R. Gormley, Thomas Schaaf [pdf]

[Abs]
Scoring the factuality of a generated summary involves measuring the degree to which a target text contains factual information using the input document as support. Given the similarities in the problem formulation, previous work has shown that Natural Language Inference models can be effectively repurposed to perform this task. As these models are trained to score entailment at a sentence level, several recent studies have shown that decomposing either the input document or the summary into sentences helps with factuality scoring. But is fine-grained decomposition always a winning strategy? In this paper we systematically compare different granularities of decomposition -- from document to sub-sentence level, and we show that the answer is no. Our results show that incorporating additional context can yield improvement, but that this does not necessarily apply to all datasets. We also show that small changes to previously proposed entailment-based scoring methods can result in better performance, highlighting the need for caution in model and methodology selection for downstream tasks.
HaRiM+: Evaluating Summary Quality with Hallucination Risk Seonil Son, Junsoo Park, Jeong-in Hwang, Junghwa Lee, Hyungjong Noh, Yeonsoo Lee AACL 2022 [pdf]

[Abs]
One of the challenges of developing a summarization model arises from the difficulty in measuring the factual inconsistency of the generated text. In this study, we reinterpret the decoder overconfidence-regularizing objective suggested in (Miao et al., 2021) as a hallucination risk measurement to better estimate the quality of generated summaries. We propose a reference-free metric, HaRiM+, which only requires an off-the-shelf summarization model to compute the hallucination risk based on token likelihoods. Deploying it requires no additional training of models or ad-hoc modules, which usually need alignment to human judgments. For summary-quality estimation, HaRiM+ records state-of-the-art correlation to human judgment on three summary-quality annotation sets: FRANK, QAGS, and SummEval. We hope that our work, which merits the use of summarization models, facilitates the progress of both automated evaluation and generation of summary.
ED-FAITH: Evaluating Dialogue Summarization on Faithfulness Sicong Huang, Asli Celikyilmaz, Haoran Li [pdf]

[Abs]
Abstractive summarization models typically generate content unfaithful to the input, thus highlighting the significance of evaluating the faithfulness of generated summaries. Most faithfulness metrics are only evaluated on news domain, can they be transferred to other summarization tasks? In this work, we first present a systematic study of faithfulness metrics for dialogue summarization. We evaluate common faithfulness metrics on dialogue datasets and observe that most metrics correlate poorly with human judgements despite performing well on news datasets. Given these findings, to improve existing metrics’ performance on dialogue summarization, we first finetune on in-domain dataset, then apply unlikelihood training on negative samples, and show that they can successfully improve metric performance on dialogue data. Inspired by the strong zero-shot performance of the T0 language model, we further propose T0-Score – a new metric for faithfulness evaluation, which shows consistent improvement against baseline metrics across multiple domains.
Evaluating the Factual Consistency of Large Language Models Through Summarization Derek Tam, Anisha Mascarenhas, Shiyue Zhang, Sarah Kwan, Mohit Bansal, Colin Raffel [pdf]

[Abs]
While large language models (LLMs) have proven to be effective on a large variety of tasks, they are also known to hallucinate information. To measure whether an LLM prefers factually consistent continuations of its input, we propose a new benchmark called FIB(Factual Inconsistency Benchmark) that focuses on the task of summarization. Specifically, our benchmark involves comparing the scores an LLM assigns to a factually consistent versus a factually inconsistent summary for an input news article. For factually consistent summaries, we use human-written reference summaries that we manually verify as factually consistent. To generate summaries that are factually inconsistent, we generate summaries from a suite of summarization models that we have manually annotated as factually inconsistent. A model's factual consistency is then measured according to its accuracy, i.e.\ the proportion of documents where it assigns a higher score to the factually consistent summary. To validate the usefulness of FIB, we evaluate 23 large language models ranging from 1B to 176B parameters from six different model families including BLOOM and OPT. We find that existing LLMs generally assign a higher score to factually consistent summaries than to factually inconsistent summaries. However, if the factually inconsistent summaries occur verbatim in the document, then LLMs assign a higher score to these factually inconsistent summaries than factually consistent summaries. We validate design choices in our benchmark including the scoring method and source of distractor summaries. Our code and benchmark data can be found at this https URL.
Improving Factual Consistency in Summarization with Compression-Based Post-Editing Alexander R. Fabbri, Prafulla Kumar Choubey, Jesse Vig, Chien-Sheng Wu, Caiming Xiong EMNLP 2022 [pdf] [code]

[Abs]
State-of-the-art summarization models still struggle to be factually consistent with the input text. A model-agnostic way to address this problem is post-editing the generated summaries. However, existing approaches typically fail to remove entity errors if a suitable input entity replacement is not available or may insert erroneous content. In our work, we focus on removing extrinsic entity errors, or entities not in the source, to improve consistency while retaining the summary's essential information and form. We propose to use sentence-compression data to train the post-editing model to take a summary with extrinsic entity errors marked with special tokens and output a compressed, well-formed summary with those errors removed. We show that this model improves factual consistency while maintaining ROUGE, improving entity precision by up to 30% on XSum, and that this model can be applied on top of another post-editor, improving entity precision by up to a total of 38%. We perform an extensive comparison of post-editing approaches that demonstrate trade-offs between factual consistency, informativeness, and grammaticality, and we analyze settings where post-editors show the largest improvements.
Evaluating and Improving Factuality in Multimodal Abstractive Summarization David Wan, Mohit Bansal EMNLP 2022 [pdf] [code]

[Abs]
Current metrics for evaluating factuality for abstractive document summarization have achieved high correlations with human judgment, but they do not account for the vision modality and thus are not adequate for vision-and-language summarization. We propose CLIPBERTScore, a simple weighted combination of CLIPScore and BERTScore to leverage the robustness and strong factuality detection performance between image-summary and document-summary, respectively. Next, due to the lack of meta-evaluation benchmarks to evaluate the quality of multimodal factuality metrics, we collect human judgments of factuality with respect to documents and images. We show that this simple combination of two metrics in the zero-shot setting achieves higher correlations than existing factuality metrics for document summarization, outperforms an existing multimodal summarization metric, and performs competitively with strong multimodal factuality metrics specifically fine-tuned for the task. Our thorough analysis demonstrates the robustness and high correlation of CLIPBERTScore and its components on four factuality metric-evaluation benchmarks. Finally, we demonstrate two practical downstream applications of our CLIPBERTScore metric: for selecting important images to focus on during training, and as a reward for reinforcement learning to improve factuality of multimodal summary generation w.r.t automatic and human evaluation. Our data and code are publicly available at this https URL
FRSUM: Towards Faithful Abstractive Summarization via Enhancing Factual Robustness Wenhao Wu, Wei Li, Jiachen Liu, Xinyan Xiao, Ziqiang Cao, Sujian Li, Hua Wu EMNLP 2022 [pdf]

[Abs]
Despite being able to generate fluent and grammatical text, current Seq2Seq summarization models still suffering from the unfaithful generation problem. In this paper, we study the faithfulness of existing systems from a new perspective of factual robustness which is the ability to correctly generate factual information over adversarial unfaithful information. We first measure a model's factual robustness by its success rate to defend against adversarial attacks when generating factual information. The factual robustness analysis on a wide range of current systems shows its good consistency with human judgments on faithfulness. Inspired by these findings, we propose to improve the faithfulness of a model by enhancing its factual robustness. Specifically, we propose a novel training strategy, namely FRSUM, which teaches the model to defend against both explicit adversarial samples and implicit factual adversarial perturbations. Extensive automatic and human evaluation results show that FRSUM consistently improves the faithfulness of various Seq2Seq models, such as T5, BART.
Questioning the Validity of Summarization Datasets and Improving Their Factual Consistency Yanzhu Guo, Chloé Clavel, Moussa Kamal Eddine, Michalis Vazirgiannis EMNLP 2022 [pdf] [data]

[Abs]
The topic of summarization evaluation has recently attracted a surge of attention due to the rapid development of abstractive summarization systems. However, the formulation of the task is rather ambiguous, neither the linguistic nor the natural language processing community has succeeded in giving a mutually agreed-upon definition. Due to this lack of well-defined formulation, a large number of popular abstractive summarization datasets are constructed in a manner that neither guarantees validity nor meets one of the most essential criteria of summarization: factual consistency. In this paper, we address this issue by combining state-of-the-art factual consistency models to identify the problematic instances present in popular summarization datasets. We release SummFC, a filtered summarization dataset with improved factual consistency, and demonstrate that models trained on this dataset achieve improved performance in nearly all quality aspects. We argue that our dataset should become a valid benchmark for developing and evaluating summarization systems.
Mutual Information Alleviates Hallucinations in Abstractive Summarization Liam van der Poel, Ryan Cotterell, Clara Meister EMNLP 2022 [pdf] [code]

[Abs]
Despite significant progress in the quality of language generated from abstractive summarization models, these models still exhibit the tendency to hallucinate, i.e., output content not supported by the source document. A number of works have tried to fix--or at least uncover the source of--the problem with limited success. In this paper, we identify a simple criterion under which models are significantly more likely to assign more probability to hallucinated content during generation: high model uncertainty. This finding offers a potential explanation for hallucinations: models default to favoring text with high marginal probability, i.e., high-frequency occurrences in the training set, when uncertain about a continuation. It also motivates possible routes for real-time intervention during decoding to prevent such hallucinations. We propose a decoding strategy that switches to optimizing for pointwise mutual information of the source and target token--rather than purely the probability of the target token--when the model exhibits uncertainty. Experiments on the XSum dataset show that our method decreases the probability of hallucinated tokens while maintaining the Rouge and BertS scores of top-performing decoding strategies.
Correcting Diverse Factual Errors in Abstractive Summarization via Post-Editing and Language Model Infilling Vidhisha Balachandran, Hannaneh Hajishirzi, William Cohen, Yulia Tsvetkov EMNLP 2022 [pdf] [code]

[Abs]
Abstractive summarization models often generate inconsistent summaries containing factual errors or hallucinated content. Recent works focus on correcting factual errors in generated summaries via post-editing. Such correction models are trained using adversarial non-factual summaries constructed using heuristic rules for injecting errors. However, generating non-factual summaries using heuristics often does not generalize well to actual model errors. In this work, we propose to generate hard, representative synthetic examples of non-factual summaries through infilling language models. With this data, we train a more robust fact-correction model to post-edit the summaries to improve factual consistency. Through quantitative and qualitative experiments on two popular summarization datasets -- CNN/DM and XSum -- we show that our approach vastly outperforms prior methods in correcting erroneous summaries. Our model -- FactEdit -- improves factuality scores by over ~11 points on CNN/DM and over ~31 points on XSum on average across multiple summarization models, producing more factual summaries while maintaining competitive summarization quality.
Phrase-Level Localization of Inconsistency Errors in Summarization by Weak Supervision Masato Takatsuka, Tetsunori Kobayashi, Yoshihiko Hayashi COLING 2022 [pdf] [code]

[Abs]
Although the fluency of automatically generated abstractive summaries has improved significantly with advanced methods, the inconsistency that remains in summarization is recognized as an issue to be addressed. In this study, we propose a methodology for localizing inconsistency errors in summarization. A synthetic dataset that contains a variety of factual errors likely to be produced by a common summarizer is created by applying sentence fusion, compression, and paraphrasing operations. In creating the dataset, we automatically label erroneous phrases and the dependency relations between them as “inconsistent,” which can contribute to detecting errors more adequately than existing models that rely only on dependency arc-level labels. Subsequently, this synthetic dataset is employed as weak supervision to train a model called SumPhrase, which jointly localizes errors in a summary and their corresponding sentences in the source document. The empirical results demonstrate that our SumPhrase model can detect factual errors in summarization more effectively than existing weakly supervised methods owing to the phrase-level labeling. Moreover, the joint identification of error-corresponding original sentences is proven to be effective in improving error detection accuracy.
Just ClozE! A Fast and Simple Method for Evaluating the Factual Consistency in Abstractive Summarization Yiyang Li, Lei Li, Qing Yang, Marina Litvak, Natalia Vanetik, Dingxin Hu, Yuze Li, Yanquan Zhou, Dongliang Xu, Xuanyu Zhang EMNLP 2022 [pdf]

[Abs]
The issue of factual consistency in abstractive summarization has attracted much attention in recent years, and the evaluation of factual consistency between summary and document has become an important and urgent task. Most of the current evaluation metrics are adopted from the question answering (QA). However, the application of QA-based metrics is extremely time-consuming in practice, causing the iteration cycle of abstractive summarization research to be severely prolonged. In this paper, we propose a new method called ClozE to evaluate factual consistency by cloze model, instantiated based on masked language model(MLM), with strong interpretability and substantially higher speed. We demonstrate that ClozE can reduce the evaluation time by nearly 96% relative to QA-based metrics while retaining their interpretability and performance through experiments on six human-annotated datasets and a meta-evaluation benchmark GO FIGURE \citep{gabriel2020go}. We also implement experiments to further demonstrate more characteristics of ClozE in terms of performance and speed. In addition, we conduct an experimental analysis of the limitations of ClozE, which suggests future research directions. The code and models for ClozE will be released upon the paper acceptance.
Extractive is not Faithful: An Investigation of Broad Unfaithfulness Problems in Extractive Summarization Shiyue Zhang, David Wan, Mohit Bansal [pdf] [code]

[Abs]
The problems of unfaithful summaries have been widely discussed under the context of abstractive summarization. Though extractive summarization is less prone to the common unfaithfulness issues of abstractive summaries, does that mean extractive is equal to faithful? Turns out that the answer is no. In this work, we define a typology with five types of broad unfaithfulness problems (including and beyond not-entailment) that can appear in extractive summaries, including incorrect coreference, incomplete coreference, incorrect discourse, incomplete discourse, as well as other misleading information. We ask humans to label these problems out of 1500 English summaries produced by 15 diverse extractive systems. We find that 33% of the summaries have at least one of the five issues. To automatically detect these problems, we find that 5 existing faithfulness evaluation metrics for summarization have poor correlations with human judgment. To remedy this, we propose a new metric, ExtEval, that is designed for detecting unfaithful extractive summaries and is shown to have the best performance. We hope our work can increase the awareness of unfaithfulness problems in extractive summarization and help future work to evaluate and resolve these issues. Our data and code are publicly available at this https URL
Entity-based SpanCopy for Abstractive Summarization to Improve the Factual Consistency Wen Xiao, Giuseppe Carenini [pdf] [code]

[Abs]
Despite the success of recent abstractive summarizers on automatic evaluation metrics, the generated summaries still present factual inconsistencies with the source document. In this paper, we focus on entity-level factual inconsistency, i.e. reducing the mismatched entities between the generated summaries and the source documents. We therefore propose a novel entity-based SpanCopy mechanism, and explore its extension with a Global Relevance component. Experiment results on four summarization datasets show that SpanCopy can effectively improve the entity-level factual consistency with essentially no change in the word-level and entity-level saliency. The code is available at this https URL
Jointly Learning Guidance Induction and Faithful Summary Generation via Conditional Variational Autoencoders Wang Xu, Tiejun Zhao Findings of NAACL 2022 [pdf]

[Abs]
Abstractive summarization can generate high quality results with the development of the neural network. However, generating factual consistency summaries is a challenging task for abstractive summarization. Recent studies extract the additional information with off-the-shelf tools from the source document as a clue to guide the summary generation, which shows effectiveness to improve the faithfulness. Unlike these work, we present a novel framework based on conditional variational autoencoders, which induces the guidance information and generates the summary equipment with the guidance synchronously. Experiments on XSUM and CNNDM dataset show that our approach can generate relevant and fluent summaries which is more faithful than the existing state-of-the-art approaches, according to multiple factual consistency metrics.
Masked Summarization to Generate Factually Inconsistent Summaries for Improved Factual Consistency Checking Hwanhee Lee, Kang Min Yoo, Joonsuk Park, Hwaran Lee, Kyomin Jung Findings of NAACL 2022 [pdf] [code]

[Abs]
Despite the recent advances in abstractive summarization systems, it is still difficult to determine whether a generated summary is factual consistent with the source text. To this end, the latest approach is to train a factual consistency classifier on factually consistent and inconsistent summaries. Luckily, the former is readily available as reference summaries in existing summarization datasets. However, generating the latter remains a challenge, as they need to be factually inconsistent, yet closely relevant to the source text to be effective. In this paper, we propose to generate factually inconsistent summaries using source texts and reference summaries with key information masked. Experiments on seven benchmark datasets demonstrate that factual consistency classifiers trained on summaries generated using our method generally outperform existing models and show a competitive correlation with human judgments. We also analyze the characteristics of the summaries generated using our method. We will release the pre-trained model and the code at https://github.com/hwanheelee1993/MFMA.
Improving the Faithfulness of Abstractive Summarization via Entity Coverage Control Haopeng Zhang, Semih Yavuz, Wojciech Kryscinski, Kazuma Hashimoto, Yingbo Zhou Findings of NAACL 2022 [pdf] [code]

[Abs]
Abstractive summarization systems leveraging pre-training language models have achieved superior results on benchmark datasets. However, such models have been shown to be more prone to hallucinate facts that are unfaithful to the input context. In this paper, we propose a method to remedy entity-level extrinsic hallucinations with Entity Coverage Control (ECC). We first compute entity coverage precision and prepend the corresponding control code for each training example, which implicitly guides the model to recognize faithfulness contents in the training phase. We further extend our method via intermediate fine-tuning on large but noisy data extracted from Wikipedia to unlock zero-shot summarization. We show that the proposed method leads to more faithful and salient abstractive summarization in supervised fine-tuning and zero-shot settings according to our experimental results on three benchmark datasets XSum, Pubmed, and SAMSum of very different domains and styles.
FactPEGASUS: Factuality-Aware Pre-training and Fine-tuning for Abstractive Summarization David Wan, Mohit Bansal NAACL 2022 [pdf] [code]

[Abs]
We present FactPEGASUS, an abstractive summarization model that addresses the problem of factuality during pre-training and fine-tuning: (1) We augment the sentence selection strategy of PEGASUS’s (Zhang et al., 2019) pre-training objective to create pseudo-summaries that are both important and factual; (2) We introduce three complementary components for fine-tuning. The corrector removes hallucinations present in the reference summary, the contrastor uses contrastive learning to better differentiate nonfactual summaries from factual ones, and the connector bridges the gap between the pre-training and fine-tuning for better transfer of knowledge. Experiments on three downstream tasks demonstrate that FactPEGASUS substantially improves factuality evaluated by multiple automatic metrics and humans. Our thorough analysis suggests that FactPEGASUS is more factual than using the original pre-training objective in zero-shot and few-shot settings, retains factual behavior more robustly than strong baselines, and does not rely entirely on becoming more extractive to improve factuality.
Improving the Faithfulness of Abstractive Summarization via Entity Coverage Control Haopeng Zhang, Semih Yavuz, Wojciech Kryscinski, Kazuma Hashimoto, Yingbo Zhou NAACL 2022 findings [pdf]

[Abs]
Abstractive summarization systems leveraging pre-training language models have achieved superior results on benchmark datasets. However, such models have been shown to be more prone to hallucinate facts that are unfaithful to the input context. In this paper, we propose a method to remedy entity-level extrinsic hallucinations with Entity Coverage Control (ECC). We first compute entity coverage precision and prepend the corresponding control code for each training example, which implicitly guides the model to recognize faithfulness contents in the training phase. We further extend our method via intermediate fine-tuning on large but noisy data extracted from Wikipedia to unlock zero-shot summarization. We show that the proposed method leads to more faithful and salient abstractive summarization in supervised fine-tuning and zero-shot settings according to our experimental results on three benchmark datasets XSum, Pubmed, and SAMSum of very different domains and styles.
SummaC: Re-Visiting NLI-based Models for Inconsistency Detection in Summarization Philippe Laban, Tobias Schnabel, Paul N. Bennett, Marti A. Hearst TACL 2022 Volume 10 [pdf] [code]

[Abs]
In the summarization domain, a key requirement for summaries is to be factually consistent with the input document. Previous work has found that natural language inference (NLI) models do not perform competitively when applied to inconsistency detection. In this work, we revisit the use of NLI for inconsistency detection, finding that past work suffered from a mismatch in input granularity between NLI datasets (sentence-level), and inconsistency detection (document level). We provide a highly effective and light-weight method called SummaCConv that enables NLI models to be successfully used for this task by segmenting documents into sentence units and aggregating scores between pairs of sentences. We furthermore introduce a new benchmark called SummaC (Summary Consistency) which consists of six large inconsistency detection datasets. On this dataset, SummaCConv obtains state-of-the-art results with a balanced accuracy of 74.4%, a 5% improvement compared with prior work.
Hallucinated but Factual! Inspecting the Factuality of Hallucinations in Abstractive Summarization Meng Cao, Yue Dong, Jackie Cheung ACL 2022 [pdf] [code]

[Abs]
State-of-the-art abstractive summarization systems often generate hallucinations; i.e., content that is not directly inferable from the source text. Despite being assumed to be incorrect, we find that much hallucinated content is actually consistent with world knowledge, which we call factual hallucinations. Including these factual hallucinations in a summary can be beneficial because they provide useful background information. In this work, we propose a novel detection approach that separates factual from non-factual hallucinations of entities. Our method is based on an entity’s prior and posterior probabilities according to pre-trained and finetuned masked language models, respectively. Empirical results suggest that our method vastly outperforms two baselines in both accuracy and F1 scores and has a strong correlation with human judgments on factuality classification tasks.Furthermore, we use our method as a reward signal to train a summarization system using an off-line reinforcement learning (RL) algorithm that can significantly improve the factuality of generated summaries while maintaining the level of abstractiveness.
Understanding Factual Errors in Summarization: Errors, Summarizers, Datasets, Error Detectors Liyan Tang, Tanya Goyal, Alexander R. Fabbri, Philippe Laban, Jiacheng Xu, Semih Yahvuz, Wojciech Kryściński, Justin F. Rousseau, Greg Durrett [pdf] [code]
Falsesum: Generating Document-level NLI Examples for Recognizing Factual Inconsistency in Summarization Prasetya Ajie Utama, Joshua Bambrick, Nafise Sadat Moosavi, Iryna Gurevych NAACL 2022 [pdf] [code]

[Abs]
Neural abstractive summarization models are prone to generate summaries that are factually inconsistent with their source documents. Previous work has introduced the task of recognizing such factual inconsistency as a downstream application of natural language inference (NLI). However, state-of-the-art NLI models perform poorly in this context due to their inability to generalize to the target task. In this work, we show that NLI models can be effective for this task when the training data is augmented with high-quality task-oriented examples. We introduce Falsesum, a data generation pipeline leveraging a controllable text generation model to perturb human-annotated summaries, introducing varying types of factual inconsistencies. Unlike previously introduced document-level NLI datasets, our generated dataset contains examples that are diverse and inconsistent yet plausible. We show that models trained on a Falsesum-augmented NLI dataset improve the state-of-the-art performance across four benchmarks for detecting factual inconsistency in summarization.
Masked Summarization to Generate Factually Inconsistent Summaries for Improved Factual Consistency Checking Hwanhee Lee, Kang Min Yoo, Joonsuk Park, Hwaran Lee, Kyomin Jung NAACL 2022 Findings [pdf] [code]
Faithful to the Document or to the World? Mitigating Hallucinations via Entity-linked Knowledge in Abstractive Summarization Yue Dong, John Wieting, Pat Verga [pdf]
Learning to Revise References for Faithful Summarization Griffin Adams, Han-Chin Shing, Qing Sun, Christopher Winestock, Kathleen McKeown, Noémie Elhadad [pdf] [code]
Factual Error Correction for Abstractive Summaries Using Entity Retrieval Hwanhee Lee, Cheoneum Park, Seunghyun Yoon, Trung Bui, Franck Dernoncourt, Juae Kim, Kyomin Jung [pdf]
Evaluating Factuality in Text Simplification Ashwin Devaraj, William Sheffield, Byron C. Wallace, Junyi Jessy Li ACL 2022 [pdf] [code]
FactGraph: Evaluating Factuality in Summarization with Semantic Graph Representations Leonardo F. R. Ribeiro, Mengwen Liu, Iryna Gurevych, Markus Dreyer, Mohit Bansal NAACL 2022 [pdf] [code]

[Abs]
Despite recent improvements in abstractive summarization, most current approaches generate summaries that are not factually consistent with the source document, severely restricting their trust and usage in real-world applications. Recent works have shown promising improvements in factuality error identification using text or dependency arc entailments; however, they do not consider the entire semantic graph simultaneously. To this end, we propose FactGraph, a method that decomposes the document and the summary into structured meaning representations (MR), which are more suitable for factuality evaluation. MRs describe core semantic concepts and their relations, aggregating the main content in both document and summary in a canonical form, and reducing data sparsity. FactGraph encodes such graphs using a graph encoder augmented with structure-aware adapters to capture interactions among the concepts based on the graph connectivity, along with text representations using an adapter-based text encoder. Experiments on different benchmarks for evaluating factuality show that FactGraph outperforms previous approaches by up to 15%. Furthermore, FactGraph improves performance on identifying content verifiability errors and better captures subsentence-level factual inconsistencies.
Don't Say What You Don't Know: Improving the Consistency of Abstractive Summarization by Constraining Beam Search Daniel King, Zejiang Shen, Nishant Subramani, Daniel S. Weld, Iz Beltagy, Doug Downey [pdf] [code]
CONFIT: Toward Faithful Dialogue Summarization with Linguistically-Informed Contrastive Fine-tuning Xiangru Tang, Arjun Nair, Borui Wang, Bingyao Wang, Jai Desai, Aaron Wade, Haoran Li, Asli Celikyilmaz, Yashar Mehdad, Dragomir Radev NAACL 2022 [pdf]

[Abs]
Factual inconsistencies in generated summaries severely limit the practical applications of abstractive dialogue summarization. Although significant progress has been achieved by using pre-trained neural language models, substantial amounts of hallucinated content are found during the human evaluation. In this work, we first devised a typology of factual errors to better understand the types of hallucinations generated by current models and conducted human evaluation on popular dialog summarization dataset. We further propose a training strategy that improves the factual consistency and overall quality of summaries via a novel contrastive fine-tuning, called CONFIT. To tackle top factual errors from our annotation, we introduce additional contrastive loss with carefully designed hard negative samples and self-supervised dialogue-specific loss to capture the key information between speakers. We show that our model significantly reduces all kinds of factual errors on both SAMSum dialogue summarization and AMI meeting summarization. On both datasets, we achieve significant improvements over state-of-the-art baselines using both automatic metrics, ROUGE and BARTScore, and human evaluation.
QAFactEval: Improved QA-Based Factual Consistency Evaluation for Summarization Alexander R. Fabbri, Chien-Sheng Wu, Wenhao Liu, Caiming Xiong NAACL 2022 [pdf] [code]

[Abs]
Factual consistency is an essential quality of text summarization models in practical settings. Existing work in evaluating this dimension can be broadly categorized into two lines of research, entailment-based and question answering (QA)-based metrics, and different experimental setups often lead to contrasting conclusions as to which paradigm performs the best. In this work, we conduct an extensive comparison of entailment and QA-based metrics, demonstrating that carefully choosing the components of a QA-based metric, especially question generation and answerability classification, is critical to performance. Building on those insights, we propose an optimized metric, which we call QAFactEval, that leads to a 14% average improvement over previous QA-based metrics on the SummaC factual consistency benchmark, and also outperforms the best-performing entailment-based metric. Moreover, we find that QA-based and entailment-based metrics can offer complementary signals and be combined into a single metric for a further performance boost.
CO2Sum:Contrastive Learning for Factual-Consistent Abstractive Summarization Wei Liu, Huanqin Wu, Wenjing Mu, Zhen Li, Tao Chen, Dan Nie [pdf]
Are Factuality Checkers Reliable? Adversarial Meta-evaluation of Factuality in Summarization Yiran Chen, Pengfei Liu, Xipeng Qiu EMNLP 2021 Findings [pdf] [code]
SummaC: Re-Visiting NLI-based Models for Inconsistency Detection in Summarization Philippe Laban, Tobias Schnabel, Paul N. Bennett, Marti A. Hearst [pdf] [code]
Dialogue Inspectional Summarization with Factual Inconsistency Awareness Leilei Gan, Yating Zhang, Kun Kuang, Lin Yuan, Shuo Li, Changlong Sun, Xiaozhong Liu, Fei Wu [pdf]
Fine-grained Factual Consistency Assessment for Abstractive Summarization Models Sen Zhang, Jianwei Niu, Chuyuan Wei `` [pdf]
MoFE: Mixture of Factual Experts for Controlling Hallucinations in Abstractive Summarization Prafulla Kumar Choubey, Jesse Vig, Wenhao Liu, Nazneen Fatema Rajani [pdf]
Investigating Crowdsourcing Protocols for Evaluating the Factual Consistency of Summaries Xiangru Tang, Alexander R. Fabbri, Ziming Mao, Griffin Adams, Borui Wang, Haoran Li, Yashar Mehdad, Dragomir Radev NAACL 2022 [pdf]

[Abs]
Current pre-trained models applied for summarization are prone to factual inconsistencies that misrepresent the source text. Evaluating the factual consistency of summaries is thus necessary to develop better models. However, the human evaluation setup for evaluating factual consistency has not been standardized. To determine the factors that affect the reliability of the human evaluation, we crowdsource evaluations for factual consistency across state-of-the-art models on two news summarization datasets using the rating-based Likert Scale and ranking-based Best-Worst Scaling. Our analysis reveals that the ranking-based Best-Worst Scaling offers a more reliable measure of summary quality across datasets and that the reliability of Likert ratings highly depends on the target dataset and the evaluation design. To improve crowdsourcing reliability, we extend the scale of the Likert rating and present a scoring algorithm for Best-Worst Scaling that we call value learning. Our crowdsourcing guidelines will be publicly available to facilitate future work on factual consistency in summarization.
MiRANews: Dataset and Benchmarks for Multi-Resource-Assisted News Summarization Xinnuo Xu, Ondřej Dušek, Shashi Narayan, Verena Rieser, Ioannis Konstas EMNLP2021 Findings [pdf] [data]
Inspecting the Factuality of Hallucinated Entities in Abstractive Summarization Meng Cao, Yue Dong, Jackie Chi Kit Cheung [pdf]
CLIFF: Contrastive Learning for Improving Faithfulness and Factuality in Abstractive Summarization Shuyang Cao, Lu Wang EMNLP 2021 [pdf] [code]
Faithful or Extractive? On Mitigating the Faithfulness-Abstractiveness Trade-off in Abstractive Summarization Faisal Ladhak, Esin Durmus, He He, Claire Cardie, Kathleen McKeown ACL 2022 [pdf] [code]

[Abs]
Despite recent progress in abstractive summarization, systems still suffer from faithfulness errors. While prior work has proposed models that improve faithfulness, it is unclear whether the improvement comes from an increased level of extractiveness of the model outputs as one naive way to improve faithfulness is to make summarization models more extractive. In this work, we present a framework for evaluating the effective faithfulness of summarization systems, by generating a faithfulness-abstractiveness trade-off curve that serves as a control at different operating points on the abstractiveness spectrum. We then show that the Maximum Likelihood Estimation (MLE) baseline as well as recently proposed methods for improving faithfulness, fail to consistently improve over the control at the same level of abstractiveness. Finally, we learn a selector to identify the most faithful and abstractive summary for a given document, and show that this system can attain higher faithfulness scores in human evaluations while being more abstractive than the baseline system on two datasets. Moreover, we show that our system is able to achieve a better faithfulness-abstractiveness trade-off than the control at the same level of abstractiveness.
Factual Consistency Evaluation for Text Summarization via Counterfactual Estimation Yuexiang Xie, Fei Sun, Yang Deng, Yaliang Li, Bolin Ding EMNLP 2021 Findings [pdf] [code]
Improving Factual Consistency of Abstractive Summarization on Customer Feedback Yang Liu, Yifei Sun, Vincent Gao ACL 2021 Proceedings of The 4th Workshop on e-Commerce and NLP [pdf]
AgreeSum: Agreement-Oriented Multi-Document Summarization Richard Yuanzhe Pang, Adam D. Lelkes, Vinh Q. Tran, Cong Yu Findings of ACL 2021 [pdf] [data]
Focus Attention: Promoting Faithfulness and Diversity in Summarization Rahul Aralikatte, Shashi Narayan, Joshua Maynez, Sascha Rothe, Ryan McDonald ACL 2021 [pdf]
Improving Factual Consistency of Abstractive Summarization via Question Answering Feng Nan, Cicero Nogueira dos Santos, Henghui Zhu, Patrick Ng, Kathleen McKeown, Ramesh Nallapati, Dejiao Zhang, Zhiguo Wang, Andrew O. Arnold, Bing Xiang ACL 2021 [pdf] [code]
Discourse Understanding and Factual Consistency in Abstractive Summarization Saadia Gabriel, Antoine Bosselut, Jeff Da, Ari Holtzman, Jan Buys, Kyle Lo, Asli Celikyilmaz, Yejin Choi EACL21 [pdf] [code]
Improving Faithfulness in Abstractive Summarization with Contrast Candidate Generation and Selection Sihao Chen, Fan Zhang, Kazoo Sone and Dan Roth NAACL21 [pdf] [code]
Understanding Factuality in Abstractive Summarization with FRANK: A Benchmark for Factuality Metrics Artidoro Pagnoni, Vidhisha Balachandran and Yulia Tsvetkov NAACL21 [pdf] [code]
Annotating and Modeling Fine-grained Factuality in Summarization Tanya Goyal, Greg Durrett NAACL21 [pdf] [code]
SAFEval: Summarization Asks for Fact-based Evaluation Thomas Scialom, Paul-Alexis Dray, Patrick Gallinari, Sylvain Lamprier, Benjamin Piwowarski, Jacopo Staiano, Alex Wang [pdf] [code]
Enhancing Factual Consistency of Abstractive Summarization Chenguang Zhu, William Hinthorn, Ruochen Xu, Qingkai Zeng, Michael Zeng, Xuedong Huang, Meng Jiang NAACL21 [pdf]
Entity-level Factual Consistency of Abstractive Text Summarization Feng Nan, Ramesh Nallapati, Zhiguo Wang, Cicero Nogueira dos Santos, Henghui Zhu, Dejiao Zhang, Kathleen McKeown, Bing Xiang EACL21 [pdf] [code]
On the Faithfulness for E-commerce Product Summarization Peng Yuan, Haoran Li, Song Xu, Youzheng Wu, Xiaodong He, Bowen Zhou COLING20 [pdf] [code]
FFCI: A Framework for Interpretable Automatic Evaluation of Summarization Fajri Koto, Jey Han Lau, Timothy Baldwin [pdf] [code]
GSum: A General Framework for Guided Neural Abstractive Summarization Zi-Yi Dou, Pengfei Liu, Hiroaki Hayashi, Zhengbao Jiang, Graham Neubig NAACL21 [pdf] [code]
Truth or Error? Towards systematic analysis of factual errors in abstractive summaries Klaus-Michael Lux, Maya Sappelli, Martha Larson EMNLP | Eval4NLP 20 [pdf]
Detecting Hallucinated Content in Conditional Neural Sequence Generation Chunting Zhou, Jiatao Gu, Mona Diab, Paco Guzman, Luke Zettlemoyer, Marjan Ghazvininejad [pdf] [code]
Go Figure! A Meta Evaluation of Factuality in Summarization Saadia Gabriel, Asli Celikyilmaz, Rahul Jha, Yejin Choi, Jianfeng Gao Findings of ACL 2021 [pdf]
Constrained Abstractive Summarization: Preserving Factual Consistency with Constrained Generation Yuning Mao, Xiang Ren, Heng Ji, Jiawei Han [pdf]
Factual Error Correction for Abstractive Summarization Models Meng Cao, Yue Dong, Jiapeng Wu, Jackie Chi Kit Cheung EMNLP20 short [pdf] [code]
Multi-Fact Correction in Abstractive Text Summarization. Yue Dong, Shuohang Wang, Zhe Gan, Yu Cheng, Jackie Chi Kit Cheung, Jingjing Liu EMNLP20 [pdf]
Factual Error Correction for Abstractive Summarization Models Cao Meng, Yue Cheung Dong, Jiapeng Wu, and Jackie Chi Kit EMNLP20 [pdf]
Evaluating the Factual Consistency of Abstractive Text Summarization Wojciech Kryściński, Bryan McCann, Caiming Xiong, Richard Socher EMNLP20 [pdf] [code]
Reducing Quantity Hallucinations in Abstractive Summarization Zheng Zhao, Shay B. Cohen, Bonnie Webber Findings of EMNLP [pdf]
On Faithfulness and Factuality in Abstractive Summarization Joshua Maynez, Shashi Narayan, Bernd Bohnet, Ryan McDonaldACL20 [pdf] [data]
Improving Truthfulness of Headline Generation Kazuki Matsumaru, Sho Takase, Naoaki Okazaki ACL20[pdf]
Optimizing the Factual Correctness of a Summary: A Study of Summarizing Radiology Reports Yuhao Zhang, Derek Merck, Emily Bao Tsai, Christopher D. Manning, Curtis P. Langlotz ACL20[pdf]
FEQA: A Question Answering Evaluation Framework for Faithfulness Assessment in Abstractive Summarization Esin Durmus, He He, Mona Diab ACL20 [pdf] [code]
Asking and Answering Questions to Evaluate the Factual Consistency of Summaries Alex Wang, Kyunghyun Cho, Mike Lewis ACL20 [pdf] [code]
Knowledge Graph-Augmented Abstractive Summarization with Semantic-Driven Cloze Reward Luyang Huang, Lingfei Wu, Lu Wang ACL20 [pdf]
Mind The Facts: Knowledge-Boosted Coherent Abstractive Text Summarization Beliz Gunel, Chenguang Zhu, Michael Zeng, Xuedong Huang NIPS19 [pdf]
Assessing The Factual Accuracy of Generated Text Ben Goodrich, Vinay Rao, Mohammad Saleh, Peter J Liu KDD19 [pdf]
Ranking Generated Summaries by Correctness: An Interesting but Challenging Application for Natural Language Inference Tobias Falke, Leonardo F. R. Ribeiro, Prasetya Ajie Utama, Ido Dagan, Iryna Gurevych ACL19 [pdf] [data]
Ensure the Correctness of the Summary: Incorporate Entailment Knowledge into Abstractive Sentence Summarization Haoran Li, Junnan Zhu, Jiajun Zhang, Chengqing Zong COLING18 [pdf] [code]
Faithful to the Original: Fact-Aware Neural Abstractive Summarization Ziqiang Cao, Furu Wei, Wenjie Li, Sujian Li AAAI18 [pdf]
FAR-ASS：Fact-aware reinforced abstractive sentence summarization MengLi Zhanga, Gang Zhoua, Wanting Yua, Wenfen Liub [pdf]

Contrastive Learning

COLO: A Contrastive Learning based Re-ranking Framework for One-Stage Summarization COLO: A Contrastive Learning based Re-ranking Framework for One-Stage Summarization COLING 2022 [pdf] [code]

[Abs]
Traditional training paradigms for extractive and abstractive summarization systems always only use token-level or sentence-level training objectives. However, the output summary is always evaluated from summary-level which leads to the inconsistency in training and evaluation. In this paper, we propose a Contrastive Learning based re-ranking framework for one-stage summarization called COLO. By modeling a contrastive objective, we show that the summarization model is able to directly generate summaries according to the summary-level score without additional modules and parameters. Extensive experiments demonstrate that COLO boosts the extractive and abstractive results of one-stage systems on CNN/DailyMail benchmark to 44.58 and 46.33 ROUGE-1 score while preserving the parameter efficiency and inference efficiency. Compared with state-of-the-art multi-stage systems, we save more than 100 GPU training hours and obtaining 3~8 speed-up ratio during inference while maintaining comparable results.
CLIFF: Contrastive Learning for Improving Faithfulness and Factuality in Abstractive Summarization Shuyang Cao, Lu Wang EMNLP 2021 [pdf] [code]
Sequence Level Contrastive Learning for Text Summarization Shusheng Xu, Xingxing Zhang, Yi Wu, Furu Wei AAAI 2022 [pdf]](https://arxiv.org/abs/2109.03481)
Enhanced Seq2Seq Autoencoder via Contrastive Learning for Abstractive Text Summarization Chujie Zheng, Kunpeng Zhang, Harry Jiannan Wang, Ling Fan, Zhe Wang [pdf] [code]
Constructing Contrastive samples via Summarization for Text Classification with limited annotations Yangkai Du, Tengfei Ma, Lingfei Wu, Fangli Xu, Xuhong Zhang, Shouling Ji Findings of EMNLP 2021 Short [pdf]
Alleviating Exposure Bias via Contrastive Learning for Abstractive Text Summarization Shichao Sun, Wenjie Li [pdf] [code]
SimCLS: A Simple Framework for Contrastive Learning of Abstractive Summarization Yixin Liu, Pengfei Liu ACL 2021 short [pdf] [code]
Contrastive Learning with Adversarial Perturbations for Conditional Text Generation Seanie Lee, Dong Bok Lee, Sung Ju Hwang ICLR 2021 [pdf]
DeepChannel: Salience Estimation by Contrastive Learning for Extractive Document Summarization Jiaxin Shi, Chen Liang, Lei Hou, Juanzi Li, Zhiyuan Liu, Hanwang Zhang AAAI 2019 [pdf] [code]
Unsupervised Reference-Free Summary Quality Evaluation via Contrastive Learning Hanlu Wu, Tengfei Ma, Lingfei Wu, Tariro Manyumwa, Shouling Ji EMNLP 2020 [pdf] [code]
Contrastive Attention Mechanism for Abstractive Sentence Summarization Xiangyu Duan, Hongfei Yu, Mingming Yin, Min Zhang, Weihua Luo, Yue Zhang EMNLP 2019 [pdf] [code]

Evaluation

ChatGPT as a Factual Inconsistency Evaluator for Abstractive Text Summarization Zheheng Luo, Qianqian Xie, Sophia Ananiadou [pdf]

[Abs]
Long-form clinical summarization of hospital admissions has real-world significance because of its potential to help both clinicians and patients. The faithfulness of summaries is critical to their safe usage in clinical settings. To better understand the limitations of abstractive systems, as well as the suitability of existing evaluation metrics, we benchmark faithfulness metrics against fine-grained human annotations for model-generated summaries of a patient's Brief Hospital Course. We create a corpus of patient hospital admissions and summaries for a cohort of HIV patients, each with complex medical histories. Annotators are presented with summaries and source notes, and asked to categorize manually highlighted summary elements (clinical entities like conditions and medications as well as actions like "following up") into one of three categories: "Incorrect," "Missing," and "Not in Notes." We meta-evaluate a broad set of proposed faithfulness metrics and, across metrics, explore the importance of domain adaptation (e.g. the impact of in-domain pre-training and metric fine-tuning), the use of source-summary alignments, and the effects of distilling a single metric from an ensemble of pre-existing metrics. Off-the-shelf metrics with no exposure to clinical text correlate well yet overly rely on summary extractiveness. As a practical guide to long-form clinical narrative summarization, we find that most metrics correlate best to human judgments when provided with one summary sentence at a time and a minimal set of relevant source context.
Large Language Models are Diverse Role-Players for Summarization Evaluation Ning Wu, Ming Gong, Linjun Shou, Shining Liang, Daxin Jiang [pdf]

[Abs]
Generative Artificial Intelligence (AI) has enabled the development of sophisticated models that are capable of producing high-caliber text, images, and other outputs through the utilization of large pre-trained models. Nevertheless, assessing the quality of the generation is an even more arduous task than the generation itself, and this issue has not been given adequate consideration recently. This paper proposes a novel evaluation framework, GPTScore, which utilizes the emergent abilities (e.g., zero-shot instruction) from generative pre-trained models to score generated texts. Experimental results on four text generation tasks, 22 evaluation aspects, and corresponding 37 datasets demonstrate that this approach can effectively allow us to achieve what one desires to evaluate for texts simply by natural language instructions. This nature helps us overcome several long-standing challenges in text evaluation--how to achieve customized, multi-faceted evaluation without the need for annotated samples. We make our code publicly available at this https URL.
GPTScore: Evaluate as You Desire Jinlan Fu, See-Kiong Ng, Zhengbao Jiang, Pengfei Liu [pdf] [code]

[Abs]
Generative Artificial Intelligence (AI) has enabled the development of sophisticated models that are capable of producing high-caliber text, images, and other outputs through the utilization of large pre-trained models. Nevertheless, assessing the quality of the generation is an even more arduous task than the generation itself, and this issue has not been given adequate consideration recently. This paper proposes a novel evaluation framework, GPTScore, which utilizes the emergent abilities (e.g., zero-shot instruction) from generative pre-trained models to score generated texts. Experimental results on four text generation tasks, 22 evaluation aspects, and corresponding 37 datasets demonstrate that this approach can effectively allow us to achieve what one desires to evaluate for texts simply by natural language instructions. This nature helps us overcome several long-standing challenges in text evaluation--how to achieve customized, multi-faceted evaluation without the need for annotated samples. We make our code publicly available at this https URL.
Needle in a Haystack: An Analysis of Finding Qualified Workers on MTurk for Summarization Lining Zhang, João Sedoc, Simon Mille, Yufang Hou, Sebastian Gehrmann, Daniel Deutsch, Elizabeth Clark, Yixin Liu, Miruna Clinciu, Saad Mahamood, Khyathi Chandu [pdf]

[Abs]
The acquisition of high-quality human annotations through crowdsourcing platforms like Amazon Mechanical Turk (MTurk) is more challenging than expected. The annotation quality might be affected by various aspects like annotation instructions, Human Intelligence Task (HIT) design, and wages paid to annotators, etc. To avoid potentially low-quality annotations which could mislead the evaluation of automatic summarization system outputs, we investigate the recruitment of high-quality MTurk workers via a three-step qualification pipeline. We show that we can successfully filter out bad workers before they carry out the evaluations and obtain high-quality annotations while optimizing the use of resources. This paper can serve as basis for the recruitment of qualified annotators in other challenging annotation tasks.
DocAsRef: A Pilot Empirical Study on Repurposing Reference-Based Summary Quality Metrics Reference-Freely Forrest Sheng Bao, Ruixuan Tu, Ge Luo [pdf]

[Abs]
Summary quality assessment metrics have two categories: reference-based and reference-free. Reference-based metrics are theoretically more accurate but are limited by the availability and quality of the human-written references, which are both difficulty to ensure. This inspires the development of reference-free metrics, which are independent from human-written references, in the past few years. However, existing reference-free metrics cannot be both zero-shot and accurate. In this paper, we propose a zero-shot but accurate reference-free approach in a sneaky way: feeding documents, based upon which summaries generated, as references into reference-based metrics. Experimental results show that this zero-shot approach can give us the best-performing reference-free metrics on nearly all aspects on several recently-released datasets, even beating reference-free metrics specifically trained for this task sometimes. We further investigate what reference-based metrics can benefit from such repurposing and whether our additional tweaks help.
RISE: Leveraging Retrieval Techniques for Summarization Evaluation David Uthus, Jianmo Ni [pdf] [code]

[Abs]
Evaluating automatically-generated text summaries is a challenging task. While there have been many interesting approaches, they still fall short of human evaluations. We present RISE, a new approach for evaluating summaries by leveraging techniques from information retrieval. RISE is first trained as a retrieval task using a dual-encoder retrieval setup, and can then be subsequently utilized for evaluating a generated summary given an input document, without gold reference summaries. RISE is especially well suited when working on new datasets where one may not have reference summaries available for evaluation. We conduct comprehensive experiments on the SummEval benchmark (Fabbri et al., 2021) and the results show that RISE has higher correlation with human evaluations compared to many past approaches to summarization evaluation. Furthermore, RISE also demonstrates data-efficiency and generalizability across languages.
Universal Evasion Attacks on Summarization Scoring Wenchuan Mu, Kwan Hui Lim [pdf]

[Abs]
The automatic scoring of summaries is important as it guides the development of summarizers. Scoring is also complex, as it involves multiple aspects such as fluency, grammar, and even textual entailment with the source text. However, summary scoring has not been considered a machine learning task to study its accuracy and robustness. In this study, we place automatic scoring in the context of regression machine learning tasks and perform evasion attacks to explore its robustness. Attack systems predict a non-summary string from each input, and these non-summary strings achieve competitive scores with good summarizers on the most popular metrics: ROUGE, METEOR, and BERTScore. Attack systems also "outperform" state-of-the-art summarization methods on ROUGE-1 and ROUGE-L, and score the second-highest on METEOR. Furthermore, a BERTScore backdoor is observed: a simple trigger can score higher than any automatic summarization method. The evasion attacks in this work indicate the low robustness of current scoring systems at the system level. We hope that our highlighting of these proposed attacks will facilitate the development of summary scores.
Self-Repetition in Abstractive Neural Summarizers Nikita Salkar, Thomas Trikalinos, Byron C. Wallace, Ani Nenkova [pdf]

[Abs]
We provide a quantitative and qualitative analysis of self-repetition in the output of neural summarizers. We measure self-repetition as the number of n-grams of length four or longer that appear in multiple outputs of the same system. We analyze the behavior of three popular architectures (BART, T5, and Pegasus), fine-tuned on five datasets. In a regression analysis, we find that the three architectures have different propensities for repeating content across output summaries for inputs, with BART being particularly prone to self-repetition. Fine-tuning on more abstractive data, and on data featuring formulaic language, is associated with a higher rate of self-repetition. In qualitative analysis we find systems produce artefacts such as ads and disclaimers unrelated to the content being summarized, as well as formulaic phrases common in the fine-tuning domain. Our approach to corpus-level analysis of self-repetition may help practitioners clean up training data for summarizers and ultimately support methods for minimizing the amount of self-repetition.
How to Find Strong Summary Coherence Measures? A Toolbox and a Comparative Study for Summary Coherence Measure Evaluation Julius Steen, Katja Markert COLING 2022 [pdf] [code]

[Abs]
Automatically evaluating the coherence of summaries is of great significance both to enable cost-efficient summarizer evaluation and as a tool for improving coherence by selecting high-scoring candidate summaries. While many different approaches have been suggested to model summary coherence, they are often evaluated using disparate datasets and metrics. This makes it difficult to understand their relative performance and identify ways forward towards better summary coherence modelling. In this work, we conduct a large-scale investigation of various methods for summary coherence modelling on an even playing field. Additionally, we introduce two novel analysis measures, intra-system correlation and bias matrices, that help identify biases in coherence measures and provide robustness against system-level confounders. While none of the currently available automatic coherence measures are able to assign reliable coherence scores to system summaries across all evaluation metrics, large-scale language models fine-tuned on self-supervised tasks show promising results, as long as fine-tuning takes into account that they need to generalize across different summary lengths.
PrefScore: Pairwise Preference Learning for Reference-free Summarization Quality Assessment Ge Luo, Hebi Li, Youbiao He, Forrest Sheng Bao COLING 2022 [pdf] [code]

[Abs]
Evaluating machine-generated summaries without a human-written reference summary has been a need for a long time. Inspired by preference labeling in existing work of summarization evaluation, we propose to judge summary quality by learning the preference rank of summaries using the Bradley-Terry power ranking model from inferior summaries generated by corrupting base summaries. Extensive experiments on several datasets show that our weakly supervised scheme can produce scores highly correlated with human ratings.
How to Find Strong Summary Coherence Measures? A Toolbox and a Comparative Study for Summary Coherence Measure Evaluation Julius Steen, Katja Markert COLING 2022 [pdf] [code]

[Abs]
Automatically evaluating the coherence of summaries is of great significance both to enable cost-efficient summarizer evaluation and as a tool for improving coherence by selecting high-scoring candidate summaries. While many different approaches have been suggested to model summary coherence, they are often evaluated using disparate datasets and metrics. This makes it difficult to understand their relative performance and identify ways forward towards better summary coherence modelling. In this work, we conduct a large-scale investigation of various methods for summary coherence modelling on an even playing field. Additionally, we introduce two novel analysis measures, intra-system correlation and bias matrices, that help identify biases in coherence measures and provide robustness against system-level confounders. While none of the currently available automatic coherence measures are able to assign reliable coherence scores to system summaries across all evaluation metrics, large-scale language models fine-tuned on self-supervised tasks show promising results, as long as fine-tuning takes into account that they need to generalize across different summary lengths.
SummScore: A Comprehensive Evaluation Metric for Summary Quality Based on Cross-Encoder Wuhang Lin, Shasha Li, Chen Zhang, Bin Ji, Jie Yu, Jun Ma, Zibo Yi APWeb-WAIM2022 [pdf]

[Abs]
Text summarization models are often trained to produce summaries that meet human quality requirements. However, the existing evaluation metrics for summary text are only rough proxies for summary quality, suffering from low correlation with human scoring and inhibition of summary diversity. To solve these problems, we propose SummScore, a comprehensive metric for summary quality evaluation based on CrossEncoder. Firstly, by adopting the original-summary measurement mode and comparing the semantics of the original text, SummScore gets rid of the inhibition of summary diversity. With the help of the text-matching pre-training Cross-Encoder, SummScore can effectively capture the subtle differences between the semantics of summaries. Secondly, to improve the comprehensiveness and interpretability, SummScore consists of four fine-grained submodels, which measure Coherence, Consistency, Fluency, and Relevance separately. We use semi-supervised multi-rounds of training to improve the performance of our model on extremely limited annotated data. Extensive experiments show that SummScore significantly outperforms existing evaluation metrics in the above four dimensions in correlation with human scoring. We also provide the quality evaluation results of SummScore on 16 mainstream summarization models for later research.
Does Summary Evaluation Survive Translation to Other Languages? Spencer Braun, Oleg Vasilyev, Neslihan Iskender, John Bohannon NAACL 2022 [pdf] [code]

[Abs]
The creation of a quality summarization dataset is an expensive, time-consuming effort, requiring the production and evaluation of summaries by both trained humans and machines. The returns to such an effort would increase significantly if the dataset could be used in additional languages without repeating human annotations. To investigate how much we can trust machine translation of summarization datasets, we translate the English SummEval dataset to seven languages and compare performances across automatic evaluation measures. We explore equivalence testing as the appropriate statistical paradigm for evaluating correlations between human and automated scoring of summaries. We also consider the effect of translation on the relative performance between measures. We find some potential for dataset reuse in languages similar to the source and along particular dimensions of summary quality. Our code and data can be found at https://github.com/PrimerAI/primer-research/.
Re-Examining System-Level Correlations of Automatic Summarization Evaluation Metrics Daniel Deutsch, Rotem Dror, Dan Roth NAACL 2022 [pdf] [code]

[Abs]
How reliably an automatic summarization evaluation metric replicates human judgments of summary quality is quantified by system-level correlations. We identify two ways in which the definition of the system-level correlation is inconsistent with how metrics are used to evaluate systems in practice and propose changes to rectify this disconnect. First, we calculate the system score for an automatic metric using the full test set instead of the subset of summaries judged by humans, which is currently standard practice. We demonstrate how this small change leads to more precise estimates of system-level correlations. Second, we propose to calculate correlations only on pairs of systems that are separated by small differences in automatic scores which are commonly observed in practice. This allows us to demonstrate that our best estimate of the correlation of ROUGE to human judgments is near 0 in realistic scenarios. The results from the analyses point to the need to collect more high-quality human judgments and to improve automatic metrics when differences in system scores are small.
SueNes: A Weakly Supervised Approach to Evaluating Single-Document Summarization via Negative Sampling Forrest Bao, Ge Luo, Hebi Li, Minghui Qiu, Yinfei Yang, Youbiao He, Cen Chen NAACL 2022 [pdf] [code]

[Abs]
Canonical automatic summary evaluation metrics, such as ROUGE, focus on lexical similarity which cannot well capture semantics nor linguistic quality and require a reference summary which is costly to obtain. Recently, there have been a growing number of efforts to alleviate either or both of the two drawbacks. In this paper, we present a proof-of-concept study to a weakly supervised summary evaluation approach without the presence of reference summaries. Massive data in existing summarization datasets are transformed for training by pairing documents with corrupted reference summaries. In cross-domain tests, our strategy outperforms baselines with promising improvements, and show a great advantage in gauging linguistic qualities over all metrics.
Reference-free Summarization Evaluation via Semantic Correlation and Compression Ratio Yizhu Liu, Qi Jia, Kenny Zhu NAACL 2022 [pdf] [code]

[Abs]
A document can be summarized in a number of ways. Reference-based evaluation of summarization has been criticized for its inflexibility. The more sufficient the number of abstracts, the more accurate the evaluation results. However, it is difficult to collect sufficient reference summaries. In this paper, we propose a new automatic reference-free evaluation metric that compares semantic distribution between source document and summary by pretrained language models and considers summary compression ratio. The experiments show that this metric is more consistent with human evaluation in terms of coherence, consistency, relevance and fluency.
MaskEval: Weighted MLM-Based Evaluation for Text Summarization and Simplification Yu Lu Liu, Rachel Bawden, Thomas Scaliom, Benoît Sagot, Jackie Chi Kit Cheung [pdf] [code]
TRUE: Re-evaluating Factual Consistency Evaluation NAACL 2022 [pdf]
Play the Shannon Game With Language Models: A Human-Free Approach to Summary Evaluation Nicholas Egan, Oleg Vasilyev, John Bohannon AAAI 2022 [pdf] [code]
Differentiable N-gram Objective on Abstractive Summarization Yunqi Zhu, Wensheng Zhang, Mingjin Zhu [pdf] [code]
DiscoScore: Evaluating Text Generation with BERT and Discourse Coherence Wei Zhao, Michael Strube, Steffen Eger [pdf] [code]
WIDAR -- Weighted Input Document Augmented ROUGE Raghav Jain, Vaibhav Mavi, Anubhav Jangra, Sriparna Saha ECIR 2022 [pdf] [code]
InfoLM: A New Metric to Evaluate Summarization & Data2Text Generation Pierre Colombo, Chloe Clave, Pablo Piantanida AAAI 2022 [pdf]
Evaluation of Summarization Systems across Gender, Age, and Race Anna Jørgensen, Anders Søgaard EMNLP 2021| newsum [pdf]
Evaluation of Abstractive Summarisation Models with Machine Translation in Deliberative Processes M. Arana-Catania, Rob Procter, Yulan He, Maria Liakata EMNLP 2021 New Frontiers in Summarization Workshop [pdf]
Evaluation of Summarization Systems across Gender, Age, and Race Anna Jørgensen, Anders Søgaard [pdf]
Finding a Balanced Degree of Automation for Summary Evaluation Shiyue Zhang, Mohit Bansal EMNLP 2021 [pdf] [code]
QuestEval: Summarization Asks for Fact-based Evaluation Thomas Scialom, Paul-Alexis Dray, Patrick Gallinari, Sylvain Lamprier, Benjamin Piwowarski, Jacopo Staiano, Alex Wang EMNLP 2021 [pdf] [code]
BARTScore: Evaluating Generated Text as Text Generation Weizhe Yuan, Graham Neubig, Pengfei Liu [pdf] [code]
A Training-free and Reference-free Summarization Evaluation Metric via Centrality-weighted Relevance and Self-referenced Redundancy Wang Chen, Piji Li, Irwin King ACL 2021 [pdf] [code]
Evaluating the Efficacy of Summarization Evaluation across Languages Fajri Koto, Jey Han Lau, Timothy Baldwin Findings of ACL 2021 [pdf]
Question-aware Transformer Models for Consumer Health Question Summarization Shweta Yadav, Deepak Gupta, Asma Ben Abacha, Dina Demner-Fushman [pdf]
Towards Human-Free Automatic Quality Evaluation of German Summarization Neslihan Iskender, Oleg Vasilyev, Tim Polzehl, John Bohannon, Sebastian Möller [pdf]
Reliability of Human Evaluation for Text Summarization: Lessons Learned and Challenges Ahead Neslihan Iskender, Tim Polzehl, Sebastian Möller EACL21 [pdf] [code]
SummVis: Interactive Visual Analysis of Models, Data, and Evaluation for Text Summarization Jesse Vig, Wojciech Kryscinski, Karan Goel, Nazneen Fatema Rajani ACL 2021 demo [pdf] [data]
Is human scoring the best criteria for summary evaluation? Findings of ACL 2021 Oleg Vasilyev, John Bohannon [pdf]
How to Evaluate a Summarizer: Study Design and Statistical Analysis for Manual Linguistic Quality Evaluation Julius Steen, Katja Markert EACL21 [pdf] [code]
HOLMS: Alternative Summary Evaluation with Large Language Models Yassine Mrabet, Dina Demner-Fushman COLING20 [pdf] [bib]
FFCI: A Framework for Interpretable Automatic Evaluation of Summarization Fajri Koto, Jey Han Lau, Timothy Baldwin [pdf] [code]
Unsupervised Reference-Free Summary Quality Evaluation via Contrastive Learning Hanlu Wu, Tengfei Ma, Lingfei Wu, Tariro Manyumwa, Shouling Ji EMNLP20 [pdf] [code]
SacreROUGE: An Open-Source Library for Using and Developing Summarization Evaluation Metrics Daniel Deutsch, Dan Roth [pdf] [code]
SummEval: Re-evaluating Summarization Evaluation Alexander R. Fabbri, Wojciech Kryściński, Bryan McCann, Caiming Xiong, Richard Socher, Dragomir Radev [pdf] [code]
HIGHRES: Highlight-based Reference-less Evaluation of Summarization Hardy, Shashi Narayan, Andreas Vlachos ACL19 [pdf] [code]

Multi-Document

Compressed Heterogeneous Graph for Abstractive Multi-Document Summarization Miao Li, Jianzhong Qi, Jey Han Lau AAAI 2023 [pdf] [code]

[Abs]
Multi-document summarization (MDS) aims to generate a summary for a number of related documents. We propose HGSUM, an MDS model that extends an encoder-decoder architecture, to incorporate a heterogeneous graph to represent different semantic units (e.g., words and sentences) of the documents. This contrasts with existing MDS models which do not consider different edge types of graphs and as such do not capture the diversity of relationships in the documents. To preserve only key information and relationships of the documents in the heterogeneous graph, HGSUM uses graph pooling to compress the input graph. And to guide HGSUM to learn compression, we introduce an additional objective that maximizes the similarity between the compressed graph and the graph constructed from the ground-truth summary during training. HGSUM is trained end-to-end with graph similarity and standard cross-entropy objectives. Experimental results over MULTI-NEWS, WCEP-100, and ARXIV show that HGSUM outperforms state-of-the-art MDS models. The code for our model and experiments is available at: this https URL.
Do Multi-Document Summarization Models Synthesize? Jay DeYoung, Stephanie C. Martinez, Iain J. Marshall, Byron C. Wallace [pdf]

[Abs]
Multi-document summarization entails producing concise synopses of collections of inputs. For some applications, the synopsis should accurately \emph{synthesize} inputs with respect to a key property or aspect. For example, a synopsis of film reviews all written about a particular movie should reflect the average critic consensus. As a more consequential example, consider narrative summaries that accompany biomedical \emph{systematic reviews} of clinical trial results. These narratives should fairly summarize the potentially conflicting results from individual trials. In this paper we ask: To what extent do modern multi-document summarization models implicitly perform this type of synthesis? To assess this we perform a suite of experiments that probe the degree to which conditional generation models trained for summarization using standard methods yield outputs that appropriately synthesize inputs. We find that existing models do partially perform synthesis, but do so imperfectly. In particular, they are over-sensitive to changes in input ordering and under-sensitive to changes in input compositions (e.g., the ratio of positive to negative movie reviews). We propose a simple, general method for improving model synthesis capabilities by generating an explicitly diverse set of candidate outputs, and then selecting from these the string best aligned with the expected aggregate measure for the inputs, or \emph{abstaining} when the model produces no good candidate. This approach improves model synthesis performance. We hope highlighting the need for synthesis (in some summarization settings), motivates further research into multi-document summarization methods and learning objectives that explicitly account for the need to synthesize.
Exploring the Challenges of Open Domain Multi-Document Summarization John Giorgi, Luca Soldaini, Bo Wang, Gary Bader, Kyle Lo, Lucy Lu Wang, Arman Cohan [pdf] [code]

[Abs]
Multi-document summarization (MDS) has traditionally been studied assuming a set of ground-truth topic-related input documents is provided. In practice, the input document set is unlikely to be available a priori and would need to be retrieved based on an information need, a setting we call open-domain MDS. We experiment with current state-of-the-art retrieval and summarization models on several popular MDS datasets extended to the open-domain setting. We find that existing summarizers suffer large reductions in performance when applied as-is to this more realistic task, though training summarizers with retrieved inputs can reduce their sensitivity retrieval errors. To further probe these findings, we conduct perturbation experiments on summarizer inputs to study the impact of different types of document retrieval errors. Based on our results, we provide practical guidelines to help facilitate a shift to open-domain MDS. We release our code and experimental results alongside all data or model artifacts created during our investigation.
How "Multi" is Multi-Document Summarization? Ruben Wolhandler, Arie Cattan, Ori Ernst, Ido Dagan EMNLP 2022 [pdf] [code]

[Abs]
The task of multi-document summarization (MDS) aims at models that, given multiple documents as input, are able to generate a summary that combines disperse information, originally spread across these documents. Accordingly, it is expected that both reference summaries in MDS datasets, as well as system summaries, would indeed be based on such dispersed information. In this paper, we argue for quantifying and assessing this expectation. To that end, we propose an automated measure for evaluating the degree to which a summary is ``disperse'', in the sense of the number of source documents needed to cover its content. We apply our measure to empirically analyze several popular MDS datasets, with respect to their reference summaries, as well as the output of state-of-the-art systems. Our results show that certain MDS datasets barely require combining information from multiple documents, where a single document often covers the full summary content. Overall, we advocate using our metric for assessing and improving the degree to which summarization datasets require combining multi-document information, and similarly how summarization models actually meet this challenge. Our code is available in this https URL.
Analyzing the Dialect Diversity in Multi-document Summaries Olubusayo Olabisi, Aaron Hudson, Antonie Jetter, Ameeta Agrawal COLING 2022 [pdf] [code]

[Abs]
Social media posts provide a compelling, yet challenging source of data of diverse perspectives from many socially salient groups. Automatic text summarization algorithms make this data accessible at scale by compressing large collections of documents into short summaries that preserve salient information from the source text. In this work, we take a complementary approach to analyzing and improving the quality of summaries generated from social media data in terms of their ability to represent salient as well as diverse perspectives. We introduce a novel dataset, DivSumm, of dialect diverse tweets and human-written extractive and abstractive summaries. Then, we study the extent of dialect diversity reflected in human-written reference summaries as well as system-generated summaries. The results of our extensive experiments suggest that humans annotate fairly well-balanced dialect diverse summaries, and that cluster-based pre-processing approaches seem beneficial in improving the overall quality of the system-generated summaries without loss in diversity.
Document-aware Positional Encoding and Linguistic-guided Encoding for Abstractive Multi-document Summarization Congbo Ma, Wei Emma Zhang, Pitawelayalage Dasun Dileepa Pitawela, Yutong Qu, Haojie Zhuang, Hu Wang [pdf]

[Abs]
One key challenge in multi-document summarization is to capture the relations among input documents that distinguish between single document summarization (SDS) and multi-document summarization (MDS). Few existing MDS works address this issue. One effective way is to encode document positional information to assist models in capturing cross-document relations. However, existing MDS models, such as Transformer-based models, only consider token-level positional information. Moreover, these models fail to capture sentences' linguistic structure, which inevitably causes confusions in the generated summaries. Therefore, in this paper, we propose document-aware positional encoding and linguistic-guided encoding that can be fused with Transformer architecture for MDS. For document-aware positional encoding, we introduce a general protocol to guide the selection of document encoding functions. For linguistic-guided encoding, we propose to embed syntactic dependency relations into the dependency relation mask with a simple but effective non-linear encoding learner for feature learning. Extensive experiments show the proposed model can generate summaries with high quality.
Multi-Document Scientific Summarization from a Knowledge Graph-Centric View Pancheng Wang, Shasha Li, Kunyuan Pang, Liangliang He, Dong Li, Jintao Tang, Ting Wang COLING 2022 [pdf] [code]

[Abs]
Multi-Document Scientific Summarization (MDSS) aims to produce coherent and concise summaries for clusters of topic-relevant scientific papers. This task requires precise understanding of paper content and accurate modeling of cross-paper relationships. Knowledge graphs convey compact and interpretable structured information for documents, which makes them ideal for content modeling and relationship modeling. In this paper, we present KGSum, an MDSS model centred on knowledge graphs during both the encoding and decoding process. Specifically, in the encoding process, two graph-based modules are proposed to incorporate knowledge graph information into paper encoding, while in the decoding process, we propose a two-stage decoder by first generating knowledge graph information of summary in the form of descriptive sentences, followed by generating the final summary. Empirical results show that the proposed architecture brings substantial improvements over baselines on the Multi-Xscience dataset.
Generating a Structured Summary of Numerous Academic Papers: Dataset and Method Shuaiqi LIU, Jiannong Cao, Ruosong Yang, Zhiyuan Wen IJCAI 2022 [pdf] [data]

[Abs]
Writing a survey paper on one research topic usually needs to cover the salient content from numerous related papers, which can be modeled as a multi-document summarization (MDS) task. Existing MDS datasets usually focus on producing the structureless summary covering a few input documents. Meanwhile, previous structured summary generation works focus on summarizing a single document into a multi-section summary. These existing datasets and methods cannot meet the requirements of summarizing numerous academic papers into a structured summary. To deal with the scarcity of available data, we propose BigSurvey, the first large-scale dataset for generating comprehensive summaries of numerous academic papers on each topic. We collect target summaries from more than seven thousand survey papers and utilize their 430 thousand reference papers’ abstracts as input documents. To organize the diverse content from dozens of input documents and ensure the efficiency of processing long text sequences, we propose a summarization method named category-based alignment and sparse transformer (CAST). The experimental results show that our CAST method outperforms various advanced summarization methods.
Proposition-Level Clustering for Multi-Document Summarization Ori Ernst, Avi Caciularu, Ori Shapira, Ramakanth Pasunuru, Mohit Bansal, Jacob Goldberger, Ido Dagan NAACL 2022 [pdf] [code]

[Abs]
Text clustering methods were traditionally incorporated into multi-document summarization (MDS) as a means for coping with considerable information repetition. Particularly, clusters were leveraged to indicate information saliency as well as to avoid redundancy. Such prior methods focused on clustering sentences, even though closely related sentences usually contain also non-aligned parts. In this work, we revisit the clustering approach, grouping together sub-sentential propositions, aiming at more precise information alignment. Specifically, our method detects salient propositions, clusters them into paraphrastic clusters, and generates a representative sentence for each cluster via text fusion.Our summarization method improves over the previous state-of-the-art MDS method in the DUC 2004 and TAC 2011 datasets, both in automatic ROUGE scores and human preference.
Multi-LexSum: Real-World Summaries of Civil Rights Lawsuits at Multiple Granularities Zejiang Shen, Kyle Lo, Lauren Yu, Nathan Dahlberg, Margo Schlanger, Doug Downey [pdf] [data]

[Abs]
With the advent of large language models, methods for abstractive summarization have made great strides, creating potential for use in applications to aid knowledge workers processing unwieldy document collections. One such setting is the Civil Rights Litigation Clearinghouse (CRLC) (this https URL),which posts information about large-scale civil rights lawsuits, serving lawyers, scholars, and the general public. Today, summarization in the CRLC requires extensive training of lawyers and law students who spend hours per case understanding multiple relevant documents in order to produce high-quality summaries of key events and outcomes. Motivated by this ongoing real-world summarization effort, we introduce Multi-LexSum, a collection of 9,280 expert-authored summaries drawn from ongoing CRLC writing. Multi-LexSum presents a challenging multi-document summarization task given the length of the source documents, often exceeding two hundred pages per case. Furthermore, Multi-LexSum is distinct from other datasets in its multiple target summaries, each at a different granularity (ranging from one-sentence "extreme" summaries to multi-paragraph narrations of over five hundred words). We present extensive analysis demonstrating that despite the high-quality summaries in the training data (adhering to strict content and style guidelines), state-of-the-art summarization models perform poorly on this task. We release Multi-LexSum for further research in summarization methods as well as to facilitate development of applications to assist in the CRLC's mission at this https URL.
AnswerSumm: A Manually-Curated Dataset and Pipeline for Answer Summarization Alexander R. Fabbri, Xiaojian Wu, Srini Iyer, Haoran Li, Mona Diab NAACL 2022 [pdf] [code]

[Abs]
Community Question Answering (CQA) fora such as Stack Overflow and Yahoo! Answers contain a rich resource of answers to a wide range of community-based questions. Each question thread can receive a large number of answers with different perspectives. One goal of answer summarization is to produce a summary that reflects the range of answer perspectives. A major obstacle for this task is the absence of a dataset to provide supervision for producing such summaries. Recent works propose heuristics to create such data, but these are often noisy and do not cover all answer perspectives present. This work introduces a novel dataset of 4,631 CQA threads for answer summarization curated by professional linguists. Our pipeline gathers annotations for all subtasks of answer summarization, including relevant answer sentence selection, grouping these sentences based on perspectives, summarizing each perspective, and producing an overall summary. We analyze and benchmark state-of-the-art models on these subtasks and introduce a novel unsupervised approach for multi-perspective data augmentation that boosts summarization performance according to automatic evaluation. Finally, we propose reinforcement learning rewards to improve factual consistency and answer coverage and analyze areas for improvement.
The patient is more dead than alive: exploring the current state of the multi-document summarisation of the biomedical literature Yulia Otmakhova, Karin Verspoor, Timothy Baldwin, Jey Han Lau ACL 2022 [pdf]

[Abs]
Although multi-document summarisation (MDS) of the biomedical literature is a highly valuable task that has recently attracted substantial interest, evaluation of the quality of biomedical summaries lacks consistency and transparency. In this paper, we examine the summaries generated by two current models in order to understand the deficiencies of existing evaluation approaches in the context of the challenges that arise in the MDS task. Based on this analysis, we propose a new approach to human evaluation and identify several challenges that must be overcome to develop effective biomedical MDS systems.
Predicting Intervention Approval in Clinical Trials through Multi-Document Summarization Georgios Katsimpras, Georgios Paliouras ACL 2022 [pdf] [code]

[Abs]
Clinical trials offer a fundamental opportunity to discover new treatments and advance the medical knowledge. However, the uncertainty of the outcome of a trial can lead to unforeseen costs and setbacks. In this study, we propose a new method to predict the effectiveness of an intervention in a clinical trial. Our method relies on generating an informative summary from multiple documents available in the literature about the intervention under study. Specifically, our method first gathers all the abstracts of PubMed articles related to the intervention. Then, an evidence sentence, which conveys information about the effectiveness of the intervention, is extracted automatically from each abstract. Based on the set of evidence sentences extracted from the abstracts, a short summary about the intervention is constructed. Finally, the produced summaries are used to train a BERT-based classifier, in order to infer the effectiveness of an intervention. To evaluate our proposed method, we introduce a new dataset which is a collection of clinical trials together with their associated PubMed articles. Our experiments, demonstrate the effectiveness of producing short informative summaries and using them to predict the effectiveness of an intervention.
Discriminative Marginalized Probabilistic Neural Method for Multi-Document Summarization of Medical Literature Gianluca Moro, Luca Ragazzi, Lorenzo Valgimigli, Davide Freddi ACL 2022 [pdf] [code]

[Abs]
Although current state-of-the-art Transformer-based solutions succeeded in a wide range for single-document NLP tasks, they still struggle to address multi-input tasks such as multi-document summarization. Many solutions truncate the inputs, thus ignoring potential summary-relevant contents, which is unacceptable in the medical domain where each information can be vital. Others leverage linear model approximations to apply multi-input concatenation, worsening the results because all information is considered, even if it is conflicting or noisy with respect to a shared background. Despite the importance and social impact of medicine, there are no ad-hoc solutions for multi-document summarization. For this reason, we propose a novel discriminative marginalized probabilistic method (DAMEN) trained to discriminate critical information from a cluster of topic-related medical documents and generate a multi-document summary via token probability marginalization. Results prove we outperform the previous state-of-the-art on a biomedical dataset for multi-document summarization of systematic literature reviews. Moreover, we perform extensive ablation studies to motivate the design choices and prove the importance of each module of our method.
ACM -- Attribute Conditioning for Abstractive Multi Document Summarization Aiswarya Sankar, Ankit Chadha [pdf]
Improving Multi-Document Summarization through Referenced Flexible Extraction with Credit-Awareness Yun-Zhu Song, Yi-Syuan Chen, Hong-Han Shuai NAACL 2022 [pdf] [code]

[Abs]
A notable challenge in Multi-Document Summarization (MDS) is the extremely-long length of the input. In this paper, we present an extract-then-abstract Transformer framework to overcome the problem. Specifically, we leverage pre-trained language models to construct a hierarchical extractor for salient sentence selection across documents and an abstractor for rewriting the selected contents as summaries. However, learning such a framework is challenging since the optimal contents for the abstractor are generally unknown. Previous works typically create pseudo extraction oracle to enable the supervised learning for both the extractor and the abstractor. Nevertheless, we argue that the performance of such methods could be restricted due to the insufficient information for prediction and inconsistent objectives between training and testing. To this end, we propose a loss weighting mechanism that makes the model aware of the unequal importance for the sentences not in the pseudo extraction oracle, and leverage the fine-tuned abstractor to generate summary references as auxiliary signals for learning the extractor. Moreover, we propose a reinforcement learning method that can efficiently apply to the extractor for harmonizing the optimization between training and testing. Experiment results show that our framework substantially outperforms strong baselines with comparable model sizes and achieves the best results on the Multi-News, Multi-XScience, and WikiCatSum corpora.
NeuS: Neutral Multi-News Summarization for Mitigating Framing Bias Nayeon Lee, Yejin Bang, Tiezheng Yu, Andrea Madotto, Pascale Fung NAACL 2022 [pdf] [code]

[Abs]
Media news framing bias can increase political polarization and undermine civil society. The need for automatic mitigation methods is therefore growing. We propose a new task, a neutral summary generation from multiple news articles of the varying political leaningsto facilitate balanced and unbiased news reading.In this paper, we first collect a new dataset, illustrate insights about framing bias through a case study, and propose a new effective metric and model (NeuS-Title) for the task. Based on our discovery that title provides a good signal for framing bias, we present NeuS-Title that learns to neutralize news content in hierarchical order from title to article. Our hierarchical multi-task learning is achieved by formatting our hierarchical data pair (title, article) sequentially with identifier-tokens (“TITLE=>”, “ARTICLE=>”) and fine-tuning the auto-regressive decoder with the standard negative log-likelihood objective.We then analyze and point out the remaining challenges and future directions. One of the most interesting observations is that neural NLG models can hallucinate not only factually inaccurate or unverifiable content but also politically biased content.
Read Top News First: A Document Reordering Approach for Multi-Document News Summarization Chao Zhao, Tenghao Huang, Somnath Basu Roy Chowdhury, Muthu Kumar Chandrasekaran, Kathleen McKeown, Snigdha Chaturvedi Findings of ACL 2022 [pdf] [code]
A Multi-Document Coverage Reward for RELAXed Multi-Document Summarization Jacob Parnell, Inigo Jauregi Unanue, Massimo Piccardi ACL 2022 [pdf] [code]

[Abs]
Multi-document summarization (MDS) has made significant progress in recent years, in part facilitated by the availability of new, dedicated datasets and capacious language models. However, a standing limitation of these models is that they are trained against limited references and with plain maximum-likelihood objectives. As for many other generative tasks, reinforcement learning (RL) offers the potential to improve the training of MDS models; yet, it requires a carefully-designed reward that can ensure appropriate leverage of both the reference summaries and the input documents. For this reason, in this paper we propose fine-tuning an MDS baseline with a reward that balances a reference-based metric such as ROUGE with coverage of the input documents. To implement the approach, we utilize RELAX (Grathwohl et al., 2018), a contemporary gradient estimator which is both low-variance and unbiased, and we fine-tune the baseline in a few-shot style for both stability and computational efficiency. Experimental results over the Multi-News and WCEP MDS datasets show significant improvements of up to +0.95 pp average ROUGE score and +3.17 pp METEOR score over the baseline, and competitive results with the literature. In addition, they show that the coverage of the input documents is increased, and evenly across all documents.
PRIMER: Pyramid-based Masked Sentence Pre-training for Multi-document Summarization Wen Xiao, Iz Beltagy, Giuseppe Carenini, Arman Cohan ACL 2022 [pdf] [code]

[Abs]
We introduce PRIMERA, a pre-trained model for multi-document representation with a focus on summarization that reduces the need for dataset-specific architectures and large amounts of fine-tuning labeled data. PRIMERA uses our newly proposed pre-training objective designed to teach the model to connect and aggregate information across documents. It also uses efficient encoder-decoder transformers to simplify the processing of concatenated input documents. With extensive experiments on 6 multi-document summarization datasets from 3 different domains on zero-shot, few-shot and full-supervised settings, PRIMERA outperforms current state-of-the-art dataset-specific and pre-trained models on most of these settings with large margins.
PeerSum: A Peer Review Dataset for Abstractive Multi-document Summarization Miao Li, Jianzhong Qi, Jey Han Lau [pdf] [data]
A Proposition-Level Clustering Approach for Multi-Document Summarization Ori Ernst, Avi Caciularu, Ori Shapira, Ramakanth Pasunuru, Mohit Bansal, Jacob Goldberger, Ido Dagan [pdf] [code]
MSˆ2: Multi-Document Summarization of Medical Studies Jay DeYoung, Iz Beltagy, Madeleine van Zuylen, Bailey Kuehl, Lucy Wang EMNLP 2021 [pdf] [data]
SgSum: Transforming Multi-document Summarization into Sub-graph Selection Moye Chen, Wei Li, Jiachen Liu, Xinyan Xiao, Hua Wu, Haifeng Wang EMNLP 2021 [pdf] [code]
Topic-Guided Abstractive Multi-Document Summarization Peng Cui, Le Hu Findings of EMNLP 2021 [pdf]
Modeling Endorsement for Multi-Document Abstractive Summarization Logan Lebanoff, Bingqing Wang, Zhe Feng, Fei Liu EMNLP 2021|newsum [pdf]
Incorporating Linguistic Knowledge for Abstractive Multi-document Summarization Congbo Ma, Wei Emma Zhang, Hu Wang, Shubham Gupta, Mingyu Guo [pdf]
Capturing Relations between Scientific Papers: An Abstractive Model for Related Work Section Generation Xiuying Chen, Hind Alamro, Mingzhe Li, Shen Gao, Xiangliang Zhang, Dongyan Zhao, Rui Yan ACL 2021 [pdf] [data]
Highlight-Transformer: Leveraging Key Phrase Aware Attention to Improve Abstractive Multi-Document Summarization Shuaiqi Liu, Jiannong Cao, Ruosong Yang, Zhiyuan Wen ACL 2021 Findings [pdf]
Entity-Aware Abstractive Multi-Document Summarization Hao Zhou, Weidong Ren, Gongshen Liu, Bo Su, Wei Lu ACL 2021 Findings [pdf] [code]
TWAG: A Topic-Guided Wikipedia Abstract Generator Fangwei Zhu, Shangqing Tu, Jiaxin Shi, Juanzi Li, Lei Hou, Tong Cui ACL 2021 [pdf] [code]
AgreeSum: Agreement-Oriented Multi-Document Summarization Richard Yuanzhe Pang, Adam D. Lelkes, Vinh Q. Tran, Cong Yu Findings of ACL 2021 [pdf] [data]
Analysis of GraphSum's Attention Weights to Improve the Explainability of Multi-Document Summarization M. Lautaro Hickmann, Fabian Wurzberger, Megi Hoxhalli, Arne Lochner, Jessica Töllich, Ansgar Scherp [pdf]
Extending Multi-Document Summarization Evaluation to the Interactive Setting Ori Shapira, Ramakanth Pasunuru, Hadar Ronen, Mohit Bansal, Yael Amsterdamer, Ido Dagan NAACL21 [pdf] [code]
Efficiently Summarizing Text and Graph Encodings of Multi-Document Clusters Ramakanth Pasunuru, Mengwen Liu, Mohit Bansal, Sujith Ravi, Markus Dreyer NAACL21 [pdf] [code]
Self-Supervised and Controlled Multi-Document Opinion Summarization Hady Elsahar, Maximin Coavoux, Jos Rozen, Matthias Gallé EACL 2021 [pdf]
MS2: Multi-Document Summarization of Medical Studies Jay DeYoung, Iz Beltagy, Madeleine van Zuylen, Bailey Keuhl, Lucy Lu Wang [pdf] [data]
Nutri-bullets: Summarizing Health Studies by Composing Segments Darsh J Shah, Lili Yu, Tao Lei, Regina Barzilay AAAI21 [pdf] [code]
Multi-document Summarization using Semantic Role Labeling and Semantic Graph for Indonesian News Article Yuly Haruka Berliana Gunawan, Masayu Leylia Khodra [pdf]
Flight of the PEGASUS? Comparing Transformers on Few-Shot and Zero-Shot Multi-document Abstractive Summarization Travis Goodwin, Max Savery, Dina Demner-Fushman COLING20 [pdf]
Abstractive Multi-Document Summarization via Joint Learning with Single-Document Summarization Hanqi Jin, Xiaojun Wan Findings of EMNLP [pdf] [code]
Coarse-to-Fine Query Focused Multi-Document Summarization Yumo Xu, Mirella Lapata EMNLP20 [pdf] [code] [code]
WSL-DS: Weakly Supervised Learning with Distant Supervision for Query Focused Multi-Document Abstractive Summarization Md Tahmid Rahman Laskar, Enamul Hoque, Jimmy Xiangji Huang COLING20 Short [pdf] [code]
AQuaMuSe: Automatically Generating Datasets for Query-Based Multi-Document Summarization Sayali Kulkarni, Sheide Chammas, Wan Zhu, Fei Sha, Eugene Ie [pdf] [data]
Multi-document Summarization with Maximal Marginal Relevance-guided Reinforcement Learning Yuning Mao, Yanru Qu, Yiqing Xie, Xiang Ren, Jiawei Han EMNLP20 [pdf] [code]
Heterogeneous Graph Neural Networks for Extractive Document Summarization Danqing Wang, Pengfei Liu, Yining Zheng, Xipeng Qiu, Xuanjing Huang ACL20 [pdf] [code]
Multi-Granularity Interaction Network for Extractive and Abstractive Multi-Document Summarization Hanqi Jin, Tianming Wang, Xiaojun Wan ACL20 [pdf]
SUPERT: Towards New Frontiers in Unsupervised Evaluation Metrics for Multi-Document Summarization Yang Gao, Wei Zhao, Steffen Eger ACL20 [pdf] [code]
Leveraging Graph to Improve Abstractive Multi-Document Summarization Wei Li, Xinyan Xiao, Jiachen Liu, Hua Wu, Haifeng Wang, Junping Du ACL20 [pdf] [code]
Generating Representative Headlines for News Stories Xiaotao Gu, Yuning Mao, Jiawei Han, Jialu Liu, Hongkun Yu, You Wu, Cong Yu, Daniel Finnie, Jiaqi Zhai, Nicholas Zukoski WWW20 [pdf] [code]
Learning to Create Sentence Semantic Relation Graphs for Multi-Document Summarization Diego Antognini, Boi Faltings EMNLP19 [pdf]
Improving the Similarity Measure of Determinantal Point Processes for Extractive Multi-Document Summarization Sangwoo Cho, Logan Lebanoff, Hassan Foroosh, Fei Liu ACL19 [pdf] [code]
Hierarchical Transformers for Multi-Document Summarization Yang Liu, Mirella Lapata ACL19 [pdf] [code]
MeanSum: A Neural Model for Unsupervised Multi-Document Abstractive Summarization Eric Chu, Peter J. Liu ICML19 [pdf] [code]
Generating Wikipedia By Summarizing Long Sequence Peter J. Liu, Mohammad Saleh, Etienne Pot, Ben Goodrich, Ryan Sepassi, Lukasz Kaiser, Noam Shazeer ICLR18 [pdf] [code]
Adapting the Neural Encoder-Decoder Framework from Single to Multi-Document Summarization Logan Lebanoff, Kaiqiang Song, Fei Liu EMNLP18 [pdf] [code]
Graph-based Neural Multi-Document Summarization Michihiro Yasunaga, Rui Zhang, Kshitijh Meelu, Ayush Pareek, Krishnan Srinivasan, Dragomir Radev CoNLL17 [pdf]
Improving Multi-Document Summarization via Text Classification Ziqiang Cao, Wenjie Li, Sujian Li, Furu Wei AAAI17 [pdf]
Automatic generation of related work through summarizing citations Jingqiang Chen, Hai Zhuge [pdf] [data]
An Unsupervised Multi-Document Summarization Framework Based on Neural Document Model Shulei Ma, Zhi-Hong Deng, Yunlun Yang COLING16 [pdf]
Event-Centric Summary Generation Lucy Vanderwende Michele Banko Arul Menezes ACL04 [pdf]

Cross-Lingual

Towards Unifying Multi-Lingual and Cross-Lingual Summarization Jiaan Wang, Fandong Meng, Duo Zheng, Yunlong Liang, Zhixu Li, Jianfeng Qu, Jie Zhou ACL 2023 [pdf]

[Abs]
To adapt text summarization to the multilingual world, previous work proposes multi-lingual summarization (MLS) and cross-lingual summarization (CLS). However, these two tasks have been studied separately due to the different definitions, which limits the compatible and systematic research on both of them. In this paper, we aim to unify MLS and CLS into a more general setting, i.e., many-to-many summarization (M2MS), where a single model could process documents in any language and generate their summaries also in any language. As the first step towards M2MS, we conduct preliminary studies to show that M2MS can better transfer task knowledge across different languages than MLS and CLS. Furthermore, we propose Pisces, a pre-trained M2MS model that learns language modeling, cross-lingual ability and summarization ability via three-stage pre-training. Experimental results indicate that our Pisces significantly outperforms the state-of-the-art baselines, especially in the zero-shot directions, where there is no training data from the source-language documents to the target-language summaries.
XWikiGen: Cross-lingual Summarization for Encyclopedic Text Generation in Low Resource Languages Dhaval Taunk, Shivprasad Sagare, Anupam Patil, Shivansh Subramanian, Manish Gupta, Vasudeva Varma [pdf] [code]

[Abs]
Lack of encyclopedic text contributors, especially on Wikipedia, makes automated text generation for \emph{low resource (LR) languages} a critical problem. Existing work on Wikipedia text generation has focused on \emph{English only} where English reference articles are summarized to generate English Wikipedia pages. But, for low-resource languages, the scarcity of reference articles makes monolingual summarization ineffective in solving this problem. Hence, in this work, we propose \task{}, which is the task of cross-lingual multi-document summarization of text from multiple reference articles, written in various languages, to generate Wikipedia-style text. Accordingly, we contribute a benchmark dataset, \data{}, spanning ∼69K Wikipedia articles covering five domains and eight languages. We harness this dataset to train a two-stage system where the input is a set of citations and a section title and the output is a section-specific LR summary. The proposed system is based on a novel idea of neural unsupervised extractive summarization to coarsely identify salient information followed by a neural abstractive model to generate the section-specific text. Extensive experiments show that multi-domain training is better than the multi-lingual setup on average.
CroCoSum: A Benchmark Dataset for Cross-Lingual Code-Switched Summarization Ruochen Zhang, Carsten Eickhoff [pdf] [code]

[Abs]
Cross-lingual summarization (CLS) has attracted increasing interest in recent years due to the availability of large-scale web-mined datasets and the advancements of multilingual language models. However, given the rareness of naturally occurring CLS resources, the majority of datasets are forced to rely on translation which can contain overly literal artifacts. This restricts our ability to observe naturally occurring CLS pairs that capture organic diction, including instances of code-switching. This alteration between languages in mid-message is a common phenomenon in multilingual settings yet has been largely overlooked in cross-lingual contexts due to data scarcity. To address this gap, we introduce CroCoSum, a dataset of cross-lingual code-switched summarization of technology news. It consists of over 24,000 English source articles and 18,000 human-curated Chinese news summaries, with more than 92% of the summaries containing code-switched phrases. For reference, we evaluate the performance of existing approaches including pipeline, end-to-end, and zero-shot methods. We show that leveraging existing resources as a pretraining step does not improve performance on CroCoSum, indicating the limited generalizability of existing resources. Finally, we discuss the challenges of evaluating cross-lingual summarizers on code-switched generation through qualitative error analyses. Our collection and code can be accessed at this https URL.
Large Scale Multi-Lingual Multi-Modal Summarization Dataset Yash Verma, Anubhav Jangra, Raghvendra Kumar, Sriparna Saha [pdf] [code]

[Abs]
Significant developments in techniques such as encoder-decoder models have enabled us to represent information comprising multiple modalities. This information can further enhance many downstream tasks in the field of information retrieval and natural language processing; however, improvements in multi-modal techniques and their performance evaluation require large-scale multi-modal data which offers sufficient diversity. Multi-lingual modeling for a variety of tasks like multi-modal summarization, text generation, and translation leverages information derived from high-quality multi-lingual annotated data. In this work, we present the current largest multi-lingual multi-modal summarization dataset (M3LS), and it consists of over a million instances of document-image pairs along with a professionally annotated multi-modal summary for each pair. It is derived from news articles published by British Broadcasting Corporation(BBC) over a decade and spans 20 languages, targeting diversity across five language roots, it is also the largest summarization dataset for 13 languages and consists of cross-lingual summarization data for 2 languages. We formally define the multi-lingual multi-modal summarization task utilizing our dataset and report baseline scores from various state-of-the-art summarization techniques in a multi-lingual setting. We also compare it with many similar datasets to analyze the uniqueness and difficulty of M3LS.https://arxiv.org/abs/2302.06560
Understanding Translationese in Cross-Lingual Summarization Jiaan Wang, Fandong Meng, Tingyi Zhang, Yunlong Liang, Jiarong Xu, Zhixu Li, Jie Zhou [pdf]

[Abs]
Given a document in a source language, cross-lingual summarization (CLS) aims at generating a concise summary in a different target language. Unlike monolingual summarization (MS), naturally occurring source-language documents paired with target-language summaries are rare. To collect large-scale CLS samples, existing datasets typically involve translation in their creation. However, the translated text is distinguished from the text originally written in that language, i.e., translationese. Though many efforts have been devoted to CLS, none of them notice the phenomenon of translationese. In this paper, we first confirm that the different approaches to constructing CLS datasets will lead to different degrees of translationese. Then we design systematic experiments to investigate how translationese affects CLS model evaluation and performance when it appears in source documents or target summaries. In detail, we find that (1) the translationese in documents or summaries of test sets might lead to the discrepancy between human judgment and automatic evaluation; (2) the translationese in training sets would harm model performance in the real scene; (3) though machine-translated documents involve translationese, they are very useful for building CLS systems on low-resource languages under specific training strategies. Furthermore, we give suggestions for future CLS research including dataset and model developments. We hope that our work could let researchers notice the phenomenon of translationese in CLS and take it into account in the future.
Searching for Effective Multilingual Fine-Tuning Methods: A Case Study in Summarization Yiwei Qin, Graham Neubig, Pengfei Liu `` [pdf] [code]

[Abs]
Recently, a large number of tuning strategies have been proposed to adapt pre-trained language models to downstream tasks. In this paper, we perform an extensive empirical evaluation of various tuning strategies for multilingual learning, particularly in the context of text summarization. Specifically, we explore the relative advantages of three families of multilingual tuning strategies (a total of five models) and empirically evaluate them for summarization over 45 languages. Experimentally, we not only established a new state-of-the-art on the XL-Sum dataset but also derive a series of observations that hopefully can provide hints for future research on the design of multilingual tuning strategies.
ClidSum: A Benchmark Dataset for Cross-Lingual Dialogue Summarization Jiaan Wang, Fandong Meng, Ziyao Lu, Duo Zheng, Zhixu Li, Jianfeng Qu, Jie Zhou EMNLP 2022 [pdf] [code]
A Survey on Cross-Lingual Summarization Jiaan Wang, Fandong Meng, Duo Zheng, Yunlong Liang, Zhixu Li, Jianfeng Qu, Jie Zhou TACL 2022 [pdf]
Overcoming Catastrophic Forgetting in Zero-Shot Cross-Lingual Generation Tu Vu, Aditya Barua, Brian Lester, Daniel Cer, Mohit Iyyer, Noah Constant [pdf]
MSAMSum: Towards Benchmarking Multi-lingual Dialogue Summarization Xiachong Feng, Xiaocheng Feng, Bing Qin ACL 2022 DialDoc Workshop [pdf] [data]
The Cross-lingual Conversation Summarization Challenge Yulong Chen, Ming Zhong, Xuefeng Bai, Naihao Deng, Jing Li, Xianchao Zhu, Yue Zhang [pdf]
Neural Label Search for Zero-Shot Multi-Lingual Extractive Summarization Ruipeng Jia, Xingxing Zhang, Yanan Cao, Shi Wang, Zheng Lin, Furu Wei ACL 2022 [pdf]

[Abs]
In zero-shot multilingual extractive text summarization, a model is typically trained on English summarization dataset and then applied on summarization datasets of other languages. Given English gold summaries and documents, sentence-level labels for extractive summarization are usually generated using heuristics. However, these monolingual labels created on English datasets may not be optimal on datasets of other languages, for that there is the syntactic or semantic discrepancy between different languages. In this way, it is possible to translate the English dataset to other languages and obtain different sets of labels again using heuristics. To fully leverage the information of these different sets of labels, we propose NLSSum (Neural Label Search for Summarization), which jointly learns hierarchical weights for these different sets of labels together with our summarization model. We conduct multilingual zero-shot summarization experiments on MLSUM and WikiLingua datasets, and we achieve state-of-the-art results using both human and automatic evaluations across these two datasets.
Bridging Cross-Lingual Gaps During Leveraging the Multilingual Sequence-to-Sequence Pretraining for Text Generation Changtong Zan, Liang Ding, Li Shen, Yu Cao, Weifeng Liu, Dacheng Tao [pdf]
A Variational Hierarchical Model for Neural Cross-Lingual Summarization Yunlong Liang, Fandong Meng, Chulun Zhou, Jinan Xu, Yufeng Chen, Jinsong Su, Jie Zhou ACL 2022 [pdf] [code]

[Abs]
The goal of the cross-lingual summarization (CLS) is to convert a document in one language (e.g., English) to a summary in another one (e.g., Chinese). The CLS task is essentially the combination of machine translation (MT) and monolingual summarization (MS), and thus there exists the hierarchical relationship between MT&MS and CLS. Existing studies on CLS mainly focus on utilizing pipeline methods or jointly training an end-to-end model through an auxiliary MT or MS objective. However, it is very challenging for the model to directly conduct CLS as it requires both the abilities to translate and summarize. To address this issue, we propose a hierarchical model for the CLS task, based on the conditional variational auto-encoder. The hierarchical model contains two kinds of latent variables at the local and global levels, respectively. At the local level, there are two latent variables, one for translation and the other for summarization. As for the global level, there is another latent variable for cross-lingual summarization conditioned on the two local-level variables. Experiments on two language directions (English-Chinese) verify the effectiveness and superiority of the proposed approach. In addition, we show that our model is able to generate better cross-lingual summaries than comparison models in the few-shot setting.
CptGraphSum: Let key clues guide the cross-lingual abstractive summarization Shuyu Jiang, Dengbiao Tu, Xingshu Chen, Rui Tang, Wenxian Wang, Haizhou Wang [pdf]
CrossSum: Beyond English-Centric Cross-Lingual Abstractive Text Summarization for 1500+ Language Pairs Tahmid Hasan, Abhik Bhattacharjee, Wasi Uddin Ahmad, Yuan-Fang Li, Yong-Bin Kang, Rifat Shahriyar [pdf] [code]
Improving Neural Cross-Lingual Summarization via Employing Optimal Transport Distance for Knowledge Distillation Thong Nguyen, Luu Anh Tuan AAAI 2022 [pdf] [code]
Evaluation of Abstractive Summarisation Models with Machine Translation in Deliberative Processes Miguel Arana-Catania, Rob Procter, Yulan He, Maria Liakata EMNLP 2021| newsum [pdf]
Models and Datasets for Cross-Lingual Summarisation Laura Perez-Beltrachini, Mirella Lapata EMNLP 2021 [pdf] [data]
MassiveSumm: a very large-scale, very multilingual, news summarisation dataset Daniel Varab, Natalie Schluter EMNLP 2021 [pdf] [code]
Bridging the Gap: Cross-Lingual Summarization with Compression Rate Yu Bai, Heyan Huang, Kai Fan, Yang Gao, Zewen Chi, Boxing Chen [pdf]
Contrastive Aligned Joint Learning for Multilingual Summarization Danqing Wang, Jiaze Chen, Hao Zhou, Xipeng Qiu, Lei Li ACL 2021 Findings [pdf] [data]
XL-Sum: Large-Scale Multilingual Abstractive Summarization for 44 Languages T. Hasan, A. Bhattacharjee, M. S. Islam, K. Samin, Y. Li, Y. Kang, M. S. Rahman, R. Shahriyar Findings of ACL 2021 [pdf] [data]
ZmBART: An Unsupervised Cross-lingual Transfer Framework for Language Generation Kaushal Kumar Maurya, Maunendra Sankar Desarkar, Yoshinobu Kano, Kumari Deepshikha Findings of ACL 2021 [pdf] [code]
mT6: Multilingual Pretrained Text-to-Text Transformer with Translation Pairs Zewen Chi, Li Dong, Shuming Ma, Shaohan Huang Xian-Ling Mao, Heyan Huang, Furu Wei [pdf] [code]
Evaluating the Efficacy of Summarization Evaluation across Languages Fajri Koto, Jey Han Lau, Timothy Baldwin Findings of ACL 2021 [pdf]
Cross-Lingual Abstractive Summarization with Limited Parallel Resources Yu Bai, Yang Gao, Heyan Huang ACL 2021 [pdf] [code]
Unsupervised Approach to Multilingual User Comments Summarization Aleš Žagar, Marko Robnik-Šikonja EACL21 [pdf] [code]
MultiHumES: Multilingual Humanitarian Dataset for Extractive Summarization Jenny Paola Yela-Bello, Ewan Oglethorpe, Navid Rekabsaz EACL21 [pdf] [data]
Cross-lingual Approach to Abstractive Summarization Aleš Žagar, Marko Robnik-Šikonja [pdf]
Mixed-Lingual Pre-training for Cross-lingual Summarization Ruochen Xu, Chenguang Zhu, Yu Shi, Michael Zeng, Xuedong Huang AACL20 [pdf]
Multi-Task Learning for Cross-Lingual Abstractive Summarization Sho Takase, Naoaki Okazaki [pdf]
WikiLingua: A New Benchmark Dataset for Cross-Lingual Abstractive Summarization Faisal Ladhak, Esin Durmus, Claire Cardie, Kathleen McKeown Findings of EMNLP20 [pdf] [data]
A Deep Reinforced Model for Zero-Shot Cross-Lingual Summarization with Bilingual Semantic Similarity Rewards Zi-Yi Dou, Sachin Kumar, Yulia Tsvetkov ACL20 workshop [pdf] [code]
Jointly Learning to Align and Summarize for Neural Cross-Lingual Summarization Yue Cao, Hui Liu, Xiaojun Wan ACL20 [pdf]
Attend, Translate and Summarize: An Efficient Method for Neural Cross-Lingual Summarization Junnan Zhu, Yu Zhou, Jiajun Zhang, Chengqing Zong ACL20 [pdf] [code]
MultiSumm: Towards a Unified Model for Multi-Lingual Abstractive Summarization Yue Cao, Xiaojun Wan, Jinge Yao, Dian Yu AAAI20 [pdf] [code]
Cross-Lingual Natural Language Generation via Pre-Training Zewen Chi, Li Dong, Furu Wei, Wenhui Wang, Xian-Ling Mao, Heyan Huang AAAI 2020 [pdf] [code]
Global Voices: Crossing Borders in Automatic News Summarization Khanh Nguyen, Hal Daumé III EMNLP19 workshop [pdf] [data]
NCLS: Neural Cross-Lingual Summarization Junnan Zhu, Qian Wang, Yining Wang, Yu Zhou, Jiajun Zhang, Shaonan Wang, Chengqing Zong EMNLP19 [pdf] [code]
Zero-Shot Cross-Lingual Abstractive Sentence Summarization through Teaching Generation and Attention Xiangyu Duan, Mingming Yin, Min Zhang, Boxing Chen, Weihua Luo ACL19 [pdf] [code]
A Robust Abstractive System for Cross-Lingual Summarization Jessica Ouyang, Boya Song, Kathy McKeown NAACL19 [pdf]
Cross-Lingual Korean Speech-to-Text Summarization HyoJeon Yoon, Dinh Tuyen Hoang, Ngoc Thanh Nguyen, Dosam Hwang ACIIDS19 [pdf]
Cross-language document summarization via extraction and ranking of multiple summaries Xiaojun Wan, Fuli Luo, Xue Sun, Songfang Huang & Jin-ge Yao [pdf]
Zero-Shot Cross-Lingual Neural Headline Generation Shi-qi Shen, Yun Chen, Cheng Yang, Zhi-yuan Liu, Mao-song Sun TASLP18 [pdf]
Cross-Language Text Summarization using Sentence and Multi-Sentence Compression Elvys Linhares Pontes, Stéphane Huet, Juan-Manuel Torres-Moreno, Andréa Carneiro Linhares NLDB18 [pdf]
Abstractive Cross-Language Summarization via Translation Model Enhanced Predicate Argument Structure Fusing Jiajun Zhang, Yu Zhou, Chengqing Zong TASLP16 [pdf]
Phrase-based Compressive Cross-Language Summarization Jin-ge Yao ,Xiaojun Wan ,Jianguo Xiao EMNLP15 [pdf]
Multilingual Single-Document Summarization with MUSE Marina Litvak, Mark Last MultiLing13 [pdf]
Using bilingual information for cross-language document summarization Xiaojun Wan ACL11 [pdf]
A Graph-based Approach to Cross-language Multi-document Summarization Florian Boudin, Stéphane Huet, Juan-Manuel Torres-Moreno [pdf]
Cross-language document summarization based on machine translation quality prediction Xiaojun Wan, Huiying Li, Jianguo Xiao ACL10 [pdf]
Evaluation of a Cross-lingual Romanian-English Multi-document Summariser Constantin Orasan, Oana Andreea Chiorean LREC08 [pdf]
Cross-lingual C*ST*RD: English access to Hindi information Anton Leuski, Chin-Yew Lin, Liang Zhou, Ulrich Germann, Franz Josef Och, Eduard Hovy [pdf]

Multi-modal

Exploiting Pseudo Image Captions for Multimodal Summarization Chaoya Jiang, Rui Xie, Wei Ye, Jinan Sun, Shikun Zhang Findings ACL 2023 [pdf]

[Abs]
Cross-modal contrastive learning in vision language pretraining (VLP) faces the challenge of (partial) false negatives. In this paper, we study this problem from the perspective of Mutual Information (MI) optimization. It is common sense that InfoNCE loss used in contrastive learning will maximize the lower bound of MI between anchors and their positives, while we theoretically prove that MI involving negatives also matters when noises commonly exist. Guided by a more general lower bound form for optimization, we propose a contrastive learning strategy regulated by progressively refined cross-modal similarity, to more accurately optimize MI between an image/text anchor and its negative texts/images instead of improperly minimizing it. Our method performs competitively on four downstream cross-modal tasks and systematically balances the beneficial and harmful effects of (partial) false negative samples under theoretical guidance.
Learning Summary-Worthy Visual Representation for Abstractive Summarization in Video Zenan Xu, Xiaojun Meng, Yasheng Wang, Qinliang Su, Zexuan Qiu, Xin Jiang, Qun Liu IJCAI 2023 [pdf]

[Abs]
Multimodal abstractive summarization for videos (MAS) requires generating a concise textual summary to describe the highlights of a video according to multimodal resources, in our case, the video content and its transcript. Inspired by the success of the large-scale generative pre-trained language model (GPLM) in generating high-quality textual content (e.g., summary), recent MAS methods have proposed to adapt the GPLM to this task by equipping it with the visual information, which is often obtained through a general-purpose visual feature extractor. However, the generally extracted visual features may overlook some summary-worthy visual information, which impedes model performance. In this work, we propose a novel approach to learning the summary-worthy visual representation that facilitates abstractive summarization. Our method exploits the summary-worthy information from both the cross-modal transcript data and the knowledge that distills from the pseudo summary. Extensive experiments on three public multimodal datasets show that our method outperforms all competing baselines. Furthermore, with the advantages of summary-worthy visual information, our model can have a significant improvement on small datasets or even datasets with limited training data.
VideoXum: Cross-modal Visual and Textural Summarization of Videos Jingyang Lin, Hang Hua, Ming Chen, Yikang Li, Jenhao Hsiao, Chiuman Ho, Jiebo Luo [pdf] [code]

[Abs]
Video summarization aims to distill the most important information from a source video to produce either an abridged clip or a textual narrative. Traditionally, different methods have been proposed depending on whether the output is a video or text, thus ignoring the correlation between the two semantically related tasks of visual summarization and textual summarization. We propose a new joint video and text summarization task. The goal is to generate both a shortened video clip along with the corresponding textual summary from a long video, collectively referred to as a cross-modal summary. The generated shortened video clip and text narratives should be semantically well aligned. To this end, we first build a large-scale human-annotated dataset -- VideoXum (X refers to different modalities). The dataset is reannotated based on ActivityNet. After we filter out the videos that do not meet the length requirements, 14,001 long videos remain in our new dataset. Each video in our reannotated dataset has human-annotated video summaries and the corresponding narrative summaries. We then design a novel end-to-end model -- VTSUM-BILP to address the challenges of our proposed task. Moreover, we propose a new metric called VT-CLIPScore to help evaluate the semantic consistency of cross-modality summary. The proposed model achieves promising performance on this new task and establishes a benchmark for future research.
Sample Efficient Multimodal Semantic Augmentation for Incremental Summarization Sumanta Bhattacharyya, Ramesh Manuvinakurike, Sahisnu Mazumder, Saurav Sahay [pdf]

[Abs]
In this work, we develop a prompting approach for incremental summarization of task videos. We develop a sample-efficient few-shot approach for extracting semantic concepts as an intermediate step. We leverage an existing model for extracting the concepts from the images and extend it to videos and introduce a clustering and querying approach for sample efficiency, motivated by the recent advances in perceiver-based architectures. Our work provides further evidence that an approach with richer input context with relevant entities and actions from the videos and using these as prompts could enhance the summaries generated by the model. We show the results on a relevant dataset and discuss possible directions for the work.
Large Scale Multi-Lingual Multi-Modal Summarization Dataset Yash Verma, Anubhav Jangra, Raghvendra Kumar, Sriparna Saha [pdf] [code]

[Abs]
Significant developments in techniques such as encoder-decoder models have enabled us to represent information comprising multiple modalities. This information can further enhance many downstream tasks in the field of information retrieval and natural language processing; however, improvements in multi-modal techniques and their performance evaluation require large-scale multi-modal data which offers sufficient diversity. Multi-lingual modeling for a variety of tasks like multi-modal summarization, text generation, and translation leverages information derived from high-quality multi-lingual annotated data. In this work, we present the current largest multi-lingual multi-modal summarization dataset (M3LS), and it consists of over a million instances of document-image pairs along with a professionally annotated multi-modal summary for each pair. It is derived from news articles published by British Broadcasting Corporation(BBC) over a decade and spans 20 languages, targeting diversity across five language roots, it is also the largest summarization dataset for 13 languages and consists of cross-lingual summarization data for 2 languages. We formally define the multi-lingual multi-modal summarization task utilizing our dataset and report baseline scores from various state-of-the-art summarization techniques in a multi-lingual setting. We also compare it with many similar datasets to analyze the uniqueness and difficulty of M3LS.https://arxiv.org/abs/2302.06560
Assist Non-native Viewers: Multimodal Cross-Lingual Summarization for How2 Videos Nayu Liu, Kaiwen Wei, Xian Sun, Hongfeng Yu, Fanglong Yao, Li Jin, Guo Zhi, Guangluan Xu EMNLP 2022 [pdf] [data]

[Abs]
Multimodal summarization for videos aims to generate summaries from multi-source information (videos, audio transcripts), which has achieved promising progress. However, existing works are restricted to monolingual video scenarios, ignoring the demands of non-native video viewers to understand the cross-language videos in practical applications. It stimulates us to propose a new task, named Multimodal Cross-Lingual Summarization for videos (MCLS), which aims to generate cross-lingual summaries from multimodal inputs of videos. First, to make it applicable to MCLS scenarios, we conduct a Video-guided Dual Fusion network (VDF) that integrates multimodal and cross-lingual information via diverse fusion strategies at both encoder and decoder. Moreover, to alleviate the problem of high annotation costs and limited resources in MCLS, we propose a triple-stage training framework to assist MCLS by transferring the knowledge from monolingual multimodal summarization data, which includes: 1) multimodal summarization on sufficient prevalent language videos with a VDF model; 2) knowledge distillation (KD) guided adjustment on bilingual transcripts; 3) multimodal summarization for cross-lingual videos with a KD induced VDF model. Experiment results on the reorganized How2 dataset show that the VDF model alone outperforms previous methods for multimodal summarization, and the performance further improves by a large margin via the proposed triple-stage training framework.
TLDW: Extreme Multimodal Summarisation of News Videos Peggy Tang, Kun Hu, Lei Zhang, Jiebo Luo, Zhiyong Wang [pdf]

[Abs]
Multimodal summarisation with multimodal output is drawing increasing attention due to the rapid growth of multimedia data. While several methods have been proposed to summarise visual-text contents, their multimodal outputs are not succinct enough at an extreme level to address the information overload issue. To the end of extreme multimodal summarisation, we introduce a new task, eXtreme Multimodal Summarisation with Multimodal Output (XMSMO) for the scenario of TL;DW - Too Long; Didn't Watch, akin to TL;DR. XMSMO aims to summarise a video-document pair into a summary with an extremely short length, which consists of one cover frame as the visual summary and one sentence as the textual summary. We propose a novel unsupervised Hierarchical Optimal Transport Network (HOT-Net) consisting of three components: hierarchical multimodal encoders, hierarchical multimodal fusion decoders, and optimal transport solvers. Our method is trained, without using reference summaries, by optimising the visual and textual coverage from the perspectives of the distance between the semantic distributions under optimal transport plans. To facilitate the study on this task, we collect a large-scale dataset XMSMO-News by harvesting 4,891 video-document pairs. The experimental results show that our method achieves promising performance in terms of ROUGE and IoU metrics.
Hierarchical3D Adapters for Long Video-to-text Summarization Pinelopi Papalampidi, Mirella Lapata [pdf]

[Abs]
In this paper, we focus on video-to-text summarization and investigate how to best utilize multimodal information for summarizing long inputs (e.g., an hour-long TV show) into long outputs (e.g., a multi-sentence summary). We extend SummScreen (Chen et al., 2021), a dialogue summarization dataset consisting of transcripts of TV episodes with reference summaries, and create a multimodal variant by collecting corresponding full-length videos. We incorporate multimodal information into a pre-trained textual summarizer efficiently using adapter modules augmented with a hierarchical structure while tuning only 3.8% of model parameters. Our experiments demonstrate that multimodal information offers superior performance over more memory-heavy and fully fine-tuned textual summarization methods.
Modeling Paragraph-Level Vision-Language Semantic Alignment for Multi-Modal Summarization Xinnian Liang, Chenhao Cui, Shuangzhi Wu, Jiali Zeng, Yufan Jiang, Zhoujun Li [pdf]

[Abs]
Most current multi-modal summarization methods follow a cascaded manner, where an off-the-shelf object detector is first used to extract visual features, then these features are fused with language representations to generate the summary with an encoder-decoder model. The cascaded way cannot capture the semantic alignments between images and paragraphs, which are crucial to a precise summary. In this paper, we propose ViL-Sum to jointly model paragraph-level \textbf{Vi}sion-\textbf{L}anguage Semantic Alignment and Multi-Modal \textbf{Sum}marization. The core of ViL-Sum is a joint multi-modal encoder with two well-designed tasks, image reordering and image selection. The joint multi-modal encoder captures the interactions between modalities, where the reordering task guides the model to learn paragraph-level semantic alignment and the selection task guides the model to selected summary-related images in the final summary. Experimental results show that our proposed ViL-Sum significantly outperforms current state-of-the-art methods. In further analysis, we find that two well-designed tasks and joint multi-modal encoder can effectively guide the model to learn reasonable paragraphs-images and summary-images relations.
MHMS: Multimodal Hierarchical Multimedia Summarization Jielin Qiu, Jiacheng Zhu, Mengdi Xu, Franck Dernoncourt, Trung Bui, Zhaowen Wang, Bo Li, Ding Zhao, Hailin Jin [pdf]
Video Summarization Based on Video-text Representation Li Haopeng, Ke Qiuhong, Gong Mingming, Zhang Rui [pdf]
UniMS: A Unified Framework for Multimodal Summarization with Knowledge Distillation Zhengkun Zhang, Xiaojun Meng, Yasheng Wang, Xin Jiang, Qun Liu, Zhenglu Yang AAAI 2022 [pdf]
Hierarchical Cross-Modality Semantic Correlation Learning Model for Multimodal Summarization Litian Zhang, Xiaoming Zhang, Junshu Pan, Feiran Huang AAAI 2022 [pdf] [data]
Attention-based Multi-hypothesis Fusion for Speech Summarization Takatomo Kano, Atsunori Ogawa, Marc Delcroix, Shinji Watanabe [pdf]
Vision Guided Generative Pre-trained Language Models for Multimodal Abstractive Summarization Tiezheng Yu, Wenliang Dai, Zihan Liu, Pascale Fung EMNLP 2021 [pdf] [code]
Multi-Modal Supplementary-Complementary Summarization using Multi-Objective Optimization Anubhav Jangra, Sriparna Saha, Adam Jatowt, Mohammad Hasanuzzaman SIGIR 2021 [pdf]
Self-Supervised Multimodal Opinion Summarization Jinbae Im, Moonki Kim, Hoyeop Lee, Hyunsouk Cho, Sehee Chung ACL21 [pdf] [code]
GPT2MVS: Generative Pre-trained Transformer-2 for Multi-modal Video Summarization Jia-Hong Huang, Luka Murn, Marta Mrak, Marcel Worring ICMR21 [pdf]
Multimodal Sentence Summarization via Multimodal Selective Encoding Haoran Li, Junnan Zhu, Jiajun Zhang, Xiaodong He, Chengqing Zong COLING20 [pdf]
Multistage Fusion with Forget Gate for Multimodal Summarization in Open-Domain Videos Nayu Liu, Xian Sun, Hongfeng Yu, Wenkai Zhang, Guangluan Xu EMNLP20 [pdf]
MAST: Multimodal Abstractive Summarization with Trimodal Hierarchical Attention Aman Khullar, Udit Arora EMNLP20 Workshop [pdf] [code]
VMSMO: Learning to Generate Multimodal Summary for Video-based News Articles Mingzhe Li, Xiuying Chen, Shen Gao, Zhangming Chan, Dongyan Zhao, Rui Yan EMNLP20 [pdf] [data]
Multi-modal Summarization for Video-containing Documents Xiyan Fu, Jun Wang, Zhenglu Yang [pdf] [code]
Text-Image-Video Summary Generation Using Joint Integer Linear Programming Anubhav Jangra, Adam Jatowt, Mohammad Hasanuzzaman, Sriparna Saha ECIR20 [pdf]
Aspect-Aware Multimodal Summarization for Chinese E-Commerce Products Haoran Li, Peng Yuan, Song Xu, Youzheng Wu, Xiaodong He, Bowen Zhou AAAI20 [pdf] [code]
Convolutional Hierarchical Attention Network for Query-Focused Video Summarization Shuwen Xiao, Zhou Zhao, Zijian Zhang, Xiaohui Yan, Min Yang AAAI20 [pdf]
Multimodal Summarization with Guidance of Multimodal Reference Junnan Zhu, Yu Zhou, Jiajun Zhang, Haoran Li, Chengqing Zong, Changliang Li AAAI20 [pdf]
EmotionCues: Emotion-Oriented Visual Summarization of Classroom Videos Haipeng Zeng, Xinhuan Shu, Yanbang Wang, Yong Wang, Liguo Zhang, Ting-Chuen Pong, Huamin Qu [pdf]
A Survey on Automatic Summarization Using Multi-Modal Summarization System for Asynchronous Collections Shilpadevi Vasant Bhagwat, Sheetal .S. Thokal [pdf]
Extractive summarization of documents with images based on multi-modal RNN Jingqiang Chen, Hai Zhuge [pdf]
Keep Meeting Summaries on Topic: Abstractive Multi-Modal Meeting Summarization Manling Li, Lingyu Zhang, Heng Ji, Richard J. Radke ACL19 [pdf]
Multimodal Abstractive Summarization for How2 Videos Shruti Palaskar, Jindřich Libovický, Spandana Gella, Florian Metze ACL19 [pdf]
MSMO: Multimodal Summarization with Multimodal Output Junnan Zhu, Haoran Li, Tianshang Liu, Yu Zhou, Jiajun Zhang, Chengqing Zong EMNLP18 [pdf] [data]
Abstractive Text-Image Summarization Using Multi-Modal Attentional Hierarchical RNN Jingqiang Chen, Hai Zhuge EMNLP18 [pdf]
Multi-modal Sentence Summarization with Modality Attention and Image Filtering Haoran Li, Junnan Zhu, Tianshang Liu, Jiajun Zhang, Chengqing Zong IJCAI18 [pdf]
Multimodal Abstractive Summarization for Open-Domain Videos Jindrich Libovický, Shruti Palaskar, Spandana Gella, Florian Metze NIPS18 [pdf] [data]
Read, Watch, Listen, and Summarize: Multi-Modal Summarization for Asynchronous Text, Image, Audio and Video Haoran Li, Junnan Zhu, Cong Ma, Jiajun Zhang, Chengqing Zong [pdf]
Fusing Verbal and Nonverbal Information for Extractive Meeting Summarization Fumio Nihei, Yukiko Nakano, Yukiko I. Nakano, Yutaka Takase, Yutaka Takase GIFT18 [pdf]
Multi-modal Summarization for Asynchronous Collection of Text, Image, Audio and Video Haoran Li, Junnan Zhu, Cong Ma, Jiajun Zhang, Chengqing Zong EMNLP17 [pdf]
Meeting Extracts for Discussion Summarization Based on Multimodal Nonverbal Information Fumio Nihei, Yukiko Nakano, Yukiko I. Nakano, Yutaka Takase, Yutaka Takase ICMI16 [pdf]
Summarizing a multimodal set of documents in a Smart Room Maria Fuentes, Horacio Rodríguez, Jordi Turmo LREC12 [pdf]
Multi-modal summarization of key events and top players in sports tournament videos Dian Tjondronegoro, Xiaohui Tao, Johannes Sasongko and Cher Han Lau [pdf]
Multimodal Summarization of Complex Sentences Naushad UzZaman, Jeffrey P. Bigham, James F. Allen [pdf]
Summarization of Multimodal Information Saif Ahmad, Paulo C F de Oliveira, Khurshid Ahmad LREC04 [pdf]
Multimodal Summarization of Meeting Recordings Berna Erol, Dar-Shyang Lee, and Jonathan Hull ICME03 [pdf]

Sentiment Related

Why Do You Feel This Way? Summarizing Triggers of Emotions in Social Media Posts Hongli Zhan, Tiberiu Sosea, Cornelia Caragea, Junyi Jessy Li EMNLP 2022 [pdf] [code]

[Abs]
Crises such as the COVID-19 pandemic continuously threaten our world and emotionally affect billions of people worldwide in distinct ways. Understanding the triggers leading to people’s emotions is of crucial importance. Social media posts can be a good source of such analysis, yet these texts tend to be charged with multiple emotions, with triggers scattering across multiple sentences. This paper takes a novel angle, namely, emotion detection and trigger summarization, aiming to both detect perceived emotions in text, and summarize events and their appraisals that trigger each emotion. To support this goal, we introduce CovidET (Emotions and their Triggers during Covid-19), a dataset of ~1,900 English Reddit posts related to COVID-19, which contains manual annotations of perceived emotions and abstractive summaries of their triggers described in the post. We develop strong baselines to jointly detect emotions and summarize emotion triggers. Our analyses show that CovidET presents new challenges in emotion-specific summarization, as well as multi-emotion detection in long social media posts.
Making the Best Use of Review Summary for Sentiment Analysis Sen Yang, Leyang Cui, Jun Xie, Yue Zhang COLING20 [pdf] [code] [bib]
A Unified Dual-view Model for Review Summarization and Sentiment Classification with Inconsistency Loss Hou Pong Chan, Wang Chen, Irwin King SIGIR20 [pdf] [code]
A Hierarchical End-to-End Model for Jointly Improving Text Summarization and Sentiment Classification Shuming Ma, Xu Sun, Junyang Lin, Xuancheng Ren IJCAI18 [pdf]
Two-level Text Summarization from Online News Sources with Sentiment Analysis Tarun B. Mirani, Sreela Sasi IEEE17 [pdf]
Creating Video Summarization From Emotion Perspective Yijie Lan, Shikui Wei, Ruoyu Liu, Yao Zhao ICSP16 [pdf]

Pre-trained Language Model Based

SOCRATIC Pretraining: Question-Driven Pretraining for Controllable Summarization Yixin Liu, Budhaditya Deb, Milagro Teruel, Aaron Halfaker, Dragomir Radev, Ahmed Hassan Awadallah ACL 2023 [pdf] [code]

[Abs]
In long document controllable summarization, where labeled data is scarce, pretrained models struggle to adapt to the task and effectively respond to user queries. In this paper, we introduce Socratic pretraining, a question-driven, unsupervised pretraining objective specifically designed to improve controllability in summarization tasks. By training a model to generate and answer relevant questions in a given context, Socratic pretraining enables the model to more effectively adhere to user-provided queries and identify relevant content to be summarized. We demonstrate the effectiveness of this approach through extensive experimentation on two summarization domains, short stories and dialogue, and multiple control strategies: keywords, questions, and factoid QA pairs. Our pretraining method relies only on unlabeled documents and a question generation system and outperforms pre-finetuning approaches that use additional supervised data. Furthermore, our results show that Socratic pretraining cuts task-specific labeled data requirements in half, is more faithful to user-provided queries, and achieves state-of-the-art performance on QMSum and SQuALITY.
An Analysis of Abstractive Text Summarization Using Pre-trained Models Tohida Rehman, Suchandan Das, Debarshi Kumar Sanyal, Samiran Chattopadhyay [pdf]

[Abs]
People nowadays use search engines like Google, Yahoo, and Bing to find information on the Internet. Due to explosion in data, it is helpful for users if they are provided relevant summaries of the search results rather than just links to webpages. Text summarization has become a vital approach to help consumers swiftly grasp vast amounts of this http URL this paper, different pre-trained models for text summarization are evaluated on different datasets. Specifically, we have used three different pre-trained models, namely, google/pegasus-cnn-dailymail, T5-base, facebook/bart-large-cnn. We have considered three different datasets, namely, CNN-dailymail, SAMSum and BillSum to get the output from the above three models. The pre-trained models are compared over these different datasets, each of 2000 examples, through ROUGH and BLEU metrics.
Z-Code++: A Pre-trained Language Model Optimized for Abstractive Summarization Pengcheng He, Baolin Peng, Liyang Lu, Song Wang, Jie Mei, Yang Liu, Ruochen Xu, Hany Hassan Awadalla, Yu Shi, Chenguang Zhu, Wayne Xiong, Michael Zeng, Jianfeng Gao, Xuedong Huang [pdf]

[Abs]
This paper presents Z-Code++, a new pre-trained language model optimized for abstractive text summarization. The model extends the state of the art encoder-decoder model using three techniques. First, we use a two-phase pre-training process to improve model's performance on low-resource summarization tasks. The model is first pre-trained using text corpora for language understanding, and then is continually pre-trained on summarization corpora for grounded text generation. Second, we replace self-attention layers in the encoder with disentangled attention layers, where each word is represented using two vectors that encode its content and position, respectively. Third, we use fusion-in-encoder, a simple yet effective method of encoding long sequences in a hierarchical manner. Z-Code++ creates new state of the art on 9 out of 13 text summarization tasks across 5 languages. Our model is parameter-efficient in that it outperforms the 600x larger PaLM-540B on XSum, and the finetuned 200x larger GPT3-175B on SAMSum. In zero-shot and few-shot settings, our model substantially outperforms the competing models.
MVP: Multi-task Supervised Pre-training for Natural Language Generation Tianyi Tang, Junyi Li, Wayne Xin Zhao, Ji-Rong Wen [pdf] [code]

[Abs]
Pre-trained language models (PLMs) have achieved notable success in natural language generation (NLG) tasks. Up to now, most of the PLMs are pre-trained in an unsupervised manner using large-scale general corpus. In the meanwhile, an increasing number of models pre-trained with less labeled data showcase superior performance compared to unsupervised models. Motivated by the success of supervised pre-training, we propose Multi-task superVised Pre-training (MVP) for natural language generation. For pre-training the text generation model MVP, we collect a labeled pre-training corpus from 45 datasets over seven generation tasks. For each task, we further pre-train specific soft prompts to stimulate the model capacity in performing a specific task. Extensive experiments have demonstrated the effectiveness of our supervised pre-training in a number of NLG tasks, and our general methods achieve state-of-the-art performance on 12 of 17 datasets.
E2S2: Encoding-Enhanced Sequence-to-Sequence Pretraining for Language Understanding and Generation Qihuang Zhong, Liang Ding, Juhua Liu, Bo Du, Dacheng Tao [pdf]
Does Pretraining for Summarization Require Knowledge Transfer? Kundan Krishna, Jeffrey Bigham, Zachary C. Lipton EMNLP 2021 Findings [pdf] [code]
ARMAN: Pre-training with Semantically Selecting and Reordering of Sentences for Persian Abstractive Summarization Alireza Salemi, Emad Kebriaei, Ghazal Neisi Minaei, Azadeh Shakery EMNLP 2021 [pdf] [code]
Leveraging Lead Bias for Zero-shot Abstractive News Summarization Chenguang Zhu, Ziyi Yang, Robert Gmyr, Michael Zeng, Xuedong Huang SIGIR 2021 [pdf]
ERNIE 3.0: Large-scale Knowledge Enhanced Pre-training for Language Understanding and Generation Yu Sun, Shuohuan Wang, Shikun Feng, Siyu Ding, Chao Pang, Junyuan Shang, Jiaxiang Liu, Xuyi Chen, Yanbin Zhao, Yuxiang Lu, Weixin Liu, Zhihua Wu, Weibao Gong, Jianzhong Liang, Zhizhou Shang, Peng Sun, Wei Liu, Xuan Ouyang, Dianhai Yu, Hao Tian, Hua Wu, Haifeng Wang [pdf]
BANG: Bridging Autoregressive and Non-autoregressive Generation with Large Scale Pretraining Weizhen Qi, Yeyun Gong, Jian Jiao, Yu Yan, Weizhu Chen, Dayiheng Liu, Kewen Tang, Houqiang Li, Jiusheng Chen, Ruofei Zhang, Ming Zhou, Nan Duan ICML 2021 [pdf] [code]
Fact-level Extractive Summarization with Hierarchical Graph Mask on BERT Ruifeng Yuan, Zili Wang, Wenjie Li COLING20 [pdf] [code]
Towards Zero-Shot Conditional Summarization with Adaptive Multi-Task Fine-Tuning Travis Goodwin, Max Savery, Dina Demner-Fushman Findings of EMNLP [pdf] [code]
Improving Zero and Few-Shot Abstractive Summarization with Intermediate Fine-tuning and Data Augmentation Alexander R. Fabbri, Simeng Han, Haoyuan Li, Haoran Li, Marjan Ghazvininejad, Shafiq Joty, Dragomir Radev, Yashar Mehdad [pdf]
Pre-trained Summarization Distillation Sam Shleifer, Alexander M. Rush [pdf] [code]
Pre-training for Abstractive Document Summarization by Reinstating Source Text Yanyan Zou, Xingxing Zhang, Wei Lu, Furu Wei, Ming Zhou EMNLP20 [pdf] [code]
PALM: Pre-training an Autoencoding&Autoregressive Language Model for Context-conditioned Generation Bin Bi, Chenliang Li, Chen Wu, Ming Yan, Wei Wang, Songfang Huang, Fei Huang, Luo Si EMNLP20 [pdf]
TED: A Pretrained Unsupervised Summarization Model with Theme Modeling and Denoising Ziyi Yang Chenguang Zhu Robert Gmyr Michael Zeng Xuedong Huang Eric Darve Findings of EMNLP20 [pdf]
QURIOUS: Question Generation Pretraining for Text Generation Shashi Narayan, Gonçalo Simoes, Ji Ma, Hannah Craighead, Ryan Mcdonald ACL20 Short [pdf]
PEGASUS: Pre-training with Extracted Gap-sentences for Abstractive Summarization Jingqing Zhang, Yao Zhao, Mohammad Saleh, Peter J. Liu ICML20 [pdf] [code]
Abstractive Text Summarization based on Language Model Conditioning and Locality Modeling Dmitrii Aksenov, Julián Moreno-Schneider, Peter Bourgonje, Robert Schwarzenberg, Leonhard Hennig, Georg Rehm LREC20 [pdf]
Abstractive Summarization with Combination of Pre-trained Sequence-to-Sequence and Saliency Models Dmitrii Aksenov, Julián Moreno-Schneider, Peter Bourgonje, Robert Schwarzenberg, Leonhard Hennig, Georg Rehm [pdf]
Learning by Semantic Similarity Makes Abstractive Summarization Better Wonjin Yoon, Yoon Sun Yeo, Minbyul Jeong, Bong-Jun Yi, Jaewoo Kang ICML20 [pdf] [code]
Text Summarization with Pretrained Encoders Yang Liu, Mirella Lapata EMNLP19 [pdf] [code]
HIBERT: Document Level Pre-training of Hierarchical Bidirectional Transformers for Document Summarization Xingxing Zhang, Furu Wei, Ming Zhou ACL19 [pdf]
MASS: Masked Sequence to Sequence Pre-training for Language Generation Kaitao Song, Xu Tan, Tao Qin, Jianfeng Lu, Tie-Yan Liu ICML19 [pdf] [code]
Pretraining-Based Natural Language Generation for Text Summarization Haoyu Zhang, Jianjun Xu, Ji Wang [pdf]
Fine-tune BERT for Extractive Summarization Yang Liu [pdf] [code]
Unified Language Model Pre-training for Natural Language Understanding and Generation Li Dong, Nan Yang, Wenhui Wang, Furu Wei, Xiaodong Liu, Yu Wang, Jianfeng Gao, Ming Zhou, Hsiao-Wuen Hon NIPS19 [pdf] [code]
Self-Supervised Learning for Contextualized Extractive Summarization Hong Wang, Xin Wang, Wenhan Xiong, Mo Yu, Xiaoxiao Guo, Shiyu Chang, William Yang Wang ACL19 [pdf] [code]
Efficient Adaptation of Pretrained Transformers for Abstractive Summarization Andrew Hoang, Antoine Bosselut, Asli Celikyilmaz, Yejin Choi [pdf] [code]

Controllable

Summarization with Precise Length Control Lesly Miculicich, Yujia Xie, Song Wang, Pengcheng He [pdf]

[Abs]
Many applications of text generation such as summarization benefit from accurately controlling the text length. Existing approaches on length-controlled summarization either result in degraded performance or can only control the length approximately. In this work, we present a framework to generate summaries with precisely the specified number of tokens or sentences, while maintaining or even improving the text quality. In addition, we jointly train the models to predict the lengths, so our model can generate summaries with optimal length. We evaluate the proposed framework on the CNNDM dataset and show improved performance compared to existing methods.
HydraSum: Disentangling Style Features in Text Summarization with Multi-Decoder Models Tanya Goyal, Nazneen Rajani, Wenhao Liu, Wojciech Kryscinski EMNLP 2022 [pdf] [code]

[Abs]
Summarization systems make numerous “decisions” about summary properties during inference, e.g. degree of copying, specificity and length of outputs, etc. However, these are implicitly encoded within model parameters and specific styles cannot be enforced. To address this, we introduce HydraSum, a new summarization architecture that extends the single decoder framework of current models to a mixture-of-experts version with multiple decoders. We show that HydraSum’s multiple decoders automatically learn contrasting summary styles when trained under the standard training objective without any extra supervision. Through experiments on three summarization datasets (CNN, Newsroom and XSum), we show that HydraSum provides a simple mechanism to obtain stylistically-diverse summaries by sampling from either individual decoders or their mixtures, outperforming baseline models. Finally, we demonstrate that a small modification to the gating strategy during training can enforce an even stricter style partitioning, e.g. high- vs low-abstractiveness or high- vs low-specificity, allowing users to sample from a larger area in the generation space and vary summary styles along multiple dimensions.
Socratic Pretraining: Question-Driven Pretraining for Controllable Summarization Artidoro Pagnoni, Alexander R. Fabbri, Wojciech Kryściński, Chien-Sheng Wu [pdf] [code]

[Abs]
In long document controllable summarization, where labeled data is scarce, pretrained models struggle to adapt to the task and effectively respond to user queries. In this paper, we introduce Socratic pretraining, a question-driven, unsupervised pretraining objective specifically designed to improve controllability in summarization tasks. By training a model to generate and answer relevant questions in a given context, Socratic pretraining enables the model to more effectively adhere to user-provided queries and identify relevant content to be summarized. We demonstrate the effectiveness of this approach through extensive experimentation on two summarization domains, short stories and dialogue, and multiple control strategies: keywords, questions, and factoid QA pairs. Our pretraining method relies only on unlabeled documents and a question generation system and outperforms pre-finetuning approaches that use additional supervised data. Furthermore, our results show that Socratic pretraining cuts task-specific labeled data requirements in half, is more faithful to user-provided queries, and achieves state-of-the-art performance on QMSum and SQuALITY.
Attend to the Right Context: A Plug-and-Play Module for Content-Controllable Summarization Wen Xiao, Lesly Miculicich, Yang Liu, Pengcheng He, Giuseppe Carenini [pdf] [code]

[Abs]
Content-Controllable Summarization generates summaries focused on the given controlling signals. Due to the lack of large-scale training corpora for the task, we propose a plug-and-play module RelAttn to adapt any general summarizers to the content-controllable summarization task. RelAttn first identifies the relevant content in the source documents, and then makes the model attend to the right context by directly steering the attention weight. We further apply an unsupervised online adaptive parameter searching algorithm to determine the degree of control in the zero-shot setting, while such parameters are learned in the few-shot setting. By applying the module to three backbone summarization models, experiments show that our method effectively improves all the summarizers, and outperforms the prefix-based method and a widely used plug-and-play model in both zero- and few-shot settings. Tellingly, more benefit is observed in the scenarios when more control is needed.
MACSUM: Controllable Summarization with Mixed Attributes Yusen Zhang, Yang Liu, Ziyi Yang, Yuwei Fang, Yulong Chen, Dragomir Radev, Chenguang Zhu, Michael Zeng, Rui Zhang [pdf] [code]

[Abs]
Controllable summarization allows users to generate customized summaries with specified attributes. However, due to the lack of designated annotations of controlled summaries, existing works have to craft pseudo datasets by adapting generic summarization benchmarks. Furthermore, most research focuses on controlling single attributes individually (e.g., a short summary or a highly abstractive summary) rather than controlling a mix of attributes together (e.g., a short and highly abstractive summary). In this paper, we propose MACSum, the first human-annotated summarization dataset for controlling mixed attributes. It contains source texts from two domains, news articles and dialogues, with human-annotated summaries controlled by five designed attributes (Length, Extractiveness, Specificity, Topic, and Speaker). We propose two simple and effective parameter-efficient approaches for the new task of mixed controllable summarization based on hard prompt tuning and soft prefix tuning. Results and analysis demonstrate that hard prompt models yield the best performance on all metrics and human evaluations. However, mixed-attribute control is still challenging for summarization tasks. Our dataset and code are available at this https URL.
SentBS: Sentence-level Beam Search for Controllable Summarization Chenhui Shen, Liying Cheng, Lidong Bing, Yang You, Luo Si EMNLP 2022 [pdf] [code]

[Abs]
A wide range of control perspectives have been explored in controllable text generation. Structure-controlled summarization is recently proposed as a useful and interesting research direction. However, current structure-controlling methods have limited effectiveness in enforcing the desired structure. To address this limitation, we propose a sentence-level beam search generation method (SentBS), where evaluation is conducted throughout the generation process to select suitable sentences for subsequent generations. We experiment with different combinations of decoding methods to be used as subcomponents by SentBS and evaluate results on the structure-controlled dataset MReD. Experiments show that all explored combinations for SentBS can improve the agreement between the generated text and the desired structure, with the best method significantly reducing the structural discrepancies suffered by the existing model, by approximately 68%.
Readability Controllable Biomedical Document Summarization Readability Controllable Biomedical Document Summarization Findings of EMNLP 2022 [pdf]

[Abs]
Different from general documents, it is recognised that the ease with which people can understand a biomedical text is eminently varied, owing to the highly technical nature of biomedical documents and the variance of readers' domain knowledge. However, existing biomedical document summarization systems have paid little attention to readability control, leaving users with summaries that are incompatible with their levels of expertise. In recognition of this urgent demand, we introduce a new task of readability controllable summarization for biomedical documents, which aims to recognise users' readability demands and generate summaries that better suit their needs: technical summaries for experts and plain language summaries (PLS) for laymen. To establish this task, we construct a corpus consisting of biomedical papers with technical summaries and PLSs written by the authors, and benchmark multiple advanced controllable abstractive and extractive summarization models based on pre-trained language models (PLMs) with prevalent controlling and generation techniques. Moreover, we propose a novel masked language model (MLM) based metric and its variant to effectively evaluate the readability discrepancy between lay and technical summaries. Experimental results from automated and human evaluations show that though current control techniques allow for a certain degree of readability adjustment during generation, the performance of existing controllable summarization methods is far from desirable in this task.
EDU-level Extractive Summarization with Varying Summary Lengths Yuping Wu, Ching-Hsun Tseng, Jiayu Shang, Shengzhong Mao, Goran Nenadic, Xiao-Jun Zeng `` [pdf]

[Abs]
Extractive models usually formulate text summarization as extracting top-k important sentences from document as summary. Few work exploited extracting finer-grained Elementary Discourse Unit (EDU) and there is little analysis and justification for the extractive unit selection. To fill such a gap, this paper firstly conducts oracle analysis to compare the upper bound of performance for models based on EDUs and sentences. The analysis provides evidences from both theoretical and experimental perspectives to justify that EDUs make more concise and precise summary than sentences without losing salient information. Then, considering this merit of EDUs, this paper further proposes EDU-level extractive model with Varying summary Lengths (EDU-VL) and develops the corresponding learning algorithm. EDU-VL learns to encode and predict probabilities of EDUs in document, and encode EDU-level candidate summaries with different lengths based on various k values and select the best candidate summary in an end-to-end training manner. Finally, the proposed and developed approach is experimented on single and multi-document benchmark datasets and shows the improved performances in comparison with the state-of-the-art models.
Topic-Aware Evaluation and Transformer Methods for Topic-Controllable Summarization Tatiana Passali, Grigorios Tsoumakas `` [pdf] [code]

[Abs]
Topic-controllable summarization is an emerging research area with a wide range of potential applications. However, existing approaches suffer from significant limitations. First, there is currently no established evaluation metric for this task. Furthermore, existing methods built upon recurrent architectures, which can significantly limit their performance compared to more recent Transformer-based architectures, while they also require modifications to the model's architecture for controlling the topic. In this work, we propose a new topic-oriented evaluation measure to automatically evaluate the generated summaries based on the topic affinity between the generated summary and the desired topic. We also conducted a user study that validates the reliability of this measure. Finally, we propose simple, yet powerful methods for topic-controllable summarization either incorporating topic embeddings into the model's architecture or employing control tokens to guide the summary generation. Experimental results show that control tokens can achieve better performance compared to more complicated embedding-based approaches while being at the same time significantly faster.
Length Control in Abstractive Summarization by Pretraining Information Selection Yizhu Liu, Qi Jia, Kenny Zhu ACL 2022 [pdf] [code]

[Abs]
Previous length-controllable summarization models mostly control lengths at the decoding stage, whereas the encoding or the selection of information from the source document is not sensitive to the designed length. They also tend to generate summaries as long as those in the training data. In this paper, we propose a length-aware attention mechanism (LAAM) to adapt the encoding of the source based on the desired length. Our approach works by training LAAM on a summary length balanced dataset built from the original training data, and then fine-tuning as usual. Results show that this approach is effective in generating high-quality summaries with desired lengths and even those short lengths never seen in the original training set.
A Character-Level Length-Control Algorithm for Non-Autoregressive Sentence Summarization Puyuan Liu, Xiang Zhang, Lili Mou [pdf] [code]
EntSUM: A Data Set for Entity-Centric Summarization Mounica Maddela, Mayank Kulkarni, Daniel Preotiuc-Pietro ACL 2022 [pdf] [code] [data]
Reinforced Abstractive Summarization with Adaptive Length Controlling Mingyang Song, Yi Feng, Liping Jing [pdf]
HydraSum -- Disentangling Stylistic Features in Text Summarization using Multi-Decoder Models Tanya Goyal, Nazneen Fatema Rajani, Wenhao Liu, Wojciech Kryściński [pdf]
RetrievalSum: A Retrieval Enhanced Framework for Abstractive Summarization Chenxin An, Ming Zhong, Zhichao Geng, Jianqiang Yang, Xipeng Qiu [pdf]
Aspect-Controllable Opinion Summarization Reinald Kim Amplayo, Stefanos Angelidis, Mirella Lapata EMNLP 2021 [pdf] [code]
Extract, Denoise, and Enforce: Evaluating and Predicting Lexical Constraints for Conditional Text Generation Yuning Mao, Wenchang Ma, Deren Lei, Xiang Ren [pdf] [code]
Planning with Learned Entity Prompts for Abstractive Summarization Shashi Narayan, Yao Zhao, Joshua Maynez, Gonçalo Simoes, Ryan McDonald TACL [pdf]
GSum: A General Framework for Guided Neural Abstractive Summarization Zi-Yi Dou, Pengfei Liu, Hiroaki Hayashi, Zhengbao Jiang, Graham Neubig NAACL21 [pdf] [code]
Abstractive summarization with combination of pre-trained sequence-to-sequence and saliency models Itsumi Saito, Kyosuke Nishida, Kosuke Nishida, Junji Tomita [pdf]
Self-Supervised and Controlled Multi-Document Opinion Summarization Hady Elsahar, Maximin Coavoux, Jos Rozen, Matthias Gallé EACL 2021 [pdf]
Controllable Summarization with Constrained Markov Decision Process Hou Pong Chan, Lu Wang, Irwin King TACL 2021 [pdf] [code]
LenAtten: An Effective Length Controlling Unit For Text Summarization Zhongyi Yu, Zhenghao Wu, Hao Zheng, Zhe XuanYuan, Jefferson Fong, Weifeng Su Findings of ACL 2021 (short) [pdf] [code]
Controllable Abstractive Dialogue Summarization with Sketch Supervision Chien-Sheng Wu, Linqing Liu, Wenhao Liu, Pontus Stenetorp, Caiming Xiong ACL-Findings 2021 [pdf] [code]
Enhancing Factual Consistency of Abstractive Summarization Chenguang Zhu, William Hinthorn, Ruochen Xu, Qingkai Zeng, Michael Zeng, Xuedong Huang, Meng Jiang NAACL21 [pdf]
Inference Time Style Control for Summarization Shuyang Cao, Lu Wang NAACL21 short [pdf] [code]
CTRLsum: Towards Generic Controllable Text Summarization Junxian He, Wojciech Kryściński, Bryan McCann, Nazneen Rajani, Caiming Xiong [pdf] [code]
Constrained Abstractive Summarization: Preserving Factual Consistency with Constrained Generation Yuning Mao, Xiang Ren, Heng Ji, Jiawei Han [pdf]
Keywords-Guided Abstractive Sentence Summarization Haoran Li, Junnan Zhu, Jiajun Zhang, Chengqing Zong, Xiaodong He AAAI20 [pdf]
SemSUM: Semantic Dependency Guided Neural Abstractive Summarization Hanqi Jin, Tianming Wang, Xiaojun Wan AAAI2020 [pdf] [code]
Interpretable Multi-Headed Attention for Abstractive Summarization at Controllable Lengths Ritesh Sarkhel, Moniba Keymanesh, Arnab Nandi, Srinivasan Parthasarathy COLING20 [pdf]
Controllable Abstractive Sentence Summarization with Guiding Entities Changmeng Zheng, Yi Cai, Guanjie Zhang, Qing Li COLING20 [pdf] [code]
Summarizing Text on Any Aspects: A Knowledge-Informed Weakly-Supervised Approach Bowen Tan, Lianhui Qin, Eric P. Xing, Zhiting Hu EMNLP20 Short [pdf] [code]
Length-controllable Abstractive Summarization by Guiding with Summary Prototype Itsumi Saito, Kyosuke Nishida, Kosuke Nishida, Atsushi Otsuka, Hisako Asano, Junji Tomita, Hiroyuki Shindo, Yuji Matsumoto [pdf]
The Summary Loop: Learning to Write Abstractive Summaries Without Examples Philippe Laban, Andrew Hsi, John Canny, Marti A. Hearst ACL20 [pdf]
Hooks in the Headline: Learning to Generate Headlines with Controlled Styles Di Jin, Zhijing Jin, Joey Tianyi Zhou, Lisa Orii, Peter Szolovits ACL20 [pdf] [code]
BiSET: Bi-directional Selective Encoding with Template for Abstractive Summarization Kai Wang, Xiaojun Quan, Rui Wang ACL19 [pdf] [code]
Improving Abstractive Document Summarization with Salient Information Modeling Yongjian You, Weijia Jia, Tianyi Liu, Wenmian Yang ACL19 [pdf] [code]
Positional Encoding to Control Output Sequence Length Sho Takase, Naoaki Okazaki NAACL19 [pdf] [code]
Query Focused Abstractive Summarization: Incorporating Query Relevance, Multi-Document Coverage, and Summary Length Constraints into seq2seq Models Tal Baumel, Matan Eyal, Michael Elhadad [pdf]
Guiding Generation for Abstractive Text Summarization based on Key Information Guide Network Chenliang Li, Weiran Xu, Si Li, Sheng Gao NAACL18 [pdf]
Controllable Abstractive Summarization Angela Fan, David Grangier, Michael Auli ACL2018 Workshop [pdf]
Retrieve, Rerank and Rewrite: Soft Template Based Neural Summarization Ziqiang Cao, Wenjie Li, Sujian Li, Furu Wei ACL18 [pdf]
Controlling Length in Abstractive Summarization Using a Convolutional Neural Network Yizhu Liu, Zhiyi Luo, Kenny Zhu EMNLP18 [pdf] [code]
Generating Wikipedia By Summarizing Long Sequence Peter J. Liu, Mohammad Saleh, Etienne Pot, Ben Goodrich, Ryan Sepassi, Lukasz Kaiser, Noam Shazeer ICLR18 [pdf] [code]
Controlling Output Length in Neural Encoder-Decoders Yuta Kikuchi, Graham Neubig, Ryohei Sasano, Hiroya Takamura, Manabu Okumura EMNLP16 [pdf] [code]

Abstractive

Exploiting Summarization Data to Help Text Simplification Renliang Sun, Zhixian Yang, Xiaojun Wan EACL 2023 [pdf] [code]

[Abs]
One of the major problems with text simplification is the lack of high-quality data. The sources of simplification datasets are limited to Wikipedia and Newsela, restricting further development of this field. In this paper, we analyzed the similarity between text summarization and text simplification and exploited summarization data to help simplify. First, we proposed an alignment algorithm to extract sentence pairs from summarization datasets. Then, we designed four attributes to characterize the degree of simplification and proposed a method to filter suitable pairs. We named these pairs Sum4Simp (S4S). Next, we conducted human evaluations to show that S4S is high-quality and compared it with a real simplification dataset. Finally, we conducted experiments to illustrate that the S4S can improve the performance of several mainstream simplification models, especially in low-resource scenarios.
Curriculum-Guided Abstractive Summarization Sajad Sotudeh, Hanieh Deilamsalehy, Franck Dernoncourt, Nazli Goharian [pdf]

[Abs]
Recent Transformer-based summarization models have provided a promising approach to abstractive summarization. They go beyond sentence selection and extractive strategies to deal with more complicated tasks such as novel word generation and sentence paraphrasing. Nonetheless, these models have two shortcomings: (1) they often perform poorly in content selection, and (2) their training strategy is not quite efficient, which restricts model performance. In this paper, we explore two orthogonal ways to compensate for these pitfalls. First, we augment the Transformer network with a sentence cross-attention module in the decoder, encouraging more abstraction of salient content. Second, we include a curriculum learning approach to reweight the training samples, bringing about an efficient learning procedure. Our second approach to enhance the training strategy of Transformers networks makes stronger gains as compared to the first approach. We apply our model on extreme summarization dataset of Reddit TIFU posts. We further look into three cross-domain summarization datasets (Webis-TLDR-17, CNN/DM, and XSum), measuring the efficacy of curriculum learning when applied in summarization. Moreover, a human evaluation is conducted to show the efficacy of the proposed method in terms of qualitative criteria, namely, fluency, informativeness, and overall quality.
R-TeaFor: Regularized Teacher-Forcing for Abstractive Summarization Guan-Yu Lin, Pu-Jen Cheng EMNLP 2022 [pdf]

[Abs]
Teacher-forcing is widely used in training sequence generation models to improve sampling efficiency and to stabilize training. However, teacher-forcing is vulnerable to the exposure bias problem. Previous works have attempted to address exposure bias by modifying the training data to simulate model-generated results. Nevertheless, they do not consider the pairwise relationship between the original training data and the modified ones, which provides more information during training. Hence, we propose Regularized Teacher-Forcing (R-TeaFor) to utilize this relationship for better regularization. Empirically, our experiments show that R-TeaFor outperforms previous summarization state-of-the-art models, and the results can be generalized to different pre-trained models.
Improving abstractive summarization with energy-based re-ranking Diogo Pernes, Afonso Mendes, André F.T. Martins GEM at EMNLP 2022 [pdf] [code]

[Abs]
Current abstractive summarization systems present important weaknesses which prevent their deployment in real-world applications, such as the omission of relevant information and the generation of factual inconsistencies (also known as hallucinations). At the same time, automatic evaluation metrics such as CTC scores have been recently proposed that exhibit a higher correlation with human judgments than traditional lexical-overlap metrics such as ROUGE. In this work, we intend to close the loop by leveraging the recent advances in summarization metrics to create quality-aware abstractive summarizers. Namely, we propose an energy-based model that learns to re-rank summaries according to one or a combination of these metrics. We experiment using several metrics to train our energy-based re-ranker and show that it consistently improves the scores achieved by the predicted summaries. Nonetheless, human evaluation results show that the re-ranking approach should be used with care for highly abstractive summaries, as the available metrics are not yet sufficiently reliable for this purpose.
Salience Allocation as Guidance for Abstractive Summarization Fei Wang, Kaiqiang Song, Hongming Zhang, Lifeng Jin, Sangwoo Cho, Wenlin Yao, Xiaoyang Wang, Muhao Chen, Dong Yu EMNLP 2022 [pdf] [code]

[Abs]
Abstractive summarization models typically learn to capture the salient information from scratch implicitly. Recent literature adds extractive summaries as guidance for abstractive summarization models to provide hints of salient content and achieves better performance. However, extractive summaries as guidance could be over strict, leading to information loss or noisy signals. Furthermore, it cannot easily adapt to documents with various abstractiveness. As the number and allocation of salience content pieces vary, it is hard to find a fixed threshold deciding which content should be included in the guidance. In this paper, we propose a novel summarization approach with a flexible and reliable salience guidance, namely SEASON (SaliencE Allocation as Guidance for Abstractive SummarizatiON). SEASON utilizes the allocation of salience expectation to guide abstractive summarization and adapts well to articles in different abstractiveness. Automatic and human evaluations on two benchmark datasets show that the proposed method is effective and reliable. Empirical results on more than one million news articles demonstrate a natural fifteen-fifty salience split for news article sentences, providing a useful insight for composing news articles.
Towards Summary Candidates Fusion Mathieu Ravaut, Shafiq Joty, Nancy F. Chen EMNLP 2022 [pdf] [code]

[Abs]
Sequence-to-sequence deep neural models fine-tuned for abstractive summarization can achieve great performance on datasets with enough human annotations. Yet, it has been shown that they have not reached their full potential, with a wide gap between the top beam search output and the oracle beam. Recently, re-ranking methods have been proposed, to learn to select a better summary candidate. However, such methods are limited by the summary quality aspects captured by the first-stage candidates. To bypass this limitation, we propose a new paradigm in second-stage abstractive summarization called SummaFusion that fuses several summary candidates to produce a novel abstractive second-stage summary. Our method works well on several summarization datasets, improving both the ROUGE scores and qualitative properties of fused summaries. It is especially good when the candidates to fuse are worse, such as in the few-shot setup where we set a new state-of-the-art. We will make our code and checkpoints available at this https URL.
Generation of Patient After-Visit Summaries to Support Physicians Pengshan Cai, Fei Liu, Adarsha Bajracharya, Joe Sills, Alok Kapoor, Weisong Liu, Dan Berlowitz, David Levy, Richeek Pradhan, Hong Yu `` [pdf] [code]

[Abs]
An after-visit summary (AVS) is a summary note given to patients after their clinical visit. It recaps what happened during their clinical visit and guides patients’ disease self-management. Studies have shown that a majority of patients found after-visit summaries useful. However, many physicians face excessive workloads and do not have time to write clear and informative summaries. In this paper, we study the problem of automatic generation of after-visit summaries and examine whether those summaries can convey the gist of clinical visits. We report our findings on a new clinical dataset that contains a large number of electronic health record (EHR) notes and their associated summaries. Our results suggest that generation of lay language after-visit summaries remains a challenging task. Crucially, we introduce a feedback mechanism that alerts physicians when an automatic summary fails to capture the important details of the clinical notes or when it contains hallucinated facts that are potentially detrimental to the summary quality. Automatic and human evaluation demonstrates the effectiveness of our approach in providing writing feedback and supporting physicians.
ArgLegalSumm: Improving Abstractive Summarization of Legal Documents with Argument Mining Mohamed Elaraby, Diane Litman COLING 2022 [pdf] [code]

[Abs]
A challenging task when generating summaries of legal documents is the ability to address their argumentative nature. We introduce a simple technique to capture the argumentative structure of legal documents by integrating argument role labeling into the summarization process. Experiments with pretrained language models show that our proposed approach improves performance over strong baselines.
Source-summary Entity Aggregation in Abstractive Summarization José Ángel González, Annie Louis, Jackie Chi Kit Cheung COLING 2022 [pdf] [code]

[Abs]
In a text, entities mentioned earlier can be referred to in later discourse by a more general description. For example, Celine Dion and Justin Bieber can be referred to by Canadian singers or celebrities. In this work, we study this phenomenon in the context of summarization, where entities from a source text are generalized in the summary. We call such instances source-summary entity aggregations. We categorize these aggregations into two types and analyze them in the Cnn/Dailymail corpus, showing that they are reasonably frequent. We then examine how well three state-of-the-art summarization systems can generate such aggregations within summaries. We also develop techniques to encourage them to generate more aggregations. Our results show that there is significant room for improvement in producing semantically correct aggregations.
Summarizing Patients Problems from Hospital Progress Notes Using Pre-trained Sequence-to-Sequence Models Yanjun Gao, Dmitry Dligach, Timothy Miller, Dongfang Xu, Matthew M. Churpek, Majid Afshar COLING 2022 [pdf]

[Abs]
Automatically summarizing patients' main problems from daily progress notes using natural language processing methods helps to battle against information and cognitive overload in hospital settings and potentially assists providers with computerized diagnostic decision support. Problem list summarization requires a model to understand, abstract, and generate clinical documentation. In this work, we propose a new NLP task that aims to generate a list of problems in a patient's daily care plan using input from the provider's progress notes during hospitalization. We investigate the performance of T5 and BART, two state-of-the-art seq2seq transformer architectures, in solving this problem. We provide a corpus built on top of progress notes from publicly available electronic health record progress notes in the Medical Information Mart for Intensive Care (MIMIC)-III. T5 and BART are trained on general domain text, and we experiment with a data augmentation method and a domain adaptation pre-training method to increase exposure to medical vocabulary and knowledge. Evaluation methods include ROUGE, BERTScore, cosine similarity on sentence embedding, and F-score on medical concepts. Results show that T5 with domain adaptive pre-training achieves significant performance gains compared to a rule-based system and general domain pre-trained language models, indicating a promising direction for tackling the problem summarization task.
Semantic-Preserving Abstractive Text Summarization with Siamese Generative Adversarial Net Xin Sheng, Linli Xu, Yinlong Xu, Deqiang Jiang, Bo Ren Findings of NAACL 2022 [pdf]

[Abs]
We propose a novel siamese generative adversarial net for abstractive text summarization (SSPGAN), which can preserve the main semantics of the source text. Different from previous generative adversarial net based methods, SSPGAN is equipped with a siamese semantic-preserving discriminator, which can not only be trained to discriminate the machine-generated summaries from the human-summarized ones, but also ensure the semantic consistency between the source text and target summary. As a consequence of the min-max game between the generator and the siamese semantic-preserving discriminator, the generator can generate a summary that conveys the key content of the source text more accurately. Extensive experiments on several text summarization benchmarks in different languages demonstrate that the proposed model can achieve significant improvements over the state-of-the-art methods.
ExtraPhrase: Efficient Data Augmentation for Abstractive Summarization Mengsay Loem, Sho Takase, Masahiro Kaneko, Naoaki Okazaki NAACL 2022 Student Research Workshop [pdf] [code]

[Abs]
TNeural models trained with large amount of parallel data have achieved impressive performance in abstractive summarization tasks. However, large-scale parallel corpora are expensive and challenging to construct. In this work, we introduce a low-cost and effective strategy, ExtraPhrase, to augment training data for abstractive summarization tasks. ExtraPhrase constructs pseudo training data in two steps: extractive summarization and paraphrasing. We extract major parts of an input text in the extractive summarization step and obtain its diverse expressions with the paraphrasing step. Through experiments, we show that ExtraPhrase improves the performance of abstractive summarization tasks by more than 0.50 points in ROUGE scores compared to the setting without data augmentation. ExtraPhrase also outperforms existing methods such as back-translation and self-training. We also show that ExtraPhrase is significantly effective when the amount of genuine training data is remarkably small, i.e., a low-resource setting. Moreover, ExtraPhrase is more cost-efficient than the existing approaches
BRIO: Bringing Order to Abstractive Summarization Yixin Liu, Pengfei Liu, Dragomir Radev, Graham Neubig ACL 2022 [pdf] [code]

[Abs]
Abstractive summarization models are commonly trained using maximum likelihood estimation, which assumes a deterministic (one-point) target distribution in which an ideal model will assign all the probability mass to the reference summary. This assumption may lead to performance degradation during inference, where the model needs to compare several system-generated (candidate) summaries that have deviated from the reference summary. To address this problem, we propose a novel training paradigm which assumes a non-deterministic distribution so that different candidate summaries are assigned probability mass according to their quality. Our method achieves a new state-of-the-art result on the CNN/DailyMail (47.78 ROUGE-1) and XSum (49.07 ROUGE-1) datasets. Further analysis also shows that our model can estimate probabilities of candidate summaries that are more correlated with their level of quality.
SummaReranker: A Multi-Task Mixture-of-Experts Re-ranking Framework for Abstractive Summarization Mathieu Ravaut, Shafiq Joty, Nancy F. Chen ACL 2022 [pdf] [code]

[Abs]
Sequence-to-sequence neural networks have recently achieved great success in abstractive summarization, especially through fine-tuning large pre-trained language models on the downstream dataset. These models are typically decoded with beam search to generate a unique summary. However, the search space is very large, and with the exposure bias, such decoding is not optimal. In this paper, we show that it is possible to directly train a second-stage model performing re-ranking on a set of summary candidates. Our mixture-of-experts SummaReranker learns to select a better candidate and consistently improves the performance of the base model. With a base PEGASUS, we push ROUGE scores by 5.44% on CNN- DailyMail (47.16 ROUGE-1), 1.31% on XSum (48.12 ROUGE-1) and 9.34% on Reddit TIFU (29.83 ROUGE-1), reaching a new state-of-the-art. Our code and checkpoints will be available at https://github.com/ntunlp/SummaReranker.
Adaptive Beam Search to Enhance On-device Abstractive Summarization Harichandana B S S, Sumit Kumar IEEE INDICON 2021 [pdf]
PLSUM: Generating PT-BR Wikipedia by Summarizing Multiple Websites André Seidel Oliveira, Anna Helena Reali Costa ENIAC 2021 [pdf]
Pointer over Attention: An Improved Bangla Text Summarization Approach Using Hybrid Pointer Generator Network Nobel Dhar, Gaurob Saha, Prithwiraj Bhattacharjee, Avi Mallick, Md Saiful Islam [pdf]
Template-aware Attention Model for Earnings Call Report Generation Yangchen Huang, Prashant K. Dhingra, Seyed Danial Mohseni Taheri EMNLP 2021| newsum [pdf]
Rewards with Negative Examples for Reinforced Topic-Focused Abstractive Summarization Khalil Mrini, Can Liu, Markus Dreyer EMNLP 2021| newsum [pdf]
Knowledge and Keywords Augmented Abstractive Sentence Summarization Shuo Guan, Ping Zhu, Zhihua Wei EMNLP 2021| newsum [pdf] [code]
Sentence-level Planning for Especially Abstractive Summarization Andreas Marfurt, James Henderson EMNLP 2021| newsum [pdf] [code]
Learn to Copy from the Copying History: Correlational Copy Network for Abstractive Summarization Haoran Li, Song Xu, Peng Yuan, Yujia Wang, Youzheng Wu, Xiaodong He, Bowen Zhou EMNLP 2021 [pdf] [code]
Enhance Long Text Understanding via Distilled Gist Detector from Abstractive Summarization Yan Liu, Yazheng Yang [pdf]
VieSum: How Robust Are Transformer-based Models on Vietnamese Summarization? Hieu Nguyen, Long Phan, James Anibal, Alec Peltekian, Hieu Tran [pdf]
Enriching and Controlling Global Semantics for Text Summarization Thong Nguyen, Anh Tuan Luu, Truc Lu, Tho Quan EMNLP 2021 [pdf]
Augmented Abstractive Summarization With Document-LevelSemantic Graph Qiwei Bi, Haoyuan Li, Kun Lu, Hanfang Yang Journal of Data Science [pdf]
ARMAN: Pre-training with Semantically Selecting and Reordering of Sentences for Persian Abstractive Summarization Alireza Salemi, Emad Kebriaei, Ghazal Neisi Minaei, Azadeh Shakery [pdf] [data]
Subjective Bias in Abstractive Summarization Lei Li, Wei Liu, Marina Litvak, Natalia Vanetik, Jiacheng Pei, Yinan Liu, Siya Qi [pdf] [code]
Neural Abstractive Unsupervised Summarization of Online News Discussions Ignacio Tampe Palma, Marcelo Mendoza, Evangelos Milios [pdf]
Attention Temperature Matters in Abstractive Summarization Distillation Shengqiang Zhang, Xingxing Zhang, Hangbo Bao, Furu Wei ACL 2022 [pdf] [code]

[Abs]
Recent progress of abstractive text summarization largely relies on large pre-trained sequence-to-sequence Transformer models, which are computationally expensive. This paper aims to distill these large models into smaller ones for faster inference and with minimal performance loss. Pseudo-labeling based methods are popular in sequence-to-sequence model distillation. In this paper, we find simply manipulating attention temperatures in Transformers can make pseudo labels easier to learn for student models. Our experiments on three summarization datasets show our proposed method consistently improves vanilla pseudo-labeling based methods. Further empirical analysis shows that both pseudo labels and summaries produced by our students are shorter and more abstractive.
BASS: Boosting Abstractive Summarization with Unified Semantic Graph Wenhao Wu, Wei Li, Xinyan Xiao, Jiachen Liu, Ziqiang Cao, Sujian Li, Hua Wu, Haifeng Wang ACL21 [pdf]
Enriching Transformers with Structured Tensor-Product Representations for Abstractive Summarization Yichen Jiang, Asli Celikyilmaz, Paul Smolensky, Paul Soulos, Sudha Rao, Hamid Palangi, Roland Fernandez, Caitlin Smith, Mohit Bansal, Jianfeng Gao NAACL21 [pdf] [code]
Uncertainty-Aware Abstractive Summarization Alexios Gidiotis, Grigorios Tsoumakas [pdf]
What's in a Summary? Laying the Groundwork for Advances in Hospital-Course Summarization Griffin Adams, Emily Alsentzer, Mert Ketenci, Jason Zucker, Noémie Elhadad NAACL21 [pdf]
Generating abstractive summaries of Lithuanian news articles using a transformer model Lukas Stankevičius, Mantas Lukoševičius [pdf]
Summarization, Simplification, and Generation: The Case of Patents Silvia Casola, Alberto Lavelli [pdf]
Quantifying Appropriateness of Summarization Data for Curriculum Learning Ryuji Kano, Takumi Takahashi, Toru Nishino, Motoki Taniguchi, Tomoki Taniguchi, Tomoko Ohkuma EACL21 [pdf]
Text Summarization of Czech News Articles Using Named Entities Petr Marek, Štěpán Müller, Jakub Konrád, Petr Lorenc, Jan Pichl, Jan Šedivý Journal [pdf]
Planning with Entity Chains for Abstractive Summarization Shashi Narayan, Yao Zhao, Joshua Maynez, Gonçalo Simoes, Ryan McDonald [pdf]
Attention Head Masking for Inference Time Content Selection in Abstractive Summarization Shuyang Cao, Lu Wang NAACL21 short [pdf] [code]
A New Approach to Overgenerating and Scoring Abstractive Summaries Kaiqiang Song, Bingqing Wang, Zhe Feng, Fei Liu NAACL21 [pdf] [code]
Exploring Explainable Selection to Control Abstractive Summarization Wang Haonan, Gao Yang, Bai Yu, Mirella Lapata, Huang Heyan AAAI21 [pdf] [code]
Friendly Topic Assistant for Transformer Based Abstractive Summarization Zhengjue Wang, Zhibin Duan, Hao Zhang, Chaojie Wang, Long Tian, Bo Chen, Mingyuan Zhou EMNLP20 [pdf] [code]
Neural Abstractive Text Summarizer for Telugu Language Mohan Bharath B, Aravindh Gowtham B, Akhil M ICSCSP20 [pdf]
Topic-Aware Abstractive Text Summarization Chujie Zheng, Kunpeng Zhang, Harry Jiannan Wang, Ling Fan [pdf] [code]
Multi-hop Inference for Question-driven Summarization Yang Deng, Wenxuan Zhang, Wai Lam EMNLP20 [pdf]
Quantitative Argument Summarization and Beyond-Cross-Domain Key Point Analysis Roy Bar-Haim, Yoav Kantor, Lilach Eden, Roni Friedman, Dan Lahav, Noam Slonim EMNLP20 [pdf]
Learning to Fuse Sentences with Transformers for Summarization Logan Lebanoff, Franck Dernoncourt, Doo Soon Kim, Lidan Wang, Walter Chang, Fei Liu EMNLP20 short [pdf] [code]
A Cascade Approach to Neural Abstractive Summarization with Content Selection and Fusion Logan Lebanoff, Franck Dernoncourt, Doo Soon Kim, Walter Chang, Fei Liu AACL20 [pdf] [code]
AutoSurvey: Automatic Survey Generation based on a Research Draft Hen-Hsen Huang IJCAI20 [pdf] [code]
Neural Abstractive Summarization with Structural Attention Tanya Chowdhury, Sachin Kumar, Tanmoy Chakraborty IJCAI20 [pdf]
A Unified Model for Financial Event Classification, Detection and Summarization Quanzhi Li, Qiong Zhang IJCAI20 Special Track on AI in FinTech [pdf]
Discriminative Adversarial Search for Abstractive Summarization Thomas Scialom, Paul-Alexis Dray, Sylvain Lamprier, Benjamin Piwowarski, Jacopo Staiano ICML20 [pdf]
Controlling the Amount of Verbatim Copying in Abstractive Summarization Kaiqiang Song, Bingqing Wang, Zhe Feng, Liu Ren, Fei Liu AAAI20 [pdf] [code]
GRET：Global Representation Enhanced Transformer Rongxiang Weng, Haoran Wei, Shujian Huang, Heng Yu, Lidong Bing, Weihua Luo, Jiajun Chen AAAI20 [pdf]
Abstractive Summarization of Spoken and Written Instructions with BERT Alexandra Savelieva, Bryan Au-Yeung, Vasanth Ramani KDD Converse 2020 [pdf]
Concept Pointer Network for Abstractive Summarization Wang Wenbo, Gao Yang, Huang Heyan, Zhou Yuxiang EMNLP19 [pdf] [code]
Co-opNet: Cooperative Generator–Discriminator Networks for Abstractive Summarization with Narrative Flow Saadia Gabriel, Antoine Bosselut, Ari Holtzman, Kyle Lo, Asli Celikyilmaz, Yejin Choi [pdf]
Contrastive Attention Mechanism for Abstractive Sentence Summarization Xiangyu Duan, Hongfei Yu, Mingming Yin, Min Zhang, Weihua Luo, Yue Zhang EMNLP19 [pdf] [code]
An Entity-Driven Framework for Abstractive Summarization Eva Sharma, Luyang Huang, Zhe Hu, Lu Wang EMNLP19 [pdf] [code]
Abstract Text Summarization: A Low Resource Challenge Shantipriya Parida, Petr Motlicek EMNLP19 [pdf] [code]
Attention Optimization for Abstractive Document Summarization Min Gui, Junfeng Tian, Rui Wang, Zhenglu Yang EMNLP19 [pdf] [code]
Scoring Sentence Singletons and Pairs for Abstractive Summarization Logan Lebanoff, Kaiqiang Song, Franck Dernoncourt, Doo Soon Kim, Seokhwan Kim, Walter Chang, Fei Liu ACL19 [pdf] [code]
Inducing Document Structure for Aspect-based Summarization Lea Frermann, Alexandre Klementiev ACL19 [pdf] [code]
Generating Summaries with Topic Templates and Structured Convolutional Decoders Laura Perez-Beltrachini, Yang Liu, Mirella Lapata ACL19 [pdf] [code]
Summary Refinement through Denoising Nikola I. Nikolov, Alessandro Calmanovici, Richard H.R. Hahnloser RANLP19 [pdf] [code]
Closed-Book Training to Improve Summarization Encoder Memory Yichen Jiang, Mohit Bansal EMNLP18 [pdf]
Improving Neural Abstractive Document Summarization with Structural Regularization Wei Li, Xinyan Xiao, Yajuan Lyu, Yuanzhuo Wang EMNLP18 [pdf]
Bottom-Up Abstractive Summarization Sebastian Gehrmann, Yuntian Deng, Alexander M. Rush EMNLP18 [pdf] [code]
A Unified Model for Extractive and Abstractive Summarization using Inconsistency Loss Wan-Ting Hsu, Chieh-Kai Lin, Ming-Ying Lee, Kerui Min, Jing Tang, Min Sun ACL18 [pdf]
Soft Layer-Specific Multi-Task Summarization with Entailment and Question Generation Han Guo, Ramakanth Pasunuru, Mohit Bansal ACL18 [pdf]
Abstractive Document Summarization via Bidirectional Decoder Xin WanChen LiRuijia WangDing XiaoChuan Shi ADMA18 [pdf]
Entity Commonsense Representation for Neural Abstractive Summarization Reinald Kim Amplayo, Seonjae Lim, Seung-won Hwang NAACL18 [pdf]
Get To The Point: Summarization with Pointer-Generator Networks Abigail See, Peter J. Liu, Christopher D. Manning ACL17 [pdf] [code]
Selective Encoding for Abstractive Sentence Summarization Qingyu Zhou, Nan Yang, Furu Wei, Ming Zhou ACL17 [pdf]
Abstractive Document Summarization with a Graph-Based Attentional Neural Model Jiwei Tan, Xiaojun Wan, Jianguo Xiao ACL17 [pdf]
Toward Abstractive Summarization Using Semantic Representations Fei Liu, Jeffrey Flanigan, Sam Thomson, Norman Sadeh, Noah A. Smith NAACL15 [pdf]
Abstractive Meeting Summarization with Entailment and Fusion Yashar Mehdad, Giuseppe Carenini, Frank Tompa, Raymond T. Ng ENLG13 [pdf]

Graph-Based

Abstractive Summarization Guided by Latent Hierarchical Document Structure Yifu Qiu, Shay B. Cohen EMNLP 2022 [pdf] [code]

[Abs]
Sequential abstractive neural summarizers often do not use the underlying structure in the input article or dependencies between the input sentences. This structure is essential to integrate and consolidate information from different parts of the text. To address this shortcoming, we propose a hierarchy-aware graph neural network (HierGNN) which captures such dependencies through three main steps: 1) learning a hierarchical document structure through a latent structure tree learned by a sparse matrix-tree computation; 2) propagating sentence information over this structure using a novel message-passing node propagation mechanism to identify salient information; 3) using graph-level attention to concentrate the decoder on salient information. Experiments confirm HierGNN improves strong sequence models such as BART, with a 0.55 and 0.75 margin in average ROUGE-1/2/L for CNN/DM and XSum. Further human evaluation demonstrates that summaries produced by our model are more relevant and less redundant than the baselines, into which HierGNN is incorporated. We also find HierGNN synthesizes summaries by fusing multiple source sentences more, rather than compressing a single source sentence, and that it processes long inputs more effectively.
Hierarchical Heterogeneous Graph Attention Network for Syntax-Aware Summarization Zixing Song, Irwin King AAAI 2022 [pdf]
Summarization with Graphical Elements Maartje ter Hoeve, Julia Kiseleva, Maarten de Rijke [pdf] [code]
HETFORMER: Heterogeneous Transformer with Sparse Attention for Long-Text Extractive Summarization Ye Liu, Jian-Guo Zhang, Yao Wan, Congying Xia, Lifang He, Philip S. Yu EMNLP 2021 short [pdf]
Centrality Meets Centroid: A Graph-based Approach for Unsupervised Document Summarization Haopeng Zhang, Jiawei Zhang [pdf]
Neural Extractive Summarization with Hierarchical Attentive Heterogeneous Graph Network Ruipeng Jia, Yanan Cao, Hengzhu Tang, Fang Fang, Cong Cao, Shi Wang EMNLP20 [pdf] [code]
Enhancing Extractive Text Summarization with Topic-Aware Graph Neural Networks Peng Cui, Le Hu, Yuanchao Liu COLING20 [pdf]
Heterogeneous Graph Neural Networks for Extractive Document Summarization Danqing Wang, Pengfei Liu, Yining Zheng, Xipeng Qiu, Xuanjing Huang ACL20 [pdf] [code]
Structured Neural Summarization Patrick Fernandes, Miltiadis Allamanis, Marc Brockschmidt ICLR19 [pdf] [code]
Hierarchical Transformers for Multi-Document Summarization Yang Liu, Mirella Lapata ACL19 [pdf] [code]
Learning to Create Sentence Semantic Relation Graphs for Multi-Document Summarization Diego Antognini, Boi Faltings EMNLP19 [pdf]
Graph-based Neural Multi-Document Summarization Michihiro Yasunaga, Rui Zhang, Kshitijh Meelu, Ayush Pareek, Krishnan Srinivasan, Dragomir Radev CoNLL17 [pdf]
Abstractive Document Summarization with a Graph-Based Attentional Neural Model Jiwei Tan, Xiaojun Wan, Jianguo Xiao ACL17 [pdf]

Unsupervised

Improving Sentence Similarity Estimation for Unsupervised Extractive Summarization Shichao Sun, Ruifeng Yuan, Wenjie Li, Sujian Li ICASSP 2023 [pdf] [code]

[Abs]
Unsupervised extractive summarization aims to extract salient sentences from a document as the summary without labeled data. Recent literatures mostly research how to leverage sentence similarity to rank sentences in the order of salience. However, sentence similarity estimation using pre-trained language models mostly takes little account of document-level information and has a weak correlation with sentence salience ranking. In this paper, we proposed two novel strategies to improve sentence similarity estimation for unsupervised extractive summarization. We use contrastive learning to optimize a document-level objective that sentences from the same document are more similar than those from different documents. Moreover, we use mutual learning to enhance the relationship between sentence similarity estimation and sentence salience ranking, where an extra signal amplifier is used to refine the pivotal information. Experimental results demonstrate the effectiveness of our strategies.
Generating Multiple-Length Summaries via Reinforcement Learning for Unsupervised Sentence Summarization Dongmin Hyun, Xiting Wang, Chanyoung Park, Xing Xie, Hwanjo Yu [pdf] [code]

[Abs]
Sentence summarization shortens given texts while maintaining core contents of the texts. Unsupervised approaches have been studied to summarize texts without human-written summaries. However, recent unsupervised models are extractive, which remove words from texts and thus they are less flexible than abstractive summarization. In this work, we devise an abstractive model based on reinforcement learning without ground-truth summaries. We formulate the unsupervised summarization based on the Markov decision process with rewards representing the summary quality. To further enhance the summary quality, we develop a multi-summary learning mechanism that generates multiple summaries with varying lengths for a given text, while making the summaries mutually enhance each other. Experimental results show that the proposed model substantially outperforms both abstractive and extractive models, yet frequently generating new words not contained in input texts.
Referee: Reference-Free Sentence Summarization with Sharper Controllability through Symbolic Knowledge Distillation Melanie Sclar, Peter West, Sachin Kumar, Yulia Tsvetkov, Yejin Choi EMNLP 2022 [pdf] [code]

[Abs]
We present Referee, a novel framework for sentence summarization that can be trained reference-free (i.e., requiring no gold summaries for supervision), while allowing direct control for compression ratio. Our work is the first to demonstrate that reference-free, controlled sentence summarization is feasible via the conceptual framework of Symbolic Knowledge Distillation (West et al., 2022), where latent knowledge in pre-trained language models is distilled via explicit examples sampled from the teacher models, further purified with three types of filters: length, fidelity, and Information Bottleneck. Moreover, we uniquely propose iterative distillation of knowledge, where student models from the previous iteration of distillation serve as teacher models in the next iteration. Starting off from a relatively modest set of GPT3-generated summaries, we demonstrate how iterative knowledge distillation can lead to considerably smaller, but better summarizers with sharper controllability. A useful by-product of this iterative distillation process is a high-quality dataset of sentence-summary pairs with varying degrees of compression ratios. Empirical results demonstrate that the final student models vastly outperform the much larger GPT3-Instruct model in terms of the controllability of compression ratios, without compromising the quality of resulting summarization.
UPER: Boosting Multi-Document Summarization with an Unsupervised Prompt-based Extractor Shangqing Tu, Jifan Yu, Fangwei Zhu, Juanzi Li, Lei Hou, Jian-Yun Nie COLING 2022 [pdf] [code]

[Abs]
Multi-Document Summarization (MDS) commonly employs the 2-stage extract-then-abstract paradigm, which first extracts a relatively short meta-document, then feeds it into the deep neural networks to generate an abstract. Previous work usually takes the ROUGE score as the label for training a scoring model to evaluate source documents. However, the trained scoring model is prone to under-fitting for low-resource settings, as it relies on the training data. To extract documents effectively, we construct prompting templates that invoke the underlying knowledge in Pre-trained Language Model (PLM) to calculate the document and keyword’s perplexity, which can assess the document’s semantic salience. Our unsupervised approach can be applied as a plug-in to boost other metrics for evaluating a document’s salience, thus improving the subsequent abstract generation. We get positive results on 2 MDS datasets, 2 data settings, and 2 abstractive backbone models, showing our method’s effectiveness. Our code is available at https://github.com/THU-KEG/UPER
Learning Non-Autoregressive Models from Search for Unsupervised Sentence Summarization Puyuan Liu, Chenyang Huang, Lili Mou ACL 2022 [[pdf] [code]

[Abs]
Text summarization aims to generate a short summary for an input text. In this work, we propose a Non-Autoregressive Unsupervised Summarization (NAUS) approach, which does not require parallel data for training. Our NAUS first performs edit-based search towards a heuristically defined score, and generates a summary as pseudo-groundtruth. Then, we train an encoder-only non-autoregressive Transformer based on the search result. We also propose a dynamic programming approach for length-control decoding, which is important for the summarization task. Experiments on two datasets show that NAUS achieves state-of-the-art performance for unsupervised summarization, yet largely improving inference efficiency. Further, our algorithm is able to perform explicit length-transfer summary generation.
Unsupervised Extractive Opinion Summarization Using Sparse Coding Somnath Basu Roy Chowdhury, Chao Zhao, Snigdha Chaturvedi ACL 2022 [pdf] [code]

[Abs]
Opinion summarization is the task of automatically generating summaries that encapsulate information expressed in multiple user reviews. We present Semantic Autoencoder (SemAE) to perform extractive opinion summarization in an unsupervised manner. SemAE uses dictionary learning to implicitly capture semantic information from the review text and learns a latent representation of each sentence over semantic units. Our extractive summarization algorithm leverages the representations to identify representative opinions among hundreds of reviews. SemAE is also able to perform controllable summarization to generate aspect-specific summaries using only a few samples. We report strong performance on SPACE and AMAZON datasets and perform experiments to investigate the functioning of our model.
Want To Reduce Labeling Cost? GPT-3 Can Help Shuohang Wang, Yang Liu, Yichong Xu, Chenguang Zhu, Michael Zeng Findings of EMNLP 2021 [pdf]
Improving Unsupervised Extractive Summarization with Facet-Aware Modeling Xinnian Liang, Shuangzhi Wu, Mu Li, Zhoujun Li ACL 2021 Findings [pdf] [code]
MRCBert: A Machine Reading ComprehensionApproach for Unsupervised Summarization Saurabh Jain, Guokai Tang, Lim Sze Chi [pdf] [code]
Centrality Meets Centroid: A Graph-based Approach for Unsupervised Document Summarization Haopeng Zhang, Jiawei Zhang [pdf]
Unsupervised Opinion Summarization with Content Planning Reinald Kim Amplayo, Stefanos Angelidis, Mirella Lapata AAAI21 [pdf] [code]
Biased TextRank: Unsupervised Graph-Based Content Extraction Ashkan Kazemi, Verónica Pérez-Rosas, Rada Mihalcea COLING20 [pdf] [code]
Unsupervised Extractive Summarization by Pre-training Hierarchical Transformers Shusheng Xu, Xingxing Zhang, Yi Wu, Furu Wei, Ming Zhou [pdf] [code]
Q-learning with Language Model for Edit-based Unsupervised Summarization Ryosuke Kohita, Akifumi Wachi, Yang Zhao, Ryuki Tachibana EMNLP20 [pdf] [code]
Abstractive Document Summarization without Parallel Data Nikola I. Nikolov, Richard H.R. Hahnloser LREC20 [pdf] [code]
Unsupervised Neural Single-Document Summarization of Reviews via Learning Latent Discourse Structure and its Ranking Masaru Isonuma, Junichiro Mori, Ichiro Sakata ACL19 [pdf] [code]
Sentence Centrality Revisited for Unsupervised Summarization Hao Zheng, Mirella Lapata ACL19 [pdf] [code]
Discrete Optimization for Unsupervised Sentence Summarization with Word-Level Extraction Raphael Schumann, Lili Mou, Yao Lu, Olga Vechtomova, Katja Markert ACL20 [pdf] [code]
SummAE: Zero-Shot Abstractive Text Summarization using Length-Agnostic Auto-Encoders Peter J. Liu, Yu-An Chung, Jie Ren [pdf] [code]
MeanSum : A Neural Model for Unsupervised Multi-Document Abstractive Summarization Eric Chu, Peter J. Liu ICML19 [pdf] [code]
SEQ3: Differentiable Sequence-to-Sequence-to-Sequence Autoencoder for Unsupervised Abstractive Sentence Compression Christos Baziotis, Ion Androutsopoulos, Ioannis Konstas, Alexandros Potamianos NAACL19 [pdf] [code]
Learning to Encode Text as Human-Readable Summaries usingGenerative Adversarial Networks Yaushian Wang, Hung-Yi Lee EMNLP18 [pdf] [code]
Unsupervised Abstractive Meeting Summarization with Multi-Sentence Compression and Budgeted Submodular Maximization Guokan Shang, Wensi Ding, Zekun Zhang, Antoine Tixier, Polykarpos Meladianos, Michalis Vazirgiannis, Jean-Pierre Lorré ACL18 [pdf] [code]

Concept-map-based

Fast Concept Mention Grouping for Concept Map–based Multi-Document Summarization Tobias Falke, Iryna Gurevych NAACL19 [pdf] [code]
Bringing Structure into Summaries : Crowdsourcing a Benchmark Corpus of Concept Maps Tobias Falke, Iryna Gurevych EMNLP17 [pdf] [code]

Timeline

Follow the Timeline! Generating Abstractive and Extractive Timeline Summary in Chronological Order Xiuying Chen, Mingzhe Li, Shen Gao, Zhangming Chan, Dongyan Zhao, Xin Gao, Xiangliang Zhang, Rui Yan TOIS [pdf] [code]

[Abs]
Nowadays, time-stamped web documents related to a general news query floods spread throughout the Internet, and timeline summarization targets concisely summarizing the evolution trajectory of events along the timeline. Unlike traditional document summarization, timeline summarization needs to model the time series information of the input events and summarize important events in chronological order. To tackle this challenge, in this paper, we propose a Unified Timeline Summarizer (UTS) that can generate abstractive and extractive timeline summaries in time order. Concretely, in the encoder part, we propose a graph-based event encoder that relates multiple events according to their content dependency and learns a global representation of each event. In the decoder part, to ensure the chronological order of the abstractive summary, we propose to extract the feature of event-level attention in its generation process with sequential information remained and use it to simulate the evolutionary attention of the ground truth summary. The event-level attention can also be used to assist in extracting summary, where the extracted summary also comes in time sequence. We augment the previous Chinese large-scale timeline summarization dataset and collect a new English timeline dataset. Extensive experiments conducted on these datasets and on the out-of-domain Timeline 17 dataset show that UTS achieves state-of-the-art performance in terms of both automatic and human evaluations.
CrisisLTLSum: A Benchmark for Local Crisis Event Timeline Extraction and Summarization Hossein Rajaby Faghihi, Bashar Alhafni, Ke Zhang, Shihao Ran, Joel Tetreault, Alejandro Jaimes [pdf] [data]

[Abs]
Social media has increasingly played a key role in emergency response: first responders can use public posts to better react to ongoing crisis events and deploy the necessary resources where they are most needed. Timeline extraction and abstractive summarization are critical technical tasks to leverage large numbers of social media posts about events. Unfortunately, there are few datasets for benchmarking technical approaches for those tasks. This paper presents CrisisLTLSum, the largest dataset of local crisis event timelines available to date. CrisisLTLSum contains 1,000 crisis event timelines across four domains: wildfires, local fires, traffic, and storms. We built CrisisLTLSum using a semi-automated cluster-then-refine approach to collect data from the public Twitter stream. Our initial experiments indicate a significant gap between the performance of strong baselines compared to the human performance on both tasks. Our dataset, code, and models are publicly available.
Joint Learning-based Heterogeneous Graph Attention Network for Timeline Summarization Jingyi You, Dongyuan Li, Hidetaka Kamigaito, Kotaro Funakoshi, Manabu Okumura NAACL 2022 [pdf] [data]

[Abs]
Previous studies on the timeline summarization (TLS) task ignored the information interaction between sentences and dates, and adopted pre-defined unlearnable representations for them. They also considered date selection and event detection as two independent tasks, which makes it impossible to integrate their advantages and obtain a globally optimal summary. In this paper, we present a joint learning-based heterogeneous graph attention network for TLS (HeterTls), in which date selection and event detection are combined into a unified framework to improve the extraction accuracy and remove redundant sentences simultaneously. Our heterogeneous graph involves multiple types of nodes, the representations of which are iteratively learned across the heterogeneous graph attention layer. We evaluated our model on four datasets, and found that it significantly outperformed the current state-of-the-art baselines with regard to ROUGE scores and date selection metrics.
Updated Headline Generation: Creating Updated Summaries for Evolving News Stories Sheena Panthaplackel, Adrian Benton, Mark Dredze ACL 2022 [pdf] [code]

[Abs]
We propose the task of updated headline generation, in which a system generates a headline for an updated article, considering both the previous article and headline. The system must identify the novel information in the article update, and modify the existing headline accordingly. We create data for this task using the NewsEdits corpus by automatically identifying contiguous article versions that are likely to require a substantive headline update. We find that models conditioned on the prior headline and body revisions produce headlines judged by humans to be as factual as gold headlines while making fewer unnecessary edits compared to a standard headline generation model. Our experiments establish benchmarks for this new contextual summarization task.
Abstractive summarization of hospitalisation histories with transformer networks Alexander Yalunin, Dmitriy Umerenkov, Vladimir Kokh [pdf]
Follow the Timeline! Generating Abstractive and Extractive Timeline Summary in Chronological Order Xiuying Chen, Mingzhe Li, Shen Gao, Zhangming Chan, Dongyan Zhao, Xin Gao, Xiangliang Zhang, Rui Yan TOIS [pdf] [data]
Multi-TimeLine Summarization (MTLS): Improving Timeline Summarization by Generating Multiple Summaries Yi Yu, Adam Jatowt, Antoine Doucet, Kazunari Sugiyama, Masatoshi Yoshikawa ACL 2021 [pdf] [data]
Summarize Dates First: A Paradigm Shift in Timeline Summarization Moreno La Quatra, Luca Cagliero, Elena Baralis, Alberto Messina, Maurizio Montagnuolo SIGIR 2021 [pdf] [data]
Examining the State-of-the-Art in News Timeline Summarization Demian Gholipour Ghalandari, Georgiana Ifrim ACL20 [pdf] [code]
Learning towards Abstractive Timeline Summarization Xiuying Chen, Zhangming Chan, Shen Gao, Meng-Hsuan Yu, Dongyan Zhao, Rui Yan IJCAI19 [pdf] [data]

Opinion

Simple Yet Effective Synthetic Dataset Construction for Unsupervised Opinion Summarization Ming Shen, Jie Ma, Shuai Wang, Yogarshi Vyas, Kalpit Dixit, Miguel Ballesteros, Yassine Benajiba EACL 2023 Findings [pdf]

[Abs]
Opinion summarization provides an important solution for summarizing opinions expressed among a large number of reviews. However, generating aspect-specific and general summaries is challenging due to the lack of annotated data. In this work, we propose two simple yet effective unsupervised approaches to generate both aspect-specific and general opinion summaries by training on synthetic datasets constructed with aspect-related review contents. Our first approach, Seed Words Based Leave-One-Out (SW-LOO), identifies aspect-related portions of reviews simply by exact-matching aspect seed words and outperforms existing methods by 3.4 ROUGE-L points on SPACE and 0.5 ROUGE-1 point on OPOSUM+ for aspect-specific opinion summarization. Our second approach, Natural Language Inference Based Leave-One-Out (NLI-LOO) identifies aspect-related sentences utilizing an NLI model in a more general setting without using seed words and outperforms existing approaches by 1.2 ROUGE-L points on SPACE for aspect-specific opinion summarization and remains competitive on other metrics.
Opinion Summarization by Weak-Supervision from Mix-structured Data Yizhu Liu, Qi Jia, Kenny Zhu EMNLP 2022 [pdf] [code]

[Abs]
Opinion summarization of multiple reviews suffers from the lack of reference summaries for training.Most previous approaches construct multiple reviews and their summary based on textual similarities between reviews,resulting in information mismatch between the review input and the summary. In this paper, we convert each review into a mixof structured and unstructured data, which we call opinion-aspect pairs (OAs) and implicit sentences (ISs).We propose a new method to synthesize training pairs of such mix-structured data as input and the textual summary as output,and design a summarization model with OA encoder and IS encoder.Experiments show that our approach outperforms previous methods on Yelp, Amazon and RottenTomatos datasets.
OpineSum: Entailment-based self-training for abstractive opinion summarization Annie Louis, Joshua Maynez [pdf]

[Abs]
A typical product or place often has hundreds of reviews, and summarization of these texts is an important and challenging problem. Recent progress on abstractive summarization in domains such as news has been driven by supervised systems trained on hundreds of thousands of news articles paired with human-written summaries. However for opinion texts, such large scale datasets are rarely available. Unsupervised methods, self-training, and few-shot learning approaches bridge that gap. In this work, we present a novel self-training approach, OpineSum, for abstractive opinion summarization. The summaries in this approach are built using a novel application of textual entailment and capture the consensus of opinions across the various reviews for an item. This method can be used to obtain silver-standard summaries on a large scale and train both unsupervised and few-shot abstractive summarization systems. OpineSum achieves state-of-the-art performance in both settings.
Zero-Shot Opinion Summarization with GPT-3 Adithya Bhaskar, Alexander R. Fabbri, Greg Durrett [pdf] [code]

[Abs]
Very large language models such as GPT-3 have shown impressive performance across a wide variety of tasks, including text summarization. In this paper, we show that this strong performance extends to opinion summarization. We explore several pipeline methods for applying GPT-3 to summarize a large collection of user reviews in a zero-shot fashion, notably approaches based on recursive summarization and selecting salient content to summarize through supervised clustering or extraction. On two datasets, an aspect-oriented summarization dataset of hotel reviews and a generic summarization dataset of Amazon and Yelp reviews, we show that the GPT-3 models achieve very strong performance in human evaluation. We argue that standard evaluation metrics do not reflect this, and evaluate against several new measures targeting faithfulness, factuality, and genericity to contrast these different methods.
Unsupervised Opinion Summarisation in the Wasserstein Space Jiayu Song, Iman Munire Bilal, Adam Tsakalidis, Rob Procter, Maria Liakata [pdf]

[Abs]
Opinion summarisation synthesises opinions expressed in a group of documents discussing the same topic to produce a single summary. Recent work has looked at opinion summarisation of clusters of social media posts. Such posts are noisy and have unpredictable structure, posing additional challenges for the construction of the summary distribution and the preservation of meaning compared to online reviews, which has been so far the focus of opinion summarisation. To address these challenges we present \textit{WassOS}, an unsupervised abstractive summarization model which makes use of the Wasserstein distance. A Variational Autoencoder is used to get the distribution of documents/posts, and the distributions are disentangled into separate semantic and syntactic spaces. The summary distribution is obtained using the Wasserstein barycenter of the semantic and syntactic distributions. A latent variable sampled from the summary distribution is fed into a GRU decoder with a transformer layer to produce the final summary. Our experiments on multiple datasets including Twitter clusters, Reddit threads, and reviews show that WassOS almost always outperforms the state-of-the-art on ROUGE metrics and consistently produces the best summaries with respect to meaning preservation according to human evaluations.
Noisy Pairing and Partial Supervision for Opinion Summarization Hayate Iso, Xiaolan Wang, Yoshi Suhara [pdf]

[Abs]
Current opinion summarization systems simply generate summaries reflecting important opinions from customer reviews, but the generated summaries may not attract the reader's attention. Although it is helpful to automatically generate professional reviewer-like summaries from customer reviews, collecting many training pairs of customer and professional reviews is generally tricky. We propose a weakly supervised opinion summarization framework, Noisy Pairing and Partial Supervision (NAPA) that can build a stylized opinion summarization system with no customer-professional review pairs. Experimental results show consistent improvements in automatic evaluation metrics, and qualitative analysis shows that our weakly supervised opinion summarization system can generate summaries that look more like those written by professional reviewers.
Unsupervised Opinion Summarization Using Approximate Geodesics Somnath Basu Roy Chowdhury, Nicholas Monath, Avinava Dubey, Amr Ahmed, Snigdha Chaturvedi [pdf]

[Abs]
Opinion summarization is the task of creating summaries capturing popular opinions from user reviews. In this paper, we introduce Geodesic Summarizer (GeoSumm), a novel system to perform unsupervised extractive opinion summarization. GeoSumm involves an encoder-decoder based representation learning model, that generates representations of text as a distribution over latent semantic units. GeoSumm generates these representations by performing dictionary learning over pre-trained text representations at multiple decoder layers. We then use these representations to quantify the relevance of review sentences using a novel approximate geodesic distance based scoring mechanism. We use the relevance scores to identify popular opinions in order to compose general and aspect-specific summaries. Our proposed model, GeoSumm, achieves state-of-the-art performance on three opinion summarization datasets. We perform additional experiments to analyze the functioning of our model and showcase the generalization ability of {\X} across different domains.
Template-based Abstractive Microblog Opinion Summarisation Iman Munire Bilal, Bo Wang, Adam Tsakalidis, Dong Nguyen, Rob Procter, Maria Liakata TACL 2022 [pdf]

[Abs]
We introduce the task of microblog opinion summarisation (MOS) and share a dataset of 3100 gold-standard opinion summaries to facilitate research in this domain. The dataset contains summaries of tweets spanning a 2-year period and covers more topics than any other public Twitter summarisation dataset. Summaries are abstractive in nature and have been created by journalists skilled in summarising news articles following a template separating factual information (main story) from author opinions. Our method differs from previous work on generating gold-standard summaries from social media, which usually involves selecting representative posts and thus favours extractive summarisation models. To showcase the dataset's utility and challenges, we benchmark a range of abstractive and extractive state-of-the-art summarisation models and achieve good performance, with the former outperforming the latter. We also show that fine-tuning is necessary to improve performance and investigate the benefits of using different sample sizes.
Efficient Few-Shot Fine-Tuning for Opinion Summarization Arthur Bražinskas, Ramesh Nallapati, Mohit Bansal, Markus Dreyer Findings of NAACL 202 [pdf] [code]

[Abs]
Abstractive summarization models are typically pre-trained on large amounts of generic texts, then fine-tuned on tens or hundreds of thousands of annotated samples. However, in opinion summarization, large annotated datasets of reviews paired with reference summaries are not available and would be expensive to create. This calls for fine-tuning methods robust to overfitting on small datasets. In addition, generically pre-trained models are often not accustomed to the specifics of customer reviews and, after fine-tuning, yield summaries with disfluencies and semantic mistakes. To address these problems, we utilize an efficient few-shot method based on adapters which, as we show, can easily store in-domain knowledge. Instead of fine-tuning the entire model, we add adapters and pre-train them in a task-specific way on a large corpus of unannotated customer reviews, using held-out reviews as pseudo summaries. Then, fine-tune the adapters on the small available human-annotated dataset. We show that this self-supervised adapter pre-training improves summary quality over standard fine-tuning by 2.0 and 1.3 ROUGE-L points on the Amazon and Yelp datasets, respectively. Finally, for summary personalization, we condition on aspect keyword queries, automatically created from generic datasets. In the same vein, we pre-train the adapters in a query-based manner on customer reviews and then fine-tune them on annotated datasets. This results in better-organized summary content reflected in improved coherence and fewer redundancies.
DSGPT: Domain-Specific Generative Pre-Training of Transformers for Text Generation in E-commerce Title and Review Summarization Xueying Zhang, Yunjiang Jiang, Yue Shang, Zhaomeng Cheng, Chi Zhang, Xiaochuan Fan, Yun Xiao, Bo Long SIGIR 2021 [pdf]
Convex Aggregation for Opinion Summarization Hayate Iso, Xiaolan Wang, Yoshihiko Suhara, Stefanos Angelidis, Wang-Chiew Tan EMNLP 2021 Findings [pdf] [code]
Measuring Similarity of Opinion-bearing Sentences Wenyi Tay, Xiuzhen Zhang, Stephen Wan, Sarvnaz Karimi EMNLP 2021| newsum [pdf] [data]
Comparative Opinion Summarization via Collaborative Decoding Hayate Iso, Xiaolan Wang, Yoshihiko Suhara [pdf] [data]
Learning Opinion Summarizers by Selecting Informative Reviews Arthur Bražinskas, Mirella Lapata, Ivan Titov EMNLP 2021 [pdf] [code]
Aspect-Controllable Opinion Summarization Reinald Kim Amplayo, Stefanos Angelidis, Mirella Lapata EMNLP 2021 [pdf] [code]
CUSTOM: Aspect-Oriented Product Summarization for E-Commerce Jiahui Liang, Junwei Bao, Yifan Wang, Youzheng Wu, Xiaodong He, Bowen Zhou [pdf] [code]
TransSum: Translating Aspect and Sentiment Embeddings for Self-Supervised Opinion Summarization Ke Wang, Xiaojun Wan ACL 2021 Findings [pdf]
Unsupervised Abstractive Opinion Summarization by Generating Sentences with Tree-Structured Topic Guidance Masaru Isonuma, Junichiro Mori, Danushka Bollegala, Ichiro Sakata TACL 2021 [pdf]
PASS: Perturb-and-Select Summarizer for Product Reviews Nadav Oved, Ran Levy ACL 2021 [pdf]
Self-Supervised Multimodal Opinion Summarization Jinbae Im, Moonki Kim, Hoyeop Lee, Hyunsouk Cho, Sehee Chung ACL21 [pdf] [code]
MRCBert: A Machine Reading Comprehension Approach for Unsupervised Summarization Saurabh Jain, Guokai Tang, Lim Sze Chi [pdf] [code]
Informative and Controllable Opinion Summarization Reinald Kim Amplayo, Mirella Lapata EACL21 [pdf] [code]
Self-Supervised and Controlled Multi-Document Opinion Summarization Hady Elsahar, Maximin Coavoux, Jos Rozen, Matthias Gallé EACL21 [pdf]
Unsupervised Opinion Summarization with Content Planning Reinald Kim Amplayo, Stefanos Angelidis, Mirella Lapata AAAI21 [pdf] [code]
Extractive Opinion Summarization in Quantized Transformer Spaces Stefanos Angelidis, Reinald Kim Amplayo, Yoshihiko Suhara, Xiaolan Wang, Mirella Lapata TACL [pdf] [code]
Few-Shot Learning for Opinion Summarization Arthur Bražinskas, Mirella Lapata, Ivan Titov EMNLP20 [pdf] [code]
Unsupervised Opinion Summarization as Copycat-Review Generation Arthur Bražinskas, Mirella Lapata, Ivan Titov ACL20 [pdf] [code]
Unsupervised Opinion Summarization with Noising and Denoising Reinald Kim Amplayo, Mirella Lapata ACL20 [pdf] [code]
OPINIONDIGEST: A Simple Framework for Opinion Summarization Yoshihiko Suhara, Xiaolan Wang, Stefanos Angelidis, Wang-Chiew Tan ACL20 Short [pdf] [code]
Weakly-Supervised Opinion Summarization by Leveraging External Information Chao Zhao, Snigdha Chaturvedi AAAI20 [pdf] [code]
MeanSum: A Neural Model for Unsupervised Multi-Document Abstractive Summarization Eric Chu, Peter J. Liu ICML19 [pdf] [code]

Reinforcement Learning

Reinforcement Learning for Abstractive Question Summarization with Question-aware Semantic Rewards Shweta Yadav, Deepak Gupta, Asma Ben Abacha, Dina Demner-Fushman ACL 2021 short [pdf] [code]
RewardsOfSum: Exploring Reinforcement Learning Rewards for Summarisation Jacob Parnell, Inigo Jauregi Unanue, Massimo Piccardi 5th Workshop on Structured Prediction for NLP ACL-IJCNLP 2021 [pdf]
Reinforced Generative Adversarial Network for Abstractive Text Summarization Tianyang Xu, Chunyun Zhang [pdf]
Answers Unite! Unsupervised Metrics for Reinforced Summarization Models Thomas Scialom, Sylvain Lamprier, Benjamin Piwowarski, Jacopo Staiano EMNLP19 [pdf]
Deep Reinforcement Learning with Distributional Semantic Rewards for Abstractive Summarization Siyao Li, Deren Lei, Pengda Qin, William Yang Wang EMNLP19 [pdf]
Reinforced Extractive Summarization with Question-Focused Rewards Kristjan Arumae, Fei Liu ACL18 [pdf]
Fast Abstractive Summarization with Reinforce-Selected Sentence Rewriting Yen-Chun Chen, Mohit Bansal ACL18 [pdf] [code]
Multi-Reward Reinforced Summarization with Saliency and Entailmen Ramakanth Pasunuru, Mohit Bansal NAACL18 [pdf]
Deep Communicating Agents for Abstractive Summarization Asli Celikyilmaz, Antoine Bosselut, Xiaodong He, Yejin Choi NAACL18 [pdf]
Ranking Sentences for Extractive Summarization with Reinforcement Learning Shashi Narayan, Shay B. Cohen, Mirella Lapata NAACL18 [pdf] [code]
A Deep Reinforced Model For Abstractive Summarization Romain Paulus, Caiming Xiong, Richard Socher ICLR18 [pdf]

Reward Learning

Recursively Summarizing Books with Human Feedback Jeff Wu, Long Ouyang, Daniel M. Ziegler, Nissan Stiennon, Ryan Lowe, Jan Leike, Paul Christiano [pdf] [code]
Learning to summarize from human feedback Nisan Stiennon, Long Ouyang, Jeff Wu, Daniel M. Ziegler, Ryan Lowe, Chelsea Voss, Alec Radford, Dario Amodei, Paul Christiano [pdf] [code]
Better Rewards Yield Better Summaries: Learning to Summarise Without References Florian Böhm, Yang Gao, Christian M. Meyer, Ori Shapira, Ido Dagan, Iryna Gurevych EMNLP19 [pdf] [code]

Extractive

DiffuSum: Generation Enhanced Extractive Summarization with Diffusion Haopeng Zhang, Xiao Liu, Jiawei Zhang ACL 2023 Findings [pdf]

[Abs]
Extractive summarization aims to form a summary by directly extracting sentences from the source document. Existing works mostly formulate it as a sequence labeling problem by making individual sentence label predictions. This paper proposes DiffuSum, a novel paradigm for extractive summarization, by directly generating the desired summary sentence representations with diffusion models and extracting sentences based on sentence representation matching. In addition, DiffuSum jointly optimizes a contrastive sentence encoder with a matching loss for sentence representation alignment and a multi-class contrastive loss for representation diversity. Experimental results show that DiffuSum achieves the new state-of-the-art extractive results on CNN/DailyMail with ROUGE scores of 44.83/22.56/40.56. Experiments on the other two datasets with different summary lengths also demonstrate the effectiveness of DiffuSum. The strong performance of our framework shows the great potential of adapting generative models for extractive summarization.
Extractive Text Summarization Using Generalized Additive Models with Interactions for Sentence Selection Vinícius Camargo da Silva, João Paulo Papa, Kelton Augusto Pontara da Costa [pdf]

[Abs]
Automatic Text Summarization (ATS) is becoming relevant with the growth of textual data; however, with the popularization of public large-scale datasets, some recent machine learning approaches have focused on dense models and architectures that, despite producing notable results, usually turn out in models difficult to interpret. Given the challenge behind interpretable learning-based text summarization and the importance it may have for evolving the current state of the ATS field, this work studies the application of two modern Generalized Additive Models with interactions, namely Explainable Boosting Machine and GAMI-Net, to the extractive summarization problem based on linguistic features and binary classification.
Noise-injected Consistency Training and Entropy-constrained Pseudo Labeling for Semi-supervised Extractive Summarization Yiming Wang, Qianren Mao, Junnan Liu, Weifeng Jiang, Hongdong Zhu, Jianxin Li COLING 2022 [pdf] [code]

[Abs]
Labeling large amounts of extractive summarization data is often prohibitive expensive due to time, financial, and expertise constraints, which poses great challenges to incorporating summarization system in practical applications. This limitation can be overcome by semi-supervised approaches: consistency-training and pseudo-labeling to make full use of unlabeled data. Researches on the two, however, are conducted independently, and very few works try to connect them. In this paper, we first use the noise-injected consistency training paradigm to regularize model predictions. Subsequently, we propose a novel entropy-constrained pseudo labeling strategy to obtain high-confidence labels from unlabeled predictions, which can obtain high-confidence labels from unlabeled predictions by comparing the entropy of supervised and unsupervised predictions. By combining consistency training and pseudo-labeling, this framework enforce a low-density separation between classes, which decently improves the performance of supervised learning over an insufficient labeled extractive summarization dataset.
Summarize, Outline, and Elaborate: Long-Text Generation via Hierarchical Supervision from Extractive Summaries Xiaofei Sun, Chun Fan, Zijun Sun, Yuxian Meng, Fei Wu, Jiwei Li [pdf]

[Abs]
The difficulty of generating coherent long texts lies in the fact that existing models overwhelmingly focus on the tasks of local word prediction, and cannot make high level plans on what to generate or capture the high-level discourse dependencies between chunks of texts. Inspired by how humans write, where a list of bullet points or a catalog is first outlined, and then each bullet point is expanded to form the whole article, we propose SOE, a pipelined system that involves of summarizing, outlining and elaborating for long text generation: the model first outlines the summaries for different segments of long texts, and then elaborates on each bullet point to generate the corresponding segment. To avoid the labor-intensive process of summary soliciting, we propose the reconstruction strategy, which extracts segment summaries in an unsupervised manner by selecting its most informative part to reconstruct the segment. The proposed generation system comes with the following merits: (1) the summary provides high-level guidance for text generation and avoids the local minimum of individual word predictions; (2) the high-level discourse dependencies are captured in the conditional dependencies between summaries and are preserved during the summary expansion process and (3) additionally, we are able to consider significantly more contexts by representing contexts as concise summaries. Extensive experiments demonstrate that SOE produces long texts with significantly better quality, along with faster convergence speed.
Extractive Summarisation for German-language Data: A Text-level Approach with Discourse Features Freya Hewett, Manfred Stede COLING 2022 [pdf] [data]

[Abs]
We examine the link between facets of Rhetorical Structure Theory (RST) and the selection of content for extractive summarisation, for German-language texts. For this purpose, we produce a set of extractive summaries for a dataset of German-language newspaper commentaries, a corpus which already has several layers of annotation. We provide an in-depth analysis of the connection between summary sentences and several RST-based features and transfer these insights to various automated summarisation models. Our results show that RST features are informative for the task of extractive summarisation, particularly nuclearity and relations at sentence-level.
Text Summarization with Oracle Expectation Yumo Xu, Mirella Lapata [pdf] [code]

[Abs]
Extractive summarization produces summaries by identifying and concatenating the most important sentences in a document. Since most summarization datasets do not come with gold labels indicating whether document sentences are summary-worthy, different labeling algorithms have been proposed to extrapolate oracle extracts for model training. In this work, we identify two flaws with the widely used greedy labeling approach: it delivers suboptimal and deterministic oracles. To alleviate both issues, we propose a simple yet effective labeling algorithm that creates soft, expectation-based sentence labels. We define a new learning objective for extractive summarization which incorporates learning signals from multiple oracle summaries and prove it is equivalent to estimating the oracle expectation for each document sentence. Without any architectural modifications, the proposed labeling scheme achieves superior performance on a variety of summarization benchmarks across domains and languages, in both supervised and zero-shot settings.
OTExtSum: Extractive Text Summarisation with Optimal Transport Peggy Tang, Kun Hu, Rui Yan, Lei Zhang, Junbin Gao, Zhiyong Wang Findings of NAACL 2022 [pdf] [code]

[Abs]
Extractive text summarisation aims to select salient sentences from a document to form a short yet informative summary. While learning-based methods have achieved promising results, they have several limitations, such as dependence on expensive training and lack of interpretability. Therefore, in this paper, we propose a novel non-learning-based method by for the first time formulating text summarisation as an Optimal Transport (OT) problem, namely Optimal Transport Extractive Summariser (OTExtSum). Optimal sentence extraction is conceptualised as obtaining an optimal summary that minimises the transportation cost to a given document regarding their semantic distributions. Such a cost is defined by the Wasserstein distance and used to measure the summary’s semantic coverage of the original document. Comprehensive experiments on four challenging and widely used datasets - MultiNews, PubMed, BillSum, and CNN/DM demonstrate that our proposed method outperforms the state-of-the-art non-learning-based methods and several recent learning-based methods in terms of the ROUGE metric.
Post-Editing Extractive Summaries by Definiteness Prediction Jad Kabbara, Jackie Chi Kit Cheung EMNLP 2021 Findings [pdf]
Decision-Focused Summarization Chao-Chun Hsu, Chenhao Tan EMNLP 2021 [pdf] [code]
Monolingual versus Multilingual BERTology for Vietnamese Extractive Multi-Document Summarization Huy To Quoc, Kiet Van Nguyen, Ngan Luu-Thuy Nguyen, Anh Gia-Tuan Nguyen [pdf]
Multiplex Graph Neural Network for Extractive Text Summarization Baoyu Jing, Zeyu You, Tao Yang, Wei Fan, Hanghang Tong EMNLP 2021 Short [pdf]
Unsupervised Extractive Summarization-Based Representations for Accurate and Explainable Collaborative Filtering Reinald Adrian Pugoy, Hung-Yu Kao ACL 2021 [pdf]
Deep Differential Amplifier for Extractive Summarization Ruipeng Jia, Yanan Cao, Fang Fang, Yuchen Zhou, Zheng Fang, Yanbing Liu, Shi Wang ACL 2021 [pdf]
Incorporating Domain Knowledge for Extractive Summarization of Legal Case Documents Paheli Bhattacharya, Soham Poddar, Koustav Rudra, Kripabandhu Ghosh, Saptarshi Ghosh ICAIL 2021 [pdf]
Topic Modeling Based Extractive Text Summarization Kalliath Abdul Rasheed Issam, Shivam Patel, Subalalitha C. N Journal [pdf]
Demoting the Lead Bias in News Summarization via Alternating Adversarial Learning Linzi Xing, Wen Xiao, Giuseppe Carenini ACL2021-short [pdf] [code]
Genetic Algorithms For Extractive Summarization William Chen, Kensal Ramos, Kalyan Naidu Mullaguri [pdf]
Extractive Summarization Considering Discourse and Coreference Relations based on Heterogeneous Graph Yin Jou Huang, Sadao Kurohashi EACL21 [pdf]
AREDSUM: Adaptive Redundancy-Aware Iterative Sentence Ranking for Extractive Document Summarization Keping Bi, Rahul Jha, Bruce Croft, Asli Celikyilmaz EACL21 [pdf]
Unsupervised Extractive Summarization using Pointwise Mutual Information Vishakh Padmakumar, He He EACL21 [pdf] [code]
Better Highlighting: Creating Sub-Sentence Summary Highlights Sangwoo Cho, Kaiqiang Song, Chen Li, Dong Yu, Hassan Foroosh, Fei Liu EMNLP20 [pdf] [code]
SupMMD: A Sentence Importance Model for Extractive Summarization using Maximum Mean Discrepancy Umanga Bista, Alexander Patrick Mathews, Aditya Krishna Menon, Lexing Xie [pdf] [code]
Stepwise Extractive Summarization and Planning with Structured Transformers Shashi Narayan, Joshua Maynez, Jakub Adamek, Daniele Pighin, Blaž Bratanič, Ryan McDonald EMNLP20 [pdf] [code]
A Discourse-Aware Neural Extractive Model for Text Summarization Jiacheng Xu, Zhe Gan, Yu Cheng, Jingjing Liu ACL20 [pdf] [code]
Reading Like HER: Human Reading Inspired Extractive Summarization Ling Luo, Xiang Ao, Yan Song, Feiyang Pan, Min Yang, Qing He EMNLP19 [pdf]
Exploiting Discourse-Level Segmentation for Extractive Summarization Zhengyuan Liu, Nancy Chen EMNLP19 [pdf]
DeepChannel: Salience Estimation by Contrastive Learning for Extractive Document Summarization Jiaxin Shi, Chen Liang, Lei Hou, Juanzi Li, Zhiyuan Liu, Hanwang Zhang AAAI19 [pdf] [code]
Extractive Summarization with SWAP-NET: Sentences and Words from Alternating Pointer Networks Aishwarya Jadhav, Vaibhav Rajan ACL18 [pdf]
Neural Document Summarization by Jointly Learning to Score and Select Sentences Qingyu Zhou, Nan Yang, Furu Wei, Shaohan Huang, Ming Zhou, Tiejun Zhao ACL18 [pdf]
Neural Latent Extractive Document Summarization Xingxing Zhang, Mirella Lapata, Furu Wei, Ming Zhou ACL18 [pdf]
Generative Adversarial Network for Abstractive Text Summarization Linqing Liu, Yao Lu, Min Yang, Qiang Qu, Jia Zhu, Hongyan Li AAAI18 [pdf] [code]
Improving Neural Abstractive Document Summarization with Explicit Information Selection Modeling Wei Li, Xinyan Xiao, Yajuan Lyu, Yuanzhuo Wang EMNLP18[pdf]
Extractive Summarization Using Multi-Task Learning with Document Classification Masaru Isonuma, Toru Fujino, Junichiro Mori, Yutaka Matsuo, Ichiro Sakata EMNLP17 [pdf]
SummaRuNNer: A Recurrent Neural Network based Sequence Model for Extractive Summarization of Documents Ramesh Nallapati, Feifei Zhai, Bowen Zhou AAAI17 [pdf] [code]
Text Summarization through Entailment-based Minimum Vertex Cover Anand Gupta, Manpreet Kaur, Shachar Mirkin, Adarsh Singh, Aseem Goyal ENLG13 [pdf]

Extractive-Abstractive

EASE: Extractive-Abstractive Summarization with Explanations Haoran Li, Arash Einolghozati, Srinivasan Iyer, Bhargavi Paranjape, Yashar Mehdad, Sonal Gupta, Marjan Ghazvininejad EMNLP 2021| newsum [pdf]
Semantic Extractor-Paraphraser based Abstractive Summarization Anubhav Jangra, Raghav Jain, Vaibhav Mavi, Sriparna Saha, Pushpak Bhattacharyya [pdf]
Contextualized Rewriting for Text Summarization Guangsheng Bao, Yue Zhang AAAI21 [pdf]
Jointly Extracting and Compressing Documents with Summary State Representations Afonso Mendes, Shashi Narayan, Sebastião Miranda, Zita Marinho, André F. T. Martins, Shay B. Cohen NAACL19 [pdf] [code]

VAE

Deep Recurrent Generative Decoder for Abstractive Text Summarization Piji Li, Wai Lam, Lidong Bing, Zihao Wang EMNLP17 [pdf]
Document Summarization with VHTM: Variational Hierarchical Topic-Aware Mechanism Xiyan Fu, Jun Wang, Jinghan Zhang, Jinmao Wei, Zhenglu Yang AAAI20 [pdf]

Syntactic

Compressive Summarization with Plausibility and Salience Modeling Shrey Desai, Jiacheng Xu, Greg Durrett EMNLP20 [pdf] [code]
StructSum: Incorporating Latent and Explicit Sentence Dependencies for Single Document Summarization Vidhisha Balachandran, Artidoro Pagnoni, Jay Yoon Lee, Dheeraj Rajagopal, Jaime Carbonell, Yulia Tsvetkov EACL21 [pdf] [code]
Joint Parsing and Generation for Abstractive Summarization Kaiqiang Song, Logan Lebanoff, Qipeng Guo, Xipeng Qiu, Xiangyang Xue, Chen Li, Dong Yu, Fei Liu AAAI20 [pdf] [code]
Neural Extractive Text Summarization with Syntactic Compression Jiacheng Xu, Greg Durrett EMNLP19 [pdf] [code]
Single Document Summarization as Tree Induction Yang Liu, Ivan Titov, Mirella Lapata NAACL19 [pdf] [code]

QA Related

Less is More: Summary of Long Instructions is Better for Program Synthesis Kirby Kuznia, Swaroop Mishra, Mihir Parmar, Chitta Baral EMNLP 2022 [pdf] [code]

[Abs]
Despite the success of large pre-trained language models (LMs) such as Codex, they show below-par performance on the larger and more complicated programming related questions. We show that LMs benefit from the summarized version of complicated questions. Our findings show that superfluous information often present in problem description such as human characters, background stories, and names (which are included to help humans in understanding a task) does not help models in understanding a task. To this extent, we create a meta-dataset from the frequently used APPS dataset and the newly created CodeContests dataset for the program synthesis task. Our meta-dataset consists of human and synthesized summaries of the long and complicated programming questions. Experimental results on Codex show that our proposed approach outperforms baseline by 8.13% on the APPS dataset and 11.88% on the CodeContests dataset on an average in terms of strict accuracy. Our analysis shows that summaries significantly improve performance for introductory (9.86%) and interview (11.48%) related programming questions. However, it shows improvement by a small margin (2%) for competitive programming questions, implying the scope for future research direction.
Focus-Driven Contrastive Learning for Medical Question Summarization Ming Zhang, Shuai Dou, Ziyang Wang, Yunfang Wu COLING 2022 [pdf]

[Abs]
Automatic medical question summarization can significantly help the system to understand consumer health questions and retrieve correct answers. The Seq2Seq model based on maximum likelihood estimation (MLE) has been applied in this task, which faces two general problems: the model can not capture well question focus and the traditional MLE strategy lacks the ability to understand sentence-level semantics. To alleviate these problems, we propose a novel question focus-driven contrastive learning framework (QFCL). Specially, we propose an easy and effective approach to generate hard negative samples based on the question focus, and exploit contrastive learning at both encoder and decoder to obtain better sentence level representations. On three medical benchmark datasets, our proposed model achieves new state-of-the-art results, and obtains a performance gain of 5.33, 12.85 and 3.81 points over the baseline BART model on three datasets respectively. Further human judgement and detailed analysis prove that our QFCL model learns better sentence representations with the ability to distinguish different sentence meanings, and generates high-quality summaries by capturing question focus.
Educational Question Generation of Children Storybooks via Question Type Distribution Learning and Event-centric Summarization Zhenjie Zhao, Yufang Hou, Dakuo Wang, Mo Yu, Chengzhong Liu, Xiaojuan Ma ACL 2022 [pdf] [code]

[Abs]
Generating educational questions of fairytales or storybooks is vital for improving children’s literacy ability. However, it is challenging to generate questions that capture the interesting aspects of a fairytale story with educational meaningfulness. In this paper, we propose a novel question generation method that first learns the question type distribution of an input story paragraph, and then summarizes salient events which can be used to generate high-cognitive-demand questions. To train the event-centric summarizer, we finetune a pre-trained transformer-based sequence-to-sequence model using silver samples composed by educational question-answer pairs. On a newly proposed educational question-answering dataset FairytaleQA, we show good performance of our method on both automatic and human evaluation metrics. Our work indicates the necessity of decomposing question type distribution learning and event-centric summary generation for educational question generation.
Benchmarking Answer Verification Methods for Question Answering-Based Summarization Evaluation Metrics Daniel Deutsch, Dan Roth [pdf]
Using Question Answering Rewards to Improve Abstractive Summarization Chulaka Gunasekara, Guy Feigenblat, Benjamin Sznajder, Ranit Aharonov, Sachindra Joshi EMNLP 2021 Findings [pdf]
Question-Based Salient Span Selection for More Controllable Text Summarization Daniel Deutsch, Dan Roth [pdf]
Text Summarization with Latent Queries Yumo Xu, Mirella Lapata [pdf]
Summarizing Chinese Medical Answer with Graph Convolution Networks and Question-focused Dual Attention Ningyu Zhang, Shumin Deng, Juan Li, Xi Chen, Wei Zhang, Huajun Chen Findings of EMNLP [pdf]
Towards Question-Answering as an Automatic Metric for Evaluating the Content Quality of a Summary Daniel Deutsch, Tania Bedrax-Weiss, Dan Roth [pdf] [code]
Guiding Extractive Summarization with Question-Answering Rewards Kristjan Arumae, Fei Liu NAACL19 [pdf] [code]
A Semantic QA-Based Approach for Text Summarization Evaluation Ping Chen, Fei Wu, Tong Wang, Wei Ding AAAI18 [pdf]

Query

Generating Query Focused Summaries without Fine-tuning the Transformer-based Pre-trained Models Deen Abdullah, Shamanth Nayak, Gandharv Suri, Yllias Chali [pdf]

[Abs]
Fine-tuning the Natural Language Processing (NLP) models for each new data set requires higher computational time associated with increased carbon footprint and cost. However, fine-tuning helps the pre-trained models adapt to the latest data sets; what if we avoid the fine-tuning steps and attempt to generate summaries using just the pre-trained models to reduce computational time and cost. In this paper, we tried to omit the fine-tuning steps and investigate whether the Marginal Maximum Relevance (MMR)-based approach can help the pre-trained models to obtain query-focused summaries directly from a new data set that was not used to pre-train the models. First, we used topic modelling on Wikipedia Current Events Portal (WCEP) and Debatepedia datasets to generate queries for summarization tasks. Then, using MMR, we ranked the sentences of the documents according to the queries. Next, we passed the ranked sentences to seven transformer-based pre-trained models to perform the summarization tasks. Finally, we used the MMR approach again to select the query relevant sentences from the generated summaries of individual pre-trained models and constructed the final summary. As indicated by the experimental results, our MMR-based approach successfully ranked and selected the most relevant sentences as summaries and showed better performance than the individual pre-trained models.
OASum: Large-Scale Open Domain Aspect-based Summarization Xianjun Yang, Kaiqiang Song, Sangwoo Cho, Xiaoyang Wang, Xiaoman Pan, Linda Petzold, Dong Yu [pdf] [code]

[Abs]
Aspect or query-based summarization has recently caught more attention, as it can generate differentiated summaries based on users' interests. However, the current dataset for aspect or query-based summarization either focuses on specific domains, contains relatively small-scale instances, or includes only a few aspect types. Such limitations hinder further explorations in this direction. In this work, we take advantage of crowd-sourcing knowledge on this http URL and automatically create a high-quality, large-scale open-domain aspect-based summarization dataset named OASum, which contains more than 3.7 million instances with around 1 million different aspects on 2 million Wikipedia pages. We provide benchmark results on OAsum and demonstrate its ability for diverse aspect-based summarization generation. To overcome the data scarcity problem on specific domains, we also perform zero-shot, few-shot, and fine-tuning on seven downstream datasets. Specifically, zero/few-shot and fine-tuning results show that the model pre-trained on our corpus demonstrates a strong aspect or query-focused generation ability compared with the backbone model. Our dataset and pre-trained checkpoints are publicly available
Constrained Regeneration for Cross-Lingual Query-Focused Extractive Summarization Elsbeth Turcan, David Wan, Faisal Ladhak, Petra Galuscakova, Sukanta Sen, Svetlana Tchistiakova, Weijia Xu, Marine Carpuat, Kenneth Heafield, Douglas Oard, Kathleen McKeown COLING 2022 [pdf]

[Abs]
Query-focused summaries of foreign-language, retrieved documents can help a user understand whether a document is actually relevant to the query term. A standard approach to this problem is to first translate the source documents and then perform extractive summarization to find relevant snippets. However, in a cross-lingual setting, the query term does not necessarily appear in the translations of relevant documents. In this work, we show that constrained machine translation and constrained post-editing can improve human relevance judgments by including a query term in a summary when its translation appears in the source document. We also present several strategies for selecting only certain documents for regeneration which yield further improvements
Focus-Driven Contrastive Learning for Medical Question Summarization Ming Zhang, Shuai Dou, Ziyang Wang, Yunfang Wu COLING 2022 [pdf]

[Abs]
Automatic medical question summarization can significantly help the system to understand consumer health questions and retrieve correct answers. The Seq2Seq model based on maximum likelihood estimation (MLE) has been applied in this task, which faces two general problems: the model can not capture well question focus and and the traditional MLE strategy lacks the ability to understand sentence-level semantics. To alleviate these problems, we propose a novel question focus-driven contrastive learning framework (QFCL). Specially, we propose an easy and effective approach to generate hard negative samples based on the question focus, and exploit contrastive learning at both encoder and decoder to obtain better sentence level representations. On three medical benchmark datasets, our proposed model achieves new state-of-the-art results, and obtains a performance gain of 5.33, 12.85 and 3.81 points over the baseline BART model on three datasets respectively. Further human judgement and detailed analysis prove that our QFCL model learns better sentence representations with the ability to distinguish different sentence meanings, and generates high-quality summaries by capturing question focus.
Domain Adaptation with Pre-trained Transformers for Query Focused Abstractive Text Summarization Md Tahmid Rahman Laskar, Enamul Hoque, Jimmy Xiangji Huang [pdf] [code]
Exploring Neural Models for Query-Focused Summarization Jesse Vig, Alexander R. Fabbri, Wojciech Kryściński Findings of NAACL 2022 [pdf] [code]

[Abs]
Query-focused summarization (QFS) aims to produce summaries that answer particular questions of interest, enabling greater user control and personalization. While recently released datasets, such as QMSum or AQuaMuSe, facilitate research efforts in QFS, the field lacks a comprehensive study of the broad space of applicable modeling methods. In this paper we conduct a systematic exploration of neural approaches to QFS, considering two general classes of methods: two-stage extractive-abstractive solutions and end-to-end models. Within those categories, we investigate existing models and explore strategies for transfer learning. We also present two modeling extensions that achieve state-of-the-art performance on the QMSum dataset, up to a margin of 3.38 ROUGE-1, 3.72 ROUGE2, and 3.28 ROUGE-L when combined with transfer learning strategies. Results from human evaluation suggest that the best models produce more comprehensive and factually consistent summaries compared to a baseline model. Code and checkpoints are made publicly available: https://github.com/salesforce/query-focused-sum.
Aspect-Oriented Summarization through Query-Focused Extraction Ojas Ahuja, Jiacheng Xu, Akshay Gupta, Kevin Horecka, Greg Durrett [pdf]
Query-Focused Extractive Summarisation for Finding Ideal Answers to Biomedical and COVID-19 Questions Diego Mollá (1 and 2), Urvashi Khanna (1), Dima Galat (1), Vincent Nguyen (2 and 3)Maciej Rybinski (3) ( (1) Macquarie University, (2) CSIRO Data61, (3) Australian National University) [pdf]
Summary-Oriented Question Generation for Informational Queries Xusen Yin, Li Zhou, Kevin Small, Jonathan May Proceedings of the 1st Workshop on Document-grounded Dialogue and Conversational Question Answering (DialDoc 2021) [pdf]
Reinforcement Learning for Abstractive Question Summarization with Question-aware Semantic Rewards Shweta Yadav, Deepak Gupta, Asma Ben Abacha, Dina Demner-Fushman ACL 2021 short [pdf] [code]
Generating Query Focused Summaries from Query-Free Resources ACL 2021 Yumo Xu, Mirella Lapata [pdf] [code]
Improve Query Focused Abstractive Summarization by Incorporating Answer Relevance Dan Su, Tiezheng Yu, Pascale Fung ACL21 [pdf] [code]
D2S: Document-to-Slide Generation Via Query-Based Text Summarization Edward Sun, Yufang Hou, Dakuo Wang, Yunfeng Zhang, Nancy X.R. Wang NAACL21 [pdf] [code]

EncoderFusion

Understanding and Improving Encoder Layer Fusion in Sequence-to-Sequence Learning Xuebo Liu, Longyue Wang, Derek F. Wong, Liang Ding, Lidia S. Chao, Zhaopeng Tu ICLR21 [pdf]
Improving Abstractive Text Summarization with History Aggregation Pengcheng Liao, Chuang Zhang, Xiaojun Chen, Xiaofei Zhou [pdf] [code]

Discourse

Discourse-Aware Unsupervised Summarization for Long Scientific Documents Yue Dong, Andrei Mircea Romascanu, Jackie Chi Kit Cheung EACL21 [pdf] [code]
Discourse Understanding and Factual Consistency in Abstractive Summarization Saadia Gabriel, Antoine Bosselut, Jeff Da, Ari Holtzman, Jan Buys, Kyle Lo, Asli Celikyilmaz, Yejin Choi EACL21 [pdf] [code]
Predicting Discourse Trees from Transformer-based Neural Summarizers Wen Xiao, Patrick Huber, Giuseppe Carenini NAACL21 [pdf] [code]
Do We Really Need That Many Parameters In Transformer For Extractive Summarization? Discourse Can Help ! Wen Xiao, Patrick Huber, Giuseppe Carenini EMNLP20 Workshop [pdf]
Dialogue Discourse-Aware Graph Convolutional Networks for Abstractive Meeting Summarization Xiachong Feng, Xiaocheng Feng, Bing Qin, Xinwei Geng, Ting Liu [pdf]
Restructuring Conversations using Discourse Relations for Zero-shot Abstractive Dialogue Summarization Prakhar Ganesh, Saket Dingliwal [pdf]
Unsupervised Neural Single-Document Summarization of Reviews via Learning Latent Discourse Structure and its Ranking Masaru Isonuma, Junichiro Mori, Ichiro Sakata ACL19 [pdf] [code]
Exploiting Discourse-Level Segmentation for Extractive Summarization Zhengyuan Liu, Nancy Chen EMNLP19 [pdf]
A Discourse-Aware Attention Model for Abstractive Summarization of Long Documents Arman Cohan, Franck Dernoncourt, Doo Soon Kim, Trung Bui, Seokhwan Kim, Walter Chang, Nazli Goharian NAACL18 [pdf] [data]

Movie

Movie Summarization via Sparse Graph Construction Pinelopi Papalampidi, Frank Keller, Mirella Lapata AAAI21 [pdf] [code]

Low Resource

Indian Language Summarization using Pretrained Sequence-to-Sequence Models Ashok Urlana, Sahil Manoj Bhatt, Nirmal Surange, Manish Shrivastava FIRE-2022, Indian Language Summarization (ILSUM) track [pdf]

[Abs]
Expert human annotation for summarization is definitely an expensive task, and can not be done on huge scales. But with this work, we show that even with a crowd sourced summary generation approach, quality can be controlled by aggressive expert informed filtering and sampling-based human evaluation. We propose a pipeline that crowd-sources summarization data and then aggressively filters the content via: automatic and partial expert evaluation. Using this pipeline we create a high-quality Telugu Abstractive Summarization dataset (TeSum) which we validate with sampling-based human evaluation. We also provide baseline numbers for various models commonly used for summarization. A number of recently released datasets for summarization, scraped the web-content relying on the assumption that summary is made available with the article by the publishers. While this assumption holds for multiple resources (or news-sites) in English, it should not be generalised across languages without thorough analysis and verification. Our analysis clearly shows that this assumption does not hold true for most Indian language news resources. We show that our proposed filtration pipeline can even be applied to these large-scale scraped datasets to extract better quality article-summary pairs.
TeSum: Human-Generated Abstractive Summarization Corpus for Telugu Ashok Urlana, Nirmal Surange, Pavan Baswani, Priyanka Ravva, Manish Shrivastava LREC 2022 [pdf] [code and data]

[Abs]
Expert human annotation for summarization is definitely an expensive task, and can not be done on huge scales. But with this work, we show that even with a crowd sourced summary generation approach, quality can be controlled by aggressive expert informed filtering and sampling-based human evaluation. We propose a pipeline that crowd-sources summarization data and then aggressively filters the content via: automatic and partial expert evaluation. Using this pipeline we create a high-quality Telugu Abstractive Summarization dataset (TeSum) which we validate with sampling-based human evaluation. We also provide baseline numbers for various models commonly used for summarization. A number of recently released datasets for summarization, scraped the web-content relying on the assumption that summary is made available with the article by the publishers. While this assumption holds for multiple resources (or news-sites) in English, it should not be generalised across languages without thorough analysis and verification. Our analysis clearly shows that this assumption does not hold true for most Indian language news resources. We show that our proposed filtration pipeline can even be applied to these large-scale scraped datasets to extract better quality article-summary pairs.
LR-Sum: Summarization for Less-Resourced Languages Chester Palen-Michel, Constantine Lignos [pdf] [code]

[Abs]
This preprint describes work in progress on LR-Sum, a new permissively-licensed dataset created with the goal of enabling further research in automatic summarization for less-resourced languages. LR-Sum contains human-written summaries for 40 languages, many of which are less-resourced. We describe our process for extracting and filtering the dataset from the Multilingual Open Text corpus (Palen-Michel et al., 2022). The source data is public domain newswire collected from from Voice of America websites, and LR-Sum is released under a Creative Commons license (CC BY 4.0), making it one of the most openly-licensed multilingual summarization datasets. We describe how we plan to use the data for modeling experiments and discuss limitations of the dataset.
Implementing Deep Learning-Based Approaches for Article Summarization in Indian Languages Rahul Tangsali, Aabha Pingle, Aditya Vyawahare, Isha Joshi, Raviraj Joshi ILSUM at FIRE 2022 [pdf]

[Abs]
The research on text summarization for low-resource Indian languages has been limited due to the availability of relevant datasets. This paper presents a summary of various deep-learning approaches used for the ILSUM 2022 Indic language summarization datasets. The ISUM 2022 dataset consists of news articles written in Indian English, Hindi, and Gujarati respectively, and their ground-truth summarizations. In our work, we explore different pre-trained seq2seq models and fine-tune those with the ILSUM 2022 datasets. In our case, the fine-tuned SoTA PEGASUS model worked the best for English, the fine-tuned IndicBART model with augmented data for Hindi, and again fine-tuned PEGASUS model along with a translation mapping-based approach for Gujarati. Our scores on the obtained inferences were evaluated using ROUGE-1, ROUGE-2, and ROUGE-4 as the evaluation metrics.
PSP: Pre-trained Soft Prompts for Few-Shot Abstractive Summarization Xiaochen Liu, Yu Bai, Jiawei Li, Yinan Hu, Yang Gao [pdf]

[Abs]
Few-shot abstractive summarization has become a challenging task in natural language generation. To support it, we developed a novel soft prompts architecture coupled with a prompt pre-training plus prompt fine-tuning paradigm, which is effective and tunes only extremely light parameters. To meet the structure of the generation models, the soft prompts comprise continuous input embeddings across an encoder and a decoder. Importantly, a new inner-prompt placed in the text is introduced to capture document-level information. The aim is to devote attention to understanding the document that better prompts the model to generate document-related content. In the training process, the prompt pre-training with self-supervised pseudo-data firstly teaches the model basic summarizing capability. Then, with few-shot examples, only the designed lightweight soft prompts are fine-tuned. Experimental results on the CNN/DailyMail and XSum datasets show that our method, with only 0.1% of the parameters, outperforms full-model tuning where all model parameters are tuned. It also surpasses Prompt Tuning by a large margin and delivers competitive results against Prefix-Tuning with 3% of the parameters.
Towards Summarizing Healthcare Questions in Low-Resource Setting Shweta Yadav, Cornelia Caragea COLING 2022 [pdf]

[Abs]
The current advancement in abstractive document summarization depends to a large extent on a considerable amount of human-annotated datasets. However, the creation of large-scale datasets is often not feasible in closed domains, such as medical and healthcare domains, where human annotation requires domain expertise. This paper presents a novel data selection strategy to generate diverse and semantic questions in a low-resource setting with the aim to summarize healthcare questions. Our method exploits the concept of guided semantic-overlap and diversity-based objective functions to optimally select the informative and diverse set of synthetic samples for data augmentation. Our extensive experiments on benchmark healthcare question summarization datasets demonstrate the effectiveness of our proposed data selection strategy by achieving new state-of-the-art results. Our human evaluation shows that our method generates diverse, fluent, and informative summarized questions.
Automatic Summarization of Russian Texts: Comparison of Extractive and Abstractive Methods Valeriya Goloviznina, Evgeny Kotelnikov Dialogue-2022 [pdf]

[Abs]
The development of large and super-large language models, such as GPT-3, T5, Switch Transformer, ERNIE, etc., has significantly improved the performance of text generation. One of the important research directions in this area is the generation of texts with arguments. The solution of this problem can be used in business meetings, political debates, dialogue systems, for preparation of student essays. One of the main domains for these applications is the economic sphere. The key problem of the argument text generation for the Russian language is the lack of annotated argumentation corpora. In this paper, we use translated versions of the Argumentative Microtext, Persuasive Essays and UKP Sentential corpora to fine-tune RuBERT model. Further, this model is used to annotate the corpus of economic news by argumentation. Then the annotated corpus is employed to fine-tune the ruGPT-3 model, which generates argument texts. The results show that this approach improves the accuracy of the argument generation by more than 20 percentage points (63.2% vs. 42.5%) compared to the original ruGPT-3 model.
Indian Legal Text Summarization: A Text Normalisation-based Approach Satyajit Ghosh, Mousumi Dutta, Tanaya Das [pdf]

[Abs]
In the Indian court system, pending cases have long been a problem. There are more than 4 crore cases outstanding. Manually summarising hundreds of documents is a time-consuming and tedious task for legal stakeholders. Many state-of-the-art models for text summarization have emerged as machine learning has progressed. Domain-independent models don't do well with legal texts, and fine-tuning those models for the Indian Legal System is problematic due to a lack of publicly available datasets. To improve the performance of domain-independent models, the authors have proposed a methodology for normalising legal texts in the Indian context. The authors experimented with two state-of-the-art domain-independent models for legal text summarization, namely BART and PEGASUS. BART and PEGASUS are put through their paces in terms of extractive and abstractive summarization to understand the effectiveness of the text normalisation approach. Summarised texts are evaluated by domain experts on multiple parameters and using ROUGE metrics. It shows the proposed text normalisation approach is effective in legal texts with domain-independent models.
Domain Specific Fine-tuning of Denoising Sequence-to-Sequence Models for Natural Language Summarization Brydon Parker, Alik Sokolov, Mahtab Ahmed, Matt Kalebic, Sedef Akinli Kocak, Ofer Shai `` [pdf] [code] [data]
An Overview of Indian Language Datasets used for Text Summarization Shagun Sinha, Girish Nath Jha [pdf]
AraBART: a Pretrained Arabic Sequence-to-Sequence Model for Abstractive Summarization Moussa Kamal Eddine, Nadi Tomeh, Nizar Habash, Joseph Le Roux, Michalis Vazirgiannis [pdf] [code]
ExtraPhrase: Efficient Data Augmentation for Abstractive Summarization Mengsay Loem, Sho Takase, Masahiro Kaneko, Naoaki Okazaki [pdf]
Mitigating Data Scarceness through Data Synthesis, Augmentation and Curriculum for Abstractive Summarization Ahmed Magooda, Diane Litman Findings of EMNLP 2021 Short [pdf]
Exploring Multitask Learning for Low-Resource Abstractive Summarization Ahmed Magooda, Mohamed Elaraby, Diane Litman EMNLP 2021 short [pdf]
Few-Shot Learning of an Interleaved Text Summarization Model by Pretraining with Synthetic Data Sanjeev Kumar Karn, Francine Chen, Yan-Ying Chen, Ulli Waltinger, Hinrich Schütze EACL21 [pdf]
AdaptSum: Towards Low-Resource Domain Adaptation for Abstractive Summarization Tiezheng Yu, Zihan Liu, Pascale Fung NAACL21 [pdf] [code]
Meta-Transfer Learning for Low-Resource Abstractive Summarization Yi-Syuan Chen, Hong-Han Shuai AAAI21 [pdf] [code]

Personalized

Unsupervised Summarization with Customized Granularities Ming Zhong, Yang Liu, Suyu Ge, Yuning Mao, Yizhu Jiao, Xingxing Zhang, Yichong Xu, Chenguang Zhu, Michael Zeng, Jiawei Han [pdf]
Transformer Reasoning Network for Personalized Review Summarization Hongyan Xu, Hongtao Liu, Pengfei Jiao, Wenjun Wang SIGIR 2021 [pdf]
PENS: A Dataset and Generic Framework for Personalized News Headline Generation Xiang Ao Xiting Wang Ling Luo Ying Qiao Qing He Xing Xie ACL 2021 [pdf] [data]
Collabot: Personalized Group Chat Summarization Naama Tepper, Anat Hashavit, Maya Barnea, Inbal Ronen, Lior Leiba WSDM 2018 [pdf]
Joint Optimization of User-desired Content in Multi-document Summaries by Learning from User Feedback Avinesh P.V.S, Christian M. Meyer ACL 2017 [pdf] [code]
Context Enhanced Personalized Social Summarization Po Hu, Donghong Ji, Chong Teng, Yujing Guo COLING12 [pdf]
Summarize What You Are Interested In: An Optimization Framework for Interactive Personalized Summarization Rui Yan, Jian-Yun Nie, Xiaoming Li EMNLP 2011 [pdf]
In-Browser Summarisation: Generating Elaborative Summaries Biased Towards the Reading Context Stephen Wan, Cécile Paris ACL 2008 [pdf]
Personalized Summarization Agent Using Non-negative Matrix Factorization Sun Park PRICAI 2008 [pdf]
Aspect-Based Personalized Text Summarization Shlomo Berkovsky, Timothy Baldwin, Ingrid Zukerman AH 2008 [pdf]
User-model based personalized summarization Alberto Díaz, Pablo Gervás [pdf]
Machine Learning of Generic and User-Focused Summarization Inderjeet Mani, Eric Bloedorn AAAI 1998 [pdf]

Interactive

Make The Most of Prior Data: A Solution for Interactive Text Summarization with Preference Feedback Duy-Hung Nguyen, Nguyen Viet Dung Nghiem, Bao-Sinh Nguyen, Dung Tien Tien Le, Shahab Sabahi, Minh-Tien Nguyen, Hung Le Findings of NAACL 2022 [pdf]

[Abs]
For summarization, human preferences is critical to tame outputs of the summarizer in favor of human interests, as ground-truth summaries are scarce and ambiguous. Practical settings require dynamic exchanges between humans and AI agents wherein feedback is provided in an online manner, a few at a time. In this paper, we introduce a new framework to train summarization models with preference feedback interactively. By properly leveraging offline data and a novel reward model, we improve the performance regarding ROUGE scores and sample-efficiency. Our experiments on three various datasets confirm the benefit of the proposed framework in active, few-shot and online settings of preference learning.
Interactive Query-Assisted Summarization via Deep Reinforcement Learning Ori Shapira, Ramakanth Pasunuru, Mohit Bansal, Ido Dagan, Yael Amsterdamer NAACL 2022 [pdf] [code]

[Abs]
Interactive summarization is a task that facilitates user-guided exploration of information within a document set. While one would like to employ state of the art neural models to improve the quality of interactive summarization, many such technologies cannot ingest the full document set or cannot operate at sufficient speed for interactivity. To that end, we propose two novel deep reinforcement learning models for the task that address, respectively, the subtask of summarizing salient information that adheres to user queries, and the subtask of listing suggested queries to assist users throughout their exploration. In particular, our models allow encoding the interactive session state and history to refrain from redundancy. Together, these models compose a state of the art solution that addresses all of the task requirements. We compare our solution to a recent interactive summarization system, and show through an experimental study involving real users that our models are able to improve informativeness while preserving positive user experience.
Hone as You Read: A Practical Type of Interactive Summarization Tanner Bohn, Charles X. Ling [pdf]

Speech

Leveraging Large Text Corpora for End-to-End Speech Summarization Kohei Matsuura, Takanori Ashihara, Takafumi Moriya, Tomohiro Tanaka, Atsunori Ogawa, Marc Delcroix, Ryo Masumura ICASSP 2023 [pdf]

[Abs]
End-to-end speech summarization (E2E SSum) is a technique to directly generate summary sentences from speech. Compared with the cascade approach, which combines automatic speech recognition (ASR) and text summarization models, the E2E approach is more promising because it mitigates ASR errors, incorporates nonverbal information, and simplifies the overall system. However, since collecting a large amount of paired data (i.e., speech and summary) is difficult, the training data is usually insufficient to train a robust E2E SSum system. In this paper, we present two novel methods that leverage a large amount of external text summarization data for E2E SSum training. The first technique is to utilize a text-to-speech (TTS) system to generate synthesized speech, which is used for E2E SSum training with the text summary. The second is a TTS-free method that directly inputs phoneme sequence instead of synthesized speech to the E2E SSum model. Experiments show that our proposed TTS- and phoneme-based methods improve several metrics on the How2 dataset. In particular, our best system outperforms a previous state-of-the-art one by a large margin (i.e., METEOR score improvements of more than 6 points). To the best of our knowledge, this is the first work to use external language resources for E2E SSum. Moreover, we report a detailed analysis of the How2 dataset to confirm the validity of our proposed E2E SSum system.
Speech Summarization using Restricted Self-Attention Roshan Sharma, Shruti Palaskar, Alan W Black, Florian Metze ICASSP 2022 [pdf]

Prompt

UniSumm and SummZoo: Unified Model and Diverse Benchmark for Few-Shot Summarization Yulong Chen, Yang Liu, Ruochen Xu, Ziyi Yang, Chenguang Zhu, Michael Zeng, Yue Zhang ACL 2023 [pdf]

[Abs]
The high annotation costs and diverse demands of various summarization tasks motivate the development of few-shot summarization. However, despite the emergence of many summarization tasks and datasets, the current training paradigm for few-shot summarization systems ignores potentially shareable knowledge in heterogeneous datasets. To this end, we propose UniSumm, a unified few-shot summarization model pre-trained with multiple summarization tasks and can be prefix-tuned to excel at any few-shot summarization task. Meanwhile, to better evaluate few-shot summarizers, under the principles of diversity and robustness, we assemble and release a new benchmark SummZoo. It consists of 8 summarization tasks with multiple sets of few-shot samples for each task, covering diverse domains. Experimental results and analysis show that UniSumm outperforms strong baselines by a large margin across all sub-tasks in SummZoo under both automatic and human evaluations and achieves comparable results in human evaluation compared with a GPT-3.5 model.
Few-shot Query-Focused Summarization with Prefix-Merging Ruifeng Yuan, Zili Wang, Ziqiang Cao, Wenjie Li EMNLP 2022 [pdf]

[Abs]
Query-focused summarization has been considered as an important extension for text summarization. It aims to generate a concise highlight for a given query. Different from text summarization, query-focused summarization has long been plagued by the problem of lacking high-quality large-scale datasets. In this paper, we investigate the idea that whether we can integrate and transfer the knowledge of text summarization and question answering to assist the few-shot learning in query-focused summarization. Here, we propose prefix-merging, a prefix-based pretraining strategy for few-shot learning in query-focused summarization. Drawn inspiration from prefix-tuning, we are allowed to integrate the task knowledge from text summarization and question answering into a properly designed prefix and apply the merged prefix to query-focused summarization. With only a small amount of trainable parameters, prefix-merging outperforms fine-tuning on query-focused summarization. We further discuss the influence of different prefix designs and propose a visualized explanation for how prefix-merging works.
UniSumm: Unified Few-shot Summarization with Multi-Task Pre-Training and Prefix-Tuning Yulong Chen, Yang Liu, Ruochen Xu, Ziyi Yang, Chenguang Zhu, Michael Zeng, Yue Zhang [pdf] [code]

[Abs]
The diverse demands of different summarization tasks and their high annotation costs are driving a need for few-shot summarization. However, despite the emergence of many summarization tasks and datasets, the current training paradigm for few-shot summarization systems ignores potentially shareable knowledge in heterogeneous datasets. To this end, we propose \textsc{UniSumm}, a unified few-shot summarization model pre-trained with multiple summarization tasks and can be prefix-tuned to excel at any few-shot summarization datasets. Meanwhile, to better evaluate few-shot summarization systems, under the principles of diversity and robustness, we assemble and publicize a new benchmark \textsc{SummZoo}. It consists of 8 diverse summarization tasks with multiple sets of few-shot samples for each task, covering both monologue and dialogue domains. Experimental results and ablation studies show that \textsc{UniSumm} outperforms strong baseline systems by a large margin across all tasks in \textsc{SummZoo} under both automatic and human evaluations. We release our code and benchmark at \url{this https URL}.
News Summarization and Evaluation in the Era of GPT-3 Tanya Goyal, Junyi Jessy Li, Greg Durrett [pdf] [code]

[Abs]
The recent success of zero- and few-shot prompting with models like GPT-3 has led to a paradigm shift in NLP research. In this paper, we study its impact on text summarization, focusing on the classic benchmark domain of news summarization. First, we investigate how zero-shot GPT-3 compares against fine-tuned models trained on large summarization datasets. We show that not only do humans overwhelmingly prefer GPT-3 summaries, but these also do not suffer from common dataset-specific issues such as poor factuality. Next, we study what this means for evaluation, particularly the role of gold standard test sets. Our experiments show that both reference-based and reference-free automatic metrics, e.g. recently proposed QA- or entailment-based factuality approaches, cannot reliably evaluate zero-shot summaries. Finally, we discuss future research challenges beyond generic summarization, specifically, keyword- and aspect-based summarization, showing how dominant fine-tuning approaches compare to zero-shot prompting. To support further research, we release: (a) a corpus of 10K generated summaries from fine-tuned and zero-shot models across 4 standard summarization benchmarks, (b) 1K human preference judgments and rationales comparing different systems for generic- and keyword-based summarization.
To Adapt or to Fine-tune: A Case Study on Abstractive Summarization Zheng Zhao, Pinzhen Chen [pdf]

[Abs]
Recent advances in the field of abstractive summarization leverage pre-trained language models rather than train a model from scratch. However, such models are sluggish to train and accompanied by a massive overhead. Researchers have proposed a few lightweight alternatives such as smaller adapters to mitigate the drawbacks. Nonetheless, it remains uncertain whether using adapters benefits the task of summarization, in terms of improved efficiency without an unpleasant sacrifice in performance. In this work, we carry out multifaceted investigations on fine-tuning and adapters for summarization tasks with varying complexity: language, domain, and task transfer. In our experiments, fine-tuning a pre-trained language model generally attains a better performance than using adapters; the performance gap positively correlates with the amount of training data used. Notably, adapters exceed fine-tuning under extremely low-resource conditions. We further provide insights on multi-linguality, model convergence, and robustness, hoping to shed light on the pragmatic choice of fine-tuning or adapters in abstractive summarization.
Discourse-Aware Prompt Design for Text Generation Marjan Ghazvininejad, Vladimir Karpukhin, Asli Celikyilmaz [pdf]

Temp

SIMSUM: Document-level Text Simplification via Simultaneous Summarization Sofia Blinova, Xinyu Zhou, Martin Jaggi, Carsten Eickhoff, Seyed Ali Bahrainian ACL 2023 [pdf] [code]

[Abs]
Document-level text simplification is a specific type of simplification which involves simplifying documents consisting of several sentences by rewriting them into fewer or more sentences. In this paper, we propose a new two-stage framework SIMSUM for automated document-level text simplification. Our model is designed with explicit summarization and simplification models and guides the generation using the main keywords of a source text. In order to evaluate our new model, we use two existing benchmark datasets for simplification, namely D-Wikipedia and Wiki-Doc. We compare our model’s performance with state of the art and show that SIMSUM achieves top results on the D-Wikipedia dataset SARI (+1.20), D-SARI (+1.64), and FKGL (-0.35) scores, improving over the best baseline models. In order to evaluate the quality of the generated text, we analyze the outputs from different models qualitatively and demonstrate the merit of our new model. Our code and datasets are available.
Characterizing Political Bias in Automatic Summaries: A Case Study of Trump and Biden Karen Zhou, Chenhao Tan [pdf]

[Abs]
Growing literature has shown that powerful NLP systems may encode social biases; however, the political bias of summarization models remains relatively unknown. In this work, we use an entity replacement method to investigate the portrayal of politicians in automatically generated summaries of news articles. We develop a computational framework based on political entities and lexical resources, and use it to assess biases about Donald Trump and Joe Biden in both extractive and abstractive summarization models. We find consistent differences, such as stronger associations of a collective US government (i.e., administration) with Biden than with Trump. These summary dissimilarities are most prominent when the entity is heavily featured in the source article. Our systematic characterization provides a framework for future studies of bias in summarization.
Learning to Generate Overlap Summaries through Noisy Synthetic Data Naman Bansal, Mousumi Akter, Shubhra Kanti Karmaker Santu EMNLP 2022 [pdf]

[Abs]
Semantic Overlap Summarization (SOS) is a novel and relatively under-explored seq-to-seq task which entails summarizing common information from multiple alternate narratives. One of the major challenges for solving this task is the lack of existing datasets for supervised training. To address this challenge, we propose a novel data augmentation technique, which allows us to create large amount of synthetic data for training a seq-to-seq model that can perform the SOS task. Through extensive experiments using narratives from the news domain, we show that the models fine-tuned using the synthetic dataset provide significant performance improvements over the pre-trained vanilla summarization techniques and are close to the models fine-tuned on the golden training data; which essentially demonstrates the effectiveness of out proposed data augmentation technique for training seq-to-seq models on the SOS task.
What to Read in a Contract? Party-Specific Summarization of Important Obligations, Entitlements, and Prohibitions in Legal Documents Abhilasha Sancheti, Aparna Garimella, Balaji Vasan Srinivasan, Rachel Rudinger [pdf]

[Abs]
Legal contracts, such as employment or lease agreements, are important documents as they govern the obligations and entitlements of the various contracting parties. However, these documents are typically long and written in legalese resulting in lots of manual hours spent in understanding them. In this paper, we address the task of summarizing legal contracts for each of the contracting parties, to enable faster reviewing and improved understanding of them. Specifically, we collect a dataset consisting of pairwise importance comparison annotations by legal experts for ~293K sentence pairs from lease agreements. We propose a novel extractive summarization system to automatically produce a summary consisting of the most important obligations, entitlements, and prohibitions in a contract. It consists of two modules: (1) a content categorize to identify sentences containing each of the categories (i.e., obligation, entitlement, and prohibition) for a party, and (2) an importance ranker to compare the importance among sentences of each category for a party to obtain a ranked list. The final summary is produced by selecting the most important sentences of a category for each of the parties. We demonstrate the effectiveness of our proposed system by comparing it against several text ranking baselines via automatic and human evaluation.
SumREN: Summarizing Reported Speech about Events in News Revanth Gangi Reddy, Heba Elfardy, Hou Pong Chan, Kevin Small, Heng Ji AAAI 2023 [pdf] [code]

[Abs]
A primary objective of news articles is to establish the factual record for an event, frequently achieved by conveying both the details of the specified event (i.e., the 5 Ws; Who, What, Where, When and Why regarding the event) and how people reacted to it (i.e., reported statements). However, existing work on news summarization almost exclusively focuses on the event details. In this work, we propose the novel task of summarizing the reactions of different speakers, as expressed by their reported statements, to a given event. To this end, we create a new multi-document summarization benchmark, SUMREN, comprising 745 summaries of reported statements from various public figures obtained from 633 news articles discussing 132 events. We propose an automatic silver training data generation approach for our task, which helps smaller models like BART achieve GPT-3 level performance on this task. Finally, we introduce a pipeline-based framework for summarizing reported speech, which we empirically show to generate summaries that are more abstractive and factual than baseline query-focused summarization approaches.
Harnessing Abstractive Summarization for Fact-Checked Claim Detection Harnessing Abstractive Summarization for Fact-Checked Claim Detection COLING 2022 [pdf] [code]

[Abs]
Social media platforms have become new battlegrounds for anti-social elements, with misinformation being the weapon of choice. Fact-checking organizations try to debunk as many claims as possible while staying true to their journalistic processes but cannot cope with its rapid dissemination. We believe that the solution lies in partial automation of the fact-checking life cycle, saving human time for tasks which require high cognition. We propose a new workflow for efficiently detecting previously fact-checked claims that uses abstractive summarization to generate crisp queries. These queries can then be executed on a general-purpose retrieval system associated with a collection of previously fact-checked claims. We curate an abstractive text summarization dataset comprising noisy claims from Twitter and their gold summaries. It is shown that retrieval performance improves 2x by using popular out-of-the-box summarization models and 3x by fine-tuning them on the accompanying dataset compared to verbatim querying. Our approach achieves Recall@5 and MRR of 35% and 0.3, compared to baseline values of 10% and 0.1, respectively. Our dataset, code, and models are available publicly: this https URL
Stage-wise Stylistic Headline Generation: Style Generation and Summarized Content Insertion Jiaao Zhan, Yang Gao∗, Yu Bai, Qianhui Liu IJCAI 2022 [pdf]

[Abs]
A quality headline with a high click-rate should notonly summarize the content of an article, but alsorefect a style that attracts users. Such demand hasdrawn rising attention to the task of stylistic headline generation (SHG). An intuitive method is to frstgenerate plain headlines leveraged by documentheadline parallel data then transfer them to a targetstyle. However, this inevitably suffers from errorpropagation. Therefore, to unify the two sub-tasksand explicitly decompose style-relevant attributesand summarize content, we propose an end-to-endstage-wise SHG model containing the style generation component and the content insertion component, where the former generates stylistic-relevantintermediate outputs and the latter receives theseoutputs then inserts the summarized content. The intermediate outputs are observable, making the stylegeneration easy to control. Our system is comprehensively evaluated by both quantitative and qualitative metrics, and it achieves state-of-the-art resultsin SHG over three different stylistic datasets.
Beyond Text Generation: Supporting Writers with Continuous Automatic Text Summaries Hai Dang, Karim Benharrak, Florian Lehmann, Daniel Buschek ACM UIST 2022 [pdf]

[Abs]
We propose a text editor to help users plan, structure and reflect on their writing process. It provides continuously updated paragraph-wise summaries as margin annotations, using automatic text summarization. Summary levels range from full text, to selected (central) sentences, down to a collection of keywords. To understand how users interact with this system during writing, we conducted two user studies (N=4 and N=8) in which people wrote analytic essays about a given topic and article. As a key finding, the summaries gave users an external perspective on their writing and helped them to revise the content and scope of their drafted paragraphs. People further used the tool to quickly gain an overview of the text and developed strategies to integrate insights from the automated summaries. More broadly, this work explores and highlights the value of designing AI tools for writers, with Natural Language Processing (NLP) capabilities that go beyond direct text generation and correction.
SETSum: Summarization and Visualization of Student Evaluations of Teaching Yinuo Hu, Shiyue Zhang, Viji Sathy, Abigail Panter, Mohit Bansal NAACL 2022 Demo [pdf] [code]

[Abs]
Student Evaluations of Teaching (SETs) are widely used in colleges and universities. Typically SET results are summarized for instructors in a static PDF report. The report often includes summary statistics for quantitative ratings and an unsorted list of open-ended student comments. The lack of organization and summarization of the raw comments hinders those interpreting the reports from fully utilizing informative feedback, making accurate inferences, and designing appropriate instructional improvements. In this work, we introduce a novel system, SETSUM, that leverages sentiment analysis, aspect extraction, summarization, and visualization techniques to provide organized illustrations of SET findings to instructors and other reviewers. Ten university professors from diverse departments serve as evaluators of the system and all agree that SETSUM help them interpret SET results more efficiently; and 6 out of 10 instructors prefer our system over the standard static PDF report (while the remaining 4 would like to have both). This demonstrates that our work holds the potential of reforming the SET reporting conventions in the future.
ASPECTNEWS: Aspect-Oriented Summarization of News Documents Ojas Ahuja, Jiacheng Xu, Akshay Gupta, Kevin Horecka, Greg Durrett ACL 2022 [pdf] [code]

[Abs]
Generic summaries try to cover an entire document and query-based summaries try to answer document-specific questions. But real users’ needs often fall in between these extremes and correspond to aspects, high-level topics discussed among similar types of documents. In this paper, we collect a dataset of realistic aspect-oriented summaries, AspectNews, which covers different subtopics about articles in news sub-domains. We annotate data across two domains of articles, earthquakes and fraud investigations, where each article is annotated with two distinct summaries focusing on different aspects for each domain. A system producing a single generic summary cannot concisely satisfy both aspects. Our focus in evaluation is how well existing techniques can generalize to these domains without seeing in-domain training data, so we turn to techniques to construct synthetic training data that have been used in query-focused summarization work. We compare several training schemes that differ in how strongly keywords are used and how oracle summaries are extracted. Our evaluation shows that our final approach yields (a) focused summaries, better than those from a generic summarization system or from keyword matching; (b) a system sensitive to the choice of keywords.
The Triangle-Densest-k-Subgraph Problem: Hardness, Lovász Extension, and Application to Document Summarization Aritra Konar, Nicholas D. Sidiropoulos AAAI 2022 [pdf]
Applying Automatic Text Summarization for Fake News Detection Philipp Hartl, Udo Kruschwitz [pdf] [code]
Graph Enhanced Contrastive Learning for Radiology Findings Summarization Jinpeng Hu, Zhuo Li, Zhihong Chen, Zhen Li, Xiang Wan, Tsung-Hui Chang ACL 2022 [pdf] [code]

[Abs]
The impression section of a radiology report summarizes the most prominent observation from the findings section and is the most important section for radiologists to communicate to physicians. Summarizing findings is time-consuming and can be prone to error for inexperienced radiologists, and thus automatic impression generation has attracted substantial attention. With the encoder-decoder framework, most previous studies explore incorporating extra knowledge (e.g., static pre-defined clinical ontologies or extra background information). Yet, they encode such knowledge by a separate encoder to treat it as an extra input to their models, which is limited in leveraging their relations with the original findings. To address the limitation, we propose a unified framework for exploiting both extra knowledge and the original findings in an integrated way so that the critical information (i.e., key words and their relations) can be extracted in an appropriate way to facilitate impression generation. In detail, for each input findings, it is encoded by a text encoder and a graph is constructed through its entities and dependency tree. Then, a graph encoder (e.g., graph neural networks (GNNs)) is adopted to model relation information in the constructed graph. Finally, to emphasize the key words in the findings, contrastive learning is introduced to map positive samples (constructed by masking non-key words) closer and push apart negative ones (constructed by masking key words). The experimental results on two datasets, OpenI and MIMIC-CXR, confirm the effectiveness of our proposed method, where the state-of-the-art results are achieved.
Differentiable Multi-Agent Actor-Critic for Multi-Step Radiology Report Summarization Sanjeev Kumar Karn, Ning Liu, Hinrich Schuetze, Oladimeji Farri ACL 2022 [pdf]

[Abs]
The IMPRESSIONS section of a radiology report about an imaging study is a summary of the radiologist’s reasoning and conclusions, and it also aids the referring physician in confirming or excluding certain diagnoses. A cascade of tasks are required to automatically generate an abstractive summary of the typical information-rich radiology report. These tasks include acquisition of salient content from the report and generation of a concise, easily consumable IMPRESSIONS section. Prior research on radiology report summarization has focused on single-step end-to-end models – which subsume the task of salient content acquisition. To fully explore the cascade structure and explainability of radiology report summarization, we introduce two innovations. First, we design a two-step approach: extractive summarization followed by abstractive summarization. Second, we additionally break down the extractive part into two independent tasks: extraction of salient (1) sentences and (2) keywords. Experiments on English radiology reports from two clinical sites show our novel approach leads to a more precise summary compared to single-step and to two-step-with-single-extractive-process baselines with an overall improvement in F1 score of 3-4%.
AUTOSUMM: Automatic Model Creation for Text Summarization Sharmila Reddy Nangi, Atharv Tyagi, Jay Mundra, Sagnik Mukherjee, Raj Snehal, Niyati Chhaya, Aparna Garimella EMNLP 2021 [pdf]

Extend

SOM-NCSCM : An Efficient Neural Chinese Sentence Compression Model Enhanced with Self-Organizing Map Kangli Zi, Shi Wang, Yu Liu, Jicun Li, Yanan Cao, Cungen Cao EMNLP 2021 [pdf] [data]

Retrieve-augmented

Training Data is More Valuable than You Think: A Simple and Effective Method by Retrieving from Training Data Shuohang Wang, Yichong Xu, Yuwei Fang, Yang Liu, Siqi Sun, Ruochen Xu, Chenguang Zhu, Michael Zeng ACL 2022 [pdf] [code]

Chart-to-text

Chart-to-Text: A Large-Scale Benchmark for Chart Summarization Shankar Kanthara, Rixie Tiffany Ko Leong, Xiang Lin, Ahmed Masry, Megh Thakkar, Enamul Hoque, Shafiq Joty ACL 2022 [pdf] [data]

[Abs]
Charts are commonly used for exploring data and communicating insights. Generating natural language summaries from charts can be very helpful for people in inferring key insights that would otherwise require a lot of cognitive and perceptual efforts. We present Chart-to-text, a large-scale benchmark with two datasets and a total of 44,096 charts covering a wide range of topics and chart types. We explain the dataset construction process and analyze the datasets. We also introduce a number of state-of-the-art neural models as baselines that utilize image captioning and data-to-text generation techniques to tackle two problem variations: one assumes the underlying data table of the chart is available while the other needs to extract data from chart images. Our analysis with automatic and human evaluation shows that while our best models usually generate fluent summaries and yield reasonable BLEU scores, they also suffer from hallucinations and factual errors as well as difficulties in correctly explaining complex patterns and trends in charts.

Podcast

Podcast Summary Assessment: A Resource for Evaluating Summary Assessment Methods Potsawee Manakul, Mark J. F. Gales [pdf] [code]

[Abs]
Automatic summary assessment is useful for both machine-generated and human-produced summaries. Automatically evaluating the summary text given the document enables, for example, summary generation system development and detection of inappropriate summaries. Summary assessment can be run in a number of modes: ranking summary generation systems; ranking summaries of a particular document; and estimating the quality of a document-summary pair on an absolute scale. Existing datasets with annotation for summary assessment are usually based on news summarization datasets such as CNN/DailyMail or XSum. In this work, we describe a new dataset, the podcast summary assessment corpus, a collection of podcast summaries that were evaluated by human experts at TREC2020. Compared to existing summary assessment data, this dataset has two unique aspects: (i) long-input, speech podcast based, documents; and (ii) an opportunity to detect inappropriate reference summaries in podcast corpus. First, we examine existing assessment methods, including model-free and model-based methods, and provide benchmark results for this long-input summary assessment dataset. Second, with the aim of filtering reference summary-document pairings for training, we apply summary assessment for data selection. The experimental results on these two aspects provide interesting insights on the summary assessment and generation tasks. The podcast summary assessment data is available.
Towards Abstractive Grounded Summarization of Podcast Transcripts Kaiqiang Song, Chen Li, Xiaoyang Wang, Dong Yu, Fei Liu ACL 2022 [pdf] [code]

[Abs]
Podcasts have shown a recent rise in popularity. Summarization of podcasts is of practical benefit to both content providers and consumers. It helps people quickly decide whether they will listen to a podcast and/or reduces the cognitive load of content providers to write summaries. Nevertheless, podcast summarization faces significant challenges including factual inconsistencies of summaries with respect to the inputs. The problem is exacerbated by speech disfluencies and recognition errors in transcripts of spoken language. In this paper, we explore a novel abstractive summarization method to alleviate these issues. Our approach learns to produce an abstractive summary while grounding summary segments in specific regions of the transcript to allow for full inspection of summary details. We conduct a series of analyses of the proposed approach on a large podcast dataset and show that the approach can achieve promising results. Grounded summaries bring clear benefits in locating the summary and transcript segments that contain inconsistent information, and hence improve summarization quality in terms of automatic and human evaluation.

Sports

Soccer Game Summarization using Audio Commentary, Metadata, and Captions Sushant Gautam, Cise Midoglu, Saeed Shafiee Sabet, Dinesh Baniya Kshatri, Pål Halvorsen NarSUM 2022 [pdf] [code]
Knowledge Enhanced Sports Game Summarization Jiaan Wang, Zhixu Li, Tingyi Zhang, Duo Zheng, Jianfeng Qu, An Liu, Lei Zhao, Zhigang Chen WSDM 2022 [pdf] [code]
SportsSum2.0: Generating High-Quality Sports News from Live Text Commentary Jiaan Wang, Zhixu Li, Qiang Yang, Jianfeng Qu, Zhigang Chen, Qingsheng Liu, Guoping Hu CIKM 2021 short [pdf] [data]
Generating Sports News from Live Commentary: A Chinese Dataset for Sports Game Summarization Kuan-Hao Huang, Chen Li, Kai-Wei Chang AACL 2020 [pdf] [data]
Generate Football News from Live Webcast Scripts Based on Character-CNN with Five Strokes Xue-Qiang Lv, Xin-Dong You, Wen-Chao Wang, Jian-She Zhou [pdf]
Content Selection for Real-time Sports News Construction from Commentary Texts Jin-ge Yao, Jianmin Zhang, Xiaojun Wan, Jianguo Xiao INLG 2017 [pdf]
Towards Constructing Sports News from Live Text Commentary Jianmin Zhang, Jin-ge Yao, Xiaojun Wan ACL 2016 [pdf]
Sports News Generation from Live Webcast Scripts Based on Rules and Templates Maofu Liu, Qiaosong Qi, Huijun Hu, Han Ren NLPCC 2016 [pdf]
Research on Summary Sentences Extraction Oriented to Live Sports Text Liya Zhu, Wenchao Wang, Yujing Chen, Xueqiang Lv, Jianshe Zhou NLPCC 2016 [pdf]

Scientific

SciLit: A Platform for Joint Scientific Literature Discovery, Summarization and Citation Generation Nianlong Gu, Richard H.R. Hahnloser ACL 2023 Demo [pdf] [demo]

[Abs]
Scientific writing involves retrieving, summarizing, and citing relevant papers, which can be time-consuming processes. Although in many workflows these processes are serially linked, there are opportunities for natural language processing (NLP) to provide end-to-end assistive tools. We propose SciLit, a pipeline that automatically recommends relevant papers, extracts highlights, and suggests a reference sentence as a citation of a paper, taking into consideration the user-provided context and keywords. SciLit efficiently recommends papers from large databases of hundreds of millions of papers using a two-stage pre-fetching and re-ranking literature search system that flexibly deals with addition and removal of a paper database. We provide a convenient user interface that displays the recommended papers as extractive summaries and that offers abstractively-generated citing sentences which are aligned with the provided context and which mention the chosen keyword(s). Our assistive tool for literature discovery and scientific writing is available at https://scilit.vercel.app
The Semantic Reader Project: Augmenting Scholarly Documents through AI-Powered Interactive Reading Interfaces Kyle Lo, Joseph Chee Chang, Andrew Head, Jonathan Bragg, Amy X. Zhang, Cassidy Trier, Chloe Anastasiades, Tal August, Russell Authur, Danielle Bragg, Erin Bransom, Isabel Cachola, Stefan Candra, Yoganand Chandrasekhar, Yen-Sung Chen, Evie Yu-Yen Cheng, Yvonne Chou, Doug Downey, Rob Evans, Raymond Fok, Fangzhou Hu, Regan Huff, Dongyeop Kang, Tae Soo Kim, Rodney Kinney, Aniket Kittur, Hyeonsu Kang, Egor Klevak, Bailey Kuehl, Michael Langan, Matt Latzke, Jaron Lochner, Kelsey MacMillan, Eric Marsh, Tyler Murray, Aakanksha Naik, Ngoc-Uyen Nguyen, Srishti Palani, Soya Park, Caroline Paulic, Napol Rachatasumrit, Smita Rao, Paul Sayre, Zejiang Shen, Pao Siangliulue, Luca Soldaini, Huy Tran, Madeleine van Zuylen, Lucy Lu Wang, Christopher Wilhelm, Caroline Wu, Jiangjiang Yang, Angele Zamarron, Marti A. Hearst, Daniel S. Weld [pdf]

[Abs]
Scholarly publications are key to the transfer of knowledge from scholars to others. However, research papers are information-dense, and as the volume of the scientific literature grows, the need for new technology to support the reading process grows. In contrast to the process of finding papers, which has been transformed by Internet technology, the experience of reading research papers has changed little in decades. The PDF format for sharing research papers is widely used due to its portability, but it has significant downsides including: static content, poor accessibility for low-vision readers, and difficulty reading on mobile devices. This paper explores the question "Can recent advances in AI and HCI power intelligent, interactive, and accessible reading interfaces -- even for legacy PDFs?" We describe the Semantic Reader Project, a collaborative effort across multiple institutions to explore automatic creation of dynamic reading interfaces for research papers. Through this project, we've developed ten research prototype interfaces and conducted usability studies with more than 300 participants and real-world users showing improved reading experiences for scholars. We've also released a production reading interface for research papers that will incorporate the best features as they mature. We structure this paper around challenges scholars and the public face when reading research papers -- Discovery, Efficiency, Comprehension, Synthesis, and Accessibility -- and present an overview of our progress and remaining open challenges.
Summaries as Captions: Generating Figure Captions for Scientific Documents with Automated Text Summarization Chieh-Yang Huang, Ting-Yao Hsu, Ryan Rossi, Ani Nenkova, Sungchul Kim, Gromit Yeuk-Yin Chan, Eunyee Koh, Clyde Lee Giles, Ting-Hao 'Kenneth' Huang [pdf]

[Abs]
Effective figure captions are crucial for clear comprehension of scientific figures, yet poor caption writing remains a common issue in scientific articles. Our study of arXiv cs.CL papers found that 53.88% of captions were rated as unhelpful or worse by domain experts, showing the need for better caption generation. Previous efforts in figure caption generation treated it as a vision task, aimed at creating a model to understand visual content and complex contextual information. Our findings, however, demonstrate that over 75% of figure captions' tokens align with corresponding figure-mentioning paragraphs, indicating great potential for language technology to solve this task. In this paper, we present a novel approach for generating figure captions in scientific documents using text summarization techniques. Our approach extracts sentences referencing the target figure, then summarizes them into a concise caption. In the experiments on real-world arXiv papers (81.2% were published at academic conferences), our method, using only text data, outperformed previous approaches in both automatic and human evaluations. We further conducted data-driven investigations into the two core challenges: (i) low-quality author-written captions and (ii) the absence of a standard for good captions. We found that our models could generate improved captions for figures with original captions rated as unhelpful, and the model trained on captions with more than 30 tokens produced higher-quality captions. We also found that good captions often include the high-level takeaway of the figure. Our work proves the effectiveness of text summarization in generating figure captions for scholarly articles, outperforming prior vision-based approaches. Our findings have practical implications for future figure captioning systems, improving scientific communication clarity.
CiteBench: A benchmark for Scientific Citation Text Generation Martin Funkquist, Ilia Kuznetsov, Yufang Hou, Iryna Gurevych `` [pdf] [code]

[Abs]
The publication rates are skyrocketing across many fields of science, and it is difficult to stay up to date with the latest research. This makes automatically summarizing the latest findings and helping scholars to synthesize related work in a given area an attractive research objective. In this paper we study the problem of citation text generation, where given a set of cited papers and citing context the model should generate a citation text. While citation text generation has been tackled in prior work, existing studies use different datasets and task definitions, which makes it hard to study citation text generation systematically. To address this, we propose CiteBench: a benchmark for citation text generation that unifies the previous datasets and enables standardized evaluation of citation text generation models across task settings and domains. Using the new benchmark, we investigate the performance of multiple strong baselines, test their transferability between the datasets, and deliver new insights into task definition and evaluation to guide the future research in citation text generation. We make CiteBench publicly available at this https URL.
MORTY: Structured Summarization for Targeted Information Extraction from Scholarly Articles Mohamad Yaser Jaradeh, Markus Stocker, Sören Auer ICADL 2022 [pdf]

[Abs]
Information extraction from scholarly articles is a challenging task due to the sizable document length and implicit information hidden in text, figures, and citations. Scholarly information extraction has various applications in exploration, archival, and curation services for digital libraries and knowledge management systems. We present MORTY, an information extraction technique that creates structured summaries of text from scholarly articles. Our approach condenses the article's full-text to property-value pairs as a segmented text snippet called structured summary. We also present a sizable scholarly dataset combining structured summaries retrieved from a scholarly knowledge graph and corresponding publicly available scientific articles, which we openly publish as a resource for the research community. Our results show that structured summarization is a suitable approach for targeted information extraction that complements other commonly used methods such as question answering and named entity recognition.
Scientific Paper Extractive Summarization Enhanced by Citation Graphs Xiuying Chen, Mingzhe Li, Shen Gao, Rui Yan, Xin Gao, Xiangliang Zhang EMNLP 2022 [pdf]

[Abs]
In a citation graph, adjacent paper nodes share related scientific terms and topics. The graph thus conveys unique structure information of document-level relatedness that can be utilized in the paper summarization task, for exploring beyond the intra-document information. In this work, we focus on leveraging citation graphs to improve scientific paper extractive summarization under different settings. We first propose a Multi-granularity Unsupervised Summarization model (MUS) as a simple and low-cost solution to the task. MUS finetunes a pre-trained encoder model on the citation graph by link prediction tasks. Then, the abstract sentences are extracted from the corresponding paper considering multi-granularity information. Preliminary results demonstrate that citation graph is helpful even in a simple unsupervised framework. Motivated by this, we next propose a Graph-based Supervised Summarization model (GSS) to achieve more accurate results on the task when large-scale labeled data are available. Apart from employing the link prediction as an auxiliary task, GSS introduces a gated sentence encoder and a graph information fusion module to take advantage of the graph information to polish the sentence representation. Experiments on a public benchmark dataset show that MUS and GSS bring substantial improvements over the prior state-of-the-art model.
Comparative Graph-based Summarization of Scientific Papers Guided by Comparative Citations Jingqiang Chen, Chaoxiang Cai, Xiaorui Jiang, Kejia Chen COLING 2022 [pdf]

[Abs]
With the rapid growth of scientific papers, understanding the changes and trends in a research area is rather time-consuming. The first challenge is to find related and comparable articles for the research. Comparative citations compare co-cited papers in a citation sentence and can serve as good guidance for researchers to track a research area. We thus go through comparative citations to find comparable objects and build a comparative scientific summarization corpus (CSSC). And then, we propose the comparative graph-based summarization (CGSUM) method to create comparative summaries using citations as guidance. The comparative graph is constructed using sentences as nodes and three different relationships of sentences as edges. The relationship that sentences occur in the same paper is used to calculate the salience of sentences, the relationship that sentences occur in two different papers is used to calculate the difference between sentences, and the relationship that sentences are related to citations is used to calculate the commonality of sentences. Experiments show that CGSUM outperforms comparative baselines on CSSC and performs well on DUC2006 and DUC2007.
CSL: A Large-scale Chinese Scientific Literature Dataset Yudong Li, Yuqing Zhang, Zhe Zhao, Linlin Shen, Weijie Liu, Weiquan Mao, Hui Zhang COLING 2022 [pdf] [code]

[Abs]
Scientific literature serves as a high-quality corpus, supporting a lot of Natural Language Processing (NLP) research. However, existing datasets are centered around the English language, which restricts the development of Chinese scientific NLP. In this work, we present CSL, a large-scale Chinese Scientific Literature dataset, which contains the titles, abstracts, keywords and academic fields of 396k papers. To our knowledge, CSL is the first scientific document dataset in Chinese. The CSL can serve as a Chinese corpus. Also, this semi-structured data is a natural annotation that can constitute many supervised NLP tasks. Based on CSL, we present a benchmark to evaluate the performance of models across scientific domain tasks, i.e., summarization, keyword generation and text classification. We analyze the behavior of existing text-to-text models on the evaluation tasks and reveal the challenges for Chinese scientific NLP tasks, which provides a valuable reference for future research. Data and code are available at this https URL
Multi-Document Scientific Summarization from a Knowledge Graph-Centric View Pancheng Wang, Shasha Li, Kunyuan Pang, Liangliang He, Dong Li, Jintao Tang, Ting Wang COLING 2022 [pdf] [code]

[Abs]
Multi-Document Scientific Summarization (MDSS) aims to produce coherent and concise summaries for clusters of topic-relevant scientific papers. This task requires precise understanding of paper content and accurate modeling of cross-paper relationships. Knowledge graphs convey compact and interpretable structured information for documents, which makes them ideal for content modeling and relationship modeling. In this paper, we present KGSum, an MDSS model centred on knowledge graphs during both the encoding and decoding process. Specifically, in the encoding process, two graph-based modules are proposed to incorporate knowledge graph information into paper encoding, while in the decoding process, we propose a two-stage decoder by first generating knowledge graph information of summary in the form of descriptive sentences, followed by generating the final summary. Empirical results show that the proposed architecture brings substantial improvements over baselines on the Multi-Xscience dataset.
On Extractive Summarization for Profile-centric Neural Expert Search in Academia Rennan C. Lima, Rodrygo L. T. Santos SIGIR 2022 Short [pdf]

[Abs]
Identifying academic experts is crucial for the progress of science, enabling researchers to connect, form networks, and collaborate on the most pressing research problems. A key challenge for ranking experts in response to a query is how to infer their expertise from the publications they coauthored. Profile-centric approaches represent candidate experts by concatenating all their publications into a text-based profile. Despite offering a complete picture of each candidate's scientific output, such lengthy profiles make it inefficient to leverage state-of-the-art neural architectures for inferring expertise. To overcome this limitation, we investigate the suitability of extractive summarization as a mechanism to reduce candidate profiles for semantic encoding using Transformers. Our thorough experiments with a representative academic search test collection demonstrate the benefits of encoding summarized profiles for an improved expertise inference.
Generating a Structured Summary of Numerous Academic Papers: Dataset and Method Shuaiqi LIU, Jiannong Cao, Ruosong Yang, Zhiyuan Wen IJCAI 2022 [pdf] [data]

[Abs]
Writing a survey paper on one research topic usually needs to cover the salient content from numerous related papers, which can be modeled as a multi-document summarization (MDS) task. Existing MDS datasets usually focus on producing the structureless summary covering a few input documents. Meanwhile, previous structured summary generation works focus on summarizing a single document into a multi-section summary. These existing datasets and methods cannot meet the requirements of summarizing numerous academic papers into a structured summary. To deal with the scarcity of available data, we propose BigSurvey, the first large-scale dataset for generating comprehensive summaries of numerous academic papers on each topic. We collect target summaries from more than seven thousand survey papers and utilize their 430 thousand reference papers’ abstracts as input documents. To organize the diverse content from dozens of input documents and ensure the efficiency of processing long text sequences, we propose a summarization method named category-based alignment and sparse transformer (CAST). The experimental results show that our CAST method outperforms various advanced summarization methods.
TSTR: Too Short to Represent, Summarize with Details! Intro-Guided Extended Summary Generation Sajad Sotudeh, Nazli Goharian NAACL 2022 [pdf] [code]

[Abs]
Many scientific papers such as those in arXiv and PubMed data collections have abstracts with varying lengths of 50-1000 words and average length of approximately 200 words, where longer abstracts typically convey more information about the source paper. Up to recently, scientific summarization research has typically focused on generating short, abstract-like summaries following the existing datasets used for scientific summarization. In domains where the source text is relatively long-form, such as in scientific documents, such summary is not able to go beyond the general and coarse overview and provide salient information from the source document. The recent interest to tackle this problem motivated curation of scientific datasets, arXiv-Long and PubMed-Long, containing human-written summaries of 400-600 words, hence, providing a venue for research in generating long/extended summaries. Extended summaries facilitate a faster read while providing details beyond coarse information. In this paper, we propose TSTR, an extractive summarizer that utilizes the introductory information of documents as pointers to their salient information. The evaluations on two existing large-scale extended summarization datasets indicate statistically significant improvement in terms of Rouge and average Rouge (F1) scores (except in one case) as compared to strong baselines and state-of-the-art. Comprehensive human evaluations favor our generated extended summaries in terms of cohesion and completeness.
X-SCITLDR: Cross-Lingual Extreme Summarization of Scholarly Documents Sotaro Takeshita, Tommaso Green, Niklas Friedrich, Kai Eckert, Simone Paolo Ponzetto JCDL 2022 [pdf] [data]
Target-aware Abstractive Related Work Generation with Contrastive Learning Xiuying Chen, Hind Alamro, Mingzhe Li, Shen Gao, Rui Yan, Xin Gao, Xiangliang Zhang SIGIR 2022 [pdf] [code]
CiteSum: Citation Text-guided Scientific Extreme Summarization and Domain Adaptation with Limited Supervision Yuning Mao, Ming Zhong, Jiawei Han EMNLP 2022 [pdf] [data]

[Abs]
Scientific extreme summarization (TLDR) aims to form ultra-short summaries of scientific papers. Previous efforts on curating scientific TLDR datasets failed to scale up due to the heavy human annotation and domain expertise required. In this paper, we propose a simple yet effective approach to automatically extracting TLDR summaries for scientific papers from their citation texts. Based on the proposed approach, we create a new benchmark CiteSum without human annotation, which is around 30 times larger than the previous human-curated dataset SciTLDR. We conduct a comprehensive analysis of CiteSum, examining its data characteristics and establishing strong baselines. We further demonstrate the usefulness of CiteSum by adapting models pre-trained on CiteSum (named CITES) to new tasks and domains with limited supervision. For scientific extreme summarization, CITES outperforms most fully-supervised methods on SciTLDR without any fine-tuning and obtains state-of-the-art results with only 128 examples. For news extreme summarization, CITES achieves significant gains on XSum over its base model (not pre-trained on CiteSum), e.g., +7.2 ROUGE-1 zero-shot performance and state-of-the-art few-shot performance. For news headline generation, CITES performs the best among unsupervised and zero-shot methods on Gigaword.

Post-Editing

An Exploration of Post-Editing Effectiveness in Text Summarization Vivian Lai, Alison Smith-Renner, Ke Zhang, Ruijia Cheng, Wenjuan Zhang, Joel Tetreault, Alejandro Jaimes NAACL 2022 [pdf] [code]

[Abs]
Automatic summarization methods are efficient but can suffer from low quality. In comparison, manual summarization is expensive but produces higher quality. Can humans and AI collaborate to improve summarization performance? In similar text generation tasks (e.g., machine translation), human-AI collaboration in the form of "post-editing" AI-generated text reduces human workload and improves the quality of AI output. Therefore, we explored whether post-editing offers advantages in text summarization. Specifically, we conducted an experiment with 72 participants, comparing post-editing provided summaries with manual summarization for summary quality, human efficiency, and user experience on formal (XSum news) and informal (Reddit posts) text. This study sheds valuable insights on when post-editing is useful for text summarization: it helped in some cases (e.g., when participants lacked domain knowledge) but not in others (e.g., when provided summaries include inaccurate information). Participants' different editing strategies and needs for assistance offer implications for future human-AI summarization systems.

Human

What Makes a Good and Useful Summary? Incorporating Users in Automatic Summarization Research Maartje Ter Hoeve, Julia Kiseleva, Maarten Rijke NAACL 2022 [pdf] [code]

[Abs]
Automatic text summarization has enjoyed great progress over the years and is used in numerous applications, impacting the lives of many. Despite this development, there is little research that meaningfully investigates how the current research focus in automatic summarization aligns with users’ needs. To bridge this gap, we propose a survey methodology that can be used to investigate the needs of users of automatically generated summaries. Importantly, these needs are dependent on the target group. Hence, we design our survey in such a way that it can be easily adjusted to investigate different user groups. In this work we focus on university students, who make extensive use of summaries during their studies. We find that the current research directions of the automatic summarization community do not fully align with students’ needs. Motivated by our findings, we present ways to mitigate this mismatch in future research on automatic summarization: we propose research directions that impact the design, the development and the evaluation of automatically generated summaries.
Mapping the Design Space of Human-AI Interaction in Text Summarization Ruijia Cheng, Alison Smith-Renner, Ke Zhang, Joel Tetreault, Alejandro Jaimes-Larrarte NAACL 2022 [pdf] [code]

[Abs]
Brief Hospital Course (BHC) summaries are succinct summaries of an entire hospital encounter, embedded within discharge summaries, written by senior clinicians responsible for the overall care of a patient. Methods to automatically produce summaries from inpatient documentation would be invaluable in reducing clinician manual burden of summarising documents under high time-pressure to admit and discharge patients. Automatically producing these summaries from the inpatient course, is a complex, multi-document summarisation task, as source notes are written from various perspectives (e.g. nursing, doctor, radiology), during the course of the hospitalisation. We demonstrate a range of methods for BHC summarisation demonstrating the performance of deep learning summarisation models across extractive and abstractive summarisation scenarios. We also test a novel ensemble extractive and abstractive summarisation model that incorporates a medical concept ontology (SNOMED) as a clinical guidance signal and shows superior performance in 2 real-world clinical data sets.

Medical

Background Knowledge Grounding for Readable, Relevant, and Factual Biomedical Lay Summaries Domenic Rosati [pdf]

[Abs]
Electronic health records (EHRs) store an extensive array of patient information, encompassing medical histories, diagnoses, treatments, and test outcomes. These records are crucial for enabling healthcare providers to make well-informed decisions regarding patient care. Summarizing clinical notes further assists healthcare professionals in pinpointing potential health risks and making better-informed decisions. This process contributes to reducing errors and enhancing patient outcomes by ensuring providers have access to the most pertinent and current patient data. Recent research has shown that incorporating prompts with large language models (LLMs) substantially boosts the efficacy of summarization tasks. However, we show that this approach also leads to increased output variance, resulting in notably divergent outputs even when prompts share similar meanings. To tackle this challenge, we introduce a model-agnostic Soft Prompt-Based Calibration (SPeC) pipeline that employs soft prompts to diminish variance while preserving the advantages of prompt-based summarization. Experimental findings on multiple clinical note tasks and LLMs indicate that our method not only bolsters performance but also effectively curbs variance for various LLMs, providing a more uniform and dependable solution for summarizing vital medical information.
SPeC: A Soft Prompt-Based Calibration on Mitigating Performance Variability in Clinical Notes Summarization Yu-Neng Chuang, Ruixiang Tang, Xiaoqian Jiang, Xia Hu [pdf]

[Abs]
Electronic health records (EHRs) store an extensive array of patient information, encompassing medical histories, diagnoses, treatments, and test outcomes. These records are crucial for enabling healthcare providers to make well-informed decisions regarding patient care. Summarizing clinical notes further assists healthcare professionals in pinpointing potential health risks and making better-informed decisions. This process contributes to reducing errors and enhancing patient outcomes by ensuring providers have access to the most pertinent and current patient data. Recent research has shown that incorporating prompts with large language models (LLMs) substantially boosts the efficacy of summarization tasks. However, we show that this approach also leads to increased output variance, resulting in notably divergent outputs even when prompts share similar meanings. To tackle this challenge, we introduce a model-agnostic Soft Prompt-Based Calibration (SPeC) pipeline that employs soft prompts to diminish variance while preserving the advantages of prompt-based summarization. Experimental findings on multiple clinical note tasks and LLMs indicate that our method not only bolsters performance but also effectively curbs variance for various LLMs, providing a more uniform and dependable solution for summarizing vital medical information.
NapSS: Paragraph-level Medical Text Simplification via Narrative Prompting and Sentence-matching Summarization Junru Lu, Jiazheng Li, Byron C. Wallace, Yulan He, Gabriele Pergola Findings of EACL 2023 [pdf] [code]

[Abs]
Accessing medical literature is difficult for laypeople as the content is written for specialists and contains medical jargon. Automated text simplification methods offer a potential means to address this issue. In this work, we propose a summarize-then-simplify two-stage strategy, which we call NapSS, identifying the relevant content to simplify while ensuring that the original narrative flow is preserved. In this approach, we first generate reference summaries via sentence matching between the original and the simplified abstracts. These summaries are then used to train an extractive summarizer, learning the most relevant content to be simplified. Then, to ensure the narrative consistency of the simplified text, we synthesize auxiliary narrative prompts combining key phrases derived from the syntactical analyses of the original text. Our model achieves results significantly better than the seq2seq baseline on an English medical corpus, yielding 3%~4% absolute improvements in terms of lexical similarity, and providing a further 1.1% improvement of SARI score when combined with the baseline. We also highlight shortcomings of existing evaluation methods, and introduce new metrics that take into account both lexical and high-level semantic similarity. A human evaluation conducted on a random sample of the test set further establishes the effectiveness of the proposed approach. Codes and models are released here: this https URL.
Leveraging Summary Guidance on Medical Report Summarization Yunqi Zhu, Xuebing Yang, Yuanyuan Wu, Wensheng Zhang [pdf] [code]

[Abs]
This study presents three deidentified large medical text datasets, named DISCHARGE, ECHO and RADIOLOGY, which contain 50K, 16K and 378K pairs of report and summary that are derived from MIMIC-III, respectively. We implement convincing baselines of automated abstractive summarization on the proposed datasets with pre-trained encoder-decoder language models, including BERT2BERT, T5-large and BART. Further, based on the BART model, we leverage the sampled summaries from the train set as prior knowledge guidance, for encoding additional contextual representations of the guidance with the encoder and enhancing the decoding representations in the decoder. The experimental results confirm the improvement of ROUGE scores and BERTScore made by the proposed method, outperforming the larger model T5-large.
Toward expanding the scope of radiology report summarization to multiple anatomies and modalities Jean-Benoit Delbrouck, Maya Varma, Curtis P. Langlotz Machine Learning For Health (ML4H) [pdf]

[Abs]
Radiology report summarization is a growing area of research. Given the Findings and/or Background sections of a radiology report, the goal is to generate a summary (called an Impression section) that highlights the key observations and conclusions of the radiology study. Recent efforts have released systems that achieve promising performance as measured by widely used summarization metrics such as BLEU and ROUGE. However, the research area of radiology report summarization currently faces important limitations. First, most of the results are reported on private datasets. This limitation prevents the ability to reproduce results and fairly compare different systems and solutions. Secondly, to the best of our knowledge, most research is carried out on chest X-rays. Sometimes, studies even omit to mention the concerned modality and anatomy in the radiology reports used for their experiments. To palliate these limitations, we propose a new dataset of six different modalities and anatomies based on the MIMIC-III database. We further release our results and the data splits used to carry out our experiments. Finally, we propose a simple report summarization system that outperforms the previous replicable research on the existing dataset.
Summarisation of Electronic Health Records with Clinical Concept Guidance Thomas Searle, Zina Ibrahim, James Teo, Richard Dobson [pdf]

[Abs]
Automatic text summarization systems commonly involve humans for preparing data or evaluating model performance, yet, there lacks a systematic understanding of humans’ roles, experience, and needs when interacting with or being assisted by AI. From a human-centered perspective, we map the design opportunities and considerations for human-AI interaction in text summarization and broader text generation tasks. We first conducted a systematic literature review of 70 papers, developing a taxonomy of five interactions in AI-assisted text generation and relevant design dimensions. We designed text summarization prototypes for each interaction. We then interviewed 16 users, aided by the prototypes, to understand their expectations, experience, and needs regarding efficiency, control, and trust with AI in text summarization and propose design considerations accordingly.

Semantic Overlap

SEM-F1: an Automatic Way for Semantic Evaluation of Multi-Narrative Overlap Summaries at Scale Naman Bansal, Mousumi Akter, Shubhra Kanti Karmaker Santu EMNLP 2022 [pdf]

[Abs]
Recent work has introduced an important yet relatively under-explored NLP task called Semantic Overlap Summarization (SOS) that entails generating a summary from multiple alternative narratives which conveys the common information provided by those narratives. Previous work also published a benchmark dataset for this task by collecting 2,925 alternative narrative pairs from the web and manually annotating 411 different reference summaries by engaging human annotators. In this paper, we exclusively focus on the automated evaluation of the SOS task using the benchmark dataset. More specifically, we first use the popular ROUGE metric from text-summarization literature and conduct a systematic study to evaluate the SOS task. Our experiments discover that ROUGE is not suitable for this novel task and therefore, we propose a new sentence-level precision-recall style automated evaluation metric, called SEM-F1 (Semantic F1). It is inspired by the benefits of the sentence-wise annotation technique using overlap labels reported by the previous work. Our experiments show that the proposed SEM-F1 metric yields a higher correlation with human judgment and higher inter-rater agreement compared to the ROUGE metric.
Semantic Overlap Summarization among Multiple Alternative Narratives: An Exploratory Study Naman Bansal, Mousumi Akter, Shubhra Kanti Karmaker COLING 2022 [pdf] [data]

[Abs]
In this paper, we introduce an important yet relatively unexplored NLP task called Semantic Overlap Summarization (SOS), which entails generating a single summary from multiple alternative narratives which can convey the common information provided by those narratives. As no benchmark dataset is readily available for this task, we created one by collecting 2,925 alternative narrative pairs from the web and then, went through the tedious process of manually creating 411 different reference summaries by engaging human annotators. As a way to evaluate this novel task, we first conducted a systematic study by borrowing the popular ROUGE metric from text-summarization literature and discovered that ROUGE is not suitable for our task. Subsequently, we conducted further human annotations to create 200 document-level and 1,518 sentence-level ground-truth overlap labels. Our experiments show that the sentence-wise annotation technique with three overlap labels, i.e., Absent (A), Partially-Present (PP), and Present (P), yields a higher correlation with human judgment and higher inter-rater agreement compared to the ROUGE metric.

Tutorial

Beyond Opinion Mining: Summarizing Opinions of Customer Reviews Reinald Kim Amplayo, Arthur Bražinskas, Yoshi Suhara, Xiaolan Wang, Bing Liu SIGIR Tutorial 2022 [pdf]

Files

README.md

Latest commit

History

README.md

File metadata and controls

Summarization Papers

Full List

Contributor

Summarization Learning Route

Trending

Presentations && Notes

Big Model Era

Decomposed

Benchmark

Survey

Toolkit

Analysis

Thesis

Theory

Dataset

Dialogue

Dataset

Email Summarization

Meeting Summarization

Chat Summarization

Medical Dialogue Summarization

Customer Service Summarization

Domain Adaption

Others

Long Document

Factual Consistency

Contrastive Learning

Evaluation

Multi-Document

Cross-Lingual

Multi-modal

Sentiment Related

Pre-trained Language Model Based

Controllable

Abstractive

Graph-Based

Unsupervised

Concept-map-based

Timeline

Opinion

Reinforcement Learning

Reward Learning

Extractive

Extractive-Abstractive

VAE

Syntactic

QA Related

Query

EncoderFusion

Discourse

Movie

Low Resource

Personalized

Interactive

Speech

Prompt

Temp

Extend

Retrieve-augmented

Chart-to-text

Podcast

Sports

Scientific

Post-Editing

Human

Medical

Semantic Overlap

Tutorial