ROUGE-SEM: Better evaluation of summarization using ROUGE combined with semantics

This project includes the source code for the paper ROUGE-SEM: Better evaluation of summarization using ROUGE combined with semantics, appearing at Expert Systems with Applications. Please cite this article as follows, if you use this code.

M. Zhang, C. Li, M. Wan et al., ROUGE-SEM: Better evaluation of summarization using ROUGE combined with semantics. Expert Systems With Applications (2023), doi: https://doi.org/10.1016/j.eswa.2023.121364.

Highlighted Features

A framework of ROUGE combined with semantics is proposed for summarization evaluation.
A classification of Summary based on semantic and lexical similarity to the reference.
Variants of ROUGE-SEM outperform the corresponding variants of ROUGE consistently.

Requirements

We use Conda python 3.7 and strongly recommend that you create a new environment.

Prerequisite: Python 3.7 or higher versions

conda create -n ROUGE-SEM python=3.7
conda activate ROUGE-SEM

Environment

Install all packages in the requirement.txt

Python 3.7
PyTorch 1.4.0+cu100
HuggingFace Transformers 4.16.2
boto3 1.24.32
numpy 1.21.4
pandas 1.1.5
regex 2022.7.9
sentencepiece 0.1.96
sklearn latest
scipy
datasets
pandas
scikit-learn
prettytable
gradio
setuptools
summ-eval

pip3 install -r requirements.txt

Set Up for ROUGE

Read more from this link.

git clone https://github.com/summanlp/evaluation
export ROUGE_EVAL_HOME="yourPath/evaluation/ROUGE-RELEASE-1.5.5/data/"
pip install pyrouge
pyrouge_set_rouge_path yourPath/evaluation/ROUGE-RELEASE-1.5.5

Datasets

SummEval

More details can be find in this link. please request and download the data from the original paper.

DialSummEval

More details can be find in this link. please request and download the data from the original paper.

Models

Our released models can be download here. You can import these models by using HuggingFace's Transformers.

Example Use Cases

Command-line interface

source run.sh

Evaluate Text Summarization Step by Step

Given the source documents, reference summaries and some to-be-evaluated summaries, you can produce the ROUGE-SEM score for these candidate summaries with the code below:

Calculate Lexical Similarity

python calculate_lexical_similarity.py -r reference.txt -c candidate.txt

Calculate Semantic Similarity

python calculate_semantic_similarity.py -r reference.txt -c candidate.txt

Candidate Summary Classifier

python candidate_summary_classifier.py -lex_score lexical_similarity.csv -sem_score semantic_similarity.csv

Categorized Summary Rewriter

python categorized_summary_rewriter.py -category categorized_summary.csv -c candidate.txt

Rewritten Summary Scorer

python rewritten_summary_scorer.py -r reference.txt -c new_candidate.csv

Citation

@article{ZHANG2023121364,
title = {ROUGE-SEM: Better evaluation of summarization using ROUGE combined with semantics},
journal = {Expert Systems with Applications},
pages = {121364},
year = {2023},
issn = {0957-4174},
doi = {https://doi.org/10.1016/j.eswa.2023.121364},
url = {https://www.sciencedirect.com/science/article/pii/S0957417423018663},
author = {Ming Zhang and Chengzhang Li and Meilin Wan and Xuejun Zhang and Qingwei Zhao},
keywords = {Automatic summarization evaluation, Semantic similarity, Lexical similarity, Contrastive learning, Back-translation},
abstract = {With the development of pre-trained language models and large-scale datasets, automatic text summarization has attracted much attention from the community of natural language processing, but the progress of automatic summarization evaluation has stagnated. Although there have been efforts to improve automatic summarization evaluation, ROUGE has remained one of the most popular metrics for nearly 20 years due to its competitive evaluation performance. However, ROUGE is not perfect, there are studies have shown that it is suffering from inaccurate evaluation of abstractive summarization and limited diversity of generated summaries, both caused by lexical bias. To avoid the bias of lexical similarity, more and more meaningful embedding-based metrics have been proposed to evaluate summaries by measuring semantic similarity. Due to the challenge of accurately measuring semantic similarity, none of them can fully replace ROUGE as the default automatic evaluation toolkit for text summarization. To address the aforementioned problems, we propose a compromise evaluation framework (ROUGE-SEM) for improving ROUGE with semantic information, which compensates for the lack of semantic awareness through a semantic similarity module. According to the differences in semantic similarity and lexical similarity, summaries are classified into four categories for the first time, including good-summary, pearl-summary, glass-summary, and bad-summary. In particular, the back-translation technique is adopted to rewrite pearl-summary and glass-summary that are inaccurately evaluated by ROUGE to alleviate lexical bias. Through this pipeline framework, summaries are first classified by candidate summary classifier, then rewritten by categorized summary rewriter, and finally scored by rewritten summary scorer, which are efficiently evaluated in a manner consistent with human behavior. When measured using Pearson, Spearman, and Kendall rank coefficients, our proposal achieves comparable or higher correlations with human judgments than several state-of-the-art automatic summarization evaluation metrics in dimensions of coherence, consistency, fluency, and relevance. This also suggests that improving ROUGE with semantics is a promising direction for automatic summarization evaluation.}
}

Get Involved

Should you have any query please contact me at zhangming@hccl.ioa.ac.cn. Please create a GitHub issue if you have any questions, suggestions, requests or bug-reports. Don't hesitate to send us an e-mail or report an issue, if something is broken or if you have further questions.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

data

data

model

model

README.md

README.md

calculate_lexical_similarity.py

calculate_lexical_similarity.py

calculate_semantic_similarity.py

calculate_semantic_similarity.py

candidate_summary_classifier.py

candidate_summary_classifier.py

categorized_summary_rewriter.py

categorized_summary_rewriter.py

requirements.txt

requirements.txt

rewritten_summary_scorer.py

rewritten_summary_scorer.py

run.sh

run.sh

Repository files navigation

ROUGE-SEM: Better evaluation of summarization using ROUGE combined with semantics

Requirements

Environment

Set Up for ROUGE

Datasets

Models

Example Use Cases

Command-line interface

Evaluate Text Summarization Step by Step

Citation

Get Involved

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
data		data
model		model
README.md		README.md
calculate_lexical_similarity.py		calculate_lexical_similarity.py
calculate_semantic_similarity.py		calculate_semantic_similarity.py
candidate_summary_classifier.py		candidate_summary_classifier.py
categorized_summary_rewriter.py		categorized_summary_rewriter.py
requirements.txt		requirements.txt
rewritten_summary_scorer.py		rewritten_summary_scorer.py
run.sh		run.sh

zhangming-19/ROUGE-SEM

Folders and files

Latest commit

History

Repository files navigation

ROUGE-SEM: Better evaluation of summarization using ROUGE combined with semantics

Requirements

Environment

Set Up for ROUGE

Datasets

Models

Example Use Cases

Command-line interface

Evaluate Text Summarization Step by Step

Citation

Get Involved

About

Topics

Resources

Stars

Watchers

Forks

Languages