Skip to content

yale-nlp/InstruSum

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 
 
 
 
 
 
 

Repository files navigation

InstruSum

This is a repository for our paper "Benchmarking Generation and Evaluation Capabilities of Large Language Models for Instruction Controllable Summarization".

Quick Links

Benchmark Dataset

InstruSum can be downloaded with Hugging Face Datasets under Salesforce/InstruSum. We provide a notebook, demo.ipynb, for exploring the dataset and performing some basic analysis.

InstruSum contains four subsets: dataset, human_eval, llm_eval, and system_outputs.

dataset

The dataset subset contains 100 human-written data examples by us. Each example contains an article, a summary instruction, a LLM-generated summary, and a hybrid LLM-human summary.

human_eval

This subset contains human evaluation results for the 100 examples in the dataset subset. There are 5 systems evaluated: OpenAI's text-davinci-002, text-davinci-003, gpt-3.5-turbo-0301, gpt-4-0314, along with the hybrid LLM-human summary. We evaluated 4 evaluation aspects:

  • Overall Quality: This rating assesses the overall quality of the summary in relation to the summary requirement.
  • Missing Information: Does the summary omit any crucial information from the article concerning the summary requirement?
  • Irrelevant Information: Does the summary include any information that is not relevant to the summary requirement?
  • Factual Consistency: Is the summary consistent with the facts presented in the article, without contradicting or misrepresenting any information?

llm_eval

This subset contains LLM-based automatic evaluation results for the 100 examples in the dataset subset.

We used 11 LLMs in our evaluation and 4 evaluation protocols:

  • LLMRank: listwise ranking
  • LLMCompare: pairwise comparison
  • LLMEval: pointwise scoring by text completion
  • LLMScore: pointwise scoring by model-predicted log-likelihood

In total, we evaluated 40 LLM-based evaluation methods over three quality aspects:

LLM LLMRank LLMCompare LLMEval LLMScore
text-davinci-002
text-davinci-003
gpt-3.5-turbo-0301
gpt-3.5-turbo-0613
gpt-3.5-turbo-instruct
gpt-4-0314
gpt-4-1106-preview
llama-2-7b-chat
llama-2-13b-chat
llama-2-70b-chat
mistral-instruct

system_outputs

This subset contains the system outputs for the 100 examples in the dataset subset over 11 LLMs (same as the llm_eval subset).

Prompts for LLM-based Evaluation Methods

We provide the prompts for the 4 LLM-based evaluation protocols across 3 quality aspects ("overall", "missing", "irrelevant") in the prompts folder.

Citation

Please cite our paper if you use InstruSum in your work:

@article{liu2023benchmarking,
  title={Benchmarking generation and evaluation capabilities of large language models for instruction controllable summarization},
  author={Liu, Yixin and Fabbri, Alexander R and Chen, Jiawen and Zhao, Yilun and Han, Simeng and Joty, Shafiq and Liu, Pengfei and Radev, Dragomir and Wu, Chien-Sheng and Cohan, Arman},
  journal={arXiv preprint arXiv:2311.09184},
  year={2023}
}

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published