A Brief Introduction of the ROUGE Summary Evaluation Package
by Chin-Yew LIN
Univeristy of Southern California/Information Sciences Institute
(1) Correct the resampling routine which ignores the last evaluation
item in the evaluation list. Therefore, the average scores reported
by ROUGE is only based on the first N-1 evaluation items.
Thanks Barry Schiffman at Columbia University to report this bug.
This bug only affects ROUGE-1.5.X. For pre-1.5 ROUGE, it only affects
the computation of confidence interval (CI) estimation, i.e. CI is only
estimated by the first N-1 evaluation items, but it *does not* affect
(2) Correct stemming on multi-token BE heads and modifiers.
Previously, only single token heads and modifiers were assumed.
(3) Change read_text and read_text_LCS functions to read exact words or
bytes required by users. Previous versions carry out whitespace
compression and other string clear up actions before enforce the length
(4) Add the capability to score summaries in Basic Element (BE)
format by using option "-3", standing for BE triple. There are 6
different modes in BE scoring. We suggest using *"-3 HMR"* on BEs
extracted from Minipar parse trees based on our correlation analysis
of BE-based scoring vs. human judgments on DUC 2002 & 2003 automatic
(5) ROUGE now generates three scores (recall, precision and F-measure)
for each evaluation. Previously, only one score is generated
(recall). Precision and F-measure scores are useful when the target
summary length is not enforced. Only recall scores were necessary since
DUC guideline dictated the limit on summary length. For comparsion to
previous DUC results, please use the recall scores. The default alpha
weighting for computing F-measure is 0.5. Users can specify a
particular alpha weighting that fits their application scenario using
option "-p alpha-weight". Where *alpah-weight* is a number between 0
and 1 inclusively.
(6) Pre-1.5 version of ROUGE used model average to compute the overall
ROUGE scores when there are multiple references. Starting from v1.5+,
ROUGE provides an option to use the best matching score among the
referenes as the final score. The model average option is specified
using "-f A" (for Average) and the best model option is specified
using "-f B" (for the Best). The "-f A" option is better when use
ROUGE in summarization evaluations; while "-f B" option is better when
use ROUGE in machine translation (MT) and definition
question-answering (DQA) evaluations since in a typical MT or DQA
evaluation scenario matching a single reference translation or
defintion answer is sufficient. However, it is very likely that
multiple different but equally good summaries exist in summarization
(7) ROUGE v1.5+ also provides the option to specify whether model unit
level average will be used (macro-average, i.e. treating every model
unit equally) or token level average will be used (micro-average,
i.e. treating every token equally). In summarization evaluation, we
suggest using model unit level average and this is the default setting
in ROUGE. To specify other average mode, use "-t 0" (default) for
model unit level average, "-t 1" for token level average and "-t 2"
for output raw token counts in models, peers, and matches.
(8) ROUGE now offers the option to use file list as the configuration
file. The input format of the summary files are specified using the
"-z INPUT-FORMAT" option. The INPUT-FORMAT can be SEE, SPL, ISI or
SIMPLE. When "-z" is specified, ROUGE assumed that the ROUGE
evalution configuration file is a file list with each evaluation
instance per line in the following format:
peer_path1 model_path1 model_path2 ... model_pathN
peer_path2 model_path1 model_path2 ... model_pathN
peer_pathM model_path1 model_path2 ... model_pathN
The first file path is the peer summary (system summary) and it
follows with a list of model summaries (reference summaries) separated
by white spaces (spaces or tabs).
(9) When stemming is applied, a new WordNet exception database based
on WordNet 2.0 is used. The new database is included in the data
(1) Use "-h" option to see a list of options.
[-a (evaluate all systems)]
[-d (print per evaluation scores)]
[-b n-bytes|-l n-words]
[-m (use Porter stemmer)]
[-s (remove stopwords)]
[-r number-of-samples (for resampling)]
[-2 max-gap-length (if < 0 then no gap length limit)]
[-u (include unigram in skip-bigram) default no)]
[-U (same as -u but also compute regular skip-bigram)]
[-w weight (weighting factor for WLCS)]
[-x (do not calculate ROUGE-L)]
[-f A|B (scoring formula)]
[-p alpha (0 <= alpha <=1)]
[-t 0|1|2 (count by token instead of sentence)]
ROUGE-eval-config-file: Specify the evaluation setup. Three files come with the ROUGE
evaluation package, i.e. ROUGE-test.xml, verify.xml, and verify-spl.xml are
systemID: Specify which system in the ROUGE-eval-config-file to perform the evaluation.
If '-a' option is used, then all systems are evaluated and users do not need to
provide this argument.
When running ROUGE without supplying any options (except -a), the following defaults are used:
(1) ROUGE-L is computed;
(2) 95% confidence interval;
(3) No stemming;
(4) Stopwords are inlcuded in the calculations;
(5) ROUGE looks for its data directory first through the ROUGE_EVAL_HOME environment variable. If
it is not set, the current directory is used.
(6) Use model average scoring formula.
(7) Assign equal importance of ROUGE recall and precision in computing ROUGE f-measure, i.e. alpha=0.5.
(8) Compute average ROUGE by averaging sentence (unit) ROUGE scores.
-2: Compute skip bigram (ROGUE-S) co-occurrence, also specify the maximum gap length between two words (skip-bigram)
-u: Compute skip bigram as -2 but include unigram, i.e. treat unigram as "start-sentence-symbol unigram"; -2 has to be specified.
-3: Compute BE score.
H -> head only scoring (does not applied to Minipar-based BEs).
HM -> head and modifier pair scoring.
HMR -> head, modifier and relation triple scoring.
HM1 -> H and HM scoring (same as HM for Minipar-based BEs).
HMR1 -> HM and HMR scoring (same as HMR for Minipar-based BEs).
HMR2 -> H, HM and HMR scoring (same as HMR for Minipar-based BEs).
-a: Evaluate all systems specified in the ROUGE-eval-config-file.
-c: Specify CF\% (0 <= CF <= 100) confidence interval to compute. The default is 95\% (i.e. CF=95).
-d: Print per evaluation average score for each system.
-e: Specify ROUGE_EVAL_HOME directory where the ROUGE data files can be found.
This will overwrite the ROUGE_EVAL_HOME specified in the environment variable.
-f: Select scoring formula: 'A' => model average; 'B' => best model
-h: Print usage information.
-b: Only use the first n bytes in the system/peer summary for the evaluation.
-l: Only use the first n words in the system/peer summary for the evaluation.
-m: Stem both model and system summaries using Porter stemmer before computing various statistics.
-n: Compute ROUGE-N up to max-ngram length will be computed.
-p: Relative importance of recall and precision ROUGE scores. Alpha -> 1 favors precision, Alpha -> 0 favors recall.
-s: Remove stopwords in model and system summaries before computing various statistics.
-t: Compute average ROUGE by averaging over the whole test corpus instead of sentences (units).
0: use sentence as counting unit, 1: use token as couting unit, 2: same as 1 but output raw counts
instead of precision, recall, and f-measure scores. 2 is useful when computation of the final,
precision, recall, and f-measure scores will be conducted later.
-r: Specify the number of sampling point in bootstrap resampling (default is 1000).
Smaller number will speed up the evaluation but less reliable confidence interval.
-w: Compute ROUGE-W that gives consecutive matches of length L in an LCS a weight of 'L^weight' instead of just 'L' as in LCS.
Typically this is set to 1.2 or other number greater than 1.
-v: Print debugging information for diagnositic purpose.
-x: Do not calculate ROUGE-L.
-z: ROUGE-eval-config-file is a list of peer-model pair per line in the specified format (SEE|SPL|ISI|SIMPLE).
(2) Please read RELEASE-NOTE.txt for information about updates from previous versions.
(3) The following files coming with this package in the "sample-output"
directory are the expected output of the evaluation files in the
(a) use "data" as ROUGE_EVAL_HOME, compute 95% confidence interval,
compute ROUGE-L (longest common subsequence, default),
compute ROUGE-S* (skip bigram) without gap length limit,
compute also ROUGE-SU* (skip bigram with unigram),
run resampling 1000 times,
compute ROUGE-N (N=1 to 4),
compute ROUGE-W (weight = 1.2), and
compute these ROUGE scores for all systems:
> ROUGE-1.5.4.pl -e data -c 95 -2 -1 -U -r 1000 -n 4 -w 1.2 -a ROUGE-test.xml
(b) Same as (a) but apply Porter's stemmer on the input:
> ROUGE-1.5.4.pl -e data -c 95 -2 -1 -U -r 1000 -n 4 -w 1.2 -m -a ROUGE-test.xml
(c) Same as (b) but apply also a stopword list on the input:
> ROUGE-1.5.4.pl -e data -c 95 -2 -1 -U -r 1000 -n 4 -w 1.2 -m -s -a ROUGE-test.xml
(d) Same as (a) but apply a summary length limit of 10 words:
> ROUGE-1.5.4.pl -e data -c 95 -2 -1 -U -r 1000 -n 4 -w 1.2 -l 10 -a ROUGE-test.xml
(e) Same as (d) but apply Porter's stemmer on the input:
> ROUGE-1.5.4.pl -e data -c 95 -2 -1 -U -r 1000 -n 4 -w 1.2 -l 10 -m -a ROUGE-test.xml
(f) Same as (e) but apply also a stopword list on the input:
> ROUGE-1.5.4.pl -e data -c 95 -2 -1 -U -r 1000 -n 4 -w 1.2 -l 10 -m -s -a ROUGE-test.xml
(g) Same as (a) but apply a summary lenght limit of 75 bytes:
> ROUGE-1.5.4.pl -e data -c 95 -2 -1 -U -r 1000 -n 4 -w 1.2 -b 75 -a ROUGE-test.xml
(h) Same as (g) but apply Porter's stemmer on the input:
> ROUGE-1.5.4.pl -e data -c 95 -2 -1 -U -r 1000 -n 4 -w 1.2 -b 75 -m -a ROUGE-test.xml
(i) Same as (h) but apply also a stopword list on the input:
> ROUGE-1.5.4.pl -e data -c 95 -2 -1 -U -r 1000 -n 4 -w 1.2 -b 75 -m -s -a ROUGE-test.xml
Sample DUC2002 data (1 system and 1 model only per DUC 2002 topic), their BE and
ROUGE evaluation configuration file in XML and file list format,
and their expected output are also included for your reference.
(a) Use DUC2002-BE-F.in.26.lst, a BE files list, as ROUGE the
command> ROUGE-1.5.4.pl -3 HM -z SIMPLE DUC2002-BE-F.in.26.lst 26
(b) Use DUC2002-BE-F.in.26.simple.xml as ROUGE XML evaluation configuration file:
command> ROUGE-1.5.4.pl -3 HM DUC2002-BE-F.in.26.simple.xml 26
(c) Use DUC2002-BE-L.in.26.lst, a BE files list, as ROUGE the
command> ROUGE-1.5.4.pl -3 HM -z SIMPLE DUC2002-BE-L.in.26.lst 26
(d) Use DUC2002-BE-L.in.26.simple.xml as ROUGE XML evaluation configuration file:
command> ROUGE-1.5.4.pl -3 HM DUC2002-BE-L.in.26.simple.xml 26
(e) Use DUC2002-ROUGE.in.26.spl.lst, a BE files list, as ROUGE the
command> ROUGE-1.5.4.pl -n 4 -z SPL DUC2002-ROUGE.in.26.spl.lst 26
(f) Use DUC2002-ROUGE.in.26.spl.xml as ROUGE XML evaluation configuration file:
command> ROUGE-1.5.4.pl -n 4 DUC2002-ROUGE.in.26.spl.xml 26
(1) You need to have DB_File installed. If the Perl script complains
about database version incompatibility, you can create a new
WordNet-2.0.exc.db by running the buildExceptionDB.pl script in
the "data/WordNet-2.0-Exceptions" subdirectory.
(2) You also need to install XML::DOM from http://www.cpan.org.
Direct link: http://www.cpan.org/modules/by-module/XML/XML-DOM-1.43.tar.gz.
You might need install extra Perl modules that are required by
(3) Setup an environment variable ROUGE_EVAL_HOME that points to the
"data" subdirectory. For example, if your "data" subdirectory
located at "/usr/local/ROUGE-1.5.4/data" then you can setup
the ROUGE_EVAL_HOME as follows:
(a) Using csh or tcsh:
$command_prompt>setenv ROUGE_EVAL_HOME /usr/local/ROUGE-1.5.4/data
(b) Using bash
(4) Run ROUGE-1.5.4.pl without supplying any arguments will give
you a description of how to use the ROUGE script.
(5) Please look into the included ROUGE-test.xml, verify.xml. and
verify-spl.xml evaluation configuration files for preparing your
own evaluation setup. More detailed description will be provided
later. ROUGE-test.xml and verify.xml specify the input from
systems and references are in SEE (Summary Evaluation Environment)
format (http://www.isi.edu/~cyl/SEE); while verify-spl.xml specify
inputs are in sentence per line format.
(1) Please look into the "docs" directory for more information about
(2) ROUGE-Note-v1.4.2.pdf explains how ROUGE works. It was published in
Proceedings of the Workshop on Text Summarization Branches Out
(WAS 2004), Bacelona, Spain, 2004.
(3) NAACL2003.pdf presents the initial idea of applying n-gram
co-occurrence statistics in automatic evaluation of
summarization. It was publised in Proceedsings of 2003 Language
Technology Conference (HLT-NAACL 2003), Edmonton, Canada, 2003.
(4) NTCIR2004.pdf discusses the effect of sample size on the
reliability of automatic evaluation results using data in the past
Document Understanding Conference (DUC) as examples. It was
published in Proceedings of the 4th NTCIR Meeting, Tokyo, Japan, 2004.
(5) ACL2004.pdf shows how ROUGE can be applied on automatic evaluation
of machine translation. It was published in Proceedings of the 42nd
Annual Meeting of the Association for Computational Linguistics
(ACL 2004), Barcelona, Spain, 2004.
(6) COLING2004.pdf proposes a new meta-evaluation framework, ORANGE, for
automatic evaluation of automatic evaluation methods. We showed
that ROUGE-S and ROUGE-L were significantly better than BLEU,
NIST, WER, and PER automatic MT evalaution methods under the
ORANGE framework. It was published in Proceedings of the 20th
International Conference on Computational Linguistics (COLING 2004),
Geneva, Switzerland, 2004.
(7) For information about BE, please go to http://www.isi.edu/~cyl/BE.
Thanks for using the ROUGE evaluation package. If you have any
questions or comments, please send them to firstname.lastname@example.org. I will do my
best to answer your questions.