We provide software for statistical significance testing. This was originally designed for a standard IR evaluation, where one or more method is represented by vectors of real-value performance scores. However, it can be used to compare any equal-length series (of performance measurements).
This utility consumes matrix input. Each row represents a single evaluation event. Each row element is an event-specific value of an effectiveness or efficiency metric such as classification accuracy, retrieval time, etc. In IR, we commonly use the following metrics: ERR, NDCG, or MAP. We provide a Python3 wrapper for this utility.
Our software employs permutation algorithms for unadjusted pair-wise significance testing and testing with adjustment for multiple comparisons. The advantage of permutation algorithms is that they make relatively mild assumptions about statistical nature of data. In particular, they do not assume observations are normal i.i.d. variables.
The code is released under the Apache License Version 2.0 http://www.apache.org/licenses/.
For technical/theoretical details see:
Leonid Boytsov, Anna Belova, Peter Westfall, 2013, Deciding on an Adjustment for Multiplicity in IR Experiments. In Proceedings of SIGIR 2013. [BibTex]
If you use our software, please, consider citing this paper.
EvalUtil:
- The test program itself: permtest. It accepts a matrix of performance scores (ERR, MAP, etc). We provide a Python3 wrapper for this test program.
- Each row of the matrix represent one retrieval method (called run in TREC terminology).
- Column I represents performance scores for the I-th query.
- In the case of binary classification, all values are 0s and 1s. The first row represents ground truth labels.
- An R-script SignTest.R which carries out a sign test for the purpose of binary classification. The input format is the same as for the utility permtest (in the case of binary classification). However, SignTest.R can compare only two outputs/systems at a time, but it can handle multiple classes. To this end, it relies on the SignTest.
ConvScripts:
- Scripts to convert TREC output file (to the matrix format).
- Each script accepts a registry file, which lists names of the files, which contain an output of a TREC evalution utility, e.g., trec_eval.
- Each such file should represent a single run.
A working example:
- Compile the Eval util
- Go to the directory SampleData
- Run the shell script sample_run.sh
- Read the comments inside the script