PermTest

General information

We provide software for statistical significance testing. This was originally designed for a standard IR evaluation, where one or more method is represented by vectors of real-value performance scores. However, it can be used to compare any equal-length series (of performance measurements).

This utility consumes matrix input. Each row represents a single evaluation event. Each row element is an event-specific value of an effectiveness or efficiency metric such as classification accuracy, retrieval time, etc. In IR, we commonly use the following metrics: ERR, NDCG, or MAP. We provide a Python3 wrapper for this utility.

Our software employs permutation algorithms for unadjusted pair-wise significance testing and testing with adjustment for multiple comparisons. The advantage of permutation algorithms is that they make relatively mild assumptions about statistical nature of data. In particular, they do not assume observations are normal i.i.d. variables.

The code is released under the Apache License Version 2.0 http://www.apache.org/licenses/.

For technical/theoretical details see:

Leonid Boytsov, Anna Belova, Peter Westfall, 2013, Deciding on an Adjustment for Multiplicity in IR Experiments. In Proceedings of SIGIR 2013. [BibTex]

If you use our software, please, consider citing this paper.

Software description

EvalUtil:

The test program itself: permtest. It accepts a matrix of performance scores (ERR, MAP, etc). We provide a Python3 wrapper for this test program.
Each row of the matrix represent one retrieval method (called run in TREC terminology).
Column I represents performance scores for the I-th query.
In the case of binary classification, all values are 0s and 1s. The first row represents ground truth labels.
An R-script SignTest.R which carries out a sign test for the purpose of binary classification. The input format is the same as for the utility permtest (in the case of binary classification). However, SignTest.R can compare only two outputs/systems at a time, but it can handle multiple classes. To this end, it relies on the SignTest.

ConvScripts:

Scripts to convert TREC output file (to the matrix format).
Each script accepts a registry file, which lists names of the files, which contain an output of a TREC evalution utility, e.g., trec_eval.
Each such file should represent a single run.

A working example:

Compile the Eval util
Go to the directory SampleData
Run the shell script sample_run.sh
Read the comments inside the script

Name		Name	Last commit message	Last commit date
Latest commit History 47 Commits
ConvScriptsTREC		ConvScriptsTREC
EvalUtil		EvalUtil
SampleData		SampleData
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ConvScriptsTREC

ConvScriptsTREC

EvalUtil

EvalUtil

SampleData

SampleData

README.md

README.md

Repository files navigation

PermTest

General information

Software description

About

Releases 1

Packages

Contributors 3

Languages

searchivarius/PermTest

Folders and files

Latest commit

History

Repository files navigation

PermTest

General information

Software description

About

Topics

Resources

Stars

Watchers

Forks

Languages