MT Quality Estimation Significance Test
Instructions for MT-QE-eval: Significance Test for Evaluation of Quality Estimation


The following is a description of how to carry out significance tests for an increase in Pearson correlation of one QE system over a baseline system described in the following paper:

Yvette Graham. "Improving Evaluation of Machine Translation Quality Estimation", ACL 2015.


  1. "R Statistical Software"

    • To install R on the command line: > sudo apt-get install r-base
  2. R's "psych" package

    To install R's "psych" package: - IF your institution uses a proxy server, you need to tell R about it BEFORE installing any package, here's what to do: A) Open your R command line, by typing "R" B) Type the following commmand into R, remembering to provide your actual credentials and proxy server details:

        > Sys.setenv(http_proxy="")
        IF NOT, continue on below.
      - Open R command line (by typing "R") and enter the following:
        > install.packages("psych")
        You'll be given an option of a CRAN site, when you have one selected, you
        might need to answer "y" to some questions. When "psych" is finished
        installing, type the following to quit R:
         > quit("no")

How to run:

Example data is included in the files:

./task-1.2.csv # System predictions and human scores (HTER in example data) for sentences;
./metrics.12   # Names of metrics you wish to carry out pairwise tests for.

Run pairwise significance tests as follows:

R --no-save < pearson-sig.R

This creates a file containing a matrix of p-values pairwise tests for all QE systems. For example, for each pair of QE systems A and B, Williams test is carried out to test significance of the increase in correlation with human scores of QE system A over that of QE system B.