# ir-measures

This is an interactive demonstration of the [`ir-measures`](https://ir-measur.es/) tool.

Let's start by installing the package via `pip`:

In [1]:
!pip install -q git+https://github.com/terrierteam/ir_measures

[K     |████████████████████████████████| 291 kB 5.3 MB/s 
[K     |████████████████████████████████| 59 kB 5.6 MB/s 
[?25h  Building wheel for ir-measures (setup.py) ... [?25l[?25hdone
  Building wheel for cwl-eval (setup.py) ... [?25l[?25hdone


We'll now grab all the data we need. Let's use data from Round 1 of the TREC COVID task.

In [2]:
!wget https://ir.nist.gov/covidSubmit/data/qrels-rnd1.txt
!wget https://ir.nist.gov/covidSubmit/archive/round1/sab20.1.meta.docs
!wget https://ir.nist.gov/covidSubmit/archive/round1/run2 -O GUIR_s2_run2

--2021-10-14 08:53:30--  https://ir.nist.gov/covidSubmit/data/qrels-rnd1.txt
Resolving ir.nist.gov (ir.nist.gov)... 129.6.13.19, 2610:20:6005:13::19
Connecting to ir.nist.gov (ir.nist.gov)|129.6.13.19|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 150110 (147K) [text/plain]
Saving to: ‘qrels-rnd1.txt’


2021-10-14 08:53:31 (139 KB/s) - ‘qrels-rnd1.txt’ saved [150110/150110]

--2021-10-14 08:53:31--  https://ir.nist.gov/covidSubmit/archive/round1/sab20.1.meta.docs
Resolving ir.nist.gov (ir.nist.gov)... 129.6.13.19, 2610:20:6005:13::19
Connecting to ir.nist.gov (ir.nist.gov)|129.6.13.19|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1399940 (1.3M)
Saving to: ‘sab20.1.meta.docs’


2021-10-14 08:53:40 (156 KB/s) - ‘sab20.1.meta.docs’ saved [1399940/1399940]

--2021-10-14 08:53:40--  https://ir.nist.gov/covidSubmit/archive/round1/run2
Resolving ir.nist.gov (ir.nist.gov)... 129.6.13.19, 2610:20:6005:13::19
Connecting to ir.nist.gov (ir.ni

## Command Line Interface

The official evaluation measures for the task were P@5, nDCG@10, AP, and Ppref, but let's also check out the performance of the measures that use binary judgments using a threshold of 2 to see how the systems do on highly relevant documents. We'll also check the judgment rate of the top documents to ensure that they are sufficiently labeled.

We can express these measures naturally to the ir_measures command line tool:

In [3]:
!ir_measures qrels-rnd1.txt sab20.1.meta.docs 'P@5 P(rel=2)@5 nDCG@10 AP AP(rel=2) Bpref Bpref(rel=2) Judged@10'

P@5	0.7800
P(rel=2)@5	0.4867
nDCG@10	0.6080
AP	0.3128
AP(rel=2)	0.2513
Bpref	0.4832
Bpref(rel=2)	0.3193
Judged@10	0.9667


In [4]:
!ir_measures qrels-rnd1.txt GUIR_s2_run2 'P@5 P(rel=2)@5 nDCG@10 AP AP(rel=2) Bpref Bpref(rel=2) Judged@10'

P@5	0.6867
P(rel=2)@5	0.5667
nDCG@10	0.6032
AP	0.2601
AP(rel=2)	0.2744
Bpref	0.4177
Bpref(rel=2)	0.3748
Judged@10	0.8033


In the above example, ir-measures automatically runs [trec_eval](https://github.com/usnistgov/trec_eval) twice: once for the measures that do not use a custom relevance thredhold and once for those that have `rel=2`. It also runs the judgment rate script from [OpenNIR](https://opennir.net).

If you have [ir_datasets](https://ir-datasets.com/) installed, you can specify a dataset identifier in place of the qrels file. This takes care of automatically downloading the necessary qrels for you.

In [5]:
!pip install -q ir_datasets

[K     |████████████████████████████████| 222 kB 5.2 MB/s 
[K     |████████████████████████████████| 294 kB 37.5 MB/s 
[K     |████████████████████████████████| 596 kB 36.0 MB/s 
[K     |████████████████████████████████| 1.8 MB 35.2 MB/s 
[K     |████████████████████████████████| 126 kB 57.4 MB/s 
[K     |████████████████████████████████| 6.3 MB 37.5 MB/s 
[K     |████████████████████████████████| 72 kB 1.2 MB/s 
[?25h  Building wheel for cbor (setup.py) ... [?25l[?25hdone
  Building wheel for warc3-wet-clueweb09 (setup.py) ... [?25l[?25hdone


In [6]:
!ir_measures cord19/trec-covid/round1 sab20.1.meta.docs 'P@5 P(rel=2)@5 nDCG@10 AP AP(rel=2) Bpref Bpref(rel=2) Judged@10'

[INFO] [starting] https://ir.nist.gov/covidSubmit/data/qrels-rnd1.txt
[INFO] [finished] https://ir.nist.gov/covidSubmit/data/qrels-rnd1.txt: [00:00] [150kB] [225kB/s]
P@5	0.7800
P(rel=2)@5	0.4867
nDCG@10	0.6080
AP	0.3128
AP(rel=2)	0.2513
Bpref	0.4832
Bpref(rel=2)	0.3193
Judged@10	0.9667


In [7]:
!ir_measures cord19/trec-covid/round1 GUIR_s2_run2 'P@5 P(rel=2)@5 nDCG@10 AP AP(rel=2) Bpref Bpref(rel=2) Judged@10'

P@5	0.6867
P(rel=2)@5	0.5667
nDCG@10	0.6032
AP	0.2601
AP(rel=2)	0.2744
Bpref	0.4177
Bpref(rel=2)	0.3748
Judged@10	0.8033


You can specify other options to the command line tool as well. For instance, if you want per-query results as jsonl format, you can specify the `-q -o jsonl` flags. (This output is pretty long, so we'll just show the top 10 lines using the `head` command.)

In [8]:
!ir_measures cord19/trec-covid/round1 sab20.1.meta.docs 'P@5 P(rel=2)@5 nDCG@10 AP AP(rel=2) Bpref Bpref(rel=2) Judged@10' -q -o jsonl | head

{"query_id": "1", "measure": "AP(rel=2)", "value": 0.20605122809198714}
{"query_id": "1", "measure": "Bpref(rel=2)", "value": 0.2818877551020408}
{"query_id": "1", "measure": "P(rel=2)@5", "value": 0.8}
{"query_id": "1", "measure": "nDCG@10", "value": 0.8347683953473287}
{"query_id": "2", "measure": "AP(rel=2)", "value": 0.2337564330531729}
{"query_id": "2", "measure": "Bpref(rel=2)", "value": 0.27218934911242604}
{"query_id": "2", "measure": "P(rel=2)@5", "value": 0.4}
{"query_id": "2", "measure": "nDCG@10", "value": 0.6649794681010974}
{"query_id": "3", "measure": "AP(rel=2)", "value": 0.15448036380988914}
{"query_id": "3", "measure": "Bpref(rel=2)", "value": 0.21180555555555558}


## Python Interface

We can run the same commands directly in Python as well.

In [9]:
import ir_measures
from ir_measures import * # import natural measure names

In [10]:
# read qrels and run files
qrels = list(ir_measures.read_trec_qrels('qrels-rnd1.txt'))
sab = list(ir_measures.read_trec_run('sab20.1.meta.docs'))
guir = list(ir_measures.read_trec_run('GUIR_s2_run2'))

In [11]:
ir_measures.calc_aggregate([P@5, P(rel=2)@5, nDCG@10, AP, AP(rel=2), Bpref, Bpref(rel=2), Judged@10], qrels, sab)

{AP: 0.31280026201626626,
 AP(rel=2): 0.2513444681464884,
 Bpref: 0.48316577337967653,
 Bpref(rel=2): 0.3193286425239503,
 Judged@10: 0.9666666666666667,
 P(rel=2)@5: 0.48666666666666675,
 P@5: 0.7799999999999999,
 nDCG@10: 0.607996962444336}

In [12]:
ir_measures.calc_aggregate([P@5, P(rel=2)@5, nDCG@10, AP, AP(rel=2), Bpref, Bpref(rel=2), Judged@10], qrels, guir)

{AP: 0.26009942341913844,
 AP(rel=2): 0.2743896188055524,
 Bpref: 0.41773486708968066,
 Bpref(rel=2): 0.3748322644003913,
 Judged@10: 0.8033333333333333,
 P(rel=2)@5: 0.5666666666666667,
 P@5: 0.6866666666666668,
 nDCG@10: 0.6031939620699799}

The above code is inefficient becuase it needs to process the qrels twice -- once for each run. You can use an evaluator to eliminate this extra work.

In [13]:
evaluator = ir_measures.evaluator([P@5, P(rel=2)@5, nDCG@10, AP, AP(rel=2), Bpref, Bpref(rel=2), Judged@10], qrels)

In [14]:
evaluator.calc_aggregate(guir)

{AP: 0.26009942341913844,
 AP(rel=2): 0.2743896188055524,
 Bpref: 0.41773486708968066,
 Bpref(rel=2): 0.3748322644003913,
 Judged@10: 0.8033333333333333,
 P(rel=2)@5: 0.5666666666666667,
 P@5: 0.6866666666666668,
 nDCG@10: 0.6031939620699799}

In [15]:
from timeit import timeit
time = timeit(lambda: ir_measures.calc_aggregate([P@5, P(rel=2)@5, nDCG@10, AP, AP(rel=2), Bpref, Bpref(rel=2), Judged@10], qrels, guir), number=10)
print(f'ir_measures.calc_aggregate: {time/10*1000:0.2f}ms/invocation')
time = timeit(lambda: evaluator.calc_aggregate(guir), number=10)
print(f'evaluator.calc_aggregate:   {time/10*1000:0.2f}ms/invocation')

ir_measures.calc_aggregate: 49.33ms/invocation
evaluator.calc_aggregate:   35.08ms/invocation


You can also get per-query results using `iter_calc`. This allows us to analyse per-query performance and conduct statistical tests.

In [16]:
count = 0
for metric in ir_measures.iter_calc([P@5, P(rel=2)@5, nDCG@10, AP, AP(rel=2), Bpref, Bpref(rel=2), Judged@10], qrels, guir):
  print(metric)
  count += 1
  if count >= 10: break # only show top 10 items

sab_p_rel2_5 = {m.query_id: m.value for m in ir_measures.iter_calc([Bpref(rel=2)], qrels, sab)}
guir_p_rel2_5 = {m.query_id: m.value for m in ir_measures.iter_calc([Bpref(rel=2)], qrels, guir)}

from scipy.stats import ttest_rel
qids = list(sab_p_rel2_5.keys())
ttest_rel([sab_p_rel2_5[v] for v in qids], [guir_p_rel2_5[v] for v in qids])

Metric(query_id='1', measure=AP(rel=2), value=0.07145298238302196)
Metric(query_id='1', measure=Bpref(rel=2), value=0.1992984693877551)
Metric(query_id='1', measure=P(rel=2)@5, value=0.2)
Metric(query_id='1', measure=nDCG@10, value=0.4519974004479882)
Metric(query_id='2', measure=AP(rel=2), value=0.2360970952991868)
Metric(query_id='2', measure=Bpref(rel=2), value=0.2869822485207101)
Metric(query_id='2', measure=P(rel=2)@5, value=0.8)
Metric(query_id='2', measure=nDCG@10, value=0.7798385221413586)
Metric(query_id='3', measure=AP(rel=2), value=0.1820361550264641)
Metric(query_id='3', measure=Bpref(rel=2), value=0.2517361111111111)


Ttest_relResult(statistic=-1.9683404208802118, pvalue=0.0586544719955888)

## PyTerrer Integration

ir-datasets is easy to use in other tools. Here, we see how [PyTerrier](https://pyterrier.readthedocs.io/) uses ir-measures for specifying evaluation criteria in experiments:

In [17]:
!pip install -q git+https://github.com/terrier-org/pyterrier.git

[K     |████████████████████████████████| 1.1 MB 5.3 MB/s 
[K     |████████████████████████████████| 69 kB 6.7 MB/s 
[K     |████████████████████████████████| 45 kB 2.6 MB/s 
[?25h  Building wheel for python-terrier (setup.py) ... [?25l[?25hdone
  Building wheel for chest (setup.py) ... [?25l[?25hdone
  Building wheel for wget (setup.py) ... [?25l[?25hdone


In [18]:
import pyterrier as pt
if not pt.started():
  pt.init()

terrier-assemblies 5.6 jar-with-dependencies not found, downloading to /root/.pyterrier...
Done
terrier-python-helper 0.0.6 jar not found, downloading to /root/.pyterrier...
Done
PyTerrier 0.7.0 has loaded Terrier 5.6 (built by craigmacdonald on 2021-09-17 13:27)


In [19]:
# NOTE: this example uses TREC COVID complete, rather than round1
dataset = pt.get_dataset('irds:cord19/trec-covid')
pt.Experiment(
    [pt.BatchRetrieve.from_dataset('trec-covid', 'terrier_stemmed', wmodel='DPH'),
     pt.BatchRetrieve.from_dataset('trec-covid', 'terrier_stemmed', wmodel='BM25')],
    dataset.get_topics('description'),
    dataset.get_qrels(),
    eval_metrics=[P@5, P(rel=2)@5, nDCG@10, AP, AP(rel=2), Bpref, Bpref(rel=2), Judged@10],
#                 ^ using ir_measures
)

Downloading trec-covid index to /root/.pyterrier/corpora/trec-covid/index/terrier_stemmed


data.lexicon.fsomapfile:   0%|          | 0.00/14.2M [00:00<?, ?iB/s]

data.meta.zdata:   0%|          | 0.00/4.38M [00:00<?, ?iB/s]

data.direct.bf:   0%|          | 0.00/23.2M [00:00<?, ?iB/s]

md5sums:   0%|          | 0.00/537 [00:00<?, ?iB/s]

data.document.fsarrayfile:   0%|          | 0.00/4.56M [00:00<?, ?iB/s]

data.properties:   0%|          | 0.00/4.33k [00:00<?, ?iB/s]

data.inverted.bf:   0%|          | 0.00/21.3M [00:00<?, ?iB/s]

data.meta.idx:   0%|          | 0.00/1.46M [00:00<?, ?iB/s]

data.meta-0.fsomapfile:   0%|          | 0.00/5.29M [00:00<?, ?iB/s]

data.lexicon.fsomaphash:   0%|          | 0.00/0.99k [00:00<?, ?iB/s]

data.lexicon.fsomapid:   0%|          | 0.00/619k [00:00<?, ?iB/s]

[INFO] [starting] https://ir.nist.gov/covidSubmit/data/topics-rnd5.xml
[INFO] [finished] https://ir.nist.gov/covidSubmit/data/topics-rnd5.xml: [00:00] [18.7kB] [8.76MB/s]
[INFO] [starting] https://ir.nist.gov/covidSubmit/data/qrels-covid_d5_j0.5-5.txt
[INFO] [finished] https://ir.nist.gov/covidSubmit/data/qrels-covid_d5_j0.5-5.txt: [00:03] [1.14MB] [303kB/s]


Unnamed: 0,name,P@5,P(rel=2)@5,nDCG@10,AP,AP(rel=2),Bpref,Bpref(rel=2),Judged@10
0,BR(DPH),0.72,0.608,0.64234,0.196439,0.173106,0.337958,0.30189,0.964
1,BR(BM25),0.716,0.592,0.624045,0.223192,0.196708,0.364892,0.326601,0.93
