# Post-Hoc Analysis of Rankings and Relevance Judgments

The notebook below examplifies how `ir_axioms` allows the post-hoc analysis of run files and qrels using the passage retrieval task of the TREC Deep Learning track in 2019 and 2020 as example (using a single BM25 run). In this notebook, we calculate the distribution of axiom preferences in rankings and evaluate the consistency of rankings and relevance judgments with retrieval axioms.

## Preparation

Install the `ir_axioms` framework and [PyTerrier](https://github.com/terrier-org/pyterrier). In Google Colab, we do this automatically.

In [None]:
from sys import modules

if 'google.colab' in modules:
    !pip install -q ir_axioms[examples] python-terrier

We initialize PyTerrier and import all required libraries and load the data from [ir_datasets](https://ir-datasets.com/).

In [None]:
from pyterrier import started, init

if not started():
    init(tqdm="auto")

## Datasets and Index
Using PyTerrier's `get_dataset()`, we load the MS MARCO passage ranking dataset.

In [2]:
from pyterrier.datasets import get_dataset, Dataset

# Load dataset.
dataset_name = "msmarco-passage"
dataset: Dataset = get_dataset(f"irds:{dataset_name}")
dataset_train: Dataset = get_dataset(f"irds:{dataset_name}/trec-dl-2019/judged")
dataset_test: Dataset = get_dataset(f"irds:{dataset_name}/trec-dl-2020/judged")

Now define paths where we will store temporary files, datasets, and the search index.

In [3]:
from pathlib import Path

cache_dir = Path("cache/")
index_dir = cache_dir / "indices" / dataset_name.split("/")[0]

If the index is not ready yet, now is a good time to create it and index the MS MARCO passages.
(Lean back and relax as this may take a while...)

In [4]:
from pyterrier.index import IterDictIndexer

if not index_dir.exists():
    indexer = IterDictIndexer(str(index_dir.absolute()))
    indexer.index(
        dataset.get_corpus_iter(),
        fields=["text"]
    )

## Baseline Run

We use PyTerrier's `BatchRetrieve` to create a baseline search pipeline for retrieving with BM25 from the index we just created.

In [5]:
from pyterrier.batchretrieve import BatchRetrieve

bm25 = BatchRetrieve(str(index_dir.absolute()), wmodel="BM25")

## Import Axioms
Here we're listing which axioms we want to use in our experiments.
Because some axioms require API calls or are computationally expensive, we cache all axioms using `ir_axiom`'s tilde operator (`~`).

In [6]:
from ir_axioms.axiom import (
    ArgUC, QTArg, QTPArg, aSL, PROX1, PROX2, PROX3, PROX4, PROX5, TFC1, TFC3, RS_TF, RS_TF_IDF, RS_BM25, RS_PL2, RS_QL,
    AND, LEN_AND, M_AND, LEN_M_AND, DIV, LEN_DIV, M_TDC, LEN_M_TDC, STMC1, STMC1_f, STMC2, STMC2_f, LNC1, TF_LNC, LB1,
    REG, ANTI_REG, REG_f, ANTI_REG_f, ASPECT_REG, ASPECT_REG_f
)

axioms = [
    ~ArgUC(), ~QTArg(), ~QTPArg(), ~aSL(),
    ~LNC1(), ~TF_LNC(), ~LB1(),
    ~PROX1(), ~PROX2(), ~PROX3(), ~PROX4(), ~PROX5(),
    ~REG(), ~REG_f(), ~ANTI_REG(), ~ANTI_REG_f(), ~ASPECT_REG(), ~ASPECT_REG_f(),
    ~AND(), ~LEN_AND(), ~M_AND(), ~LEN_M_AND(), ~DIV(), ~LEN_DIV(),
    ~RS_TF(), ~RS_TF_IDF(), ~RS_BM25(), ~RS_PL2(), ~RS_QL(),
    ~TFC1(), ~TFC3(), ~M_TDC(), ~LEN_M_TDC(),
    ~STMC1(), ~STMC1_f(), ~STMC2(), ~STMC2_f(),
]
axiom_names = [axiom.axiom.name for axiom in axioms]

## Axiomatic Experiments

The `AxiomaticExperiment` class provides the entry-point to the post-hoc analysis of rankings and relevance judgments.
We pass it the retrieval pipelines to evaluate, dataset and index, and the axioms to compute preferences for.
With the `depth` parameter you can control how much preferences are being computed.

Computing time roughly scales with `len(retrieval_systems)` x `len(topics)` x `depth` x `depth` x `len(axioms)`.
Additional parameters help optimize computational costs by filtering missing qrels and/or topics.
In our experience, caching helps most with multiple runs for the same benchmark (like TREC runs), because different run's result sets often overlap.

In [7]:
from ir_axioms.backend.pyterrier.experiment import AxiomaticExperiment

experiment = AxiomaticExperiment(
    retrieval_systems=[bm25],
    topics=dataset_test.get_topics(),
    qrels=dataset_test.get_qrels(),
    index=index_dir,
    dataset=dataset_name,
    axioms=axioms,
    axiom_names=axiom_names,
    depth=10,
    filter_by_qrels=False,
    filter_by_topics=False,
    verbose=True,
    cache_dir=cache_dir,
)

### Calculate All Preferences
With the `preference` member you can compute a `DataFrame` containing all pairwise preferences up to the specified depth.

In [8]:
experiment.preferences

Computing system axiomatic preferences:   0%|          | 0/1 [00:00<?, ?system/s]

Computing query axiom preferences:   0%|          | 0/54 [00:00<?, ?query/s]

Unnamed: 0,qid,docid_a,docno_a,rank_a,score_a,query,label_a,iteration_a,docid_b,docno_b,...,RS-PL2_preference,RS-QL_preference,TFC1_preference,TFC3_preference,M-TDC_preference,LEN-M-TDC_preference,STMC1_preference,STMC1-fastText_preference,STMC2_preference,STMC2-fastText_preference
0,1030303,8726436,8726436,0,54.354218,who is aziz hashim,3.0,0,8726436,8726436,...,0,0,0,0,0,0,0,0,0,0
1,1030303,8726436,8726436,0,54.354218,who is aziz hashim,3.0,0,8726433,8726433,...,1,1,0,0,0,0,1,1,0,0
2,1030303,8726436,8726436,0,54.354218,who is aziz hashim,3.0,0,8726435,8726435,...,1,1,0,0,0,0,1,1,0,0
3,1030303,8726436,8726436,0,54.354218,who is aziz hashim,3.0,0,8726429,8726429,...,1,1,-1,0,0,0,1,1,0,0
4,1030303,8726436,8726436,0,54.354218,who is aziz hashim,3.0,0,8726437,8726437,...,1,1,0,0,0,0,-1,1,0,-1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
5395,997622,7965342,7965342,9,30.023091,where is the show shameless filmed,0.0,0,4643397,4643397,...,1,-1,0,0,0,0,-1,1,0,-1
5396,997622,7965342,7965342,9,30.023091,where is the show shameless filmed,0.0,0,4518222,4518222,...,1,-1,0,0,0,0,1,1,0,0
5397,997622,7965342,7965342,9,30.023091,where is the show shameless filmed,0.0,0,4810071,4810071,...,-1,-1,0,0,0,0,-1,1,0,0
5398,997622,7965342,7965342,9,30.023091,where is the show shameless filmed,0.0,0,4558331,4558331,...,1,-1,0,0,0,0,1,1,0,0


### Evaluate Inconsistent Document Pairs
Investigate document pairs that are inconsistent with retrieval axioms (and relevance judgments).
These can provide insights to improve the evaluated ranking function.

In [12]:
experiment.inconsistent_pairs.mean()

  experiment.inconsistent_pairs.mean()


qid                                        inf
docid_a                           4.859517e+06
docno_a                                    inf
rank_a                            6.326316e+00
score_a                           3.505220e+01
label_a                           2.223158e+00
iteration_a                       0.000000e+00
docid_b                           4.742632e+06
docno_b                                    inf
rank_b                            2.810526e+00
score_b                           3.751082e+01
label_b                           4.757895e-01
iteration_b                       0.000000e+00
ORIG_preference                  -1.000000e+00
ORACLE_preference                 1.000000e+00
ArgUC_preference                 -2.105263e-02
QTArg_preference                 -4.842105e-02
QTPArg_preference                 2.105263e-02
aSL_preference                    2.105263e-03
LNC1_preference                  -2.105263e-03
TF-LNC_preference                 1.263158e-02
LB1_preferenc

### Evaluate Consistency of Runs with Axioms
This evaluation shows how consistent axioms are with the run's original order (ORIG) and the relevance judgments (ORACLE).
High consistency with the ORACLE axiom indicates that for your run, the axiom captures the relevance judgments well.
High consistency with the ORIG axiom indicates that your run adheres to the axiom's constraint well.

In [10]:
experiment.preference_consistency

Unnamed: 0,axiom,ORIG_consistency,ORACLE_consistency
0,ArgUC,0.494475,0.494505
1,QTArg,0.662539,0.538922
2,QTPArg,0.591623,0.673367
3,aSL,0.462185,0.531746
4,LNC1,0.578947,0.5625
5,TF-LNC,0.58042,0.613333
6,LB1,0.664957,0.625352
7,PROX1,0.568596,0.605611
8,PROX2,0.592907,0.63474
9,PROX3,0.666667,0.482759


### Evaluate Preference Distribution
Calculating the distribution of each axiom's preferences compared to the original ranking order (ORIG) can help identify decisive axioms for your dataset.

In [11]:
experiment.preference_distribution

Unnamed: 0,axiom,axiom == 0,axiom == ORIG,axiom != ORIG
0,ArgUC,2068,179,183
1,QTArg,2107,214,109
2,QTPArg,2048,226,156
3,aSL,2192,110,128
4,LNC1,2373,33,24
5,TF-LNC,2287,83,60
6,LB1,1845,389,196
7,PROX1,1169,717,544
8,PROX2,1133,769,528
9,PROX3,2340,60,30
