# KwikSort Re-Ranking with Estimating the ORACLE Axiom

The notebook below exemplifies how `ir_axioms` can be used to re-rank a result set in PyTerrier using the KwikSort algorithm and an estimation of the ORACLE axiom.
We use run files and qrels from the passage retrieval task of the TREC Deep Learning track in 2019 and 2020 as example (using BM25 as a baseline).
In this notebook, we first train the ORACLE axiom estimation using preferences inferred from 2019 qrels and topics.
Then, we re-rank using that trained `EstimatorAxiom` and evaluate nDCG@10, reciprocal rank, and average precision for the baseline and the re-ranked pipeline using PyTerrier's standard `Experiment` functionality.

## Preparation

Install the `ir_axioms` framework and [PyTerrier](https://github.com/terrier-org/pyterrier). In Google Colab, we do this automatically.

In [1]:
from sys import modules

if 'google.colab' in modules:
    !pip install -q ir_axioms[examples] python-terrier

We initialize PyTerrier and import all required libraries and load the data from [ir_datasets](https://ir-datasets.com/).

In [2]:
from pyterrier import started, init

if not started():
    init(tqdm="auto", no_download=True)

PyTerrier 0.8.0 has loaded Terrier 5.6 (built by craigmacdonald on 2021-09-17 13:27)

No etc/terrier.properties, using terrier.default.properties for bootstrap configuration.


## Datasets and Index
Using PyTerrier's `get_dataset()`, we load the MS MARCO passage ranking dataset.

In [3]:
from pyterrier.datasets import get_dataset, Dataset

# Load dataset.
dataset_name = "msmarco-passage"
dataset: Dataset = get_dataset(f"irds:{dataset_name}")
dataset_train: Dataset = get_dataset(f"irds:{dataset_name}/trec-dl-2019/judged")
dataset_test: Dataset = get_dataset(f"irds:{dataset_name}/trec-dl-2020/judged")

Now define paths where we will store temporary files, datasets, and the search index.

In [4]:
from pathlib import Path

cache_dir = Path("cache/")
index_dir = cache_dir / "indices" / dataset_name.split("/")[0]

If the index is not ready yet, now is a good time to create it and index the MS MARCO passages.
(Lean back and relax as this may take a while...)

In [5]:
from pyterrier.index import IterDictIndexer

if not index_dir.exists():
    indexer = IterDictIndexer(str(index_dir.absolute()))
    indexer.index(
        dataset.get_corpus_iter(),
        fields=["text"]
    )

## Baseline Run

We use PyTerrier's `BatchRetrieve` to create a baseline search pipeline for retrieving with BM25 from the index we just created.

In [6]:
from pyterrier.batchretrieve import BatchRetrieve

bm25 = BatchRetrieve(str(index_dir.absolute()), wmodel="BM25")

## Import Axioms
Here we're listing which axioms we want to use in our experiments.
Because some axioms require API calls or are computationally expensive, we cache all axioms using `ir_axiom`'s tilde operator (`~`).

In [7]:
from ir_axioms.axiom import (
    ArgUC, QTArg, QTPArg, aSL, PROX1, PROX2, PROX3, PROX4, PROX5, TFC1, TFC3, RS_TF, RS_TF_IDF, RS_BM25, RS_PL2, RS_QL,
    AND, LEN_AND, M_AND, LEN_M_AND, DIV, LEN_DIV, M_TDC, LEN_M_TDC, STMC1, STMC1_f, STMC2, STMC2_f, LNC1, TF_LNC, LB1,
    REG, ANTI_REG, REG_f, ANTI_REG_f, ASPECT_REG, ASPECT_REG_f, ORIG
)

axioms = [
    ~ArgUC(), ~QTArg(), ~QTPArg(), ~aSL(),
    ~LNC1(), ~TF_LNC(), ~LB1(),
    ~PROX1(), ~PROX2(), ~PROX3(), ~PROX4(), ~PROX5(),
    ~REG(), ~REG_f(), ~ANTI_REG(), ~ANTI_REG_f(), ~ASPECT_REG(), ~ASPECT_REG_f(),
    ~AND(), ~LEN_AND(), ~M_AND(), ~LEN_M_AND(), ~DIV(), ~LEN_DIV(),
    ~RS_TF(), ~RS_TF_IDF(), ~RS_BM25(), ~RS_PL2(), ~RS_QL(),
    ~TFC1(), ~TFC3(), ~M_TDC(), ~LEN_M_TDC(),
    ~STMC1(), ~STMC1_f(), ~STMC2(), ~STMC2_f(),
    ORIG()
]

## KwikSort Re-ranking with Estimating the ORACLE Axiom
We have now defined the axioms with which we want to estimate the ORACLE axiom.
To remind, the ORACLE axiom replicates the perfect ordering induced by human relevance judgments (i.e. from qrels).
We combine the preferences from all axioms in a random forest classifier.
The resulting output preferences can be used with KwikSort to re-rank the top-20 baseline results.

In [8]:
from sklearn.ensemble import RandomForestClassifier
from ir_axioms.modules.pivot import MiddlePivotSelection
from ir_axioms.backend.pyterrier.estimator import EstimatorKwikSortReranker

random_forest = RandomForestClassifier(
    max_depth=3,
)
kwiksort_random_forest = bm25 % 20 >> EstimatorKwikSortReranker(
    axioms=axioms,
    estimator=random_forest,
    index=index_dir,
    dataset=dataset_name,
    pivot_selection=MiddlePivotSelection(),
    cache_dir=cache_dir,
    verbose=True,
)

After setting up the trainable PyTerrier module, we pass in training topics and relevance judgments for training.

In [9]:
kwiksort_random_forest.fit(dataset_train.get_topics(), dataset_train.get_qrels())

Collecting axiom preferences:   0%|          | 0/699 [00:00<?, ?document pair/s]

## Experimental Evaluation
Because our axiomatic re-rankers are PyTerrier modules, we can now use PyTerrier's `Experiment` interface to evaluate various metrics and to compare our new approach to the BM25 baseline ranking.
Refer to the PyTerrier [documentation](https://pyterrier.readthedocs.io/en/latest/experiments.html) to learn more about running experiments.
(We concatenate results from the Baseline ranking for the ranking positions after the top-20 using the `^` operator.)

In [10]:
from pyterrier.pipelines import Experiment
from ir_measures import nDCG, MAP, RR

experiment = Experiment(
    [bm25, kwiksort_random_forest ^ bm25],
    dataset_test.get_topics(),
    dataset_test.get_qrels(),
    [nDCG @ 10, RR, MAP],
    ["BM25", "KwikSort Random Forest"],
    verbose=True,
)
experiment.sort_values(by="nDCG@10", ascending=False, inplace=True)

pt.Experiment:   0%|          | 0/2 [00:00<?, ?system/s]

Reranking query axiomatically:   0%|          | 0/54 [00:00<?, ?query/s]

In [11]:
experiment

Unnamed: 0,name,nDCG@10,RR,AP
1,KwikSort Random Forest,0.498611,0.839206,0.36593
0,BM25,0.493627,0.802359,0.358724


## Extra: Feature Importances
Inspecting the feature importances from the random forest classifier can help to identify axioms that are not used for re-ranking.
If an axiom's feature importance is zero for most of your applications, you may consider omitting it from the ranking pipeline.

In [12]:
random_forest.feature_importances_

array([0.01500606, 0.01318037, 0.01402221, 0.00507727, 0.        ,
       0.00927955, 0.02071664, 0.02384842, 0.05429309, 0.02249094,
       0.03615352, 0.03983559, 0.04633239, 0.02072293, 0.07855608,
       0.11243486, 0.        , 0.        , 0.11392089, 0.00919257,
       0.10809633, 0.0156373 , 0.04798415, 0.01167902, 0.0207403 ,
       0.        , 0.        , 0.05891347, 0.01428831, 0.01471213,
       0.        , 0.00168695, 0.0019021 , 0.03305155, 0.00926728,
       0.01168474, 0.015293  , 0.        ])