# Re-Ranking Comparison

The notebook below showcases different axiomatic re-ranking approaches in `ir_axioms` using PyTerrier:

1. KwikSort with manually combined axioms,
2. KwikSort with an estimation of the ORACLE axiom, and
3. aggregated axiom preferences as features for LamdaMART.

We use run files and qrels from the passage retrieval task of the TREC Deep Learning track in 2019 and 2020 as example (using BM25 as a baseline).
In this notebook, we evaluate nDCG@10, reciprocal rank, and average precision for the baseline and the re-ranked pipelines using PyTerrier's standard `Experiment` functionality.


## Preparation

Install the `ir_axioms` framework and [PyTerrier](https://github.com/terrier-org/pyterrier). In Google Colab, we do this automatically.


In [None]:
from sys import modules

if "google.colab" in modules:
    !pip install -q ir_axioms[experiments] python-terrier

## Datasets and Index

Using PyTerrier's `get_dataset()`, we load the MS MARCO passage ranking dataset.


In [None]:
from pyterrier.datasets import get_dataset, Dataset

# Load dataset.
dataset_name = "msmarco-passage"
dataset: Dataset = get_dataset(f"irds:{dataset_name}")
dataset_train: Dataset = get_dataset(f"irds:{dataset_name}/trec-dl-2019/judged")
dataset_test: Dataset = get_dataset(f"irds:{dataset_name}/trec-dl-2020/judged")

Now define paths where we will store temporary files, datasets, and the search index.


In [None]:
from pathlib import Path

cache_dir = Path("cache/")
index_dir = cache_dir / "indices" / dataset_name.split("/")[0]
results_dir = Path("results")

If the index is not ready yet, now is a good time to create it and index the MS MARCO passages.
(Lean back and relax as this may take a while...)


In [None]:
from pyterrier.index import IterDictIndexer

if not index_dir.exists():
    indexer = IterDictIndexer(str(index_dir.absolute()))
    indexer.index(dataset.get_corpus_iter(), fields=["text"])

## Baseline Run

We use PyTerrier's `BatchRetrieve` to create a baseline search pipeline for retrieving with BM25 from the index we just created.


In [None]:
from pyterrier.batchretrieve import BatchRetrieve

bm25 = BatchRetrieve(str(index_dir.absolute()), wmodel="BM25")

## Combine and Import Axioms

Here we're listing which axioms we want to use in our experiments.


In [None]:
from ir_axioms.axiom import (
    ArgUC,
    QTArg,
    QTPArg,
    aSL,
    PROX1,
    PROX2,
    PROX3,
    PROX4,
    PROX5,
    TFC1,
    TFC3,
    RS_TF,
    RS_TF_IDF,
    RS_BM25,
    RS_PL2,
    RS_QL,
    AND,
    LEN_AND,
    M_AND,
    LEN_M_AND,
    DIV,
    LEN_DIV,
    M_TDC,
    LEN_M_TDC,
    STMC1,
    STMC1_f,
    STMC2,
    STMC2_f,
    LNC1,
    TF_LNC,
    LB1,
    REG,
    ANTI_REG,
    REG_f,
    ANTI_REG_f,
    ASPECT_REG,
    ASPECT_REG_f,
    ORIG,
)

First, we combine many of the axioms implemented in `ir_axioms` to form a majority voting.
That is, we only want to keep preferences, where at least 50% (or 0.5) of the axioms agree.
Because some axioms require API calls or are computationally expensive, we cache the voting result using the tilde operator (`~`).
We are going to use that vote axiom in a `KwikSortReranker` later.


In [None]:
from ir_axioms.axiom import VoteAxiom

majority_vote_axiom = (
    ~VoteAxiom(
        [
            ArgUC(),
            QTArg(),
            QTPArg(),
            aSL(),
            LNC1(),
            TF_LNC(),
            LB1(),
            PROX1(),
            PROX2(),
            PROX3(),
            PROX4(),
            PROX5(),
            REG(),
            REG_f(),
            ANTI_REG(),
            ANTI_REG_f(),
            ASPECT_REG(),
            ASPECT_REG_f(),
            AND(),
            LEN_AND(),
            M_AND(),
            LEN_M_AND(),
            DIV(),
            LEN_DIV(),
            RS_TF(),
            RS_TF_IDF(),
            RS_BM25(),
            RS_PL2(),
            RS_QL(),
            TFC1(),
            TFC3(),
            M_TDC(),
            LEN_M_TDC(),
            STMC1(),
            STMC1_f(),
            STMC2(),
            STMC2_f(),
        ],
        minimum_votes=0.5,
    )
    | ORIG()
)

Then, for estimating the ORACLE axiom and for generating axiomatic features for learning to rank with LambdaMART, we define a list of all axioms that we want to use in our experiments.
Again, we implement caching for the axioms (using `~`).


In [None]:
all_axioms = [
    ~ArgUC(),
    ~QTArg(),
    ~QTPArg(),
    ~aSL(),
    ~LNC1(),
    ~TF_LNC(),
    ~LB1(),
    ~PROX1(),
    ~PROX2(),
    ~PROX3(),
    ~PROX4(),
    ~PROX5(),
    ~REG(),
    ~REG_f(),
    ~ANTI_REG(),
    ~ANTI_REG_f(),
    ~ASPECT_REG(),
    ~ASPECT_REG_f(),
    ~AND(),
    ~LEN_AND(),
    ~M_AND(),
    ~LEN_M_AND(),
    ~DIV(),
    ~LEN_DIV(),
    ~RS_TF(),
    ~RS_TF_IDF(),
    ~RS_BM25(),
    ~RS_PL2(),
    ~RS_QL(),
    ~TFC1(),
    ~TFC3(),
    ~M_TDC(),
    ~LEN_M_TDC(),
    ~STMC1(),
    ~STMC1_f(),
    ~STMC2(),
    ~STMC2_f(),
    ORIG(),
]

## Re-ranking Approaches

We will now compare the three different axiomatic re-ranking approaches.
Please refer to the other notebooks in this repository for more detailed explanations of each of the approaches.


### KwikSort Re-ranking

For the first re-ranker, we re-rank the top-20 baseline results using the KwikSort algorithm, using our previously defined vote axiom.


In [None]:
from ir_axioms.modules.pivot import MiddlePivotSelection
from ir_axioms.backend.pyterrier.transformers import KwikSortReranker

kwiksort = bm25 % 20 >> KwikSortReranker(
    axiom=majority_vote_axiom,
    index=index_dir,
    dataset=dataset_name,
    pivot_selection=MiddlePivotSelection(),
    cache_dir=cache_dir,
    verbose=True,
)

### KwikSort Re-ranking with Estimating the ORACLE Axiom

The second re-ranker works by estimating the ORACLE axiom using preferences from all reference axioms using a random forest classifier.
The resulting output preferences are used with KwikSort to re-rank the top-20 baseline results.


In [None]:
from sklearn.ensemble import RandomForestClassifier
from ir_axioms.modules.pivot import MiddlePivotSelection
from ir_axioms.backend.pyterrier.estimator import EstimatorKwikSortReranker

random_forest = RandomForestClassifier(
    max_depth=3,
)
kwiksort_random_forest = bm25 % 20 >> EstimatorKwikSortReranker(
    axioms=all_axioms,
    estimator=random_forest,
    index=index_dir,
    dataset=dataset_name,
    pivot_selection=MiddlePivotSelection(),
    cache_dir=cache_dir,
    verbose=True,
)

We fit the estimator using preferences from the training dataset.


In [None]:
kwiksort_random_forest.fit(dataset_train.get_topics(), dataset_train.get_qrels())

## Aggregating Axiomatic Features for LTR with LambdaMART

For the third re-ranker, we aggregate axiomatic preferences into three features per axiom.


In [None]:
from ir_axioms.backend.pyterrier.transformers import AggregatedAxiomaticPreferences

aggregations = [
    lambda prefs: sum(p >= 0 for p in prefs) / len(prefs),
    lambda prefs: sum(p == 0 for p in prefs) / len(prefs),
    lambda prefs: sum(p <= 0 for p in prefs) / len(prefs),
]
features = ~(
    bm25 % 20
    >> AggregatedAxiomaticPreferences(
        axioms=all_axioms,
        index=index_dir,
        aggregations=aggregations,
        dataset=dataset_name,
        cache_dir=cache_dir,
        verbose=True,
    )
)

After aggregating the preference features, we initialize a LambdaMART ranker for optimizing nDCG@10 and apply it to the top-20 baseline results.


In [None]:
from lightgbm import LGBMRanker
from pyterrier.ltr import apply_learned_model

lambda_mart = LGBMRanker(
    num_iterations=1000,
    metric="ndcg",
    eval_at=[10],
    importance_type="gain",
)
ltr = features >> apply_learned_model(lambda_mart, form="ltr")

We also fit the re-ranker using preferences from the training dataset (using the last 5 topics as the validation dataset).


In [None]:
ltr.fit(
    dataset_train.get_topics()[:-5],
    dataset_train.get_qrels(),
    dataset_train.get_topics()[-5:],
    dataset_train.get_qrels(),
)

## Experimental Evaluation

Because our axiomatic re-rankers are PyTerrier modules, we can now use PyTerrier's `Experiment` interface to evaluate various metrics and to compare our new approaches to the BM25 baseline ranking.
Refer to the PyTerrier [documentation](https://pyterrier.readthedocs.io/en/latest/experiments.html) to learn more about running experiments.
(We concatenate results from the Baseline ranking for the ranking positions after the top-20 using the `^` operator.)


In [None]:
from pyterrier.pipelines import Experiment
from ir_measures import nDCG, MAP, RR

results_dir.mkdir(exist_ok=True)
experiment = Experiment(
    [bm25, kwiksort ^ bm25, kwiksort_random_forest ^ bm25, ltr ^ bm25],
    dataset_test.get_topics(),
    dataset_test.get_qrels(),
    [nDCG @ 10, RR, MAP],
    ["BM25", "KwikSort", "KwikSort Random Forest", "Axiomatic LTR"],
    verbose=True,
)
experiment.sort_values(by="nDCG@10", ascending=False, inplace=True)

In [None]:
experiment