# Introduction to PyTerrier

_DSAIT4050: Information retrieval lecture, TU Delft_

**Part 7: Neural ranking models**

In this part, we'll learn how to use **neural ranking models** with PyTerrier. Since PyTerrier itself does not implement any neural rankers, we'll use the [**Fast-Forward indexes** library](https://github.com/mrjleo/fast-forward-indexes), which focuses on efficient re-ranking using neural models and provides PyTerrier integration (using transformers). In order to learn more about Fast-Forward indexes, have a look at the [documentation](https://mrjleo.github.io/fast-forward-indexes/docs/v0.7.0/) and the [corresponding paper](https://dl.acm.org/doi/abs/10.1145/3485447.3511955).


In [None]:
pip install python-terrier==0.12.1 fast-forward-indexes==0.7.0

**Important**: To execute this notebook in reasonable time, you'll need a CUDA-capable GPU. If you have one, follow the [official tutorials](https://pytorch.org/get-started/locally/) and install PyTorch with CUDA acceleration. If you do not have one, you can use Google Colab: Under "Edit" -> "Notebook settings" -> "Hardware accelerator", select a GPU.

If the installation was successful, the following should return `True`:


In [None]:
import torch

torch.cuda.is_available()

_Side note_: In this notebook, we focus on the **retrieve-and-re-rank** setting. PyTerrier supports **dense retrieval** models through [plugins](https://pyterrier.readthedocs.io/en/latest/ext/pyterrier-dr/index.html). Another library that provides many pre-trained models and dense retrieval indexes is [pyserini](https://github.com/castorini/pyserini).


In [14]:
import pyterrier as pt

In this notebook, we'll use the [FiQA dataset](https://sites.google.com/view/fiqa/home), which is a dataset for financial question answering. Since QA pipelines usually include a retrieval/ranking step, FiQA provides a corpus, queries (questions), and corresponding QRels:


In [15]:
dataset = pt.get_dataset("irds:beir/fiqa")

Due to the domain of this dataset (finance), the questions and documents are relatively complex. We'll see how much a large neural re-ranking model (based on BERT) manages to improve over term matching (BM25).

Let's create a lexical index first:


In [None]:
from pathlib import Path

indexer = pt.IterDictIndexer(
    str(Path.cwd()),  # this will be ignored
    type=pt.index.IndexingType.MEMORY,
)
index_ref = indexer.index(dataset.get_corpus_iter(), fields=["text"])

As our baseline, we measure BM25 performance without any re-ranking:


In [None]:
from pyterrier.measures import RR, nDCG, MAP

bm25 = pt.terrier.Retriever(index_ref, wmodel="BM25")
testset = pt.get_dataset("irds:beir/fiqa/test")
pt.Experiment(
    [bm25],
    testset.get_topics(),
    testset.get_qrels(),
    eval_metrics=[RR @ 10, nDCG @ 10, MAP @ 100],
)

## Fast-Forward indexes

Fast-forward indexes use _dual-encoder models_ (the same that are used in dense retrieval) for _interpolation-based re-ranking_. The benefit of this (compared to cross-encoders) is that document representations only need to be computed once (during the indexing step) and can be looked up during re-ranking.

### The encoders

We'll start by instantiating the encoders. [TAS-B](https://github.com/sebastian-hofstaetter/tas-balanced-dense-retrieval) is a single-vector dual-encoder model based on BERT, where the query and document encoders are identical (Siamese architecure). A pre-trained model (trained on MS MARCO) is [available on the Hugging Face hub](https://huggingface.co/sebastian-hofstaetter/distilbert-dot-tas_b-b256-msmarco). We'll use this model in a transfer setting (i.e., without fine-tuning) on the FiQA dataset.

The encoders can be loaded as follows:


In [18]:
from fast_forward.encoder import TASBEncoder
import torch

q_encoder = d_encoder = TASBEncoder(
    device="cuda:0" if torch.cuda.is_available() else "cpu"
)

Now we can use these encoders to compute vector representations of text pieces, such as queries:


In [None]:
q_encoder(["query 1", "query 2"])

### The index

We've already created an index for the BM25 retriever earlier. For the dense vector representations, we'll need to create another separate index:


In [20]:
from fast_forward.index import OnDiskIndex, Mode

ff_index_path = Path.cwd() / "07_data" / "ffindex_fiqa_tasb.h5"
ff_index_path.parent.mkdir(exist_ok=True, parents=True)
ff_index = OnDiskIndex(
    ff_index_path,
    query_encoder=q_encoder,
    mode=Mode.MAXP,
)

`Mode.MAXP` determines how documents that have multiple vectors are scored; however, since our documents only have a single representation each, this will have no effect; we just need to set this to tell the index we're working with documents and not passages (more on that [here](https://mrjleo.github.io/fast-forward-indexes/docs/v0.7.0/fast_forward/index.html#ranking-mode)).

Now we can index the corpus using our document encoder. This is done using the `Indexer` utility class. We'll use an iterator that simply yields the documents in the correct format.

**The next cell will take a while to execute, even with GPU acceleration.** You can adjust the batch size according to your available VRAM.


In [None]:
from fast_forward.util import Indexer


def docs_iter():
    for d in dataset.get_corpus_iter():
        yield {"doc_id": d["docno"], "text": d["text"]}


Indexer(ff_index, d_encoder, batch_size=8).from_dicts(docs_iter())

Once indexing is done, we can always load the index instead of indexing everything again:


In [None]:
from fast_forward.index import OnDiskIndex, Mode

ff_index = OnDiskIndex.load(
    Path.cwd() / "07_data" / "ffindex_fiqa_tasb.h5",
    query_encoder=q_encoder,
    mode=Mode.MAXP,
)

At this point, if you have enough RAM, you can load the entire index (i.e., all vector representations) into the main memory:


In [24]:
ff_index = ff_index.to_memory()

## Re-ranking BM25 results

We've already performed re-ranking in the learning-to-rank notebook. In order to use a Fast-Forward index for re-ranking, we wrap it in an `FFScore` transformer:


In [29]:
from fast_forward.util.pyterrier import FFScore

ff_score = FFScore(ff_index)

In order to see how this works, we take the queries from the testset and retrieve a small number of candidate documents for each one using BM25:


In [None]:
candidates = (bm25 % 5)(testset.get_topics())
candidates

As usual, this data frame contains a score for each query-document pair. We can apply our new re-ranking transformer to the candidates:


In [None]:
re_ranked = ff_score(candidates)
re_ranked

As you can see, the `score` column has now been updated to reflect the re-ranking scores. Furthermore, there is a new column, `score_0`, which contains the original retrieval scores. As mentioned earlier, Fast-Forward indexes focus on _interpolation-based re-ranking_. In essence, the idea is to take both lexical retrieval scores $s_{\text{lex}}$ and semantic re-ranking scores $s_{\text{sem}}$ into account, such that the final score $s$ is computed as follows:

$$s = \alpha s_{\text{lex}} + (1-\alpha) s_{\text{sem}}$$

We can perform the interpolation using the `FFInterpolate` transformer:


In [None]:
from fast_forward.util.pyterrier import FFInterpolate

ff_int = FFInterpolate(alpha=0.5)
ff_int(re_ranked)

In the data frame above, both scores have been fused into one with equal weights.

Now we're ready to take everything for a spin. Let's compare our re-ranker to BM25:


In [None]:
pt.Experiment(
    [bm25, bm25 % 1000 >> ff_score >> ff_int],
    testset.get_topics(),
    testset.get_qrels(),
    eval_metrics=[RR @ 10, nDCG @ 10, MAP @ 100],
    names=["BM25", "BM25 >> FF"],
)

As you can see, the performance already improved quite nicely. But thinking back, we chose $\alpha=0.5$ pretty much arbitrarily. How do we know this is really the best value?

### Validation

PyTerrier offers several functions to determine the best hyperparameters for a ranker. In the following, we'll use [`pyterrier.GridSearch`](https://pyterrier.readthedocs.io/en/latest/tuning.html#pyterrier.GridSearch) to find the best value for $\alpha$.

**Important**: When you tune hyperparameters of your model, **do not use the same data you use for testing (i.e., the testset)**. Otherwise, your results are invalid, because you optimized your method for the testing data. Instead, we'll use the development (validation) data:


In [34]:
devset = pt.get_dataset("irds:beir/fiqa/dev")

PyTerriers `GridSearch` class can be used to automatically run an experiment multiple times in order to find the hyperparameters that result in the best performance.

We'll use a similar pipeline as before, but we limit the number of candidate documents to `100` in order to reduce the runtime. We provide a list of values for `alpha` and a metric (MAP), which is used to decide which value results in the best performance:


In [None]:
pt.GridSearch(
    bm25 % 100 >> ff_score >> ff_int,
    {ff_int: {"alpha": [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9]}},
    devset.get_topics(),
    devset.get_qrels(),
    "map",
    verbose=True,
)

In this case, out of the options we provided, `alpha=0.3` was the best.

Conveniently, the best value has already been set for us in the transformer:


In [None]:
ff_int.alpha

Let's repeat our experiment on the testset with the optimal hyperparameter:


In [None]:
pt.Experiment(
    [bm25, bm25 % 1000 >> ff_score >> ff_int],
    testset.get_topics(),
    testset.get_qrels(),
    eval_metrics=[RR @ 10, nDCG @ 10, MAP @ 100],
    names=["BM25", "BM25 >> FF"],
)

As you can see, the value of `alpha` makes a big difference!

_Final remarks_: We've used neural models for re-ranking only in this notebook. However, in practice, it is not uncommon to use the re-ranking scores we computed here as a feature for a learning-to-rank model.

## Further reading

Check out the [section on neural models](https://pyterrier.readthedocs.io/en/latest/neural.html) in the PyTerrier documentation.
