# Introduction to PyTerrier

_DSAIT4050: Information retrieval lecture, TU Delft_

**Part 7: Neural ranking models**

In this part, we'll learn how to use **neural ranking models** with PyTerrier. Since PyTerrier itself does not implement any neural rankers, we'll use the [**Fast-Forward indexes** library](https://github.com/mrjleo/fast-forward-indexes), which focuses on efficient re-ranking using neural models and provides PyTerrier integration (using transformers). In order to learn more about Fast-Forward indexes, have a look at the [documentation](https://mrjleo.github.io/fast-forward-indexes/docs/v0.7.0/) and the [corresponding paper](https://dl.acm.org/doi/abs/10.1145/3485447.3511955).


In [7]:
pip install python-terrier==0.12.1 fast-forward-indexes==0.7.0

Note: you may need to restart the kernel to use updated packages.



[notice] A new release of pip is available: 24.3.1 -> 25.0.1
[notice] To update, run: python.exe -m pip install --upgrade pip


**Important**: To execute this notebook in reasonable time, you'll need a CUDA-capable GPU. If you have one, follow the [official tutorials](https://pytorch.org/get-started/locally/) and install PyTorch with CUDA acceleration. If you do not have one, you can use Google Colab: Under "Edit" -> "Notebook settings" -> "Hardware accelerator", select a GPU.

If the installation was successful, the following should return `True`:


In [3]:
import torch
import os

os.environ["JAVA_HOME"] = "C:\Program Files\Java\jdk-19"

torch.cuda.is_available()

True

_Side note_: In this notebook, we focus on the **retrieve-and-re-rank** setting. PyTerrier supports **dense retrieval** models through [plugins](https://pyterrier.readthedocs.io/en/latest/ext/pyterrier-dr/index.html). Another library that provides many pre-trained models and dense retrieval indexes is [pyserini](https://github.com/castorini/pyserini).


In [4]:
import pyterrier as pt

In this notebook, we'll use the [FiQA dataset](https://sites.google.com/view/fiqa/home), which is a dataset for financial question answering. Since QA pipelines usually include a retrieval/ranking step, FiQA provides a corpus, queries (questions), and corresponding QRels:


In [5]:
dataset = pt.get_dataset("irds:beir/fiqa")

Due to the domain of this dataset (finance), the questions and documents are relatively complex. We'll see how much a large neural re-ranking model (based on BERT) manages to improve over term matching (BM25).

Let's create a lexical index first:


In [6]:
from pathlib import Path

indexer = pt.IterDictIndexer(
    str(Path.cwd()),  # this will be ignored
    type=pt.index.IndexingType.MEMORY,
)
index_ref = indexer.index(dataset.get_corpus_iter(), fields=["text"])

Java started (triggered by TerrierIndexer.__init__) and loaded: pyterrier.java, pyterrier.terrier.java [version=5.11 (build: craig.macdonald 2025-01-13 21:29), helper_version=0.0.8]
beir/fiqa documents: 100%|██████████| 57638/57638 [00:15<00:00, 3650.87it/s]


As our baseline, we measure BM25 performance without any re-ranking:


In [7]:
from pyterrier.measures import RR, nDCG, MAP

bm25 = pt.terrier.Retriever(index_ref, wmodel="BM25")
testset = pt.get_dataset("irds:beir/fiqa/test")
pt.Experiment(
    [bm25],
    testset.get_topics(),
    testset.get_qrels(),
    eval_metrics=[RR @ 10, nDCG @ 10, MAP @ 100],
)

[INFO] [starting] opening zip file
[INFO] [finished] opening zip file [2ms]
[INFO] [starting] opening zip file
[INFO] [finished] opening zip file [1ms]




Unnamed: 0,name,RR@10,nDCG@10,AP@100
0,TerrierRetr(BM25),0.310271,0.252589,0.20864


## Fast-Forward indexes

Fast-forward indexes use _dual-encoder models_ (the same that are used in dense retrieval) for _interpolation-based re-ranking_. The benefit of this (compared to cross-encoders) is that document representations only need to be computed once (during the indexing step) and can be looked up during re-ranking.

### The encoders

We'll start by instantiating the encoders. [TAS-B](https://github.com/sebastian-hofstaetter/tas-balanced-dense-retrieval) is a single-vector dual-encoder model based on BERT, where the query and document encoders are identical (Siamese architecure). A pre-trained model (trained on MS MARCO) is [available on the Hugging Face hub](https://huggingface.co/sebastian-hofstaetter/distilbert-dot-tas_b-b256-msmarco). We'll use this model in a transfer setting (i.e., without fine-tuning) on the FiQA dataset.

The encoders can be loaded as follows:


In [8]:
from fast_forward.encoder import TASBEncoder
import torch

q_encoder = d_encoder = TASBEncoder(
    device="cuda:0" if torch.cuda.is_available() else "cpu"
)

config.json:   0%|          | 0.00/504 [00:00<?, ?B/s]

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development


pytorch_model.bin:   0%|          | 0.00/265M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/306 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/262k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/265M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

Now we can use these encoders to compute vector representations of text pieces, such as queries:


In [9]:
q_encoder(["query 1", "query 2"])

array([[-0.04583394, -0.41078392, -0.42934185, ..., -0.16365711,
        -0.17415644,  0.41219893],
       [-0.17898682, -0.43222326, -0.42240688, ..., -0.07488572,
         0.01203857,  0.10395049]], dtype=float32)

### The index

We've already created an index for the BM25 retriever earlier. For the dense vector representations, we'll need to create another separate index:


In [11]:
from fast_forward.index import OnDiskIndex, Mode

ff_index_path = Path.cwd() / "07_data" / "ffindex_fiqa_tasb.h5"
ff_index_path.parent.mkdir(exist_ok=True, parents=True)
ff_index = OnDiskIndex(
    ff_index_path,
    query_encoder=q_encoder,
    mode=Mode.MAXP,
)

ValueError: File C:\Users\chena\PycharmProjects\IR-rankingmodels\guide_notebooks\07_data\ffindex_fiqa_tasb.h5 exists.

`Mode.MAXP` determines how documents that have multiple vectors are scored; however, since our documents only have a single representation each, this will have no effect; we just need to set this to tell the index we're working with documents and not passages (more on that [here](https://mrjleo.github.io/fast-forward-indexes/docs/v0.7.0/fast_forward/index.html#ranking-mode)).

Now we can index the corpus using our document encoder. This is done using the `Indexer` utility class. We'll use an iterator that simply yields the documents in the correct format.

**The next cell will take a while to execute, even with GPU acceleration.** You can adjust the batch size according to your available VRAM.


In [12]:
from fast_forward.util import Indexer


def docs_iter():
    for d in dataset.get_corpus_iter():
        yield {"doc_id": d["docno"], "text": d["text"]}


Indexer(ff_index, d_encoder, batch_size=8).from_dicts(docs_iter())

0it [00:00, ?it/s]
16it [00:00, 108.33it/s]%|          | 0/57638 [00:00<?, ?it/s][A
32it [00:00, 127.04it/s]%|          | 16/57638 [00:00<08:38, 111.20it/s][A
56it [00:00, 157.50it/s]%|          | 32/57638 [00:00<07:25, 129.26it/s][A
80it [00:00, 183.43it/s]%|          | 56/57638 [00:00<06:03, 158.38it/s][A
104it [00:00, 179.58it/s]|          | 80/57638 [00:00<05:12, 184.44it/s][A
128it [00:00, 181.17it/s]|          | 104/57638 [00:00<05:19, 179.91it/s][A
152it [00:00, 187.86it/s]|          | 128/57638 [00:00<05:16, 181.63it/s][A
176it [00:00, 189.50it/s]|          | 152/57638 [00:00<05:05, 187.94it/s][A
200it [00:01, 193.59it/s]|          | 176/57638 [00:00<05:02, 189.80it/s][A
224it [00:01, 202.34it/s]|          | 200/57638 [00:01<04:56, 193.80it/s][A
256it [00:01, 222.71it/s]|          | 224/57638 [00:01<04:43, 202.51it/s][A
280it [00:01, 215.60it/s]|          | 256/57638 [00:01<04:17, 222.83it/s][A
304it [00:01, 217.86it/s]|          | 280/57638 [00:01<04:25, 215.69it/s

Once indexing is done, we can always load the index instead of indexing everything again:


In [13]:
from fast_forward.index import OnDiskIndex, Mode

ff_index = OnDiskIndex.load(
    Path.cwd() / "07_data" / "ffindex_fiqa_tasb.h5",
    query_encoder=q_encoder,
    mode=Mode.MAXP,
)

100%|██████████| 57638/57638 [00:00<00:00, 2089809.86it/s]


At this point, if you have enough RAM, you can load the entire index (i.e., all vector representations) into the main memory:


In [14]:
ff_index = ff_index.to_memory()

## Re-ranking BM25 results

We've already performed re-ranking in the learning-to-rank notebook. In order to use a Fast-Forward index for re-ranking, we wrap it in an `FFScore` transformer:


In [15]:
from fast_forward.util.pyterrier import FFScore

ff_score = FFScore(ff_index)

In order to see how this works, we take the queries from the testset and retrieve a small number of candidate documents for each one using BM25:


In [16]:
candidates = (bm25 % 5)(testset.get_topics())
candidates

Unnamed: 0,qid,docid,docno,rank,score,query
0,4641,36224,376148,0,41.677305,where should i park my rainy day emergency fund
1,4641,47916,497993,1,29.149791,where should i park my rainy day emergency fund
2,4641,55690,580025,2,26.773005,where should i park my rainy day emergency fund
3,4641,24501,253614,3,26.640181,where should i park my rainy day emergency fund
4,4641,3157,32833,4,24.265187,where should i park my rainy day emergency fund
...,...,...,...,...,...,...
3235,2399,33136,343489,0,33.280064,where do web sites get foreign exchange curren...
3236,2399,6704,69171,1,31.335596,where do web sites get foreign exchange curren...
3237,2399,4148,43046,2,31.265964,where do web sites get foreign exchange curren...
3238,2399,46670,484891,3,29.869008,where do web sites get foreign exchange curren...


As usual, this data frame contains a score for each query-document pair. We can apply our new re-ranking transformer to the candidates:


In [17]:
re_ranked = ff_score(candidates)
re_ranked

Unnamed: 0,qid,docno,score,query,score_0,rank
0,9979,35369,104.525070,what is the best way to invest in gold as a he...,23.946875,0
1,9979,96351,102.627594,what is the best way to invest in gold as a he...,22.793557,1
2,9979,327271,102.549774,what is the best way to invest in gold as a he...,22.893632,2
3,9979,483734,101.910439,what is the best way to invest in gold as a he...,27.970351,3
4,9979,30584,98.504639,what is the best way to invest in gold as a he...,22.506184,4
...,...,...,...,...,...,...
3235,10034,44955,96.867401,tax implications of holding ewu or other such ...,28.036300,0
3236,10034,181942,96.055405,tax implications of holding ewu or other such ...,24.774710,1
3237,10034,180146,95.081093,tax implications of holding ewu or other such ...,23.568159,2
3238,10034,197478,94.961800,tax implications of holding ewu or other such ...,25.703399,3


As you can see, the `score` column has now been updated to reflect the re-ranking scores. Furthermore, there is a new column, `score_0`, which contains the original retrieval scores. As mentioned earlier, Fast-Forward indexes focus on _interpolation-based re-ranking_. In essence, the idea is to take both lexical retrieval scores $s_{\text{lex}}$ and semantic re-ranking scores $s_{\text{sem}}$ into account, such that the final score $s$ is computed as follows:

$$s = \alpha s_{\text{lex}} + (1-\alpha) s_{\text{sem}}$$

We can perform the interpolation using the `FFInterpolate` transformer:


In [18]:
from fast_forward.util.pyterrier import FFInterpolate

ff_int = FFInterpolate(alpha=0.5)
ff_int(re_ranked)

Unnamed: 0,qid,docno,query,score,rank
0,9979,35369,what is the best way to invest in gold as a he...,64.235972,1
1,9979,96351,what is the best way to invest in gold as a he...,62.710576,3
2,9979,327271,what is the best way to invest in gold as a he...,62.721703,2
3,9979,483734,what is the best way to invest in gold as a he...,64.940395,0
4,9979,30584,what is the best way to invest in gold as a he...,60.505411,4
...,...,...,...,...,...
3235,10034,44955,tax implications of holding ewu or other such ...,62.451851,0
3236,10034,181942,tax implications of holding ewu or other such ...,60.415057,1
3237,10034,180146,tax implications of holding ewu or other such ...,59.324626,3
3238,10034,197478,tax implications of holding ewu or other such ...,60.332599,2


In the data frame above, both scores have been fused into one with equal weights.

Now we're ready to take everything for a spin. Let's compare our re-ranker to BM25:


In [19]:
pt.Experiment(
    [bm25, bm25 % 1000 >> ff_score >> ff_int],
    testset.get_topics(),
    testset.get_qrels(),
    eval_metrics=[RR @ 10, nDCG @ 10, MAP @ 100],
    names=["BM25", "BM25 >> FF"],
)

Unnamed: 0,name,RR@10,nDCG@10,AP@100
0,BM25,0.310271,0.252589,0.20864
1,BM25 >> FF,0.370925,0.308329,0.25481


As you can see, the performance already improved quite nicely. But thinking back, we chose $\alpha=0.5$ pretty much arbitrarily. How do we know this is really the best value?

### Validation

PyTerrier offers several functions to determine the best hyperparameters for a ranker. In the following, we'll use [`pyterrier.GridSearch`](https://pyterrier.readthedocs.io/en/latest/tuning.html#pyterrier.GridSearch) to find the best value for $\alpha$.

**Important**: When you tune hyperparameters of your model, **do not use the same data you use for testing (i.e., the testset)**. Otherwise, your results are invalid, because you optimized your method for the testing data. Instead, we'll use the development (validation) data:


In [20]:
devset = pt.get_dataset("irds:beir/fiqa/dev")

PyTerriers `GridSearch` class can be used to automatically run an experiment multiple times in order to find the hyperparameters that result in the best performance.

We'll use a similar pipeline as before, but we limit the number of candidate documents to `100` in order to reduce the runtime. We provide a list of values for `alpha` and a metric (MAP), which is used to decide which value results in the best performance:


In [21]:
pt.GridSearch(
    bm25 % 100 >> ff_score >> ff_int,
    {ff_int: {"alpha": [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9]}},
    devset.get_topics(),
    devset.get_qrels(),
    "map",
    verbose=True,
)

[INFO] [starting] opening zip file
[INFO] [finished] opening zip file [1ms]
GridScan: 100%|██████████| 9/9 [01:07<00:00,  7.55s/it]

Best map is 0.286447
Best setting is ['<fast_forward.util.pyterrier.FFInterpolate object at 0x000002310B1082D0> alpha=0.3']





(TerrierRetr(BM25) >> RankCutoff(100) >> FFScore(2409670228240, 2407286455632) >> <fast_forward.util.pyterrier.FFInterpolate object at 0x000002310B1082D0>)

In this case, out of the options we provided, `alpha=0.3` was the best.

Conveniently, the best value has already been set for us in the transformer:


In [22]:
ff_int.alpha

0.3

Let's repeat our experiment on the testset with the optimal hyperparameter:


In [24]:
pt.Experiment(
    [bm25, bm25 % 1000 >> ff_score >> ff_int],
    testset.get_topics(),
    testset.get_qrels(),
    eval_metrics=[RR @ 10, nDCG @ 10, MAP @ 100],
    names=["BM25", "BM25 >> FF"],
)

Unnamed: 0,name,RR@10,nDCG@10,AP@100
0,BM25,0.310271,0.252589,0.20864
1,BM25 >> FF,0.386422,0.321607,0.26807


As you can see, the value of `alpha` makes a big difference!

_Final remarks_: We've used neural models for re-ranking only in this notebook. However, in practice, it is not uncommon to use the re-ranking scores we computed here as a feature for a learning-to-rank model.

## Further reading

Check out the [section on neural models](https://pyterrier.readthedocs.io/en/latest/neural.html) in the PyTerrier documentation.
