# Introduction to PyTerrier

_IN4325: Information retrieval lecture, TU Delft_

**Part 6: Learning to rank**

In this part, we'll dive into learning-to-rank (LTR) models. Specifically, we'll cover how to use PyTerrier transformers to

- compute query-document features and
- train and evaluate LTR models.

In order to run everything in this notebook, you'll need [NLTK](https://www.nltk.org/), [scikit-learn](https://scikit-learn.org/), and [LightGBM](https://github.com/microsoft/LightGBM/tree/master/python-package) installed:


In [None]:
pip install python-terrier==0.10.0 nltk scikit-learn lightgbm

In [None]:
import pyterrier as pt

if not pt.started():
    pt.init(tqdm="notebook")

We'll use NLTK for tokenization later. This requires some data that we need to download first:


In [None]:
import nltk

nltk.download("punkt")

We'll use the `nfcorpus` dataset again, as before. In this notebook, we'll use a subset of the queries (`nontopic`). The only reason for this is that it makes the computations faster.


In [None]:
dataset = pt.get_dataset("irds:nfcorpus")

As LTR models rely on query and document features, we'll include some metadata in our index, namely the titles, abstracts, and URLs.

Note that this seems to slow down retrieval quite a bit (even when we're not retrieving the metadata from the index), so this notebook might run slower on your machine than the previous ones. We'll use caching on most transformers to mitigate this problem at least somewhat.


In [None]:
from pathlib import Path

index = pt.index.IterDictIndexer(
    str(Path.cwd()),  # this will be ignored
    meta={
        "docno": 16,
        "title": 256,
        "abstract": 65536,
        "url": 128,
    },
    type=pt.index.IndexingType.MEMORY,
).index(dataset.get_corpus_iter(), fields=["title", "abstract", "url"])

## LTR paradigm

The idea of learning-to-rank is to use a feature-based supervised machine learning model for ranking. PyTerrier supports end-to-end LTR pipelines, including first-stage retrieval, computation of features, training, and evaluation.

### First-stage retrieval

LTR models are commonly used in a two-stage process (_retrieve-and-re-rank_): A lightweight retrieval model is used for _candidate selection_ given a query, and the LTR model subsequently _re-ranks_ the candidates. This is because applying the LTR model directly on the whole corpus would be too expensive.

We'll use good old BM25 for first-stage retrieval. In order to keep the runtime of this notebook down, we limit the number of documents to be re-ranked to `100`. We also include the metadata of the retrieved documents so we can use it to compute features later:


In [None]:
first_stage_bm25 = pt.BatchRetrieve(
    index,
    wmodel="BM25",
    num_results=100,
    metadata=["docno", "title", "abstract", "url"],
)

### Computing features

In order to compute features, we can use PyTerrier transformers. Specifically, the `**` operator (_feature union_) collects features computed by transformers in a designated column in the data frame.

In order to illustrate this, we can use other retrievers to compute features. By applying the feature operator, we instruct PyTerrier to use these models for _scoring_ rather than retrieval. Here, we initialize two `pyterrier.BatchRetrieve` objects with the PL2 and DPH weighting models and include them in the pipeline:


In [None]:
pipeline_with_features = ~first_stage_bm25 >> (
    pt.BatchRetrieve(index, wmodel="PL2") ** pt.BatchRetrieve(index, wmodel="DPH")
)

What this pipeline does is the following: For each query,

1. retrieve the top-`100` documents using the first-stage retriever (BM25), and
2. compute the PL2 and DPH scores for each query-document pair (these are the features).

Let's run this pipeline on a single query from the test set:


In [None]:
test_queries = pt.get_dataset("irds:nfcorpus/test/nontopic").get_topics()
pipeline_with_features(test_queries[test_queries["qid"] == "PLAIN-102"])

Each row corresponds to one of the candidate documents for this query. The `score` column contains the first-stage retrieval score (BM25), by which the documents are ordered. Finally, the `features` column contains the list of features. In our case, the first feature is the PL2 score, and the second feature is the DPH score.

#### Custom features

It is also easy to compute our own features. This can be done with custom transformers.

Say, for example, we want to compute very simple similarity scores of the query to the title and abstract of each document, respectively. We can do this by implementing a function that takes as input a single row of the data frame (as above) and outputs a list of features (as a `numpy.ndarray`). In our case, we compute the Jaccard similarity of the query to the title and abstract:


In [None]:
import numpy as np


def _jaccard_sim(row):
    query_tokens = set(nltk.word_tokenize(row["query"].lower()))
    title_tokens = set(nltk.word_tokenize(row["title"].lower()))
    abstract_tokens = set(nltk.word_tokenize(row["abstract"].lower()))
    js_query_title = len(query_tokens & title_tokens) / len(query_tokens | title_tokens)
    js_query_abstract = len(query_tokens & abstract_tokens) / len(
        query_tokens | abstract_tokens
    )
    return np.array([js_query_title, js_query_abstract])

_Side note: This way of doing it is inefficient, because each query is tokenized multiple times. Can you think of a better way of implementing this?_

We can now include this function as a transformer in our pipeline by using [`pyterrier.apply.doc_features`](https://pyterrier.readthedocs.io/en/latest/apply.html#pyterrier.apply.doc_features).

You might have noticed that we're accessing the `title` and `abstract` columns in the data frame. This is possible because we added them as metadata during indexing and specified the metadata to be retrieved by `first_stage_bm25`. Alternatively, you can use the [`pyterrier.text.get_text`](https://pyterrier.readthedocs.io/en/latest/text.html#pyterrier.text.get_text) transformer to retrieve metadata from the index.

Our new pipeline looks like this:


In [None]:
pipeline_with_features = ~first_stage_bm25 >> (
    pt.apply.doc_features(_jaccard_sim)
    ** pt.BatchRetrieve(index, wmodel="PL2")
    ** pt.BatchRetrieve(index, wmodel="DPH")
)

Running the same query through the new pipeline, we can see that our four features show up in the list:


In [None]:
pipeline_with_features(test_queries[test_queries["qid"] == "PLAIN-102"])

If you only want to compute a single feature in your custom transformer, you can use [`pyterrier.apply.doc_score`](https://pyterrier.readthedocs.io/en/latest/apply.html#pyterrier.apply.doc_score). Let's add two more features:

1. By returning `row["score"]`, we're simply adding the first-stage retrieval score to the feature set.
2. We'll also include the length of the URL as a feature.

We now have a complete pipeline with six features in total:


In [None]:
pipeline_complete = ~first_stage_bm25 >> (
    pt.apply.doc_features(_jaccard_sim)
    ** pt.BatchRetrieve(index, wmodel="PL2")
    ** pt.BatchRetrieve(index, wmodel="DPH")
    ** pt.apply.doc_score(lambda row: row["score"])
    ** pt.apply.doc_score(lambda row: len(row["url"]))
)

### Training LTR models

The actual models used for re-ranking are not implemented in PyTerrier itself; rather, PyTerrier provides a transformer for trainable models (i.e., regression or LTR models) that implement a scikit-learn-like API (i.e., `fit` and `predict` methods). These trainable transformers are [`pyterrier.Estimator`](https://pyterrier.readthedocs.io/en/latest/transformer.html#pt-transformer-estimator) objects.

We'll start by training a simple SVM regression model from scikit-learn. Estimators can be created using [`pyterrier.ltr.apply_learned_model`](https://pyterrier.readthedocs.io/en/latest/ltr.html#pyterrier.ltr.apply_learned_model):


In [None]:
from sklearn.svm import SVR

ltr_svm = ~pipeline_complete >> pt.ltr.apply_learned_model(SVR())

Before we can do re-ranking, the model needs to be trained. The `nfcorpus` dataset provides a train/dev/test split, so we can easily load the training data.

**Depending on your hardware, the next cell might take a while to execute.**


In [None]:
ltr_svm.fit(
    pt.get_dataset("irds:nfcorpus/train/nontopic").get_topics(),
    pt.get_dataset("irds:nfcorpus/train/nontopic").get_qrels(),
)

We can also directly use gradient boosting methods from [XGBoost](https://xgboost.readthedocs.io/en/latest/) and [LightGBM](https://lightgbm.readthedocs.io/en/stable/) by specifying `form="ltr"`. Let's train a [`lightgbm.LGBMRanker`](https://lightgbm.readthedocs.io/en/latest/pythonapi/lightgbm.LGBMRanker.html), which defaults to a LambdaMART model:


In [None]:
from lightgbm import LGBMRanker

ltr_lambdamart = ~pipeline_complete >> pt.ltr.apply_learned_model(
    LGBMRanker(
        metric="ndcg",
        importance_type="gain",
    ),
    form="ltr",
)

This model makes use of validation (dev) data:


In [None]:
ltr_lambdamart.fit(
    pt.get_dataset("irds:nfcorpus/train/nontopic").get_topics(),
    pt.get_dataset("irds:nfcorpus/train/nontopic").get_qrels(),
    pt.get_dataset("irds:nfcorpus/dev/nontopic").get_topics(),
    pt.get_dataset("irds:nfcorpus/dev/nontopic").get_qrels(),
)

Finally, we can compare the baseline performance (no LTR) with the SVM and LambdaMART models on the test set:


In [None]:
from pyterrier.measures import nDCG, RR, MAP

pt.Experiment(
    [first_stage_bm25, ltr_svm, ltr_lambdamart],
    pt.get_dataset("irds:nfcorpus/test/nontopic").get_topics(),
    pt.get_dataset("irds:nfcorpus/test/nontopic").get_qrels(),
    names=["BM25", "BM25 >> LTR (SVM)", "BM25 >> LTR (LambdaMART)"],
    eval_metrics=[nDCG @ 10, RR @ 10, MAP],
)

### FeaturesBatchRetrieve

So far, we have used the feature union operator (`**`) to append PL2 and DPH scores to our feature list. This is not optimal, because each of the operations requires another index access to compute the features. If we're only interested in those retrieval-based features, we can use [`pyterrier.FeaturesBatchRetrieve`](https://pyterrier.readthedocs.io/en/latest/ltr.html#pyterrier.FeaturesBatchRetrieve) instead, which computes everything at once:


In [None]:
bm25_fbr = pt.FeaturesBatchRetrieve(
    index,
    wmodel="BM25",
    features=["WMODEL:BM25", "WMODEL:PL2", "WMODEL:DPH"],
    num_results=100,
)

### Feature ablation

Given our approach above (`bm25_fbr`) with three features, we might be interested to know which of these features has the greatest impact on ranking performance. In order to find out, we could create three separate pipelines, where each of them has one of the features removed, and then compare the performance.

Luckily, PyTerrier has us covered and provides transformers to make our lives easier: [`pyterrier.ltr.ablate_features`](https://pyterrier.readthedocs.io/en/latest/ltr.html#pyterrier.ltr.ablate_features) can be included in a pipeline to dynamically remove a set of features; [`pyterrier.ltr.keep_features`](https://pyterrier.readthedocs.io/en/latest/ltr.html#pyterrier.ltr.keep_features) does the opposite. Hence, we can simply use it in a loop to get the effect we want:


In [None]:
ltr_lambdamart_abl = {
    feature: bm25_fbr
    >> pt.ltr.ablate_features(feature)
    >> pt.ltr.apply_learned_model(
        LGBMRanker(
            metric="ndcg",
            importance_type="gain",
        ),
        form="ltr",
    )
    for feature in [0, 1, 2]
}

We have to train each of these models individually:


In [None]:
for pipeline in ltr_lambdamart_abl.values():
    pipeline.fit(
        pt.get_dataset("irds:nfcorpus/train/nontopic").get_topics(),
        pt.get_dataset("irds:nfcorpus/train/nontopic").get_qrels(),
        pt.get_dataset("irds:nfcorpus/dev/nontopic").get_topics(),
        pt.get_dataset("irds:nfcorpus/dev/nontopic").get_qrels(),
    )

Finally, let's compare them:


In [None]:
pipelines, names = [], []
for feature, pipeline in ltr_lambdamart_abl.items():
    pipelines.append(pipeline)
    names.append(f"LambdaMART (feature {feature} removed)")

pt.Experiment(
    pipelines,
    pt.get_dataset("irds:nfcorpus/test/nontopic").get_topics(),
    pt.get_dataset("irds:nfcorpus/test/nontopic").get_qrels(),
    names=names,
    eval_metrics=[nDCG @ 10, RR @ 10, MAP],
)

## Further reading

Check out the [LTR section](https://pyterrier.readthedocs.io/en/latest/ltr.html) in the PyTerrier documentation.
