# PyTerrier Tutorial Notebook - Neural Re-Ranking

This is one of a series of Colab notebooks created for the PyTerrier Tutorial entitled '**IR From Bag-of-words to BERT and Beyond through Practical Experiments**'. It demonstrates the use of [PyTerrier](https://github.com/terrier-org/pyterrier) on the [CORD19 test collection](https://www.kaggle.com/allen-institute-for-ai/CORD-19-research-challenge).

In particular, in this notebook you will:

 - Re-rank documents using neural models like KNRM, Vanilla BERT, EPIC, and monoT5.

## Setup

In the following, we will set up the libraries required to execute the notebook.

### Pyterrier installation

The following cell installs the latest release of the [PyTerrier](https://github.com/terrier-org/pyterrier) package.

In [None]:
%pip install -q --upgrade python-terrier

### Pyterrier plugins installation 

We install the [OpenNIR](https://opennir.net/) and [monoT5](https://github.com/terrierteam/pyterrier_t5) PyTerrier plugins. You can safely ignore the package versioning errors.

In [None]:
%pip install -q --upgrade git+https://github.com/Georgetown-IR-Lab/OpenNIR
%pip install -q --upgrade git+https://github.com/terrierteam/pyterrier_t5

## Preliminary steps

**[PyTerrier](https://github.com/terrier-org/pyterrier) initialization** 

Lets get [PyTerrier](https://github.com/terrier-org/pyterrier) started. This will download the latest version of the [Terrier](http://terrier.org/) IR platform. We also import the [OpenNIR](https://opennir.net/) pyterrier bindings.

In [None]:
import pyterrier as pt
from pyterrier.measures import * # allow for natural measure names
import onir_pt

### [TREC-COVID19](https://ir.nist.gov/covidSubmit/) Dataset download

The following cell downloads the [TREC-COVID19](https://ir.nist.gov/covidSubmit/) dataset that we will use in the remainder of this notebook.

In [None]:
dataset = pt.datasets.get_dataset('irds:cord19/trec-covid')
topics = dataset.get_topics(variant='description')
qrels = dataset.get_qrels()

### Terrier inverted index download

To save a few minutes, we use a pre-built Terrier inverted index for the TREC-COVID19 collection ([`'terrier_stemmed'`](http://data.terrier.org/trec-covid.dataset.html#terrier_stemmed) version). Download time took a few seconds for us.

In [None]:
index = pt.get_dataset('trec-covid').get_index('terrier_stemmed_positions')

## Re-Rankers from scratch

Let's start exploring a few neural re-ranking methods! We can build them from scratch using `onir_pt.reranker`.

And OpenNIR reranking model consists of:
 - `ranker` (e.g., `drmm`, `knrm`, or `pacrr`). This defines the neural ranking architecture.
 - `vocab` (e.g., `wordvec_hash`, or `bert`). This defines how text is encoded by the model. This approach makes it easy to swap out different text representations.

This line will take a few minutes to run as it downloads and prepares the word vectors.

We'll start with neural ranking model that doesn't use contextualized embeddings.

In [None]:
knrm = onir_pt.reranker('knrm', 'wordvec_hash', text_field='title_abstract')

Let's look at how well these models work at ranking!

In [None]:
br = pt.BatchRetrieve(index) % 50
# build a sub-pipeline to get the concatenated title and abstract text
get_title_abstract = pt.text.get_text(dataset, 'title') >> pt.text.get_text(dataset, 'abstract') >> pt.apply.title_abstract(lambda r: r['title'] + ' ' + r['abstract'])
pipeline = br >> get_title_abstract >> knrm
pt.Experiment(
    [br, pipeline],
    topics,
    qrels,
    names=['DPH', 'DPH >> KNRM'],
    eval_metrics=[AP(rel=2), nDCG, nDCG@10, P(rel=2)@10]
)

This doesn't work very well because the model is not trained; it's using random weights to combine the scores from the similarity matrix.

## Loading a trained re-ranker

You can train re-ranking models in PyTerrier using the `fit` method. This takes a bit of time, so we'll download a model that's already been trained. If you'd like to train the model yourself, you can use:

```python
# transfer training signals from a medical sample of MS MARCO
from sklearn.model_selection import train_test_split
train_ds = pt.datasets.get_dataset('irds:msmarco-passage/train/medical')
train_topics, valid_topics = train_test_split(train_ds.get_topics(), test_size=50, random_state=42) # split into training and validation sets

# Index MS MARCO
indexer = pt.index.IterDictIndexer('./terrier_msmarco-passage')
tr_index_ref = indexer.index(train_ds.get_corpus_iter(), fields=('text',), meta=('docno',))

pipeline = (pt.BatchRetrieve(tr_index_ref) % 100 # get top 100 results
            >> pt.text.get_text(train_ds, 'text') # fetch the document text
            >> pt.apply.generic(lambda df: df.rename(columns={'text': 'abstract'})) # rename columns
            >> knrm) # apply neural re-ranker

pipeline.fit(
    train_topics,
    train_ds.get_qrels(),
    valid_topics,
    train_ds.get_qrels())
```

In [None]:
del knrm # free up the memory before loading a new version of the ranker
knrm = onir_pt.reranker.from_checkpoint('https://macavaney.us/knrm.medmarco.tar.gz', text_field='title_abstract', expected_md5="d70b1d4f899690dae51161537e69ed5a")

In [None]:
pipeline = br >> get_title_abstract >> knrm
pt.Experiment(
    [br, pipeline],
    topics,
    qrels,
    names=['DPH', 'DPH >> KNRM'],
    baseline=0,
    eval_metrics=[AP(rel=2), nDCG, nDCG@10, P(rel=2)@10]
)

That's a little better than before, but it still underperforms our first-stage ranking model.

## Vanilla BERT

Contextualized language models, such as [BERT](https://arxiv.org/abs/1810.04805), are much more powerful neural models that have been shown to be effective for ranking.

We'll try using a "vanilla" (or "mono") version of the BERT model. The BERT model is pre-trained for the task of language modeling and next sentence prediction.

In [None]:
del knrm # clear out memory from KNRM
vbert = onir_pt.reranker('vanilla_transformer', 'bert', text_field='title_abstract', vocab_config={'train': True})

Let's see how this model does on TREC COVID.

In [None]:
pipeline = br % 50 >> get_title_abstract >> vbert
pt.Experiment(
    [br, pipeline],
    topics,
    qrels,
    names=['DPH', 'DPH >> VBERT'],
    baseline=0,
    eval_metrics=[AP(rel=2), nDCG, nDCG@10, P(rel=2)@10]
)

As we see, although the model is pre-trained, it doesn't do very well at ranking on our benchmark. This is because it's not tuned for the task of relevance ranking.

We can train the model for ranking (as shown above for KNRM) or we can download a trained model. Here, we use the [SLEDGE](https://arxiv.org/abs/2010.05987) model, which is a Vanilla BERT model trained on scientific text and tuned on medical queries.

In [None]:
sledge = onir_pt.reranker.from_checkpoint('https://macavaney.us/scibert-medmarco.tar.gz', text_field='title_abstract', expected_md5="854966d0b61543ffffa44cea627ab63b")

In [None]:
pipeline = br % 50 >> get_title_abstract >> sledge
pt.Experiment(
    [br, pipeline],
    topics,
    qrels,
    names=['DPH', 'DPH >> SLEDGE'],
    baseline=0,
    eval_metrics=[AP(rel=2), nDCG, nDCG@10, P(rel=2)@10, 'mrt']
)

That's much better! We're able to significantly improve upon the first stage ranker. But we can see that this is pretty slow to run.

## EPIC

Some models focus on query-time computational efficiency. The [EPIC](https://arxiv.org/abs/2004.14245) model builds light-weight document representations that are independent of the query. This means that they can be computed ahead of time. You can index the corpus yourself with the following code (but it takes a while):

```python
indexed_epic = onir_pt.indexed_epic.from_checkpoint('https://macavaney.us/epic.msmarco.tar.gz', index_path='./epic_cord19')
indexed_epic.index(dataset.get_corpus_iter(), fields=('title', 'abstract'))
```

Instead, we'll download a copy of the EPIC-processed documents:

In [None]:
import os
if not os.path.exists('epic_cord19.zip'):
  !wget http://macavaney.us/epic_cord19.zip
  !unzip epic_cord19.zip
indexed_epic = onir_pt.indexed_epic.from_checkpoint('https://macavaney.us/epic.msmarco.tar.gz', index_path='./epic_cord19')

We can now run this model over the results of a first-stage ranker. Note how we do not need to fetch the document text with `pt.text.get_text`, which further saves time.

In [None]:
br = pt.BatchRetrieve(index) % 50
pipeline = (br >> indexed_epic.reranker())
pt.Experiment(
    [br, pipeline],
    dataset.get_topics('description'),
    dataset.get_qrels(),
    names=['DPH', 'DPH >> EPIC (indexed)'],
    eval_metrics=[AP(rel=2), nDCG, nDCG@10, P(rel=2)@10, "mrt"]
)

## Tuning re-ranking threshold

[Prior work suggests](https://arxiv.org/pdf/1904.12683.pdf) that the re-ranking cutoff threshold is an important model hyperparameter. Let's see how this parameter affects EPIC.

In [None]:
cutoffs = [10, 20, 30, 40, 50, 60, 70, 80, 90, 100]
dph = pt.BatchRetrieve(index)
res = pt.Experiment(
    [dph % cutoff >> indexed_epic.reranker() for cutoff in cutoffs],
    dataset.get_topics('description'),
    dataset.get_qrels(),
    names=[f'c={cutoff}' for cutoff in cutoffs],
    eval_metrics=[AP(rel=2), nDCG, nDCG@10, P(rel=2)@10, "mrt"]
)
res

In [None]:
from matplotlib import pyplot as plt
%matplotlib inline
plt.plot(res['name'], res['nDCG@10'], label='nDCG@10')
plt.plot(res['name'], res['P(rel=2)@10'], label='P(rel=2)@10')
plt.ylabel('value')
plt.legend()
plt.show()
plt.clf()
plt.plot(res['name'], res['mrt'])
plt.ylabel('mrt')
plt.show()

It appears that the optimal re-ranking threshold for this collection is around 50-70. This also avoids excessive re-ranking time, which grows roughly linearly with larger thredhols. In pratice, this paramter should be tuned on a held-out validation set to avoid over-fitting.

## monoT5

The [monoT5](https://arxiv.org/abs/2003.06713) model scores documents using a causal language model. Let's see how this approach works on TREC COVID.

The `MonoT5ReRanker` class from `pyterrier_t5` automatically loads a version of the monoT5 ranker that is trained on the MS MARCO passage dataset.

In [None]:
from pyterrier_t5 import MonoT5ReRanker
monoT5 = MonoT5ReRanker(text_field='title_abstract')

In [None]:
br = pt.BatchRetrieve(index) % 50
pipeline = (br >> get_title_abstract >> monoT5)
pt.Experiment(
    [br, pipeline],
    dataset.get_topics('description'),
    dataset.get_qrels(),
    names=['DPH', 'DPH >> T5'],
    eval_metrics=[AP(rel=2), nDCG, nDCG@10, P(rel=2)@10, "mrt"]
)