# Dense Retrieval for Natural Questions RAG

This notebook demonstrates [PyTerrier](http://github.com/terrier-org/pyterrier) and PyTerrier-RAG. This notebooks runs on Google Colab with a (free) T4 GPU, but requires 60GB disk space (which is very tight; not all Colab instances have that much).

## Installation

Lets install what we need:
 - PyTerrier - core platform
 - PyTerrier_t5 - MonoT5 reranker
 - PyTerrier_dr - dense (biencoder) retrieval
 - pyterrier_rag - Support for RAG datasets and answer generators (aka readers)

In [1]:
#%pip install -q python-terrier pyterrier_t5 pyterrier_dr

In [2]:
#%pip install -q pyterrier-rag

In [3]:
import pyterrier as pt
pt.utils.set_tqdm('notebook')

## Retrievers

Lets load a dense index of Wikipedia. Conveniently, we've stored this as a [Huggingface dataset](https://huggingface.co/datasets/pyterrier/ragwiki-terrier). This is 60GB in size (it also contains the text of the documents) - downloading takes about 1 hour on Google Colab.

We'll also need a sparse index, to load the content of the documents.

Finally, lets make a monoT5 reranker, we can use that to rerank BM25,


In [4]:
e5_index = pt.Artifact.from_hf('pyterrier/ragwiki-e5.flex')

In [5]:
from pyterrier_t5 import MonoT5ReRanker

sparse_index = pt.Artifact.from_hf('pyterrier/ragwiki-terrier')
monoT5 = MonoT5ReRanker(batch_size=64, verbose=False)

You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565



Lets formulate our ranking pipelines (1) a plain E5 dense retrieval, which we decorate with the title and text; (2) the same pipeline with monoT5. 

Here we are using two PyTerrier operators to make a pipeline:
 - `%` - apply a rank cutoff to the left.
 - `>>` - compose (aka. then), which means apply the right handside on the output of the left hand side.

In [6]:
from pyterrier_dr import E5

e5_query_encoder = E5()
e5_ret = e5_query_encoder >> e5_index.np_retriever(batch_size=4096) >> sparse_index.text_loader(["title", "text"])
monoT5_ret =  e5_query_encoder >> e5_index.np_retriever(batch_size=4096) % 10 >> sparse_index.text_loader(["title", "text"]) >> monoT5

Java started (triggered by TerrierIndex.index_ref) and loaded: pyterrier.java, pyterrier.terrier.java [version=5.11 (build: craig.macdonald 2025-01-13 21:29), helper_version=0.0.8]


Lets compare the results...

In [15]:
(e5_ret%3).search("What are chemical reactions?")

NumpyRetriever scoring:   0%|          | 0/5131 [00:00<?, ?docbatch/s]

Unnamed: 0,qid,query,query_vec,docno,docid,score,rank,title,text
0,1,What are chemical reactions?,"[-0.018842425, 0.0220169, -0.061589867, 0.0121...",860125,860125,0.88379,0,"""Chemical reaction""",Chemical reaction A chemical reaction is a pro...
1,1,What are chemical reactions?,"[-0.018842425, 0.0220169, -0.061589867, 0.0121...",12495298,12495298,0.882468,1,"""Chemical reaction""",redistribution of substances in the human body...
2,1,What are chemical reactions?,"[-0.018842425, 0.0220169, -0.061589867, 0.0121...",53321,53321,0.876791,2,"""Chemical reaction""",are called reactants or reagents. Chemical rea...


In [14]:
(monoT5_ret%3).search("What are chemical reactions?")

NumpyRetriever scoring:   0%|          | 0/5131 [00:00<?, ?docbatch/s]

Unnamed: 0,qid,query,query_vec,docno,docid,title,text,score,rank
0,1,What are chemical reactions?,"[-0.018842425, 0.0220169, -0.061589867, 0.0121...",860125,860125,"""Chemical reaction""",Chemical reaction A chemical reaction is a pro...,-0.031418,0
1,1,What are chemical reactions?,"[-0.018842425, 0.0220169, -0.061589867, 0.0121...",12495298,12495298,"""Chemical reaction""",redistribution of substances in the human body...,-0.066888,1
2,1,What are chemical reactions?,"[-0.018842425, 0.0220169, -0.061589867, 0.0121...",53321,53321,"""Chemical reaction""",are called reactants or reagents. Chemical rea...,-0.080431,2


**NB:** that we're using `.np_retriever()` here - `.torch_retriever()` might be faster, but we'd need more GPU memory. There are also `.faiss_ivf_retriever()` or `.faiss_hnsw_retriever()` which would likely be faster... see the [PyTerrier_DR's FlexIndex documentation](https://pyterrier.readthedocs.io/en/latest/ext/pyterrier-dr/indexing-retrieval.html).



## Readers

### Fusion in Decoder

Lets now see the readers that will generate the answers. The first one we use is Fusion in Decoder - a T5-based model that encodes each document separately, but combines these representations in the decoder step.

In PyTerrier terms, a reader takes as input the following columns:
 - qid
 - query
 - docno
 - title & text

And returns:
 - qid
 - query
 - qanswer

We provide a checkpoint trained for NQ on Huggingface at `terrierteam/t5fid_base_nq`.

We further formulate two RAG pipelines - one using BM25 and one using monoT5 as input to FiD.

In [9]:
import pyterrier_rag.readers
fid = pyterrier_rag.readers.T5FiD("terrierteam/t5fid_base_nq")

e5_fid = e5_ret %3 >> fid
monot5_fid = monoT5_ret %3 >> fid

When we invoke search on this pipeline, we now have a qanswer column that contains the answer.

In [10]:
monot5_fid.search("What are chemical reactions?")

NumpyRetriever scoring:   0%|          | 0/5131 [00:00<?, ?docbatch/s]

Unnamed: 0,qid,query,qanswer
0,1,What are chemical reactions?,chemical equations


# Datasets & Experiments

Lets compare the effectiveness of these three approaches on the Natural Questions dataset. These topics are automatically downloaded.

In [11]:
dataset = pt.get_dataset('rag:nq')
dataset.get_topics('dev').head(2)

Unnamed: 0,qid,query
0,dev_0,who sings does he love me with reba
1,dev_1,how many pages is invisible man by ralph ellison


And their corresponding gold truth answers:

In [12]:
dataset.get_topics('dev').head(2).merge(dataset.get_answers('dev'))

Unnamed: 0,qid,query,gold_answer
0,dev_0,who sings does he love me with reba,Linda Davis
1,dev_1,how many pages is invisible man by ralph ellison,581 (second edition)


Now lets run an experiment using Natural Questions.

They first four arguments correspond closely to main details our our experiment - specifically, we're going to compare: `e5_fid`, `monot5_fid`, on 100 dev topics (this take about 2 minutes). We'll evaluate our answers using Exact Match and F1.

The additional arguments are:
 - `batch_size` - how many queries to run and evalate at once. Not always necessary, but makes the progress bars more granular
 - `verbose` - display progress bars for this experiment
 - `precompute_prefix` - optimise the experiment such that E5 query encoding and retrieval is only computed once.
 - `names` - for naming rows in the output dataframe
 - `baseline` - we'll compare to E5 FiD with E5 monoT5 FID

In [13]:
pt.Experiment(
    [e5_fid, monot5_fid],
    dataset.get_topics('dev').head(100), # NB: remove .head(100) to run on all dev topics
    dataset.get_answers('dev'),
    [pyterrier_rag.measures.F1, pyterrier_rag.measures.EM],
    batch_size=25,
    verbose=True,
    precompute_prefix=True,
    names=['e5 fid', 'e5_monoT5_fid'],
    baseline=0
)

Precomputing results of 100 topics on shared pipeline component E5.base()
  warn("precompute_prefix with batch_size is very experimental. Please report any problems")


pt.Experiment precomputation:   0%|          | 0/4 [00:00<?, ?batches/s]

pt.Experiment:   0%|          | 0/8 [00:00<?, ?batches/s]

NumpyRetriever scoring:   0%|          | 0/5131 [00:00<?, ?docbatch/s]

NumpyRetriever scoring:   0%|          | 0/5131 [00:00<?, ?docbatch/s]

NumpyRetriever scoring:   0%|          | 0/5131 [00:00<?, ?docbatch/s]

NumpyRetriever scoring:   0%|          | 0/5131 [00:00<?, ?docbatch/s]

NumpyRetriever scoring:   0%|          | 0/5131 [00:00<?, ?docbatch/s]

NumpyRetriever scoring:   0%|          | 0/5131 [00:00<?, ?docbatch/s]

NumpyRetriever scoring:   0%|          | 0/5131 [00:00<?, ?docbatch/s]

NumpyRetriever scoring:   0%|          | 0/5131 [00:00<?, ?docbatch/s]

Unnamed: 0,name,EM,F1,EM +,EM -,EM p-value,F1 +,F1 -,F1 p-value
0,e5 fid,0.49,0.608476,,,,,,
1,e5_monoT5_fid,0.42,0.557,6.0,13.0,0.108656,6.0,14.0,0.10999


# That's all folks.

There are lots of other retrievers possible in PyTerrier - for instance [query expansion](https://pyterrier.readthedocs.io/en/latest/rewrite.html), [doc2query](https://github.com/terrierteam/pyterrier_doc2query), or [dense retrieval](https://github.com/terrierteam/pyterrier_dr) (including the [ColBERT](https://github.com/terrierteam/pyterrier_colbert) multi-representation dense model) or learned sparse, ala [SPLADE](https://github.com/cmacdonald/pyt_splade).

PyTerrier-RAG also provides easy access to lots of other datasets.