# Dense Retrieval for Natural Questions RAG

This notebook demonstrates [PyTerrier](http://github.com/terrier-org/pyterrier) and PyTerrier-RAG. This notebooks runs on Google Colab with a (free) T4 GPU, but requires 60GB disk space (which is very tight; not all Colab instances have that much).

## Installation

Lets install what we need:
 - PyTerrier - core platform
 - PyTerrier_t5 - MonoT5 reranker
 - PyTerrier_dr - dense (biencoder) retrieval
 - pyterrier_rag - Support for RAG datasets and answer generators (aka readers)

In [1]:
%pip install -q python-terrier pyterrier_t5 pyterrier_dr

  Preparing metadata (setup.py) ... [?25l[?25hdone
  Preparing metadata (setup.py) ... [?25l[?25hdone
  Preparing metadata (setup.py) ... [?25l[?25hdone
  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m163.4/163.4 kB[0m [31m6.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m54.6/54.6 kB[0m [31m835.6 kB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m347.9/347.9 kB[0m [31m7.0 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m60.1/60.1 kB[0m [31m2.2 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m9.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.6/1.6 MB[0m [31m33.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m287.9/287.9 kB

In [None]:
%pip install -q pyterrier-rag

  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m43.7/43.7 kB[0m [31m3.4 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m9.5/9.5 MB[0m [31m74.7 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.6/3.6 MB[0m [31m69.5 MB/s[0m eta [36m0:00:00[0m
[?25h  Building wheel for pyterrier-rag (setup.py) ... [?25l[?25hdone


In [None]:
import pyterrier as pt


## Retrievers

Lets load a dense index of Wikipedia. Conveniently, we've stored this as a [Huggingface dataset](https://huggingface.co/datasets/pyterrier/ragwiki-terrier). This is 60GB in size (it also contains the text of the documents) - downloading takes about 1 hour on Google Colab.

We'll also need a sparse index, to load the content of the documents.

Finally, lets make a monoT5 reranker, we can use that to rerank BM25,


In [8]:
e5_index = pt.Artifact.from_hf('pyterrier/ragwiki-e5.flex')

https://huggingface.co/datasets/pyterrier/ragwiki-e5.flex/resolve/main/artifact.tar.lz4.0:   0%|          | 0.…

extracting docnos.npids [207 B]
extracting pt_meta.json [81 B]
extracting vecs.f4 [60.1 GB]


https://huggingface.co/datasets/pyterrier/ragwiki-e5.flex/resolve/main/artifact.tar.lz4.1:   0%|          | 0.…

https://huggingface.co/datasets/pyterrier/ragwiki-e5.flex/resolve/main/artifact.tar.lz4.2:   0%|          | 0.…

https://huggingface.co/datasets/pyterrier/ragwiki-e5.flex/resolve/main/artifact.tar.lz4.3:   0%|          | 0.…

https://huggingface.co/datasets/pyterrier/ragwiki-e5.flex/resolve/main/artifact.tar.lz4.4:   0%|          | 0.…

https://huggingface.co/datasets/pyterrier/ragwiki-e5.flex/resolve/main/artifact.tar.lz4.5:   0%|          | 0.…

https://huggingface.co/datasets/pyterrier/ragwiki-e5.flex/resolve/main/artifact.tar.lz4.6:   0%|          | 0.…

https://huggingface.co/datasets/pyterrier/ragwiki-e5.flex/resolve/main/artifact.tar.lz4.7:   0%|          | 0.…

https://huggingface.co/datasets/pyterrier/ragwiki-e5.flex/resolve/main/artifact.tar.lz4.8:   0%|          | 0.…

https://huggingface.co/datasets/pyterrier/ragwiki-e5.flex/resolve/main/artifact.tar.lz4.9:   0%|          | 0.…

https://huggingface.co/datasets/pyterrier/ragwiki-e5.flex/resolve/main/artifact.tar.lz4.10:   0%|          | 0…

https://huggingface.co/datasets/pyterrier/ragwiki-e5.flex/resolve/main/artifact.tar.lz4.11:   0%|          | 0…

https://huggingface.co/datasets/pyterrier/ragwiki-e5.flex/resolve/main/artifact.tar.lz4.12:   0%|          | 0…

https://huggingface.co/datasets/pyterrier/ragwiki-e5.flex/resolve/main/artifact.tar.lz4.13:   0%|          | 0…

In [None]:
from pyterrier_t5 import MonoT5ReRanker

sparse_index = pt.Artifact.from_hf('pyterrier/ragwiki-terrier')
monoT5 = MonoT5ReRanker(batch_size=64, verbose=False)

https://huggingface.co/datasets/pyterrier/ragwiki-terrier/resolve/main/artifact.tar.lz4:   0%|          | 0.00…

extracting data.direct.bf [1.9 GB]
extracting data.document.fsarrayfile [340.7 MB]
extracting data.inverted.bf [1.5 GB]
extracting data.lexicon.fsomapfile [330.0 MB]
extracting data.lexicon.fsomaphash [1017 B]
extracting data.lexicon.fsomapid [15.3 MB]
extracting data.meta-0.fsomapfile [1.3 GB]
extracting data.meta.idx [160.3 MB]
extracting data.meta.zdata [8.2 GB]
extracting data.properties [4.1 KB]
extracting pt_meta.json [79 B]


Error while fetching `HF_TOKEN` secret value from your vault: 'Requesting secret HF_TOKEN timed out. Secrets can only be fetched when running from the Colab UI.'.
You are not authenticated with the Hugging Face Hub in this notebook.
If the error persists, please let us know by opening an issue on GitHub (https://github.com/huggingface/huggingface_hub/issues/new).


spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.39M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/1.21k [00:00<?, ?B/s]

You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565


config.json:   0%|          | 0.00/1.84k [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/892M [00:00<?, ?B/s]


Lets formulate our ranking pipelines (1) a plain E5 dense retrieval, which we decorate with the title and text; (2) the same pipeline with monoT5. 

Here we are using two PyTerrier operators to make a pipeline:
 - `%` - apply a rank cutoff to the left.
 - `>>` - compose (aka. then), which means apply the right handside on the output of the left hand side.

In [10]:
from pyterrier_dr import E5

e5_query_encoder = E5()
e5_ret = e5_query_encoder >> e5_index >> sparse_index.text_loader(["title", "text"])
monoT5_ret =  e5_query_encoder >> e5_index  % 10 >> monoT5 >> sparse_index.text_loader(["title", "text"])

modules.json:   0%|          | 0.00/387 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/67.6k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/57.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/650 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/438M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/314 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/711k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/200 [00:00<?, ?B/s]

terrier-assemblies 5.11 jar-with-dependencies not found, downloading to /root/.pyterrier...
Done
terrier-python-helper 0.0.8 jar not found, downloading to /root/.pyterrier...
Done


Java started (triggered by TerrierIndex.index_ref) and loaded: pyterrier.java, pyterrier.terrier.java [version=5.11 (build: craig.macdonald 2025-01-13 21:29), helper_version=0.0.8]


13:07:27.406 [main] WARN org.terrier.structures.BaseCompressingMetaIndex -- Structure meta reading lookup file directly from disk (SLOW) - try index.meta.index-source=fileinmem in the index properties file. 160.3 MiB of memory would be required.
13:07:27.484 [main] WARN org.terrier.structures.BaseCompressingMetaIndex -- Structure meta reading data file directly from disk (SLOW) - try index.meta.data-source=fileinmem in the index properties file. 8.2 GiB of memory would be required.


Lets compare the results...

In [11]:
(e5_ret%3).search("What are chemical reactions?")

NumpyRetriever scoring:   0%|          | 0/5131 [00:00<?, ?docbatch/s]

Unnamed: 0,qid,query,query_vec,docid,score,rank,docno,text,title
0,1,What are chemical reactions?,"[-0.018842446, 0.022016881, -0.061589867, 0.01...",860125,0.88379,0,860125,Chemical reaction A chemical reaction is a pro...,"""Chemical reaction"""
1,1,What are chemical reactions?,"[-0.018842446, 0.022016881, -0.061589867, 0.01...",12495298,0.882468,1,12495298,redistribution of substances in the human body...,"""Chemical reaction"""
2,1,What are chemical reactions?,"[-0.018842446, 0.022016881, -0.061589867, 0.01...",53321,0.876791,2,53321,are called reactants or reagents. Chemical rea...,"""Chemical reaction"""


In [None]:
(monoT5_ret%3).search("What are chemical reactions?")

NumpyRetriever scoring:   0%|          | 0/5131 [00:00<?, ?docbatch/s]

Interestingly, the re-ranking had some impact - 860125 was 3rd under BM25, but became first under monoT5 - while order many not matter so much for our readers, the inclusion of 3147077 and removal of 1027780 would likely change the reader's generated answer.


You'll see that all of our retrievers give as output the same columns:
 - qid - unique identifier of the question
 - query - text of the question
 - docno - unique identifier of the passage
 - title and text (of the passage)
 - score and rank - to invoke an ordering

## Readers

### Fusion in Decoder

Lets now see the readers that will generate the answers. The first one we use is Fusion in Decoder - a T5-based model that encodes each document separately, but combines these representations in the decoder step.

In PyTerrier terms, a reader takes as input the following columns:
 - qid
 - query
 - docno
 - title & text

And returns:
 - qid
 - query
 - qanswer

We provide a checkpoint trained for NQ on Huggingface at terrierteam/t5fid_base_nq.

We further formulate two RAG pipelines - one using BM25 and one using monoT5 as input to FiD.

In [None]:
import pyterrier_rag.readers
fid = pyterrier_rag.readers.T5FiD("terrierteam/t5fid_base_nq")

e5_fid = e5_ret %3 >> fid
monot5_fid = monoT5_ret %3 >> fid

When we invoke search on this pipeline, we now have a qanswer column that contains the answer.

In [None]:
monot5_fid.search("What are chemical reactions?")

# Datasets & Experiments

Lets compare the effectiveness of these three approaches on the Natural Questions dataset. These topics are automatically downloaded.

In [None]:
dataset = pt.get_dataset('rag:nq')
dataset.get_topics('dev').head(2)

And their corresponding gold truth answers:

In [None]:
dataset.get_topics('dev').head(2).merge(dataset.get_answers('dev'))

Now lets run an experiment using Natural Questions.

They first four arguments correspond closely to main details our our experiment - specifically, we're going to compare: `bm25_fid`, `monot5_fid`, `monoT5_flant5`, on 100 dev topics (this take about 2 minutes). We'll evaluate our answers using Exact Match and F1.

The additional arguments are:
 - `batch_size` - how many queries to run and evalate at once. Not always necessary, but makes the progress bars more granular
 - `verbose` - display progress bars for this experiment
 - `precompute_prefix` - optimise the experiment such that BM25 is only computed once.
 - `names` - for naming rows in the output dataframe
 - `baseline` - we'll compare to monoT5 with FiD, to see how much it helps compared to BM25, and how much FlanT5 does better than FiD.

In [None]:
pt.Experiment(
    [e5_fid, monot5_fid],
    dataset.get_topics('dev').head(100), # NB: remove .head(100) to run on all dev topics
    dataset.get_answers('dev'),
    [pyterrier_rag.measures.F1, pyterrier_rag.measures.EM],
    batch_size=25,
    verbose=True,
    precompute_prefix=True,
    names=['e5 fid', 'e5_monoT5_fid'],
    baseline=0
)

# That's all folks.

There are lots of other retrievers possible in PyTerrier - for instance [query expansion](https://pyterrier.readthedocs.io/en/latest/rewrite.html), [doc2query](https://github.com/terrierteam/pyterrier_doc2query), or [dense retrieval](https://github.com/terrierteam/pyterrier_dr) (including the [ColBERT](https://github.com/terrierteam/pyterrier_colbert) multi-representation dense model) or learned sparse, ala [SPLADE](https://github.com/cmacdonald/pyt_splade).

PyTerrier-RAG also provides easy access to lots of other datasets.