# Sparse Retrieval for Natural Questions RAG

This notebook demonstrates [PyTerrier](http://github.com/terrier-org/pyterrier) and PyTerrier-RAG. This notebooks runs on Google Colab with a (free) T4 GPU.

## Installation

Lets install what we need:
 - PyTerrier - core platform
 - PyTerrier_t5 - MonoT5 reranker
 - pyterrier_rag - Support for RAG datasets and answer generators (aka readers)

In [1]:
%pip install -q python-terrier pyterrier_t5

  Preparing metadata (setup.py) ... [?25l[?25hdone
  Preparing metadata (setup.py) ... [?25l[?25hdone
  Preparing metadata (setup.py) ... [?25l[?25hdone
  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m163.4/163.4 kB[0m [31m6.2 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m347.9/347.9 kB[0m [31m12.2 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m60.1/60.1 kB[0m [31m5.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m53.3 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.6/1.6 MB[0m [31m75.0 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m287.9/287.9 kB[0m [31m26.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m119.4/119.4 

In [2]:
%pip install -q git+https://github.com/cmacdonald/pyterrier_rag.git

  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m43.7/43.7 kB[0m [31m3.0 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m9.5/9.5 MB[0m [31m71.0 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.6/3.6 MB[0m [31m72.3 MB/s[0m eta [36m0:00:00[0m
[?25h  Building wheel for pyterrier-rag (setup.py) ... [?25l[?25hdone


In [3]:
import pyterrier as pt
import pyterrier_rag

## Retrievers

Lets load a sparse index of Wikipedia. Conveniently, we've stored this as a [Huggingface dataset](https://huggingface.co/datasets/pyterrier/ragwiki-terrier). This is 12GB in size (it also contains the text of the documents) - downloading takes about 10 minutes on Google Colab.

We'll use that index to get a BM25 retriever. Terrier doesnt like question marks in queries, so we'll strip these and restore them after.

Finally, lets make a monoT5 reranker, we can use that to rerank BM25,


In [4]:
from pyterrier_t5 import MonoT5ReRanker

sparse_index = pt.Artifact.from_hf('pyterrier/ragwiki-terrier')

bm25_ret = pt.rewrite.tokenise() >> sparse_index.bm25(include_fields=['docno', 'text', 'title'], threads=5) >> pt.rewrite.reset()
monoT5 = MonoT5ReRanker(batch_size=64, verbose=False)

https://huggingface.co/datasets/pyterrier/ragwiki-terrier/resolve/main/artifact.tar.lz4:   0%|          | 0.00…

extracting data.direct.bf [1.9 GB]
extracting data.document.fsarrayfile [340.7 MB]
extracting data.inverted.bf [1.5 GB]
extracting data.lexicon.fsomapfile [330.0 MB]
extracting data.lexicon.fsomaphash [1017 B]
extracting data.lexicon.fsomapid [15.3 MB]
extracting data.meta-0.fsomapfile [1.3 GB]
extracting data.meta.idx [160.3 MB]
extracting data.meta.zdata [8.2 GB]
extracting data.properties [4.1 KB]
extracting pt_meta.json [79 B]
terrier-assemblies 5.11 jar-with-dependencies not found, downloading to /root/.pyterrier...
Done
terrier-python-helper 0.0.8 jar not found, downloading to /root/.pyterrier...
Done


Java started (triggered by tokenise) and loaded: pyterrier.java, pyterrier.terrier.java [version=5.11 (build: craig.macdonald 2025-01-13 21:29), helper_version=0.0.8]


14:59:59.364 [main] WARN org.terrier.structures.BaseCompressingMetaIndex -- Structure meta reading lookup file directly from disk (SLOW) - try index.meta.index-source=fileinmem in the index properties file. 160.3 MiB of memory would be required.
14:59:59.442 [main] WARN org.terrier.structures.BaseCompressingMetaIndex -- Structure meta reading data file directly from disk (SLOW) - try index.meta.data-source=fileinmem in the index properties file. 8.2 GiB of memory would be required.


  warn(
  warn(
The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.39M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/1.21k [00:00<?, ?B/s]

You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565


config.json:   0%|          | 0.00/1.84k [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/892M [00:00<?, ?B/s]

Lets formulate our monoT5 reranking pipeline - we'll take the top 10 documents from BM25 and rerank those using BM25. Here we are using two PyTerrier operators to make a pipeline:
 - `%` - apply a rank cutoff to the left.
 - `>>` - compose (aka. then), which means apply the right handside on the output of the left hand side.

In [5]:
monoT5_ret = bm25_ret % 10 >> monoT5

Lets compare the results...

In [6]:
(bm25_ret%3).search("What are chemical reactions?")

Unnamed: 0,qid,docid,docno,text,title,rank,score,query
0,1,1027780,1027780,Chemical change Chemical changes occur when a ...,"""Chemical change""",0,29.908465,What are chemical reactions?
1,1,53321,53321,are called reactants or reagents. Chemical rea...,"""Chemical reaction""",1,29.825437,What are chemical reactions?
2,1,860125,860125,Chemical reaction A chemical reaction is a pro...,"""Chemical reaction""",2,29.526942,What are chemical reactions?


In [7]:
(monoT5_ret%3).search("What are chemical reactions?")

Unnamed: 0,qid,docid,docno,text,title,query,score,rank
1,1,860125,860125,Chemical reaction A chemical reaction is a pro...,"""Chemical reaction""",What are chemical reactions?,-0.029645,0
0,1,53321,53321,are called reactants or reagents. Chemical rea...,"""Chemical reaction""",What are chemical reactions?,-0.08068,1
2,1,3147077,3147077,the course of a reaction. Reaction mechanisms ...,Chemistry,What are chemical reactions?,-0.658414,2


Interestingly, the re-ranking had some impact - 860125 was 3rd under BM25, but became first under monoT5 - while order many not matter so much for our readers, the inclusion of 3147077 and removal of 1027780 would likely change the reader's generated answer.


You'll see that all of our retrievers give as output the same columns:
 - qid - unique identifier of the question
 - query - text of the question
 - docno - unique identifier of the passage
 - title and text (of the passage)
 - score and rank - to invoke an ordering

## Readers

### Fusion in Decoder

Lets now see the readers that will generate the answers. The first one we use is Fusion in Decoder - a T5-based model that encodes each document separately, but combines these representations in the decoder step.

In PyTerrier terms, a reader takes as input the following columns:
 - qid
 - query
 - docno
 - title & text

And returns:
 - qid
 - query
 - qanswer

We provide a checkpoint trained for NQ on Huggingface at terrierteam/t5fid_base_nq.

We further formulate two RAG pipelines - one using BM25 and one using monoT5 as input to FiD.

In [8]:
import pyterrier_rag.readers
fid = pyterrier_rag.readers.T5FiD("terrierteam/t5fid_base_nq")

bm25_fid = bm25_ret %3 >> fid
monot5_fid = monoT5_ret %3 >> fid

config.json:   0%|          | 0.00/1.54k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/892M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/142 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/20.8k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.42M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/2.20k [00:00<?, ?B/s]

When we invoke search on this pipeline, we now have a qanswer column that contains the answer.

In [9]:
monot5_fid.search("What are chemical reactions?")

Unnamed: 0,qid,query,qanswer
0,1,What are chemical reactions?,chemical equations


### FlanT5

Our second reader is FlanT5 - an instruction-tuned model - we'll use it zero-shot.

In [10]:
flant5 = pyterrier_rag.readers.Seq2SeqLMReader(model_name_or_path='google/flan-t5-base', model=None)
monoT5_flant5 =  bm25_ret % 10 >> monoT5 %3 >> flant5
monoT5_flant5.search("What are chemical reactions?")

config.json:   0%|          | 0.00/1.40k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/990M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/2.54k [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.42M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/2.20k [00:00<?, ?B/s]



Unnamed: 0,qid,query,qanswer
0,1,What are chemical reactions?,chemical transformation of one set of chemical...


Interesting to see that the answer by FlanT5 is a bit longer and detailed

# Datasets & Experiments

Lets compare the effectiveness of these three approaches on the Natural Questions dataset. These topics are automatically downloaded.

In [11]:
dataset = pt.get_dataset('rag:nq')
dataset.get_topics('dev').head(2)

Unnamed: 0,qid,query
0,dev_0,who sings does he love me with reba
1,dev_1,how many pages is invisible man by ralph ellison


And their corresponding gold truth answers:

In [12]:
dataset.get_topics('dev').head(2).merge(dataset.get_answers('dev'))

Unnamed: 0,qid,query,gold_answer
0,dev_0,who sings does he love me with reba,Linda Davis
1,dev_1,how many pages is invisible man by ralph ellison,581 (second edition)


Now lets run an experiment using Natural Questions.

They first four arguments correspond closely to main details our our experiment - specifically, we're going to compare: `bm25_fid`, `monot5_fid`, `monoT5_flant5`, on 100 dev topics (this take about 2 minutes). We'll evaluate our answers using Exact Match and F1.

The additional arguments are:
 - `batch_size` - how many queries to run and evalate at once. Not always necessary, but makes the progress bars more granular
 - `verbose` - display progress bars for this experiment
 - `precompute_prefix` - optimise the experiment such that BM25 is only computed once.
 - `names` - for naming rows in the output dataframe
 - `baseline` - we'll compare to monoT5 with FiD, to see how much it helps compared to BM25, and how much FlanT5 does better than FiD.

In [13]:
pt.Experiment(
    [bm25_fid, monot5_fid, monoT5_flant5],
    dataset.get_topics('dev').head(100), # NB: remove .head(100) to run on all dev topics
    dataset.get_answers('dev'),
    [pyterrier_rag.measures.F1, pyterrier_rag.measures.EM],
    batch_size=25,
    verbose=True,
    precompute_prefix=True,
    names=['bm25 fid', 'monoT5_fid', 'monoT5 FlanT5 0z'],
    baseline=1
)

Precomputing results of 100 topics on shared pipeline component (pt.apply.query() >> TerrierRetr(BM25) >> pt.apply.generic())
  warn("precompute_prefix with batch_size is very experimental. Please report any problems")


pt.Experiment precomputation:   0%|          | 0/4 [00:00<?, ?batches/s]

pt.Experiment:   0%|          | 0/12 [00:00<?, ?batches/s]

Unnamed: 0,name,F1,EM,F1 +,F1 -,F1 p-value,EM +,EM -,EM p-value
0,bm25 fid,0.265524,0.17,4.0,18.0,0.020276,3.0,11.0,0.031801
1,monoT5_fid,0.351333,0.25,,,,,,
2,monoT5 FlanT5 0z,0.298913,0.21,5.0,13.0,0.061812,3.0,7.0,0.207501


From the results, we can see that MonoT5 with FiD was the most effective answer generator - 0.35 F1, 0.2 EM. Applying monoT5 to rerank the top 10 passages of BM25 improved the answers to 18 questions compared to raw BM25 (see F1- column). The improvement brought by monoT5 is significant for both F1 and EM (see the calculated p-values).

FlanT5 gave a better answer than FiD for 5 questions (F1+), but degraded for 13 (F1-). FiD is likely better as it has been fine-tuned on the NQ dataset, while FlanT5 is used zero-shot. However, the difference is not statistically significant for either F1 nor EM (paired t-test).

FiD can be take more passages than just 3, as its context length is not limited. Let's see how well it does with a context length of 100 passages selected by monoT5. This experiments takes about 10 minutes on a Colab T4 GPU. Again, the BM25 results are pre-computed and reused for both pipelines.

In [14]:
pt.Experiment(
    [monot5_fid, bm25_ret % 200 >> monoT5 % 100 >> fid],
    dataset.get_topics('dev').head(100), # NB: remove .head(100) to run on all dev topics
    dataset.get_answers('dev'),
    [pyterrier_rag.measures.F1, pyterrier_rag.measures.EM],
    batch_size=25,
    verbose=True,
    precompute_prefix=True,
    names=['monoT5 3p FiD', 'monoT5 100p FiD'],
    baseline=0
)

Precomputing results of 100 topics on shared pipeline component (pt.apply.query() >> TerrierRetr(BM25) >> pt.apply.generic())
  warn("precompute_prefix with batch_size is very experimental. Please report any problems")


pt.Experiment precomputation:   0%|          | 0/4 [00:00<?, ?batches/s]

pt.Experiment:   0%|          | 0/8 [00:00<?, ?batches/s]

Unnamed: 0,name,F1,EM,F1 +,F1 -,F1 p-value,EM +,EM -,EM p-value
0,monoT5 3p FiD,0.351333,0.25,,,,,,
1,monoT5 100p FiD,0.547905,0.4,29.0,8.0,1.3e-05,20.0,5.0,0.002304


Great! According to F1, giving FiD 100 passages gives a significant improvement $(p<0.05)$ compared to 3 passages, improving the answers of 29 queries.



# That's all folks.

There are lots of other retrievers possible in PyTerrier - for instance [query expansion](https://pyterrier.readthedocs.io/en/latest/rewrite.html), [doc2query](https://github.com/terrierteam/pyterrier_doc2query), or [dense retrieval](https://github.com/terrierteam/pyterrier_dr) (including the [ColBERT](https://github.com/terrierteam/pyterrier_colbert) multi-representation dense model).

PyTerrier-RAG also provides easy access to lots of other datasets.