# Sparse Retrieval for Natural Questions RAG

This notebook demonstrates [PyTerrier](http://github.com/terrier-org/pyterrier) and [PyTerrier-RAG](https://github.com/terrierteam/pyterrier_rag). This notebook runs on Google Colab if you select a Runtime with a T4 GPU (these are free to use), and will take approx. 12 minutes to execute fully.

## Installation

Lets install what we need:
 - PyTerrier - core platform
 - PyTerrier_t5 - MonoT5 reranker
 - pyterrier_rag - Support for RAG datasets and answer generators (aka readers)

In [None]:
#%pip install -q --root-user-action=ignore python-terrier pyterrier_t5

In [None]:
#%pip install -q  --root-user-action=ignore pyterrier-rag

In [13]:
import pyterrier as pt
pt.utils.set_tqdm('notebook')
import pyterrier_rag

## Retrievers

Lets load a sparse index of Wikipedia. Conveniently, we've stored this as a [Huggingface dataset](https://huggingface.co/datasets/pyterrier/ragwiki-terrier). This is 12GB in size (it also contains the text of the documents) - downloading takes about 10 minutes on Google Colab.

We'll use that index to get a BM25 retriever (you can see how this was created at https://huggingface.co/datasets/pyterrier/ragwiki-terrier#reproduction). 

Terrier doesnt like question marks in queries, so we'll strip these and restore them after.

Finally, lets make a monoT5 reranker, which we will use to rerank BM25.


In [14]:
from pyterrier_t5 import MonoT5ReRanker

sparse_index = pt.Artifact.from_hf('pyterrier/ragwiki-terrier')

bm25_ret = pt.rewrite.tokenise() >> sparse_index.bm25(include_fields=['docno', 'text', 'title'], threads=5) >> pt.rewrite.reset()
monoT5 = MonoT5ReRanker(batch_size=64, verbose=False)

  warn(
  warn(


Lets formulate our monoT5 reranking pipeline - we'll take the top 10 documents from BM25 and rerank those using monoT5. Here we are using two PyTerrier operators to make a pipeline:
 - `%` - apply a rank cutoff to the left.
 - `>>` - compose (aka. then), which means apply the right handside on the output of the left hand side.

In [15]:
monoT5_ret = bm25_ret % 10 >> monoT5

Lets compare the results...

In [16]:
(bm25_ret%3).search("What are chemical reactions?")

Unnamed: 0,qid,docid,docno,text,title,rank,score,query
0,1,1027780,1027780,Chemical change Chemical changes occur when a ...,"""Chemical change""",0,29.908465,What are chemical reactions?
1,1,53321,53321,are called reactants or reagents. Chemical rea...,"""Chemical reaction""",1,29.825437,What are chemical reactions?
2,1,860125,860125,Chemical reaction A chemical reaction is a pro...,"""Chemical reaction""",2,29.526942,What are chemical reactions?


In [17]:
(monoT5_ret%3).search("What are chemical reactions?")

Unnamed: 0,qid,docid,docno,text,title,query,score,rank
1,1,860125,860125,Chemical reaction A chemical reaction is a pro...,"""Chemical reaction""",What are chemical reactions?,-0.029645,0
0,1,53321,53321,are called reactants or reagents. Chemical rea...,"""Chemical reaction""",What are chemical reactions?,-0.08068,1
2,1,3147077,3147077,the course of a reaction. Reaction mechanisms ...,Chemistry,What are chemical reactions?,-0.658413,2


Interestingly, the re-ranking had some impact - 860125 was 3rd under BM25, but became first under monoT5 - while order many not matter so much for our readers, the inclusion of 3147077 and removal of 1027780 would likely change the reader's generated answer.


You'll see that all of our retrievers give as output the same columns:
 - qid - unique identifier of the question
 - query - text of the question
 - docno - unique identifier of the passage
 - title and text (of the passage)
 - score and rank - to invoke an ordering

## Readers

### Fusion in Decoder

Lets now see the readers that will generate the answers. The first one we use is Fusion in Decoder - a T5-based model that encodes each document separately, but combines these representations in the decoder step.

In PyTerrier terms, a reader takes as input the following columns:
 - qid
 - query
 - docno
 - title & text

And returns:
 - qid
 - query
 - qanswer

We provide a checkpoint trained for NQ on Huggingface at terrierteam/t5fid_base_nq.

We further formulate two RAG pipelines - one using BM25 and one using monoT5 as input to FiD.

In [18]:
import pyterrier_rag.readers
fid = pyterrier_rag.readers.T5FiD("terrierteam/t5fid_base_nq")

bm25_fid = bm25_ret %3 >> fid
monot5_fid = monoT5_ret %3 >> fid

When we invoke search on this pipeline, we now have a qanswer column that contains the answer.

In [19]:
monot5_fid.search("What are chemical reactions?")

Unnamed: 0,qid,query,qanswer
0,1,What are chemical reactions?,chemical equations


### FlanT5

Our second reader is FlanT5 - an instruction-tuned model - we use it zero-shot.
We instantiate it using a backend (`Seq2SeqLMBackend`), and giving that to a Reader class.

`Concatenator` takes the document text and titles (returned in the retrieval part of the pipeline) and puts it into the prompt.

In [75]:
from pyterrier_rag.backend import Seq2SeqLMBackend
from pyterrier_rag.prompt import Concatenator, PromptTransformer
from pyterrier_rag.readers import Reader

prompt = "Use the context information to answer the Question: \n Context: {{ qcontext }} \n Question: {{ query }} \n Answer:"
prompt = PromptTransformer(instruction=prompt, system_message='')

flant5 = Reader(Seq2SeqLMBackend('google/flan-t5-base'), prompt=prompt)
monoT5_flant5 = bm25_ret % 10 >> monoT5 %3 >> Concatenator() >> flant5
results_flant5 = monoT5_flant5.search("What are chemical reactions?")
results_flant5

Unnamed: 0,prompt,qid,query_0,qanswer
0,\n### Human: Use the context information to an...,1,What are chemical reactions?,a process that leads to the chemical transform...


Interesting to see that the answer by FlanT5 is a bit longer and detailed

# Datasets & Experiments

Lets compare the effectiveness of these three approaches on the Natural Questions dataset. These topics are automatically downloaded.

In [31]:
dataset = pt.get_dataset('rag:nq')
dataset.get_topics('dev').head(2)

Unnamed: 0,qid,query
0,dev_0,who sings does he love me with reba
1,dev_1,how many pages is invisible man by ralph ellison


And their corresponding gold truth answers:

In [32]:
dataset.get_topics('dev').head(2).merge(dataset.get_answers('dev'))

Unnamed: 0,qid,query,gold_answer
0,dev_0,who sings does he love me with reba,Linda Davis
1,dev_1,how many pages is invisible man by ralph ellison,581 (second edition)


Now lets run an experiment using Natural Questions.

They first four arguments correspond closely to main details our our experiment - specifically, we're going to compare: `bm25_fid`, `monot5_fid`, `monoT5_flant5`, on 100 dev topics (this take about 2 minutes). We'll evaluate our answers using Exact Match and F1.

The additional arguments are:
 - `batch_size` - how many queries to run and evalate at once. Not always necessary, but makes the progress bars more granular
 - `verbose` - display progress bars for this experiment
 - `precompute_prefix` - optimise the experiment such that BM25 is only computed once.
 - `names` - for naming rows in the output dataframe
 - `baseline` - we'll compare to monoT5 with FiD, to see how much it helps compared to BM25, and how much FlanT5 does better than FiD.

In [76]:
pt.Experiment(
    [bm25_fid, monot5_fid, monoT5_flant5],
    dataset.get_topics('dev').head(100), # NB: remove .head(100) to run on all dev topics
    dataset.get_answers('dev'),
    [pyterrier_rag.measures.F1, pyterrier_rag.measures.EM],
    batch_size=25,
    verbose=True,
    precompute_prefix=True,
    names=['bm25 fid', 'monoT5_fid', 'monoT5 FlanT5 0z'],
    baseline=1
)

Precomputing results of 100 topics on shared pipeline component (pt.apply.query() >> TerrierRetr(BM25) >> pt.apply.generic())
  warn("precompute_prefix with batch_size is very experimental. Please report any problems")
pt.Experiment precomputation: 100%|██████████| 4/4 [00:20<00:00,  5.11s/batches]
pt.Experiment: 100%|██████████| 12/12 [01:50<00:00,  9.18s/batches]


Unnamed: 0,name,EM,F1,EM +,EM -,EM p-value,F1 +,F1 -,F1 p-value
0,bm25 fid,0.17,0.265524,3.0,11.0,0.031801,4.0,18.0,0.020276
1,monoT5_fid,0.25,0.351333,,,,,,
2,monoT5 FlanT5 0z,0.21,0.272857,4.0,8.0,0.250199,6.0,18.0,0.024181


From the results, we can see that MonoT5 with FiD was the most effective answer generator - 0.35 F1, 0.2 EM. Applying monoT5 to rerank the top 10 passages of BM25 improved the answers to 18 questions compared to raw BM25 (see F1- column). The improvement brought by monoT5 is significant for both F1 and EM (see the calculated p-values).

FlanT5 gave a better answer than FiD for 5 questions (F1+), but degraded for 13 (F1-). FiD is likely better as it has been fine-tuned on the NQ dataset, while FlanT5 is used zero-shot. However, the difference is not statistically significant for either F1 nor EM (according to a paired t-test).

FiD can be take more passages than just 3, as its context length is not limited. Let's see how well it does with a context length of 100 passages selected by monoT5. This experiments takes about 10 minutes on a Colab T4 GPU. Again, the BM25 results are pre-computed and reused for both pipelines.

In [34]:
pt.Experiment(
    [monot5_fid, bm25_ret % 200 >> monoT5 % 100 >> fid],
    dataset.get_topics('dev').head(100), # NB: remove .head(100) to run on all dev topics
    dataset.get_answers('dev'),
    [pyterrier_rag.measures.F1, pyterrier_rag.measures.EM],
    precompute_prefix=True,
    names=['monoT5 3p FiD', 'monoT5 100p FiD'],
    baseline=0
)

Precomputing results of 100 topics on shared pipeline component (pt.apply.query() >> TerrierRetr(BM25) >> pt.apply.generic())
  warn("precompute_prefix with batch_size is very experimental. Please report any problems")
pt.Experiment precomputation: 100%|██████████| 4/4 [00:06<00:00,  1.62s/batches]
pt.Experiment: 100%|██████████| 8/8 [22:01<00:00, 165.21s/batches]


Unnamed: 0,name,EM,F1,EM +,EM -,EM p-value,F1 +,F1 -,F1 p-value
0,monoT5 3p FiD,0.25,0.351333,,,,,,
1,monoT5 100p FiD,0.4,0.547905,20.0,5.0,0.002304,29.0,8.0,1.3e-05


Great! According to F1, giving FiD 100 passages gives a significant improvement $(p<0.05)$ compared to 3 passages, improving the answers of 29 queries.



# That's all folks.

There are lots of other retrievers possible in PyTerrier - for instance [query expansion](https://pyterrier.readthedocs.io/en/latest/rewrite.html), [doc2query](https://github.com/terrierteam/pyterrier_doc2query), [SPLADE learned sparse](https://github.com/cmacdonald/pyt_splade) or [dense retrieval](https://github.com/terrierteam/pyterrier_dr) (including the [ColBERT](https://github.com/terrierteam/pyterrier_colbert) multi-representation dense model).

PyTerrier-RAG also provides easy access to lots of other datasets.