# Sparse Retrieval for Natural Questions RAG

This notebook demonstrates [PyTerrier](http://github.com/terrier-org/pyterrier) and [PyTerrier-RAG](https://github.com/terrierteam/pyterrier_rag). This notebooks runs on Google Colab if you select a Runtime with a T4 GPU (these are free to use), and will take approx. 12 minutes to execute fully.

## Installation

Lets install what we need:
 - PyTerrier - core platform
 - PyTerrier_t5 - MonoT5 reranker
 - pyterrier_rag - Support for RAG datasets and answer generators (aka readers)

In [None]:
%pip install -q vllm pyterrier_t5

[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
pyterrier-rag 0.1.0 requires transformers==4.44.2, but you have transformers 4.51.3 which is incompatible.[0m[31m
[0mNote: you may need to restart the kernel to use updated packages.


In [None]:
%pip install -q pyterrier-rag

[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
vllm 0.8.3 requires transformers>=4.51.0, but you have transformers 4.44.2 which is incompatible.[0m[31m
[0mNote: you may need to restart the kernel to use updated packages.


In [3]:
import pyterrier as pt
import pyterrier_rag

  from .autonotebook import tqdm as notebook_tqdm


## Retrievers

Lets load a sparse index of Wikipedia. Conveniently, we've stored this as a [Huggingface dataset](https://huggingface.co/datasets/pyterrier/ragwiki-terrier). This is 12GB in size (it also contains the text of the documents) - downloading takes about 10 minutes on Google Colab.

We'll use that index to get a BM25 retriever (you can see how this was created at https://huggingface.co/datasets/pyterrier/ragwiki-terrier#reproduction). 

Terrier doesnt like question marks in queries, so we'll strip these and restore them after.

Finally, lets make a monoT5 reranker, we can use that to rerank BM25,


In [None]:
sparse_index = pt.Artifact.from_hf('pyterrier/ragwiki-terrier')
bm25_ret = pt.rewrite.tokenise() >> sparse_index.bm25(include_fields=['docno', 'text', 'title'], threads=5) >> pt.rewrite.reset()

https://huggingface.co/datasets/pyterrier/ragwiki-terrier/resolve/main/artifact.tar.lz4:   0%|          | 880k/12.0G [00:00<28:21, 7.59MB/s]

extracting data.direct.bf [1.9 GB]


https://huggingface.co/datasets/pyterrier/ragwiki-terrier/resolve/main/artifact.tar.lz4:   2%|▏         | 265M/12.0G [00:49<34:31, 6.10MB/s]   

In [6]:
(bm25_ret%3).search("What are chemical reactions?")

Unnamed: 0,qid,docid,docno,text,title,rank,score,query
0,1,1027780,1027780,Chemical change Chemical changes occur when a ...,"""Chemical change""",0,29.908465,What are chemical reactions?
1,1,53321,53321,are called reactants or reagents. Chemical rea...,"""Chemical reaction""",1,29.825437,What are chemical reactions?
2,1,860125,860125,Chemical reaction A chemical reaction is a pro...,"""Chemical reaction""",2,29.526942,What are chemical reactions?


## Readers

In [None]:
from pyterrier_rag.backend import VLLMBackend
from pyterrier_rag.prompt import Concatenator
from pyterrier_rag.readers import Reader

model_name_or_path='mistralai/Mistral-7B-v0.1'
mistral = VLLMBackend(model_name_or_path)

mistral_reader = Reader(mistral)
bm25_mistral = bm25_ret %3 >> Concatenator() >> mistral_reader

config.json:   0%|          | 0.00/1.54k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/892M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/142 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/20.8k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.42M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/2.20k [00:00<?, ?B/s]

# Datasets & Experiments

Lets compare the effectiveness of this approach on the Natural Questions dataset. These topics are automatically downloaded.

In [11]:
dataset = pt.get_dataset('rag:nq')
dataset.get_topics('dev').head(2)

Unnamed: 0,qid,query
0,dev_0,who sings does he love me with reba
1,dev_1,how many pages is invisible man by ralph ellison


And their corresponding gold truth answers:

In [12]:
dataset.get_topics('dev').head(2).merge(dataset.get_answers('dev'))

Unnamed: 0,qid,query,gold_answer
0,dev_0,who sings does he love me with reba,Linda Davis
1,dev_1,how many pages is invisible man by ralph ellison,581 (second edition)


Now lets run an experiment using Natural Questions.

They first four arguments correspond closely to main details our our experiment - specifically, we're going to compare: `bm25_fid`, `monot5_fid`, `monoT5_flant5`, on 100 dev topics (this take about 2 minutes). We'll evaluate our answers using Exact Match and F1.

The additional arguments are:
 - `batch_size` - how many queries to run and evalate at once. Not always necessary, but makes the progress bars more granular
 - `verbose` - display progress bars for this experiment
 - `precompute_prefix` - optimise the experiment such that BM25 is only computed once.
 - `names` - for naming rows in the output dataframe
 - `baseline` - we'll compare to monoT5 with FiD, to see how much it helps compared to BM25, and how much FlanT5 does better than FiD.

In [None]:
pt.Experiment(
    [bm25_mistral],
    dataset.get_topics('dev').head(100), # NB: remove .head(100) to run on all dev topics
    dataset.get_answers('dev'),
    [pyterrier_rag.measures.F1, pyterrier_rag.measures.EM],
    batch_size=25,
    verbose=True,
    precompute_prefix=True,
    names=['bm25 mistral'],
)

Precomputing results of 100 topics on shared pipeline component (pt.apply.query() >> TerrierRetr(BM25) >> pt.apply.generic())
  warn("precompute_prefix with batch_size is very experimental. Please report any problems")


pt.Experiment precomputation:   0%|          | 0/4 [00:00<?, ?batches/s]

pt.Experiment:   0%|          | 0/12 [00:00<?, ?batches/s]

Unnamed: 0,name,F1,EM,F1 +,F1 -,F1 p-value,EM +,EM -,EM p-value
0,bm25 fid,0.265524,0.17,4.0,18.0,0.020276,3.0,11.0,0.031801
1,monoT5_fid,0.351333,0.25,,,,,,
2,monoT5 FlanT5 0z,0.298913,0.21,5.0,13.0,0.061812,3.0,7.0,0.207501


# That's all folks.

There are lots of other retrievers possible in PyTerrier - for instance [query expansion](https://pyterrier.readthedocs.io/en/latest/rewrite.html), [doc2query](https://github.com/terrierteam/pyterrier_doc2query), [SPLADE learned sparse](https://github.com/cmacdonald/pyt_splade) or [dense retrieval](https://github.com/terrierteam/pyterrier_dr) (including the [ColBERT](https://github.com/terrierteam/pyterrier_colbert) multi-representation dense model).

PyTerrier-RAG also provides easy access to lots of other datasets.