# Sparse Retrieval for Natural Questions RAG

This notebook demonstrates [PyTerrier](http://github.com/terrier-org/pyterrier) and [PyTerrier-RAG](https://github.com/terrierteam/pyterrier_rag). This notebooks runs on Google Colab if you select a Runtime with a T4 GPU (these are free to use), and will take approx. 12 minutes to execute fully.

## Installation

Lets install what we need:
 - PyTerrier - core platform
 - PyTerrier_t5 - MonoT5 reranker
 - pyterrier_rag - Support for RAG datasets and answer generators (aka readers)

In [1]:
%pip install -q --root-user-action=ignore vllm pyterrier_t5

[0mNote: you may need to restart the kernel to use updated packages.


In [2]:
%pip install -q --root-user-action=ignore pyterrier-rag

[0mNote: you may need to restart the kernel to use updated packages.


In [3]:
import pyterrier as pt
import pyterrier_rag

## Retrievers

Lets load a sparse index of Wikipedia. Conveniently, we've stored this as a [Huggingface dataset](https://huggingface.co/datasets/pyterrier/ragwiki-terrier). This is 12GB in size (it also contains the text of the documents) - downloading takes about 10 minutes on Google Colab.

We'll use that index to get a BM25 retriever (you can see how this was created at https://huggingface.co/datasets/pyterrier/ragwiki-terrier#reproduction). 

Terrier doesnt like question marks in queries, so we'll strip these and restore them after.

Finally, lets make a monoT5 reranker, we can use that to rerank BM25,


In [4]:
sparse_index = pt.Artifact.from_hf('pyterrier/ragwiki-terrier')
bm25_ret = pt.rewrite.tokenise() >> sparse_index.bm25(include_fields=['docno', 'text', 'title'], threads=5) >> pt.rewrite.reset()

Java started (triggered by tokenise) and loaded: pyterrier.java, pyterrier.terrier.java [version=5.11 (build: craig.macdonald 2025-01-13 21:29), helper_version=0.0.8]
  warn(
  warn(


In [5]:
(bm25_ret%3).search("What are chemical reactions?")

Unnamed: 0,qid,docid,docno,text,title,rank,score,query
0,1,1027780,1027780,Chemical change Chemical changes occur when a ...,"""Chemical change""",0,29.908465,What are chemical reactions?
1,1,53321,53321,are called reactants or reagents. Chemical rea...,"""Chemical reaction""",1,29.825437,What are chemical reactions?
2,1,860125,860125,Chemical reaction A chemical reaction is a pro...,"""Chemical reaction""",2,29.526942,What are chemical reactions?


## Readers

In [6]:
from pyterrier_rag.backend import VLLMBackend
from pyterrier_rag.prompt import Concatenator
from pyterrier_rag.readers import Reader

model_name_or_path='mistralai/Mistral-7B-v0.1'
mistral = VLLMBackend(model_name_or_path)

mistral_reader = Reader(mistral)
bm25_mistral = bm25_ret %3 >> Concatenator() >> mistral_reader

INFO 07-14 15:06:52 [importing.py:53] Triton module has been replaced with a placeholder.
INFO 07-14 15:06:53 [__init__.py:239] Automatically detected platform cuda.


OSError: You are trying to access a gated repo.
Make sure to have access to it at https://huggingface.co/mistralai/Mistral-7B-v0.1.
403 Client Error. (Request ID: Root=1-68751d0f-27491bc6453b1982366b6641;034cb3c2-8a59-4db0-bd47-2f6b4481fde1)

Cannot access gated repo for url https://huggingface.co/mistralai/Mistral-7B-v0.1/resolve/main/config.json.
Access to model mistralai/Mistral-7B-v0.1 is restricted and you are not in the authorized list. Visit https://huggingface.co/mistralai/Mistral-7B-v0.1 to ask for access.

# Datasets & Experiments

Lets compare the effectiveness of this approach on the Natural Questions dataset. These topics are automatically downloaded.

In [None]:
dataset = pt.get_dataset('rag:nq')
dataset.get_topics('dev').head(2)

And their corresponding gold truth answers:

In [None]:
dataset.get_topics('dev').head(2).merge(dataset.get_answers('dev'))

Now lets run an experiment using Natural Questions.

Here we only have one pipeline - bm25_mistral.

The additional arguments are:
 - `batch_size` - how many queries to run and evalate at once. Not always necessary, but makes the progress bars more granular
 - `verbose` - display progress bars for this experiment
 - `precompute_prefix` - optimise the experiment such that BM25 is only computed once.
 - `names` - for naming rows in the output dataframe


In [None]:
pt.Experiment(
    [bm25_mistral],
    dataset.get_topics('dev').head(100), # NB: remove .head(100) to run on all dev topics
    dataset.get_answers('dev'),
    [pyterrier_rag.measures.F1, pyterrier_rag.measures.EM],
    batch_size=25,
    verbose=True,
    names=['bm25 mistral'],
)

# That's all folks.

There are lots of other retrievers possible in PyTerrier - for instance [query expansion](https://pyterrier.readthedocs.io/en/latest/rewrite.html), [doc2query](https://github.com/terrierteam/pyterrier_doc2query), [SPLADE learned sparse](https://github.com/cmacdonald/pyt_splade) or [dense retrieval](https://github.com/terrierteam/pyterrier_dr) (including the [ColBERT](https://github.com/terrierteam/pyterrier_colbert) multi-representation dense model).

PyTerrier-RAG also provides easy access to lots of other datasets.