# Sparse Retrieval for Natural Questions RAG

This notebook demonstrates [PyTerrier](http://github.com/terrier-org/pyterrier) and [PyTerrier-RAG](https://github.com/terrierteam/pyterrier_rag). This notebooks runs on Google Colab if you select a Runtime with a T4 GPU (these are free to use), and will take approx. 12 minutes to execute fully.

## Installation

Lets install what we need:
 - PyTerrier - core platform
 - PyTerrier_t5 - MonoT5 reranker
 - pyterrier_rag - Support for RAG datasets and answer generators (aka readers)

In [1]:
#%pip install -q --root-user-action=ignore vllm pyterrier_t5

In [2]:
#%pip install -q --root-user-action=ignore pyterrier-rag

In [3]:
import pyterrier as pt
pt.utils.set_tqdm('notebook')
import pyterrier_rag

## Retrievers

Lets load a sparse index of Wikipedia. Conveniently, we've stored this as a [Huggingface dataset](https://huggingface.co/datasets/pyterrier/ragwiki-terrier). This is 12GB in size (it also contains the text of the documents) - downloading takes about 10 minutes on Google Colab.

We'll use that index to get a BM25 retriever (you can see how this was created at https://huggingface.co/datasets/pyterrier/ragwiki-terrier#reproduction). 

Terrier doesnt like question marks in queries, so we'll strip these and restore them after.

Finally, lets make a monoT5 reranker, we can use that to rerank BM25,


In [4]:
sparse_index = pt.Artifact.from_hf('pyterrier/ragwiki-terrier')
bm25_ret = pt.rewrite.tokenise() >> sparse_index.bm25(include_fields=['docno', 'text', 'title'], threads=5) >> pt.rewrite.reset()

Java started (triggered by tokenise) and loaded: pyterrier.java, pyterrier.terrier.java [version=5.11 (build: craig.macdonald 2025-01-13 21:29), helper_version=0.0.8]
  warn(
  warn(


In [5]:
(bm25_ret%3).search("What are chemical reactions?")

Unnamed: 0,qid,docid,docno,text,title,rank,score,query
0,1,1027780,1027780,Chemical change Chemical changes occur when a ...,"""Chemical change""",0,29.908465,What are chemical reactions?
1,1,53321,53321,are called reactants or reagents. Chemical rea...,"""Chemical reaction""",1,29.825437,What are chemical reactions?
2,1,860125,860125,Chemical reaction A chemical reaction is a pro...,"""Chemical reaction""",2,29.526942,What are chemical reactions?


## Readers

In [6]:
from pyterrier_rag.backend import OpenAIBackend
from pyterrier_rag.prompt import Concatenator
from pyterrier_rag.readers import Reader
from pyterrier_rag.prompt import PromptTransformer, prompt

from fastchat.model import get_conversation_template

In [7]:
model_name = "llama-3-8b-instruct"

In [21]:
system_message = r"""You are an expert Q&A system that is trusted around the world. 
        Always answer the query using the provided context information,
        and not prior knowledge.
        Some rules to follow:
        1. Never directly reference the given context in your answer
        2. Avoid statements like 'Based on the context, ...' or 
        'The context information ...' or anything along those lines."""
prompt_text = """Context information is below.
            ---------------------
            {{ qcontext }}
            ---------------------
            Given the context information and not prior knowledge, answer the query.
            Query: {{ query }}
            "Answer: """

template = get_conversation_template("meta-llama-3.1-sp")
prompt = PromptTransformer(
    conversation_template=template,
    system_message=system_message,
    instruction=prompt_text,
    api_type="openai"
)


In [32]:

generation_args={
    "temperature": 0.1,
    "max_tokens": 128,
}

# this could equally be a real OpenAI models
llama = OpenAIBackend(model_name, 
                      api_key="ida_0sjUm0LEoXlnDsdJV19hvyVfpyxvubmGZGBTiops", 
                      generation_args=generation_args,
                      base_url="http://api.llm.apps.os.dcs.gla.ac.uk/v1")

llama_reader = Reader(llama, prompt=prompt)
bm25_llama = bm25_ret % 5 >> Concatenator() >> llama_reader

In [33]:
results = bm25_llama.search("what are chemical reactions?")


In [34]:
print(results.iloc[0].qanswer)

A process that leads to the chemical transformation of one set of chemical substances to another.


# Datasets & Experiments

Lets compare the effectiveness of this approach on the Natural Questions dataset. These topics are automatically downloaded.

In [35]:
dataset = pt.get_dataset('rag:nq')
dataset.get_topics('dev').head(2)

Unnamed: 0,qid,query
0,dev_0,who sings does he love me with reba
1,dev_1,how many pages is invisible man by ralph ellison


And their corresponding gold truth answers:

In [36]:
dataset.get_topics('dev').head(2).merge(dataset.get_answers('dev'))

Unnamed: 0,qid,query,gold_answer
0,dev_0,who sings does he love me with reba,Linda Davis
1,dev_1,how many pages is invisible man by ralph ellison,581 (second edition)


Now lets run an experiment using Natural Questions.

Here we only have one pipeline - bm25_mistral.

The additional arguments are:
 - `batch_size` - how many queries to run and evalate at once. Not always necessary, but makes the progress bars more granular
 - `verbose` - display progress bars for this experiment
 - `precompute_prefix` - optimise the experiment such that BM25 is only computed once.
 - `names` - for naming rows in the output dataframe


In [37]:
pt.Experiment(
    [bm25_llama],
    dataset.get_topics('dev').head(100), # NB: remove .head() to run on all dev topics
    dataset.get_answers('dev'),
    [pyterrier_rag.measures.F1, pyterrier_rag.measures.EM],
    batch_size=25,
    verbose=True,
    names=['bm25 llama'],
)

pt.Experiment:   0%|          | 0/4 [00:00<?, ?batches/s]

Unnamed: 0,name,F1,EM
0,bm25 llama,0.312947,0.21


# That's all folks.

There are lots of other retrievers possible in PyTerrier - for instance [query expansion](https://pyterrier.readthedocs.io/en/latest/rewrite.html), [doc2query](https://github.com/terrierteam/pyterrier_doc2query), [SPLADE learned sparse](https://github.com/cmacdonald/pyt_splade) or [dense retrieval](https://github.com/terrierteam/pyterrier_dr) (including the [ColBERT](https://github.com/terrierteam/pyterrier_colbert) multi-representation dense model).

PyTerrier-RAG also provides easy access to lots of other datasets.