# Retrieval Augmented Generation

## Configure LLM

In [1]:
import dspy

# Set up the LM.
OPENAI_API_KEY = open("../.secrets/openai-api_key.txt").read()
turbo = dspy.OpenAI(model='gpt-3.5-turbo-instruct', max_tokens=250, api_key=OPENAI_API_KEY)
# configure language model
dspy.settings.configure(lm=turbo)

  from .autonotebook import tqdm as notebook_tqdm


## Setup ColBERTv2 Retriever

A free server hosting a Wikipedia 2017 "abstracts" search index containing the first paragraph of each article from this 2017 dump.

In [2]:
colbertv2 = dspy.ColBERTv2(url='http://20.102.90.50:2017/wiki17_abstracts')
# configure retrieval model
dspy.settings.configure(rm=colbertv2)

## Load Dataset

In [3]:
from dspy.datasets import HotPotQA

# Load the dataset.
dataset = HotPotQA(train_seed=1, train_size=20, eval_seed=69, dev_size=50, test_size=0)
trainset = [x.with_inputs('question') for x in dataset.train]
devset = [x.with_inputs('question') for x in dataset.dev]

Downloading builder script: 100%|██████████| 6.42k/6.42k [00:00<00:00, 10.6kB/s]
Downloading readme: 100%|██████████| 9.19k/9.19k [00:00<00:00, 15.0kB/s]
Downloading data: 100%|██████████| 566M/566M [10:48<00:00, 873kB/s]    
Downloading data: 100%|██████████| 47.5M/47.5M [00:29<00:00, 1.58MB/s]
Downloading data: 100%|██████████| 46.2M/46.2M [00:55<00:00, 826kB/s] 
Generating train split: 100%|██████████| 90447/90447 [00:13<00:00, 6899.10 examples/s]
Generating validation split: 100%|██████████| 7405/7405 [00:01<00:00, 7146.72 examples/s]
Generating test split: 100%|██████████| 7405/7405 [00:00<00:00, 7589.25 examples/s]


In [8]:
trainset[3].question, trainset[3].answer

('The Victorians - Their Story In Pictures is a documentary series written by an author born in what year?',
 '1950')

## Define Signature

For some advanced tasks, you need more verbose signatures defined using Class-based DSPy signatures.

- Clarify something about the nature of the task
- Supply hints on the nature of an input field, expressed as a desc keyword argument for `dspy.InputField`
- Supply constraints on an output field, expressed as a desc keyword argument for `dspy.OutputField`

In [9]:
class GenerateAnswer(dspy.Signature):
    """Answer questions with short factoid answers."""

    context = dspy.InputField(desc="may contain relevant facts")
    question = dspy.InputField()
    answer = dspy.OutputField(desc="often between 1 and 5 words")

## Build the Pipeline

1. Declare sub-modules in `__init__`
2. Use sub-modules to define control flow

In [10]:
class RAG(dspy.Module):
    def __init__(self, num_passages=3):
        super().__init__()
        self.retrieve = dspy.Retrieve(k=num_passages)  # this should work since we've configured our retrieval model early on
        self.generate_answer = dspy.ChainOfThought(GenerateAnswer)  # use the signature we defined in the previous step
    
    def forward(self, question):
        # retrieve context = 3 passages from ColBERTv2
        context = self.retrieve(question).passages
        # Use CoT module to generate answer given a question
        prediction = self.generate_answer(context=context, question=question)
        return dspy.Prediction(context=context, answer=prediction.answer)

## Optimize the Pipeline

Optimization, in this case, is collecting and selecting good demonstrations for inclusion within the prompt(s).



In [11]:
from dspy.teleprompt import BootstrapFewShot

# Validation logic: check that the predicted answer is correct.
# Also check that the retrieved context does actually contain that answer.
def validate_context_and_answer(example, pred, trace=None):
    answer_EM = dspy.evaluate.answer_exact_match(example, pred)
    answer_PM = dspy.evaluate.answer_passage_match(example, pred)
    return answer_EM and answer_PM

# Set up a basic teleprompter, which will compile our RAG program.
teleprompter = BootstrapFewShot(metric=validate_context_and_answer)

# Compile!
compiled_rag = teleprompter.compile(RAG(), trainset=trainset)

 55%|█████▌    | 11/20 [00:22<00:18,  2.06s/it]

Bootstrapped 4 full traces after 12 examples in round 0.





## Inference

In [12]:
# Ask any question you like to this simple RAG program.
my_question = "What castle did David Gregory inherit?"

# Get the prediction. This contains `pred.context` and `pred.answer`.
pred = compiled_rag(my_question)

# Print the contexts and the answer.
print(f"Question: {my_question}")
print(f"Predicted Answer: {pred.answer}")
print(f"Retrieved Contexts (truncated): {[c[:200] + '...' for c in pred.context]}")

Question: What castle did David Gregory inherit?
Predicted Answer: Kinnairdy Castle
Retrieved Contexts (truncated): ['David Gregory (physician) | David Gregory (20 December 1625 – 1720) was a Scottish physician and inventor. His surname is sometimes spelt as Gregorie, the original Scottish spelling. He inherited Kinn...', 'Gregory Tarchaneiotes | Gregory Tarchaneiotes (Greek: Γρηγόριος Ταρχανειώτης , Italian: "Gregorio Tracanioto" or "Tracamoto" ) was a "protospatharius" and the long-reigning catepan of Italy from 998 t...', 'David Gregory (mathematician) | David Gregory (originally spelt Gregorie) FRS (? 1659 – 10 October 1708) was a Scottish mathematician and astronomer. He was professor of mathematics at the University ...']


In [13]:
turbo.inspect_history(n=1)




Answer questions with short factoid answers.

---

Question: At My Window was released by which American singer-songwriter?
Answer: John Townes Van Zandt

Question: "Everything Has Changed" is a song from an album released under which record label ?
Answer: Big Machine Records

Question: The Victorians - Their Story In Pictures is a documentary series written by an author born in what year?
Answer: 1950

Question: Which Pakistani cricket umpire who won 3 consecutive ICC umpire of the year awards in 2009, 2010, and 2011 will be in the ICC World Twenty20?
Answer: Aleem Sarwar Dar

Question: Having the combination of excellent foot speed and bat speed helped Eric Davis, create what kind of outfield for the Los Angeles Dodgers?
Answer: "Outfield of Dreams"

Question: Who is older, Aleksandr Danilovich Aleksandrov or Anatoly Fomenko?
Answer: Aleksandr Danilovich Aleksandrov

Question: The Organisation that allows a community to influence their operation or use and to enjoy the benefits a

'\n\n\nAnswer questions with short factoid answers.\n\n---\n\nQuestion: At My Window was released by which American singer-songwriter?\nAnswer: John Townes Van Zandt\n\nQuestion: "Everything Has Changed" is a song from an album released under which record label ?\nAnswer: Big Machine Records\n\nQuestion: The Victorians - Their Story In Pictures is a documentary series written by an author born in what year?\nAnswer: 1950\n\nQuestion: Which Pakistani cricket umpire who won 3 consecutive ICC umpire of the year awards in 2009, 2010, and 2011 will be in the ICC World Twenty20?\nAnswer: Aleem Sarwar Dar\n\nQuestion: Having the combination of excellent foot speed and bat speed helped Eric Davis, create what kind of outfield for the Los Angeles Dodgers?\nAnswer: "Outfield of Dreams"\n\nQuestion: Who is older, Aleksandr Danilovich Aleksandrov or Anatoly Fomenko?\nAnswer: Aleksandr Danilovich Aleksandrov\n\nQuestion: The Organisation that allows a community to influence their operation or use a

In [14]:
# Ask any question you like to this simple RAG program.
my_question = "Who is the first president of the USA?"

# Get the prediction. This contains `pred.context` and `pred.answer`.
pred = compiled_rag(my_question)

# Print the contexts and the answer.
print(f"Question: {my_question}")
print(f"Predicted Answer: {pred.answer}")
print(f"Retrieved Contexts (truncated): {[c[:200] + '...' for c in pred.context]}")

Question: Who is the first president of the USA?
Predicted Answer: George Washington
Retrieved Contexts (truncated): ['George Washington (disambiguation) | George Washington (1732–1799) was the first President of the United States....', 'Lansdowne portrait | The Lansdowne portrait is an iconic oil-on-canvas portrait of George Washington, the first President of the United States. The portrait was painted by Gilbert Stuart on April 12, ...', 'His Excellency: George Washington | His Excellency: George Washington is a 2004 biography of the first President of the United States, General George Washington. It was written by Joseph Ellis, a prof...']


In [15]:
# Ask any question you like to this simple RAG program.
my_question = "Who is the first female prime minister of India?"

# Get the prediction. This contains `pred.context` and `pred.answer`.
pred = compiled_rag(my_question)

# Print the contexts and the answer.
print(f"Question: {my_question}")
print(f"Predicted Answer: {pred.answer}")
print(f"Retrieved Contexts (truncated): {[c[:200] + '...' for c in pred.context]}")

Question: Who is the first female prime minister of India?
Predicted Answer: Indira Gandhi
Retrieved Contexts (truncated): ['Indira Gandhi | Indira Priyadarshini Gandhi (; Nehru; 19 November 1917 – 31 October 1984) was an Indian politician and central figure of the Indian National Congress party. She was the first and to da...', 'First Indira Gandhi ministry | Indira Gandhi was sworn in as Prime Minister of India for the first time on 24 January 1966. In her ministry, the ministers were as follows....', 'Manorama Madhwaraj | Manorama Madhwaraj was the first woman cabinet minister of India elected as an MLA in the 5th, 8th and 9th Karnataka Legislative Assembly. On all of these occasions was elected fr...']


## Inspect Learned Parameters

In [16]:
for name, parameter in compiled_rag.named_predictors():
    print(name)
    print(parameter.demos[0])

generate_answer
Example({'augmented': True, 'context': ['Tae Kwon Do Times | Tae Kwon Do Times is a magazine devoted to the martial art of taekwondo, and is published in the United States of America. While the title suggests that it focuses on taekwondo exclusively, the magazine also covers other Korean martial arts. "Tae Kwon Do Times" has published articles by a wide range of authors, including He-Young Kimm, Thomas Kurz, Scott Shaw, and Mark Van Schuyver.', "Kwon Tae-man | Kwon Tae-man (born 1941) was an early Korean hapkido practitioner and a pioneer of the art, first in Korea and then in the United States. He formed one of the earliest dojang's for hapkido in the United States in Torrance, California, and has been featured in many magazine articles promoting the art.", 'Hee Il Cho | Cho Hee Il (born October 13, 1940) is a prominent Korean-American master of taekwondo, holding the rank of 9th "dan" in the martial art. He has written 11 martial art books, produced 70 martial art tra

## Evaluate Pipeline

In [18]:
from dspy.evaluate.evaluate import Evaluate

# Set up the `evaluate_on_hotpotqa` function. We'll use this many times below.
evaluate_on_hotpotqa = Evaluate(devset=devset, num_threads=1, display_progress=False, display_table=5)

# Evaluate the `compiled_rag` program with the `answer_exact_match` metric.
metric = dspy.evaluate.answer_exact_match
evaluate_on_hotpotqa(compiled_rag, metric=metric)

Unnamed: 0,question,example_answer,gold_titles,context,pred_answer,answer_exact_match
0,Give a Girl a Break featured Marge Champion as what kind of talent?,dancer,"{'Marge Champion', 'Give a Girl a Break'}","['Give a Girl a Break | Give a Girl a Break is a 1953 musical comedy film directed by Stanley Donen, starring Debbie Reynolds and...",Dancer,✔️ [True]
1,In what year was the film that Steve Hoban is best known for released?,2009,"{'Splice (film)', 'Steve Hoban'}","['Steve Hoban | Steven ""Steve"" Hoban (born 1964) is a Canadian film producer. He has been nominated for three Genie Awards and won another. He...",2009,✔️ [True]
2,What football player left the Superliga side OB when he won Goalkeeper of the Year award in Denmark and Norway in 2010?,Anders Lindegaard,"{'Sten Grytebust', 'Anders Lindegaard'}",['Anders Lindegaard | Anders Rozenkrantz Lindegaard (] ; born 13 April 1984) is a Danish footballer who plays for English club Burnley as a goalkeeper....,Anders Lindegaard,✔️ [True]
3,Álvaro Raposo De Oliveira was born in a city located in the valleys of which rivers ?,"the Chillón, Rímac and Lurín rivers","{'Lima', 'Álvaro Raposo de Oliveira'}","['Álvaro Raposo de Oliveira | Álvaro Raposo De Oliveira (born September 5, 1990, in Lima) is a Peru professional tennis player.', 'Miguel Sutil | Miguel...",Amazon River,False
4,"What are the accounting companies that make up the ""Big Four"" excluding Ernst & Young?","PricewaterhouseCoopers, Deloitte Touche Tohmatsu","{'Arthur Andersen', 'Ernst & Young'}","['Ernst & Young | EY (formerly Ernst & Young) is a multinational professional services firm headquartered in London, England. EY is one of the largest...",Deloitte and PricewaterhouseCoopers,False


40.0

In [19]:
def gold_passages_retrieved(example, pred, trace=None):
    gold_titles = set(map(dspy.evaluate.normalize_text, example['gold_titles']))
    found_titles = set(map(dspy.evaluate.normalize_text, [c.split(' | ')[0] for c in pred.context]))

    return gold_titles.issubset(found_titles)

compiled_rag_retrieval_score = evaluate_on_hotpotqa(compiled_rag, metric=gold_passages_retrieved)

Unnamed: 0,question,example_answer,gold_titles,context,pred_answer,gold_passages_retrieved
0,Give a Girl a Break featured Marge Champion as what kind of talent?,dancer,"{'Marge Champion', 'Give a Girl a Break'}","['Give a Girl a Break | Give a Girl a Break is a 1953 musical comedy film directed by Stanley Donen, starring Debbie Reynolds and...",Dancer,✔️ [True]
1,In what year was the film that Steve Hoban is best known for released?,2009,"{'Splice (film)', 'Steve Hoban'}","['Steve Hoban | Steven ""Steve"" Hoban (born 1964) is a Canadian film producer. He has been nominated for three Genie Awards and won another. He...",2009,False
2,What football player left the Superliga side OB when he won Goalkeeper of the Year award in Denmark and Norway in 2010?,Anders Lindegaard,"{'Sten Grytebust', 'Anders Lindegaard'}",['Anders Lindegaard | Anders Rozenkrantz Lindegaard (] ; born 13 April 1984) is a Danish footballer who plays for English club Burnley as a goalkeeper....,Anders Lindegaard,False
3,Álvaro Raposo De Oliveira was born in a city located in the valleys of which rivers ?,"the Chillón, Rímac and Lurín rivers","{'Lima', 'Álvaro Raposo de Oliveira'}","['Álvaro Raposo de Oliveira | Álvaro Raposo De Oliveira (born September 5, 1990, in Lima) is a Peru professional tennis player.', 'Miguel Sutil | Miguel...",Amazon River,False
4,"What are the accounting companies that make up the ""Big Four"" excluding Ernst & Young?","PricewaterhouseCoopers, Deloitte Touche Tohmatsu","{'Arthur Andersen', 'Ernst & Young'}","['Ernst & Young | EY (formerly Ernst & Young) is a multinational professional services firm headquartered in London, England. EY is one of the largest...",Deloitte and PricewaterhouseCoopers,✔️ [True]


In [20]:
compiled_rag_retrieval_score

38.0