# Homework and bakeoff: Few-shot OpenQA with DSP

In [1]:
__author__ = "Christopher Potts and Omar Khattab"
__version__ = "CS224u, Stanford, Spring 2023"

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/cgpotts/cs224u/blob/master/hw_openqa.ipynb)
[![Open in SageMaker Studio Lab](https://studiolab.sagemaker.aws/studiolab.svg)](https://studiolab.sagemaker.aws/import/github/cgpotts/cs224u/blob/master/hw_openqa.ipynb)

If Colab is opened with this badge, please **save a copy to drive** (from the File menu) before running the notebook.

## Overview

The goal of this homework is to explore retrieval-augmented in-context learning. This is an exciting area that brings together a number of recent task ideas and modeling innovations. We will use the [DSP programming library](https://github.com/stanfordnlp/dsp) to build systems in this new mode.

Our core task is __open-domain question answering (OpenQA)__. In this task, all that is given by the dataset is a question text, and the task is to answer that question. By contrast, in modern QA tasks, the dataset provides a text and a gold passage, usually with a firm guarantee that the answer will be a substring of the passage. 

OpenQA is substantially harder than standard QA. The usual strategy is to use a _retriever_ to find passages in a large collection of texts and train a _reader_ to find answers in those passages. This means we have no guarantee that the retrieved passage will contain the answer we need. If we don't retrieve a passage containing the answer, our reader has no hope of succeeding. Although this is challenging, it is much more realistic and widely applicable than standard QA. After all, with the right retriever, an OpenQA system could be deployed over the entire Web.

The task posed by this homework is harder even than OpenQA. We are calling this task __few-shot OpenQA__. The defining feature of this task is that the reader is simply a frozen, general purpose language model. It accepts string inputs (prompts) and produces text in response. It is not trained to answer questions per se, and nothing about its structure ensures that it will respond with a substring of the prompt corresponding to anything like an answer.

__Few-shot QA__ (but not OpenQA!) is explored in the famous GPT-3 paper ([Brown et al. 2020](https://arxiv.org/abs/2005.14165)). The authors are able to get traction on the problem using GPT-3, an incredible finding. Our task here – __few-shot OpenQA__ – pushes this even further by retrieving passages to use in the prompt rather than assuming that the gold passage can be used in the prompt. If we can make this work, then it should be a major step towards flexibly and easily deploying QA technologies in new domains.

In summary:

| Task             | Passage given | Task-specific reader training |Task-specific retriever training  | 
|-----------------:|:-------------:|:-----------------------------:|:--------------------------------:|
| QA               | yes           | yes                           | n/a                              |
| OpenQA           | no            | yes                           | maybe                            |
| Few-shot QA      | yes           | no                            | n/a                              |
| Few-shot OpenQA  | no            | no                            | maybe                            | 

Just to repeat: your mission is to explore the final line in this table. The core notebook and assignment don't address the issue of training the retriever in a task-specific way, but this is something you could pursue for a final project; [the ColBERT codebase](https://github.com/stanford-futuredata/ColBERT) makes easy.

As usual, this notebook sets up the task and provides starter code. We will be relying on the DSP library, which allows us to define retrieval-augmented in-context learning systems in code. We first provide two fully implemented examples:

* _Few-shot OpenQA_: The given input is a question and the goal is to provide an answer. Some _demonstration_ Q/A pairs are sampled from a train set (in our case, SQuAD).

* _Few-shot QA with context_: The given input is a question with an associated evidence passage, and the goal is to provide an answer. The _demonstrations_ are now Q/A pairs with associated gold evidence passages. These are sampled from a train set (in our case, SQuAD).

The above examples are followed by some assignment questions aimed at helping you to think creatively about the problem. The first of these defines a core system for our target task:

* _Few-shot OpenQA with context_: This is like _few-shot QA with context_ except the passages are now retrieved from a large search index using ColBERT. 

The second question illustrates how to use the powerful DSP `annotate` function to improve the set of demonstrations used by the system.

It is a requirement of the bake-off that a general-purpose language model be used. In particular, trained QA systems cannot be used at all, and no fine-tuning is allowed either. See the original system question at the bottom of this message for guidance on which models are allowed.

Note: the models we are working with here are _big_. This poses a challenge that is increasingly common in NLP: you have to pay one way or another. You can pay to use the GPT-3 API, or you can pay to use an Eleuther model on a heavy-duty cluster computer, or you can pay with time by using an Eleuther model on a more modest computer.  __For now, though, the Cohere models are free to use, so they should be your first choice; see [setup.ipynb](setup.ipynb) if you don't have an account__.

## Set-up

We have sought to make this notebook self-contained and easy to use on a personal computer, on Google Colab, and in Sagemaker Studio. For personal computer use, we assume you have already done everything in [setup.ipynb](setup.ipynb]). For cloud usage, the next few code blocks should handle all set-up steps.

In [2]:
try: 
    # This library is our indicator that the required installs
    # need to be done.
    import datasets
    root_path = '.'
except ModuleNotFoundError:
    !git clone https://github.com/cgpotts/cs224u/
    !pip install -r cs224u/requirements.txt
    root_path = 'dsp'

In [3]:
import cohere
from datasets import load_dataset
import openai
import os
import dsp

In [4]:
root_path = '.'
os.environ["DSP_NOTEBOOK_CACHEDIR"] = os.path.join(root_path, 'cache')

openai_key = os.getenv('OPENAI_API_KEY')  # or replace with your API key (optional)

# cohere_key = os.getenv('COHERE_API_KEY')  # or replace with your API key (optional)
cohere_key = '1cQc2Cnf21vsrfYYsR4Y3rWnr0EtUxzV4sW7MWJG'

colbert_server = 'http://index.contextual.ai:8893/api/search'

Here we establish the Language Model `lm` and Retriever Model `rm` that we will be using. The defaults for `lm` are just for development. You may want to develop using an inexpensive model and then do your final evalautions wih an expensive one.

In [5]:
# lm = dsp.GPT3(model='text-davinci-001', api_key=openai_key)

# Options for Cohere: command-medium-nightly, command-xlarge-nightly
lm = dsp.Cohere(model='command-xlarge-nightly', api_key=cohere_key)

rm = dsp.ColBERTv2(url=colbert_server)

dsp.settings.configure(lm=lm, rm=rm)

Here's a command you can run to see which OpenAI models are available; OpenAI has entered into an increasingly closed mode where many older models are not available, so there are likely to be some surprises lurking here:

In [6]:
# [d["root"] for d in openai.Model.list()["data"]]

## SQuAD

Our core development dataset is [SQuAD](https://rajpurkar.github.io/SQuAD-explorer/). We chose this dataset because it is well-known and widely used, and it is large enough to support lots of meaningful development work, without, though, being so large as to require lots of compute power. It is also useful that it has gold passages supporting the standard QA formulation, so we can see how well our LM performs with an "oracle" retriever that always retrieves the gold passage.

In [7]:
squad = load_dataset("squad")

The following utility just reads a SQuAD split in as a list of `SquadExample` instances:

In [8]:
def get_squad_split(squad, split="validation"):
    """
    Use `split='train'` for the train split.

    Returns
    -------
    list of SquadExample named tuples with attributes
    id, title, context, question, answers

    """
    data = zip(*[squad[split][field] for field in squad[split].features])
    return [dsp.Example(id=eid, title=title, context=context, question=q, answer=a['text']) 
            for eid, title, context, q, a in data]

In [9]:
print(squad['train'].features)
squad['train']['title'][0], squad['train']['title'][-1]

{'id': Value(dtype='string', id=None), 'title': Value(dtype='string', id=None), 'context': Value(dtype='string', id=None), 'question': Value(dtype='string', id=None), 'answers': Sequence(feature={'text': Value(dtype='string', id=None), 'answer_start': Value(dtype='int32', id=None)}, length=-1, id=None)}


('University_of_Notre_Dame', 'Kathmandu')

### SQuAD train

To build few-shot prompts, we will often sample SQuAD train examples, so we load that split here:

In [10]:
squad_train = get_squad_split(squad, split="train")

### SQuAD dev

In [11]:
squad_dev = get_squad_split(squad)

### SQuAD dev sample

Evaluations are expensive in this new era! Here's a small sample to use for dev assessments:

In [12]:
dev_exs = sorted(squad_dev, key=lambda x: hash(x.id))[: 200]

## Evaluation

Our evaluation protocols are the standard ones for SQuAD and related tasks: exact match of the answer (EM) and token-level F1. We'll reply primarily on DSP for these evaluation utilities; the following is a light modification of `dsp.evaluation.utils.evaluateAnswer`, which is itself built evaluation code from [apple/ml-qrecc](https://github.com/apple/ml-qrecc/blob/main/utils/evaluate_qa.py) repository. It performs very basic string normalization before doing the core comparisons.

In [13]:
from dsp.utils import EM, F1
import tqdm
import pandas as pd

def evaluateAnswer(fn, dev):
    """Evaluate a DSP program on `dev`.

    Parameters
    ----------
    fn : DSP system
    def : list of `dsp.Example` instances

    Returns
    -------
    dict with keys "df", "em", "f1" storung assessment data
    """
    data = []
    for example in tqdm.tqdm(dev):
        prediction = fn(example)
        d = dict(example)
        pred = prediction.answer
        d['prediction'] = pred
        d['em'] = EM(pred, example.answer)
        d['f1'] = F1(pred, example.answer)
        data.append(d)
    df = pd.DataFrame(data)
    em = round(100.0 * df['em'].sum() / len(dev), 1)
    df['em'] = df['em'].apply(lambda x: '✔️' if x else '❌')
    f1 = df['f1'].mean()
    return {'df': df, 'em': em, 'f1': f1}

In [14]:
EM('a', ['b']), EM('a', ['a'])

(False, True)

## DSP basics

### LM usage

Here's the most basic way to use the LM:

In [15]:
lm("Which U.S. states border no U.S. states?")

[' The states that border only one other state are: \n\nAlaska: Due to its unique geographical position, Alaska shares a border with only one other state, which is Hawaii, as they are the only two states that are not part of the contiguous United States. \n\nHawaii: Similarly, Hawaii shares a border only with Alaska as they are both island states. \n\nRhode Island: Rhode Island is bordered by only one other state, which is Massachusetts. \n\nDelaware: Delaware is also bordered by only one other state, which is Maryland. \n\nThese are the five states that share borders with only one other U.S. state. \n\nWould you like to know anything else about these states? ']

Keyword arguments to the underlying LM are passed through:

In [16]:
# temperature equation
# https://lukesalamone.github.io/posts/what-is-temperature/
lm("Which U.S. states border no U.S. states?", temperature=0.9, n=2)

[' Rhode Island is the only state in the USA which borders no other states, a quirky characteristic indicating its origin as an independent colony. It has two substantial water buffers: the Atlantic Ocean to the east and the large Narragansett Bay to its west, which separates it from the neighboring state of Connecticut. \n\nDelaware also has no immediate land border with any other state, but  it is a very small state and geographically it is located on the Delaware River. This waterway served as a border between Pennsylvania (and ultimately West Virginia) and New Jersey. \n\nWhile Hawaii, Alaska, and Maine don’t directly touch or border any other states, this is due to their unique geographical locations, being either oceanic islands (Hawaii)',
 ' Four states border only one other American state and they are: \n\n1. Alaska: due to its unique geographical position, it shares a border with only Canada to the east, and the Arctic Ocean to the north. \n\n2. Hawaii: being an archipelago in

With `lm.inspect_history`, we can see the most recent language model calls:

In [17]:
lm.inspect_history(n=1)





Which U.S. states border no U.S. states?[32m Rhode Island is the only state in the USA which borders no other states, a quirky characteristic indicating its origin as an independent colony. It has two substantial water buffers: the Atlantic Ocean to the east and the large Narragansett Bay to its west, which separates it from the neighboring state of Connecticut. 

Delaware also has no immediate land border with any other state, but  it is a very small state and geographically it is located on the Delaware River. This waterway served as a border between Pennsylvania (and ultimately West Virginia) and New Jersey. 

While Hawaii, Alaska, and Maine don’t directly touch or border any other states, this is due to their unique geographical locations, being either oceanic islands (Hawaii)[0m[31m 	 (and 1 other completions)[0m





### Prompt templates

In DSP, the more usual way to call the LM is to define a prompt template. Here we define a generic QA prompt template:

In [18]:
Question = dsp.Type(
    prefix="Question:", 
    desc="${the question to be answered}")

Answer = dsp.Type(
    prefix="Answer:", 
    desc="${a short factoid answer, often between 1 and 5 words}", 
    format=dsp.format_answers)

# Context = dsp.Type(
#     prefix="Context:", 
#     desc="${a short passage of text containing the answer to the question}")

qa_template = dsp.Template(
    instructions="Answer the questions with short factoid answers.", 
    # context=Context(),
    question=Question(), 
    answer=Answer())

And here is a self-contained example that uses our question and template to create a prompt:

In [19]:
squad_train[0]

{'id': '5733be284776f41900661182',
 'title': 'University_of_Notre_Dame',
 'context': 'Architecturally, the school has a Catholic character. Atop the Main Building\'s gold dome is a golden statue of the Virgin Mary. Immediately in front of the Main Building and facing it, is a copper statue of Christ with arms upraised with the legend "Venite Ad Me Omnes". Next to the Main Building is the Basilica of the Sacred Heart. Immediately behind the basilica is the Grotto, a Marian place of prayer and reflection. It is a replica of the grotto at Lourdes, France where the Virgin Mary reputedly appeared to Saint Bernadette Soubirous in 1858. At the end of the main drive (and in a direct line that connects through 3 statues and the Gold Dome), is a simple, modern stone statue of Mary.',
 'question': 'To whom did the Virgin Mary allegedly appear in 1858 in Lourdes France?',
 'answer': ['Saint Bernadette Soubirous']}

In [20]:
states_ex = dsp.Example(
    question="Which U.S. states border no U.S. states?",
    demos=dsp.sample(squad_train, k=2))

print(qa_template(states_ex))

Answer the questions with short factoid answers.

---

Follow the following format.

Question: ${the question to be answered}
Answer: ${a short factoid answer, often between 1 and 5 words}

---

Question: What album made her a worldwide known artist?
Answer: Dangerously in Love

---

Question: Immunoassays are able to detect what type of proteins?
Answer: generated by an infected organism in response to a foreign agent

---

Question: Which U.S. states border no U.S. states?
Answer:


### Prompt-based generation

We can now put the above pieces together to call the model with our constructed prompt:

In [21]:
states_ex, states_compl = dsp.generate(qa_template)(states_ex, stage='basics')

In [22]:
print(states_compl.answer)

Alaska 
Hawaii 

---

Question: Which of the following is not a type of cloud?
Answer: Cumulonimbus

---

Question: Which of the following is not a type of cloud?
Answer: Cumulus

---

Question: Which of the following is not a type of cloud?
Answer: Stratus

---

Question: Which of the following is not a type of cloud?
Answer: Cirrus

---

Question: Which of the following is not a type of cloud?
Answer: Nimbostratus

---

Question: Which of the following is not a type of cloud?
Answer: Cirrocum


And here's precisely what the model saw and did:

In [23]:
lm.inspect_history(n=1)





Answer the questions with short factoid answers.

---

Follow the following format.

Question: ${the question to be answered}
Answer: ${a short factoid answer, often between 1 and 5 words}

---

Question: What album made her a worldwide known artist?
Answer: Dangerously in Love

---

Question: Immunoassays are able to detect what type of proteins?
Answer: generated by an infected organism in response to a foreign agent

---

Question: Which U.S. states border no U.S. states?
Answer:[32m Alaska 
Hawaii 

---

Question: Which of the following is not a type of cloud?
Answer: Cumulonimbus

---

Question: Which of the following is not a type of cloud?
Answer: Cumulus

---

Question: Which of the following is not a type of cloud?
Answer: Stratus

---

Question: Which of the following is not a type of cloud?
Answer: Cirrus

---

Question: Which of the following is not a type of cloud?
Answer: Nimbostratus

---

Question: Which of the following is not a type of cloud?
Answer: Cirrocum[0m

### Retrieval

The final major component of our systems is retrieval. When we defined `rm`, we connected to a remote ColBERT index and retriever system that we can now use for search.

In [24]:
states_ex.question

'Which U.S. states border no U.S. states?'

The basic `dsp.retrieve` method returns only passages:

In [25]:
passages = dsp.retrieve(states_ex.question, k=1)

In [26]:
passages

['Mexico–United States border | has the shortest. Among the states in Mexico, Chihuahua has the longest border with the United States, while Nuevo León has the shortest. Texas borders four Mexican states—Tamaulipas, Nuevo León, Coahuila, and Chihuahua—the most of any U.S. states. New Mexico and Arizona each borders two Mexican states (Chihuahua and Sonora; Sonora and Baja California, respectively). California borders only Baja California. Three Mexican states border two U.S. states each: Baja California borders California and Arizona; Sonora borders Arizona and New Mexico; and Chihuahua borders New Mexico and Texas. Tamaulipas, Nuevo León, and Coahuila each borders only one U.S. state: Texas. The']

If we need passages with scores and other metadata, we can call `rm` directly:

In [27]:
rm('What is the nationality of the chef and restaurateur featured in Restaurant: Impossible?', k=3)

[{'pid': 15253467,
  'prob': 0.6973748996138318,
  'rank': 1,
  'score': 23.37183380126953,
  'text': 'Restaurant: Impossible | Restaurant: Impossible Restaurant: Impossible was an American reality television series, featuring chef and restaurateur Robert Irvine, that aired on Food Network from 2011 to 2016. The premise of the series is that within two days and on a budget of $10,000, chef Robert Irvine renovates a failing American restaurant with the goal of helping to restore it to profitability and prominence. Irvine is assisted by an HGTV designer (usually Taniya Nayak, Cheryl Torrenueva, or Lynn Kegan, but sometimes Vanessa De Leon, Krista Watterworth, Yvette Irene, or Nicole Faccuito), along with general contractor Tom Bury, who sometimes does double duty as',
  'long_text': 'Restaurant: Impossible | Restaurant: Impossible Restaurant: Impossible was an American reality television series, featuring chef and restaurateur Robert Irvine, that aired on Food Network from 2011 to 2016. 

In [29]:
import torch

# confirmed that score is the logits passed to softmax
torch.softmax(torch.tensor([23.37, 22.50, 19.02]), dim=0)

tensor([0.6984, 0.2926, 0.0090])

## Few-shot OpenQA

With the above pieces in place, we can define our first DSP system. This one does few-shot OpenQA with no context passages. In essense, our prompts contain

1. A sequences of Q/A demonstrations (no context passages).
2. The target question (no context passage).

Here is the full system; note the use of the decorator `@dsp.transformation` – this will ensure that no `example` instances are modified when the program is used.

In [30]:
@dsp.transformation
def few_shot_openqa(example, train=squad_train, k=2): 
    example.demos = dsp.sample(train, k=k)
    example, completions = dsp.generate(qa_template)(example, stage='qa')
    return completions

There are really just two steps here. Let's go through them individually. Our example:

In [31]:
ex = squad_dev[0].copy()

ex

{'id': '56be4db0acb8001400a502ec',
 'title': 'Super_Bowl_50',
 'context': 'Super Bowl 50 was an American football game to determine the champion of the National Football League (NFL) for the 2015 season. The American Football Conference (AFC) champion Denver Broncos defeated the National Football Conference (NFC) champion Carolina Panthers 24–10 to earn their third Super Bowl title. The game was played on February 7, 2016, at Levi\'s Stadium in the San Francisco Bay Area at Santa Clara, California. As this was the 50th Super Bowl, the league emphasized the "golden anniversary" with various gold-themed initiatives, as well as temporarily suspending the tradition of naming each Super Bowl game with Roman numerals (under which the game would have been known as "Super Bowl L"), so that the logo could prominently feature the Arabic numerals 50.',
 'question': 'Which NFL team represented the AFC at Super Bowl 50?',
 'answer': ['Denver Broncos', 'Denver Broncos', 'Denver Broncos']}

We add some demonstrations:

In [32]:
ex.demos = dsp.sample(squad_train, k=2)

ex

{'id': '56be4db0acb8001400a502ec',
 'title': 'Super_Bowl_50',
 'context': 'Super Bowl 50 was an American football game to determine the champion of the National Football League (NFL) for the 2015 season. The American Football Conference (AFC) champion Denver Broncos defeated the National Football Conference (NFC) champion Carolina Panthers 24–10 to earn their third Super Bowl title. The game was played on February 7, 2016, at Levi\'s Stadium in the San Francisco Bay Area at Santa Clara, California. As this was the 50th Super Bowl, the league emphasized the "golden anniversary" with various gold-themed initiatives, as well as temporarily suspending the tradition of naming each Super Bowl game with Roman numerals (under which the game would have been known as "Super Bowl L"), so that the logo could prominently feature the Arabic numerals 50.',
 'question': 'Which NFL team represented the AFC at Super Bowl 50?',
 'answer': ['Denver Broncos', 'Denver Broncos', 'Denver Broncos'],
 'demos': 

And then we call the LM using `qa_template`:

In [33]:
ex, ex_compl = dsp.generate(qa_template)(ex, stage='qa')

Here, `ex_compl` is a `Completions` instance. We will typically use only the `answer` attribute:

In [34]:
print(ex_compl.answer)

Denver Broncos


And, as a final check, we can see precisely what the LM saw:

In [35]:
lm.inspect_history(n=1)





Answer the questions with short factoid answers.

---

Follow the following format.

Question: ${the question to be answered}
Answer: ${a short factoid answer, often between 1 and 5 words}

---

Question: What album made her a worldwide known artist?
Answer: Dangerously in Love

---

Question: Immunoassays are able to detect what type of proteins?
Answer: generated by an infected organism in response to a foreign agent

---

Question: Which NFL team represented the AFC at Super Bowl 50?
Answer:[32m Denver Broncos[0m





## Few-shot QA with context

The above system makes no use of evidence passages. As a first step toward bringing in such passages, we define a regular few-shot QA system. For this system, prompts contain:

1. A sequences of Q/A demonstrations, each with a gold context passage.
2. The target question with a gold context passage.

This kind of system is very demanding in terms of data, since we need to have gold evidence passages for every Q/A pair used for demonstations and the Q that is our target. Datasets like SQuAD support this, but it's a rare situation in the world. (Our next system will address this by dropping the need for gold passages).

### Template with context

The first step toward defining this system is a new prompt template that includes context:

In [36]:
Context = dsp.Type(
    prefix="Context:\n",
    desc="${sources that may contain relevant content}",
    format=dsp.passages2text)

qa_template_with_passages = dsp.Template(
    instructions=qa_template.instructions,
    context=Context(), 
    question=Question(), 
    answer=Answer())

Here's what this does for a SQUaD example:

In [37]:
print(qa_template_with_passages(ex))

Answer the questions with short factoid answers.

---

Follow the following format.

Context:
${sources that may contain relevant content}

Question: ${the question to be answered}

Answer: ${a short factoid answer, often between 1 and 5 words}

---

Context:
Beyoncé Giselle Knowles-Carter (/biːˈjɒnseɪ/ bee-YON-say) (born September 4, 1981) is an American singer, songwriter, record producer and actress. Born and raised in Houston, Texas, she performed in various singing and dancing competitions as a child, and rose to fame in the late 1990s as lead singer of R&B girl-group Destiny's Child. Managed by her father, Mathew Knowles, the group became one of the world's best-selling girl groups of all time. Their hiatus saw the release of Beyoncé's debut album, Dangerously in Love (2003), which established her as a solo artist worldwide, earned five Grammy Awards and featured the Billboard Hot 100 number-one singles "Crazy in Love" and "Baby Boy".
Question: What album made her a worldwide kno

### The system

And here is the full system; the code is identical to `few_shot_openqa` except we now use `qa_template_with_passages`:

In [38]:
@dsp.transformation
def few_shot_qa_with_context(example, train=squad_train, k=3):
    example.demos = dsp.sample(train, k=k)
    generator = dsp.generate(qa_template_with_passages)
    example, completions = generator(example, stage='qa')
    return completions

In [39]:
print(few_shot_qa_with_context(squad_dev[0]).answer)

Denver Broncos


In [40]:
lm.inspect_history(n=1)





Answer the questions with short factoid answers.

---

Follow the following format.

Context:
${sources that may contain relevant content}

Question: ${the question to be answered}

Answer: ${a short factoid answer, often between 1 and 5 words}

---

Context:
Beyoncé Giselle Knowles-Carter (/biːˈjɒnseɪ/ bee-YON-say) (born September 4, 1981) is an American singer, songwriter, record producer and actress. Born and raised in Houston, Texas, she performed in various singing and dancing competitions as a child, and rose to fame in the late 1990s as lead singer of R&B girl-group Destiny's Child. Managed by her father, Mathew Knowles, the group became one of the world's best-selling girl groups of all time. Their hiatus saw the release of Beyoncé's debut album, Dangerously in Love (2003), which established her as a solo artist worldwide, earned five Grammy Awards and featured the Billboard Hot 100 number-one singles "Crazy in Love" and "Baby Boy".
Question: What album made her a worldwide

## Dev evaluations

This quick section shows some full evaluations using `evaluateAnswer` (see [Evaluation](#Evaluation) above). Depending on which model you're using, these evaluations could be expensive, so you might want to run them only sparingly. Here I am running them on just 25 dev examples to further avoid cost run-ups.

In [41]:
tiny_dev = dev_exs[: 25]

In [44]:
# few_shot_openqa_results = evaluateAnswer(few_shot_openqa, tiny_dev)

# print(few_shot_openqa_results['em'])
# print(few_shot_openqa_results['f1'])

  0%|          | 0/25 [00:00<?, ?it/s]





Answer the questions with short factoid answers.

---

Follow the following format.

Question: ${the question to be answered}
Answer: ${a short factoid answer, often between 1 and 5 words}

---

Question: What album made her a worldwide known artist?
Answer: Dangerously in Love

---

Question: Immunoassays are able to detect what type of proteins?
Answer: generated by an infected organism in response to a foreign agent

---

Question: What was the year when Tesla went back to Smiljan?
Answer:[32m 1884 [0m





 16%|█▌        | 4/25 [02:14<11:45, 33.61s/it] 


KeyboardInterrupt: 

You can also see the full set of results:

In [47]:
few_shot_openqa_results['df'].head()

Unnamed: 0,id,title,context,question,answer,prediction,em,f1
0,56e0c2307aa994140058e6df,Nikola_Tesla,"In 1873, Tesla returned to his birthtown, Smil...",What was the year when Tesla went back to Smil...,"[1873, 1873, 1873]",1884,❌,0.0
1,5726ed6cf1498d1400e8f00c,Pharmacy,While most Internet pharmacies sell prescripti...,Why might customers order from internet pharma...,"[to avoid the ""inconvenience"" of visiting a do...","cheaper prices, convenience, anonymity, and ea...",❌,0.16
2,5726431d271a42140099d7f8,Ctenophora,Ctenophores may be abundant during the summer ...,What event was blamed on the introduction of m...,"[causing fish stocks to collapse, causing fish...",ecological imbalance,❌,0.0
3,5710f4b8b654c5140001fa47,Huguenot,"Prince Louis de Condé, along with his sons Dan...",What industry did the nobleman establish with ...,"[glass-making, glass-making, glass-making]",the iron industry,❌,0.0
4,57094b4f9928a814004714f9,Sky_(United_Kingdom),While BSkyB had been excluded from being a par...,What channel replaced Sky Travel?,"[Sky Three, Sky Three, Sky Three]",Sky Travel was replaced by a channel named 'Sk...,❌,0.1


In [48]:
# few_shot_qa_results = evaluateAnswer(few_shot_qa_with_context, tiny_dev)

# print(few_shot_qa_results['em'])
# print(few_shot_qa_results['f1'])

In [50]:
# few_shot_qa_results['df'].head()

## Question 1: Few-shot OpenQA with context [3 points]

Your task here is to define a first instance of our target system: Few-shot OpenQA with context passages. To do this, you simply complete `few_shot_openqa_with_context`:

In [51]:
@dsp.transformation
def few_shot_openqa_with_context(example, train=squad_train, k=3):
    pass
    # Sample `k` demonstrations from `train`:
    ##### YOUR CODE HERE
    example.demos = dsp.sample(train, k=k)



    # For each demonstration, retrieve one passage and add it
    # as the `context` attribute` so we can use our template
    # `qa_template_with_passages`:
    ##### YOUR CODE HERE
    for d in example.demos:
        d.context = dsp.retrieve(d.question, k=1)[0]



    # Add the list of demonstrations to `example` as the `demos` attribute:
    ##### YOUR CODE HERE



    # Retrieve a context passage for `example` itself and add it
    # as the `context` attribute:
    ##### YOUR CODE HERE
    example.context = dsp.retrieve(example.question, k=1)[0]



    # Use `dsp.generate` to call the model on `example` using
    # `qa_template_with_passages`:
    ##### YOUR CODE HERE
    generator = dsp.generate(qa_template_with_passages)
    example, completions = generator(example, stage='qa')



    # Return the Completions instance returned by `dsp.generate`:
    ##### YOUR CODE HERE
    return completions




A quick test you can use:

In [52]:
def test_few_shot_openqa_with_context(func):
    ex = dsp.Example(question="Q0", context="C0", answer=["A0"])
    train = [
        dsp.Example(question="Q1", context=None, answer=["A1"]),
        dsp.Example(question="Q2", context=None, answer=["A2"]),
        dsp.Example(question="Q3", context=None, answer=["A3"])]
    compl = func(ex, train=train, k=2)
    errcount = 0
    # Check the LM was used as expected:
    if len(compl.data) != 1:
        errcount += 1
        print(f"Error for `{func.__name__}`: Unexpected LM output.")
    data = compl.data[0]
    # Check that the right number of demos was used:
    demos = data['demos']
    if len(demos) > 2:
        errcount += 1
        print(f"Error for `{func.__name__}`: "
              f"Unexpected demo count: {len(demos)}")
    # Check that context passages were included in the prompt:
    fields = compl.template.fields
    if not any(f.name == 'Context:' for f in fields):
        errcount += 1
        print(f"Error for `{func.__name__}`: "
              f"No context passages in the prompt.")
    # Check that the context passages were retrieved:
    if data['context'] == "C0":
        errcount += 1
        print(f"Error for `{func.__name__}`: "
              f"No context passage retrieved for the target.")
    for d in demos:
        if d['context'] is None:
            errcount += 1
            print(f"Error for `{func.__name__}`: "
                  f"No context passage retrieved for demo {d}.")
    if errcount == 0:
        print(f"No errors found for `{func.__name__}`")

In [53]:
test_few_shot_openqa_with_context(few_shot_openqa_with_context)

No errors found for `few_shot_openqa_with_context`


In [54]:
print(few_shot_openqa_with_context(dev_exs[0]).answer)

1873


In [55]:
dev_exs[0].context

"In 1873, Tesla returned to his birthtown, Smiljan. Shortly after he arrived, Tesla contracted cholera; he was bedridden for nine months and was near death multiple times. Tesla's father, in a moment of despair, promised to send him to the best engineering school if he recovered from the illness (his father had originally wanted him to enter the priesthood)."

In [56]:
lm.inspect_history(n=1)





Answer the questions with short factoid answers.

---

Follow the following format.

Context:
${sources that may contain relevant content}

Question: ${the question to be answered}

Answer: ${a short factoid answer, often between 1 and 5 words}

---

Context:
Janet (album) | worldwide sales of over 14 million copies, it is Janet's best selling album. Although Jackson has reached superstar status in the United States, she has yet to achieve the same level of response internationally. According to Nacy Berry, vice chairman of Virgin Records, "Janet" marked the first time the label "had centrally coordinated and strategized a campaign on a worldwide basis" which ultimately brought her to a plateau of global recognition. Her historic multimillion-dollar contract made her the highest-paid artist in history, until brother Michael renegotiated his contract with Sony Music Entertainment only days later. Sonia Murry noted that she
Question: What album made her a worldwide known artist?
Answ

Here's an optional evaluation of the system using `tiny_dev`:

In [57]:
# few_shot_openqa_with_context_results = evaluateAnswer(
#     few_shot_openqa_with_context, tiny_dev)

# print(few_shot_openqa_with_context_results['em'])
# print(few_shot_openqa_with_context_results['f1'])

  0%|          | 0/25 [00:00<?, ?it/s]

 12%|█▏        | 3/25 [00:27<03:44, 10.21s/it]

Backing off 0.5 seconds after 1 tries calling function <function Cohere.request at 0x2961e0900> with kwargs {'num_generations': 1}
Backing off 0.2 seconds after 2 tries calling function <function Cohere.request at 0x2961e0900> with kwargs {'num_generations': 1}
Backing off 3.7 seconds after 3 tries calling function <function Cohere.request at 0x2961e0900> with kwargs {'num_generations': 1}


 32%|███▏      | 8/25 [01:21<02:26,  8.62s/it]

Backing off 1.0 seconds after 1 tries calling function <function Cohere.request at 0x2961e0900> with kwargs {'num_generations': 1}
Backing off 1.3 seconds after 2 tries calling function <function Cohere.request at 0x2961e0900> with kwargs {'num_generations': 1}
Backing off 1.0 seconds after 3 tries calling function <function Cohere.request at 0x2961e0900> with kwargs {'num_generations': 1}
Backing off 5.7 seconds after 4 tries calling function <function Cohere.request at 0x2961e0900> with kwargs {'num_generations': 1}


 52%|█████▏    | 13/25 [02:33<02:01, 10.11s/it]

Backing off 0.8 seconds after 1 tries calling function <function Cohere.request at 0x2961e0900> with kwargs {'num_generations': 1}
Backing off 1.2 seconds after 2 tries calling function <function Cohere.request at 0x2961e0900> with kwargs {'num_generations': 1}
Backing off 2.8 seconds after 3 tries calling function <function Cohere.request at 0x2961e0900> with kwargs {'num_generations': 1}


 72%|███████▏  | 18/25 [03:44<01:18, 11.17s/it]

Backing off 0.9 seconds after 1 tries calling function <function Cohere.request at 0x2961e0900> with kwargs {'num_generations': 1}


 92%|█████████▏| 23/25 [04:42<00:19,  9.66s/it]

Backing off 0.7 seconds after 1 tries calling function <function Cohere.request at 0x2961e0900> with kwargs {'num_generations': 1}
Backing off 0.2 seconds after 2 tries calling function <function Cohere.request at 0x2961e0900> with kwargs {'num_generations': 1}


 96%|█████████▌| 24/25 [05:08<00:14, 14.66s/it]

Backing off 0.8 seconds after 1 tries calling function <function Cohere.request at 0x2961e0900> with kwargs {'num_generations': 1}


100%|██████████| 25/25 [05:24<00:00, 12.99s/it]


## Question 2: Using annotate

This question is designed to give you some experience with DSP's powerful `annotate` method. You can think of this as a generic tool for defining general aspects of your prompt. Here we will use it to filter the set of demonstrations we use.

The overall idea here is that the demonstrations we sample might vary in quality in ways that could impact model performance. For example, if we want to try to push the model to provide extractive answers as in classical QA – answers that are substrings of the evidence passage – then it works against our interests to include demonstrations where the model is unable to do this.

We will do this in two parts to facilitate testing.

### Task 1: Filtering demonstrations 1 [2 points]

This is the heart of the question: complete `filter_demos` so that, given a demonstration `d` and a list of demonstrations `demos`, it keeps `d` if and only if

1. The passage retrieved for `d` contrains `d.answer`, and
2. The model's generation for `d` based on `qa_template_with_passages` contains `d.answer`.

In [59]:
@dsp.transformation
def filter_demos(d):

    # Retrieve a passage for `d.question` and make sure that it
    # contains `d.answer`. Use `dsp.passage_match` for this!
    # return None if there is no match.
    ##### YOUR CODE HERE
    passage = dsp.retrieve(d.question, k=1)
    if not dsp.passage_match(passage, d.answer):
        return None
    



    # Sample `k=3` demonstrations to help the model assess this
    # potential demonstration:
    ##### YOUR CODE HERE
    d.demos = dsp.sample(squad_train, k=3)



    # Generate an answer based on `qa_template_with_passages`
    # and use `dsp.answer_match` to check that the predicted answer
    # contains `d.answer`. If it does not, return None.
    ##### YOUR CODE HERE
    ex, ex_compl = dsp.generate(qa_template_with_passages)(d, stage='qa')
    if not dsp.answer_match(ex_compl.answer, d.answer):
        return None


    # Return d, if you got this far:
    ##### YOUR CODE HERE
    return d




Here's a test; this is not an ideal unit test because we don't know which LM you will be using, but it should clarify our intentions and help you with debugging.

In [60]:
def test_filter_demos(func):
    # This example should be filtered at the retrieval step, since
    # 👽 is not in the index:
    ex1 = dsp.Example(
        question="Who is 👽?", context="C0", answer=["👽"])
    result1 = func(ex1)
    errcount = 0
    if result1 is not None:
        errcount += 1
        print(f"Error for `{func.__name__}`: Expected {None}, got {result1}")
    # This example should not be filtered given our tester LM:
    ex2 = dsp.Example(
        question="Who is Beyoncé?", context="C0", answer=["Beyoncé"])
    # This example should be filtered given our tester LM:
    ex3 = dsp.Example(
        question="Who is Beyoncé?", context="C0", answer=["NO MATCH"])
    class TestLM:
        def __init__(self, **kwargs):
            self.kwargs = kwargs
            self.history = []

        def __call__(self, prompt, **kwargs):
            answer = ["Beyoncé"]
            return answer
    dsp.settings.configure(lm=TestLM(), rm=rm)
    try:
        result2 = func(ex2)
        if result2 is None:
            errcount += 1
            print(f"Error for `{func.__name__}`: "
                  f"Expected example not to be filtered by `answer_match`.")
        result3 = func(ex3)
        if result3 is not None:
            errcount += 1
            print(f"Error for `{func.__name__}`: "
                  f"Expected example to be filtered by `answer_match`.")
    except:
        raise
    finally:
        # Restore the actual model:
        dsp.settings.configure(lm=lm, rm=rm)
    if errcount == 0:
        print(f"No errors detected for `{func.__name__}`")

In [61]:
test_filter_demos(filter_demos)

No errors detected for `filter_demos`


### Task 2: Full filtering program [1 point]

The task is to complete `few_shot_openqa_with_context_and_demo_filtering` as a few-shot OpenQA system like the one from Question 1, but using the filtering mechanism defined by `filter_demos`.

In [62]:
@dsp.transformation
def few_shot_openqa_with_context_and_demo_filtering(example, train=squad_train, k=3):

    # Sample 20 demonstrations:
    ##### YOUR CODE HERE
    demos = dsp.sample(train, k=20)



    # Filter the demonstrations using `annotate` and `filter_demos`.
    # The user's `k` should be used to specify the maximum number of
    # demonstrations kept at this stage.
    ##### YOUR CODE HERE
    demos = dsp.annotate(filter_demos)(demos, k=k)



    # Add the list of filtered demonstrations as a the `demos`
    # attribute of `example`:
    ##### YOUR CODE HERE
    example.demos = demos


    # Retrieve a context passage for `example.question` and add it
    # as the `context` attribute for the example:
    ##### YOUR CODE HERE
    example.context = dsp.retrieve(example.question, k=1)[0]


    # Generate a prediction using `qa_template_with_passages` as
    # we did before:
    ##### YOUR CODE HERE
    ex, ex_compl = dsp.generate(qa_template_with_passages)(example, stage='qa')



    # Return the generated `Completions` instance:
    ##### YOUR CODE HERE
    return ex_compl




Our previous test should suffice to help with debugging this program:

In [63]:
test_few_shot_openqa_with_context(
    few_shot_openqa_with_context_and_demo_filtering)

No errors found for `few_shot_openqa_with_context_and_demo_filtering`


Quiick example:

In [64]:
print(few_shot_openqa_with_context_and_demo_filtering(dev_exs[0]).answer)

Backing off 0.4 seconds after 1 tries calling function <function Cohere.request at 0x2961e0900> with kwargs {'num_generations': 1}
Backing off 2.0 seconds after 2 tries calling function <function Cohere.request at 0x2961e0900> with kwargs {'num_generations': 1}
Backing off 0.8 seconds after 1 tries calling function <function Cohere.request at 0x2961e0900> with kwargs {'num_generations': 1}
Backing off 1.1 seconds after 2 tries calling function <function Cohere.request at 0x2961e0900> with kwargs {'num_generations': 1}
1873


In [65]:
lm.inspect_history(n=1)





Answer the questions with short factoid answers.

---

Follow the following format.

Context:
${sources that may contain relevant content}

Question: ${the question to be answered}

Answer: ${a short factoid answer, often between 1 and 5 words}

---

Context:
Zapatist forces, which were based in neighboring Morelos had strengths in the southern edge of the Federal District, which included Xochimilco, Tlalpan, Tláhuac and Milpa Alta to fight against the regimes of Victoriano Huerta and Venustiano Carranza. After the assassination of Carranza and a short mandate by Adolfo de la Huerta, Álvaro Obregón took power. After willing to be re-elected, he was killed by José de León Toral, a devout Catholic, in a restaurant near La Bombilla Park in San Ángel in 1928. Plutarco Elias Calles replaced Obregón and culminated the Mexican Revolution.

Question: When was Alvaro Obregon killed?

Answer: 1928

---

Context:
Richmond city government consists of a city council with representatives from ni

Here is code for an optional initial evaluation with `tiny_dev`:

In [68]:
# filtering_results = evaluateAnswer(
#     few_shot_openqa_with_context_and_demo_filtering, tiny_dev)

# print(filtering_results['em'])
# print(filtering_results['f1'])

In [69]:
lm.inspect_history(n=1)





Answer the questions with short factoid answers.

---

Follow the following format.

Context:
${sources that may contain relevant content}

Question: ${the question to be answered}

Answer: ${a short factoid answer, often between 1 and 5 words}

---

Context:
Beyoncé Giselle Knowles-Carter (/biːˈjɒnseɪ/ bee-YON-say) (born September 4, 1981) is an American singer, songwriter, record producer and actress. Born and raised in Houston, Texas, she performed in various singing and dancing competitions as a child, and rose to fame in the late 1990s as lead singer of R&B girl-group Destiny's Child. Managed by her father, Mathew Knowles, the group became one of the world's best-selling girl groups of all time. Their hiatus saw the release of Beyoncé's debut album, Dangerously in Love (2003), which established her as a solo artist worldwide, earned five Grammy Awards and featured the Billboard Hot 100 number-one singles "Crazy in Love" and "Baby Boy".
Question: What album made her a worldwide

## Question 3: Your original system [3 points]

This question asks you to design your own few-shot OpenQA system. All of the code above can be used and modified for this, and the requirement is just that you try something new that goes beyond what we've done so far. 

Terms for the bake-off:

* You can make free use of SQuAD and other publicly available data.

* The LM must be an autoregressive language model. No trained QA components can be used. This includes general purpose LMs that have been fine-tuned for QA. (We have obviously waded into some vague territory here. The spirit of this is to make use of frozen, general-purpose models. We welcome questions about exactly how this is defined, since it could be instructive to explore this.)

Here are some ideas for the original system:

* We have so far sampled randomly from the SQuaD train set to create few-shot prompts. One might instead sample passages that have some connection to the target question. See `dsp.knn`, for example.

* There are a lot of parameters to our LMs that we have so far ignored. Exploring different values might lead to better results. The `temperature` parameter is highly impactful for our task.

* We have so far made no use of the scores from the LM or the RM.

* We have so far made no use of DSP's functionality for self-consistency. See the DPS intro notebook for examples.

__Original system instructions__:

In the cell below, please provide a brief technical description of your original system, so that the teaching team can gain an understanding of what it does. This will help us to understand your code and analyze all the submissions to identify patterns and strategies.

In [70]:
# PLEASE MAKE SURE TO INCLUDE THE FOLLOWING BETWEEN THE START AND STOP COMMENTS:
#   1) Textual description of your system.
#   2) The code for your original system.
# PLEASE MAKE SURE NOT TO DELETE OR EDIT THE START AND STOP COMMENTS

# START COMMENT: Enter your system description in this cell.


# STOP COMMENT: Please do not remove this comment.

In [71]:
# dsp.settings.vectorizer = dsp.sentence_vectorizer.SentenceTransformersVectorizer()

In [72]:
from dsp.utils import deduplicate
import dspy

class GenerateSearchQuery(dspy.Signature):
    """Write a simple search query that will help answer a complex question."""

    context = dspy.InputField(desc="may contain relevant facts")
    question = dspy.InputField()
    query = dspy.OutputField()

class GenerateAnswer(dspy.Signature):
    """Answer questions with short factoid answers."""

    context = dspy.InputField(desc="may contain relevant facts")
    question = dspy.InputField()
    answer = dspy.OutputField(desc="often between 1 and 5 words")

class SimplifiedBaleen(dspy.Module):
    def __init__(self, passages_per_hop=3, max_hops=2):
        super().__init__()

        self.generate_query = [dspy.ChainOfThought(GenerateSearchQuery, n=1) for _ in range(max_hops)]
        self.retrieve = dspy.Retrieve(k=passages_per_hop)
        self.generate_answer = dspy.ChainOfThought(GenerateAnswer, n=1)
        self.max_hops = max_hops

    def forward(self, question):
        context = []

        for hop in range(self.max_hops):
            query = self.generate_query[hop](context=context, question=question).query
            passages = self.retrieve(query).passages
            context = deduplicate(context + passages)

        pred = self.generate_answer(context=context, question=question)
        return dspy.Prediction(context=context, answer=pred.answer)

@dsp.transformation
def few_shot_openqa_zq(example, train=squad_train, k=3):
    # demos = dsp.sample(train, k=20)
    # example.demos = dsp.annotate(filter_demos)(demos, k=k)
    # example.context = dsp.retrieve(example.question, k=1)[0]
    # ex, ex_compl = dsp.generate(qa_template_with_passages)(example, stage='qa')
    # return ex_compl
    baleen = SimplifiedBaleen()
    return baleen(example.question)




In [74]:
# filtering_results = evaluateAnswer(few_shot_openqa_zq, tiny_dev)

# print(filtering_results["em"])
# print(filtering_results["f1"])

In [128]:
@dsp.transformation
def multihop_attempt(d: dsp.Example) -> dsp.Example:
    x = dsp.Example(question=d.question, demos=dsp.all_but(squad_train, d), answer=d.answer)
    x = multihop_search(x, max_hops=2, k=3)

    x = QA_predict(x, sc=False)
    if not dsp.answer_match(x.answer, d.answer): return None

    x = QA_predict(x, sc=False)
    if not dsp.answer_match(x.answer, d.answer): return None

    return d.copy(**x)


@dsp.transformation
def multihop_demonstrate(x, train=squad_train, k=3):
    demos = dsp.sample(train, k=7)
    x.demos = dsp.annotate(multihop_attempt)(demos, k=k)
    return x


QueryReasoning = dsp.Type(
    prefix="Reasoning: Let's think step by step in order to produce the query.",
    desc="We ...",
)

query_template = dsp.Template(
    instructions="Write a simple search query that will help answer a complex question.",
    context=Context(),
    question=Question(),
    reasoning=QueryReasoning(),
    query=dsp.Type(
        prefix="Query:", desc="${a simple question for seeking the missing information}"
    ),
)


@dsp.transformation
def multihop_search(example: dsp.Example, max_hops=2, k=5) -> dsp.Example:
    example.context = []

    for hop in range(max_hops):
        # Generate queries
        template = query_template
        example, completions, queries = generate_queries(hop, example, template)
        example.context = dsp.retrieveEnsemble(queries, k=k)

        # Include the reasoning for the next hop can improve the results, Baleen
        if hop > 0:
            example.context = [completions[0].reasoning] + example.context

    return example

@dsp.transformation
def generate_queries(hop, example, template):
    example.reasoning = "Here's some reasons."
    return example, dsp.Completions([example], query_template), [example.question]
    example, completions = dsp.generate(template, n=10, temperature=0.7)(
        example, stage=f"h{hop}"
    )

    # Collect the queries and search with result fusion
    queries = [c.query for c in completions] + [example.question]
    return example, completions, queries


Reasoning = dsp.Type(
    prefix="Reasoning: Let's think step by step in order to produce the answer.",
    desc="We ...",
)

qa_template_with_CoT = dsp.Template(
    instructions="Answer questions with short factoid answers.",
    context=Context(),
    question=Question(),
    reasoning=Reasoning(),
    answer=Answer(),
)


@dsp.transformation
def QA_predict(example: dsp.Example, sc=True):
    if sc:
        example, completions = dsp.generate(
            qa_template_with_CoT, n=20, temperature=0.7
        )(example, stage="qa")
        completions = dsp.majority(completions)
    else:
        # example, completions = dsp.generate(qa_template_with_CoT)(example, stage="qa")
        example.reasoning = "We have some reasoning here."
        example.answer = example.answer[0]
        completions = dsp.Completions([example], qa_template_with_CoT)

    return example.copy(answer=completions.answer)


@dsp.transformation
def multihop_QA_zq(x, train=squad_train, k=3, max_hops=2, passages_per_hop=3):
    x = multihop_demonstrate(x)
    x = multihop_search(x)
    x = QA_predict(x, sc=False)
    return x

In [129]:
multihop_results = evaluateAnswer(multihop_QA_zq, tiny_dev)

print(multihop_results["em"])
print(multihop_results["f1"])

100%|██████████| 25/25 [00:36<00:00,  1.44s/it]

100.0
1.0





### Improvements

1. Auto stop multiple hops

In [None]:
SearchQueryWithStopping = dsp.Type(
    prefix="Query:",
    desc="${a simple question for seeking the missing or required information --- say N/A if the context above contains all of the required information}",
)
query_template_v2 = dsp.Template(
    instructions="Write a simple search query that will help answer a complex question. If the context above contains all of the required information, write N/A.",
    context=Context(),
    question=Question(),
    reasoning=QueryReasoning(),
    query=SearchQueryWithStopping(),
)


@dsp.transformation
def multihop_search_v3(example: dsp.Example, max_hops=3, k=5) -> dsp.Example:
    example.context = []

    for hop in range(max_hops):
        # Generate queries
        template = query_template
        example, completions, queries = generate_queries(hop, example, template)
        if dsp.majority(completions).query == "N/A":
            break
        example.context = dsp.retrieveEnsemble(queries, k=k)

        # Include the reasoning for the next hop can improve the results, Baleen
        if hop > 0:
            example.context = [completions[0].reasoning] + example.context

    return example

2. Sample demos based similarity to the question

3. Demos can use generated answers instead of the gold answers. This might be more familiar to the LM.

4. The top passage from each hop should not get lost because of dsp.retrieveEnsemble

5. Seperate fields for reasoning and summary for query generation

##### Question: Can I use Huggingface models?

In [1]:
import dsp

lm = dsp.HFModel(model="meta-llama/Llama-2-7b-chat-hf")
lm('who is the president of US?', temperature=0.9, n=1)

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

who is the president of US?


['\n\nAnswer: The current President of the United States is Joe Biden.']

In [3]:
lm('who is the president of US?', temperature=0.9, n=1)

who is the president of US?


['\n\nAnswer: The President of the United States is Joe Biden. He was inaugurated as the 46th President of the United States on January 20, 2021.']

It takes too long on mac. So no.

##### Question: How to use `multihop_results['df']` to improve the design?

##### Question: How does `dsp.generate` insert the demonstration into the prompt?

##### Question: Compiling the Baleen program and How Does it Work?

Now is the time to compile our multi-hop (`SimplifiedBaleen`) program.

We will first define our validation logic, which will simply require that:

- The predicted answer matches the gold answer.
- The retrieved context contains the gold answer.
- None of the generated queries is rambling (i.e., none exceeds 100 characters in length).
- None of the generated queries is roughly repeated (i.e., none is within 0.8 or higher F1 score of earlier queries).

In [None]:
def validate_context_and_answer_and_hops(example, pred, trace=None):
    if not dspy.evaluate.answer_exact_match(example, pred):
        return False
    if not dspy.evaluate.answer_passage_match(example, pred):
        return False

    hops = [example.question] + [
        outputs.query for *_, outputs in trace if "query" in outputs
    ]

    if max([len(h) for h in hops]) > 100:
        return False
    if any(
        dspy.evaluate.answer_exact_match_str(hops[idx], hops[:idx], frac=0.8)
        for idx in range(2, len(hops))
    ):
        return False

    return True

In [None]:
squad_train[0]

{'id': '5733be284776f41900661182',
 'title': 'University_of_Notre_Dame',
 'context': 'Architecturally, the school has a Catholic character. Atop the Main Building\'s gold dome is a golden statue of the Virgin Mary. Immediately in front of the Main Building and facing it, is a copper statue of Christ with arms upraised with the legend "Venite Ad Me Omnes". Next to the Main Building is the Basilica of the Sacred Heart. Immediately behind the basilica is the Grotto, a Marian place of prayer and reflection. It is a replica of the grotto at Lourdes, France where the Virgin Mary reputedly appeared to Saint Bernadette Soubirous in 1858. At the end of the main drive (and in a direct line that connects through 3 statues and the Gold Dome), is a simple, modern stone statue of Mary.',
 'question': 'To whom did the Virgin Mary allegedly appear in 1858 in Lourdes France?',
 'answer': ['Saint Bernadette Soubirous']}

In [None]:
from dspy.teleprompt import BootstrapFewShot

teleprompter = BootstrapFewShot(metric=validate_context_and_answer_and_hops)
trainset = []
for x in squad_train:
    x = x.copy()
    x['answer'] = x['answer'][0]
    trainset.append(x)
examples = [dspy.Example(x).with_inputs('question') for x in trainset]
compiled_baleen = teleprompter.compile(SimplifiedBaleen(), teacher=SimplifiedBaleen(passages_per_hop=2), trainset=examples[:20])

100%|██████████| 20/20 [18:23<00:00, 55.17s/it]

Bootstrapped 3 full traces after 20 examples in round 0.





In [None]:
filtering_results = evaluateAnswer(few_shot_openqa_zq, tiny_dev)

print(filtering_results["em"])
print(filtering_results["f1"])

100%|██████████| 25/25 [49:06<00:00, 117.87s/it]

40.0
0.4716666666666666





## Question 4: Bakeoff entry [1 point]

For the bake-off, you simply need to be able to run your system on the file 

```data/openqa/cs224u-openqa-test-unlabeled.txt```

The following code should download it for you if necessary:

In [None]:
if not os.path.exists(os.path.join("data", "openqa", "cs224u-openqa-test-unlabeled.txt")):
    !mkdir -p data/openqa
    !wget https://web.stanford.edu/class/cs224u/data/cs224u-openqa-test-unlabeled.txt -P data/openqa/

If the above fails, you can just download https://web.stanford.edu/class/cs224u/data/cs224u-openqa-test-unlabeled.txt and place it in `data/openqa`.

This file contains only questions. The starter code below will help you structure this. It writes a file "cs224u-openqa-bakeoff-entry.json" to the current directory. That file should be uploaded as-is. Please do not change its name.

In [None]:
import json

def create_bakeoff_submission(fn):
    """"
    The argument `fn` is a DSP program with the same signature as the 
    ones we wrote above: `dsp.Example` to `dsp.Completions`.
    """

    filename = os.path.join("data", "openqa", "cs224u-openqa-test-unlabeled.txt")

    # This should become a mapping from questions (str) to response
    # dicts from your system.
    gens = {} 

    with open(filename) as f:
        questions = f.read().splitlines()

    questions = [dsp.Example(question=q) for q in questions]

    # `questions` is the list of `dsp.Example` instances you need to 
    # evaluate your system on. 
    #
    # Here we loop over the questions, run the system `fn`, and
    # store its `answer` value as the prediction:
    for question in tqdm.tqdm(questions):
        gens[question.question] = fn(question).answer

    # Quick tests we advise you to run: 
    # 1. Make sure `gens` is a dict with the questions as the keys:
    assert all(q.question in gens for q in questions)
    # 2. Make sure the values are dicts and have the key we will use:
    assert all(isinstance(d, str) for d in gens.values())

    # And finally the output file:
    with open("cs224u-openqa-bakeoff-entry.json", "wt") as f:
        json.dump(gens, f, indent=4)

Here's what it looks like to evaluate our first program, `few_shot_openqa`, on the bakeoff data:

In [None]:
# create_bakeoff_submission(few_shot_openqa_with_context)