# Homework and bakeoff: Few-shot OpenQA with DSP

In [None]:
__author__ = "Christopher Potts and Omar Khattab"
__version__ = "CS224u, Stanford, Spring 2023"

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/cgpotts/cs224u/blob/master/hw_openqa.ipynb)
[![Open in SageMaker Studio Lab](https://studiolab.sagemaker.aws/studiolab.svg)](https://studiolab.sagemaker.aws/import/github/cgpotts/cs224u/blob/master/hw_openqa.ipynb)

If Colab is opened with this badge, please **save a copy to drive** (from the File menu) before running the notebook.

## Overview

The goal of this homework is to explore retrieval-augmented in-context learning. This is an exciting area that brings together a number of recent task ideas and modeling innovations. We will use the [DSP programming library](https://github.com/stanfordnlp/dsp) to build systems in this new mode.

Our core task is __open-domain question answering (OpenQA)__. In this task, all that is given by the dataset is a question text, and the task is to answer that question. By contrast, in modern QA tasks, the dataset provides a text and a gold passage, usually with a firm guarantee that the answer will be a substring of the passage. 

OpenQA is substantially harder than standard QA. The usual strategy is to use a _retriever_ to find passages in a large collection of texts and train a _reader_ to find answers in those passages. This means we have no guarantee that the retrieved passage will contain the answer we need. If we don't retrieve a passage containing the answer, our reader has no hope of succeeding. Although this is challenging, it is much more realistic and widely applicable than standard QA. After all, with the right retriever, an OpenQA system could be deployed over the entire Web.

The task posed by this homework is harder even than OpenQA. We are calling this task __few-shot OpenQA__. The defining feature of this task is that the reader is simply a frozen, general purpose language model. It accepts string inputs (prompts) and produces text in response. It is not trained to answer questions per se, and nothing about its structure ensures that it will respond with a substring of the prompt corresponding to anything like an answer.

__Few-shot QA__ (but not OpenQA!) is explored in the famous GPT-3 paper ([Brown et al. 2020](https://arxiv.org/abs/2005.14165)). The authors are able to get traction on the problem using GPT-3, an incredible finding. Our task here – __few-shot OpenQA__ – pushes this even further by retrieving passages to use in the prompt rather than assuming that the gold passage can be used in the prompt. If we can make this work, then it should be a major step towards flexibly and easily deploying QA technologies in new domains.

In summary:

| Task             | Passage given | Task-specific reader training |Task-specific retriever training  | 
|-----------------:|:-------------:|:-----------------------------:|:--------------------------------:|
| QA               | yes           | yes                           | n/a                              |
| OpenQA           | no            | yes                           | maybe                            |
| Few-shot QA      | yes           | no                            | n/a                              |
| Few-shot OpenQA  | no            | no                            | maybe                            | 

Just to repeat: your mission is to explore the final line in this table. The core notebook and assignment don't address the issue of training the retriever in a task-specific way, but this is something you could pursue for a final project; [the ColBERT codebase](https://github.com/stanford-futuredata/ColBERT) makes easy.

As usual, this notebook sets up the task and provides starter code. We will be relying on the DSP library, which allows us to define retrieval-augmented in-context learning systems in code. We first provide two fully implemented examples:

* _Few-shot OpenQA_: The given input is a question and the goal is to provide an answer. Some _demonstration_ Q/A pairs are sampled from a train set (in our case, SQuAD).

* _Few-shot QA with context_: The given input is a question with an associated evidence passage, and the goal is to provide an answer. The _demonstrations_ are now Q/A pairs with associated gold evidence passages. These are sampled from a train set (in our case, SQuAD).

The above examples are followed by some assignment questions aimed at helping you to think creatively about the problem. The first of these defines a core system for our target task:

* _Few-shot OpenQA with context_: This is like _few-shot QA with context_ except the passages are now retrieved from a large search index using ColBERT. 

The second question illustrates how to use the powerful DSP `annotate` function to improve the set of demonstrations used by the system.

It is a requirement of the bake-off that a general-purpose language model be used. In particular, trained QA systems cannot be used at all, and no fine-tuning is allowed either. See the original system question at the bottom of this message for guidance on which models are allowed.

Note: the models we are working with here are _big_. This poses a challenge that is increasingly common in NLP: you have to pay one way or another. You can pay to use the GPT-3 API, or you can pay to use an Eleuther model on a heavy-duty cluster computer, or you can pay with time by using an Eleuther model on a more modest computer.  __For now, though, the Cohere models are free to use, so they should be your first choice; see [setup.ipynb](setup.ipynb) if you don't have an account__.

## Set-up

We have sought to make this notebook self-contained and easy to use on a personal computer, on Google Colab, and in Sagemaker Studio. For personal computer use, we assume you have already done everything in [setup.ipynb](setup.ipynb]). For cloud usage, the next few code blocks should handle all set-up steps.

In [None]:
try: 
    # This library is our indicator that the required installs
    # need to be done.
    import datasets
    root_path = '.'
except ModuleNotFoundError:
    !git clone https://github.com/cgpotts/cs224u/
    !pip install -r cs224u/requirements.txt
    root_path = 'dsp'

In [None]:
import cohere
from datasets import load_dataset
import openai
import os
import dsp

In [None]:
os.environ["DSP_NOTEBOOK_CACHEDIR"] = os.path.join(root_path, 'cache')

openai_key = os.getenv('OPENAI_API_KEY')  # or replace with your API key (optional)

cohere_key = os.getenv('COHERE_API_KEY')  # or replace with your API key (optional)

colbert_server = 'http://index.contextual.ai:8893/api/search'

Here we establish the Language Model `lm` and Retriever Model `rm` that we will be using. The defaults for `lm` are just for development. You may want to develop using an inexpensive model and then do your final evalautions wih an expensive one.

In [None]:
lm = dsp.GPT3(model='text-davinci-001', api_key=openai_key)

# Options for Cohere: command-medium-nightly, command-xlarge-nightly
#lm = dsp.Cohere(model='command-xlarge-nightly', api_key=cohere_key)

rm = dsp.ColBERTv2(url=colbert_server)

dsp.settings.configure(lm=lm, rm=rm)

Here's a command you can run to see which OpenAI models are available; OpenAI has entered into an increasingly closed mode where many older models are not available, so there are likely to be some surprises lurking here:

In [None]:
# [d["root"] for d in openai.Model.list()["data"]]

## SQuAD

Our core development dataset is [SQuAD](https://rajpurkar.github.io/SQuAD-explorer/). We chose this dataset because it is well-known and widely used, and it is large enough to support lots of meaningful development work, without, though, being so large as to require lots of compute power. It is also useful that it has gold passages supporting the standard QA formulation, so we can see how well our LM performs with an "oracle" retriever that always retrieves the gold passage.

In [None]:
squad = load_dataset("squad")

The following utility just reads a SQuAD split in as a list of `SquadExample` instances:

In [None]:
def get_squad_split(squad, split="validation"):
    """
    Use `split='train'` for the train split.

    Returns
    -------
    list of SquadExample named tuples with attributes
    id, title, context, question, answers

    """
    data = zip(*[squad[split][field] for field in squad[split].features])
    return [dsp.Example(id=eid, title=title, context=context, question=q, answer=a['text']) 
            for eid, title, context, q, a in data]

### SQuAD train

To build few-shot prompts, we will often sample SQuAD train examples, so we load that split here:

In [None]:
squad_train = get_squad_split(squad, split="train")

### SQuAD dev

In [None]:
squad_dev = get_squad_split(squad)

### SQuAD dev sample

Evaluations are expensive in this new era! Here's a small sample to use for dev assessments:

In [None]:
dev_exs = sorted(squad_dev, key=lambda x: hash(x.id))[: 200]

## Evaluation

Our evaluation protocols are the standard ones for SQuAD and related tasks: exact match of the answer (EM) and token-level F1. We'll reply primarily on DSP for these evaluation utilities; the following is a light modification of `dsp.evaluation.utils.evaluateAnswer`, which is itself built evaluation code from [apple/ml-qrecc](https://github.com/apple/ml-qrecc/blob/main/utils/evaluate_qa.py) repository. It performs very basic string normalization before doing the core comparisons.

In [None]:
from dsp.utils import EM, F1
import tqdm
import pandas as pd

def evaluateAnswer(fn, dev):
    """Evaluate a DSP program on `dev`.

    Parameters
    ----------
    fn : DSP system
    def : list of `dsp.Example` instances

    Returns
    -------
    dict with keys "df", "em", "f1" storung assessment data
    """
    data = []
    for example in tqdm.tqdm(dev):
        prediction = fn(example)
        d = dict(example)
        pred = prediction.answer
        d['prediction'] = pred
        d['em'] = EM(pred, example.answer)
        d['f1'] = F1(pred, example.answer)
        data.append(d)
    df = pd.DataFrame(data)
    em = round(100.0 * df['em'].sum() / len(dev), 1)
    df['em'] = df['em'].apply(lambda x: '✔️' if x else '❌')
    f1 = df['f1'].mean()
    return {'df': df, 'em': em, 'f1': f1}

## DSP basics

### LM usage

Here's the most basic way to use the LM:

In [None]:
lm("Which U.S. states border no U.S. states?")

Keyword arguments to the underlying LM are passed through:

In [None]:
lm("Which U.S. states border no U.S. states?", temperature=0.9, n=4)

With `lm.inspect_history`, we can see the most recent language model calls:

In [None]:
lm.inspect_history(n=1)

### Prompt templates

In DSP, the more usual way to call the LM is to define a prompt template. Here we define a generic QA prompt template:

In [None]:
Question = dsp.Type(
    prefix="Question:", 
    desc="${the question to be answered}")

Answer = dsp.Type(
    prefix="Answer:", 
    desc="${a short factoid answer, often between 1 and 5 words}", 
    format=dsp.format_answers)

qa_template = dsp.Template(
    instructions="Answer questions with short factoid answers.", 
    question=Question(), 
    answer=Answer())

And here is a self-contained example that uses our question and template to create a prompt:

In [None]:
states_ex = dsp.Example(
    question="Which U.S. states border no U.S. states?",
    demos=dsp.sample(squad_train, k=2))

print(qa_template(states_ex))

### Prompt-based generation

We can how put the above pieces together to call the model with our constructed prompt:

In [None]:
states_ex, states_compl = dsp.generate(qa_template)(states_ex, stage='basics')

In [None]:
print(states_compl.answer)

And here's precisely what the model saw and did:

In [None]:
lm.inspect_history(n=1)

### Retrieval

The final major component of our systems is retrieval. When we defined `rm`, we connected to a remote ColBERT index and retriever system that we can now use for search.

In [None]:
states_ex.question

The basic `dsp.retrieve` method returns only passages:

In [None]:
passages = dsp.retrieve(states_ex.question, k=1)

In [None]:
passages

If we need passages with scores and other metadata, we can call `rm` directly:

In [None]:
rm(states_ex.question, k=1)

## Few-shot OpenQA

With the above pieces in place, we can define our first DSP system. This one does few-shot OpenQA with no context passages. In essense, our prompts contain

1. A sequences of Q/A demonstrations (no context passages).
2. The target question (no context passage).

Here is the full system; note the use of the decorator `@dsp.transformation` – this will ensure that no `example` instances are modified when the program is used.

In [None]:
@dsp.transformation
def few_shot_openqa(example, train=squad_train, k=2): 
    example.demos = dsp.sample(train, k=k)
    example, completions = dsp.generate(qa_template)(example, stage='qa')
    return completions

There are really just two steps here. Let's go through them individually. Our example:

In [None]:
ex = squad_dev[0].copy()

ex

We add some demonstrations:

In [None]:
ex.demos = dsp.sample(squad_train, k=2)

ex

And then we call the LM using `qa_template`:

In [None]:
ex, ex_compl = dsp.generate(qa_template)(ex, stage='qa')

Here, `ex_compl` is a `Completions` instance. We will typically use only the `answer` attribute:

In [None]:
print(ex_compl.answer)

And, as a final check, we can see precisely what the LM saw:

In [None]:
lm.inspect_history(n=1)

## Few-shot QA with context

The above system makes no use of evidence passages. As a first step toward bringing in such passages, we define a regular few-shot QA system. For this system, prompts contain:

1. A sequences of Q/A demonstrations, each with a gold context passage.
2. The target question with a gold context passage.

This kind of system is very demanding in terms of data, since we need to have gold evidence passages for every Q/A pair used for demonstations and the Q that is our target. Datasets like SQuAD support this, but it's a rare situation in the world. (Our next system will address this by dropping the need for gold passages).

### Template with context

The first step toward defining this system is a new prompt template that includes context:

In [None]:
Context = dsp.Type(
    prefix="Context:\n",
    desc="${sources that may contain relevant content}",
    format=dsp.passages2text)

qa_template_with_passages = dsp.Template(
    instructions=qa_template.instructions,
    context=Context(), 
    question=Question(), 
    answer=Answer())

Here's what this does for a SQUaD example:

In [None]:
print(qa_template_with_passages(ex))

### The system

And here is the full system; the code is identical to `few_shot_openqa` except we now use `qa_template_with_passages`:

In [None]:
@dsp.transformation
def few_shot_qa_with_context(example, train=squad_train, k=3):
    example.demos = dsp.sample(train, k=k)
    generator = dsp.generate(qa_template_with_passages)
    example, completions = generator(example, stage='qa')
    return completions

In [None]:
print(few_shot_qa_with_context(squad_dev[0]).answer)

In [None]:
lm.inspect_history(n=1)

## Dev evaluations

This quick section shows some full evaluations using `evaluateAnswer` (see [Evaluation](#Evaluation) above). Depending on which model you're using, these evaluations could be expensive, so you might want to run them only sparingly. Here I am running them on just 25 dev examples to further avoid cost run-ups.

In [None]:
tiny_dev = dev_exs[: 25]

In [None]:
# few_shot_openqa_results = evaluateAnswer(few_shot_openqa, tiny_dev)
#
# print(few_shot_openqa_results['em'])
# print(few_shot_openqa_results['f1'])

You can also see the full set of results:

In [None]:
# few_shot_openqa_results['df'].head()

In [None]:
# few_shot_qa_results = evaluateAnswer(few_shot_qa_with_context, tiny_dev)
#
# print(few_shot_qa_results['em'])
# print(few_shot_qa_results['f1'])

## Question 1: Few-shot OpenQA with context [3 points]

Your task here is to define a first instance of our target system: Few-shot OpenQA with context passages. To do this, you simply complete `few_shot_openqa_with_context`:

In [None]:
@dsp.transformation
def few_shot_openqa_with_context(example, train=squad_train, k=3):
    pass
    # Sample `k` demonstrations from `train`:
    ##### YOUR CODE HERE



    # For each demonstration, retrieve one passage and add it
    # as the `context` attribute` so we can use our template
    # `qa_template_with_passages`:
    ##### YOUR CODE HERE



    # Add the list of demonstrations to `example` as the `demos` attribute:
    ##### YOUR CODE HERE



    # Retrieve a context passage for `example` itself and add it
    # as the `context` attribute:
    ##### YOUR CODE HERE



    # Use `dsp.generate` to call the model on `example` using
    # `qa_template_with_passages`:
    ##### YOUR CODE HERE



    # Return the Completions instance returned by `dsp.generate`:
    ##### YOUR CODE HERE




A quick test you can use:

In [None]:
def test_few_shot_openqa_with_context(func):
    ex = dsp.Example(question="Q0", context="C0", answer=["A0"])
    train = [
        dsp.Example(question="Q1", context=None, answer=["A1"]),
        dsp.Example(question="Q2", context=None, answer=["A2"]),
        dsp.Example(question="Q3", context=None, answer=["A3"])]
    compl = func(ex, train=train, k=2)
    errcount = 0
    # Check the LM was used as expected:
    if len(compl.data) != 1:
        errcount += 1
        print(f"Error for `{func.__name__}`: Unexpected LM output.")
    data = compl.data[0]
    # Check that the right number of demos was used:
    demos = data['demos']
    if len(demos) > 2:
        errcount += 1
        print(f"Error for `{func.__name__}`: "
              f"Unexpected demo count: {len(demos)}")
    # Check that context passages were included in the prompt:
    fields = compl.template.fields
    if not any(f.name == 'Context:' for f in fields):
        errcount += 1
        print(f"Error for `{func.__name__}`: "
              f"No context passages in the prompt.")
    # Check that the context passages were retrieved:
    if data['context'] == "C0":
        errcount += 1
        print(f"Error for `{func.__name__}`: "
              f"No context passage retrieved for the target.")
    for d in demos:
        if d['context'] is None:
            errcount += 1
            print(f"Error for `{func.__name__}`: "
                  f"No context passage retrieved for demo {d}.")
    if errcount == 0:
        print(f"No errors found for `{func.__name__}`")

In [None]:
test_few_shot_openqa_with_context(few_shot_openqa_with_context)

In [None]:
print(few_shot_openqa_with_context(dev_exs[0]).answer)

In [None]:
lm.inspect_history(n=1)

Here's an optional evaluation of the system using `tiny_dev`:

In [None]:
# few_shot_openqa_with_context_results = evaluateAnswer(
#     few_shot_openqa_with_context, tiny_dev)
#
# few_shot_openqa_with_context_results['f1']

## Question 2: Using annotate

This question is designed to give you some experience with DSP's powerful `annotate` method. You can think of this as a generic tool for defining general aspects of your prompt. Here we will use it to filter the set of demonstrations we use.

The overall idea here is that the demonstrations we sample might vary in quality in ways that could impact model performance. For example, if we want to try to push the model to provide extractive answers as in classical QA – answers that are substrings of the evidence passage – then it works against our interests to include demonstrations where the model is unabel to do this.

We will do this in two parts to facilitate testing.

### Task 1: Filtering demonstrations 1 [2 points]

This is the heart of the question: complete `filter_demos` so that, given a demonstration `d` and a list of demonstrations `demos`, it keeps `d` if and only if

1. The passage retrieved for `d` contrains `d.answer`, and
2. The model's generation for `d` based on `qa_template_with_passages` contains `d.answer`.

In [None]:
@dsp.transformation
def filter_demos(d):

    # Retrieve a passage for `d.question` and make sure that it
    # contains `d.answer`. Use `dsp.passage_match` for this!
    # return None if there is no match.
    ##### YOUR CODE HERE



    # Sample `k=3` demonstrations to help the model assess this
    # potential demonstration:
    ##### YOUR CODE HERE



    # Generate an answer based on `qa_template_with_passages`
    # and use `dsp.answer_match` to check that the predicted answer
    # contains `d.answer`. If it does not, return None.
    ##### YOUR CODE HERE



    # Return d, if you got this far:
    ##### YOUR CODE HERE




Here's a test; this is not an ideal unit test because we don't know which LM you will be using, but it should clarify our intentions and help you with debugging.

In [None]:
def test_filter_demos(func):
    # This example should be filtered at the retrieval step, since
    # 👽 is not in the index:
    ex1 = dsp.Example(
        question="Who is 👽?", context="C0", answer=["👽"])
    result1 = func(ex1)
    errcount = 0
    if result1 is not None:
        errcount += 1
        print(f"Error for `{func.__name__}`: Expected {None}, got {result1}")
    # This example should not be filtered given our tester LM:
    ex2 = dsp.Example(
        question="Who is Beyoncé?", context="C0", answer=["Beyoncé"])
    # This example should be filtered given our tester LM:
    ex3 = dsp.Example(
        question="Who is Beyoncé?", context="C0", answer=["NO MATCH"])
    class TestLM:
        def __init__(self, **kwargs):
            self.kwargs = kwargs
            self.history = []

        def __call__(self, prompt, **kwargs):
            answer = ["Beyoncé"]
            return answer
    dsp.settings.configure(lm=TestLM(), rm=rm)
    try:
        result2 = func(ex2)
        if result2 is None:
            errcount += 1
            print(f"Error for `{func.__name__}`: "
                  f"Expected example not to be filtered by `answer_match`.")
        result3 = func(ex3)
        if result3 is not None:
            errcount += 1
            print(f"Error for `{func.__name__}`: "
                  f"Expected example to be filtered by `answer_match`.")
    except:
        raise
    finally:
        # Restore the actual model:
        dsp.settings.configure(lm=lm, rm=rm)
    if errcount == 0:
        print(f"No errors detected for `{func.__name__}`")

In [None]:
test_filter_demos(filter_demos)

### Task 2: Full filtering program [1 point]

The task is to complete `few_shot_openqa_with_context_and_demo_filtering` as a few-shot OpenQA system like the one from Question 1, but using the filtering mechanism defined by `filter_demos`.

In [None]:
@dsp.transformation
def few_shot_openqa_with_context_and_demo_filtering(example, train=squad_train, k=3):

    # Sample 20 demonstrations:
    ##### YOUR CODE HERE



    # Filter the demonstrations using `annotate` and `filter_demos`.
    # The user's `k` should be used to specify the maximum number of
    # demonstrations kept at this stage.
    ##### YOUR CODE HERE



    # Add the list of filtered demonstrations as a the `demos`
    # attribute of `example`:
    ##### YOUR CODE HERE



    # Retrieve a context passage for `example.question` and add it
    # as the `context` attribute for the example:
    ##### YOUR CODE HERE



    # Generate a prediction using `qa_template_with_passages` as
    # we did before:
    ##### YOUR CODE HERE



    # Return the generated `Completions` instance:
    ##### YOUR CODE HERE




Our previous test should suffice to help with debugging this program:

In [None]:
test_few_shot_openqa_with_context(
    few_shot_openqa_with_context_and_demo_filtering)

Quiick example:

In [None]:
print(few_shot_openqa_with_context_and_demo_filtering(dev_exs[0]).answer)

In [None]:
lm.inspect_history(n=1)

Here is code for an optional initial evaluation with `tiny_dev`:

In [None]:
# filtering_results = evaluateAnswer(
#     few_shot_openqa_with_context_and_demo_filtering, tiny_dev)

# filtering_results['f1']

## Question 3: Your original system [3 points]

This question asks you to design your own few-shot OpenQA system. All of the code above can be used and modified for this, and the requirement is just that you try something new that goes beyond what we've done so far. 

Terms for the bake-off:

* You can make free use of SQuAD and other publicly available data.

* The LM must be an autoregressive language model. No trained QA components can be used. This includes general purpose LMs that have been fine-tuned for QA. (We have obviously waded into some vague territory here. The spirit of this is to make use of frozen, general-purpose models. We welcome questions about exactly how this is defined, since it could be instructive to explore this.)

Here are some ideas for the original system:

* We have so far sampled randomly from the SQuaD train set to create few-shot prompts. One might instead sample passages that have some connection to the target question. See `dsp.knn`, for example.

* There are a lot of parameters to our LMs that we have so far ignored. Exploring different values might lead to better results. The `temperature` parameter is highly impactful for our task.

* We have so far made no use of the scores from the LM or the RM.

* We have so far made no use of DSP's functionality for self-consistency. See the DPS intro notebook for examples.

__Original system instructions__:

In the cell below, please provide a brief technical description of your original system, so that the teaching team can gain an understanding of what it does. This will help us to understand your code and analyze all the submissions to identify patterns and strategies.

In [None]:
# PLEASE MAKE SURE TO INCLUDE THE FOLLOWING BETWEEN THE START AND STOP COMMENTS:
#   1) Textual description of your system.
#   2) The code for your original system.
# PLEASE MAKE SURE NOT TO DELETE OR EDIT THE START AND STOP COMMENTS

# START COMMENT: Enter your system description in this cell.


# STOP COMMENT: Please do not remove this comment.

## Question 4: Bakeoff entry [1 point]

For the bake-off, you simply need to be able to run your system on the file 

```data/openqa/cs224u-openqa-test-unlabeled.txt```

The following code should download it for you if necessary:

In [None]:
if not os.path.exists(os.path.join("data", "openqa", "cs224u-openqa-test-unlabeled.txt")):
    !mkdir -p data/openqa
    !wget https://web.stanford.edu/class/cs224u/data/cs224u-openqa-test-unlabeled.txt -P data/openqa/

If the above fails, you can just download https://web.stanford.edu/class/cs224u/data/cs224u-openqa-test-unlabeled.txt and place it in `data/openqa`.

This file contains only questions. The starter code below will help you structure this. It writes a file "cs224u-openqa-bakeoff-entry.json" to the current directory. That file should be uploaded as-is. Please do not change its name.

In [None]:
import json

def create_bakeoff_submission(fn):
    """"
    The argument `fn` is a DSP program with the same signature as the 
    ones we wrote above: `dsp.Example` to `dsp.Completions`.
    """

    filename = os.path.join("data", "openqa", "cs224u-openqa-test-unlabeled.txt")

    # This should become a mapping from questions (str) to response
    # dicts from your system.
    gens = {} 

    with open(filename) as f:
        questions = f.read().splitlines()

    questions = [dsp.Example(question=q) for q in questions]

    # `questions` is the list of `dsp.Example` instances you need to 
    # evaluate your system on. 
    #
    # Here we loop over the questions, run the system `fn`, and
    # store its `answer` value as the prediction:
    for question in tqdm.tqdm(questions):
        gens[question.question] = fn(question).answer

    # Quick tests we advise you to run: 
    # 1. Make sure `gens` is a dict with the questions as the keys:
    assert all(q.question in gens for q in questions)
    # 2. Make sure the values are dicts and have the key we will use:
    assert all(isinstance(d, str) for d in gens.values())

    # And finally the output file:
    with open("cs224u-openqa-bakeoff-entry.json", "wt") as f:
        json.dump(gens, f, indent=4)

Here's what it looks like to evaluate our first program, `few_shot_openqa`, on the bakeoff data:

In [None]:
# create_bakeoff_submission(few_shot_openqa_with_context)