# Multi-hop QA Program 1: Direct Query

This notebook is a stand-alone version of Program 1 from the intro notebook.

### Installation

If you haven't installed **DSP** already, let's do that.

Note: If you're running this from a cloned copy of the repo, then you can skip this block.

In [None]:
try: # When on google Colab, let's clone the notebook so we download the cache.
    import google.colab 
    !git -C dsp/ pull || git clone https://github.com/stanfordnlp/dsp
except: pass

!pip install -U pip dsp-ml

### Setting Up

We'll start by setting up the language model (LM) and retrieval model (RM).

We will work with the **GPT-3.5** LM (`text-davinci-002`) and the **ColBERTv2** RM.

To use GPT-3, you'll need an OpenAI key. For ColBERTv2, we've set up a server hosting a Wikipedia (Dec 2018) search index, so you don't need to worry about setting one up!

To make things easy, we've set up a cache in this repository. _If you want to run this notebook without changing the code or examples, you don't need an API key. All examples are cached._

In [1]:
%load_ext autoreload
%autoreload 2

try: import google.colab; root_path = 'dsp'
# The root path is ../ if you're running this from the demo folder of the cloned repository
except: root_path = '../'

import os
os.environ["DSP_NOTEBOOK_CACHEDIR"] = os.path.join(root_path, 'cache')

# Add ../ to the path to import dsp if you're running this directly from the cloned copy of the repo (without pip installing dsp)
import sys
sys.path.insert(0, '../')

import dsp

openai_key = os.getenv('OPENAI_API_KEY')  # or replace with your API key (optional)
colbert_server = 'http://ec2-44-228-128-229.us-west-2.compute.amazonaws.com:8893/api/search'

lm = dsp.GPT3(model='text-davinci-002', api_key=openai_key)
rm = dsp.ColBERTv2(url=colbert_server)

dsp.settings.configure(lm=lm, rm=rm)

Not loading Cohere because it is not installed.


### Task Examples

Next, let's look at a few examples of the task. Each example consists of a question and one or more gold answers.

We have six training examples (`train`), which we'll feed into the programs. These will help define the task.

Notice that our examples only have input (`question`) and output (`answer`) fields. When our advanced programs build sophisticated pipelines, training "demonstrations" for other fields will be constructed automatically.

In [2]:
train = [('Who produced the album that included a re-recording of "Lithium"?', ['Butch Vig']),
         ('Who was the director of the 2009 movie featuring Peter Outerbridge as William Easton?', ['Kevin Greutert']),
         ('The heir to the Du Pont family fortune sponsored what wrestling team?', ['Foxcatcher', 'Team Foxcatcher', 'Foxcatcher Team']),
         ('In what year was the star of To Hell and Back born?', ['1925']),
         ('Which award did the first book of Gary Zukav receive?', ['U.S. National Book Award', 'National Book Award']),
         ('What city was the victim of Joseph Druces working in?', ['Boston, Massachusetts', 'Boston']),]

train = [dsp.Example(question=question, answer=answer) for question, answer in train]

The development examples (`dev`) will be used to assess the behavior of each program we build. Of course, this tiny set is not meant to be a reliable benchmark, but it'll be instructive to use it for illustration.

In [3]:
dev = [('Who has a broader scope of profession: E. L. Doctorow or Julia Peterkin?', ['E. L. Doctorow', 'E.L. Doctorow', 'Doctorow']),
       ('What documentary about the Gilgo Beach Killer debuted on A&E?', ['The Killing Season']),
       ('Right Back At It Again contains lyrics co-written by the singer born in what city?', ['Gainesville, Florida', 'Gainesville']),
       ('What year was the party of the winner of the 1971 San Francisco mayoral election founded?', ['1828']),
       ('Which author is English: John Braine or Studs Terkel?', ['John Braine']),
       ('Anthony Dirrell is the brother of which super middleweight title holder?', ['Andre Dirrell']),
       ('In which city is the sports nutrition business established by Oliver Cookson based ?', ['Cheshire', 'Cheshire, UK']),
       ('Find the birth date of the actor who played roles in First Wives Club and Searching for the Elephant.', ['February 13, 1980']),
       ('Kyle Moran was born in the town on what river?', ['Castletown', 'Castletown River']),
       ("What is the name of one branch of Robert D. Braun's speciality?", ['aeronautical engineering', 'astronautical engineering', 'aeronautics', 'astronautics']),
       ("Where was the actress who played the niece in the Priest film born?", ['Surrey', 'Guildford, Surrey']),
       ('Name the movie in which the daughter of Noel Harrison plays Violet Trefusis.', ['Portrait of a Marriage']),
       ('What year was the father of the Princes in the Tower born?', ['1442'])]

dev = [dsp.Example(question=question, answer=answer) for question, answer in dev]

### Program Definition

Direct Query is the simplest program for this task. We'll prompt **GPT-3.5** to answer each question based on its internal parameteric knowledge.

We'll start by defining the `Template` that defines how we will communicate with the LM.

Specifically, the question–answer template (`qa_template`) will include a question and a short answer for each example.

In [4]:
Question = dsp.Type(prefix="Question:", desc="${the question to be answered}")
Answer = dsp.Type(prefix="Answer:", desc="${a short factoid answer, often between 1 and 5 words}", format=dsp.format_answers)

qa_template = dsp.Template(instructions="Answer questions with short factoid answers.", question=Question(), answer=Answer())

Then, let's define the actual program, `Direct_Query_QA`. It'll accept a string (`question`) and returns another string (its short `answer`).

In [5]:
def Direct_Query_QA(question: str) -> str:
    demos = dsp.sample(train, k=7)
    example = dsp.Example(question=question, demos=demos)

    example, completions = dsp.generate(qa_template)(example, stage='qa')
    return completions.answer

Let's invoke the program on a sample question.

In [6]:
print(dev[0].question)
print(Direct_Query_QA(dev[0].question))

Who has a broader scope of profession: E. L. Doctorow or Julia Peterkin?
E. L. Doctorow


Let's inspect the last call to the LM to learn more about the behavior of the program.

In [7]:
lm.inspect_history(n=1)





Answer questions with short factoid answers.

---

Follow the following format.

Question: ${the question to be answered}
Answer: ${a short factoid answer, often between 1 and 5 words}

---

Question: Which award did the first book of Gary Zukav receive?
Answer: U.S. National Book Award

Question: The heir to the Du Pont family fortune sponsored what wrestling team?
Answer: Foxcatcher

Question: Who was the director of the 2009 movie featuring Peter Outerbridge as William Easton?
Answer: Kevin Greutert

Question: Who produced the album that included a re-recording of "Lithium"?
Answer: Butch Vig

Question: What city was the victim of Joseph Druces working in?
Answer: Boston, Massachusetts

Question: In what year was the star of To Hell and Back born?
Answer: 1925

Question: Who has a broader scope of profession: E. L. Doctorow or Julia Peterkin?
Answer:[32m E. L. Doctorow[0m





In [8]:
from dsp.evaluation.utils import evaluate

evaluate(Direct_Query_QA, dev)

100%|██████████| 13/13 [00:00<00:00, 121.15it/s]

Answered 3 / 13 (23.1%) correctly.





Unnamed: 0,question,answer,prediction,correct
0,Who has a broader scope of profession: E. L. Doctorow or Julia Peterkin?,"['E. L. Doctorow', 'E.L. Doctorow', 'Doctorow']",E. L. Doctorow,✔️
1,What documentary about the Gilgo Beach Killer debuted on A&E?,['The Killing Season'],The Long Island Serial Killer,❌
2,Right Back At It Again contains lyrics co-written by the singer born in what city?,"['Gainesville, Florida', 'Gainesville']","Melbourne, Australia",❌
3,What year was the party of the winner of the 1971 San Francisco mayoral election founded?,['1828'],1966,❌
4,Which author is English: John Braine or Studs Terkel?,['John Braine'],John Braine,✔️
5,Anthony Dirrell is the brother of which super middleweight title holder?,['Andre Dirrell'],Andre Dirrell,✔️
6,In which city is the sports nutrition business established by Oliver Cookson based ?,"['Cheshire', 'Cheshire, UK']","Manchester, England",❌
7,Find the birth date of the actor who played roles in First Wives Club and Searching for the Elephant.,"['February 13, 1980']","July 30, 1953",❌
8,Kyle Moran was born in the town on what river?,"['Castletown', 'Castletown River']",Hudson River,❌
9,What is the name of one branch of Robert D. Braun's speciality?,"['aeronautical engineering', 'astronautical engineering', 'aeronautics', 'astronautics']",Aerospace engineering,❌


23.1