# Using GPT-4 to bootstrap few-shot CoT demonstations for GPT-3.5

The [Scoped Negation (ScoNe) benchmark of She et al. (2023)](https://aclanthology.org/2023.acl-short.154/) seeks to stress-test models on their ability to reason about negation. In the original paper, the `text-davinci-002` and `text-davinci-003` models were more or less at chance on the hardest ScoNe categories.

This notebook starts with a very simple Chain-of-Thought-based module for ScoNe. `gpt-3.5-turbo` is at chance on the "one scoping negation" category (one of the two hardest in ScoNe) using this simple program. 

We figured that bootstrapping demonstrations would help, but `turbo` struggled to create good demonstrations that included CoT steps. When we switched to using `gpt4-turbo` just to create these demonstrations (which involves under 50 calls to that model), `turbo` regularly achieved 85–90% accuracy. **This is a single compilation step using `dspy.BootstrapFewShotWithRandomSearch`.**

## Set-up

In [1]:
import glob
import os
import pandas as pd
import random

import dspy
from dspy.evaluate import Evaluate
from dspy.teleprompt import BootstrapFewShotWithRandomSearch

In [2]:
os.environ["DSP_NOTEBOOK_CACHEDIR"] = os.path.join('.', 'cache')

In [3]:
# We'll rely on turbo for everything except bootstrapping CoT demos:

turbo = dspy.OpenAI(model='gpt-3.5-turbo-1106', max_tokens=250, model_type='chat')

dspy.settings.configure(lm=turbo)

In [4]:
# GPT-4 will be used only to bootstrap CoT demos:

gpt4T = dspy.OpenAI(model='gpt-4-1106-preview', max_tokens=350, model_type='chat')

In [5]:
# Toggling this to true will redo the bootstrapping process. When
# it is set to False, the existing demonstrations will be used but
# turbo will still be used to evaluate the zero-shot and full programs.
RUN_FROM_SCRATCH = False

## ScoNe

In [6]:
!git clone https://github.com/selenashe/ScoNe.git

Cloning into 'ScoNe'...
remote: Enumerating objects: 77, done.[K
remote: Counting objects: 100% (77/77), done.[K
remote: Compressing objects: 100% (55/55), done.[K
remote: Total 77 (delta 42), reused 42 (delta 20), pack-reused 0[K
Receiving objects: 100% (77/77), 116.25 KiB | 1.21 MiB/s, done.
Resolving deltas: 100% (42/42), done.


### Data loader

In [7]:
def load_scone(dirname):
    dfs = []
    for filename in glob.glob(dirname + "/*.csv"):
        df = pd.read_csv(filename, index_col=0)
        df['category'] = os.path.basename(filename).replace(".csv", "")
        dfs.append(df)
    data_df = pd.concat(dfs)

    def as_example(row):
        # The 'one_scoped' file is from an earlier dataset, MoNLI, and
        # so is formatted a bit differently:
        suffix = '' if row['category'] == 'one_scoped' else '_edited'
        # Reformat the hypothesis to be an embedded clause in a question:
        hkey = 'sentence2' + suffix
        question = row[hkey][0].lower() + row[hkey][1: ].strip(".")
        question = f"Can we logically conclude for sure that {question}?"
        # Binary task formulation:
        label = "Yes" if row['gold_label' + suffix] == 'entailment' else "No"
        return dspy.Example({
            "context": row['sentence1' + suffix],
            "question": question,
            "answer": label,
            "category": row['category']
        }).with_inputs("context", "question")

    return list(data_df.apply(as_example, axis=1).values)

### Train and dev samples

In [8]:
all_train = load_scone("ScoNe/scone_nli/train")

random.seed(1)
random.shuffle(all_train)

# 200 random train, 50 random dev:
train, dev = all_train[: 200], all_train[200: 250]

len(train), len(dev)

(200, 50)

### Test

In [9]:
random.seed(1)

test = load_scone(dirname="ScoNe/scone_nli/test")

# We're developing a system for the full ScoNe benchmark, but we'll
# evaluate only on one of the hardest and most informative ScoNe
# categories for now -- examples with a single negation that plays
# a crucial role in the reasoning:
test = [ex for ex in test if ex.category == "one_scoped"]

In [10]:
pd.Series([ex.answer for ex in test]).value_counts()

No     100
Yes    100
dtype: int64

## Evaluation tools

In [11]:
scone_accuracy = dspy.evaluate.metrics.answer_exact_match

In [12]:
evaluator = Evaluate(devset=test, num_threads=1, display_progress=True, display_table=0)

## Zero-shot CoT

In [13]:
class ScoNeSignature(dspy.Signature):
    ("""You are given some context (a premise) and a question (a hypothesis). """
    """You must indicate with Yes/No answer whether we can logically """
    """conclude the hypothesis from the premise.""")

    context = dspy.InputField()
    question = dspy.InputField()
    answer = dspy.OutputField(desc="Yes or No")

In [14]:
class ScoNeCoT(dspy.Module):
    def __init__(self):
        super().__init__()
        self.generate_answer = dspy.ChainOfThought(ScoNeSignature)

    def forward(self, context, question):
        return self.generate_answer(context=context, question=question)

In [15]:
cot_zeroshot = ScoNeCoT()

In [16]:
evaluator(cot_zeroshot, metric=scone_accuracy)

Average Metric: 100 / 200  (50.0): 100%|█████████████████████████| 200/200 [00:00<00:00, 733.75it/s]

Average Metric: 100 / 200  (50.0%)





50.0

## Optimized few-shot with bootstrapped demonstrations

In [17]:
bootstrap_optimizer = BootstrapFewShotWithRandomSearch(
    max_bootstrapped_demos=8,
    max_labeled_demos=8,
    num_candidate_programs=10,
    num_threads=8,
    metric=scone_accuracy,
    teacher_settings=dict(lm=gpt4T))

Going to sample between 1 and 8 traces per predictor.
Will attempt to train 10 candidate sets.


In [18]:
if RUN_FROM_SCRATCH:
    cot_fewshot = bootstrap_optimizer.compile(cot_zeroshot, trainset=train, valset=dev)
else:
    cot_fewshot = ScoNeCoT()
    cot_fewshot.load("scone-cot_fewshot-turbo-gpt4-demos.json")

Average Metric: 24 / 50  (48.0): 100%|████████████████████████████| 50/50 [00:00<00:00, 1096.32it/s]


Average Metric: 24 / 50  (48.0%)
Score: 48.0 for set: [0]
New best score: 48.0 for seed -3
Scores so far: [48.0]
Best score: 48.0


Average Metric: 25 / 50  (50.0): 100%|████████████████████████████| 50/50 [00:00<00:00, 1034.71it/s]


Average Metric: 25 / 50  (50.0%)
Score: 50.0 for set: [8]
New best score: 50.0 for seed -2
Scores so far: [48.0, 50.0]
Best score: 50.0


  6%|███▎                                                         | 11/200 [00:00<00:00, 899.26it/s]


Bootstrapped 8 full traces after 12 examples in round 0.


Average Metric: 27 / 50  (54.0): 100%|████████████████████████████| 50/50 [00:00<00:00, 1225.04it/s]


Average Metric: 27 / 50  (54.0%)
Score: 54.0 for set: [8]
New best score: 54.0 for seed -1
Scores so far: [48.0, 50.0, 54.0]
Best score: 54.0
Average of max per entry across top 1 scores: 0.54
Average of max per entry across top 2 scores: 0.7
Average of max per entry across top 3 scores: 0.76
Average of max per entry across top 5 scores: 0.76
Average of max per entry across top 8 scores: 0.76
Average of max per entry across top 9999 scores: 0.76


  4%|██▊                                                           | 9/200 [00:00<00:00, 815.06it/s]


Bootstrapped 7 full traces after 10 examples in round 0.


Average Metric: 37 / 50  (74.0): 100%|█████████████████████████████| 50/50 [00:00<00:00, 884.47it/s]


Average Metric: 37 / 50  (74.0%)
Score: 74.0 for set: [8]
New best score: 74.0 for seed 0
Scores so far: [48.0, 50.0, 54.0, 74.0]
Best score: 74.0
Average of max per entry across top 1 scores: 0.74
Average of max per entry across top 2 scores: 0.78
Average of max per entry across top 3 scores: 0.86
Average of max per entry across top 5 scores: 0.92
Average of max per entry across top 8 scores: 0.92
Average of max per entry across top 9999 scores: 0.92


  2%|█▏                                                            | 4/200 [00:00<00:00, 309.09it/s]


Bootstrapped 3 full traces after 5 examples in round 0.


Average Metric: 28 / 50  (56.0): 100%|████████████████████████████| 50/50 [00:00<00:00, 1111.93it/s]


Average Metric: 28 / 50  (56.0%)
Score: 56.0 for set: [8]
Scores so far: [48.0, 50.0, 54.0, 74.0, 56.0]
Best score: 74.0
Average of max per entry across top 1 scores: 0.74
Average of max per entry across top 2 scores: 0.8
Average of max per entry across top 3 scores: 0.82
Average of max per entry across top 5 scores: 0.92
Average of max per entry across top 8 scores: 0.92
Average of max per entry across top 9999 scores: 0.92


  0%|▎                                                             | 1/200 [00:00<00:00, 712.23it/s]


Bootstrapped 1 full traces after 2 examples in round 0.


Average Metric: 31 / 50  (62.0): 100%|████████████████████████████| 50/50 [00:00<00:00, 1043.32it/s]


Average Metric: 31 / 50  (62.0%)
Score: 62.0 for set: [8]
Scores so far: [48.0, 50.0, 54.0, 74.0, 56.0, 62.0]
Best score: 74.0
Average of max per entry across top 1 scores: 0.74
Average of max per entry across top 2 scores: 0.86
Average of max per entry across top 3 scores: 0.9
Average of max per entry across top 5 scores: 0.94
Average of max per entry across top 8 scores: 0.94
Average of max per entry across top 9999 scores: 0.94


  2%|█▏                                                            | 4/200 [00:00<00:00, 837.65it/s]


Bootstrapped 4 full traces after 5 examples in round 0.


Average Metric: 23 / 50  (46.0): 100%|████████████████████████████| 50/50 [00:00<00:00, 1104.00it/s]


Average Metric: 23 / 50  (46.0%)
Score: 46.0 for set: [8]
Scores so far: [48.0, 50.0, 54.0, 74.0, 56.0, 62.0, 46.0]
Best score: 74.0
Average of max per entry across top 1 scores: 0.74
Average of max per entry across top 2 scores: 0.86
Average of max per entry across top 3 scores: 0.9
Average of max per entry across top 5 scores: 0.94
Average of max per entry across top 8 scores: 0.96
Average of max per entry across top 9999 scores: 0.96


  2%|█▏                                                            | 4/200 [00:00<00:00, 802.55it/s]


Bootstrapped 4 full traces after 5 examples in round 0.


Average Metric: 34 / 50  (68.0): 100%|████████████████████████████| 50/50 [00:00<00:00, 1116.66it/s]


Average Metric: 34 / 50  (68.0%)
Score: 68.0 for set: [8]
Scores so far: [48.0, 50.0, 54.0, 74.0, 56.0, 62.0, 46.0, 68.0]
Best score: 74.0
Average of max per entry across top 1 scores: 0.74
Average of max per entry across top 2 scores: 0.92
Average of max per entry across top 3 scores: 0.98
Average of max per entry across top 5 scores: 0.98
Average of max per entry across top 8 scores: 1.0
Average of max per entry across top 9999 scores: 1.0


  2%|█▌                                                            | 5/200 [00:00<00:00, 855.28it/s]


Bootstrapped 5 full traces after 6 examples in round 0.


Average Metric: 30 / 50  (60.0): 100%|████████████████████████████| 50/50 [00:00<00:00, 1148.03it/s]


Average Metric: 30 / 50  (60.0%)
Score: 60.0 for set: [8]
Scores so far: [48.0, 50.0, 54.0, 74.0, 56.0, 62.0, 46.0, 68.0, 60.0]
Best score: 74.0
Average of max per entry across top 1 scores: 0.74
Average of max per entry across top 2 scores: 0.92
Average of max per entry across top 3 scores: 0.98
Average of max per entry across top 5 scores: 0.98
Average of max per entry across top 8 scores: 1.0
Average of max per entry across top 9999 scores: 1.0


  1%|▌                                                             | 2/200 [00:00<00:00, 723.34it/s]


Bootstrapped 2 full traces after 3 examples in round 0.


Average Metric: 27 / 50  (54.0): 100%|████████████████████████████| 50/50 [00:00<00:00, 1109.09it/s]


Average Metric: 27 / 50  (54.0%)
Score: 54.0 for set: [8]
Scores so far: [48.0, 50.0, 54.0, 74.0, 56.0, 62.0, 46.0, 68.0, 60.0, 54.0]
Best score: 74.0
Average of max per entry across top 1 scores: 0.74
Average of max per entry across top 2 scores: 0.92
Average of max per entry across top 3 scores: 0.98
Average of max per entry across top 5 scores: 0.98
Average of max per entry across top 8 scores: 1.0
Average of max per entry across top 9999 scores: 1.0


  3%|█▊                                                            | 6/200 [00:00<00:00, 828.15it/s]


Bootstrapped 6 full traces after 7 examples in round 0.


Average Metric: 28 / 50  (56.0): 100%|████████████████████████████| 50/50 [00:00<00:00, 1036.51it/s]


Average Metric: 28 / 50  (56.0%)
Score: 56.0 for set: [8]
Scores so far: [48.0, 50.0, 54.0, 74.0, 56.0, 62.0, 46.0, 68.0, 60.0, 54.0, 56.0]
Best score: 74.0
Average of max per entry across top 1 scores: 0.74
Average of max per entry across top 2 scores: 0.92
Average of max per entry across top 3 scores: 0.98
Average of max per entry across top 5 scores: 0.98
Average of max per entry across top 8 scores: 1.0
Average of max per entry across top 9999 scores: 1.0


  2%|█▌                                                            | 5/200 [00:00<00:00, 790.78it/s]


Bootstrapped 4 full traces after 6 examples in round 0.


Average Metric: 25 / 50  (50.0): 100%|████████████████████████████| 50/50 [00:00<00:00, 1128.36it/s]


Average Metric: 25 / 50  (50.0%)
Score: 50.0 for set: [8]
Scores so far: [48.0, 50.0, 54.0, 74.0, 56.0, 62.0, 46.0, 68.0, 60.0, 54.0, 56.0, 50.0]
Best score: 74.0
Average of max per entry across top 1 scores: 0.74
Average of max per entry across top 2 scores: 0.92
Average of max per entry across top 3 scores: 0.98
Average of max per entry across top 5 scores: 0.98
Average of max per entry across top 8 scores: 1.0
Average of max per entry across top 9999 scores: 1.0


  4%|██▍                                                           | 8/200 [00:00<00:00, 845.75it/s]


Bootstrapped 8 full traces after 9 examples in round 0.


Average Metric: 31 / 50  (62.0): 100%|█████████████████████████████| 50/50 [00:00<00:00, 921.83it/s]

Average Metric: 31 / 50  (62.0%)
Score: 62.0 for set: [8]
Scores so far: [48.0, 50.0, 54.0, 74.0, 56.0, 62.0, 46.0, 68.0, 60.0, 54.0, 56.0, 50.0, 62.0]
Best score: 74.0
Average of max per entry across top 1 scores: 0.74
Average of max per entry across top 2 scores: 0.92
Average of max per entry across top 3 scores: 0.98
Average of max per entry across top 5 scores: 0.98
Average of max per entry across top 8 scores: 1.0
Average of max per entry across top 9999 scores: 1.0
13 candidate programs found.





In [19]:
evaluator(cot_fewshot, metric=scone_accuracy)

Average Metric: 171 / 200  (85.5): 100%|█████████████████████████| 200/200 [00:00<00:00, 557.50it/s]

Average Metric: 171 / 200  (85.5%)





85.5

In [20]:
cot_fewshot.save("scone-cot_fewshot-turbo-gpt4-demos.json")

## Example prompt with prediction

In [21]:
turbo.inspect_history(n=1)





You are given some context (a premise) and a question (a hypothesis). You must indicate with Yes/No answer whether we can logically conclude the hypothesis from the premise.

---

Follow the following format.

Context: ${context}

Question: ${question}

Reasoning: Let's think step by step in order to ${produce the answer}. We ...

Answer: Yes or No

---

Context: It is not true that there is not a single person walking in the city.

Question: Can we logically conclude for sure that it is not true that there is not a single celebrity walking in the city?

Reasoning: Let's think step by step in order to produce the answer. We know that the double negative in the context implies that there is at least one person walking in the city. However, the context does not provide any information about the status or occupation of the person walking in the city. Therefore, we cannot logically conclude that the person walking in the city is a celebrity.

Answer: No

---

Context: the boy, not girl