# ImpPres LLM Baseline

You have to implement in this notebook a baseline for ImpPres classification using an LLM.
This baseline must be implemented using DSPy.



In [None]:
# Configure the DSPy environment with the language model - for grok the parameters must be:
# env variable should be in os.environ['XAI_API_KEY']
# "xai/grok-3-mini"
import os
import dspy

os.environ["XAI_API_KEY"] = ""

lm = dspy.LM('xai/grok-3-mini', api_key=os.environ['XAI_API_KEY'])
dspy.configure(lm=lm)

In [10]:
from typing import Literal

class ClassifyNLI(dspy.Signature):
    premise: str = dspy.InputField()
    hypothesis: str = dspy.InputField()
    label: Literal['entailment', 'contradiction', 'neutral'] = dspy.OutputField()

pred_model = dspy.Predict(ClassifyNLI)

class CoTNLI(dspy.Signature):
    premise: str = dspy.InputField()
    hypothesis: str = dspy.InputField()
    label: Literal['entailment', 'contradiction', 'neutral'] = dspy.OutputField()

CoT_model = dspy.ChainOfThought(CoTNLI)

## Load ImpPres dataset

In [11]:
from datasets import load_dataset

sections = ['presupposition_all_n_presupposition', 
            'presupposition_both_presupposition', 
            'presupposition_change_of_state', 
            'presupposition_cleft_existence', 
            'presupposition_cleft_uniqueness', 
            'presupposition_only_presupposition', 
            'presupposition_possessed_definites_existence', 
            'presupposition_possessed_definites_uniqueness', 
            'presupposition_question_presupposition']

dataset = {}
for section in sections:
    print(f"Loading dataset for section: {section}")
    dataset[section] = load_dataset("facebook/imppres", section)

Loading dataset for section: presupposition_all_n_presupposition
Loading dataset for section: presupposition_both_presupposition
Loading dataset for section: presupposition_change_of_state
Loading dataset for section: presupposition_cleft_existence
Loading dataset for section: presupposition_cleft_uniqueness
Loading dataset for section: presupposition_only_presupposition
Loading dataset for section: presupposition_possessed_definites_existence
Loading dataset for section: presupposition_possessed_definites_uniqueness
Loading dataset for section: presupposition_question_presupposition


In [12]:
dataset

{'presupposition_all_n_presupposition': DatasetDict({
     all_n_presupposition: Dataset({
         features: ['premise', 'hypothesis', 'trigger', 'trigger1', 'trigger2', 'presupposition', 'gold_label', 'UID', 'pairID', 'paradigmID'],
         num_rows: 1900
     })
 }),
 'presupposition_both_presupposition': DatasetDict({
     both_presupposition: Dataset({
         features: ['premise', 'hypothesis', 'trigger', 'trigger1', 'trigger2', 'presupposition', 'gold_label', 'UID', 'pairID', 'paradigmID'],
         num_rows: 1900
     })
 }),
 'presupposition_change_of_state': DatasetDict({
     change_of_state: Dataset({
         features: ['premise', 'hypothesis', 'trigger', 'trigger1', 'trigger2', 'presupposition', 'gold_label', 'UID', 'pairID', 'paradigmID'],
         num_rows: 1900
     })
 }),
 'presupposition_cleft_existence': DatasetDict({
     cleft_existence: Dataset({
         features: ['premise', 'hypothesis', 'trigger', 'trigger1', 'trigger2', 'presupposition', 'gold_label', 'UI

In [13]:
from dspy import Example

examples = [
    Example(
        premise=row["premise"],
        hypothesis=row["hypothesis"],
        trigger=row["trigger"],
        label=row["gold_label"]
    ).with_inputs("premise", "hypothesis")
    for row in dataset['presupposition_all_n_presupposition']['all_n_presupposition']
]

trainset = examples[:30]

In [14]:
from dspy.teleprompt import BootstrapFewShot, COPRO, KNNFewShot
from sentence_transformers import SentenceTransformer

class EmbedderWrapper:
    def __init__(self, model_name):
        self.model = SentenceTransformer(model_name)
    def __call__(self, texts):
        return self.model.encode(texts, convert_to_numpy=True)
embedder = EmbedderWrapper('all-MiniLM-L6-v2')

label_names = ['entailment', 'contradiction', 'neutral']
label2id = {label: i for i, label in enumerate(label_names)}

def exact_match_metric(example, pred, trace=None):
    return label2id[pred.label.strip().lower()] == example['label']

bootstrap_optimizer = BootstrapFewShot(metric=exact_match_metric)
optimized_bootstrap = bootstrap_optimizer.compile(student=pred_model, trainset=trainset)

KNN_optimizer = KNNFewShot(k=5, trainset=trainset, vectorizer=embedder)
optimized_knn = KNN_optimizer.compile(pred_model)

corpo = COPRO(metric=exact_match_metric, max_trials=5)
optimized_cot = corpo.compile(CoT_model, trainset=trainset, eval_kwargs={})

 43%|████▎     | 13/30 [01:32<02:00,  7.11s/it]


Bootstrapped 4 full traces after 13 examples for up to 1 rounds, amounting to 13 attempts.


2025/07/06 18:22:53 INFO dspy.teleprompt.copro_optimizer: Iteration Depth: 1/3.
2025/07/06 18:22:53 INFO dspy.teleprompt.copro_optimizer: At Depth 1/3, Evaluating Prompt Candidate #1/10 for Predictor 1 of 1.
2025/07/06 18:23:17 INFO dspy.evaluate.evaluate: Average Metric: 8 / 30 (26.7%)
2025/07/06 18:23:17 INFO dspy.teleprompt.copro_optimizer: At Depth 1/3, Evaluating Prompt Candidate #2/10 for Predictor 1 of 1.
2025/07/06 18:23:36 INFO dspy.evaluate.evaluate: Average Metric: 8 / 30 (26.7%)
2025/07/06 18:23:36 INFO dspy.teleprompt.copro_optimizer: At Depth 1/3, Evaluating Prompt Candidate #3/10 for Predictor 1 of 1.
2025/07/06 18:23:58 INFO dspy.evaluate.evaluate: Average Metric: 7 / 30 (23.3%)
2025/07/06 18:23:58 INFO dspy.teleprompt.copro_optimizer: At Depth 1/3, Evaluating Prompt Candidate #4/10 for Predictor 1 of 1.
2025/07/06 18:24:21 INFO dspy.evaluate.evaluate: Average Metric: 9 / 30 (30.0%)
2025/07/06 18:24:21 INFO dspy.teleprompt.copro_optimizer: At Depth 1/3, Evaluating Promp

## Evaluate Metrics

Let's use the huggingface `evaluate` package to compute the performance of the baseline.


In [15]:
import evaluate
accuracy = evaluate.load("accuracy")
f1 = evaluate.load("f1")
precision = evaluate.load("precision")
recall = evaluate.load("recall")

In [16]:
def compute_metrics(results):
    label2id = {"entailment": 0, "neutral": 1, "contradiction": 2}
    pred_labels = [label2id[r['pred_label']] for r in results]
    gold_labels = [label2id[r['gold_label']] for r in results]

    return {
        "accuracy": accuracy.compute(predictions=pred_labels, references=gold_labels)["accuracy"],
        "precision": precision.compute(predictions=pred_labels, references=gold_labels, average="macro")["precision"],
        "recall": recall.compute(predictions=pred_labels, references=gold_labels, average="macro")["recall"],
        "f1": f1.compute(predictions=pred_labels, references=gold_labels, average="macro")["f1"],
    }

## Your Turn

Compute the classification metrics on the baseline LLM model on each test section of the ANLI dataset for samples that have a non-empty 'reason' field.

You also must show a comparison between the DeBERTa baseline model and this LLM baseline model. The comparison metric should compute the agreement between the two models:
* On how many samples they are both correct [Correct]
* On how many samples Model1 is correct and Model2 is incorrect [Correct1]
* On how many samples Model1 is incorrect and Model2 is correct [Correct2]
* On how many samples both are incorrect [Incorrect]

In [17]:
from tqdm import tqdm
label_names = ["entailment", "neutral", "contradiction"]
def evaluate_dspy_on_dataset(dataset, model_module):
    results = []
    for example in tqdm(dataset):
        pred = model_module(premise=example["premise"], hypothesis=example["hypothesis"])
        results.append({
            'pred_label': pred.label.lower().strip(),
            'gold_label': label_names[example["gold_label"]],
        })
    return results

In [18]:
import pandas as pd

label2id = {label: i for i, label in enumerate(label_names)}

def run_model(model):
    results_dict = {}
    all_results = []
    section = "presupposition_all_n_presupposition"

    # Get the single split (e.g., 'all_n_presupposition')
    split_name = list(dataset[section].keys())[0]  
    ds = dataset[section][split_name]

    # Evaluate only on a small subset
    subset = ds.select(range(200))  

    results = evaluate_dspy_on_dataset(subset, model)
    all_results.extend(results)
    metrics = compute_metrics(results)
    results_dict[section] = metrics

    print("\nDSPy Evaluation Result:")
    df = pd.DataFrame.from_dict(results_dict, orient="index")
    df = df[["accuracy", "precision", "recall", "f1"]]
    df = df.round(4)
    print(df)

print("\nEvaluating bootstrap model:")
run_model(optimized_bootstrap)
print("\nEvaluating KNN model:")
run_model(optimized_knn)
print("\nEvaluating CoT model:")
run_model(optimized_cot)


Evaluating bootstrap model:


100%|██████████| 200/200 [14:46<00:00,  4.43s/it]



DSPy Evaluation Result:
                                     accuracy  precision  recall      f1
presupposition_all_n_presupposition     0.955     0.9674  0.9471  0.9551

Evaluating KNN model:


 80%|████████  | 4/5 [00:22<00:05,  5.69s/it]


Bootstrapped 4 full traces after 4 examples for up to 1 rounds, amounting to 4 attempts.


 80%|████████  | 4/5 [00:16<00:04,  4.20s/it]/it]


Bootstrapped 4 full traces after 4 examples for up to 1 rounds, amounting to 4 attempts.


 80%|████████  | 4/5 [00:16<00:04,  4.00s/it]/it]


Bootstrapped 4 full traces after 4 examples for up to 1 rounds, amounting to 4 attempts.


 80%|████████  | 4/5 [00:13<00:03,  3.38s/it]/it]


Bootstrapped 4 full traces after 4 examples for up to 1 rounds, amounting to 4 attempts.


 80%|████████  | 4/5 [00:09<00:02,  2.30s/it]/it]


Bootstrapped 4 full traces after 4 examples for up to 1 rounds, amounting to 4 attempts.


 80%|████████  | 4/5 [00:09<00:02,  2.26s/it]t]  


Bootstrapped 4 full traces after 4 examples for up to 1 rounds, amounting to 4 attempts.


 80%|████████  | 4/5 [00:20<00:05,  5.19s/it]t]


Bootstrapped 4 full traces after 4 examples for up to 1 rounds, amounting to 4 attempts.


 80%|████████  | 4/5 [00:13<00:03,  3.30s/it]/it]


Bootstrapped 4 full traces after 4 examples for up to 1 rounds, amounting to 4 attempts.


 80%|████████  | 4/5 [00:15<00:03,  3.97s/it]/it]


Bootstrapped 4 full traces after 4 examples for up to 1 rounds, amounting to 4 attempts.


 80%|████████  | 4/5 [00:20<00:05,  5.10s/it]/it]


Bootstrapped 4 full traces after 4 examples for up to 1 rounds, amounting to 4 attempts.


 80%|████████  | 4/5 [00:11<00:02,  2.88s/it]s/it]


Bootstrapped 4 full traces after 4 examples for up to 1 rounds, amounting to 4 attempts.


 80%|████████  | 4/5 [00:12<00:03,  3.00s/it]s/it]


Bootstrapped 4 full traces after 4 examples for up to 1 rounds, amounting to 4 attempts.


 80%|████████  | 4/5 [00:10<00:02,  2.68s/it]it]  


Bootstrapped 4 full traces after 4 examples for up to 1 rounds, amounting to 4 attempts.


 80%|████████  | 4/5 [00:11<00:02,  2.90s/it]it]


Bootstrapped 4 full traces after 4 examples for up to 1 rounds, amounting to 4 attempts.


 80%|████████  | 4/5 [00:12<00:03,  3.20s/it]it]


Bootstrapped 4 full traces after 4 examples for up to 1 rounds, amounting to 4 attempts.


 80%|████████  | 4/5 [00:13<00:03,  3.44s/it]it]


Bootstrapped 4 full traces after 4 examples for up to 1 rounds, amounting to 4 attempts.


 80%|████████  | 4/5 [00:18<00:04,  4.54s/it]it]


Bootstrapped 4 full traces after 4 examples for up to 1 rounds, amounting to 4 attempts.


 80%|████████  | 4/5 [00:17<00:04,  4.31s/it]it]


Bootstrapped 4 full traces after 4 examples for up to 1 rounds, amounting to 4 attempts.


 80%|████████  | 4/5 [00:27<00:06,  6.78s/it]it]


Bootstrapped 4 full traces after 4 examples for up to 1 rounds, amounting to 4 attempts.


 80%|████████  | 4/5 [00:27<00:06,  6.98s/it]s/it]


Bootstrapped 4 full traces after 4 examples for up to 1 rounds, amounting to 4 attempts.


 80%|████████  | 4/5 [00:15<00:03,  3.87s/it]s/it]


Bootstrapped 4 full traces after 4 examples for up to 1 rounds, amounting to 4 attempts.


 80%|████████  | 4/5 [00:18<00:04,  4.54s/it]s/it]


Bootstrapped 4 full traces after 4 examples for up to 1 rounds, amounting to 4 attempts.


 80%|████████  | 4/5 [00:29<00:07,  7.29s/it]s/it]


Bootstrapped 4 full traces after 4 examples for up to 1 rounds, amounting to 4 attempts.


 80%|████████  | 4/5 [00:15<00:03,  3.78s/it]s/it]


Bootstrapped 4 full traces after 4 examples for up to 1 rounds, amounting to 4 attempts.


 80%|████████  | 4/5 [00:15<00:03,  3.98s/it]s/it]


Bootstrapped 4 full traces after 4 examples for up to 1 rounds, amounting to 4 attempts.


 80%|████████  | 4/5 [00:23<00:05,  5.75s/it]s/it]


Bootstrapped 4 full traces after 4 examples for up to 1 rounds, amounting to 4 attempts.


 80%|████████  | 4/5 [00:18<00:04,  4.70s/it]s/it]


Bootstrapped 4 full traces after 4 examples for up to 1 rounds, amounting to 4 attempts.


 80%|████████  | 4/5 [00:19<00:04,  5.00s/it]s/it]


Bootstrapped 4 full traces after 4 examples for up to 1 rounds, amounting to 4 attempts.


 80%|████████  | 4/5 [00:24<00:06,  6.08s/it]s/it]


Bootstrapped 4 full traces after 4 examples for up to 1 rounds, amounting to 4 attempts.


 80%|████████  | 4/5 [00:19<00:04,  4.83s/it]s/it]


Bootstrapped 4 full traces after 4 examples for up to 1 rounds, amounting to 4 attempts.


 80%|████████  | 4/5 [00:16<00:04,  4.02s/it]s/it]


Bootstrapped 4 full traces after 4 examples for up to 1 rounds, amounting to 4 attempts.


 80%|████████  | 4/5 [00:17<00:04,  4.45s/it]s/it]


Bootstrapped 4 full traces after 4 examples for up to 1 rounds, amounting to 4 attempts.


 80%|████████  | 4/5 [00:09<00:02,  2.27s/it]s/it]


Bootstrapped 4 full traces after 4 examples for up to 1 rounds, amounting to 4 attempts.


 80%|████████  | 4/5 [00:00<00:00, 583.94it/s]t]  


Bootstrapped 4 full traces after 4 examples for up to 1 rounds, amounting to 4 attempts.


 80%|████████  | 4/5 [00:22<00:05,  5.73s/it]it]


Bootstrapped 4 full traces after 4 examples for up to 1 rounds, amounting to 4 attempts.


 80%|████████  | 4/5 [00:37<00:09,  9.43s/it]it]


Bootstrapped 4 full traces after 4 examples for up to 1 rounds, amounting to 4 attempts.


 80%|████████  | 4/5 [00:00<00:00, 620.14it/s]/it]


Bootstrapped 4 full traces after 4 examples for up to 1 rounds, amounting to 4 attempts.


 80%|████████  | 4/5 [00:10<00:02,  2.61s/it]it]  


Bootstrapped 4 full traces after 4 examples for up to 1 rounds, amounting to 4 attempts.


 80%|████████  | 4/5 [00:17<00:04,  4.27s/it]it]


Bootstrapped 4 full traces after 4 examples for up to 1 rounds, amounting to 4 attempts.


 80%|████████  | 4/5 [00:18<00:04,  4.53s/it]it]


Bootstrapped 4 full traces after 4 examples for up to 1 rounds, amounting to 4 attempts.


 80%|████████  | 4/5 [00:20<00:05,  5.01s/it]it]


Bootstrapped 4 full traces after 4 examples for up to 1 rounds, amounting to 4 attempts.


 80%|████████  | 4/5 [00:18<00:04,  4.71s/it]it]


Bootstrapped 4 full traces after 4 examples for up to 1 rounds, amounting to 4 attempts.


 80%|████████  | 4/5 [00:17<00:04,  4.31s/it]it]


Bootstrapped 4 full traces after 4 examples for up to 1 rounds, amounting to 4 attempts.


 80%|████████  | 4/5 [00:15<00:03,  3.93s/it]it]


Bootstrapped 4 full traces after 4 examples for up to 1 rounds, amounting to 4 attempts.


 80%|████████  | 4/5 [00:00<00:00, 498.33it/s]t]


Bootstrapped 4 full traces after 4 examples for up to 1 rounds, amounting to 4 attempts.


 80%|████████  | 4/5 [00:18<00:04,  4.58s/it]it]


Bootstrapped 4 full traces after 4 examples for up to 1 rounds, amounting to 4 attempts.


 80%|████████  | 4/5 [00:21<00:05,  5.39s/it]it]


Bootstrapped 4 full traces after 4 examples for up to 1 rounds, amounting to 4 attempts.


 80%|████████  | 4/5 [00:18<00:04,  4.50s/it]it]


Bootstrapped 4 full traces after 4 examples for up to 1 rounds, amounting to 4 attempts.


 80%|████████  | 4/5 [00:14<00:03,  3.52s/it]it]


Bootstrapped 4 full traces after 4 examples for up to 1 rounds, amounting to 4 attempts.


 80%|████████  | 4/5 [00:18<00:04,  4.54s/it]it]


Bootstrapped 4 full traces after 4 examples for up to 1 rounds, amounting to 4 attempts.


 80%|████████  | 4/5 [00:00<00:00, 458.72it/s]t]


Bootstrapped 4 full traces after 4 examples for up to 1 rounds, amounting to 4 attempts.


 80%|████████  | 4/5 [00:17<00:04,  4.29s/it]it]


Bootstrapped 4 full traces after 4 examples for up to 1 rounds, amounting to 4 attempts.


 80%|████████  | 4/5 [00:00<00:00, 900.89it/s]t]


Bootstrapped 4 full traces after 4 examples for up to 1 rounds, amounting to 4 attempts.


 80%|████████  | 4/5 [00:11<00:02,  2.97s/it]it]


Bootstrapped 4 full traces after 4 examples for up to 1 rounds, amounting to 4 attempts.


 80%|████████  | 4/5 [00:00<00:00, 860.63it/s]t]


Bootstrapped 4 full traces after 4 examples for up to 1 rounds, amounting to 4 attempts.


 80%|████████  | 4/5 [00:00<00:00, 982.50it/s]t]


Bootstrapped 4 full traces after 4 examples for up to 1 rounds, amounting to 4 attempts.


 80%|████████  | 4/5 [00:00<00:00, 951.95it/s]t]


Bootstrapped 4 full traces after 4 examples for up to 1 rounds, amounting to 4 attempts.


 80%|████████  | 4/5 [00:17<00:04,  4.34s/it]it]


Bootstrapped 4 full traces after 4 examples for up to 1 rounds, amounting to 4 attempts.


 80%|████████  | 4/5 [00:19<00:04,  4.80s/it]it]


Bootstrapped 4 full traces after 4 examples for up to 1 rounds, amounting to 4 attempts.


 80%|████████  | 4/5 [00:16<00:04,  4.16s/it]it]


Bootstrapped 4 full traces after 4 examples for up to 1 rounds, amounting to 4 attempts.


 80%|████████  | 4/5 [00:16<00:04,  4.07s/it]it]


Bootstrapped 4 full traces after 4 examples for up to 1 rounds, amounting to 4 attempts.


 80%|████████  | 4/5 [00:16<00:04,  4.09s/it]it]


Bootstrapped 4 full traces after 4 examples for up to 1 rounds, amounting to 4 attempts.


 80%|████████  | 4/5 [00:17<00:04,  4.46s/it]it]


Bootstrapped 4 full traces after 4 examples for up to 1 rounds, amounting to 4 attempts.


 80%|████████  | 4/5 [00:21<00:05,  5.35s/it]it]


Bootstrapped 4 full traces after 4 examples for up to 1 rounds, amounting to 4 attempts.


 80%|████████  | 4/5 [00:16<00:04,  4.20s/it]it]


Bootstrapped 4 full traces after 4 examples for up to 1 rounds, amounting to 4 attempts.


 80%|████████  | 4/5 [00:26<00:06,  6.60s/it]it]


Bootstrapped 4 full traces after 4 examples for up to 1 rounds, amounting to 4 attempts.


 80%|████████  | 4/5 [00:15<00:03,  3.81s/it]it]


Bootstrapped 4 full traces after 4 examples for up to 1 rounds, amounting to 4 attempts.


 80%|████████  | 4/5 [00:19<00:04,  4.77s/it]it]


Bootstrapped 4 full traces after 4 examples for up to 1 rounds, amounting to 4 attempts.


 80%|████████  | 4/5 [00:24<00:06,  6.18s/it]it]


Bootstrapped 4 full traces after 4 examples for up to 1 rounds, amounting to 4 attempts.


 80%|████████  | 4/5 [00:19<00:04,  4.79s/it]it]


Bootstrapped 4 full traces after 4 examples for up to 1 rounds, amounting to 4 attempts.


 80%|████████  | 4/5 [00:21<00:05,  5.43s/it]it]


Bootstrapped 4 full traces after 4 examples for up to 1 rounds, amounting to 4 attempts.


 80%|████████  | 4/5 [00:30<00:07,  7.71s/it]it]


Bootstrapped 4 full traces after 4 examples for up to 1 rounds, amounting to 4 attempts.


 80%|████████  | 4/5 [00:18<00:04,  4.68s/it]s/it]


Bootstrapped 4 full traces after 4 examples for up to 1 rounds, amounting to 4 attempts.


 80%|████████  | 4/5 [00:22<00:05,  5.53s/it]it]  


Bootstrapped 4 full traces after 4 examples for up to 1 rounds, amounting to 4 attempts.


 80%|████████  | 4/5 [00:23<00:05,  5.79s/it]it]


Bootstrapped 4 full traces after 4 examples for up to 1 rounds, amounting to 4 attempts.


 80%|████████  | 4/5 [00:18<00:04,  4.65s/it]it]


Bootstrapped 4 full traces after 4 examples for up to 1 rounds, amounting to 4 attempts.


 80%|████████  | 4/5 [00:14<00:03,  3.58s/it]it]


Bootstrapped 4 full traces after 4 examples for up to 1 rounds, amounting to 4 attempts.


 80%|████████  | 4/5 [00:25<00:06,  6.27s/it]it]


Bootstrapped 4 full traces after 4 examples for up to 1 rounds, amounting to 4 attempts.


 80%|████████  | 4/5 [00:00<00:00, 804.05it/s]t]


Bootstrapped 4 full traces after 4 examples for up to 1 rounds, amounting to 4 attempts.


 80%|████████  | 4/5 [00:14<00:03,  3.68s/it]it]


Bootstrapped 4 full traces after 4 examples for up to 1 rounds, amounting to 4 attempts.


 80%|████████  | 4/5 [00:16<00:04,  4.06s/it]it]


Bootstrapped 4 full traces after 4 examples for up to 1 rounds, amounting to 4 attempts.


 80%|████████  | 4/5 [00:00<00:00, 169.59it/s]t]


Bootstrapped 4 full traces after 4 examples for up to 1 rounds, amounting to 4 attempts.


 80%|████████  | 4/5 [00:12<00:03,  3.10s/it]it]


Bootstrapped 4 full traces after 4 examples for up to 1 rounds, amounting to 4 attempts.


 80%|████████  | 4/5 [00:00<00:00, 855.81it/s]t]


Bootstrapped 4 full traces after 4 examples for up to 1 rounds, amounting to 4 attempts.


 80%|████████  | 4/5 [00:00<00:00, 526.46it/s]t]


Bootstrapped 4 full traces after 4 examples for up to 1 rounds, amounting to 4 attempts.


 80%|████████  | 4/5 [00:18<00:04,  4.74s/it]it]


Bootstrapped 4 full traces after 4 examples for up to 1 rounds, amounting to 4 attempts.


 80%|████████  | 4/5 [00:24<00:06,  6.18s/it]it]


Bootstrapped 4 full traces after 4 examples for up to 1 rounds, amounting to 4 attempts.


 80%|████████  | 4/5 [00:16<00:04,  4.23s/it]it]


Bootstrapped 4 full traces after 4 examples for up to 1 rounds, amounting to 4 attempts.


 80%|████████  | 4/5 [00:14<00:03,  3.71s/it]it]


Bootstrapped 4 full traces after 4 examples for up to 1 rounds, amounting to 4 attempts.


 80%|████████  | 4/5 [00:16<00:04,  4.23s/it]it]


Bootstrapped 4 full traces after 4 examples for up to 1 rounds, amounting to 4 attempts.


 80%|████████  | 4/5 [00:30<00:07,  7.73s/it]it]


Bootstrapped 4 full traces after 4 examples for up to 1 rounds, amounting to 4 attempts.


 80%|████████  | 4/5 [00:00<00:00, 609.02it/s]t]


Bootstrapped 4 full traces after 4 examples for up to 1 rounds, amounting to 4 attempts.


 80%|████████  | 4/5 [00:00<00:00, 400.36it/s]t]


Bootstrapped 4 full traces after 4 examples for up to 1 rounds, amounting to 4 attempts.


 80%|████████  | 4/5 [00:00<00:00, 187.72it/s]t]


Bootstrapped 4 full traces after 4 examples for up to 1 rounds, amounting to 4 attempts.


 80%|████████  | 4/5 [00:13<00:03,  3.41s/it]it]


Bootstrapped 4 full traces after 4 examples for up to 1 rounds, amounting to 4 attempts.


 80%|████████  | 4/5 [00:17<00:04,  4.27s/it]it]


Bootstrapped 4 full traces after 4 examples for up to 1 rounds, amounting to 4 attempts.


 80%|████████  | 4/5 [00:17<00:04,  4.47s/it]it]


Bootstrapped 4 full traces after 4 examples for up to 1 rounds, amounting to 4 attempts.


 80%|████████  | 4/5 [00:07<00:01,  1.79s/it]it]


Bootstrapped 4 full traces after 4 examples for up to 1 rounds, amounting to 4 attempts.


 80%|████████  | 4/5 [00:25<00:06,  6.45s/it]it]


Bootstrapped 4 full traces after 4 examples for up to 1 rounds, amounting to 4 attempts.


 80%|████████  | 4/5 [00:12<00:03,  3.09s/it]it]


Bootstrapped 4 full traces after 4 examples for up to 1 rounds, amounting to 4 attempts.


 80%|████████  | 4/5 [00:17<00:04,  4.30s/it]/it]


Bootstrapped 4 full traces after 4 examples for up to 1 rounds, amounting to 4 attempts.


 80%|████████  | 4/5 [00:13<00:03,  3.30s/it]/it]


Bootstrapped 4 full traces after 4 examples for up to 1 rounds, amounting to 4 attempts.


 80%|████████  | 4/5 [00:17<00:04,  4.26s/it]/it]


Bootstrapped 4 full traces after 4 examples for up to 1 rounds, amounting to 4 attempts.


 80%|████████  | 4/5 [00:19<00:04,  4.88s/it]/it]


Bootstrapped 4 full traces after 4 examples for up to 1 rounds, amounting to 4 attempts.


 80%|████████  | 4/5 [00:14<00:03,  3.53s/it]/it]


Bootstrapped 4 full traces after 4 examples for up to 1 rounds, amounting to 4 attempts.


 80%|████████  | 4/5 [00:15<00:03,  3.76s/it]/it]


Bootstrapped 4 full traces after 4 examples for up to 1 rounds, amounting to 4 attempts.


 80%|████████  | 4/5 [00:16<00:04,  4.09s/it]/it]


Bootstrapped 4 full traces after 4 examples for up to 1 rounds, amounting to 4 attempts.


 80%|████████  | 4/5 [00:06<00:01,  1.63s/it]/it]


Bootstrapped 4 full traces after 4 examples for up to 1 rounds, amounting to 4 attempts.


 80%|████████  | 4/5 [00:18<00:04,  4.70s/it]/it]


Bootstrapped 4 full traces after 4 examples for up to 1 rounds, amounting to 4 attempts.


 80%|████████  | 4/5 [00:16<00:04,  4.11s/it]/it]


Bootstrapped 4 full traces after 4 examples for up to 1 rounds, amounting to 4 attempts.


 80%|████████  | 4/5 [00:17<00:04,  4.29s/it]/it]


Bootstrapped 4 full traces after 4 examples for up to 1 rounds, amounting to 4 attempts.


 80%|████████  | 4/5 [00:17<00:04,  4.32s/it]/it]


Bootstrapped 4 full traces after 4 examples for up to 1 rounds, amounting to 4 attempts.


 80%|████████  | 4/5 [00:20<00:05,  5.07s/it]/it]


Bootstrapped 4 full traces after 4 examples for up to 1 rounds, amounting to 4 attempts.


 80%|████████  | 4/5 [00:25<00:06,  6.37s/it]/it]


Bootstrapped 4 full traces after 4 examples for up to 1 rounds, amounting to 4 attempts.


 80%|████████  | 4/5 [00:17<00:04,  4.46s/it]/it]


Bootstrapped 4 full traces after 4 examples for up to 1 rounds, amounting to 4 attempts.


 80%|████████  | 4/5 [00:20<00:05,  5.04s/it]/it]


Bootstrapped 4 full traces after 4 examples for up to 1 rounds, amounting to 4 attempts.


 80%|████████  | 4/5 [00:00<00:00, 935.92it/s]it]


Bootstrapped 4 full traces after 4 examples for up to 1 rounds, amounting to 4 attempts.


 80%|████████  | 4/5 [00:17<00:04,  4.26s/it]/it]


Bootstrapped 4 full traces after 4 examples for up to 1 rounds, amounting to 4 attempts.


 80%|████████  | 4/5 [00:20<00:05,  5.00s/it]/it]


Bootstrapped 4 full traces after 4 examples for up to 1 rounds, amounting to 4 attempts.


 80%|████████  | 4/5 [00:00<00:00, 678.64it/s]it]


Bootstrapped 4 full traces after 4 examples for up to 1 rounds, amounting to 4 attempts.


 80%|████████  | 4/5 [00:16<00:04,  4.08s/it]/it]


Bootstrapped 4 full traces after 4 examples for up to 1 rounds, amounting to 4 attempts.


 80%|████████  | 4/5 [00:16<00:04,  4.03s/it]/it]


Bootstrapped 4 full traces after 4 examples for up to 1 rounds, amounting to 4 attempts.


 80%|████████  | 4/5 [00:18<00:04,  4.54s/it]/it]


Bootstrapped 4 full traces after 4 examples for up to 1 rounds, amounting to 4 attempts.


 80%|████████  | 4/5 [00:19<00:04,  4.89s/it]/it]


Bootstrapped 4 full traces after 4 examples for up to 1 rounds, amounting to 4 attempts.


 80%|████████  | 4/5 [00:24<00:06,  6.05s/it]/it]


Bootstrapped 4 full traces after 4 examples for up to 1 rounds, amounting to 4 attempts.


 80%|████████  | 4/5 [00:19<00:04,  4.81s/it]/it]


Bootstrapped 4 full traces after 4 examples for up to 1 rounds, amounting to 4 attempts.


 80%|████████  | 4/5 [00:00<00:00, 848.36it/s]it]


Bootstrapped 4 full traces after 4 examples for up to 1 rounds, amounting to 4 attempts.


 80%|████████  | 4/5 [00:16<00:04,  4.11s/it]/it]


Bootstrapped 4 full traces after 4 examples for up to 1 rounds, amounting to 4 attempts.


 80%|████████  | 4/5 [00:06<00:01,  1.73s/it]/it]


Bootstrapped 4 full traces after 4 examples for up to 1 rounds, amounting to 4 attempts.


 80%|████████  | 4/5 [00:00<00:00, 309.11it/s]it]


Bootstrapped 4 full traces after 4 examples for up to 1 rounds, amounting to 4 attempts.


 80%|████████  | 4/5 [00:30<00:07,  7.68s/it]/it]


Bootstrapped 4 full traces after 4 examples for up to 1 rounds, amounting to 4 attempts.


 80%|████████  | 4/5 [00:16<00:04,  4.24s/it]/it]


Bootstrapped 4 full traces after 4 examples for up to 1 rounds, amounting to 4 attempts.


 80%|████████  | 4/5 [00:20<00:05,  5.00s/it]/it]


Bootstrapped 4 full traces after 4 examples for up to 1 rounds, amounting to 4 attempts.


 80%|████████  | 4/5 [00:15<00:03,  3.81s/it]/it]


Bootstrapped 4 full traces after 4 examples for up to 1 rounds, amounting to 4 attempts.


 80%|████████  | 4/5 [00:00<00:00, 506.01it/s]it]


Bootstrapped 4 full traces after 4 examples for up to 1 rounds, amounting to 4 attempts.


 80%|████████  | 4/5 [00:18<00:04,  4.58s/it]/it]


Bootstrapped 4 full traces after 4 examples for up to 1 rounds, amounting to 4 attempts.


 80%|████████  | 4/5 [00:08<00:02,  2.23s/it]/it]


Bootstrapped 4 full traces after 4 examples for up to 1 rounds, amounting to 4 attempts.


 80%|████████  | 4/5 [00:00<00:00, 378.36it/s]it]


Bootstrapped 4 full traces after 4 examples for up to 1 rounds, amounting to 4 attempts.


 80%|████████  | 4/5 [00:17<00:04,  4.30s/it]/it]


Bootstrapped 4 full traces after 4 examples for up to 1 rounds, amounting to 4 attempts.


 80%|████████  | 4/5 [00:13<00:03,  3.41s/it]/it]


Bootstrapped 4 full traces after 4 examples for up to 1 rounds, amounting to 4 attempts.


 80%|████████  | 4/5 [00:20<00:05,  5.19s/it]/it]


Bootstrapped 4 full traces after 4 examples for up to 1 rounds, amounting to 4 attempts.


 80%|████████  | 4/5 [00:16<00:04,  4.03s/it]/it]


Bootstrapped 4 full traces after 4 examples for up to 1 rounds, amounting to 4 attempts.


 80%|████████  | 4/5 [00:00<00:00, 503.40it/s]it]


Bootstrapped 4 full traces after 4 examples for up to 1 rounds, amounting to 4 attempts.


 80%|████████  | 4/5 [00:00<00:00, 742.52it/s]it]


Bootstrapped 4 full traces after 4 examples for up to 1 rounds, amounting to 4 attempts.


 80%|████████  | 4/5 [00:16<00:04,  4.08s/it]/it]


Bootstrapped 4 full traces after 4 examples for up to 1 rounds, amounting to 4 attempts.


 80%|████████  | 4/5 [00:08<00:02,  2.13s/it]/it]


Bootstrapped 4 full traces after 4 examples for up to 1 rounds, amounting to 4 attempts.


 80%|████████  | 4/5 [00:00<00:00, 449.69it/s]it]


Bootstrapped 4 full traces after 4 examples for up to 1 rounds, amounting to 4 attempts.


 80%|████████  | 4/5 [00:08<00:02,  2.07s/it]/it]


Bootstrapped 4 full traces after 4 examples for up to 1 rounds, amounting to 4 attempts.


 80%|████████  | 4/5 [00:00<00:00, 596.93it/s]it]


Bootstrapped 4 full traces after 4 examples for up to 1 rounds, amounting to 4 attempts.


 80%|████████  | 4/5 [00:18<00:04,  4.67s/it]/it]


Bootstrapped 4 full traces after 4 examples for up to 1 rounds, amounting to 4 attempts.


 80%|████████  | 4/5 [00:00<00:00, 143.09it/s]it]


Bootstrapped 4 full traces after 4 examples for up to 1 rounds, amounting to 4 attempts.


 80%|████████  | 4/5 [00:14<00:03,  3.56s/it]/it]


Bootstrapped 4 full traces after 4 examples for up to 1 rounds, amounting to 4 attempts.


 80%|████████  | 4/5 [00:08<00:02,  2.18s/it]/it]


Bootstrapped 4 full traces after 4 examples for up to 1 rounds, amounting to 4 attempts.


 80%|████████  | 4/5 [00:00<00:00, 324.19it/s]it]


Bootstrapped 4 full traces after 4 examples for up to 1 rounds, amounting to 4 attempts.


 80%|████████  | 4/5 [00:07<00:01,  1.84s/it]/it]


Bootstrapped 4 full traces after 4 examples for up to 1 rounds, amounting to 4 attempts.


 80%|████████  | 4/5 [00:06<00:01,  1.54s/it]/it]


Bootstrapped 4 full traces after 4 examples for up to 1 rounds, amounting to 4 attempts.


 80%|████████  | 4/5 [00:00<00:00, 158.62it/s]it]


Bootstrapped 4 full traces after 4 examples for up to 1 rounds, amounting to 4 attempts.


 80%|████████  | 4/5 [00:00<00:00, 152.03it/s]it]


Bootstrapped 4 full traces after 4 examples for up to 1 rounds, amounting to 4 attempts.


 80%|████████  | 4/5 [00:22<00:05,  5.54s/it]/it]


Bootstrapped 4 full traces after 4 examples for up to 1 rounds, amounting to 4 attempts.


 80%|████████  | 4/5 [00:03<00:00,  1.03it/s]/it]


Bootstrapped 4 full traces after 4 examples for up to 1 rounds, amounting to 4 attempts.


 80%|████████  | 4/5 [00:14<00:03,  3.53s/it]/it]


Bootstrapped 4 full traces after 4 examples for up to 1 rounds, amounting to 4 attempts.


 80%|████████  | 4/5 [00:00<00:00, 578.44it/s]it]


Bootstrapped 4 full traces after 4 examples for up to 1 rounds, amounting to 4 attempts.


 80%|████████  | 4/5 [00:15<00:03,  3.79s/it]/it]


Bootstrapped 4 full traces after 4 examples for up to 1 rounds, amounting to 4 attempts.


 80%|████████  | 4/5 [00:07<00:01,  1.98s/it]/it]


Bootstrapped 4 full traces after 4 examples for up to 1 rounds, amounting to 4 attempts.


 80%|████████  | 4/5 [00:00<00:00, 489.42it/s]it]


Bootstrapped 4 full traces after 4 examples for up to 1 rounds, amounting to 4 attempts.


 80%|████████  | 4/5 [00:00<00:00, 622.30it/s]it]


Bootstrapped 4 full traces after 4 examples for up to 1 rounds, amounting to 4 attempts.


 80%|████████  | 4/5 [00:00<00:00, 298.84it/s]it]


Bootstrapped 4 full traces after 4 examples for up to 1 rounds, amounting to 4 attempts.


 80%|████████  | 4/5 [00:08<00:02,  2.12s/it]/it]


Bootstrapped 4 full traces after 4 examples for up to 1 rounds, amounting to 4 attempts.


 80%|████████  | 4/5 [00:00<00:00, 249.74it/s]it]


Bootstrapped 4 full traces after 4 examples for up to 1 rounds, amounting to 4 attempts.


 80%|████████  | 4/5 [00:13<00:03,  3.35s/it]/it]


Bootstrapped 4 full traces after 4 examples for up to 1 rounds, amounting to 4 attempts.


 80%|████████  | 4/5 [00:15<00:03,  3.86s/it]/it]


Bootstrapped 4 full traces after 4 examples for up to 1 rounds, amounting to 4 attempts.


 80%|████████  | 4/5 [00:00<00:00, 555.56it/s]it]


Bootstrapped 4 full traces after 4 examples for up to 1 rounds, amounting to 4 attempts.


 80%|████████  | 4/5 [00:00<00:00, 482.44it/s]it]


Bootstrapped 4 full traces after 4 examples for up to 1 rounds, amounting to 4 attempts.


 80%|████████  | 4/5 [00:22<00:05,  5.56s/it]/it]


Bootstrapped 4 full traces after 4 examples for up to 1 rounds, amounting to 4 attempts.


 80%|████████  | 4/5 [00:00<00:00, 353.45it/s]it]


Bootstrapped 4 full traces after 4 examples for up to 1 rounds, amounting to 4 attempts.


 80%|████████  | 4/5 [00:00<00:00, 608.38it/s]it]


Bootstrapped 4 full traces after 4 examples for up to 1 rounds, amounting to 4 attempts.


 80%|████████  | 4/5 [00:52<00:13, 13.09s/it]/it]


Bootstrapped 4 full traces after 4 examples for up to 1 rounds, amounting to 4 attempts.


 80%|████████  | 4/5 [00:00<00:00, 328.55it/s]it]


Bootstrapped 4 full traces after 4 examples for up to 1 rounds, amounting to 4 attempts.


 80%|████████  | 4/5 [00:20<00:05,  5.18s/it]/it]


Bootstrapped 4 full traces after 4 examples for up to 1 rounds, amounting to 4 attempts.


 80%|████████  | 4/5 [00:20<00:05,  5.12s/it]/it]


Bootstrapped 4 full traces after 4 examples for up to 1 rounds, amounting to 4 attempts.


 80%|████████  | 4/5 [00:00<00:00, 368.36it/s]it]


Bootstrapped 4 full traces after 4 examples for up to 1 rounds, amounting to 4 attempts.


 80%|████████  | 4/5 [00:00<00:00, 437.83it/s]it]


Bootstrapped 4 full traces after 4 examples for up to 1 rounds, amounting to 4 attempts.


 80%|████████  | 4/5 [00:19<00:04,  4.98s/it]/it]


Bootstrapped 4 full traces after 4 examples for up to 1 rounds, amounting to 4 attempts.


 80%|████████  | 4/5 [00:00<00:00, 249.19it/s]it]


Bootstrapped 4 full traces after 4 examples for up to 1 rounds, amounting to 4 attempts.


 80%|████████  | 4/5 [00:00<00:00, 642.14it/s]it]


Bootstrapped 4 full traces after 4 examples for up to 1 rounds, amounting to 4 attempts.


 80%|████████  | 4/5 [00:00<00:00, 466.14it/s]it]


Bootstrapped 4 full traces after 4 examples for up to 1 rounds, amounting to 4 attempts.


 80%|████████  | 4/5 [00:00<00:00, 564.32it/s]it]


Bootstrapped 4 full traces after 4 examples for up to 1 rounds, amounting to 4 attempts.


 80%|████████  | 4/5 [00:00<00:00, 578.29it/s]it]


Bootstrapped 4 full traces after 4 examples for up to 1 rounds, amounting to 4 attempts.


 80%|████████  | 4/5 [00:16<00:04,  4.12s/it]/it]


Bootstrapped 4 full traces after 4 examples for up to 1 rounds, amounting to 4 attempts.


 80%|████████  | 4/5 [00:00<00:00, 685.88it/s]it]


Bootstrapped 4 full traces after 4 examples for up to 1 rounds, amounting to 4 attempts.


 80%|████████  | 4/5 [00:15<00:03,  3.81s/it]/it]


Bootstrapped 4 full traces after 4 examples for up to 1 rounds, amounting to 4 attempts.


 80%|████████  | 4/5 [00:17<00:04,  4.33s/it]/it]


Bootstrapped 4 full traces after 4 examples for up to 1 rounds, amounting to 4 attempts.


 80%|████████  | 4/5 [00:16<00:04,  4.07s/it]/it]


Bootstrapped 4 full traces after 4 examples for up to 1 rounds, amounting to 4 attempts.


 80%|████████  | 4/5 [00:18<00:04,  4.53s/it]/it]


Bootstrapped 4 full traces after 4 examples for up to 1 rounds, amounting to 4 attempts.


 80%|████████  | 4/5 [00:15<00:03,  3.88s/it]/it]


Bootstrapped 4 full traces after 4 examples for up to 1 rounds, amounting to 4 attempts.


 80%|████████  | 4/5 [00:16<00:04,  4.13s/it]/it]


Bootstrapped 4 full traces after 4 examples for up to 1 rounds, amounting to 4 attempts.


 80%|████████  | 4/5 [00:21<00:05,  5.27s/it]/it]


Bootstrapped 4 full traces after 4 examples for up to 1 rounds, amounting to 4 attempts.


 80%|████████  | 4/5 [00:07<00:01,  1.89s/it]/it]


Bootstrapped 4 full traces after 4 examples for up to 1 rounds, amounting to 4 attempts.


 80%|████████  | 4/5 [00:00<00:00, 798.04it/s]it]


Bootstrapped 4 full traces after 4 examples for up to 1 rounds, amounting to 4 attempts.


 80%|████████  | 4/5 [00:19<00:04,  4.99s/it]/it]


Bootstrapped 4 full traces after 4 examples for up to 1 rounds, amounting to 4 attempts.


100%|██████████| 200/200 [1:00:18<00:00, 18.09s/it]



DSPy Evaluation Result:
                                     accuracy  precision  recall      f1
presupposition_all_n_presupposition      0.95     0.9642  0.9418  0.9505

Evaluating CoT model:


100%|██████████| 200/200 [16:05<00:00,  4.83s/it]


DSPy Evaluation Result:
                                     accuracy  precision  recall      f1
presupposition_all_n_presupposition     0.915     0.9433  0.9012  0.9154



