# Explingo Experiment Runner

This notebook:
1. Loads the gold-standard dataset, prepares the metrics functions, and verifies that the metric functions give the maximum score on the gold-standard dataset, and lower scores on less aligned datasets
2. Runs the prompt-design, few-shot, and bootstrap-few-shot experiments on a testing dataset

## Setup
Import necessary libraries and prepare the LLM

**Note: To run these cells, you need a `keys.yaml` file in the top-level Explingo directory with the following line:**
```yaml
openai_api_key: <your_openai_api_key>
```

In [1]:
import pandas as pd

from experiment_runner import ExplingoExperimentRunner
import os
import yaml
import dspy
import metrics
import random

In [2]:
with open(os.path.join("..", "keys.yaml"), "r") as file:
    config = yaml.safe_load(file)
    openai_api_key = config["openai_api_key"]

llm = dspy.OpenAI(model='gpt-4o', api_key=openai_api_key, max_tokens=1000)

Now, we create the main experiment runner object. This object takes in a dataset, and then
1. Splits the dataset into a training dataset and a testing dataset (see notes below)
2. Sets up the evaluation metrics (see notes below). The fluency metric is set up to use sample from the dataset as reference
3. Runs the experiments on the testing dataset

Some examples in the testing datasets include gold-standard narratives; others include only a sample explanation.
- The former makes up the gold-standard dataset used for tuning the evaluation metrics and providing few-shot examples.
- The latter makes up the testing dataset used for evaluation and for bootstrapping few-shot examples

We use the following metrics, all scored on a scale from 0-4:
- Accuracy: the narrative accurately describes the information in the explanation
- Fluency: the narrative is coherent and natural, as compared to the gold-standard explanations. We pass in a small list of sample narratives from the gold-standard dataset to compare against
- Conciseness: the narrative is not too long, as compared to the gold-standard explanations. For now, any narrative that is no longer than the longest gold-standard narrative will score 4
- Completeness: the narrative includes all relevant information from the original explanation 

**Note: You can set `verbose=1` to see the narratives generated, or `verbose=2` to see the explanations, narratives, and rationalizations**

In [6]:
# iterate all datasets in the eval_data folder
runners = {}
total_eval = 0
for dataset in os.listdir(os.path.join("eval_data")):
    runners[dataset] = ExplingoExperimentRunner(llm=llm, openai_api_key=openai_api_key, dataset_filepath = os.path.join("eval_data", dataset), verbose=0)
    total_eval += len(runners[dataset].eval_data)
    
print("Total eval examples:", total_eval)
results = []

eval_data\housing_1.json
Total number of examples: 35
Labeled training examples: 5
Labeled evaluation examples: 15
Unlabeled training examples: 5
Unlabeled evaluation examples: 10
---
eval_data\housing_2.json
Total number of examples: 22
Labeled training examples: 5
Labeled evaluation examples: 7
Unlabeled training examples: 5
Unlabeled evaluation examples: 5
---
eval_data\housing_3.json
Total number of examples: 22
Labeled training examples: 5
Labeled evaluation examples: 8
Unlabeled training examples: 5
Unlabeled evaluation examples: 4
---
eval_data\mushroom_1.json
Total number of examples: 30
Labeled training examples: 5
Labeled evaluation examples: 6
Unlabeled training examples: 5
Unlabeled evaluation examples: 14
---
eval_data\mushroom_2.json
Total number of examples: 30
Labeled training examples: 5
Labeled evaluation examples: 6
Unlabeled training examples: 5
Unlabeled evaluation examples: 14
---
eval_data\pdf_1.json
Total number of examples: 30
Labeled training examples: 5
Label

## Basic prompt design experiment

We begin with basic prompts. With 4 metrics (without completeness), each with a score of 0-2, the maximum score is 8. 

We generate narratives/rationalizations on `max_iters=5` sample explanations, and return the average total score.

In [4]:
# Utilities for cleaner results

def pretty_print(result):
    s = f"Total score: {result[0]}"
    s2 = ", ".join([f"{k}: {v}" for k, v in result[1].items()])
    print(f"{s} ({s2})")
    
def update_results(method, dataset, scores, kwargs):
    result = {"dataset": dataset, "prompt": prompt, "total score": scores[0]}
    result.update(scores[1])
    result.update(kwargs)
    results.append(result)

In [5]:
prompts = ["You are helping users understand an ML model's prediction. Given an explanation and information about the model, "
           "convert the explanation into a human-readable narrative.",
           "You are helping users who do not have experience working with ML understand an ML model's prediction. Given an explanation and information about the model, convert the explanation into a human-readable narrative. Make your answers sound as natural as possible.",
           "You are helping users understand an ML model's prediction. Given an explanation and information about the model, convert the explanation into a human-readable narrative. Be sure to explicitly mention all values from the explanation in your response.",
]

for dataset in runners:
    runner = runners[dataset]
    print(f"Dataset: {dataset}")
    for prompt in prompts:
        print(f"Prompt: {prompt}")
        scores = runner.run_basic_prompting_experiment(prompt=prompt)
        update_results("basic_prompting", dataset, scores, {"prompt": prompt})
        pretty_print(scores)
        print("--")
    print("=====")

Dataset: housing_1.json
Prompt: You are helping users understand an ML model's prediction. Given an explanation and information about the model, convert the explanation into a human-readable narrative.


KeyboardInterrupt: 

## Few-shot experiment

Next, we repeat the experiment with the addition of N few-shot examples from the gold-standard dataset.

In [7]:
for dataset in runners:
    runner = runners[dataset]
    print(f"Dataset: {dataset}")
    for i in [1, 3, 5]:
        if dataset == "housing_1.json" and i in [1, 3]:
            print(f"Skipping {dataset} for n={i}")
            continue
        print(f"Few-shot n: {i}")
        scores = runner.run_few_shot_experiment(n_few_shot=i, prompt=prompts[0])
        update_results("few_shot", dataset, scores, {"n_few_shot": i, "prompt": prompts[0]})
        pretty_print(scores)
        print("--")
    print("=====")

Dataset: housing_1.json
Skipping housing_1.json for n=1
Skipping housing_1.json for n=3
Few-shot n: 5
Total score: 15.24 (accuracy: 3.52, completeness: 4.0, fluency: 3.72, conciseness: 4.0)
--
=====
Dataset: housing_2.json
Few-shot n: 1
Total score: 16.0 (accuracy: 4.0, completeness: 4.0, fluency: 4.0, conciseness: 4.0)
--
Few-shot n: 3
Total score: 16.0 (accuracy: 4.0, completeness: 4.0, fluency: 4.0, conciseness: 4.0)
--
Few-shot n: 5
Total score: 16.0 (accuracy: 4.0, completeness: 4.0, fluency: 4.0, conciseness: 4.0)
--
=====
Dataset: housing_3.json
Few-shot n: 1
Total score: 15.75 (accuracy: 4.0, completeness: 3.8333333333333335, fluency: 3.9166666666666665, conciseness: 4.0)
--
Few-shot n: 3
Total score: 15.75 (accuracy: 4.0, completeness: 3.8333333333333335, fluency: 3.9166666666666665, conciseness: 4.0)
--
Few-shot n: 5
Total score: 15.916666666666666 (accuracy: 4.0, completeness: 4.0, fluency: 3.9166666666666665, conciseness: 4.0)
--
=====
Dataset: mushroom_1.json
Few-shot n: 1

In [8]:
llm.inspect_history(n=1)




You are helping users understand an ML model's prediction. Given an explanation and information about the model, convert the explanation into a human-readable narrative.
---
Follow the following format
Context: what the model predicts
Explanation: explanation of the model's prediction
Explanation Format: format the explanation is given in
Narrative: human-readable narrative version of the explanation
---
Example 1
Context: The model predicts whether a student will pass their class
Explanation: (Family eductional support, yes, 1.21), (In a romantic relationship, no, 1.12), (Age, 17, -0.45)
Explanation Format: SHAP feature contribution in (feature_name, feature_value, contribution) format
Narrative: The lack of a romantic relationship and having family support suggest the student is more likely to pass. However, being 17 indicates a lower probability of passing.
Example 2
Context: The model predicts whether a student will pass their class
Explanation: (In a romantic relationship, yes,

"\n\n\nYou are helping users understand an ML model's prediction. Given an explanation and information about the model, convert the explanation into a human-readable narrative.\n---\nFollow the following format\nContext: what the model predicts\nExplanation: explanation of the model's prediction\nExplanation Format: format the explanation is given in\nNarrative: human-readable narrative version of the explanation\n---\nExample 1\nContext: The model predicts whether a student will pass their class\nExplanation: (Family eductional support, yes, 1.21), (In a romantic relationship, no, 1.12), (Age, 17, -0.45)\nExplanation Format: SHAP feature contribution in (feature_name, feature_value, contribution) format\nNarrative: The lack of a romantic relationship and having family support suggest the student is more likely to pass. However, being 17 indicates a lower probability of passing.\nExample 2\nContext: The model predicts whether a student will pass their class\nExplanation: (In a romantic

## Bootstrapped few-shot
Next, we repeat the experiment with the addition of 3 examples bootstrapped by DSPy to optimize the evaluation metrics.

In [11]:
for dataset in runners:
    runner = runners[dataset]
    print(f"Dataset: {dataset}")
    for i, j in [[0, 1], [0, 3], [3, 3]]:
        print(f"Few-shot n: {i}, Bootstrapped n: {j}")
        scores = runner.run_bootstrap_few_shot_experiment(n_labeled_few_shot=i, n_bootstrapped_few_shot=j)
        update_results("bootstrap_few_shot", dataset, scores, {"n_few_shot": i, "n_bootstrapped_few_shot": j, "prompt": prompts[0]})
        pretty_print(scores)
        print("--")
    results_df = pd.DataFrame(results)
    results_df.to_csv(f"results_{i}_{j}.csv")
    print("=====")

Dataset: housing_1.json
Few-shot n: 0, Bootstrapped n: 1


 30%|███       | 3/10 [00:00<00:00, 214.08it/s]


Bootstrapped 1 full traces after 4 examples in round 0.


 30%|███       | 3/10 [00:00<00:00, 206.66it/s]


Bootstrapped 1 full traces after 4 examples in round 0.


 30%|███       | 3/10 [00:00<00:00, 222.27it/s]


Bootstrapped 1 full traces after 4 examples in round 0.


 30%|███       | 3/10 [00:00<00:00, 138.97it/s]


Bootstrapped 1 full traces after 4 examples in round 0.


 30%|███       | 3/10 [00:00<00:00, 100.96it/s]


Bootstrapped 1 full traces after 4 examples in round 0.


 30%|███       | 3/10 [00:00<00:00, 91.98it/s]


Bootstrapped 1 full traces after 4 examples in round 0.


 30%|███       | 3/10 [00:00<00:00, 82.40it/s]


Bootstrapped 1 full traces after 4 examples in round 0.


 30%|███       | 3/10 [00:00<00:00, 87.70it/s]


Bootstrapped 1 full traces after 4 examples in round 0.


 30%|███       | 3/10 [00:00<00:00, 80.11it/s]


Bootstrapped 1 full traces after 4 examples in round 0.


 30%|███       | 3/10 [00:00<00:00, 87.29it/s]


Bootstrapped 1 full traces after 4 examples in round 0.


 30%|███       | 3/10 [00:00<00:00, 74.03it/s]


Bootstrapped 1 full traces after 4 examples in round 0.


 30%|███       | 3/10 [00:00<00:00, 71.82it/s]


Bootstrapped 1 full traces after 4 examples in round 0.


 30%|███       | 3/10 [00:00<00:00, 79.36it/s]


Bootstrapped 1 full traces after 4 examples in round 0.


 30%|███       | 3/10 [00:00<00:00, 74.71it/s]


Bootstrapped 1 full traces after 4 examples in round 0.


 30%|███       | 3/10 [00:00<00:00, 89.75it/s]


Bootstrapped 1 full traces after 4 examples in round 0.


 30%|███       | 3/10 [00:00<00:00, 79.04it/s]


Bootstrapped 1 full traces after 4 examples in round 0.


 30%|███       | 3/10 [00:00<00:00, 93.29it/s]


Bootstrapped 1 full traces after 4 examples in round 0.


 30%|███       | 3/10 [00:00<00:00, 73.34it/s]


Bootstrapped 1 full traces after 4 examples in round 0.


 30%|███       | 3/10 [00:00<00:00, 176.06it/s]


Bootstrapped 1 full traces after 4 examples in round 0.


 30%|███       | 3/10 [00:00<00:00, 206.00it/s]


Bootstrapped 1 full traces after 4 examples in round 0.


 30%|███       | 3/10 [00:00<00:00, 75.31it/s]


Bootstrapped 1 full traces after 4 examples in round 0.


 30%|███       | 3/10 [00:00<00:00, 94.62it/s]


Bootstrapped 1 full traces after 4 examples in round 0.


 30%|███       | 3/10 [00:00<00:00, 77.18it/s]


Bootstrapped 1 full traces after 4 examples in round 0.


 30%|███       | 3/10 [00:00<00:00, 140.89it/s]


Bootstrapped 1 full traces after 4 examples in round 0.


 30%|███       | 3/10 [00:00<00:00, 77.86it/s]


Bootstrapped 1 full traces after 4 examples in round 0.
Total score: 13.64 (accuracy: 3.04, completeness: 4.0, fluency: 2.6, conciseness: 4.0)
--
Few-shot n: 0, Bootstrapped n: 3


 50%|█████     | 5/10 [00:11<00:11,  2.22s/it]


Bootstrapped 3 full traces after 6 examples in round 0.


 50%|█████     | 5/10 [00:00<00:00, 90.89it/s]


Bootstrapped 3 full traces after 6 examples in round 0.


 50%|█████     | 5/10 [00:00<00:00, 113.09it/s]


Bootstrapped 3 full traces after 6 examples in round 0.


 50%|█████     | 5/10 [00:00<00:00, 84.46it/s]


Bootstrapped 3 full traces after 6 examples in round 0.


 50%|█████     | 5/10 [00:00<00:00, 87.69it/s]


Bootstrapped 3 full traces after 6 examples in round 0.


 50%|█████     | 5/10 [00:00<00:00, 86.42it/s]


Bootstrapped 3 full traces after 6 examples in round 0.


 50%|█████     | 5/10 [00:00<00:00, 92.13it/s]


Bootstrapped 3 full traces after 6 examples in round 0.


 50%|█████     | 5/10 [00:00<00:00, 80.49it/s]


Bootstrapped 3 full traces after 6 examples in round 0.


 50%|█████     | 5/10 [00:00<00:00, 83.09it/s]


Bootstrapped 3 full traces after 6 examples in round 0.


 50%|█████     | 5/10 [00:00<00:00, 87.20it/s]


Bootstrapped 3 full traces after 6 examples in round 0.


 50%|█████     | 5/10 [00:00<00:00, 88.73it/s]


Bootstrapped 3 full traces after 6 examples in round 0.


 50%|█████     | 5/10 [00:00<00:00, 75.62it/s]


Bootstrapped 3 full traces after 6 examples in round 0.


 50%|█████     | 5/10 [00:00<00:00, 72.27it/s]


Bootstrapped 3 full traces after 6 examples in round 0.


 50%|█████     | 5/10 [00:00<00:00, 90.04it/s]


Bootstrapped 3 full traces after 6 examples in round 0.


 50%|█████     | 5/10 [00:00<00:00, 81.07it/s]


Bootstrapped 3 full traces after 6 examples in round 0.


 50%|█████     | 5/10 [00:00<00:00, 85.00it/s]


Bootstrapped 3 full traces after 6 examples in round 0.


 50%|█████     | 5/10 [00:00<00:00, 80.79it/s]


Bootstrapped 3 full traces after 6 examples in round 0.


 50%|█████     | 5/10 [00:00<00:00, 89.41it/s]


Bootstrapped 3 full traces after 6 examples in round 0.


 50%|█████     | 5/10 [00:00<00:00, 100.05it/s]


Bootstrapped 3 full traces after 6 examples in round 0.


 50%|█████     | 5/10 [00:00<00:00, 98.44it/s]


Bootstrapped 3 full traces after 6 examples in round 0.


 50%|█████     | 5/10 [00:00<00:00, 94.77it/s]


Bootstrapped 3 full traces after 6 examples in round 0.


 50%|█████     | 5/10 [00:00<00:00, 88.27it/s]


Bootstrapped 3 full traces after 6 examples in round 0.


 50%|█████     | 5/10 [00:00<00:00, 84.26it/s]


Bootstrapped 3 full traces after 6 examples in round 0.


 50%|█████     | 5/10 [00:00<00:00, 72.91it/s]


Bootstrapped 3 full traces after 6 examples in round 0.


 50%|█████     | 5/10 [00:00<00:00, 81.47it/s]


Bootstrapped 3 full traces after 6 examples in round 0.
Total score: 13.76 (accuracy: 3.2, completeness: 3.92, fluency: 2.64, conciseness: 4.0)
--
Few-shot n: 3, Bootstrapped n: 3


 40%|████      | 4/10 [00:24<00:36,  6.02s/it]


Bootstrapped 3 full traces after 5 examples in round 0.


 40%|████      | 4/10 [00:00<00:00, 79.35it/s]


Bootstrapped 3 full traces after 5 examples in round 0.


 40%|████      | 4/10 [00:00<00:00, 82.96it/s]


Bootstrapped 3 full traces after 5 examples in round 0.


 40%|████      | 4/10 [00:00<00:00, 73.04it/s]


Bootstrapped 3 full traces after 5 examples in round 0.


 40%|████      | 4/10 [00:00<00:00, 131.28it/s]


Bootstrapped 3 full traces after 5 examples in round 0.


 40%|████      | 4/10 [00:00<00:00, 71.53it/s]


Bootstrapped 3 full traces after 5 examples in round 0.


 40%|████      | 4/10 [00:00<00:00, 86.28it/s]


Bootstrapped 3 full traces after 5 examples in round 0.


 40%|████      | 4/10 [00:00<00:00, 234.76it/s]


Bootstrapped 3 full traces after 5 examples in round 0.


 40%|████      | 4/10 [00:00<00:00, 165.56it/s]


Bootstrapped 3 full traces after 5 examples in round 0.


 40%|████      | 4/10 [00:00<00:00, 199.37it/s]


Bootstrapped 3 full traces after 5 examples in round 0.


 40%|████      | 4/10 [00:00<00:00, 125.43it/s]


Bootstrapped 3 full traces after 5 examples in round 0.


 40%|████      | 4/10 [00:00<00:00, 160.09it/s]


Bootstrapped 3 full traces after 5 examples in round 0.


 40%|████      | 4/10 [00:00<00:00, 37.71it/s]


Bootstrapped 3 full traces after 5 examples in round 0.


 40%|████      | 4/10 [00:00<00:00, 147.16it/s]


Bootstrapped 3 full traces after 5 examples in round 0.


 40%|████      | 4/10 [00:00<00:00, 75.83it/s]


Bootstrapped 3 full traces after 5 examples in round 0.


 40%|████      | 4/10 [00:00<00:00, 140.47it/s]


Bootstrapped 3 full traces after 5 examples in round 0.


 40%|████      | 4/10 [00:00<00:00, 91.49it/s]


Bootstrapped 3 full traces after 5 examples in round 0.


 40%|████      | 4/10 [00:00<00:00, 80.19it/s]


Bootstrapped 3 full traces after 5 examples in round 0.


 40%|████      | 4/10 [00:00<00:00, 87.26it/s]


Bootstrapped 3 full traces after 5 examples in round 0.


 40%|████      | 4/10 [00:00<00:00, 79.18it/s]


Bootstrapped 3 full traces after 5 examples in round 0.


 40%|████      | 4/10 [00:00<00:00, 82.09it/s]


Bootstrapped 3 full traces after 5 examples in round 0.


 40%|████      | 4/10 [00:00<00:00, 120.49it/s]


Bootstrapped 3 full traces after 5 examples in round 0.


 40%|████      | 4/10 [00:00<00:00, 180.88it/s]


Bootstrapped 3 full traces after 5 examples in round 0.


 40%|████      | 4/10 [00:00<00:00, 69.72it/s]


Bootstrapped 3 full traces after 5 examples in round 0.


 40%|████      | 4/10 [00:00<00:00, 108.63it/s]


Bootstrapped 3 full traces after 5 examples in round 0.
Total score: 14.92 (accuracy: 3.36, completeness: 3.92, fluency: 3.64, conciseness: 4.0)
--
=====
Dataset: housing_2.json
Few-shot n: 0, Bootstrapped n: 1


100%|██████████| 10/10 [00:43<00:00,  4.38s/it]


Bootstrapped 0 full traces after 10 examples in round 0.


100%|██████████| 10/10 [00:00<00:00, 83.62it/s]


Bootstrapped 0 full traces after 10 examples in round 0.


100%|██████████| 10/10 [00:00<00:00, 78.59it/s]


Bootstrapped 0 full traces after 10 examples in round 0.


100%|██████████| 10/10 [00:00<00:00, 91.70it/s]


Bootstrapped 0 full traces after 10 examples in round 0.


100%|██████████| 10/10 [00:00<00:00, 91.77it/s]


Bootstrapped 0 full traces after 10 examples in round 0.


100%|██████████| 10/10 [00:00<00:00, 78.95it/s]


Bootstrapped 0 full traces after 10 examples in round 0.


100%|██████████| 10/10 [00:00<00:00, 93.45it/s]


Bootstrapped 0 full traces after 10 examples in round 0.


100%|██████████| 10/10 [00:00<00:00, 73.72it/s]


Bootstrapped 0 full traces after 10 examples in round 0.


100%|██████████| 10/10 [00:00<00:00, 81.92it/s]


Bootstrapped 0 full traces after 10 examples in round 0.


100%|██████████| 10/10 [00:00<00:00, 77.07it/s]


Bootstrapped 0 full traces after 10 examples in round 0.


100%|██████████| 10/10 [00:00<00:00, 87.57it/s]


Bootstrapped 0 full traces after 10 examples in round 0.


100%|██████████| 10/10 [00:00<00:00, 76.51it/s]


Bootstrapped 0 full traces after 10 examples in round 0.
Total score: 14.416666666666666 (accuracy: 4.0, completeness: 4.0, fluency: 2.4166666666666665, conciseness: 4.0)
--
Few-shot n: 0, Bootstrapped n: 3


100%|██████████| 10/10 [00:00<00:00, 86.09it/s]


Bootstrapped 0 full traces after 10 examples in round 0.


100%|██████████| 10/10 [00:00<00:00, 83.44it/s]


Bootstrapped 0 full traces after 10 examples in round 0.


100%|██████████| 10/10 [00:00<00:00, 86.31it/s]


Bootstrapped 0 full traces after 10 examples in round 0.


100%|██████████| 10/10 [00:00<00:00, 72.62it/s]


Bootstrapped 0 full traces after 10 examples in round 0.


100%|██████████| 10/10 [00:00<00:00, 101.89it/s]


Bootstrapped 0 full traces after 10 examples in round 0.


100%|██████████| 10/10 [00:00<00:00, 75.31it/s]


Bootstrapped 0 full traces after 10 examples in round 0.


100%|██████████| 10/10 [00:00<00:00, 85.10it/s]


Bootstrapped 0 full traces after 10 examples in round 0.


100%|██████████| 10/10 [00:00<00:00, 67.64it/s]


Bootstrapped 0 full traces after 10 examples in round 0.


100%|██████████| 10/10 [00:00<00:00, 64.88it/s]


Bootstrapped 0 full traces after 10 examples in round 0.


100%|██████████| 10/10 [00:00<00:00, 76.73it/s]


Bootstrapped 0 full traces after 10 examples in round 0.


100%|██████████| 10/10 [00:00<00:00, 81.35it/s]


Bootstrapped 0 full traces after 10 examples in round 0.


100%|██████████| 10/10 [00:00<00:00, 79.49it/s]


Bootstrapped 0 full traces after 10 examples in round 0.
Total score: 14.416666666666666 (accuracy: 4.0, completeness: 4.0, fluency: 2.4166666666666665, conciseness: 4.0)
--
Few-shot n: 3, Bootstrapped n: 3


 40%|████      | 4/10 [00:19<00:29,  4.93s/it]


Bootstrapped 3 full traces after 5 examples in round 0.


 40%|████      | 4/10 [00:00<00:00, 98.25it/s]


Bootstrapped 3 full traces after 5 examples in round 0.


 40%|████      | 4/10 [00:00<00:00, 71.94it/s]


Bootstrapped 3 full traces after 5 examples in round 0.


 40%|████      | 4/10 [00:00<00:00, 142.41it/s]


Bootstrapped 3 full traces after 5 examples in round 0.


 40%|████      | 4/10 [00:00<00:00, 92.35it/s]


Bootstrapped 3 full traces after 5 examples in round 0.


 40%|████      | 4/10 [00:00<00:00, 97.39it/s]


Bootstrapped 3 full traces after 5 examples in round 0.


 40%|████      | 4/10 [00:00<00:00, 87.10it/s]


Bootstrapped 3 full traces after 5 examples in round 0.


 40%|████      | 4/10 [00:00<00:00, 10.72it/s]


Bootstrapped 3 full traces after 5 examples in round 0.


 40%|████      | 4/10 [00:00<00:00, 92.19it/s]


Bootstrapped 3 full traces after 5 examples in round 0.


 40%|████      | 4/10 [00:00<00:00, 147.38it/s]


Bootstrapped 3 full traces after 5 examples in round 0.


 40%|████      | 4/10 [00:00<00:00, 78.18it/s]


Bootstrapped 3 full traces after 5 examples in round 0.


 40%|████      | 4/10 [00:00<00:00, 104.54it/s]


Bootstrapped 3 full traces after 5 examples in round 0.
Total score: 15.916666666666666 (accuracy: 4.0, completeness: 4.0, fluency: 3.9166666666666665, conciseness: 4.0)
--
=====
Dataset: housing_3.json
Few-shot n: 0, Bootstrapped n: 1


 10%|█         | 1/10 [00:00<00:03,  2.89it/s]


Bootstrapped 1 full traces after 2 examples in round 0.


 10%|█         | 1/10 [00:00<00:00, 85.20it/s]


Bootstrapped 1 full traces after 2 examples in round 0.


 10%|█         | 1/10 [00:00<00:00, 79.95it/s]


Bootstrapped 1 full traces after 2 examples in round 0.


 10%|█         | 1/10 [00:00<00:00, 94.90it/s]


Bootstrapped 1 full traces after 2 examples in round 0.


 10%|█         | 1/10 [00:00<00:00, 82.38it/s]


Bootstrapped 1 full traces after 2 examples in round 0.


 10%|█         | 1/10 [00:00<00:00, 165.67it/s]


Bootstrapped 1 full traces after 2 examples in round 0.


 10%|█         | 1/10 [00:00<00:00, 88.99it/s]


Bootstrapped 1 full traces after 2 examples in round 0.


 10%|█         | 1/10 [00:00<00:00, 79.61it/s]


Bootstrapped 1 full traces after 2 examples in round 0.


 10%|█         | 1/10 [00:00<00:00, 85.10it/s]


Bootstrapped 1 full traces after 2 examples in round 0.


 10%|█         | 1/10 [00:00<00:00, 76.19it/s]


Bootstrapped 1 full traces after 2 examples in round 0.


 10%|█         | 1/10 [00:00<00:00, 79.57it/s]


Bootstrapped 1 full traces after 2 examples in round 0.


 10%|█         | 1/10 [00:00<00:00, 81.28it/s]


Bootstrapped 1 full traces after 2 examples in round 0.
Total score: 14.833333333333334 (accuracy: 4.0, completeness: 4.0, fluency: 2.8333333333333335, conciseness: 4.0)
--
Few-shot n: 0, Bootstrapped n: 3


100%|██████████| 10/10 [00:03<00:00,  2.85it/s]


Bootstrapped 1 full traces after 10 examples in round 0.


100%|██████████| 10/10 [00:00<00:00, 84.44it/s]


Bootstrapped 1 full traces after 10 examples in round 0.


100%|██████████| 10/10 [00:00<00:00, 92.80it/s]


Bootstrapped 1 full traces after 10 examples in round 0.


100%|██████████| 10/10 [00:00<00:00, 69.80it/s]


Bootstrapped 1 full traces after 10 examples in round 0.


100%|██████████| 10/10 [00:00<00:00, 78.91it/s]


Bootstrapped 1 full traces after 10 examples in round 0.


100%|██████████| 10/10 [00:00<00:00, 80.98it/s]


Bootstrapped 1 full traces after 10 examples in round 0.


100%|██████████| 10/10 [00:00<00:00, 79.43it/s]


Bootstrapped 1 full traces after 10 examples in round 0.


100%|██████████| 10/10 [00:00<00:00, 85.26it/s]


Bootstrapped 1 full traces after 10 examples in round 0.


100%|██████████| 10/10 [00:00<00:00, 88.86it/s]


Bootstrapped 1 full traces after 10 examples in round 0.


100%|██████████| 10/10 [00:00<00:00, 89.31it/s]


Bootstrapped 1 full traces after 10 examples in round 0.


100%|██████████| 10/10 [00:00<00:00, 80.97it/s]


Bootstrapped 1 full traces after 10 examples in round 0.


100%|██████████| 10/10 [00:00<00:00, 84.04it/s]


Bootstrapped 1 full traces after 10 examples in round 0.
Total score: 14.833333333333334 (accuracy: 4.0, completeness: 4.0, fluency: 2.8333333333333335, conciseness: 4.0)
--
Few-shot n: 3, Bootstrapped n: 3


 30%|███       | 3/10 [00:15<00:35,  5.00s/it]


Bootstrapped 3 full traces after 4 examples in round 0.


 30%|███       | 3/10 [00:00<00:00, 91.00it/s]


Bootstrapped 3 full traces after 4 examples in round 0.


 30%|███       | 3/10 [00:00<00:00, 84.83it/s]


Bootstrapped 3 full traces after 4 examples in round 0.


 30%|███       | 3/10 [00:00<00:00, 82.75it/s]


Bootstrapped 3 full traces after 4 examples in round 0.


 30%|███       | 3/10 [00:00<00:00, 75.59it/s]


Bootstrapped 3 full traces after 4 examples in round 0.


 30%|███       | 3/10 [00:00<00:00, 116.80it/s]


Bootstrapped 3 full traces after 4 examples in round 0.


 30%|███       | 3/10 [00:00<00:00, 87.83it/s]


Bootstrapped 3 full traces after 4 examples in round 0.


 30%|███       | 3/10 [00:00<00:00, 132.70it/s]


Bootstrapped 3 full traces after 4 examples in round 0.


 30%|███       | 3/10 [00:00<00:00, 76.65it/s]


Bootstrapped 3 full traces after 4 examples in round 0.


 30%|███       | 3/10 [00:00<00:00, 98.50it/s]


Bootstrapped 3 full traces after 4 examples in round 0.


 30%|███       | 3/10 [00:00<00:00, 97.69it/s]


Bootstrapped 3 full traces after 4 examples in round 0.


 30%|███       | 3/10 [00:00<00:00, 78.46it/s]


Bootstrapped 3 full traces after 4 examples in round 0.
Total score: 15.666666666666666 (accuracy: 4.0, completeness: 3.8333333333333335, fluency: 3.8333333333333335, conciseness: 4.0)
--
=====
Dataset: mushroom_1.json
Few-shot n: 0, Bootstrapped n: 1


100%|██████████| 10/10 [00:43<00:00,  4.34s/it]


Bootstrapped 0 full traces after 10 examples in round 0.


100%|██████████| 10/10 [00:00<00:00, 115.99it/s]


Bootstrapped 0 full traces after 10 examples in round 0.


100%|██████████| 10/10 [00:00<00:00, 85.19it/s]


Bootstrapped 0 full traces after 10 examples in round 0.


100%|██████████| 10/10 [00:00<00:00, 160.14it/s]


Bootstrapped 0 full traces after 10 examples in round 0.


100%|██████████| 10/10 [00:00<00:00, 70.59it/s]


Bootstrapped 0 full traces after 10 examples in round 0.


100%|██████████| 10/10 [00:00<00:00, 81.89it/s]


Bootstrapped 0 full traces after 10 examples in round 0.


100%|██████████| 10/10 [00:00<00:00, 101.44it/s]


Bootstrapped 0 full traces after 10 examples in round 0.


100%|██████████| 10/10 [00:00<00:00, 168.80it/s]


Bootstrapped 0 full traces after 10 examples in round 0.


100%|██████████| 10/10 [00:00<00:00, 87.12it/s]


Bootstrapped 0 full traces after 10 examples in round 0.


100%|██████████| 10/10 [00:00<00:00, 87.07it/s]


Bootstrapped 0 full traces after 10 examples in round 0.


100%|██████████| 10/10 [00:00<00:00, 96.31it/s]


Bootstrapped 0 full traces after 10 examples in round 0.


100%|██████████| 10/10 [00:00<00:00, 78.87it/s]


Bootstrapped 0 full traces after 10 examples in round 0.


100%|██████████| 10/10 [00:00<00:00, 93.22it/s]


Bootstrapped 0 full traces after 10 examples in round 0.


100%|██████████| 10/10 [00:00<00:00, 91.68it/s]


Bootstrapped 0 full traces after 10 examples in round 0.


100%|██████████| 10/10 [00:00<00:00, 87.60it/s]


Bootstrapped 0 full traces after 10 examples in round 0.


100%|██████████| 10/10 [00:00<00:00, 89.56it/s]


Bootstrapped 0 full traces after 10 examples in round 0.


100%|██████████| 10/10 [00:00<00:00, 161.22it/s]


Bootstrapped 0 full traces after 10 examples in round 0.


100%|██████████| 10/10 [00:00<00:00, 159.35it/s]


Bootstrapped 0 full traces after 10 examples in round 0.


100%|██████████| 10/10 [00:00<00:00, 81.42it/s]


Bootstrapped 0 full traces after 10 examples in round 0.


100%|██████████| 10/10 [00:00<00:00, 85.34it/s]


Bootstrapped 0 full traces after 10 examples in round 0.
Total score: 13.85 (accuracy: 4.0, completeness: 4.0, fluency: 1.85, conciseness: 4.0)
--
Few-shot n: 0, Bootstrapped n: 3


100%|██████████| 10/10 [00:00<00:00, 86.91it/s]


Bootstrapped 0 full traces after 10 examples in round 0.


100%|██████████| 10/10 [00:00<00:00, 101.61it/s]


Bootstrapped 0 full traces after 10 examples in round 0.


100%|██████████| 10/10 [00:00<00:00, 86.47it/s]


Bootstrapped 0 full traces after 10 examples in round 0.


100%|██████████| 10/10 [00:00<00:00, 90.96it/s]


Bootstrapped 0 full traces after 10 examples in round 0.


100%|██████████| 10/10 [00:00<00:00, 82.67it/s]


Bootstrapped 0 full traces after 10 examples in round 0.


100%|██████████| 10/10 [00:00<00:00, 92.73it/s]


Bootstrapped 0 full traces after 10 examples in round 0.


100%|██████████| 10/10 [00:00<00:00, 90.77it/s]


Bootstrapped 0 full traces after 10 examples in round 0.


100%|██████████| 10/10 [00:00<00:00, 87.42it/s]


Bootstrapped 0 full traces after 10 examples in round 0.


100%|██████████| 10/10 [00:00<00:00, 79.06it/s]


Bootstrapped 0 full traces after 10 examples in round 0.


100%|██████████| 10/10 [00:00<00:00, 87.11it/s]


Bootstrapped 0 full traces after 10 examples in round 0.


100%|██████████| 10/10 [00:00<00:00, 88.21it/s]


Bootstrapped 0 full traces after 10 examples in round 0.


100%|██████████| 10/10 [00:00<00:00, 81.83it/s]


Bootstrapped 0 full traces after 10 examples in round 0.


100%|██████████| 10/10 [00:00<00:00, 86.12it/s]


Bootstrapped 0 full traces after 10 examples in round 0.


100%|██████████| 10/10 [00:00<00:00, 83.35it/s]


Bootstrapped 0 full traces after 10 examples in round 0.


100%|██████████| 10/10 [00:00<00:00, 85.29it/s]


Bootstrapped 0 full traces after 10 examples in round 0.


100%|██████████| 10/10 [00:00<00:00, 84.05it/s]


Bootstrapped 0 full traces after 10 examples in round 0.


100%|██████████| 10/10 [00:00<00:00, 90.12it/s]


Bootstrapped 0 full traces after 10 examples in round 0.


100%|██████████| 10/10 [00:00<00:00, 84.07it/s]


Bootstrapped 0 full traces after 10 examples in round 0.


100%|██████████| 10/10 [00:00<00:00, 85.57it/s]


Bootstrapped 0 full traces after 10 examples in round 0.


100%|██████████| 10/10 [00:00<00:00, 71.25it/s]


Bootstrapped 0 full traces after 10 examples in round 0.
Total score: 13.85 (accuracy: 4.0, completeness: 4.0, fluency: 1.85, conciseness: 4.0)
--
Few-shot n: 3, Bootstrapped n: 3


 60%|██████    | 6/10 [00:15<00:10,  2.62s/it]


Bootstrapped 3 full traces after 7 examples in round 0.


 60%|██████    | 6/10 [00:00<00:00, 90.04it/s]


Bootstrapped 3 full traces after 7 examples in round 0.


 60%|██████    | 6/10 [00:00<00:00, 96.99it/s]


Bootstrapped 3 full traces after 7 examples in round 0.


 60%|██████    | 6/10 [00:00<00:00, 87.12it/s]


Bootstrapped 3 full traces after 7 examples in round 0.


 60%|██████    | 6/10 [00:00<00:00, 88.12it/s]


Bootstrapped 3 full traces after 7 examples in round 0.


 60%|██████    | 6/10 [00:00<00:00, 90.39it/s]


Bootstrapped 3 full traces after 7 examples in round 0.


 60%|██████    | 6/10 [00:00<00:00, 82.42it/s]


Bootstrapped 3 full traces after 7 examples in round 0.


 60%|██████    | 6/10 [00:00<00:00, 83.05it/s]


Bootstrapped 3 full traces after 7 examples in round 0.


 60%|██████    | 6/10 [00:00<00:00, 107.26it/s]


Bootstrapped 3 full traces after 7 examples in round 0.


 60%|██████    | 6/10 [00:00<00:00, 81.46it/s]


Bootstrapped 3 full traces after 7 examples in round 0.


 60%|██████    | 6/10 [00:00<00:00, 83.00it/s]


Bootstrapped 3 full traces after 7 examples in round 0.


 60%|██████    | 6/10 [00:00<00:00, 82.87it/s]


Bootstrapped 3 full traces after 7 examples in round 0.


 60%|██████    | 6/10 [00:00<00:00, 85.87it/s]


Bootstrapped 3 full traces after 7 examples in round 0.


 60%|██████    | 6/10 [00:00<00:00, 167.32it/s]


Bootstrapped 3 full traces after 7 examples in round 0.


 60%|██████    | 6/10 [00:00<00:00, 84.30it/s]


Bootstrapped 3 full traces after 7 examples in round 0.


 60%|██████    | 6/10 [00:00<00:00, 84.65it/s]


Bootstrapped 3 full traces after 7 examples in round 0.


 60%|██████    | 6/10 [00:00<00:00, 65.58it/s]


Bootstrapped 3 full traces after 7 examples in round 0.


 60%|██████    | 6/10 [00:00<00:00, 139.17it/s]


Bootstrapped 3 full traces after 7 examples in round 0.


 60%|██████    | 6/10 [00:00<00:00, 79.07it/s]


Bootstrapped 3 full traces after 7 examples in round 0.


 60%|██████    | 6/10 [00:00<00:00, 82.25it/s]


Bootstrapped 3 full traces after 7 examples in round 0.
Total score: 14.25 (accuracy: 3.0, completeness: 3.8, fluency: 3.45, conciseness: 4.0)
--
=====
Dataset: mushroom_2.json
Few-shot n: 0, Bootstrapped n: 1


 10%|█         | 1/10 [00:00<00:03,  2.61it/s]


Bootstrapped 1 full traces after 2 examples in round 0.


 10%|█         | 1/10 [00:00<00:00, 94.04it/s]


Bootstrapped 1 full traces after 2 examples in round 0.


 10%|█         | 1/10 [00:00<00:00, 99.28it/s]


Bootstrapped 1 full traces after 2 examples in round 0.


 10%|█         | 1/10 [00:00<00:00, 83.17it/s]


Bootstrapped 1 full traces after 2 examples in round 0.


 10%|█         | 1/10 [00:00<00:00, 75.96it/s]


Bootstrapped 1 full traces after 2 examples in round 0.


 10%|█         | 1/10 [00:00<00:00, 76.82it/s]


Bootstrapped 1 full traces after 2 examples in round 0.


 10%|█         | 1/10 [00:00<00:00, 86.76it/s]


Bootstrapped 1 full traces after 2 examples in round 0.


 10%|█         | 1/10 [00:00<00:00, 199.99it/s]


Bootstrapped 1 full traces after 2 examples in round 0.


 10%|█         | 1/10 [00:00<00:00, 86.43it/s]


Bootstrapped 1 full traces after 2 examples in round 0.


 10%|█         | 1/10 [00:00<00:00, 86.33it/s]


Bootstrapped 1 full traces after 2 examples in round 0.


 10%|█         | 1/10 [00:00<00:00, 66.32it/s]


Bootstrapped 1 full traces after 2 examples in round 0.


 10%|█         | 1/10 [00:00<00:00, 71.32it/s]


Bootstrapped 1 full traces after 2 examples in round 0.


 10%|█         | 1/10 [00:00<00:00, 71.04it/s]


Bootstrapped 1 full traces after 2 examples in round 0.


 10%|█         | 1/10 [00:00<00:00, 104.25it/s]


Bootstrapped 1 full traces after 2 examples in round 0.


 10%|█         | 1/10 [00:00<00:00, 79.93it/s]


Bootstrapped 1 full traces after 2 examples in round 0.


 10%|█         | 1/10 [00:00<00:00, 166.40it/s]


Bootstrapped 1 full traces after 2 examples in round 0.


 10%|█         | 1/10 [00:00<00:00, 85.70it/s]


Bootstrapped 1 full traces after 2 examples in round 0.


 10%|█         | 1/10 [00:00<00:00, 82.66it/s]


Bootstrapped 1 full traces after 2 examples in round 0.


 10%|█         | 1/10 [00:00<00:00, 74.00it/s]


Bootstrapped 1 full traces after 2 examples in round 0.


 10%|█         | 1/10 [00:00<00:00, 86.88it/s]


Bootstrapped 1 full traces after 2 examples in round 0.
Total score: 13.95 (accuracy: 3.8, completeness: 3.9, fluency: 2.25, conciseness: 4.0)
--
Few-shot n: 0, Bootstrapped n: 3


100%|██████████| 10/10 [00:02<00:00,  3.70it/s]


Bootstrapped 1 full traces after 10 examples in round 0.


100%|██████████| 10/10 [00:00<00:00, 80.70it/s]


Bootstrapped 1 full traces after 10 examples in round 0.


100%|██████████| 10/10 [00:00<00:00, 81.61it/s]


Bootstrapped 1 full traces after 10 examples in round 0.


100%|██████████| 10/10 [00:00<00:00, 92.35it/s]


Bootstrapped 1 full traces after 10 examples in round 0.


100%|██████████| 10/10 [00:00<00:00, 76.88it/s]


Bootstrapped 1 full traces after 10 examples in round 0.


100%|██████████| 10/10 [00:00<00:00, 85.63it/s]


Bootstrapped 1 full traces after 10 examples in round 0.


100%|██████████| 10/10 [00:00<00:00, 87.78it/s]


Bootstrapped 1 full traces after 10 examples in round 0.


100%|██████████| 10/10 [00:00<00:00, 82.20it/s]


Bootstrapped 1 full traces after 10 examples in round 0.


100%|██████████| 10/10 [00:00<00:00, 84.94it/s]


Bootstrapped 1 full traces after 10 examples in round 0.


100%|██████████| 10/10 [00:00<00:00, 76.04it/s]


Bootstrapped 1 full traces after 10 examples in round 0.


100%|██████████| 10/10 [00:00<00:00, 82.38it/s]


Bootstrapped 1 full traces after 10 examples in round 0.


100%|██████████| 10/10 [00:00<00:00, 85.69it/s]


Bootstrapped 1 full traces after 10 examples in round 0.


100%|██████████| 10/10 [00:00<00:00, 87.71it/s]


Bootstrapped 1 full traces after 10 examples in round 0.


100%|██████████| 10/10 [00:00<00:00, 79.22it/s]


Bootstrapped 1 full traces after 10 examples in round 0.


100%|██████████| 10/10 [00:00<00:00, 100.27it/s]


Bootstrapped 1 full traces after 10 examples in round 0.


100%|██████████| 10/10 [00:00<00:00, 79.60it/s]


Bootstrapped 1 full traces after 10 examples in round 0.


100%|██████████| 10/10 [00:00<00:00, 88.53it/s]


Bootstrapped 1 full traces after 10 examples in round 0.


100%|██████████| 10/10 [00:00<00:00, 82.11it/s]


Bootstrapped 1 full traces after 10 examples in round 0.


100%|██████████| 10/10 [00:00<00:00, 49.15it/s]


Bootstrapped 1 full traces after 10 examples in round 0.


100%|██████████| 10/10 [00:00<00:00, 59.67it/s]


Bootstrapped 1 full traces after 10 examples in round 0.
Total score: 13.95 (accuracy: 3.8, completeness: 3.9, fluency: 2.25, conciseness: 4.0)
--
Few-shot n: 3, Bootstrapped n: 3


100%|██████████| 10/10 [00:30<00:00,  3.06s/it]


Bootstrapped 2 full traces after 10 examples in round 0.


100%|██████████| 10/10 [00:00<00:00, 74.52it/s]


Bootstrapped 2 full traces after 10 examples in round 0.


100%|██████████| 10/10 [00:00<00:00, 81.61it/s]


Bootstrapped 2 full traces after 10 examples in round 0.


100%|██████████| 10/10 [00:00<00:00, 146.99it/s]


Bootstrapped 2 full traces after 10 examples in round 0.


100%|██████████| 10/10 [00:00<00:00, 81.88it/s]


Bootstrapped 2 full traces after 10 examples in round 0.


100%|██████████| 10/10 [00:00<00:00, 105.00it/s]


Bootstrapped 2 full traces after 10 examples in round 0.


100%|██████████| 10/10 [00:00<00:00, 89.55it/s]


Bootstrapped 2 full traces after 10 examples in round 0.


100%|██████████| 10/10 [00:00<00:00, 87.20it/s]


Bootstrapped 2 full traces after 10 examples in round 0.


100%|██████████| 10/10 [00:00<00:00, 85.40it/s]


Bootstrapped 2 full traces after 10 examples in round 0.


100%|██████████| 10/10 [00:00<00:00, 84.39it/s]


Bootstrapped 2 full traces after 10 examples in round 0.


100%|██████████| 10/10 [00:00<00:00, 172.36it/s]


Bootstrapped 2 full traces after 10 examples in round 0.


100%|██████████| 10/10 [00:00<00:00, 84.13it/s]


Bootstrapped 2 full traces after 10 examples in round 0.


100%|██████████| 10/10 [00:00<00:00, 167.14it/s]


Bootstrapped 2 full traces after 10 examples in round 0.


100%|██████████| 10/10 [00:00<00:00, 85.83it/s]


Bootstrapped 2 full traces after 10 examples in round 0.


100%|██████████| 10/10 [00:00<00:00, 87.39it/s]


Bootstrapped 2 full traces after 10 examples in round 0.


100%|██████████| 10/10 [00:00<00:00, 22.05it/s]


Bootstrapped 2 full traces after 10 examples in round 0.


100%|██████████| 10/10 [00:00<00:00, 89.85it/s]


Bootstrapped 2 full traces after 10 examples in round 0.


100%|██████████| 10/10 [00:00<00:00, 85.32it/s]


Bootstrapped 2 full traces after 10 examples in round 0.


100%|██████████| 10/10 [00:00<00:00, 83.98it/s]


Bootstrapped 2 full traces after 10 examples in round 0.


100%|██████████| 10/10 [00:00<00:00, 75.65it/s]


Bootstrapped 2 full traces after 10 examples in round 0.
Total score: 13.6 (accuracy: 3.8, completeness: 3.8, fluency: 2.0, conciseness: 4.0)
--
=====
Dataset: pdf_1.json
Few-shot n: 0, Bootstrapped n: 1


100%|██████████| 10/10 [00:48<00:00,  4.84s/it]


Bootstrapped 0 full traces after 10 examples in round 0.


100%|██████████| 10/10 [00:00<00:00, 244.85it/s]


Bootstrapped 0 full traces after 10 examples in round 0.


100%|██████████| 10/10 [00:00<00:00, 177.64it/s]


Bootstrapped 0 full traces after 10 examples in round 0.


100%|██████████| 10/10 [00:00<00:00, 83.69it/s]


Bootstrapped 0 full traces after 10 examples in round 0.


100%|██████████| 10/10 [00:00<00:00, 91.02it/s]


Bootstrapped 0 full traces after 10 examples in round 0.


100%|██████████| 10/10 [00:00<00:00, 169.30it/s]


Bootstrapped 0 full traces after 10 examples in round 0.


100%|██████████| 10/10 [00:00<00:00, 83.97it/s]


Bootstrapped 0 full traces after 10 examples in round 0.


100%|██████████| 10/10 [00:00<00:00, 76.28it/s]


Bootstrapped 0 full traces after 10 examples in round 0.


100%|██████████| 10/10 [00:00<00:00, 143.21it/s]


Bootstrapped 0 full traces after 10 examples in round 0.


100%|██████████| 10/10 [00:00<00:00, 83.74it/s]


Bootstrapped 0 full traces after 10 examples in round 0.


100%|██████████| 10/10 [00:00<00:00, 159.66it/s]


Bootstrapped 0 full traces after 10 examples in round 0.


100%|██████████| 10/10 [00:00<00:00, 89.58it/s]


Bootstrapped 0 full traces after 10 examples in round 0.


100%|██████████| 10/10 [00:00<00:00, 90.93it/s]


Bootstrapped 0 full traces after 10 examples in round 0.


100%|██████████| 10/10 [00:00<00:00, 84.88it/s]


Bootstrapped 0 full traces after 10 examples in round 0.


100%|██████████| 10/10 [00:00<00:00, 86.78it/s]


Bootstrapped 0 full traces after 10 examples in round 0.


100%|██████████| 10/10 [00:00<00:00, 144.04it/s]


Bootstrapped 0 full traces after 10 examples in round 0.


100%|██████████| 10/10 [00:00<00:00, 94.34it/s]


Bootstrapped 0 full traces after 10 examples in round 0.


100%|██████████| 10/10 [00:00<00:00, 93.83it/s]


Bootstrapped 0 full traces after 10 examples in round 0.


100%|██████████| 10/10 [00:00<00:00, 176.69it/s]


Bootstrapped 0 full traces after 10 examples in round 0.


100%|██████████| 10/10 [00:00<00:00, 70.99it/s]


Bootstrapped 0 full traces after 10 examples in round 0.
Total score: 13.85 (accuracy: 3.6, completeness: 3.8, fluency: 2.45, conciseness: 4.0)
--
Few-shot n: 0, Bootstrapped n: 3


100%|██████████| 10/10 [00:00<00:00, 79.08it/s]


Bootstrapped 0 full traces after 10 examples in round 0.


100%|██████████| 10/10 [00:00<00:00, 82.35it/s]


Bootstrapped 0 full traces after 10 examples in round 0.


100%|██████████| 10/10 [00:00<00:00, 78.29it/s]


Bootstrapped 0 full traces after 10 examples in round 0.


100%|██████████| 10/10 [00:00<00:00, 75.30it/s]


Bootstrapped 0 full traces after 10 examples in round 0.


100%|██████████| 10/10 [00:00<00:00, 78.15it/s]


Bootstrapped 0 full traces after 10 examples in round 0.


100%|██████████| 10/10 [00:00<00:00, 80.67it/s]


Bootstrapped 0 full traces after 10 examples in round 0.


100%|██████████| 10/10 [00:00<00:00, 81.22it/s]


Bootstrapped 0 full traces after 10 examples in round 0.


100%|██████████| 10/10 [00:00<00:00, 87.42it/s]


Bootstrapped 0 full traces after 10 examples in round 0.


100%|██████████| 10/10 [00:00<00:00, 77.00it/s]


Bootstrapped 0 full traces after 10 examples in round 0.


100%|██████████| 10/10 [00:00<00:00, 83.23it/s]


Bootstrapped 0 full traces after 10 examples in round 0.


100%|██████████| 10/10 [00:00<00:00, 77.37it/s]


Bootstrapped 0 full traces after 10 examples in round 0.


100%|██████████| 10/10 [00:00<00:00, 83.83it/s]


Bootstrapped 0 full traces after 10 examples in round 0.


100%|██████████| 10/10 [00:00<00:00, 81.55it/s]


Bootstrapped 0 full traces after 10 examples in round 0.


100%|██████████| 10/10 [00:00<00:00, 91.11it/s]


Bootstrapped 0 full traces after 10 examples in round 0.


100%|██████████| 10/10 [00:00<00:00, 95.53it/s]


Bootstrapped 0 full traces after 10 examples in round 0.


100%|██████████| 10/10 [00:00<00:00, 77.63it/s]


Bootstrapped 0 full traces after 10 examples in round 0.


100%|██████████| 10/10 [00:00<00:00, 81.41it/s]


Bootstrapped 0 full traces after 10 examples in round 0.


100%|██████████| 10/10 [00:00<00:00, 82.82it/s]


Bootstrapped 0 full traces after 10 examples in round 0.


100%|██████████| 10/10 [00:00<00:00, 82.48it/s]


Bootstrapped 0 full traces after 10 examples in round 0.


100%|██████████| 10/10 [00:00<00:00, 82.47it/s]


Bootstrapped 0 full traces after 10 examples in round 0.
Total score: 13.85 (accuracy: 3.6, completeness: 3.8, fluency: 2.45, conciseness: 4.0)
--
Few-shot n: 3, Bootstrapped n: 3


100%|██████████| 10/10 [00:32<00:00,  3.22s/it]


Bootstrapped 1 full traces after 10 examples in round 0.


100%|██████████| 10/10 [00:00<00:00, 122.71it/s]


Bootstrapped 1 full traces after 10 examples in round 0.


100%|██████████| 10/10 [00:00<00:00, 95.14it/s]


Bootstrapped 1 full traces after 10 examples in round 0.


100%|██████████| 10/10 [00:00<00:00, 100.13it/s]


Bootstrapped 1 full traces after 10 examples in round 0.


100%|██████████| 10/10 [00:00<00:00, 92.65it/s]


Bootstrapped 1 full traces after 10 examples in round 0.


100%|██████████| 10/10 [00:00<00:00, 160.94it/s]


Bootstrapped 1 full traces after 10 examples in round 0.


100%|██████████| 10/10 [00:00<00:00, 175.77it/s]


Bootstrapped 1 full traces after 10 examples in round 0.


100%|██████████| 10/10 [00:00<00:00, 175.61it/s]


Bootstrapped 1 full traces after 10 examples in round 0.


100%|██████████| 10/10 [00:00<00:00, 90.42it/s]


Bootstrapped 1 full traces after 10 examples in round 0.


100%|██████████| 10/10 [00:00<00:00, 87.78it/s]


Bootstrapped 1 full traces after 10 examples in round 0.


100%|██████████| 10/10 [00:00<00:00, 95.30it/s]


Bootstrapped 1 full traces after 10 examples in round 0.


100%|██████████| 10/10 [00:00<00:00, 83.63it/s]


Bootstrapped 1 full traces after 10 examples in round 0.


100%|██████████| 10/10 [00:00<00:00, 92.36it/s]


Bootstrapped 1 full traces after 10 examples in round 0.


100%|██████████| 10/10 [00:00<00:00, 101.23it/s]


Bootstrapped 1 full traces after 10 examples in round 0.


100%|██████████| 10/10 [00:00<00:00, 80.81it/s]


Bootstrapped 1 full traces after 10 examples in round 0.


100%|██████████| 10/10 [00:00<00:00, 93.13it/s]


Bootstrapped 1 full traces after 10 examples in round 0.


100%|██████████| 10/10 [00:00<00:00, 88.32it/s]


Bootstrapped 1 full traces after 10 examples in round 0.


100%|██████████| 10/10 [00:00<00:00, 81.96it/s]


Bootstrapped 1 full traces after 10 examples in round 0.


100%|██████████| 10/10 [00:00<00:00, 53.44it/s]


Bootstrapped 1 full traces after 10 examples in round 0.


100%|██████████| 10/10 [00:01<00:00,  8.99it/s]


Bootstrapped 1 full traces after 10 examples in round 0.
Total score: 13.85 (accuracy: 3.2, completeness: 2.8, fluency: 3.85, conciseness: 4.0)
--
=====
Dataset: pdf_2.json
Few-shot n: 0, Bootstrapped n: 1


100%|██████████| 10/10 [00:03<00:00,  2.88it/s]


Bootstrapped 0 full traces after 10 examples in round 0.


100%|██████████| 10/10 [00:00<00:00, 69.12it/s]


Bootstrapped 0 full traces after 10 examples in round 0.


100%|██████████| 10/10 [00:00<00:00, 81.19it/s]


Bootstrapped 0 full traces after 10 examples in round 0.


100%|██████████| 10/10 [00:00<00:00, 68.71it/s]


Bootstrapped 0 full traces after 10 examples in round 0.


100%|██████████| 10/10 [00:00<00:00, 84.39it/s]


Bootstrapped 0 full traces after 10 examples in round 0.


100%|██████████| 10/10 [00:00<00:00, 42.44it/s]


Bootstrapped 0 full traces after 10 examples in round 0.


100%|██████████| 10/10 [00:00<00:00, 73.42it/s]


Bootstrapped 0 full traces after 10 examples in round 0.


100%|██████████| 10/10 [00:00<00:00, 40.76it/s]


Bootstrapped 0 full traces after 10 examples in round 0.


100%|██████████| 10/10 [00:00<00:00, 59.51it/s]


Bootstrapped 0 full traces after 10 examples in round 0.


100%|██████████| 10/10 [00:00<00:00, 43.49it/s]


Bootstrapped 0 full traces after 10 examples in round 0.


100%|██████████| 10/10 [00:00<00:00, 70.36it/s]


Bootstrapped 0 full traces after 10 examples in round 0.


100%|██████████| 10/10 [00:00<00:00, 60.51it/s]


Bootstrapped 0 full traces after 10 examples in round 0.


100%|██████████| 10/10 [00:00<00:00, 66.33it/s]


Bootstrapped 0 full traces after 10 examples in round 0.


100%|██████████| 10/10 [00:00<00:00, 70.31it/s]


Bootstrapped 0 full traces after 10 examples in round 0.


100%|██████████| 10/10 [00:00<00:00, 75.14it/s]


Bootstrapped 0 full traces after 10 examples in round 0.


100%|██████████| 10/10 [00:00<00:00, 89.93it/s]


Bootstrapped 0 full traces after 10 examples in round 0.


100%|██████████| 10/10 [00:00<00:00, 135.98it/s]


Bootstrapped 0 full traces after 10 examples in round 0.


100%|██████████| 10/10 [00:00<00:00, 89.23it/s]


Bootstrapped 0 full traces after 10 examples in round 0.


100%|██████████| 10/10 [00:00<00:00, 85.42it/s]


Bootstrapped 0 full traces after 10 examples in round 0.


100%|██████████| 10/10 [00:00<00:00, 83.49it/s]


Bootstrapped 0 full traces after 10 examples in round 0.
Total score: 13.2 (accuracy: 3.6, completeness: 3.8, fluency: 1.8, conciseness: 4.0)
--
Few-shot n: 0, Bootstrapped n: 3


100%|██████████| 10/10 [00:00<00:00, 87.48it/s]


Bootstrapped 0 full traces after 10 examples in round 0.


100%|██████████| 10/10 [00:00<00:00, 92.06it/s]


Bootstrapped 0 full traces after 10 examples in round 0.


100%|██████████| 10/10 [00:00<00:00, 93.63it/s]


Bootstrapped 0 full traces after 10 examples in round 0.


100%|██████████| 10/10 [00:00<00:00, 88.36it/s]


Bootstrapped 0 full traces after 10 examples in round 0.


100%|██████████| 10/10 [00:00<00:00, 76.91it/s]


Bootstrapped 0 full traces after 10 examples in round 0.


100%|██████████| 10/10 [00:00<00:00, 93.75it/s]


Bootstrapped 0 full traces after 10 examples in round 0.


100%|██████████| 10/10 [00:00<00:00, 83.42it/s]


Bootstrapped 0 full traces after 10 examples in round 0.


100%|██████████| 10/10 [00:00<00:00, 86.93it/s]


Bootstrapped 0 full traces after 10 examples in round 0.


100%|██████████| 10/10 [00:00<00:00, 90.01it/s]


Bootstrapped 0 full traces after 10 examples in round 0.


100%|██████████| 10/10 [00:00<00:00, 72.16it/s]


Bootstrapped 0 full traces after 10 examples in round 0.


100%|██████████| 10/10 [00:00<00:00, 147.03it/s]


Bootstrapped 0 full traces after 10 examples in round 0.


100%|██████████| 10/10 [00:00<00:00, 41.87it/s]


Bootstrapped 0 full traces after 10 examples in round 0.


100%|██████████| 10/10 [00:00<00:00, 75.56it/s]


Bootstrapped 0 full traces after 10 examples in round 0.


100%|██████████| 10/10 [00:00<00:00, 87.54it/s]


Bootstrapped 0 full traces after 10 examples in round 0.


100%|██████████| 10/10 [00:00<00:00, 162.82it/s]


Bootstrapped 0 full traces after 10 examples in round 0.


100%|██████████| 10/10 [00:00<00:00, 68.62it/s]


Bootstrapped 0 full traces after 10 examples in round 0.


100%|██████████| 10/10 [00:00<00:00, 78.31it/s]


Bootstrapped 0 full traces after 10 examples in round 0.


100%|██████████| 10/10 [00:00<00:00, 37.87it/s]


Bootstrapped 0 full traces after 10 examples in round 0.


100%|██████████| 10/10 [00:00<00:00, 36.60it/s]


Bootstrapped 0 full traces after 10 examples in round 0.


100%|██████████| 10/10 [00:00<00:00, 60.59it/s]


Bootstrapped 0 full traces after 10 examples in round 0.
Total score: 13.2 (accuracy: 3.6, completeness: 3.8, fluency: 1.8, conciseness: 4.0)
--
Few-shot n: 3, Bootstrapped n: 3


100%|██████████| 10/10 [00:31<00:00,  3.16s/it]


Bootstrapped 0 full traces after 10 examples in round 0.


100%|██████████| 10/10 [00:00<00:00, 95.65it/s]


Bootstrapped 0 full traces after 10 examples in round 0.


100%|██████████| 10/10 [00:00<00:00, 102.37it/s]


Bootstrapped 0 full traces after 10 examples in round 0.


100%|██████████| 10/10 [00:00<00:00, 121.30it/s]


Bootstrapped 0 full traces after 10 examples in round 0.


100%|██████████| 10/10 [00:00<00:00, 168.14it/s]


Bootstrapped 0 full traces after 10 examples in round 0.


100%|██████████| 10/10 [00:00<00:00, 94.68it/s]


Bootstrapped 0 full traces after 10 examples in round 0.


100%|██████████| 10/10 [00:00<00:00, 95.30it/s]


Bootstrapped 0 full traces after 10 examples in round 0.


100%|██████████| 10/10 [00:00<00:00, 87.85it/s]


Bootstrapped 0 full traces after 10 examples in round 0.


100%|██████████| 10/10 [00:00<00:00, 95.46it/s]


Bootstrapped 0 full traces after 10 examples in round 0.


100%|██████████| 10/10 [00:00<00:00, 164.20it/s]


Bootstrapped 0 full traces after 10 examples in round 0.


100%|██████████| 10/10 [00:00<00:00, 158.51it/s]


Bootstrapped 0 full traces after 10 examples in round 0.


100%|██████████| 10/10 [00:00<00:00, 100.58it/s]


Bootstrapped 0 full traces after 10 examples in round 0.


100%|██████████| 10/10 [00:00<00:00, 88.95it/s]


Bootstrapped 0 full traces after 10 examples in round 0.


100%|██████████| 10/10 [00:00<00:00, 82.04it/s]


Bootstrapped 0 full traces after 10 examples in round 0.


100%|██████████| 10/10 [00:00<00:00, 94.66it/s]


Bootstrapped 0 full traces after 10 examples in round 0.


100%|██████████| 10/10 [00:00<00:00, 90.84it/s]


Bootstrapped 0 full traces after 10 examples in round 0.


100%|██████████| 10/10 [00:00<00:00, 176.51it/s]


Bootstrapped 0 full traces after 10 examples in round 0.


100%|██████████| 10/10 [00:00<00:00, 94.78it/s]


Bootstrapped 0 full traces after 10 examples in round 0.


100%|██████████| 10/10 [00:00<00:00, 96.68it/s]


Bootstrapped 0 full traces after 10 examples in round 0.


100%|██████████| 10/10 [00:00<00:00, 87.25it/s]


Bootstrapped 0 full traces after 10 examples in round 0.
Total score: 10.05 (accuracy: 0.4, completeness: 2.2, fluency: 3.45, conciseness: 4.0)
--
=====
Dataset: student_1.json
Few-shot n: 0, Bootstrapped n: 1


 60%|██████    | 6/10 [00:26<00:17,  4.45s/it]


Bootstrapped 1 full traces after 7 examples in round 0.


 60%|██████    | 6/10 [00:00<00:00, 99.43it/s]


Bootstrapped 1 full traces after 7 examples in round 0.


 60%|██████    | 6/10 [00:00<00:00, 96.19it/s]


Bootstrapped 1 full traces after 7 examples in round 0.


 60%|██████    | 6/10 [00:00<00:00, 97.81it/s]


Bootstrapped 1 full traces after 7 examples in round 0.


 60%|██████    | 6/10 [00:00<00:00, 93.94it/s]


Bootstrapped 1 full traces after 7 examples in round 0.


 60%|██████    | 6/10 [00:00<00:00, 101.27it/s]


Bootstrapped 1 full traces after 7 examples in round 0.


 60%|██████    | 6/10 [00:00<00:00, 90.36it/s]


Bootstrapped 1 full traces after 7 examples in round 0.


 60%|██████    | 6/10 [00:00<00:00, 94.41it/s]


Bootstrapped 1 full traces after 7 examples in round 0.


 60%|██████    | 6/10 [00:00<00:00, 89.67it/s]


Bootstrapped 1 full traces after 7 examples in round 0.


 60%|██████    | 6/10 [00:00<00:00, 101.01it/s]


Bootstrapped 1 full traces after 7 examples in round 0.


 60%|██████    | 6/10 [00:00<00:00, 165.72it/s]


Bootstrapped 1 full traces after 7 examples in round 0.


 60%|██████    | 6/10 [00:00<00:00, 87.94it/s]


Bootstrapped 1 full traces after 7 examples in round 0.


 60%|██████    | 6/10 [00:00<00:00, 95.50it/s]


Bootstrapped 1 full traces after 7 examples in round 0.


 60%|██████    | 6/10 [00:00<00:00, 89.61it/s]


Bootstrapped 1 full traces after 7 examples in round 0.


 60%|██████    | 6/10 [00:00<00:00, 87.72it/s]


Bootstrapped 1 full traces after 7 examples in round 0.


 60%|██████    | 6/10 [00:00<00:00, 88.47it/s]


Bootstrapped 1 full traces after 7 examples in round 0.


 60%|██████    | 6/10 [00:00<00:00, 94.94it/s]


Bootstrapped 1 full traces after 7 examples in round 0.


 60%|██████    | 6/10 [00:00<00:00, 92.63it/s]


Bootstrapped 1 full traces after 7 examples in round 0.


 60%|██████    | 6/10 [00:00<00:00, 92.81it/s]


Bootstrapped 1 full traces after 7 examples in round 0.


 60%|██████    | 6/10 [00:00<00:00, 125.51it/s]


Bootstrapped 1 full traces after 7 examples in round 0.
Total score: 14.75 (accuracy: 4.0, completeness: 4.0, fluency: 2.75, conciseness: 4.0)
--
Few-shot n: 0, Bootstrapped n: 3


100%|██████████| 10/10 [00:14<00:00,  1.41s/it]


Bootstrapped 3 full traces after 10 examples in round 0.


100%|██████████| 10/10 [00:00<00:00, 72.00it/s]


Bootstrapped 3 full traces after 10 examples in round 0.


100%|██████████| 10/10 [00:00<00:00, 96.62it/s]


Bootstrapped 3 full traces after 10 examples in round 0.


100%|██████████| 10/10 [00:00<00:00, 102.66it/s]


Bootstrapped 3 full traces after 10 examples in round 0.


100%|██████████| 10/10 [00:00<00:00, 89.76it/s]


Bootstrapped 3 full traces after 10 examples in round 0.


100%|██████████| 10/10 [00:00<00:00, 95.74it/s]


Bootstrapped 3 full traces after 10 examples in round 0.


100%|██████████| 10/10 [00:00<00:00, 135.02it/s]


Bootstrapped 3 full traces after 10 examples in round 0.


100%|██████████| 10/10 [00:00<00:00, 89.43it/s]


Bootstrapped 3 full traces after 10 examples in round 0.


100%|██████████| 10/10 [00:00<00:00, 79.10it/s]


Bootstrapped 3 full traces after 10 examples in round 0.


100%|██████████| 10/10 [00:00<00:00, 84.80it/s]


Bootstrapped 3 full traces after 10 examples in round 0.


100%|██████████| 10/10 [00:00<00:00, 94.01it/s]


Bootstrapped 3 full traces after 10 examples in round 0.


100%|██████████| 10/10 [00:00<00:00, 80.31it/s]


Bootstrapped 3 full traces after 10 examples in round 0.


100%|██████████| 10/10 [00:00<00:00, 18.49it/s]


Bootstrapped 3 full traces after 10 examples in round 0.


100%|██████████| 10/10 [00:00<00:00, 162.49it/s]


Bootstrapped 3 full traces after 10 examples in round 0.


100%|██████████| 10/10 [00:00<00:00, 90.54it/s]


Bootstrapped 3 full traces after 10 examples in round 0.


100%|██████████| 10/10 [00:00<00:00, 89.12it/s]


Bootstrapped 3 full traces after 10 examples in round 0.


100%|██████████| 10/10 [00:00<00:00, 81.81it/s]


Bootstrapped 3 full traces after 10 examples in round 0.


100%|██████████| 10/10 [00:00<00:00, 76.32it/s]


Bootstrapped 3 full traces after 10 examples in round 0.


100%|██████████| 10/10 [00:00<00:00, 92.43it/s]


Bootstrapped 3 full traces after 10 examples in round 0.


100%|██████████| 10/10 [00:00<00:00, 83.66it/s]


Bootstrapped 3 full traces after 10 examples in round 0.
Total score: 15.15 (accuracy: 4.0, completeness: 4.0, fluency: 3.15, conciseness: 4.0)
--
Few-shot n: 3, Bootstrapped n: 3


 30%|███       | 3/10 [00:13<00:31,  4.56s/it]


Bootstrapped 3 full traces after 4 examples in round 0.


 30%|███       | 3/10 [00:00<00:00, 91.90it/s]


Bootstrapped 3 full traces after 4 examples in round 0.


 30%|███       | 3/10 [00:00<00:00, 122.02it/s]


Bootstrapped 3 full traces after 4 examples in round 0.


 30%|███       | 3/10 [00:00<00:00, 101.48it/s]


Bootstrapped 3 full traces after 4 examples in round 0.


 30%|███       | 3/10 [00:00<00:00, 76.64it/s]


Bootstrapped 3 full traces after 4 examples in round 0.


 30%|███       | 3/10 [00:00<00:00, 87.86it/s]


Bootstrapped 3 full traces after 4 examples in round 0.


 30%|███       | 3/10 [00:00<00:00, 90.46it/s]


Bootstrapped 3 full traces after 4 examples in round 0.


 30%|███       | 3/10 [00:00<00:00, 85.14it/s]


Bootstrapped 3 full traces after 4 examples in round 0.


 30%|███       | 3/10 [00:00<00:00, 99.34it/s]


Bootstrapped 3 full traces after 4 examples in round 0.


 30%|███       | 3/10 [00:00<00:00, 92.91it/s]


Bootstrapped 3 full traces after 4 examples in round 0.


 30%|███       | 3/10 [00:00<00:00, 100.69it/s]


Bootstrapped 3 full traces after 4 examples in round 0.


 30%|███       | 3/10 [00:00<00:00, 97.81it/s]


Bootstrapped 3 full traces after 4 examples in round 0.


 30%|███       | 3/10 [00:00<00:00, 79.02it/s]


Bootstrapped 3 full traces after 4 examples in round 0.


 30%|███       | 3/10 [00:00<00:00, 94.72it/s]


Bootstrapped 3 full traces after 4 examples in round 0.


 30%|███       | 3/10 [00:00<00:00, 106.64it/s]


Bootstrapped 3 full traces after 4 examples in round 0.


 30%|███       | 3/10 [00:00<00:00, 89.45it/s]


Bootstrapped 3 full traces after 4 examples in round 0.


 30%|███       | 3/10 [00:00<00:00, 95.63it/s]


Bootstrapped 3 full traces after 4 examples in round 0.


 30%|███       | 3/10 [00:00<00:00, 96.72it/s]


Bootstrapped 3 full traces after 4 examples in round 0.


 30%|███       | 3/10 [00:00<00:00, 88.69it/s]


Bootstrapped 3 full traces after 4 examples in round 0.


 30%|███       | 3/10 [00:00<00:00, 92.90it/s]


Bootstrapped 3 full traces after 4 examples in round 0.
Total score: 15.6 (accuracy: 4.0, completeness: 3.9, fluency: 3.7, conciseness: 4.0)
--
=====
Dataset: student_2.json
Few-shot n: 0, Bootstrapped n: 1


 30%|███       | 3/10 [00:01<00:02,  2.90it/s]


Bootstrapped 1 full traces after 4 examples in round 0.


 30%|███       | 3/10 [00:00<00:00, 102.74it/s]


Bootstrapped 1 full traces after 4 examples in round 0.


 30%|███       | 3/10 [00:00<00:00, 95.32it/s]


Bootstrapped 1 full traces after 4 examples in round 0.


 30%|███       | 3/10 [00:00<00:00, 98.26it/s]


Bootstrapped 1 full traces after 4 examples in round 0.


 30%|███       | 3/10 [00:00<00:00, 94.67it/s]


Bootstrapped 1 full traces after 4 examples in round 0.


 30%|███       | 3/10 [00:00<00:00, 93.14it/s]


Bootstrapped 1 full traces after 4 examples in round 0.


 30%|███       | 3/10 [00:00<00:00, 94.76it/s]


Bootstrapped 1 full traces after 4 examples in round 0.


 30%|███       | 3/10 [00:00<00:00, 90.45it/s]


Bootstrapped 1 full traces after 4 examples in round 0.


 30%|███       | 3/10 [00:00<00:00, 145.35it/s]


Bootstrapped 1 full traces after 4 examples in round 0.


 30%|███       | 3/10 [00:00<00:00, 94.00it/s]


Bootstrapped 1 full traces after 4 examples in round 0.


 30%|███       | 3/10 [00:00<00:00, 82.39it/s]


Bootstrapped 1 full traces after 4 examples in round 0.


 30%|███       | 3/10 [00:00<00:00, 67.21it/s]


Bootstrapped 1 full traces after 4 examples in round 0.


 30%|███       | 3/10 [00:00<00:00, 89.16it/s]


Bootstrapped 1 full traces after 4 examples in round 0.


 30%|███       | 3/10 [00:00<00:00, 101.19it/s]


Bootstrapped 1 full traces after 4 examples in round 0.


 30%|███       | 3/10 [00:00<00:00, 127.48it/s]


Bootstrapped 1 full traces after 4 examples in round 0.


 30%|███       | 3/10 [00:00<00:00, 81.79it/s]


Bootstrapped 1 full traces after 4 examples in round 0.


 30%|███       | 3/10 [00:00<00:00, 82.87it/s]


Bootstrapped 1 full traces after 4 examples in round 0.


 30%|███       | 3/10 [00:00<00:00, 79.45it/s]


Bootstrapped 1 full traces after 4 examples in round 0.


 30%|███       | 3/10 [00:00<00:00, 94.96it/s]


Bootstrapped 1 full traces after 4 examples in round 0.


 30%|███       | 3/10 [00:00<00:00, 86.44it/s]


Bootstrapped 1 full traces after 4 examples in round 0.
Total score: 15.1 (accuracy: 4.0, completeness: 4.0, fluency: 3.1, conciseness: 4.0)
--
Few-shot n: 0, Bootstrapped n: 3


100%|██████████| 10/10 [00:30<00:00,  3.03s/it]


Bootstrapped 3 full traces after 10 examples in round 0.


100%|██████████| 10/10 [00:00<00:00, 84.56it/s]


Bootstrapped 3 full traces after 10 examples in round 0.


100%|██████████| 10/10 [00:00<00:00, 90.78it/s]


Bootstrapped 3 full traces after 10 examples in round 0.


100%|██████████| 10/10 [00:00<00:00, 174.86it/s]


Bootstrapped 3 full traces after 10 examples in round 0.


100%|██████████| 10/10 [00:00<00:00, 93.42it/s]


Bootstrapped 3 full traces after 10 examples in round 0.


100%|██████████| 10/10 [00:00<00:00, 90.35it/s]


Bootstrapped 3 full traces after 10 examples in round 0.


100%|██████████| 10/10 [00:00<00:00, 94.33it/s]


Bootstrapped 3 full traces after 10 examples in round 0.


100%|██████████| 10/10 [00:00<00:00, 83.52it/s]


Bootstrapped 3 full traces after 10 examples in round 0.


100%|██████████| 10/10 [00:00<00:00, 168.57it/s]


Bootstrapped 3 full traces after 10 examples in round 0.


100%|██████████| 10/10 [00:00<00:00, 155.80it/s]


Bootstrapped 3 full traces after 10 examples in round 0.


100%|██████████| 10/10 [00:00<00:00, 100.31it/s]


Bootstrapped 3 full traces after 10 examples in round 0.


100%|██████████| 10/10 [00:00<00:00, 109.63it/s]


Bootstrapped 3 full traces after 10 examples in round 0.


100%|██████████| 10/10 [00:00<00:00, 91.97it/s]


Bootstrapped 3 full traces after 10 examples in round 0.


100%|██████████| 10/10 [00:00<00:00, 81.08it/s]


Bootstrapped 3 full traces after 10 examples in round 0.


100%|██████████| 10/10 [00:00<00:00, 89.87it/s]


Bootstrapped 3 full traces after 10 examples in round 0.


100%|██████████| 10/10 [00:00<00:00, 72.84it/s]


Bootstrapped 3 full traces after 10 examples in round 0.


100%|██████████| 10/10 [00:00<00:00, 82.45it/s]


Bootstrapped 3 full traces after 10 examples in round 0.


100%|██████████| 10/10 [00:00<00:00, 94.64it/s]


Bootstrapped 3 full traces after 10 examples in round 0.


100%|██████████| 10/10 [00:00<00:00, 94.21it/s]


Bootstrapped 3 full traces after 10 examples in round 0.


100%|██████████| 10/10 [00:00<00:00, 164.42it/s]


Bootstrapped 3 full traces after 10 examples in round 0.
Total score: 14.9 (accuracy: 4.0, completeness: 4.0, fluency: 2.9, conciseness: 4.0)
--
Few-shot n: 3, Bootstrapped n: 3


 40%|████      | 4/10 [00:12<00:18,  3.04s/it]


Bootstrapped 3 full traces after 5 examples in round 0.


 40%|████      | 4/10 [00:00<00:00, 93.38it/s]


Bootstrapped 3 full traces after 5 examples in round 0.


 40%|████      | 4/10 [00:00<00:00, 82.63it/s]


Bootstrapped 3 full traces after 5 examples in round 0.


 40%|████      | 4/10 [00:00<00:00, 82.12it/s]


Bootstrapped 3 full traces after 5 examples in round 0.


 40%|████      | 4/10 [00:00<00:00, 90.21it/s]


Bootstrapped 3 full traces after 5 examples in round 0.


 40%|████      | 4/10 [00:00<00:00, 174.19it/s]


Bootstrapped 3 full traces after 5 examples in round 0.


 40%|████      | 4/10 [00:00<00:00, 100.45it/s]


Bootstrapped 3 full traces after 5 examples in round 0.


 40%|████      | 4/10 [00:00<00:00, 160.93it/s]


Bootstrapped 3 full traces after 5 examples in round 0.


 40%|████      | 4/10 [00:00<00:00, 167.95it/s]


Bootstrapped 3 full traces after 5 examples in round 0.


 40%|████      | 4/10 [00:00<00:00, 79.59it/s]


Bootstrapped 3 full traces after 5 examples in round 0.


 40%|████      | 4/10 [00:00<00:00, 135.97it/s]


Bootstrapped 3 full traces after 5 examples in round 0.


 40%|████      | 4/10 [00:00<00:00, 99.16it/s]


Bootstrapped 3 full traces after 5 examples in round 0.


 40%|████      | 4/10 [00:00<00:00, 183.86it/s]


Bootstrapped 3 full traces after 5 examples in round 0.


 40%|████      | 4/10 [00:00<00:00, 87.27it/s]


Bootstrapped 3 full traces after 5 examples in round 0.


 40%|████      | 4/10 [00:00<00:00, 83.11it/s]


Bootstrapped 3 full traces after 5 examples in round 0.


 40%|████      | 4/10 [00:00<00:00, 118.24it/s]


Bootstrapped 3 full traces after 5 examples in round 0.


 40%|████      | 4/10 [00:00<00:00, 94.55it/s]


Bootstrapped 3 full traces after 5 examples in round 0.


 40%|████      | 4/10 [00:00<00:00, 83.30it/s]


Bootstrapped 3 full traces after 5 examples in round 0.


 40%|████      | 4/10 [00:00<00:00, 104.29it/s]


Bootstrapped 3 full traces after 5 examples in round 0.


 40%|████      | 4/10 [00:00<00:00, 93.30it/s]


Bootstrapped 3 full traces after 5 examples in round 0.
Total score: 15.75 (accuracy: 3.8, completeness: 4.0, fluency: 3.95, conciseness: 4.0)
--
=====


In [12]:
result_df = pd.DataFrame(results)
result_df.to_csv("results.csv")
result_df


Unnamed: 0,dataset,prompt,total score,accuracy,completeness,fluency,conciseness,n_few_shot,n_bootstrapped_few_shot
0,housing_1.json,You are helping users understand an ML model's...,15.24,3.52,4.0,3.72,4.0,5,
1,housing_2.json,You are helping users understand an ML model's...,16.0,4.0,4.0,4.0,4.0,1,
2,housing_2.json,You are helping users understand an ML model's...,16.0,4.0,4.0,4.0,4.0,3,
3,housing_2.json,You are helping users understand an ML model's...,16.0,4.0,4.0,4.0,4.0,5,
4,housing_3.json,You are helping users understand an ML model's...,15.75,4.0,3.833333,3.916667,4.0,1,
5,housing_3.json,You are helping users understand an ML model's...,15.75,4.0,3.833333,3.916667,4.0,3,
6,housing_3.json,You are helping users understand an ML model's...,15.916667,4.0,4.0,3.916667,4.0,5,
7,mushroom_1.json,You are helping users understand an ML model's...,14.05,2.8,3.8,3.45,4.0,1,
8,mushroom_1.json,You are helping users understand an ML model's...,14.5,3.4,3.6,3.5,4.0,3,
9,mushroom_1.json,You are helping users understand an ML model's...,13.95,2.6,3.8,3.55,4.0,5,
