# __DSPy Prompt Optimizers__: Optimizing instructions & fewshot examples for LM programs

Description of this notebook, what you'll learn, etc.

### 0] Setup

First, we'll __load in the cached requests__ for this tasks, so that we don't actually need to call any LMs for this notebook.

In [1]:
!rm -rf DSPy_optimizer_cache
!git clone https://huggingface.co/kopsahlong/DSPy_optimizer_cache
%cd DSPy_optimizer_cache/
# !git checkout master
%cd ..
import os
repo_clone_path = 'DSPy_optimizer_cache/cache' #TODO: update this cache to just contain the runs we need!!

# Set up the cache for this notebook
os.environ["DSP_NOTEBOOK_CACHEDIR"] = repo_clone_path
os.environ["DSP_CACHEDIR"] = f"{repo_clone_path}/cachedir"


SyntaxError: unterminated string literal (detected at line 11) (1838760429.py, line 11)

In [3]:
!echo $HOME

/lfs/0/kristaoo/home


We will also specify the __prompt LM model__ (in this case GPT 3.5), the __task LM model__ (Llama 13B) and the retrieval model we'll be using for our task (a HotPotQA multihop retrieval task).

In [2]:
import os 
import dspy
import openai
import os

### NOTE: if you'd like to run this code without a cache, you can remove these lines to configure your OPEN AI key ###
# os.environ['OPENAI_API_KEY'] = "TODO: ADD YOUR OPEN AI KEY HERE"
# openai.api_key = os.environ.get('OPENAI_API_KEY')
# openai.api_base = "https://api.openai.com/v1"

prompt_model_name = "gpt-3.5-turbo-1106"
task_model_name = "meta-llama/Llama-2-13b-chat-hf"
colbert_v2_endpoint = "http://20.102.90.50:2017/wiki17_abstracts"

ports = [7140, 7141, 7142, 7143] #TODO: REMOVE THIS

prompt_model = dspy.OpenAI(model=prompt_model_name, max_tokens=150)
task_model = dspy.HFClientTGI(model=task_model_name, port=[7140, 7141, 7142, 7143], max_tokens=150)

colbertv2 = dspy.ColBERTv2(url=colbert_v2_endpoint)

dspy.settings.configure(rm=colbertv2, lm=task_model)

  from .autonotebook import tqdm as notebook_tqdm


### 1] Define Task

Here, we'll define the program that we'd like to run, which is a multihop [...] (we can say that it was loosely inspired by a certain paper). We additionally load in the data, and define how we'd like to evaluate this task.

In [3]:
from dspy.evaluate import Evaluate
import re 
from dspy.datasets import HotPotQA

class ReturnRankedDocuments(dspy.Signature):
    """Given a question we are trying to answer and a list of passages, return a comma separated list of the numbers associated with each passage. These numbers should be ordered by helpfulness in answering the question, with most helpful passage number first, and the least helpful last."""
    question = dspy.InputField(desc="The question we're trying to answer.")
    context = dspy.InputField(desc="List of potentially related passages.")
    ranking = dspy.OutputField(desc="A comma separated list of numbers corresponding to passage indices, ranked in descending order by their helpfulness in answering our question.")

class RankingMultiHop(dspy.Module):
    def __init__(self, hops, num_passages_to_retrieve, max_passages_in_context):
        super().__init__()
        self.hops = hops
        self.num_passages_to_retrieve = num_passages_to_retrieve
        self.max_passages_in_context = max_passages_in_context
        self.retrieve = dspy.Retrieve(k = self.num_passages_to_retrieve)
        self.generate_query = dspy.ChainOfThought("context ,question->search_query")
        self.generate_answer = dspy.ChainOfThought("context ,question->answer")
        self.generate_ranking = dspy.ChainOfThought(ReturnRankedDocuments)
    
    def forward (self,question) :
        context = []
        full_context = []
        top_context = []
        max_passage_num = self.max_passages_in_context
        for hop in range(self.hops):
            # Get a new query
            query = self.generate_query(context = context, question = question).search_query
            # Get new passages
            context = self.retrieve(query).passages
            # Add these new passages to the previous top context 
            full_context = top_context + context
            # Get the most important indices, ranked
            most_important_indices =  self.generate_ranking(question=question, context=full_context).ranking
            indices = [int(num) for num in re.findall(r'\d+', most_important_indices)]

            if len(indices) < max_passage_num:
                indices = range(1,max_passage_num+1)

            valid_indices = [index-1 for index in indices if index-1 < len(context)]
            top_indices = sorted(valid_indices, key=lambda x: x)[:max_passage_num+1]
            most_important_context_list = [context[idx] for idx in top_indices]
            # Save the top context
            top_context = most_important_context_list

        return dspy.Prediction(context=context, answer=self.generate_answer(context = top_context , question = question).answer)

program = RankingMultiHop(hops=4, num_passages_to_retrieve=5, max_passages_in_context=5)

# Load and configure the datasets.
TRAIN_SIZE = 500
EVAL_SIZE = 500

hotpot_dataset = HotPotQA(train_seed=1, eval_seed=2023, test_size=0)
trainset = [x.with_inputs('question') for x in hotpot_dataset.train][:TRAIN_SIZE]
devset = [x.with_inputs('question') for x in hotpot_dataset.dev][:EVAL_SIZE]

# Set up metrics
NUM_THREADS = 10

metric = dspy.evaluate.answer_exact_match

# kwargs = dict(num_threads=NUM_THREADS, display_progress=True, display_table=None)
kwargs = dict(num_threads=NUM_THREADS, display_progress=True)
evaluate = Evaluate(devset=devset, metric=metric, **kwargs)

  table = cls._concat_blocks(blocks, axis=0)


### 2] Baseline Evaluation
Now, we'll quickly evaluate our baseline program so that we can see how the performance using the Prompt Optimizer compares. We should see performance of about __16%__ on our trainset, and __21.4%__ on our devset.

In [4]:
baseline_train_score = evaluate(program,devset=trainset)
baseline_eval_score = evaluate(program, devset=devset)

#TODO: add in retrieval eval 

  0%|          | 0/500 [00:00<?, ?it/s]

  return self._cached_call(args, kwargs)[0]
Average Metric: 80 / 500  (16.0): 100%|██████████| 500/500 [01:19<00:00,  6.26it/s]
  df = df.applymap(truncate_cell)


Average Metric: 80 / 500  (16.0%)


Average Metric: 107 / 500  (21.4): 100%|██████████| 500/500 [01:20<00:00,  6.24it/s]


Average Metric: 107 / 500  (21.4%)


We can also inspect a trace from this program to see what a call using this program looks like in use. #TODO

### 3] Bayesian Optimization

#### 3a] Inspecting Pre-Optimized Program

First, because Bayesian Optimization can take a little while to run, let's load in a precompiled program to get a better understanding for what it looks like, and see how it performs.

#### 3a] Optimizing a Program from scratch

Now that we've seen what the Bayesian Prompt Optimizer can do, let's demonstrate how to train it. Note that because all of our calls to the LM are cached, this won't cost you anything, but it may take __~15 min__ to complete. [TODO: give a description of how the Bayesian Prompt Optimizer works]

In [5]:
from dspy.teleprompt import BayesianSignatureOptimizer

# Define hyperparameters:
N = 10 # The number of instructions and fewshot examples that we will generate and optimize over
trials = 30 # The number of optimization trials to be run (we will test out a new combination of instructions and fewshot examples in each trial) 
temperature = 1.0 # The temperature configured for generating new instructions

# Compile
eval_kwargs = dict(num_threads=16, display_progress=True, display_table=0)
teleprompter = BayesianSignatureOptimizer(prompt_model=prompt_model, task_model=task_model, metric=metric, n=N, init_temperature=temperature, verbose=False)
compiled_program = teleprompter.compile(program.deepcopy(), devset=trainset, optuna_trials_num=trials, max_bootstrapped_demos=1,max_labeled_demos=2, eval_kwargs=eval_kwargs)

DSPy_optimizer_cache/cache/compiler


  1%|          | 5/500 [00:00<00:08, 57.26it/s]


Bootstrapped 1 full traces after 6 examples in round 0.


  0%|          | 2/500 [00:00<00:09, 54.86it/s]


Bootstrapped 1 full traces after 3 examples in round 0.


  0%|          | 1/500 [00:00<00:09, 51.38it/s]


Bootstrapped 1 full traces after 2 examples in round 0.


  0%|          | 1/500 [00:00<00:09, 51.48it/s]


Bootstrapped 1 full traces after 2 examples in round 0.


  0%|          | 2/500 [00:00<00:09, 53.30it/s]


Bootstrapped 1 full traces after 3 examples in round 0.


  1%|          | 4/500 [00:00<00:08, 56.01it/s]


Bootstrapped 1 full traces after 5 examples in round 0.


  1%|          | 4/500 [00:00<00:08, 56.18it/s]


Bootstrapped 1 full traces after 5 examples in round 0.


  1%|          | 6/500 [00:00<00:08, 55.90it/s]


Bootstrapped 1 full traces after 7 examples in round 0.


  1%|          | 3/500 [00:00<00:09, 54.17it/s]


Bootstrapped 1 full traces after 4 examples in round 0.


[I 2024-02-26 21:34:37,557] A new study created in memory with name: no-name-61921210-557c-461c-aca6-e0a8a107330a


Starting trial #0


Average Metric: 29 / 100  (29.0): 100%|██████████| 100/100 [00:14<00:00,  6.76it/s]


Average Metric: 29 / 100  (29.0%)


Average Metric: 32 / 100  (32.0): 100%|██████████| 100/100 [00:15<00:00,  6.53it/s]


Average Metric: 32 / 100  (32.0%)


Average Metric: 29 / 100  (29.0): 100%|██████████| 100/100 [00:15<00:00,  6.63it/s]


Average Metric: 29 / 100  (29.0%)


Average Metric: 19 / 100  (19.0): 100%|██████████| 100/100 [00:14<00:00,  6.67it/s]


Average Metric: 19 / 100  (19.0%)


Average Metric: 23 / 100  (23.0): 100%|██████████| 100/100 [00:14<00:00,  6.70it/s]
[I 2024-02-26 21:35:54,035] Trial 0 finished with value: 26.4 and parameters: {'139679081593344_predictor_instruction': 1, '139679081593344_predictor_demos': 1, '139679081593440_predictor_instruction': 5, '139679081593440_predictor_demos': 4, '139679081593584_predictor_instruction': 3, '139679081593584_predictor_demos': 0}. Best is trial 0 with value: 26.4.


Average Metric: 23 / 100  (23.0%)
Starting trial #1


Average Metric: 40 / 100  (40.0): 100%|██████████| 100/100 [00:16<00:00,  5.94it/s]


Average Metric: 40 / 100  (40.0%)


Average Metric: 29 / 100  (29.0): 100%|██████████| 100/100 [00:17<00:00,  5.76it/s]


Average Metric: 29 / 100  (29.0%)


Average Metric: 33 / 100  (33.0): 100%|██████████| 100/100 [00:16<00:00,  5.96it/s]


Average Metric: 33 / 100  (33.0%)


Average Metric: 24 / 100  (24.0): 100%|██████████| 100/100 [00:16<00:00,  5.95it/s]


Average Metric: 24 / 100  (24.0%)


Average Metric: 31 / 100  (31.0): 100%|██████████| 100/100 [00:16<00:00,  5.97it/s]
[I 2024-02-26 21:37:19,423] Trial 1 finished with value: 31.4 and parameters: {'139679081593344_predictor_instruction': 9, '139679081593344_predictor_demos': 3, '139679081593440_predictor_instruction': 8, '139679081593440_predictor_demos': 4, '139679081593584_predictor_instruction': 4, '139679081593584_predictor_demos': 2}. Best is trial 1 with value: 31.4.


Average Metric: 31 / 100  (31.0%)
Starting trial #2


Average Metric: 35 / 100  (35.0): 100%|██████████| 100/100 [00:16<00:00,  5.94it/s]


Average Metric: 35 / 100  (35.0%)


Average Metric: 33 / 100  (33.0): 100%|██████████| 100/100 [00:16<00:00,  6.18it/s]


Average Metric: 33 / 100  (33.0%)


Average Metric: 39 / 100  (39.0): 100%|██████████| 100/100 [00:16<00:00,  6.16it/s]


Average Metric: 39 / 100  (39.0%)


Average Metric: 28 / 100  (28.0): 100%|██████████| 100/100 [00:16<00:00,  6.18it/s]


Average Metric: 28 / 100  (28.0%)


Average Metric: 28 / 100  (28.0): 100%|██████████| 100/100 [00:16<00:00,  6.15it/s]
[I 2024-02-26 21:38:42,088] Trial 2 finished with value: 32.6 and parameters: {'139679081593344_predictor_instruction': 1, '139679081593344_predictor_demos': 9, '139679081593440_predictor_instruction': 0, '139679081593440_predictor_demos': 4, '139679081593584_predictor_instruction': 5, '139679081593584_predictor_demos': 8}. Best is trial 2 with value: 32.6.


Average Metric: 28 / 100  (28.0%)
Starting trial #3


Average Metric: 31 / 100  (31.0): 100%|██████████| 100/100 [00:14<00:00,  6.92it/s]


Average Metric: 31 / 100  (31.0%)


Average Metric: 33 / 100  (33.0): 100%|██████████| 100/100 [00:14<00:00,  6.93it/s]


Average Metric: 33 / 100  (33.0%)


Average Metric: 28 / 100  (28.0): 100%|██████████| 100/100 [00:15<00:00,  6.60it/s]


Average Metric: 28 / 100  (28.0%)


Average Metric: 16 / 100  (16.0): 100%|██████████| 100/100 [00:15<00:00,  6.45it/s]


Average Metric: 16 / 100  (16.0%)


Average Metric: 30 / 100  (30.0): 100%|██████████| 100/100 [00:15<00:00,  6.56it/s]
[I 2024-02-26 21:39:57,811] Trial 3 finished with value: 27.6 and parameters: {'139679081593344_predictor_instruction': 2, '139679081593344_predictor_demos': 2, '139679081593440_predictor_instruction': 3, '139679081593440_predictor_demos': 9, '139679081593584_predictor_instruction': 6, '139679081593584_predictor_demos': 0}. Best is trial 2 with value: 32.6.


Average Metric: 30 / 100  (30.0%)
Starting trial #4


Average Metric: 26 / 100  (26.0): 100%|██████████| 100/100 [00:15<00:00,  6.43it/s]


Average Metric: 26 / 100  (26.0%)


Average Metric: 27 / 100  (27.0): 100%|██████████| 100/100 [00:16<00:00,  6.19it/s]


Average Metric: 27 / 100  (27.0%)


Average Metric: 23 / 100  (23.0): 100%|██████████| 100/100 [00:15<00:00,  6.44it/s]


Average Metric: 23 / 100  (23.0%)


Average Metric: 23 / 100  (23.0): 100%|██████████| 100/100 [00:15<00:00,  6.26it/s]


Average Metric: 23 / 100  (23.0%)


Average Metric: 22 / 100  (22.0): 100%|██████████| 100/100 [00:16<00:00,  6.07it/s]
[I 2024-02-26 21:41:18,440] Trial 4 finished with value: 24.2 and parameters: {'139679081593344_predictor_instruction': 7, '139679081593344_predictor_demos': 6, '139679081593440_predictor_instruction': 1, '139679081593440_predictor_demos': 3, '139679081593584_predictor_instruction': 0, '139679081593584_predictor_demos': 2}. Best is trial 2 with value: 32.6.


Average Metric: 22 / 100  (22.0%)
Starting trial #5


Average Metric: 43 / 100  (43.0): 100%|██████████| 100/100 [00:15<00:00,  6.28it/s]


Average Metric: 43 / 100  (43.0%)


Average Metric: 36 / 100  (36.0): 100%|██████████| 100/100 [00:16<00:00,  6.14it/s]


Average Metric: 36 / 100  (36.0%)


  response = send_hftgi_request_v01_wrapped(
  response = send_hftgi_request_v01_wrapped(
Average Metric: 42 / 100  (42.0): 100%|██████████| 100/100 [00:16<00:00,  5.93it/s]


Average Metric: 42 / 100  (42.0%)


Average Metric: 32 / 100  (32.0): 100%|██████████| 100/100 [00:16<00:00,  6.24it/s]


Average Metric: 32 / 100  (32.0%)


Average Metric: 30 / 100  (30.0): 100%|██████████| 100/100 [00:16<00:00,  5.96it/s]
[I 2024-02-26 21:42:41,196] Trial 5 finished with value: 36.6 and parameters: {'139679081593344_predictor_instruction': 5, '139679081593344_predictor_demos': 3, '139679081593440_predictor_instruction': 4, '139679081593440_predictor_demos': 6, '139679081593584_predictor_instruction': 6, '139679081593584_predictor_demos': 8}. Best is trial 5 with value: 36.6.


Average Metric: 30 / 100  (30.0%)
Starting trial #6


Average Metric: 4 / 100  (4.0): 100%|██████████| 100/100 [00:16<00:00,  6.06it/s]
[I 2024-02-26 21:42:57,916] Trial 6 pruned. 


Average Metric: 4 / 100  (4.0%)
Trial pruned.
Starting trial #7


Average Metric: 32 / 100  (32.0): 100%|██████████| 100/100 [00:17<00:00,  5.82it/s]
[I 2024-02-26 21:43:15,274] Trial 7 pruned. 


Average Metric: 32 / 100  (32.0%)
Trial pruned.
Starting trial #8


Average Metric: 40 / 100  (40.0): 100%|██████████| 100/100 [00:15<00:00,  6.37it/s]


Average Metric: 40 / 100  (40.0%)


Average Metric: 38 / 100  (38.0): 100%|██████████| 100/100 [00:15<00:00,  6.41it/s]


Average Metric: 38 / 100  (38.0%)


Average Metric: 40 / 100  (40.0): 100%|██████████| 100/100 [00:15<00:00,  6.39it/s]


Average Metric: 40 / 100  (40.0%)


Average Metric: 34 / 100  (34.0): 100%|██████████| 100/100 [00:15<00:00,  6.49it/s]


Average Metric: 34 / 100  (34.0%)


Average Metric: 36 / 100  (36.0): 100%|██████████| 100/100 [00:15<00:00,  6.45it/s]
[I 2024-02-26 21:44:34,016] Trial 8 finished with value: 37.6 and parameters: {'139679081593344_predictor_instruction': 8, '139679081593344_predictor_demos': 9, '139679081593440_predictor_instruction': 8, '139679081593440_predictor_demos': 8, '139679081593584_predictor_instruction': 2, '139679081593584_predictor_demos': 1}. Best is trial 8 with value: 37.6.


Average Metric: 36 / 100  (36.0%)
Starting trial #9


Average Metric: 49 / 100  (49.0): 100%|██████████| 100/100 [00:16<00:00,  5.98it/s]


Average Metric: 49 / 100  (49.0%)


  response = send_hftgi_request_v01_wrapped(
Average Metric: 40 / 100  (40.0): 100%|██████████| 100/100 [00:17<00:00,  5.71it/s]


Average Metric: 40 / 100  (40.0%)


Average Metric: 43 / 100  (43.0): 100%|██████████| 100/100 [00:17<00:00,  5.88it/s]


Average Metric: 43 / 100  (43.0%)


Average Metric: 36 / 100  (36.0): 100%|██████████| 100/100 [00:16<00:00,  6.09it/s]


Average Metric: 36 / 100  (36.0%)


Average Metric: 38 / 100  (38.0): 100%|██████████| 100/100 [00:16<00:00,  5.99it/s]
[I 2024-02-26 21:45:59,368] Trial 9 finished with value: 41.2 and parameters: {'139679081593344_predictor_instruction': 0, '139679081593344_predictor_demos': 4, '139679081593440_predictor_instruction': 8, '139679081593440_predictor_demos': 5, '139679081593584_predictor_instruction': 1, '139679081593584_predictor_demos': 4}. Best is trial 9 with value: 41.2.


Average Metric: 38 / 100  (38.0%)
Starting trial #10


Average Metric: 47 / 100  (47.0): 100%|██████████| 100/100 [00:07<00:00, 14.09it/s]


Average Metric: 47 / 100  (47.0%)


Average Metric: 39 / 100  (39.0): 100%|██████████| 100/100 [00:07<00:00, 13.72it/s]


Average Metric: 39 / 100  (39.0%)


Average Metric: 44 / 100  (44.0): 100%|██████████| 100/100 [00:07<00:00, 13.68it/s]


Average Metric: 44 / 100  (44.0%)


Average Metric: 34 / 100  (34.0): 100%|██████████| 100/100 [00:07<00:00, 13.83it/s]


Average Metric: 34 / 100  (34.0%)


Average Metric: 37 / 100  (37.0): 100%|██████████| 100/100 [00:06<00:00, 14.37it/s]
[I 2024-02-26 21:46:36,422] Trial 10 finished with value: 40.2 and parameters: {'139679081593344_predictor_instruction': 0, '139679081593344_predictor_demos': 4, '139679081593440_predictor_instruction': 2, '139679081593440_predictor_demos': 5, '139679081593584_predictor_instruction': 1, '139679081593584_predictor_demos': 4}. Best is trial 9 with value: 41.2.


Average Metric: 37 / 100  (37.0%)
Starting trial #11


Average Metric: 47 / 100  (47.0): 100%|██████████| 100/100 [00:06<00:00, 16.54it/s]


Average Metric: 47 / 100  (47.0%)


Average Metric: 39 / 100  (39.0): 100%|██████████| 100/100 [00:06<00:00, 16.38it/s]


Average Metric: 39 / 100  (39.0%)


Average Metric: 44 / 100  (44.0): 100%|██████████| 100/100 [00:06<00:00, 16.60it/s]


Average Metric: 44 / 100  (44.0%)


Average Metric: 34 / 100  (34.0): 100%|██████████| 100/100 [00:06<00:00, 16.14it/s]


Average Metric: 34 / 100  (34.0%)


Average Metric: 37 / 100  (37.0): 100%|██████████| 100/100 [00:06<00:00, 16.29it/s]
[I 2024-02-26 21:47:07,967] Trial 11 finished with value: 40.2 and parameters: {'139679081593344_predictor_instruction': 0, '139679081593344_predictor_demos': 4, '139679081593440_predictor_instruction': 2, '139679081593440_predictor_demos': 5, '139679081593584_predictor_instruction': 1, '139679081593584_predictor_demos': 4}. Best is trial 9 with value: 41.2.


Average Metric: 37 / 100  (37.0%)
Starting trial #12


Average Metric: 47 / 100  (47.0): 100%|██████████| 100/100 [00:06<00:00, 16.49it/s]


Average Metric: 47 / 100  (47.0%)


Average Metric: 39 / 100  (39.0): 100%|██████████| 100/100 [00:05<00:00, 16.96it/s]


Average Metric: 39 / 100  (39.0%)


Average Metric: 44 / 100  (44.0): 100%|██████████| 100/100 [00:06<00:00, 16.26it/s]


Average Metric: 44 / 100  (44.0%)


Average Metric: 34 / 100  (34.0): 100%|██████████| 100/100 [00:06<00:00, 16.58it/s]


Average Metric: 34 / 100  (34.0%)


Average Metric: 37 / 100  (37.0): 100%|██████████| 100/100 [00:06<00:00, 16.55it/s]
[I 2024-02-26 21:47:40,229] Trial 12 finished with value: 40.2 and parameters: {'139679081593344_predictor_instruction': 0, '139679081593344_predictor_demos': 4, '139679081593440_predictor_instruction': 2, '139679081593440_predictor_demos': 5, '139679081593584_predictor_instruction': 1, '139679081593584_predictor_demos': 4}. Best is trial 9 with value: 41.2.


Average Metric: 37 / 100  (37.0%)
Starting trial #13


Average Metric: 49 / 100  (49.0): 100%|██████████| 100/100 [00:07<00:00, 13.68it/s]


Average Metric: 49 / 100  (49.0%)


Average Metric: 40 / 100  (40.0): 100%|██████████| 100/100 [00:07<00:00, 13.84it/s]


Average Metric: 40 / 100  (40.0%)


Average Metric: 46 / 100  (46.0): 100%|██████████| 100/100 [00:07<00:00, 13.68it/s]


Average Metric: 46 / 100  (46.0%)


Average Metric: 35 / 100  (35.0): 100%|██████████| 100/100 [00:07<00:00, 13.84it/s]


Average Metric: 35 / 100  (35.0%)


Average Metric: 42 / 100  (42.0): 100%|██████████| 100/100 [00:07<00:00, 13.83it/s]
[I 2024-02-26 21:48:17,557] Trial 13 finished with value: 42.4 and parameters: {'139679081593344_predictor_instruction': 0, '139679081593344_predictor_demos': 4, '139679081593440_predictor_instruction': 7, '139679081593440_predictor_demos': 5, '139679081593584_predictor_instruction': 1, '139679081593584_predictor_demos': 4}. Best is trial 13 with value: 42.4.


Average Metric: 42 / 100  (42.0%)
Starting trial #14


Average Metric: 32 / 100  (32.0): 100%|██████████| 100/100 [00:15<00:00,  6.49it/s]
[I 2024-02-26 21:48:33,158] Trial 14 pruned. 


Average Metric: 32 / 100  (32.0%)
Trial pruned.
Starting trial #15


Average Metric: 37 / 100  (37.0): 100%|██████████| 100/100 [00:16<00:00,  6.01it/s]
[I 2024-02-26 21:48:49,962] Trial 15 pruned. 


Average Metric: 37 / 100  (37.0%)
Trial pruned.
Starting trial #16


Average Metric: 44 / 100  (44.0): 100%|██████████| 100/100 [00:11<00:00,  8.80it/s]


Average Metric: 44 / 100  (44.0%)


Average Metric: 44 / 100  (44.0): 100%|██████████| 100/100 [00:11<00:00,  8.69it/s]


Average Metric: 44 / 100  (44.0%)


Average Metric: 46 / 100  (46.0): 100%|██████████| 100/100 [00:11<00:00,  8.83it/s]


Average Metric: 46 / 100  (46.0%)


  response = send_hftgi_request_v01_wrapped(
  response = send_hftgi_request_v01_wrapped(
Average Metric: 41 / 100  (41.0): 100%|██████████| 100/100 [00:11<00:00,  8.38it/s]


Average Metric: 41 / 100  (41.0%)


Average Metric: 40 / 100  (40.0): 100%|██████████| 100/100 [00:11<00:00,  8.89it/s]
[I 2024-02-26 21:49:48,516] Trial 16 finished with value: 43.0 and parameters: {'139679081593344_predictor_instruction': 0, '139679081593344_predictor_demos': 4, '139679081593440_predictor_instruction': 9, '139679081593440_predictor_demos': 7, '139679081593584_predictor_instruction': 1, '139679081593584_predictor_demos': 3}. Best is trial 16 with value: 43.0.


Average Metric: 40 / 100  (40.0%)
Starting trial #17


Average Metric: 37 / 100  (37.0): 100%|██████████| 100/100 [00:15<00:00,  6.51it/s]
[I 2024-02-26 21:50:04,053] Trial 17 pruned. 


Average Metric: 37 / 100  (37.0%)
Trial pruned.
Starting trial #18


Average Metric: 38 / 100  (38.0): 100%|██████████| 100/100 [00:16<00:00,  6.03it/s]
[I 2024-02-26 21:50:20,853] Trial 18 pruned. 


Average Metric: 38 / 100  (38.0%)
Trial pruned.
Starting trial #19


Average Metric: 33 / 100  (33.0): 100%|██████████| 100/100 [00:14<00:00,  6.87it/s]
[I 2024-02-26 21:50:35,594] Trial 19 pruned. 


Average Metric: 33 / 100  (33.0%)
Trial pruned.
Starting trial #20


Average Metric: 27 / 100  (27.0): 100%|██████████| 100/100 [00:16<00:00,  6.03it/s]
[I 2024-02-26 21:50:52,428] Trial 20 pruned. 


Average Metric: 27 / 100  (27.0%)
Trial pruned.
Starting trial #21


Average Metric: 49 / 100  (49.0): 100%|██████████| 100/100 [00:06<00:00, 16.43it/s]


Average Metric: 49 / 100  (49.0%)


Average Metric: 40 / 100  (40.0): 100%|██████████| 100/100 [00:06<00:00, 16.21it/s]


Average Metric: 40 / 100  (40.0%)


Average Metric: 43 / 100  (43.0): 100%|██████████| 100/100 [00:06<00:00, 16.06it/s]


Average Metric: 43 / 100  (43.0%)


Average Metric: 36 / 100  (36.0): 100%|██████████| 100/100 [00:06<00:00, 15.99it/s]


Average Metric: 36 / 100  (36.0%)


Average Metric: 38 / 100  (38.0): 100%|██████████| 100/100 [00:06<00:00, 15.86it/s]
[I 2024-02-26 21:51:24,668] Trial 21 finished with value: 41.2 and parameters: {'139679081593344_predictor_instruction': 0, '139679081593344_predictor_demos': 4, '139679081593440_predictor_instruction': 8, '139679081593440_predictor_demos': 5, '139679081593584_predictor_instruction': 1, '139679081593584_predictor_demos': 4}. Best is trial 16 with value: 43.0.


Average Metric: 38 / 100  (38.0%)
Starting trial #22


Average Metric: 49 / 100  (49.0): 100%|██████████| 100/100 [00:07<00:00, 13.61it/s]


Average Metric: 49 / 100  (49.0%)


Average Metric: 41 / 100  (41.0): 100%|██████████| 100/100 [00:07<00:00, 13.61it/s]


Average Metric: 41 / 100  (41.0%)


Average Metric: 45 / 100  (45.0): 100%|██████████| 100/100 [00:07<00:00, 13.88it/s]


Average Metric: 45 / 100  (45.0%)


Average Metric: 35 / 100  (35.0): 100%|██████████| 100/100 [00:07<00:00, 13.73it/s]


Average Metric: 35 / 100  (35.0%)


Average Metric: 38 / 100  (38.0): 100%|██████████| 100/100 [00:07<00:00, 13.77it/s]
[I 2024-02-26 21:52:02,055] Trial 22 finished with value: 41.6 and parameters: {'139679081593344_predictor_instruction': 0, '139679081593344_predictor_demos': 4, '139679081593440_predictor_instruction': 9, '139679081593440_predictor_demos': 5, '139679081593584_predictor_instruction': 1, '139679081593584_predictor_demos': 4}. Best is trial 16 with value: 43.0.


Average Metric: 38 / 100  (38.0%)
Starting trial #23


Average Metric: 21 / 100  (21.0): 100%|██████████| 100/100 [00:11<00:00,  8.86it/s]
[I 2024-02-26 21:52:13,595] Trial 23 pruned. 


Average Metric: 21 / 100  (21.0%)
Trial pruned.
Starting trial #24


Average Metric: 29 / 100  (29.0): 100%|██████████| 100/100 [00:14<00:00,  6.77it/s]
[I 2024-02-26 21:52:28,612] Trial 24 pruned. 


Average Metric: 29 / 100  (29.0%)
Trial pruned.
Starting trial #25


  response = send_hftgi_request_v01_wrapped(
  response = send_hftgi_request_v01_wrapped(
  response = send_hftgi_request_v01_wrapped(
  response = send_hftgi_request_v01_wrapped(
Average Metric: 36 / 100  (36.0): 100%|██████████| 100/100 [00:12<00:00,  8.11it/s]
[I 2024-02-26 21:52:41,142] Trial 25 pruned. 


Average Metric: 36 / 100  (36.0%)
Trial pruned.
Starting trial #26


Average Metric: 36 / 100  (36.0): 100%|██████████| 100/100 [00:14<00:00,  7.01it/s]
[I 2024-02-26 21:52:55,623] Trial 26 pruned. 


Average Metric: 36 / 100  (36.0%)
Trial pruned.
Starting trial #27


Average Metric: 33 / 100  (33.0): 100%|██████████| 100/100 [00:15<00:00,  6.53it/s]
[I 2024-02-26 21:53:11,086] Trial 27 pruned. 


Average Metric: 33 / 100  (33.0%)
Trial pruned.
Starting trial #28


Average Metric: 40 / 100  (40.0): 100%|██████████| 100/100 [00:07<00:00, 13.87it/s]
[I 2024-02-26 21:53:18,558] Trial 28 pruned. 


Average Metric: 40 / 100  (40.0%)
Trial pruned.
Starting trial #29


Average Metric: 34 / 100  (34.0): 100%|██████████| 100/100 [00:13<00:00,  7.28it/s]
[I 2024-02-26 21:53:32,494] Trial 29 pruned. 


Average Metric: 34 / 100  (34.0%)
Trial pruned.


## Evaluate

In [6]:
bayesian_train_score = evaluate(compiled_program,devset=trainset)
bayesian_eval_score = evaluate(compiled_program, devset=devset)

Average Metric: 215 / 500  (43.0): 100%|██████████| 500/500 [00:30<00:00, 16.32it/s]


Average Metric: 215 / 500  (43.0%)


Average Metric: 199 / 500  (39.8): 100%|██████████| 500/500 [01:21<00:00,  6.11it/s]


Average Metric: 199 / 500  (39.8%)
