In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
import dspy

lm = dspy.LM("ollama_chat/gpt-oss:20b", api_base="http://localhost:11434", api_key="fake")
dspy.configure(lm=lm)

To invoke the LLM

In [3]:
lm(messages=[{"role": "user", "content": "Hi! How many 'r's are there in strawberry?"}])  

["There are **3** 'r's in the word *strawberry*."]

# 1. Inline signatures

Declare signatures inline using strings and arrows!

## Chain Of Thought
GPT-oss has a 128k context window! Let's make it summarize some documents!

In [4]:
import os

if not os.path.exists("../docs"):
    os.makedirs("../docs")

In [5]:
!wget https://arxiv.org/pdf/2505.20286 -O "../docs/alita_paper.pdf"

--2025-08-12 22:01:54--  https://arxiv.org/pdf/2505.20286
Resolving arxiv.org (arxiv.org)... 151.101.67.42, 151.101.195.42, 151.101.3.42, ...
Connecting to arxiv.org (arxiv.org)|151.101.67.42|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1113373 (1.1M) [application/pdf]
Saving to: ‘../docs/alita_paper.pdf’


2025-08-12 22:01:54 (5.12 MB/s) - ‘../docs/alita_paper.pdf’ saved [1113373/1113373]



In [6]:
from llama_index.core import SimpleDirectoryReader

docs = SimpleDirectoryReader("../docs").load_data()

In [7]:
from IPython.display import display, Markdown
display(Markdown(str(docs[0])))

Doc ID: 3ae54dff-0494-4fdf-83df-2c754f5c6576
Text: arXiv:2505.20286v1  [cs.AI]  26 May 2025 ALITA : G ENERALIST
AGENT ENABLING SCALABLE AGENTIC REASONING WITH MINIMAL PREDEFINITION
AND MAXIMAL SELF -EVOLUTION Jiahao Qiu∗1, Xuan Qi∗2, Tongcheng
Zhang∗3, Xinzhe Juan3,4, Jiacheng Guo1, Yifu Lu1, Yimin Wang3,4, Zixin
Yao1, Qihan Ren3, Xun Jiang5, Xing Zhou5, Dongrui Liu3, Ling Yang1,
Yue Wu1, Kaixua...

In [8]:
doc_text = "\n\n".join([d.get_content() for d in docs])

In [9]:
summarize = dspy.ChainOfThought('full_document -> summary')
response = summarize(full_document = doc_text)

In [10]:
display(Markdown(response.summary))

Alita successfully generated a YouTube Video Subtitle Crawler MCP, executed it to retrieve the transcript of the specified 360 VR video, and extracted the correct number “100000000” mentioned by the narrator after the dinosaur scene. The workflow involved MCP brainstorming, web search for an open‑source tool, environment setup, code generation, MCP packaging, and final answer extraction.

In [11]:
display(Markdown(response.reasoning))

The case study demonstrates Alita’s workflow for extracting a specific piece of information from a YouTube 360 VR video. The process begins with an MCP Brainstorming step, where Alita identifies the need for a “YouTube Video Subtitle Crawler” MCP to automate subtitle extraction. The Web Agent then searches open‑source repositories and locates the `youtube-transcript-api` library on GitHub. The Manager Agent synthesizes this information, writes a Python function that uses the API to fetch the transcript, and generates environment setup instructions (conda environment creation and pip install). Once the code is executed in the prepared environment, the Manager Agent packages the function into the MCP, which is then used to scrape the subtitles from the target video. By parsing the transcript, Alita identifies the number “100000000” mentioned immediately after the dinosaurs are first shown. This answer matches the correct answer provided in the dataset.

In [12]:
dspy.inspect_history()





[34m[2025-08-12T22:01:59.599782][0m

[31mSystem message:[0m

Your input fields are:
1. `full_document` (str):
Your output fields are:
1. `reasoning` (str): 
2. `summary` (str):
All interactions will be structured in the following way, with the appropriate values filled in.

[[ ## full_document ## ]]
{full_document}

[[ ## reasoning ## ]]
{reasoning}

[[ ## summary ## ]]
{summary}

[[ ## completed ## ]]
In adhering to this structure, your objective is: 
        Given the fields `full_document`, produce the fields `summary`.


[31mUser message:[0m

[[ ## full_document ## ]]
arXiv:2505.20286v1  [cs.AI]  26 May 2025
ALITA : G ENERALIST AGENT ENABLING SCALABLE AGENTIC
REASONING WITH MINIMAL PREDEFINITION AND MAXIMAL
SELF -EVOLUTION
Jiahao Qiu∗1, Xuan Qi∗2, Tongcheng Zhang∗3, Xinzhe Juan3,4, Jiacheng Guo1, Yifu Lu1, Yimin Wang3,4, Zixin Yao1,
Qihan Ren3, Xun Jiang5, Xing Zhou5, Dongrui Liu3, Ling Yang1, Yue Wu1, Kaixuan Huang1, Shilong Liu1,
Hongru Wang6, Mengdi Wang1
1AI Lab, Prin

## DSPy predict - A zero vector DB example

Adding an instruction to the Signature helps us to couch the LLM's reply.

> Not recommended because the document will greatly clog the LLM's context window. This code just demonstrates the power of having a long context window and how to use DSPy declarative signatures with instructions!

In [13]:
zero_vector_db = dspy.Predict(
    dspy.Signature(
        'document: str, question: str -> answer: str',
        instructions='Only use the document to answer the question and nothing else.'
    )
)

question = 'How does ALITA help LLMs to achieve autonomous reasoning?'
response = zero_vector_db(question=question, document=doc_text)

In [14]:
display(Markdown(response.answer))

ALITA enables large language models (LLMs) to perform autonomous reasoning by adopting a design philosophy of **minimal predefinition and maximal self‑evolution**.  
Key mechanisms include:

1. **MCP Brainstorming** – The LLM first introspects the task, identifies missing capabilities, and proposes new *Model‑agnostic Toolchains* (MCPs) that can be built on‑the‑fly.  
2. **Web Agent Retrieval** – It searches public code repositories and APIs to find existing libraries that can implement the proposed MCP, thereby avoiding the need for the model to write code from scratch.  
3. **Dynamic Environment Construction** – The LLM generates the necessary environment‑setup commands (e.g., conda or pip installs) and integrates them with the retrieved code.  
4. **Self‑Generated MCP Packaging** – The model packages the retrieved code and environment instructions into a reusable MCP, which can be invoked as a tool for the current task.  
5. **Iterative Refinement** – If the first attempt fails, the model can regenerate the MCP or adjust its reasoning chain, effectively learning from its own failures.  

By allowing the model to **create, evolve, and reuse tools in real time**, ALITA turns the LLM into an autonomous reasoner that no longer relies on a fixed set of pre‑built tools or workflows. This self‑evolving capability scales with the underlying model’s coding and reasoning power, enabling more complex, multi‑step problem solving without human‑written tool libraries.

YES!! No vector database!

In [15]:
dspy.inspect_history()





[34m[2025-08-12T22:02:07.095044][0m

[31mSystem message:[0m

Your input fields are:
1. `document` (str): 
2. `question` (str):
Your output fields are:
1. `answer` (str):
All interactions will be structured in the following way, with the appropriate values filled in.

[[ ## document ## ]]
{document}

[[ ## question ## ]]
{question}

[[ ## answer ## ]]
{answer}

[[ ## completed ## ]]
In adhering to this structure, your objective is: 
        Only use the document to answer the question and nothing else.


[31mUser message:[0m

[[ ## document ## ]]
arXiv:2505.20286v1  [cs.AI]  26 May 2025
ALITA : G ENERALIST AGENT ENABLING SCALABLE AGENTIC
REASONING WITH MINIMAL PREDEFINITION AND MAXIMAL
SELF -EVOLUTION
Jiahao Qiu∗1, Xuan Qi∗2, Tongcheng Zhang∗3, Xinzhe Juan3,4, Jiacheng Guo1, Yifu Lu1, Yimin Wang3,4, Zixin Yao1,
Qihan Ren3, Xun Jiang5, Xing Zhou5, Dongrui Liu3, Ling Yang1, Yue Wu1, Kaixuan Huang1, Shilong Liu1,
Hongru Wang6, Mengdi Wang1
1AI Lab, Princeton University 2IIIS, Tsi

# 2. Programmatic Signatures and how they integrate with the broader LLM ecosystem
In general, you will have to use DSPy for any (or only the final) LLM centric operation because it is focused on LLM prompting. Every other operation (tool, vector database, etc.) can come from any other framework!

> We use LlamaIndex to provide vector indexing capabilities here.

Creating a vector database to ingest our documents

In [16]:
from llama_index.core import VectorStoreIndex, Settings
from llama_index.embeddings.ollama import OllamaEmbedding

Settings.embed_model = OllamaEmbedding(model_name="nomic-embed-text")

index = VectorStoreIndex(docs, embed_model=Settings.embed_model)

base_retriever = index.as_retriever(similarity_top_k=6)

In [17]:
nodes = base_retriever.retrieve(question)
len(nodes)

6

In [18]:
from tqdm.notebook import tqdm

class GenerateAnswer(dspy.Signature):
    """Answer questions with short factoid answers."""
    context = dspy.InputField(desc="May contain relevant facts")
    question = dspy.InputField()
    answer = dspy.OutputField(desc="Often between 1-10 sentences.")

class RewriteQuestion(dspy.Signature):
    question = dspy.InputField()
    rewritten_questions: list[str] = dspy.OutputField(
        desc="Decompose this question into sub questions or rewrite the original user question if necessary to improve retrieval from a vector database. Otherwise return the original question."
    )
    
class RAG(dspy.Module):
    def __init__(self):
        super().__init__()
        self.retriever = base_retriever
        self.rewriter = dspy.Predict(RewriteQuestion)
        self.generate_answer = dspy.ChainOfThought(GenerateAnswer)
        self.consolidate_answer = dspy.Predict(
            dspy.Signature(
                'original_question: str, sub_answers: list[str] -> consolidated_answer:str',
                instructions="Consolidate the sub answers into a coherent answer within a few paragraphs that answers the original question."
            )
        )
        
    def query_rewrite(self, question: str):
        return self.rewriter(question=question)
    
    def forward(self, question: str):
        question_rewrite = self.query_rewrite(question)
        sub_answers = []
        for q in tqdm(question_rewrite.rewritten_questions):
            print(f"\n----\nProcessing question: {q}")            
            context = self.retriever.retrieve(q) #the LlamaIndex component
            sub_answer = self.generate_answer(context=context, question=q)
            print(f"\nAnswer to {q}: {sub_answer}")
            print(f"\nSub question answer reasoning: {sub_answer.reasoning}\n----\n")
            sub_answers.append(sub_answer)
        prediction = self.consolidate_answer(original_question=question, sub_answers=sub_answers)
        return prediction

In [19]:
engine = RAG()
pred = engine(
    "How Agent become autonomous with Atila? Alita use Model context protocol? It write own tools? Tools deploy where?"
)

  0%|          | 0/4 [00:00<?, ?it/s]


----
Processing question: How does an agent become autonomous with Atila?

Answer to How does an agent become autonomous with Atila?: Prediction(
    reasoning='The question asks for a concise explanation of how an agent achieves autonomy using the Alita framework. Based on the provided context, Alita is a generalist agent that relies on minimal predefined tools and workflows, instead using a manager–planner–executor architecture that allows the agent to self‑evolve and adapt to diverse tasks. The answer should highlight these key design choices that enable autonomy.',
    answer='Alita achieves autonomy by using a minimal‑predefinition, self‑evolving architecture: a manager coordinates a planner that generates step‑by‑step plans, and an executor carries them out. The agent learns and refines its own tools and workflows on the fly, requiring no extensive manual design, which lets it adapt and act independently across varied tasks.'
)

Sub question answer reasoning: The question asks f

In [20]:
display(Markdown(pred.consolidated_answer))

Alita achieves autonomy through a lightweight, self‑evolving architecture. A manager agent coordinates a planner that generates step‑by‑step plans, and an executor carries them out. Because Alita relies on minimal predefined tools and workflows, it can learn, refine, and create new tools on the fly, allowing it to adapt to a wide range of tasks without extensive manual configuration.

There is no evidence that Alita incorporates the Model Context Protocol; the available information does not mention this protocol in its design.

Alita does not write its own tools. Instead, it operates with a small set of core, pre‑defined tools (such as MCP Brainstorming, ScriptGeneratingTool, and CodeRunningTool) and expands its capabilities through self‑evolution rather than generating new tools from scratch.

The tools are deployed inside Alita’s manager agent. The manager orchestrates the reasoning and execution flow, invoking the appropriate toolkits as needed during the planning and execution stages.

In [21]:
dspy.inspect_history()





[34m[2025-08-12T22:03:12.246244][0m

[31mSystem message:[0m

Your input fields are:
1. `original_question` (str): 
2. `sub_answers` (list[str]):
Your output fields are:
1. `consolidated_answer` (str):
All interactions will be structured in the following way, with the appropriate values filled in.

[[ ## original_question ## ]]
{original_question}

[[ ## sub_answers ## ]]
{sub_answers}

[[ ## consolidated_answer ## ]]
{consolidated_answer}

[[ ## completed ## ]]
In adhering to this structure, your objective is: 
        Consolidate the sub answers into a coherent answer within a few paragraphs that answers the original question.


[31mUser message:[0m

[[ ## original_question ## ]]
How Agent become autonomous with Atila? Alita use Model context protocol? It write own tools? Tools deploy where?

[[ ## sub_answers ## ]]
[Prediction(
    reasoning='The question asks for a concise explanation of how an agent achieves autonomy using the Alita framework. Based on the provided contex

# 3. Creating agents with DSPy
Agents in DSPy require tools

In [22]:
class RAGAgentSignature(dspy.Signature):
    question: str = dspy.InputField()
    history: dspy.History = dspy.InputField()
    answer: str = dspy.OutputField()

class AskForMoreInfo(dspy.Signature):
    question = dspy.InputField()
    response = dspy.OutputField()

def ask_for_clarification_tool(question: str):
    """Use this tool if the user's question is unclear. This tool prompts the user for more information"""
    clarification = dspy.Predict(
        AskForMoreInfo(
            question=question,
            instructions="The user has asked an ambiguous question. Ask the user for clarifications to the question."
        )
    )
    return clarification.response

def query_alita_knowledge_base(question: str):
    """Use this tool to query the knowledge base on Alita."""
    engine = RAG()
    pred = engine(question)
    return pred.consolidated_answer

history = dspy.History(messages=[])
agent = dspy.ReAct(RAGAgentSignature, tools=[query_alita_knowledge_base, ask_for_clarification_tool,])

In [23]:
response = agent(question="jkok", history=history)
history.messages.append({"question": "jkok", **response})

In [24]:
response.answer

'Could you please clarify what you mean by “jkok”? Are you asking for information about a topic, a location, or something else?'

In [25]:
clarified_question = "Sorry I accidentally sent that message. Here's the question I intended to ask: How does an agent become autonomous with Alita?"
response = agent(question=clarified_question, history=history)
display(Markdown(response.answer))

  0%|          | 0/3 [00:00<?, ?it/s]


----
Processing question: What steps are required for an agent to become autonomous using Alita?

Answer to What steps are required for an agent to become autonomous using Alita?: Prediction(
    reasoning='The question asks for the procedural steps that an agent must follow to achieve autonomy when using the Alita framework. Based on the provided documents, Alita’s design emphasizes minimal predefinition, self‑generated planning, execution, and iterative self‑evolution. Therefore, the key steps are: (1) supply a high‑level goal or task description; (2) let Alita generate a multi‑step plan (MCP) without relying on pre‑built tools; (3) execute the plan, allowing the agent to perform actions; (4) evaluate outcomes and gather feedback; (5) refine the plan or internal models through self‑evolution; and (6) repeat the cycle until the task is completed autonomously.',
    answer='1. Provide a high‑level goal or task description.  \n2. Alita generates a multi‑step plan (MCP) with minimal pre

An agent becomes autonomous with Alita by following a simple, self‑driven cycle that relies on minimal pre‑definition and maximal self‑evolution:

1. **Set a high‑level goal** – The user supplies only a brief, abstract task description.  
2. **Generate a multi‑step plan** – Alita’s core planner creates a detailed plan (MCP) using only generic primitives, without needing a library of task‑specific tools.  
3. **Execute the plan** – The agent carries out the actions, interacting with the environment or APIs as required.  
4. **Evaluate and gather feedback** – After each step, the agent assesses the outcome, noting successes, failures, and any new information.  
5. **Refine internally** – Using the feedback, the agent updates its internal models, composes new strategies, or adjusts the plan—this is the self‑evolution phase.  
6. **Iterate until completion** – The cycle repeats until the original goal is achieved, with the agent requiring only minimal human oversight.

In this framework, autonomy means the agent can independently plan, reason, and execute complex tasks without relying on extensive, manually designed tools or continuous human supervision. The combination of minimal pre‑definition and maximal self‑evolution allows Alita to adapt and compose new solutions on the fly, achieving true agentic behavior.

# 4. Prompt Optimization
To redo this using oss20b

In [None]:
import mlflow

mlflow.set_tracking_uri('http://127.0.0.1:5000')
mlflow.set_experiment('gsm8k')

2025/08/12 22:13:30 INFO mlflow.tracking.fluent: Experiment with name 'gsm8k' does not exist. Creating a new experiment.


<Experiment: artifact_location='mlflow-artifacts:/2', creation_time=1755008010365, experiment_id='2', last_update_time=1755008010365, lifecycle_stage='active', name='gsm8k', tags={}>

In [27]:
mlflow.dspy.autolog()

In [28]:
qwen25 = dspy.LM(
    "ollama_chat/qwen2.5:latest", 
    api_base="http://localhost:11434", 
    api_key="fake"
)

dspy.configure(lm=qwen25)

Do look at the [dataset's HuggingFace page](https://huggingface.co/datasets/DigitalLearningGmbH/MATH-lighteval) for the full list of subsets!

In [None]:
from dspy.datasets import MATH

dataset=MATH(subset='counting_and_probability')

counting_and_probability/train-00000-of-(…):   0%|          | 0.00/329k [00:00<?, ?B/s]

counting_and_probability/test-00000-of-0(…):   0%|          | 0.00/175k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/771 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/474 [00:00<?, ? examples/s]

In [33]:
print(len(dataset.train), len(dataset.dev))
example = dataset.train[100]
print("Question:", example.question)
print("Answer:", example.answer)

158 158
Question: One fair die has faces 1, 1, 2, 2, 3, 3 and another has faces 4, 4, 5, 5, 6, 6. The dice  are rolled and the numbers on the top faces are added. What is the probability that the sum will be odd?
Answer: \frac{5}{9}


Defining the module and running an example

In [34]:
module = dspy.ChainOfThought("question -> answer")
module(question=example.question)

Prediction(
    reasoning="To determine the probability that the sum of the numbers on the top faces will be odd when rolling these two dice, we need to consider the parity (odd or even nature) of the outcomes.\n\n- The first die has three 1's and three 2's. Therefore, it can show an odd number with a probability of \\( \\frac{3}{6} = \\frac{1}{2} \\), and an even number with a probability of \\( \\frac{3}{6} = \\frac{1}{2} \\).\n- The second die has three 4's and three 6's. Therefore, it can show an odd number with a probability of \\( \\frac{0}{6} = 0 \\), and an even number with a probability of \\( \\frac{6}{6} = 1 \\).\n\nFor the sum to be odd, one die must show an odd number and the other must show an even number. Since the second die always shows an even number, this means that the first die must show an odd number for their sum to be odd.\n\nThe probability of the first die showing an odd number is \\( \\frac{1}{2} \\).\n\nTherefore, the probability that the sum will be odd is 

Setting up our evaluator and running it zero shot on our model!
> This can take quite awhile. It took me close to 23 minutes!

In [35]:
THREADS = 24
kwargs = dict(num_threads=THREADS, display_progress=True, display_table=5)
evaluate = dspy.Evaluate(devset=dataset.dev, metric=dataset.metric, **kwargs)

evaluate(module)

Average Metric: 0.00 / 1 (0.0%):   1%|          | 1/158 [00:12<31:41, 12.11s/it]

  string = string.replace("\%", "")


Average Metric: 20.00 / 158 (12.7%): 100%|██████████| 158/158 [22:57<00:00,  8.72s/it]

2025/08/12 22:47:13 INFO dspy.evaluate.evaluate: Average Metric: 20 / 158 (12.7%)





Unnamed: 0,question,example_reasoning,example_answer,pred_reasoning,pred_answer,method
0,"If Michael rolls three fair dice, what is the probability that he ...","We calculate the complement, or the probability that Michael does ...",\frac{2}{27},To find the probability that Michael will roll at least two 1's wh...,\(\frac{2}{27}\),
1,In how many ways can 8 people sit around a round table if Pierre a...,"Solution 1: We choose any seat for Pierre, and then seat everyone ...",960,"To solve this problem, we need to consider several constraints: Pi...",1440,
2,Compute $\dbinom{7}{2}$.,$\dbinom{7}{2}=\dfrac{7\times 6}{2}=\boxed{21}.$,21,The binomial coefficient \(\binom{n}{k}\) is calculated using the ...,21,✔️ [True]
3,A team averages 7 losses for every 13 games it wins. If ties are n...,"If a team averages 7 losses for 13 wins, that means the team wins ...",65\%,"To find the probability that the team will win its next game, we n...",65%,
4,I coach a soccer team with 15 members. I want to choose a starting...,There are $\binom{15}{2}=105$ ways to select the 2 captains from a...,75075,"To solve this problem, we need to break it down into two parts: ch...",\[ \binom{15}{11} \times \binom{11}{2} = \frac{15!}{11!4!} \times ...,


12.66

Wow our model performed so badly - just 12.66%!

In [36]:
kwargs = dict(num_threads=THREADS, teacher_settings=dict(lm=lm), prompt_model=qwen25)
optimizer = dspy.MIPROv2(metric=dataset.metric, auto="medium", **kwargs)

This will take quite awhile! It took me more than an hour!

In [43]:
import random

selected_trainset = random.sample(dataset.train, 30)
kwargs = dict(max_bootstrapped_demos=4, max_labeled_demos=4)
optimized_module = optimizer.compile(
    module, 
    trainset=selected_trainset, 
    requires_permission_to_run=False,
    **kwargs
)

2025/08/13 12:25:36 INFO dspy.teleprompt.mipro_optimizer_v2: 
RUNNING WITH THE FOLLOWING MEDIUM AUTO RUN SETTINGS:
num_trials: 18
minibatch: False
num_fewshot_candidates: 12
num_instruct_candidates: 6
valset size: 24

2025/08/13 12:25:36 INFO dspy.teleprompt.mipro_optimizer_v2: 
==> STEP 1: BOOTSTRAP FEWSHOT EXAMPLES <==
2025/08/13 12:25:36 INFO dspy.teleprompt.mipro_optimizer_v2: These will be used as few-shot example candidates for our program and for creating instructions.

2025/08/13 12:25:36 INFO dspy.teleprompt.mipro_optimizer_v2: Bootstrapping N=12 sets of demonstrations...


Bootstrapping set 1/12
Bootstrapping set 2/12
Bootstrapping set 3/12


 67%|██████▋   | 4/6 [01:55<00:57, 29.00s/it]


Bootstrapped 4 full traces after 4 examples for up to 1 rounds, amounting to 4 attempts.
Bootstrapping set 4/12


100%|██████████| 6/6 [02:46<00:00, 27.75s/it]


Bootstrapped 3 full traces after 5 examples for up to 1 rounds, amounting to 6 attempts.
Bootstrapping set 5/12


 83%|████████▎ | 5/6 [01:29<00:17, 17.96s/it]


Bootstrapped 4 full traces after 5 examples for up to 1 rounds, amounting to 5 attempts.
Bootstrapping set 6/12


 67%|██████▋   | 4/6 [03:02<01:31, 45.62s/it]


Bootstrapped 3 full traces after 4 examples for up to 1 rounds, amounting to 4 attempts.
Bootstrapping set 7/12


100%|██████████| 6/6 [02:45<00:00, 27.64s/it]


Bootstrapped 3 full traces after 5 examples for up to 1 rounds, amounting to 6 attempts.
Bootstrapping set 8/12


 17%|█▋        | 1/6 [00:09<00:45,  9.05s/it]


Bootstrapped 1 full traces after 1 examples for up to 1 rounds, amounting to 1 attempts.
Bootstrapping set 9/12


 50%|█████     | 3/6 [00:51<00:51, 17.04s/it]


Bootstrapped 2 full traces after 3 examples for up to 1 rounds, amounting to 3 attempts.
Bootstrapping set 10/12


 50%|█████     | 3/6 [02:27<02:27, 49.05s/it]


Bootstrapped 1 full traces after 3 examples for up to 1 rounds, amounting to 3 attempts.
Bootstrapping set 11/12


 83%|████████▎ | 5/6 [02:15<00:27, 27.02s/it]


Bootstrapped 4 full traces after 5 examples for up to 1 rounds, amounting to 5 attempts.
Bootstrapping set 12/12


 17%|█▋        | 1/6 [00:19<01:36, 19.27s/it]
2025/08/13 12:43:39 INFO dspy.teleprompt.mipro_optimizer_v2: 
==> STEP 2: PROPOSE INSTRUCTION CANDIDATES <==
2025/08/13 12:43:39 INFO dspy.teleprompt.mipro_optimizer_v2: We will use the few-shot examples from the previous step, a generated dataset summary, a summary of the program code, and a randomly selected prompting tip to propose instructions.


Bootstrapped 1 full traces after 1 examples for up to 1 rounds, amounting to 1 attempts.
Error getting source code: unhashable type: 'dict'.

Running without program aware proposer.


2025/08/13 12:43:48 INFO dspy.teleprompt.mipro_optimizer_v2: 
Proposing N=6 instructions...

2025/08/13 12:44:11 INFO dspy.teleprompt.mipro_optimizer_v2: Proposed Instructions for Predictor 0:

2025/08/13 12:44:11 INFO dspy.teleprompt.mipro_optimizer_v2: 0: Given the fields `question`, produce the fields `answer`.

2025/08/13 12:44:11 INFO dspy.teleprompt.mipro_optimizer_v2: 1: You are a cybersecurity analyst tasked with ensuring the integrity of your company's data. A critical piece of information is that January 1, 2007 was a Monday. Your team needs to know how many Fridays there were in 2007 to verify an important log entry. Using only this fact and without writing any code, determine the number of Fridays in 2007.

Failure to provide the correct answer could result in a significant security breach. Make your reasoning clear and concise.

2025/08/13 12:44:11 INFO dspy.teleprompt.mipro_optimizer_v2: 2: You are a math teacher preparing lesson plans for your class in 2007. Given that J

Average Metric: 4.00 / 24 (16.7%): 100%|██████████| 24/24 [00:51<00:00,  2.13s/it]  

2025/08/13 12:45:02 INFO dspy.evaluate.evaluate: Average Metric: 4 / 24 (16.7%)
2025/08/13 12:45:02 INFO dspy.teleprompt.mipro_optimizer_v2: Default program score: 16.67

2025/08/13 12:45:02 INFO dspy.teleprompt.mipro_optimizer_v2: ===== Trial 2 / 18 =====



Average Metric: 1.00 / 3 (33.3%):  12%|█▎        | 3/24 [00:36<03:54, 11.16s/it] 



Average Metric: 6.00 / 21 (28.6%):  88%|████████▊ | 21/24 [04:55<00:22,  7.41s/it]



Average Metric: 7.00 / 24 (29.2%): 100%|██████████| 24/24 [05:16<00:00, 13.20s/it]

2025/08/13 12:50:19 INFO dspy.evaluate.evaluate: Average Metric: 7 / 24 (29.2%)
2025/08/13 12:50:19 INFO dspy.teleprompt.mipro_optimizer_v2: [92mBest full score so far![0m Score: 29.17
2025/08/13 12:50:19 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 29.17 with parameters ['Predictor 0: Instruction 1', 'Predictor 0: Few-Shot Set 6'].
2025/08/13 12:50:19 INFO dspy.teleprompt.mipro_optimizer_v2: Scores so far: [16.67, 29.17]
2025/08/13 12:50:19 INFO dspy.teleprompt.mipro_optimizer_v2: Best score so far: 29.17


2025/08/13 12:50:19 INFO dspy.teleprompt.mipro_optimizer_v2: ===== Trial 3 / 18 =====



Average Metric: 8.00 / 24 (33.3%): 100%|██████████| 24/24 [06:49<00:00, 17.06s/it]

2025/08/13 12:57:08 INFO dspy.evaluate.evaluate: Average Metric: 8 / 24 (33.3%)
2025/08/13 12:57:08 INFO dspy.teleprompt.mipro_optimizer_v2: [92mBest full score so far![0m Score: 33.33
2025/08/13 12:57:08 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 33.33 with parameters ['Predictor 0: Instruction 4', 'Predictor 0: Few-Shot Set 2'].
2025/08/13 12:57:08 INFO dspy.teleprompt.mipro_optimizer_v2: Scores so far: [16.67, 29.17, 33.33]
2025/08/13 12:57:08 INFO dspy.teleprompt.mipro_optimizer_v2: Best score so far: 33.33


2025/08/13 12:57:08 INFO dspy.teleprompt.mipro_optimizer_v2: ===== Trial 4 / 18 =====



Average Metric: 2.00 / 12 (16.7%):  50%|█████     | 12/24 [02:35<01:32,  7.74s/it]



Average Metric: 6.00 / 21 (28.6%):  88%|████████▊ | 21/24 [05:42<00:36, 12.09s/it]



Average Metric: 7.00 / 24 (29.2%): 100%|██████████| 24/24 [06:29<00:00, 16.25s/it]

2025/08/13 13:03:38 INFO dspy.evaluate.evaluate: Average Metric: 7 / 24 (29.2%)





2025/08/13 13:03:39 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 29.17 with parameters ['Predictor 0: Instruction 0', 'Predictor 0: Few-Shot Set 6'].
2025/08/13 13:03:39 INFO dspy.teleprompt.mipro_optimizer_v2: Scores so far: [16.67, 29.17, 33.33, 29.17]
2025/08/13 13:03:39 INFO dspy.teleprompt.mipro_optimizer_v2: Best score so far: 33.33


2025/08/13 13:03:39 INFO dspy.teleprompt.mipro_optimizer_v2: ===== Trial 5 / 18 =====


Average Metric: 7.00 / 24 (29.2%): 100%|██████████| 24/24 [05:58<00:00, 14.93s/it]

2025/08/13 13:09:37 INFO dspy.evaluate.evaluate: Average Metric: 7 / 24 (29.2%)
2025/08/13 13:09:37 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 29.17 with parameters ['Predictor 0: Instruction 2', 'Predictor 0: Few-Shot Set 4'].
2025/08/13 13:09:37 INFO dspy.teleprompt.mipro_optimizer_v2: Scores so far: [16.67, 29.17, 33.33, 29.17, 29.17]
2025/08/13 13:09:37 INFO dspy.teleprompt.mipro_optimizer_v2: Best score so far: 33.33


2025/08/13 13:09:37 INFO dspy.teleprompt.mipro_optimizer_v2: ===== Trial 6 / 18 =====



Average Metric: 7.00 / 24 (29.2%): 100%|██████████| 24/24 [04:56<00:00, 12.34s/it]

2025/08/13 13:14:33 INFO dspy.evaluate.evaluate: Average Metric: 7 / 24 (29.2%)
2025/08/13 13:14:33 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 29.17 with parameters ['Predictor 0: Instruction 3', 'Predictor 0: Few-Shot Set 5'].
2025/08/13 13:14:33 INFO dspy.teleprompt.mipro_optimizer_v2: Scores so far: [16.67, 29.17, 33.33, 29.17, 29.17, 29.17]
2025/08/13 13:14:33 INFO dspy.teleprompt.mipro_optimizer_v2: Best score so far: 33.33


2025/08/13 13:14:33 INFO dspy.teleprompt.mipro_optimizer_v2: ===== Trial 7 / 18 =====



Average Metric: 1.00 / 5 (20.0%):  21%|██        | 5/24 [01:35<04:27, 14.08s/it]



Average Metric: 4.00 / 16 (25.0%):  67%|██████▋   | 16/24 [03:36<01:07,  8.40s/it]



Average Metric: 5.00 / 21 (23.8%):  88%|████████▊ | 21/24 [06:10<00:48, 16.17s/it]



Average Metric: 7.00 / 24 (29.2%): 100%|██████████| 24/24 [06:29<00:00, 16.24s/it]

2025/08/13 13:21:03 INFO dspy.evaluate.evaluate: Average Metric: 7 / 24 (29.2%)
2025/08/13 13:21:03 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 29.17 with parameters ['Predictor 0: Instruction 4', 'Predictor 0: Few-Shot Set 6'].
2025/08/13 13:21:03 INFO dspy.teleprompt.mipro_optimizer_v2: Scores so far: [16.67, 29.17, 33.33, 29.17, 29.17, 29.17, 29.17]
2025/08/13 13:21:03 INFO dspy.teleprompt.mipro_optimizer_v2: Best score so far: 33.33


2025/08/13 13:21:03 INFO dspy.teleprompt.mipro_optimizer_v2: ===== Trial 8 / 18 =====



Average Metric: 12.00 / 22 (54.5%):  92%|█████████▏| 22/24 [03:46<00:22, 11.18s/it]



Average Metric: 12.00 / 23 (52.2%):  96%|█████████▌| 23/24 [05:22<00:36, 36.75s/it]



Average Metric: 12.00 / 23 (52.2%): 100%|██████████| 24/24 [07:07<00:00, 57.15s/it]



Average Metric: 12.00 / 24 (50.0%): : 27it [07:24, 16.48s/it]                      

2025/08/13 13:28:28 INFO dspy.evaluate.evaluate: Average Metric: 12 / 24 (50.0%)
2025/08/13 13:28:28 INFO dspy.teleprompt.mipro_optimizer_v2: [92mBest full score so far![0m Score: 50.0
2025/08/13 13:28:28 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 50.0 with parameters ['Predictor 0: Instruction 5', 'Predictor 0: Few-Shot Set 1'].
2025/08/13 13:28:28 INFO dspy.teleprompt.mipro_optimizer_v2: Scores so far: [16.67, 29.17, 33.33, 29.17, 29.17, 29.17, 29.17, 50.0]
2025/08/13 13:28:28 INFO dspy.teleprompt.mipro_optimizer_v2: Best score so far: 50.0


2025/08/13 13:28:28 INFO dspy.teleprompt.mipro_optimizer_v2: ===== Trial 9 / 18 =====



Average Metric: 6.00 / 24 (25.0%): 100%|██████████| 24/24 [06:04<00:00, 15.18s/it]

2025/08/13 13:34:33 INFO dspy.evaluate.evaluate: Average Metric: 6 / 24 (25.0%)
2025/08/13 13:34:33 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 25.0 with parameters ['Predictor 0: Instruction 3', 'Predictor 0: Few-Shot Set 3'].
2025/08/13 13:34:33 INFO dspy.teleprompt.mipro_optimizer_v2: Scores so far: [16.67, 29.17, 33.33, 29.17, 29.17, 29.17, 29.17, 50.0, 25.0]
2025/08/13 13:34:33 INFO dspy.teleprompt.mipro_optimizer_v2: Best score so far: 50.0


2025/08/13 13:34:33 INFO dspy.teleprompt.mipro_optimizer_v2: ===== Trial 10 / 18 =====



Average Metric: 4.00 / 19 (21.1%):  79%|███████▉  | 19/24 [03:27<00:41,  8.25s/it]



Average Metric: 5.00 / 21 (23.8%):  88%|████████▊ | 21/24 [05:13<01:22, 27.53s/it]



Average Metric: 6.00 / 24 (25.0%): 100%|██████████| 24/24 [05:32<00:00, 13.84s/it]

2025/08/13 13:40:05 INFO dspy.evaluate.evaluate: Average Metric: 6 / 24 (25.0%)
2025/08/13 13:40:05 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 25.0 with parameters ['Predictor 0: Instruction 3', 'Predictor 0: Few-Shot Set 10'].
2025/08/13 13:40:05 INFO dspy.teleprompt.mipro_optimizer_v2: Scores so far: [16.67, 29.17, 33.33, 29.17, 29.17, 29.17, 29.17, 50.0, 25.0, 25.0]
2025/08/13 13:40:05 INFO dspy.teleprompt.mipro_optimizer_v2: Best score so far: 50.0


2025/08/13 13:40:05 INFO dspy.teleprompt.mipro_optimizer_v2: ===== Trial 11 / 18 =====



Average Metric: 12.00 / 24 (50.0%): 100%|██████████| 24/24 [00:00<00:00, 3625.29it/s]

2025/08/13 13:40:05 INFO dspy.evaluate.evaluate: Average Metric: 12 / 24 (50.0%)
2025/08/13 13:40:05 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 50.0 with parameters ['Predictor 0: Instruction 5', 'Predictor 0: Few-Shot Set 1'].
2025/08/13 13:40:05 INFO dspy.teleprompt.mipro_optimizer_v2: Scores so far: [16.67, 29.17, 33.33, 29.17, 29.17, 29.17, 29.17, 50.0, 25.0, 25.0, 50.0]
2025/08/13 13:40:05 INFO dspy.teleprompt.mipro_optimizer_v2: Best score so far: 50.0


2025/08/13 13:40:05 INFO dspy.teleprompt.mipro_optimizer_v2: ===== Trial 12 / 18 =====



Average Metric: 12.00 / 24 (50.0%): 100%|██████████| 24/24 [00:00<00:00, 4541.13it/s]

2025/08/13 13:40:05 INFO dspy.evaluate.evaluate: Average Metric: 12 / 24 (50.0%)
2025/08/13 13:40:05 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 50.0 with parameters ['Predictor 0: Instruction 5', 'Predictor 0: Few-Shot Set 1'].
2025/08/13 13:40:05 INFO dspy.teleprompt.mipro_optimizer_v2: Scores so far: [16.67, 29.17, 33.33, 29.17, 29.17, 29.17, 29.17, 50.0, 25.0, 25.0, 50.0, 50.0]
2025/08/13 13:40:05 INFO dspy.teleprompt.mipro_optimizer_v2: Best score so far: 50.0


2025/08/13 13:40:05 INFO dspy.teleprompt.mipro_optimizer_v2: ===== Trial 13 / 18 =====



Average Metric: 1.00 / 1 (100.0%):   4%|▍         | 1/24 [00:32<12:19, 32.13s/it]



Average Metric: 9.00 / 21 (42.9%):  88%|████████▊ | 21/24 [05:24<00:39, 13.30s/it]



Average Metric: 9.00 / 24 (37.5%): 100%|██████████| 24/24 [05:54<00:00, 14.78s/it]

2025/08/13 13:45:59 INFO dspy.evaluate.evaluate: Average Metric: 9 / 24 (37.5%)
2025/08/13 13:45:59 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 37.5 with parameters ['Predictor 0: Instruction 5', 'Predictor 0: Few-Shot Set 5'].
2025/08/13 13:45:59 INFO dspy.teleprompt.mipro_optimizer_v2: Scores so far: [16.67, 29.17, 33.33, 29.17, 29.17, 29.17, 29.17, 50.0, 25.0, 25.0, 50.0, 50.0, 37.5]
2025/08/13 13:45:59 INFO dspy.teleprompt.mipro_optimizer_v2: Best score so far: 50.0


2025/08/13 13:45:59 INFO dspy.teleprompt.mipro_optimizer_v2: ===== Trial 14 / 18 =====



Average Metric: 7.00 / 16 (43.8%):  67%|██████▋   | 16/24 [03:12<01:05,  8.20s/it]



Average Metric: 8.00 / 21 (38.1%):  88%|████████▊ | 21/24 [05:51<00:56, 18.72s/it]



Average Metric: 8.00 / 24 (33.3%): 100%|██████████| 24/24 [06:14<00:00, 15.60s/it]

2025/08/13 13:52:14 INFO dspy.evaluate.evaluate: Average Metric: 8 / 24 (33.3%)
2025/08/13 13:52:14 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 33.33 with parameters ['Predictor 0: Instruction 1', 'Predictor 0: Few-Shot Set 1'].
2025/08/13 13:52:14 INFO dspy.teleprompt.mipro_optimizer_v2: Scores so far: [16.67, 29.17, 33.33, 29.17, 29.17, 29.17, 29.17, 50.0, 25.0, 25.0, 50.0, 50.0, 37.5, 33.33]
2025/08/13 13:52:14 INFO dspy.teleprompt.mipro_optimizer_v2: Best score so far: 50.0


2025/08/13 13:52:14 INFO dspy.teleprompt.mipro_optimizer_v2: ===== Trial 15 / 18 =====



Average Metric: 9.00 / 24 (37.5%): 100%|██████████| 24/24 [04:19<00:00, 10.83s/it]

2025/08/13 13:56:34 INFO dspy.evaluate.evaluate: Average Metric: 9 / 24 (37.5%)
2025/08/13 13:56:34 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 37.5 with parameters ['Predictor 0: Instruction 5', 'Predictor 0: Few-Shot Set 9'].
2025/08/13 13:56:34 INFO dspy.teleprompt.mipro_optimizer_v2: Scores so far: [16.67, 29.17, 33.33, 29.17, 29.17, 29.17, 29.17, 50.0, 25.0, 25.0, 50.0, 50.0, 37.5, 33.33, 37.5]
2025/08/13 13:56:34 INFO dspy.teleprompt.mipro_optimizer_v2: Best score so far: 50.0


2025/08/13 13:56:34 INFO dspy.teleprompt.mipro_optimizer_v2: ===== Trial 16 / 18 =====



Average Metric: 6.00 / 17 (35.3%):  71%|███████   | 17/24 [02:58<01:02,  8.99s/it]



Average Metric: 9.00 / 21 (42.9%):  88%|████████▊ | 21/24 [05:29<01:00, 20.14s/it]



Average Metric: 10.00 / 24 (41.7%): 100%|██████████| 24/24 [05:48<00:00, 14.53s/it]

2025/08/13 14:02:22 INFO dspy.evaluate.evaluate: Average Metric: 10 / 24 (41.7%)
2025/08/13 14:02:22 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 41.67 with parameters ['Predictor 0: Instruction 5', 'Predictor 0: Few-Shot Set 8'].
2025/08/13 14:02:22 INFO dspy.teleprompt.mipro_optimizer_v2: Scores so far: [16.67, 29.17, 33.33, 29.17, 29.17, 29.17, 29.17, 50.0, 25.0, 25.0, 50.0, 50.0, 37.5, 33.33, 37.5, 41.67]
2025/08/13 14:02:23 INFO dspy.teleprompt.mipro_optimizer_v2: Best score so far: 50.0


2025/08/13 14:02:23 INFO dspy.teleprompt.mipro_optimizer_v2: ===== Trial 17 / 18 =====



Average Metric: 4.00 / 9 (44.4%):  38%|███▊      | 9/24 [01:34<01:49,  7.31s/it]



Average Metric: 5.00 / 21 (23.8%):  88%|████████▊ | 21/24 [05:31<00:42, 14.24s/it]



Average Metric: 6.00 / 24 (25.0%): 100%|██████████| 24/24 [06:04<00:00, 15.18s/it]

2025/08/13 14:08:27 INFO dspy.evaluate.evaluate: Average Metric: 6 / 24 (25.0%)
2025/08/13 14:08:27 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 25.0 with parameters ['Predictor 0: Instruction 5', 'Predictor 0: Few-Shot Set 4'].
2025/08/13 14:08:27 INFO dspy.teleprompt.mipro_optimizer_v2: Scores so far: [16.67, 29.17, 33.33, 29.17, 29.17, 29.17, 29.17, 50.0, 25.0, 25.0, 50.0, 50.0, 37.5, 33.33, 37.5, 41.67, 25.0]
2025/08/13 14:08:27 INFO dspy.teleprompt.mipro_optimizer_v2: Best score so far: 50.0


2025/08/13 14:08:27 INFO dspy.teleprompt.mipro_optimizer_v2: ===== Trial 18 / 18 =====



Average Metric: 7.00 / 24 (29.2%): 100%|██████████| 24/24 [05:05<00:00, 12.71s/it]

2025/08/13 14:13:32 INFO dspy.evaluate.evaluate: Average Metric: 7 / 24 (29.2%)
2025/08/13 14:13:32 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 29.17 with parameters ['Predictor 0: Instruction 0', 'Predictor 0: Few-Shot Set 1'].
2025/08/13 14:13:32 INFO dspy.teleprompt.mipro_optimizer_v2: Scores so far: [16.67, 29.17, 33.33, 29.17, 29.17, 29.17, 29.17, 50.0, 25.0, 25.0, 50.0, 50.0, 37.5, 33.33, 37.5, 41.67, 25.0, 29.17]
2025/08/13 14:13:32 INFO dspy.teleprompt.mipro_optimizer_v2: Best score so far: 50.0


2025/08/13 14:13:32 INFO dspy.teleprompt.mipro_optimizer_v2: ===== Trial 19 / 18 =====



Average Metric: 8.00 / 24 (33.3%): 100%|██████████| 24/24 [05:43<00:00, 14.30s/it]

2025/08/13 14:19:15 INFO dspy.evaluate.evaluate: Average Metric: 8 / 24 (33.3%)
2025/08/13 14:19:15 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 33.33 with parameters ['Predictor 0: Instruction 2', 'Predictor 0: Few-Shot Set 1'].
2025/08/13 14:19:15 INFO dspy.teleprompt.mipro_optimizer_v2: Scores so far: [16.67, 29.17, 33.33, 29.17, 29.17, 29.17, 29.17, 50.0, 25.0, 25.0, 50.0, 50.0, 37.5, 33.33, 37.5, 41.67, 25.0, 29.17, 33.33]
2025/08/13 14:19:15 INFO dspy.teleprompt.mipro_optimizer_v2: Best score so far: 50.0


2025/08/13 14:19:15 INFO dspy.teleprompt.mipro_optimizer_v2: Returning best identified program with score 50.0!





Evaluation

In [None]:
kwargs = dict(num_threads=THREADS, display_progress=True)
evaluate = dspy.Evaluate(devset=dataset.dev, metric=dataset.metric, **kwargs)

# Evaluate the program as usual
result = evaluate(optimized_module)

In [46]:
result

39.24

To save our optimized module

In [47]:
optimized_module.save("../optimized_math_qwen25_7bn.json")

In [48]:
dspy.inspect_history()





[34m[2025-08-13T14:49:47.136059][0m

[31mSystem message:[0m

Your input fields are:
1. `question` (str):
Your output fields are:
1. `reasoning` (str): 
2. `answer` (str):
All interactions will be structured in the following way, with the appropriate values filled in.

[[ ## question ## ]]
{question}

[[ ## reasoning ## ]]
{reasoning}

[[ ## answer ## ]]
{answer}

[[ ## completed ## ]]
In adhering to this structure, your objective is: 
        Given a set of mathematical problems that involve reasoning, counting techniques, area calculations, or probability, your task is to provide a step-by-step solution that leads to an exact numerical answer. The questions will often require you to break down the problem into simpler parts and use logical steps to arrive at the final result. Ensure your response includes the reasoning process and the final answer expressed as a common fraction, integer, or other appropriate format.


[31mUser message:[0m

[[ ## question ## ]]
What is the ne

Running the full set takes too long - it was 3hrs and it still wasn't halfway done!

In [None]:
optimized_module = optimizer.compile(module, trainset=dataset.train, **kwargs)

2025/08/13 09:24:05 INFO dspy.teleprompt.mipro_optimizer_v2: 
RUNNING WITH THE FOLLOWING MEDIUM AUTO RUN SETTINGS:
num_trials: 18
minibatch: True
num_fewshot_candidates: 12
num_instruct_candidates: 6
valset size: 126



[93m[1mProjected Language Model (LM) Calls[0m

Based on the parameters you have set, the maximum number of LM calls is projected as follows:

[93m- Prompt Generation: [94m[1m10[0m[93m data summarizer calls + [94m[1m6[0m[93m * [94m[1m1[0m[93m lm calls in program + ([94m[1m2[0m[93m) lm calls in program-aware proposer = [94m[1m18[0m[93m prompt model calls[0m
[93m- Program Evaluation: [94m[1m35[0m[93m examples in minibatch * [94m[1m18[0m[93m batches + [94m[1m126[0m[93m examples in val set * [94m[1m4[0m[93m full evals = [94m[1m1134[0m[93m LM Program calls[0m

[93m[1mEstimated Cost Calculation:[0m

[93mTotal Cost = (Number of calls to task model * (Avg Input Token Length per Call * Task Model Price per Input Token + Avg Output Token Length per Call * Task Model Price per Output Token)
            + (Number of program calls * (Avg Input Token Length per Call * Task Prompt Price per Input Token + Avg Output Token Length per Call * Prompt Model

2025/08/13 09:24:25 INFO dspy.teleprompt.mipro_optimizer_v2: 
==> STEP 1: BOOTSTRAP FEWSHOT EXAMPLES <==
2025/08/13 09:24:25 INFO dspy.teleprompt.mipro_optimizer_v2: These will be used as few-shot example candidates for our program and for creating instructions.

2025/08/13 09:24:25 INFO dspy.teleprompt.mipro_optimizer_v2: Bootstrapping N=12 sets of demonstrations...



No input received within 20 seconds. Proceeding with execution...
Bootstrapping set 1/12
Bootstrapping set 2/12
Bootstrapping set 3/12


 22%|██▏       | 7/32 [03:20<11:54, 28.59s/it]


Bootstrapped 4 full traces after 7 examples for up to 1 rounds, amounting to 7 attempts.
Bootstrapping set 4/12


  9%|▉         | 3/32 [00:49<07:56, 16.45s/it]


Bootstrapped 2 full traces after 3 examples for up to 1 rounds, amounting to 3 attempts.
Bootstrapping set 5/12


  9%|▉         | 3/32 [01:02<10:07, 20.96s/it]


Bootstrapped 3 full traces after 3 examples for up to 1 rounds, amounting to 3 attempts.
Bootstrapping set 6/12


 12%|█▎        | 4/32 [01:03<07:21, 15.77s/it]


Bootstrapped 4 full traces after 4 examples for up to 1 rounds, amounting to 4 attempts.
Bootstrapping set 7/12


  9%|▉         | 3/32 [00:45<07:23, 15.29s/it]


Bootstrapped 1 full traces after 3 examples for up to 1 rounds, amounting to 3 attempts.
Bootstrapping set 8/12


  3%|▎         | 1/32 [00:13<06:46, 13.12s/it]


Bootstrapped 1 full traces after 1 examples for up to 1 rounds, amounting to 1 attempts.
Bootstrapping set 9/12


 12%|█▎        | 4/32 [01:15<08:48, 18.86s/it]


Bootstrapped 4 full traces after 4 examples for up to 1 rounds, amounting to 4 attempts.
Bootstrapping set 10/12


  3%|▎         | 1/32 [00:08<04:12,  8.13s/it]


Bootstrapped 1 full traces after 1 examples for up to 1 rounds, amounting to 1 attempts.
Bootstrapping set 11/12


  3%|▎         | 1/32 [00:34<17:34, 34.01s/it]


Bootstrapped 1 full traces after 1 examples for up to 1 rounds, amounting to 1 attempts.
Bootstrapping set 12/12


  9%|▉         | 3/32 [01:00<09:48, 20.29s/it]
2025/08/13 09:34:38 INFO dspy.teleprompt.mipro_optimizer_v2: 
==> STEP 2: PROPOSE INSTRUCTION CANDIDATES <==
2025/08/13 09:34:38 INFO dspy.teleprompt.mipro_optimizer_v2: We will use the few-shot examples from the previous step, a generated dataset summary, a summary of the program code, and a randomly selected prompting tip to propose instructions.


Bootstrapped 3 full traces after 3 examples for up to 1 rounds, amounting to 3 attempts.
Error getting source code: unhashable type: 'dict'.

Running without program aware proposer.


2025/08/13 09:35:13 INFO dspy.teleprompt.mipro_optimizer_v2: 
Proposing N=6 instructions...

2025/08/13 09:35:38 INFO dspy.teleprompt.mipro_optimizer_v2: Proposed Instructions for Predictor 0:

2025/08/13 09:35:38 INFO dspy.teleprompt.mipro_optimizer_v2: 0: Given the fields `question`, produce the fields `answer`.

2025/08/13 09:35:38 INFO dspy.teleprompt.mipro_optimizer_v2: 1: apples (A), bananas (B), grapes (G), strawberries (S), and pineapples (P). Some combinations do not taste good together - specifically, strawberries and pineapples should not be in the same fruit salad, and grapes and bananas should not be used together. How many different fruit salads can you make using any 3 of these fruits without violating the taste or appearance rules?

2025/08/13 09:35:38 INFO dspy.teleprompt.mipro_optimizer_v2: 2: Given a problem involving probability, combinatorics, or basic statistics presented as a real-world scenario, compute the answer step-by-step by identifying key elements of the 

Average Metric: 14.00 / 126 (11.1%): 100%|██████████| 126/126 [18:08<00:00,  8.64s/it]

2025/08/13 09:53:47 INFO dspy.evaluate.evaluate: Average Metric: 14 / 126 (11.1%)
2025/08/13 09:53:47 INFO dspy.teleprompt.mipro_optimizer_v2: Default program score: 11.11

2025/08/13 09:53:47 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 2 / 23 - Minibatch ==



Average Metric: 9.00 / 35 (25.7%): 100%|██████████| 35/35 [05:36<00:00,  9.61s/it]

2025/08/13 09:59:23 INFO dspy.evaluate.evaluate: Average Metric: 9 / 35 (25.7%)
2025/08/13 09:59:23 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 25.71 on minibatch of size 35 with parameters ['Predictor 0: Instruction 1', 'Predictor 0: Few-Shot Set 6'].
2025/08/13 09:59:23 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [25.71]
2025/08/13 09:59:23 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [11.11]
2025/08/13 09:59:23 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 11.11


2025/08/13 09:59:23 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 3 / 23 - Minibatch ==



Average Metric: 14.00 / 35 (40.0%): 100%|██████████| 35/35 [07:48<00:00, 13.37s/it]

2025/08/13 10:07:11 INFO dspy.evaluate.evaluate: Average Metric: 14 / 35 (40.0%)
2025/08/13 10:07:11 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 40.0 on minibatch of size 35 with parameters ['Predictor 0: Instruction 4', 'Predictor 0: Few-Shot Set 2'].
2025/08/13 10:07:11 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [25.71, 40.0]
2025/08/13 10:07:11 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [11.11]
2025/08/13 10:07:11 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 11.11


2025/08/13 10:07:11 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 4 / 23 - Minibatch ==



Average Metric: 12.00 / 35 (34.3%): 100%|██████████| 35/35 [05:21<00:00,  9.20s/it]

2025/08/13 10:12:33 INFO dspy.evaluate.evaluate: Average Metric: 12 / 35 (34.3%)
2025/08/13 10:12:33 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 34.29 on minibatch of size 35 with parameters ['Predictor 0: Instruction 0', 'Predictor 0: Few-Shot Set 6'].
2025/08/13 10:12:33 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [25.71, 40.0, 34.29]
2025/08/13 10:12:33 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [11.11]
2025/08/13 10:12:33 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 11.11


2025/08/13 10:12:33 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 5 / 23 - Minibatch ==



Average Metric: 9.00 / 35 (25.7%): 100%|██████████| 35/35 [06:33<00:00, 11.25s/it]

2025/08/13 10:19:07 INFO dspy.evaluate.evaluate: Average Metric: 9 / 35 (25.7%)
2025/08/13 10:19:07 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 25.71 on minibatch of size 35 with parameters ['Predictor 0: Instruction 2', 'Predictor 0: Few-Shot Set 4'].
2025/08/13 10:19:07 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [25.71, 40.0, 34.29, 25.71]
2025/08/13 10:19:07 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [11.11]
2025/08/13 10:19:07 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 11.11


2025/08/13 10:19:07 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 6 / 23 - Minibatch ==



Average Metric: 12.00 / 35 (34.3%): 100%|██████████| 35/35 [05:28<00:00,  9.38s/it]

2025/08/13 10:24:36 INFO dspy.evaluate.evaluate: Average Metric: 12 / 35 (34.3%)
2025/08/13 10:24:36 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 34.29 on minibatch of size 35 with parameters ['Predictor 0: Instruction 3', 'Predictor 0: Few-Shot Set 5'].
2025/08/13 10:24:36 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [25.71, 40.0, 34.29, 25.71, 34.29]
2025/08/13 10:24:36 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [11.11]
2025/08/13 10:24:36 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 11.11


2025/08/13 10:24:36 INFO dspy.teleprompt.mipro_optimizer_v2: ===== Trial 7 / 23 - Full Evaluation =====
2025/08/13 10:24:36 INFO dspy.teleprompt.mipro_optimizer_v2: Doing full eval on next top averaging program (Avg Score: 40.0) from minibatch trials...



Average Metric: 44.00 / 126 (34.9%): 100%|██████████| 126/126 [14:35<00:00,  6.95s/it]

2025/08/13 10:39:11 INFO dspy.evaluate.evaluate: Average Metric: 44 / 126 (34.9%)
2025/08/13 10:39:11 INFO dspy.teleprompt.mipro_optimizer_v2: [92mNew best full eval score![0m Score: 34.92
2025/08/13 10:39:11 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [11.11, 34.92]
2025/08/13 10:39:11 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 34.92
2025/08/13 10:39:11 INFO dspy.teleprompt.mipro_optimizer_v2: 

2025/08/13 10:39:11 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 8 / 23 - Minibatch ==



Average Metric: 12.00 / 35 (34.3%): 100%|██████████| 35/35 [06:04<00:00, 10.42s/it]

2025/08/13 10:45:16 INFO dspy.evaluate.evaluate: Average Metric: 12 / 35 (34.3%)
2025/08/13 10:45:16 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 34.29 on minibatch of size 35 with parameters ['Predictor 0: Instruction 4', 'Predictor 0: Few-Shot Set 6'].
2025/08/13 10:45:16 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [25.71, 40.0, 34.29, 25.71, 34.29, 34.29]
2025/08/13 10:45:16 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [11.11, 34.92]
2025/08/13 10:45:16 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 34.92


2025/08/13 10:45:16 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 9 / 23 - Minibatch ==



Average Metric: 1.00 / 5 (20.0%):  14%|█▍        | 5/35 [01:16<06:00, 12.01s/it]



Average Metric: 9.00 / 32 (28.1%):  91%|█████████▏| 32/35 [08:47<00:59, 19.74s/it]



Average Metric: 10.00 / 35 (28.6%): 100%|██████████| 35/35 [09:04<00:00, 15.56s/it]

2025/08/13 10:54:21 INFO dspy.evaluate.evaluate: Average Metric: 10 / 35 (28.6%)
2025/08/13 10:54:21 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 28.57 on minibatch of size 35 with parameters ['Predictor 0: Instruction 5', 'Predictor 0: Few-Shot Set 1'].
2025/08/13 10:54:21 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [25.71, 40.0, 34.29, 25.71, 34.29, 34.29, 28.57]
2025/08/13 10:54:21 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [11.11, 34.92]
2025/08/13 10:54:21 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 34.92


2025/08/13 10:54:21 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 10 / 23 - Minibatch ==



Average Metric: 2.00 / 2 (100.0%):   6%|▌         | 2/35 [00:28<06:50, 12.43s/it]



Average Metric: 14.00 / 35 (40.0%): 100%|██████████| 35/35 [07:20<00:00, 12.57s/it]

2025/08/13 11:01:41 INFO dspy.evaluate.evaluate: Average Metric: 14 / 35 (40.0%)
2025/08/13 11:01:41 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 40.0 on minibatch of size 35 with parameters ['Predictor 0: Instruction 3', 'Predictor 0: Few-Shot Set 3'].
2025/08/13 11:01:41 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [25.71, 40.0, 34.29, 25.71, 34.29, 34.29, 28.57, 40.0]
2025/08/13 11:01:41 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [11.11, 34.92]
2025/08/13 11:01:41 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 34.92


2025/08/13 11:01:41 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 11 / 23 - Minibatch ==



Average Metric: 9.00 / 35 (25.7%): 100%|██████████| 35/35 [00:00<00:00, 3971.34it/s]

2025/08/13 11:01:41 INFO dspy.evaluate.evaluate: Average Metric: 9 / 35 (25.7%)
2025/08/13 11:01:41 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 25.71 on minibatch of size 35 with parameters ['Predictor 0: Instruction 4', 'Predictor 0: Few-Shot Set 2'].
2025/08/13 11:01:41 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [25.71, 40.0, 34.29, 25.71, 34.29, 34.29, 28.57, 40.0, 25.71]
2025/08/13 11:01:41 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [11.11, 34.92]
2025/08/13 11:01:41 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 34.92


2025/08/13 11:01:41 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 12 / 23 - Minibatch ==



Average Metric: 11.00 / 35 (31.4%): 100%|██████████| 35/35 [04:35<00:00,  7.88s/it]

2025/08/13 11:06:17 INFO dspy.evaluate.evaluate: Average Metric: 11 / 35 (31.4%)
2025/08/13 11:06:17 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 31.43 on minibatch of size 35 with parameters ['Predictor 0: Instruction 3', 'Predictor 0: Few-Shot Set 3'].
2025/08/13 11:06:17 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [25.71, 40.0, 34.29, 25.71, 34.29, 34.29, 28.57, 40.0, 25.71, 31.43]
2025/08/13 11:06:17 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [11.11, 34.92]
2025/08/13 11:06:17 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 34.92


2025/08/13 11:06:17 INFO dspy.teleprompt.mipro_optimizer_v2: ===== Trial 13 / 23 - Full Evaluation =====
2025/08/13 11:06:17 INFO dspy.teleprompt.mipro_optimizer_v2: Doing full eval on next top averaging program (Avg Score: 35.715) from minibatch trials...



Average Metric: 4.00 / 21 (19.0%):  17%|█▋        | 21/126 [01:14<06:57,  3.98s/it] 



Average Metric: 36.00 / 120 (30.0%):  95%|█████████▌| 120/126 [1:15:53<04:29, 44.85s/it]   

[W 2025-08-13 12:22:15,643] Trial 11 failed with parameters: {'0_predictor_instruction': 3, '0_predictor_demos': 3} because of the following error: KeyboardInterrupt().
Traceback (most recent call last):
  File "/Users/tituslim/Documents/Personal Learning Folder/Personal Projects/dspy-playground/.venv/lib/python3.12/site-packages/optuna/study/_optimize.py", line 201, in _run_trial
    value_or_values = func(trial)
                      ^^^^^^^^^^^
  File "/Users/tituslim/Documents/Personal Learning Folder/Personal Projects/dspy-playground/.venv/lib/python3.12/site-packages/dspy/teleprompt/mipro_optimizer_v2.py", line 614, in objective
    best_score, best_program, total_eval_calls = self._perform_full_evaluation(
                                                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/tituslim/Documents/Personal Learning Folder/Personal Projects/dspy-playground/.venv/lib/python3.12/site-packages/dspy/teleprompt/mipro_optimizer_v2.py", line 798, in _perform_full_ev

KeyboardInterrupt: 