# Tutorial: ProgramOfThought

`dspy.ProgramOfThought` automatically generates and refines Python code to solve downstream tasks.

Install the latest DSPy via `pip install -U dspy` and follow along.

## 1) Using PythonInterpreter

`ProgramOfThought` integrates an adapted Python interpreter to execute code generated by LMs. 

As a brief example to demonstrate how the interpreter works, we'll create an instance of `dspy.PythonInterpreter` and demonstrate the underlying execution of `ProgramOfThought`.

In [17]:
import dspy
interpreter = dspy.PythonInterpreter()
expr = "value = 2*5 + 4\nvalue"
answer = interpreter.execute(expr)
answer

14

## 2) Demonstrating ProgramOfThought
 As an example, we'll define a signature with an input question and an output answer. Then, we'll create and invoke the `ProgramOfThought` program, which uses an LM to first generate code to represent the question, executes the code using the interpreter and outputs the final result as the answer to the question.

Let's use Meta's `Llama-3-70b-Instruct`. You can easily swap this out for [other providers or local models](https://github.com/stanfordnlp/dspy/blob/main/examples/migration.ipynb).

In [3]:
llama31_70b = dspy.LM("openai/meta-llama/Meta-Llama-3-70b-Instruct", api_base="API_BASE", api_key="None")

dspy.settings.configure(lm=llama31_70b)

Let's now define our module with a brief signature that specifies the input question and output answer. We can then call `ProgramOfThought` on the signature and pass in our sample problem.

In [4]:
class BasicGenerateAnswer(dspy.Signature):
    question = dspy.InputField()
    answer = dspy.OutputField()

pot = dspy.ProgramOfThought(BasicGenerateAnswer)
problem = "2*5 + 4"
pot(question=problem).answer

'14'

Great! The module successfully produced the same correct answer. Let's see how exactly it used the LM to do so:

In [5]:
dspy.inspect_history()





[34m[2025-01-06T21:58:40.879405][0m

[31mSystem message:[0m

Your input fields are:
1. `question` (str)
2. `final_generated_code` (str): python code that answers the question
3. `code_output` (str): output of previously-generated python code

Your output fields are:
1. `reasoning` (str)
2. `answer` (str)

All interactions will be structured in the following way, with the appropriate values filled in.

[[ ## question ## ]]
{question}

[[ ## final_generated_code ## ]]
{final_generated_code}

[[ ## code_output ## ]]
{code_output}

[[ ## reasoning ## ]]
{reasoning}

[[ ## answer ## ]]
{answer}

[[ ## completed ## ]]

In adhering to this structure, your objective is: 
        Given the final code `question`, `final_generated_code`, `code_output`, provide the final `answer`.


[31mUser message:[0m

[[ ## question ## ]]
2*5 + 4

[[ ## final_generated_code ## ]]
def calculate_expression():
    # Multiply 2 and 5
    multiplication_result = 2 * 5
    
    # Add 4 to the result
    fin

We see that the generated Python code defines a function for intermediate computations and returns the final answer upon execution through the `PythonInterpreter`, getting the right answer.

## 3) Comparing with ChainOfThought

Now we turn to a more complex problem to demonstrate how the `ProgramOfThought` module can be helpful. 

Problem: **Compute 12! / sum of prime numbers between 1 and 30.**

This is a fairly challenging computation. Let's see how `ChainOfThought` performs first:

In [6]:
problem = "Compute 12! / sum of prime numbers between 1 and 30."

cot = dspy.ChainOfThought(BasicGenerateAnswer)
cot(question=problem).answer

'3,710,009'

In [7]:
dspy.inspect_history()





[34m[2025-01-06T21:59:08.539739][0m

[31mSystem message:[0m

Your input fields are:
1. `question` (str)

Your output fields are:
1. `reasoning` (str)
2. `answer` (str)

All interactions will be structured in the following way, with the appropriate values filled in.

[[ ## question ## ]]
{question}

[[ ## reasoning ## ]]
{reasoning}

[[ ## answer ## ]]
{answer}

[[ ## completed ## ]]

In adhering to this structure, your objective is: 
        Given the fields `question`, produce the fields `answer`.


[31mUser message:[0m

[[ ## question ## ]]
Compute 12! / sum of prime numbers between 1 and 30.

Respond with the corresponding output fields, starting with the field `[[ ## reasoning ## ]]`, then `[[ ## answer ## ]]`, and then ending with the marker for `[[ ## completed ## ]]`.


[31mResponse:[0m

[32m[[ ## reasoning ## ]]
To solve this problem, we need to calculate 12! (12 factorial) and the sum of prime numbers between 1 and 30. 

First, let's calculate 12!. 12! = 12 * 11 *

So `ChainOfThought` does fairly well in reasoning through the steps, getting the correct value for both 12! and the sum of only prime numbers between 1-30. 

But it fails at the last step of division, incorrectly computing `479,001,600 / 129 = 3,710,009` when the correct answer is `3713190.69767` (verified by a real calculator!)

Let's see how `ProgramOfThought` fares:

In [8]:
pot(question=problem).answer

'3713190.697674419'

In [9]:
dspy.inspect_history()





[34m[2025-01-06T21:59:13.140776][0m

[31mSystem message:[0m

Your input fields are:
1. `question` (str)
2. `final_generated_code` (str): python code that answers the question
3. `code_output` (str): output of previously-generated python code

Your output fields are:
1. `reasoning` (str)
2. `answer` (str)

All interactions will be structured in the following way, with the appropriate values filled in.

[[ ## question ## ]]
{question}

[[ ## final_generated_code ## ]]
{final_generated_code}

[[ ## code_output ## ]]
{code_output}

[[ ## reasoning ## ]]
{reasoning}

[[ ## answer ## ]]
{answer}

[[ ## completed ## ]]

In adhering to this structure, your objective is: 
        Given the final code `question`, `final_generated_code`, `code_output`, provide the final `answer`.


[31mUser message:[0m

[[ ## question ## ]]
Compute 12! / sum of prime numbers between 1 and 30.

[[ ## final_generated_code ## ]]
def is_prime(n):
    """Check if a number is prime."""
    if n < 2:
        r

With the Python interpreter executing code accurately, `ProgramOfThought` mitigates computation errors that may fail in `ChainOfThought`, improving correctness particularly for numerical and logical queries.

## 3) Computation with Contextual Reasoning

Now let's try a more complex example of doing computation in complex math word problems. 

### Step 1: Define a helper function to search Wikipedia
We'll use a `dspy.ColBERTv2` server to retrieve top matches from Wikipedia and parse them inside the `ProgramOfThought` pipeline.

In [12]:
def search_wikipedia(query: str):
    results = dspy.ColBERTv2(url='http://20.102.90.50:2017/wiki17_abstracts')(query, k=3)
    return [x['text'] for x in results]

### Step 2: Multi-Hop Search with ProgramOfThought
We'll take inspiration from the [Multi-Hop Search task](https://dspy.ai/tutorials/multihop_search/) and simply tweak the final `generate_answer` layer to use `ProgramOfThought` in place of `ChainOfThought` to ensure accurate computations given a question and retrieved context.

We pose a challenging word problem that requires retrieval to gather information and then use the facts to perform computation and produce a final result. 

In [15]:
class GenerateAnswer(dspy.Signature):
    """Answer questions with short factoid answers."""

    context = dspy.InputField(desc="may contain relevant facts")
    question = dspy.InputField()
    answer = dspy.OutputField(desc="often between 1 and 5 words")


class GenerateSearchQuery(dspy.Signature):
    """Write a simple search query that will help answer the non-numerical components of a complex question."""

    context = dspy.InputField(desc="may contain relevant facts")
    question = dspy.InputField()
    query = dspy.OutputField()

from dspy.dsp.utils import deduplicate

class MultiHopSearchWithPoT(dspy.Module):
    def __init__(self, num_hops):
        self.num_hops = num_hops
        self.generate_query = dspy.ChainOfThought(GenerateSearchQuery)
        self.generate_answer = dspy.ProgramOfThought(GenerateAnswer, max_iters=3)

    def forward(self, question):
        context = []
        for _ in range(self.num_hops):
            query = self.generate_query(context=context, question=question).query
            context = deduplicate(context + search_wikipedia(query))
        prediction = self.generate_answer(context=context, question=question)
        return dspy.Prediction(context=context, answer=prediction.answer)

multi_hop_pot = MultiHopSearchWithPoT(num_hops=2)
question = (
    "What is the square of the total sum of the atomic number of the metal "
    "that makes up the gift from France to the United States in the late "
    "19th century and the sum of the number of digits in the first 10 prime numbers?"
)
multi_hop_pot(question=question).answer

'2025'

In [16]:
dspy.inspect_history()





[34m[2025-01-06T22:00:34.427037][0m

[31mSystem message:[0m

Your input fields are:
1. `context` (str): may contain relevant facts
2. `question` (str)
3. `final_generated_code` (str): python code that answers the question
4. `code_output` (str): output of previously-generated python code

Your output fields are:
1. `reasoning` (str)
2. `answer` (str): often between 1 and 5 words

All interactions will be structured in the following way, with the appropriate values filled in.

[[ ## context ## ]]
{context}

[[ ## question ## ]]
{question}

[[ ## final_generated_code ## ]]
{final_generated_code}

[[ ## code_output ## ]]
{code_output}

[[ ## reasoning ## ]]
{reasoning}

[[ ## answer ## ]]
{answer}

[[ ## completed ## ]]

In adhering to this structure, your objective is: 
        Given the final code `context`, `question`, `final_generated_code`, `code_output`, provide the final `answer`.


[31mUser message:[0m

[[ ## context ## ]]
[1] «Goddess of Democracy | The Goddess of Democ

Notice how the retrieved context includes passages about the Statue Liberty and copper. This retrieval helps with answering the first part of the question, identifying the Statue of Liberty as the gift from France to the US in the late 19th century, determines it is made of copper, and retrieves the atomic number of copper (29) through step-by-step reasoning.

The second part of the question is broken down into Python logic, summing the number of digits in the first 10 prime numbers programmatically.

By combining these two subproblems, the solution correctly aggregates the results and outputs the final answer: **2025**.