# Tutorial: Math Reasoning

Let's walk through a quick example of setting up a `dspy.ChainOfThought` module and optimizing it for answering algebra questions.

Install the latest DSPy via `pip install -U dspy` and follow along.

<details>
<summary>Recommended: Set up MLflow Tracing to understand what's happening under the hood.</summary>

### MLflow DSPy Integration

<a href="https://mlflow.org/">MLflow</a> is an LLMOps tool that natively integrates with DSPy and offer explainability and experiment tracking. In this tutorial, you can use MLflow to visualize prompts and optimization progress as traces to understand the DSPy's behavior better. You can set up MLflow easily by following the four steps below.

1. Install MLflow

```bash
%pip install mlflow>=2.20
```

2. Start MLflow UI in a separate terminal
```bash
mlflow ui --port 5000
```

3. Connect the notebook to MLflow
```python
import mlflow

mlflow.set_tracking_uri("http://localhost:5000")
mlflow.set_experiment("DSPy")
```

4. Enabling tracing.
```python
mlflow.dspy.autolog()
```

Once you have completed the steps above, you can see traces for each program execution on the notebook. They provide great visibility into the model's behavior and helps you understand the DSPy's concepts better throughout the tutorial.

![MLflow Trace](./mlflow-tracing-math.png)

To learn more about the integration, visit [MLflow DSPy Documentation](https://mlflow.org/docs/latest/llms/dspy/index.html) as well.

</details>

Let's tell DSPy that we will use OpenAI's `gpt-4o-mini` in our modules. To authenticate, DSPy will look into your `OPENAI_API_KEY`. You can easily swap this out for [other providers or local models](https://github.com/stanfordnlp/dspy/blob/main/examples/migration.ipynb).

In [1]:
import dspy

gpt4o_mini = dspy.LM('openai/gpt-4o-mini', max_tokens=2000)
gpt4o = dspy.LM('openai/gpt-4o', max_tokens=2000)
dspy.configure(lm=gpt4o_mini)  # we'll use gpt-4o-mini as the default LM, unless otherwise specified

Next, let's load some data examples from the [MATH](https://arxiv.org/abs/2103.03874) benchmark. We'll use a training split for optimization and evaluate it on a held-out dev set.

Please note that the following step will require:
```bash
%pip install git+https://github.com/hendrycks/math.git
```

In [2]:
from dspy.datasets import MATH

dataset = MATH(subset='algebra')
print(len(dataset.train), len(dataset.dev))

350 350


Let's inspect one example from the training set.

In [3]:
example = dataset.train[0]
print("Question:", example.question)
print("Answer:", example.answer)

Question: The doctor has told Cal O'Ree that during his ten weeks of working out at the gym, he can expect each week's weight loss to be $1\%$ of his weight at the end of the previous week. His weight at the beginning of the workouts is $244$ pounds. How many pounds does he expect to weigh at the end of the ten weeks? Express your answer to the nearest whole number.
Answer: 221


Now let's define our module. It's extremely simple: just a chain-of-thought step that takes a `question` and produces an `answer`.

In [4]:
module = dspy.ChainOfThought("question -> answer")
module(question=example.question)

Prediction(
    reasoning="Cal O'Ree's weight loss each week is $1\\%$ of his weight at the end of the previous week. This means that at the end of each week, he retains $99\\%$ of his weight from the previous week. \n\nIf we denote his weight at the beginning as \\( W_0 = 244 \\) pounds, then his weight at the end of week \\( n \\) can be expressed as:\n\\[\nW_n = W_{n-1} \\times 0.99\n\\]\nThis can be simplified to:\n\\[\nW_n = W_0 \\times (0.99)^n\n\\]\nAfter 10 weeks, his weight will be:\n\\[\nW_{10} = 244 \\times (0.99)^{10}\n\\]\n\nNow, we calculate \\( (0.99)^{10} \\):\n\\[\n(0.99)^{10} \\approx 0.904382\n\\]\n\nNow, we can calculate his expected weight after 10 weeks:\n\\[\nW_{10} \\approx 244 \\times 0.904382 \\approx 220.5\n\\]\n\nRounding to the nearest whole number, Cal O'Ree can expect to weigh approximately \\( 221 \\) pounds at the end of the ten weeks.",
    answer='221'
)

Next, let's set up an evaluator for the zero-shot module above, before prompt optimization.

In [5]:
THREADS = 24
kwargs = dict(num_threads=THREADS, display_progress=True, display_table=5)
evaluate = dspy.Evaluate(devset=dataset.dev, metric=dataset.metric, **kwargs)

evaluate(module)

Average Metric: 259.00 / 350 (74.0%): 100%|██████████| 350/350 [01:30<00:00,  3.85it/s]

2024/11/28 18:41:55 INFO dspy.evaluate.evaluate: Average Metric: 259 / 350 (74.0%)





Unnamed: 0,question,example_reasoning,example_answer,pred_reasoning,pred_answer,method
0,What is the smallest integer value of $c$ such that the function $...,The given function has a domain of all real numbers if and only if...,1,To determine the smallest integer value of \( c \) such that the f...,1,✔️ [True]
1,What is the least value of $x$ that is a solution of $|{-x+3}|=7$?,"In order to have $|{-x+3}| = 7$, we must have $-x + 3 = 7$ or $-x ...",-4,"To solve the equation \( |{-x+3}|=7 \), we need to consider the de...",-4,✔️ [True]
2,Evaluate $\left\lceil -\frac{7}{4}\right\rceil$.,"$-\frac{7}{4}$ is between $-1$ and $-2$, so $\left\lceil -\frac{7}...",-1,"To evaluate \(\left\lceil -\frac{7}{4}\right\rceil\), we first nee...",-1,✔️ [True]
3,"A triangle has vertices at coordinates $(11,1)$, $(2,3)$ and $(3,7...",We must find the distance between each pair of points by using the...,10,To find the length of the longest side of the triangle with vertic...,10,✔️ [True]
4,Let $f(x) = x + 2$ and $g(x) = 1/f(x)$. What is $g(f(-3))$?,"First, we find that $f(-3) = (-3) + 2 = -1$. Then, $$g(f(-3)) = g(...",1,"To find \( g(f(-3)) \), we first need to evaluate \( f(-3) \). The...",1,✔️ [True]


74.0

<details>
<summary>Tracking Evaluation Results in MLflow Experiment</summary>

<br/>

To track and visualize the evaluation results over time, you can record the results in MLflow Experiment.


```python
import mlflow

# Start an MLflow Run to record the evaluation
with mlflow.start_run(run_name="math_evaluation"):
    kwargs = dict(num_threads=THREADS, display_progress=True, return_all_scores=True, return_outputs=True)
    evaluate = dspy.Evaluate(devset=dataset.dev, metric=dataset.metric, **kwargs)

    # Evaluate the program as usual
    aggregated_score, outputs, all_scores = evaluate(module)

    # Log the aggregated score
    mlflow.log_metric("correctness", aggregated_score)
    # Log the detailed evaluation results as a table
    mlflow.log_table(
        {
            "Question": [example.question for example in dataset.dev],
            "Gold Answer": [example.answer for example in dataset.dev],
            "Predicted Answer": outputs,
            "Correctness": all_scores,
        },
        artifact_file="eval_results.json",
    )
```

To learn more about the integration, visit [MLflow DSPy Documentation](https://mlflow.org/docs/latest/llms/dspy/index.html) as well.

</details>

And lastly let's optimize our module. Since we want strong reasoning, we'll use the large GPT-4o as the teacher model (used to bootstrap reasoning for the small LM at optimization time) but not as the prompt model (used to craft instructions) or the task model (trained).

GPT-4o will be invoked only a small number of times. The model involved directly in optimization and in the resulting (optimized) program will be GPT-4o-mini.

We will also specify `max_bootstrapped_demos=4` which means we want at most four bootstrapped examples in the prompt and `max_labeled_demos=4` which means that, in total between bootstrapped and pre-labeled examples, we want at most four.

In [None]:
kwargs = dict(num_threads=THREADS, teacher_settings=dict(lm=gpt4o), prompt_model=gpt4o_mini)
optimizer = dspy.MIPROv2(metric=dataset.metric, auto="medium", **kwargs)

kwargs = dict(requires_permission_to_run=False, max_bootstrapped_demos=4, max_labeled_demos=4)
optimized_module = optimizer.compile(module, trainset=dataset.train, **kwargs)

In [7]:
evaluate(optimized_module)

Average Metric: 310.00 / 350 (88.6%): 100%|██████████| 350/350 [01:31<00:00,  3.84it/s]

2024/11/28 18:59:19 INFO dspy.evaluate.evaluate: Average Metric: 310 / 350 (88.6%)





Unnamed: 0,question,example_reasoning,example_answer,pred_reasoning,pred_answer,method
0,What is the smallest integer value of $c$ such that the function $...,The given function has a domain of all real numbers if and only if...,1,The function \( f(x) = \frac{x^2 + 1}{x^2 - x + c} \) will have a ...,1,✔️ [True]
1,What is the least value of $x$ that is a solution of $|{-x+3}|=7$?,"In order to have $|{-x+3}| = 7$, we must have $-x + 3 = 7$ or $-x ...",-4,The equation \( |{-x+3}|=7 \) implies two possible cases: 1. \(-x ...,-4,✔️ [True]
2,Evaluate $\left\lceil -\frac{7}{4}\right\rceil$.,"$-\frac{7}{4}$ is between $-1$ and $-2$, so $\left\lceil -\frac{7}...",-1,"To evaluate \(\left\lceil -\frac{7}{4}\right\rceil\), we first nee...",-1,✔️ [True]
3,"A triangle has vertices at coordinates $(11,1)$, $(2,3)$ and $(3,7...",We must find the distance between each pair of points by using the...,10,To find the length of the sides of the triangle formed by the vert...,10,✔️ [True]
4,Let $f(x) = x + 2$ and $g(x) = 1/f(x)$. What is $g(f(-3))$?,"First, we find that $f(-3) = (-3) + 2 = -1$. Then, $$g(f(-3)) = g(...",1,"To find \( g(f(-3)) \), we first need to evaluate \( f(-3) \). Usi...",1,✔️ [True]


88.57

Neat. It was pretty straightforward to improve quality from 74% to over 88% on a held-out set here.

That said, for reasoning tasks like this, you will often want to consider more advanced strategies, like:

- A `dspy.ReAct` module with access to a calculator function or `dspy.PythonInterpreter`
- Ensembling multiple optimized prompts with a majority vote (or an Aggregator module) on top


Just to understand what changed, let's view the prompt after optimization. Alternatively, if you enabled MLflow tracing following above instructions, you can compare the prompts before and after optimization in the rich trace UI.

In [8]:
dspy.inspect_history()





[34m[2024-11-28T18:59:19.176586][0m

[31mSystem message:[0m

Your input fields are:
1. `question` (str)

Your output fields are:
1. `reasoning` (str)
2. `answer` (str)

All interactions will be structured in the following way, with the appropriate values filled in.

[[ ## question ## ]]
{question}

[[ ## reasoning ## ]]
{reasoning}

[[ ## answer ## ]]
{answer}

[[ ## completed ## ]]

In adhering to this structure, your objective is: 
        Analyze the `question` provided, and systematically apply mathematical reasoning to derive the `answer`. Ensure to articulate each step of your thought process in the `reasoning` field, detailing how you identify relationships and formulate equations to arrive at the solution.


[31mUser message:[0m

[[ ## question ## ]]
If $|4x+2|=10$ and $x<0$, what is the value of $x$?

Respond with the corresponding output fields, starting with the field `[[ ## reasoning ## ]]`, then `[[ ## answer ## ]]`, and then ending with the marker for `[[ ## com