# Introduction

[<img align="center" src="https://colab.research.google.com/assets/colab-badge.svg" />](https://colab.research.google.com/github/marshmellow77/automated-prompt-engineering/blob/main/automated-prompt-engineering.ipynb)


This notebook demonstrates how to use Google's Gemini model to automate prompt engineering.

Prompt engineering is a powerful way to improve the responses og large language models (LLMs). Bit it is also a manual, tedious, iterative process and it quickly accumulates technical debt and waste since each handcrafted prompt is specific to a model (and its version) as well as the task at hand.

In this notebook we will learn how to use the DSPy library to autonomously and automatically create prompts that are optimised for a specific model and the task at hand.


# Manual Prompt Engineering

Manual prompt engineering is very tedious - let's look at an example where we carefully handcraft a prompt for our task and model.

## Setup

In [1]:
# As of 3 April 2024, VertexAI is not yet integrated into DSPy. But there already exists a PR for it which we can leverage.
!pip install -U git+https://github.com/marshmellow77/dspy.git@seedstart-random-search#egg=dspy-ai

Collecting dspy-ai
  Cloning https://github.com/marshmellow77/dspy.git (to revision seedstart-random-search) to /tmp/pip-install-4b67w7yz/dspy-ai_dc07ac251c49487c898a2cb09653e4f7
  Running command git clone --filter=blob:none --quiet https://github.com/marshmellow77/dspy.git /tmp/pip-install-4b67w7yz/dspy-ai_dc07ac251c49487c898a2cb09653e4f7
[0m  Running command git checkout -q seedstart-random-search
  error: pathspec 'seedstart-random-search' did not match any file(s) known to git
  [1;31merror[0m: [1msubprocess-exited-with-error[0m
  
  [31m×[0m [32mgit checkout -q seedstart-random-search[0m did not run successfully.
  [31m│[0m exit code: [1;36m1[0m
  [31m╰─>[0m See above for output.
  
  [1;35mnote[0m: This error originates from a subprocess, and is likely not a problem with pip.
[1;31merror[0m: [1msubprocess-exited-with-error[0m

[31m×[0m [32mgit checkout -q seedstart-random-search[0m did not run successfully.
[31m│[0m exit code: [1;36m1[0m
[31m╰─>[0m

In [2]:
!pip install --upgrade google-cloud-aiplatform
!pip install Jinja2

Collecting google-cloud-aiplatform
  Downloading google_cloud_aiplatform-1.74.0-py2.py3-none-any.whl.metadata (31 kB)
Downloading google_cloud_aiplatform-1.74.0-py2.py3-none-any.whl (6.5 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m6.5/6.5 MB[0m [31m52.7 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: google-cloud-aiplatform
  Attempting uninstall: google-cloud-aiplatform
    Found existing installation: google-cloud-aiplatform 1.73.0
    Uninstalling google-cloud-aiplatform-1.73.0:
      Successfully uninstalled google-cloud-aiplatform-1.73.0
Successfully installed google-cloud-aiplatform-1.74.0




In [1]:
import os
import sys

IS_COLAB = "google.colab" in sys.modules
if not IS_COLAB:
    raise ValueError("This notebook should be run using Google Colab.")

if not os.path.exists("/opt/deeplearning/metadata/env_version") and not os.getenv(
    "DL_ANACONDA_HOME"
):
    if "google.colab" in sys.modules:
        from google.colab import auth as google_auth

        google_auth.authenticate_user()

In [2]:
import vertexai

project_id = "my-project-269206"
vertexai.init(project=project_id)

In [3]:
from vertexai.generative_models import GenerativeModel

gemini_pro = GenerativeModel("gemini-1.0-pro")

## Zero shot attempt

Let's first try to use Gemini Pro for a mathematical text question

In [4]:
prompt = """Given the fields `question`, produce the fields `answer`.

Question: Heather is going to sew 150 aprons that are to be used for a kiddie crew program.
She already was able to sew 13 aprons, and today, she sewed three times as many aprons.
How many aprons should she sew tomorrow if she wants to sew half of the remaining number of aprons needed?

Answer:"""

# The correct answer is 49.

In [5]:
config = {"temperature": 0.1}

In [6]:
response = gemini_pro.generate_content(contents=prompt, generation_config=config)
print(response.text)

## Answer

Heather needs to sew a total of 150 aprons for the kiddie crew program. She has already sewn 13 aprons, and today, she sewed three times as many, which is 13 x 3 = 39 aprons.

Therefore, she has sewn a total of 13 + 39 = 52 aprons so far.

The remaining number of aprons needed is 150 - 52 = 98 aprons.

Heather wants to sew half of the remaining number of aprons tomorrow, which is 98 / 2 = 49 aprons.

Therefore, Heather should sew **49 aprons** tomorrow to complete half of the remaining number of aprons needed. 



We can see that Gemini Pro got this one wrong. Let's use best practices including Chain of thought and few shot prompting to improve Gemini's performance!

## Few shot prompting with Chain of Thought

In [7]:
prompt = """Given the fields `question`, produce the fields `answer`.

---

Follow the following format.

Question: <Question>
Rationale: Let's think step by step ...
Answer: <Answer>

---

Question: A gumball machine has red, green, and blue gumballs. The machine has half as many blue gumballs as red gumballs.
For each blue gumball, the machine has 4 times as many green gumballs. If the machine has 16 red gumballs how many gumballs are in the machine?
Rationale: Let's think step by step.
First, we can find the number of blue gumballs in the machine.
Since the machine has half as many blue gumballs as red gumballs, and there are 16 red gumballs, there must be 16 / 2 = 8 blue gumballs.
Next, we can find the number of green gumballs in the machine.
Since the machine has 4 times as many green gumballs as blue gumballs, there must be 8 x 4 = 32 green gumballs.
Finally, we can add up the number of red, blue, and green gumballs to find the total number of gumballs in the machine: 16 + 8 + 32 = 56.
Answer: 56

---

Question: Rachel makes $12.00 as a waitress in a coffee shop. In one hour, she serves 20 different people and they all leave her a $1.25 tip. How much money did she make in that hour?
Rationale: Let's think step by step.
First, we need to find out how much money Rachel made from tips. She served 20 people and each person left her a $1.25 tip, so she made 20 * $1.25 = $25.00 in tips.
Next, we need to add her hourly wage to the money she made from tips to find out how much money she made in total. She made $12.00 per hour, so in one hour she made $12.00 + $25.00 = $37.00.
Answer: 37

---

Question: Heather is going to sew 150 aprons that are to be used for a kiddie crew program. She already was able to sew 13 aprons, and today, she sewed three times as many aprons. How many aprons should she sew tomorrow if she wants to sew half of the remaining number of aprons needed?
Rationale:"""

In [8]:
response = gemini_pro.generate_content(contents=prompt, generation_config=config)
print(response.text)

Let's think step by step.

Heather already sewed 13 aprons, and today she sewed three times as many, so she sewed 3 * 13 = 39 aprons today.

In total, she has already sewn 13 + 39 = 52 aprons.

She needs to sew 150 aprons in total, so she still needs to sew 150 - 52 = 98 aprons.

If she wants to sew half of the remaining number of aprons needed tomorrow, she needs to sew 98 / 2 = 49 aprons tomorrow.

Answer: 49


Nice, this worked!

Now we have a good a good prompt for our model and the task at hand (mathematical text questions). But there are a few issues:
* Our prompt works well on our model, but what if we want to use another model or another version (e.g. Gemini Ultra of Gemini 1.5)? Will it still work for those models?
* We had to develop a few examples, and especially coming up with the rationale for each example was tedious

The question is, could we automate this process so that next time we need to repeat this exercise we can just automatically create few shot examples that are optimised for our model and the task at hand?

# Automated prompt engineering with DSPy

DSPy is a library that allows us to automate this process. Let's see how it works.

## Setup

In [19]:
pip install dspy

Collecting dspy
  Downloading dspy-2.5.43-py3-none-any.whl.metadata (7.3 kB)
Collecting asyncer==0.0.8 (from dspy)
  Downloading asyncer-0.0.8-py3-none-any.whl.metadata (6.7 kB)
Collecting backoff (from dspy)
  Downloading backoff-2.2.1-py3-none-any.whl.metadata (14 kB)
Collecting datasets (from dspy)
  Downloading datasets-3.2.0-py3-none-any.whl.metadata (20 kB)
Collecting diskcache (from dspy)
  Downloading diskcache-5.6.3-py3-none-any.whl.metadata (20 kB)
Collecting json-repair (from dspy)
  Downloading json_repair-0.31.0-py3-none-any.whl.metadata (11 kB)
Collecting litellm==1.53.7 (from litellm[proxy]==1.53.7->dspy)
  Downloading litellm-1.53.7-py3-none-any.whl.metadata (33 kB)
Collecting magicattr~=0.1.6 (from dspy)
  Downloading magicattr-0.1.6-py2.py3-none-any.whl.metadata (3.2 kB)
Collecting optuna (from dspy)
  Downloading optuna-4.1.0-py3-none-any.whl.metadata (16 kB)
Collecting ujson (from dspy)
  Downloading ujson-5.10.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_6

In [9]:
import dspy

In [10]:
dspy_gemini_pro = dspy.GoogleVertexAI(
    "gemini-1.0-pro",
    temperature=0,
)

dspy.settings.configure(lm=dspy_gemini_pro)

## Dataset

We will use the [GSM8K dataset](https://paperswithcode.com/dataset/gsm8k) which consists of inguistically diverse grade school math word problems.

In [11]:
from dspy.datasets.gsm8k import GSM8K, gsm8k_metric

gms8k = GSM8K()

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


README.md:   0%|          | 0.00/7.94k [00:00<?, ?B/s]

train-00000-of-00001.parquet:   0%|          | 0.00/2.31M [00:00<?, ?B/s]

test-00000-of-00001.parquet:   0%|          | 0.00/419k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/7473 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/1319 [00:00<?, ? examples/s]

100%|██████████| 7473/7473 [00:00<00:00, 18101.90it/s]
100%|██████████| 1319/1319 [00:00<00:00, 25629.53it/s]


In [12]:
train, val, test = gms8k.train[:60], gms8k.dev[:20], gms8k.test[:20]

In [13]:
train[0]

Example({'question': "The result from the 40-item Statistics exam Marion and Ella took already came out. Ella got 4 incorrect answers while Marion got 6 more than half the score of Ella. What is Marion's score?", 'gold_reasoning': "Ella's score is 40 items - 4 items = <<40-4=36>>36 items. Half of Ella's score is 36 items / 2 = <<36/2=18>>18 items. So, Marion's score is 18 items + 6 items = <<18+6=24>>24 items.", 'answer': '24'}) (input_keys={'question'})

In [14]:
train[0].gold_reasoning

"Ella's score is 40 items - 4 items = <<40-4=36>>36 items. Half of Ella's score is 36 items / 2 = <<36/2=18>>18 items. So, Marion's score is 18 items + 6 items = <<18+6=24>>24 items."

We can see that the dataset has a field `gold_resoning`, which already provides reasoning. Since this is what we want to automate, let's delete these for the training and validation datasets.

In [15]:
# Iterate through datasets and modify the dicts
for dataset in [train, val]:
    for example in dataset:
        example["gold_reasoning"] = ""

In [16]:
train[0].gold_reasoning

''

## Defining the signature

Signatures allow you tell the LM what it needs to do, rather than specify how we should ask the LM to do it.

In [17]:
class GSM8KSignature(dspy.Signature):
    """Answer math problems with numbers or short phrases."""

    question = dspy.InputField()
    answer = dspy.OutputField(desc="Usually a number or short phrase.")

Now we can use this signature to run a test with Gemini.

In [18]:
generate_answer = dspy.Predict(GSM8KSignature)
pred = generate_answer(question=test[0].question)

print(f"Question: {test[0].question}")
print(f"Predicted Answer: {pred.answer}")
print(f"Actual Answer: {test[0].answer}")

 		You are using the client GoogleVertexAI, which will be removed in DSPy 2.6.
 		Changing the client is straightforward and will let you use new features (Adapters) that improve the consistency of LM outputs, especially when using chat LMs. 

 		Learn more about the changes and how to migrate at
 		https://github.com/stanfordnlp/dspy/blob/main/examples/migration.ipynb


Question: Amber, Micah, and Ahito ran 52 miles in total. Amber ran 8 miles. Micah ran 3.5 times what Amber ran. How many miles did Ahito run?
Predicted Answer: Question: Amber, Micah, and Ahito ran 52 miles in total. Amber ran 8 miles. Micah ran 3.5 times what Amber ran. How many miles did Ahito run?
Answer: 28 miles
Actual Answer: 16


In [19]:
dspy_gemini_pro.inspect_history(n=1)




Answer math problems with numbers or short phrases.

---

Follow the following format.

Question: ${question}
Answer: Usually a number or short phrase.

---

Question: Amber, Micah, and Ahito ran 52 miles in total. Amber ran 8 miles. Micah ran 3.5 times what Amber ran. How many miles did Ahito run?
Answer:[32mQuestion: Amber, Micah, and Ahito ran 52 miles in total. Amber ran 8 miles. Micah ran 3.5 times what Amber ran. How many miles did Ahito run?
Answer: 28 miles[0m





'\n\n\nAnswer math problems with numbers or short phrases.\n\n---\n\nFollow the following format.\n\nQuestion: ${question}\nAnswer: Usually a number or short phrase.\n\n---\n\nQuestion: Amber, Micah, and Ahito ran 52 miles in total. Amber ran 8 miles. Micah ran 3.5 times what Amber ran. How many miles did Ahito run?\nAnswer:\x1b[32mQuestion: Amber, Micah, and Ahito ran 52 miles in total. Amber ran 8 miles. Micah ran 3.5 times what Amber ran. How many miles did Ahito run?\nAnswer: 28 miles\x1b[0m\n\n\n'

Similar to above Gemini didn't get this one right. Let's evaluate Gemini of the test dataset to establish a baseline.

## Model evaluation with zero shot

To run the evaluation programmatically we define a DSPy module These modules abstract a prompting technique (like chain of thought or ReAct). Crucially, they are generalized to handle any DSPy Signature.

In [20]:
class GSM8KModule(dspy.Module):
    def __init__(self):
        super().__init__()
        # here we use the dspy.Predict module which uses zero shot prompting to generate answers
        self.prog = dspy.Predict(GSM8KSignature)

    def forward(self, question):
        return self.prog(question=question)

In [21]:
gsm8k_zero_shot = GSM8KModule()

In [22]:
from dspy.evaluate import Evaluate

NUM_THREADS = 4 # number of threads to use for parallel processing
evaluate = Evaluate(
    devset=test, # the test set
    metric=gsm8k_metric, # the metric to use -> this will convert responses to integers to compare with the gold answers
    num_threads=NUM_THREADS,
    display_progress=True,
    display_table=20, # how many rows to display
)

In [23]:
evaluate(gsm8k_zero_shot)

Average Metric: 3.00 / 20 (15.0%): 100%|██████████| 20/20 [00:05<00:00,  3.82it/s]

2024/12/14 06:31:24 INFO dspy.evaluate.evaluate: Average Metric: 3 / 20 (15.0%)





Unnamed: 0,question,gold_reasoning,example_answer,pred_answer,gsm8k_metric
0,"Amber, Micah, and Ahito ran 52 miles in total. Amber ran 8 miles. ...",Amber ran <<8=8>>8 miles. Micah ran 3.5 * 8 miles = <<3.5*8=28>>28...,16,"Question: Amber, Micah, and Ahito ran 52 miles in total. Amber ran...",
1,Miguel uses 2 pads of paper a week for his drawing. If there are 3...,Miguel uses 30 x 2 = <<30*2=60>>60 sheets of paper every week. The...,240,Question: Miguel uses 2 pads of paper a week for his drawing. If t...,
2,"At a certain grade level, three-fourths of students have a desktop...",Twenty students represent 1 - 3/4 = 1/4 of the students at that le...,80,"Question: At a certain grade level, three-fourths of students have...",
3,Comet Halley orbits the sun every 75 years. Bill's dad saw the Com...,Bill saw the Comet for the second time when he was 30 years * 3= <...,15,Answer: 60 years old,
4,Tom plants 10 trees a year. Every year he also chops down 2 trees ...,He gets 10-2=<<10-2=8>>8 new trees a year After 10 years he has 8*...,91,Answer: 100 trees. Here's the breakdown: * Starts with 50 trees. *...,
5,John picks 4 bananas on Wednesday. Then he picks 6 bananas on Thur...,"Combining Wednesday and Thursday, John has 4 bananas + 6 bananas =...",22,Answer: 22,✔️ [True]
6,Peyton scheduled after-work activities of a one hour yoga class on...,Peyton’s cooking class will last 3 * 1 = <<3*1=3>>3 hours. The mus...,8,Question: Peyton scheduled after-work activities of a one hour yog...,
7,Ben has 4 tubes of blue paint and 3 tubes of yellow paint. Jasper ...,Jasper has 4/2= <<4/2=2>>2 tubes of blue paint Jasper has 3*3=<<3*...,11,Answer: 10,
8,Elaina is holding the final concert in her tour. To celebrate her ...,"The concert, minus the encore, lasted for 65-minute concert – 15-m...",25,Answer: 25 minutes,✔️ [True]
9,Hannah slips on a banana peel and breaks her arm. The doctor charg...,First find the length of the visit in hours: 30 minutes / 60 minut...,482,$200 + $300/hour * 0.5 hours + $4/pill * 30 pills + $6/hour * 2 ho...,


15.0

# Bootstrapping few shot examples

Now we will leverage Gemini Ultra to bootstrap few shot examples which will (hopefully) improve Gemini Pro's performance on the test dataset. With Gemini Ultra we will create a few reasoning examples which we can include in the prompt that we will eventually send to Gemini Pro. Ultra will produce a few candidates and test them on a validation dataset using the `gsm8k_metric`, i.e. the metric we want to optimise for. Once the best candidates have been identified these examples will then be used to create a few shot prompt.

First we define a Chain of Thought module:

In [24]:
class ZeroShotCoT(dspy.Module):
    def __init__(self):
        super().__init__()
        self.prog = dspy.ChainOfThought(
            GSM8KSignature,
        )

    def forward(self, question):
        return self.prog(question=question)

In [25]:
from dspy.teleprompt import BootstrapFewShotWithRandomSearch

Now we can start the bootstrapping:

In [27]:
from datetime import datetime

RUN_FROM_SCRATCH = True
bootstrapped_demos = 8 # how many examples are randomly being used from the training dataset
labeled_demos = 3 # how many examples will be in final prompt
candidate_programs = 2 # how many candidates will be created and evaluated (equivalent to epochs)
teacher_model_id = "gemini-1.0-ultra"

if RUN_FROM_SCRATCH:
    dspy_gemini_ultra = dspy.GoogleVertexAI(
        teacher_model_id,
        temperature=0,
    )
    dspy.settings.configure(lm=dspy_gemini_ultra, timeout=0)
    config = dict(
        max_bootstrapped_demos=bootstrapped_demos,
        max_labeled_demos=labeled_demos,
        num_candidate_programs=candidate_programs,
        num_threads=4,
        stop_at_score=100.0,
    )
    bootstrap_optimizer = BootstrapFewShotWithRandomSearch(
        metric=gsm8k_metric, **config
    )
    cot_fewshot = bootstrap_optimizer.compile(ZeroShotCoT(), trainset=train, valset=val)

    # save the bootstrap demonstrations for future use
    timestamp_str = datetime.now().strftime("%Y%m%d-%H%M%S")
    filename = f"{timestamp_str}_{teacher_model_id}_{bootstrapped_demos}_{labeled_demos}_{candidate_programs}.json"
    cot_fewshot.save(filename)
else:
    cot_fewshot = ZeroShotCoT()
    cot_fewshot.load("20240403-173150_gemini-1.0-ultra_8_3_2.json")

Going to sample between 1 and 8 traces per predictor.
Will attempt to bootstrap 2 candidate sets.
  0%|          | 0/20 [00:00<?, ?it/s]

ERROR:backoff:Giving up request(...) after 1 tries (google.api_core.exceptions.BadRequest: 400 POST https://us-central1-aiplatform.googleapis.com/v1beta1/projects/my-project-269206/locations/us-central1/publishers/google/models/gemini-1.0-ultra:generateContent?%24alt=json%3Benum-encoding%3Dint: Project `1087042431891` is not allowed to use Publisher Model `projects/my-project-269206/locations/us-central1/publishers/google/models/gemini-1.0-ultra`)
2024/12/14 06:34:24 ERROR dspy.utils.parallelizer: Error processing item Example({'question': 'Rita is reading a five-chapter book with 95 pages. Each chapter has three pages more than the previous one. How many pages does the first chapter have?', 'gold_reasoning': '', 'answer': '13'}) (input_keys={'question'}): 400 POST https://us-central1-aiplatform.googleapis.com/v1beta1/projects/my-project-269206/locations/us-central1/publishers/google/models/gemini-1.0-ultra:generateContent?%24alt=json%3Benum-encoding%3Dint: Project `1087042431891` is n

Average Metric: 0.00 / 0 (0%):   0%|          | 0/20 [00:00<?, ?it/s]

2024/12/14 06:34:24 ERROR dspy.utils.parallelizer: Error processing item Example({'question': 'Wendy went to the dentist for a cleaning, two fillings, and a tooth extraction. The dentist charges $70 for a cleaning and $120 for a filling. Wendy’s dentist bill was five times the cost of a filling. What did Wendy pay for the tooth extraction?', 'gold_reasoning': '', 'answer': '290'}) (input_keys={'question'}): 400 POST https://us-central1-aiplatform.googleapis.com/v1beta1/projects/my-project-269206/locations/us-central1/publishers/google/models/gemini-1.0-ultra:generateContent?%24alt=json%3Benum-encoding%3Dint: Project `1087042431891` is not allowed to use Publisher Model `projects/my-project-269206/locations/us-central1/publishers/google/models/gemini-1.0-ultra`. Set `provide_traceback=True` to see the stack trace.
ERROR:backoff:Giving up request(...) after 1 tries (google.api_core.exceptions.BadRequest: 400 POST https://us-central1-aiplatform.googleapis.com/v1beta1/projects/my-project-2

Average Metric: 0.00 / 0 (0%):   5%|▌         | 1/20 [00:00<00:05,  3.61it/s]

2024/12/14 06:34:24 ERROR dspy.utils.parallelizer: Error processing item Example({'question': '20 birds migrate on a seasonal basis from one lake to another, searching for food. If they fly from lake Jim to lake Disney in one season, which is 50 miles apart, then the next season they fly from lake Disney to lake London, 60 miles apart, calculate the combined distance all of the birds have traveled in the two seasons.', 'gold_reasoning': '', 'answer': '2200'}) (input_keys={'question'}): 400 POST https://us-central1-aiplatform.googleapis.com/v1beta1/projects/my-project-269206/locations/us-central1/publishers/google/models/gemini-1.0-ultra:generateContent?%24alt=json%3Benum-encoding%3Dint: Project `1087042431891` is not allowed to use Publisher Model `projects/my-project-269206/locations/us-central1/publishers/google/models/gemini-1.0-ultra`. Set `provide_traceback=True` to see the stack trace.


Average Metric: 0.00 / 0 (0%):   5%|▌         | 1/20 [00:00<00:05,  3.61it/s]

2024/12/14 06:34:24 ERROR dspy.utils.parallelizer: Error processing item Example({'question': 'Karen is packing her backpack for a long-distance hike. She packs 20 pounds of water, 10 pounds of food, and 20 pounds of gear. During her hike, she drinks 2 pounds of water per hour and eats 1/3rd the weight of food per hour as water per hour. How much weight is she carrying after six hours?', 'gold_reasoning': '', 'answer': '34'}) (input_keys={'question'}): 400 POST https://us-central1-aiplatform.googleapis.com/v1beta1/projects/my-project-269206/locations/us-central1/publishers/google/models/gemini-1.0-ultra:generateContent?%24alt=json%3Benum-encoding%3Dint: Project `1087042431891` is not allowed to use Publisher Model `projects/my-project-269206/locations/us-central1/publishers/google/models/gemini-1.0-ultra`. Set `provide_traceback=True` to see the stack trace.


Average Metric: 0.00 / 0 (0%):  15%|█▌        | 3/20 [00:00<00:04,  3.61it/s]

ERROR:backoff:Giving up request(...) after 1 tries (google.api_core.exceptions.BadRequest: 400 POST https://us-central1-aiplatform.googleapis.com/v1beta1/projects/my-project-269206/locations/us-central1/publishers/google/models/gemini-1.0-ultra:generateContent?%24alt=json%3Benum-encoding%3Dint: Project `1087042431891` is not allowed to use Publisher Model `projects/my-project-269206/locations/us-central1/publishers/google/models/gemini-1.0-ultra`)
2024/12/14 06:34:24 ERROR dspy.utils.parallelizer: Error processing item Example({'question': 'Benjamin collects 6 dozen eggs a day. Carla collects 3 times the number of eggs that Benjamin collects.  Trisha collects 4 dozen less than Benjamin.  How many dozen eggs do the three collect total?', 'gold_reasoning': '', 'answer': '26'}) (input_keys={'question'}): 400 POST https://us-central1-aiplatform.googleapis.com/v1beta1/projects/my-project-269206/locations/us-central1/publishers/google/models/gemini-1.0-ultra:generateContent?%24alt=json%3Benu

Average Metric: 0.00 / 0 (0%):  20%|██        | 4/20 [00:00<00:04,  3.61it/s]

2024/12/14 06:34:24 ERROR dspy.utils.parallelizer: Error processing item Example({'question': 'Roy spends 2 hours on sports activities in school every day. He goes to school 5 days a week. If he missed 2 days within a week, how many hours did he spend on sports in school that week?', 'gold_reasoning': '', 'answer': '6'}) (input_keys={'question'}): 400 POST https://us-central1-aiplatform.googleapis.com/v1beta1/projects/my-project-269206/locations/us-central1/publishers/google/models/gemini-1.0-ultra:generateContent?%24alt=json%3Benum-encoding%3Dint: Project `1087042431891` is not allowed to use Publisher Model `projects/my-project-269206/locations/us-central1/publishers/google/models/gemini-1.0-ultra`. Set `provide_traceback=True` to see the stack trace.
ERROR:backoff:Giving up request(...) after 1 tries (google.api_core.exceptions.BadRequest: 400 POST https://us-central1-aiplatform.googleapis.com/v1beta1/projects/my-project-269206/locations/us-central1/publishers/google/models/gemini-1

Average Metric: 0.00 / 0 (0%):  25%|██▌       | 5/20 [00:00<00:04,  3.61it/s]

2024/12/14 06:34:25 ERROR dspy.utils.parallelizer: Error processing item Example({'question': 'Cameron is printing her thesis in the school library and has 400 A4 pieces of paper. If 40% of the papers did not print out up to her desired quality and she separated them as invalid, calculate the total number of valid documents.', 'gold_reasoning': '', 'answer': '240'}) (input_keys={'question'}): 400 POST https://us-central1-aiplatform.googleapis.com/v1beta1/projects/my-project-269206/locations/us-central1/publishers/google/models/gemini-1.0-ultra:generateContent?%24alt=json%3Benum-encoding%3Dint: Project `1087042431891` is not allowed to use Publisher Model `projects/my-project-269206/locations/us-central1/publishers/google/models/gemini-1.0-ultra`. Set `provide_traceback=True` to see the stack trace.
ERROR:backoff:Giving up request(...) after 1 tries (google.api_core.exceptions.BadRequest: 400 POST https://us-central1-aiplatform.googleapis.com/v1beta1/projects/my-project-269206/locations

Average Metric: 0.00 / 0 (0%):  30%|███       | 6/20 [00:00<00:03,  3.61it/s]

2024/12/14 06:34:25 ERROR dspy.utils.parallelizer: Error processing item Example({'question': "Burt spent $2.00 on a packet of basil seeds and $8.00 on potting soil.  The packet of seeds yielded 20 basil plants.  He sells each basil plant for $5.00 at the local farmer's market.  What is the net profit from his basil plants?", 'gold_reasoning': '', 'answer': '90'}) (input_keys={'question'}): 400 POST https://us-central1-aiplatform.googleapis.com/v1beta1/projects/my-project-269206/locations/us-central1/publishers/google/models/gemini-1.0-ultra:generateContent?%24alt=json%3Benum-encoding%3Dint: Project `1087042431891` is not allowed to use Publisher Model `projects/my-project-269206/locations/us-central1/publishers/google/models/gemini-1.0-ultra`. Set `provide_traceback=True` to see the stack trace.


Average Metric: 0.00 / 0 (0%):  35%|███▌      | 7/20 [00:00<00:00, 21.66it/s]

ERROR:backoff:Giving up request(...) after 1 tries (google.api_core.exceptions.BadRequest: 400 POST https://us-central1-aiplatform.googleapis.com/v1beta1/projects/my-project-269206/locations/us-central1/publishers/google/models/gemini-1.0-ultra:generateContent?%24alt=json%3Benum-encoding%3Dint: Project `1087042431891` is not allowed to use Publisher Model `projects/my-project-269206/locations/us-central1/publishers/google/models/gemini-1.0-ultra`)
2024/12/14 06:34:25 ERROR dspy.utils.parallelizer: Error processing item Example({'question': "Martha's cat catches 3 rats and 7 birds. Cara's cat catches 3 less than five times as many animals as Martha's cat. How many animals does Cara's cat catch?", 'gold_reasoning': '', 'answer': '47'}) (input_keys={'question'}): 400 POST https://us-central1-aiplatform.googleapis.com/v1beta1/projects/my-project-269206/locations/us-central1/publishers/google/models/gemini-1.0-ultra:generateContent?%24alt=json%3Benum-encoding%3Dint: Project `1087042431891` 

Average Metric: 0.00 / 0 (0%):  40%|████      | 8/20 [00:00<00:00, 21.66it/s]

ERROR:backoff:Giving up request(...) after 1 tries (google.api_core.exceptions.BadRequest: 400 POST https://us-central1-aiplatform.googleapis.com/v1beta1/projects/my-project-269206/locations/us-central1/publishers/google/models/gemini-1.0-ultra:generateContent?%24alt=json%3Benum-encoding%3Dint: Project `1087042431891` is not allowed to use Publisher Model `projects/my-project-269206/locations/us-central1/publishers/google/models/gemini-1.0-ultra`)
ERROR:backoff:Giving up request(...) after 1 tries (google.api_core.exceptions.BadRequest: 400 POST https://us-central1-aiplatform.googleapis.com/v1beta1/projects/my-project-269206/locations/us-central1/publishers/google/models/gemini-1.0-ultra:generateContent?%24alt=json%3Benum-encoding%3Dint: Project `1087042431891` is not allowed to use Publisher Model `projects/my-project-269206/locations/us-central1/publishers/google/models/gemini-1.0-ultra`)
ERROR:backoff:Giving up request(...) after 1 tries (google.api_core.exceptions.BadRequest: 400 P

BadRequest: 400 POST https://us-central1-aiplatform.googleapis.com/v1beta1/projects/my-project-269206/locations/us-central1/publishers/google/models/gemini-1.0-ultra:generateContent?%24alt=json%3Benum-encoding%3Dint: Project `1087042431891` is not allowed to use Publisher Model `projects/my-project-269206/locations/us-central1/publishers/google/models/gemini-1.0-ultra`

After this step we have our examples ready, and we can test Gemini Pro on the same test dataset as above.

In [None]:
dspy.settings.configure(lm=dspy_gemini_pro, timeout=0)

In [None]:
evaluate(cot_fewshot)

Average Metric: 18 / 20  (90.0): 100%|██████████| 20/20 [00:07<00:00,  2.65it/s]

Average Metric: 18 / 20  (90.0%)



 '✔️ [True]' '✔️ [True]' '✔️ [True]' '✔️ [True]' '✔️ [True]' '✔️ [True]'
 '✔️ [True]' '✔️ [True]' 'False' '✔️ [True]' '✔️ [True]' '✔️ [True]'
 '✔️ [True]' '✔️ [True]']' has dtype incompatible with bool, please explicitly cast to a compatible dtype first.
  df.loc[:, metric_name] = df[metric_name].apply(


Unnamed: 0,question,gold_reasoning,example_answer,rationale,pred_answer,gsm8k_metric
0,"Amber, Micah, and Ahito ran 52 miles in total. Amber ran 8 miles. Micah ran 3.5 times what Amber ran. How many miles did Ahito...",Amber ran <<8=8>>8 miles. Micah ran 3.5 * 8 miles = <<3.5*8=28>>28 miles Ahito ran the rest = 52 - 8 - 28 = <<52-8-28=16>>16...,16,"find out how many miles Ahito ran. First, let's find out how many miles Micah ran: 3.5 x 8 = 28 miles. Now, let's add...",16,✔️ [True]
1,"Miguel uses 2 pads of paper a week for his drawing. If there are 30 sheets of paper on a pad of paper, how many...","Miguel uses 30 x 2 = <<30*2=60>>60 sheets of paper every week. Therefore, he uses 60 x 4 = <<60*4=240>>240 sheets of paper every month.",240,"find the number of sheets of paper Miguel uses every month. First, we need to find the total number of pads of paper he uses...",240,✔️ [True]
2,"At a certain grade level, three-fourths of students have a desktop computer at home. If 20 students do not have desktop computers, how many students...","Twenty students represent 1 - 3/4 = 1/4 of the students at that level. So, there are 4 x 20 students = <<4*20=80>>80 students at...",80,"find the total number of students. First, we need to find the fraction of students who have desktop computers: 1 - 3/4 = 1/4. Now,...",80,✔️ [True]
3,Comet Halley orbits the sun every 75 years. Bill's dad saw the Comet when he was 30 years old. Bill saw the comet a second...,Bill saw the Comet for the second time when he was 30 years * 3= <<30*3=90>>90 years old. Comet Halley can be seen every 75...,15,"find Bill's age when he saw the Comet. First, we need to find the number of years that passed between the two sightings: 75 years...",90,False
4,Tom plants 10 trees a year. Every year he also chops down 2 trees a year. He starts with 50 trees. After 10 years 30%...,He gets 10-2=<<10-2=8>>8 new trees a year After 10 years he has 8*10=<<8*10=80>>80 new trees So he had 80+50=<<80+50=130>>130 trees He lost 130*.3=<<130*.3=39>>39 trees That...,91,"find out how many trees Tom has left. First, let's find out how many trees Tom plants in 10 years: 10 trees/year x 10 years...",91,✔️ [True]
5,"John picks 4 bananas on Wednesday. Then he picks 6 bananas on Thursday. On Friday, he picks triple the number of bananas he did on...","Combining Wednesday and Thursday, John has 4 bananas + 6 bananas = <<4+6=10>>10 bananas. On Friday, he picks 3 * 4 bananas = <<3*4=12>>12 bananas....",22,"find the total number of bananas John has. First, let's find the number of bananas John picked on Friday: 4 x 3 = 12 bananas....",22,✔️ [True]
6,"Peyton scheduled after-work activities of a one hour yoga class on Monday, a cooking class that lasts three times as long as Monday’s yoga on...",Peyton’s cooking class will last 3 * 1 = <<3*1=3>>3 hours. The museum tour will take 3 / 2 = 1 1/2 hours. Peyton’s after-work...,8,calculate the total hours of Peyton's after-work activities. 1. **Yoga class (Monday):** 1 hour 2. **Cooking class (Tuesday):** 3 x 1 hour = 3 hours...,8,✔️ [True]
7,"Ben has 4 tubes of blue paint and 3 tubes of yellow paint. Jasper has half as many tubes of blue paint as Ben, and...",Jasper has 4/2= <<4/2=2>>2 tubes of blue paint Jasper has 3*3=<<3*3=9>>9 tubes of yellow paint Jasper has a total of 2+9 =<<2+9=11>>11 tubes of paint,11,"find out how many tubes of paint Jasper has. First, let's find out how many tubes of blue paint Jasper has: 4 / 2 =...",11,✔️ [True]
8,"Elaina is holding the final concert in her tour. To celebrate her final concert, she makes the concert twice as long as her usual concerts....","The concert, minus the encore, lasted for 65-minute concert – 15-minute encore = <<65-15=50>>50 minutes. This is twice as long as her usual concerts so...",25,"find the runtime of Elaina's usual concerts. First, we need to subtract the encore's runtime from the total runtime of the final concert: 65 -...",25,✔️ [True]
9,"Hannah slips on a banana peel and breaks her arm. The doctor charges her $200 for the cast, $300/hour for a 30-minute visit, $4/pill for...",First find the length of the visit in hours: 30 minutes / 60 minutes/hour = <<30/60=.5>>.5 hours Then find the total cost of the visit:...,482,"calculate the total cost of the doctor's visit. First, we need to find the cost of the visit: $300/hour x 0.5 hours = $150. Next,...",$482,✔️ [True]


90.0

Nice, this improved Gemini Pro's performance significantly from 35% :)

In [None]:
dspy_gemini_pro.inspect_history(n=1)





Answer math problems with numbers or short phrases.

---

Follow the following format.

Question: ${question}
Reasoning: Let's think step by step in order to ${produce the answer}. We ...
Answer: Usually a number or short phrase.

---

Question: A tank contains 6000 liters of water, 2000 liters evaporated, and then 3500 liters were drained by Bob. How many liters are in the tank if it now rains for 30 minutes and every 10 minutes 350 liters of rain are added to the tank?
Reasoning: Let's think step by step in order to find the total amount of water in the tank. First, 2000 liters evaporated from the initial 6000 liters, leaving 6000 - 2000 = 4000 liters. Next, Bob drained 3500 liters, resulting in 4000 - 3500 = 500 liters. Since it rained for 30 minutes, and 350 liters are added every 10 minutes, a total of 350 x 3 = 1050 liters were added. Therefore, the final amount of water in the tank is 500 + 1050 = 1550 liters.
Answer: 1550

---

Question: Louise is baking cakes for a gatheri