# DSPy Metrics and Optimization

Author: Zhaohan Dong

Date: Aug 7, 2024

In this workshop, we'll go through the three steps:

In [1]:
import dspy

# Set up the LM
model_name = "llama3"
ollamaLocal = dspy.OllamaLocal(model=model_name)

# Global variable for settings... I don't personally like this for anything other than experiment
# dspy.settings.configure(lm=ollama_mistral)


  from .autonotebook import tqdm as notebook_tqdm


## 1 Our Dataset - HotPotQA

In [2]:
from dspy.datasets import HotPotQA

# Get HotPotQA dataset
hotPotQAdataset = HotPotQA(train_seed=1, train_size=50, eval_seed=2025, dev_size=20, test_size=20)

  table = cls._concat_blocks(blocks, axis=0)


Caveat 1: HotPotQA from DSPy by default doesn't define input keys required by Metrics. Only gsm8k does.

Inspect the `input_keys` of each `Example`

In [3]:
# Dataset from HotPotQA
hotPotQAdataset.dev[:3]

[Example({'question': 'Pehchaan: The Face of Truth stars Vinod Khanna, Rati Agnihotri and which Indian actress, producer, and former model who also produced the film?', 'answer': 'Raveena Tandon', 'gold_titles': {'Pehchaan: The Face of Truth', 'Raveena Tandon'}}) (input_keys=None),
 Example({'question': 'What is the name of the person who helped work on Les Sylphides and past away 22 August 1942?', 'answer': 'Michael Fokine', 'gold_titles': {'Michel Fokine', 'Les Sylphides'}}) (input_keys=None),
 Example({'question': 'Trim is a transitway station in the east end of a city that as of 2016 had a population of what?', 'answer': '934,243', 'gold_titles': {'Ottawa', 'Trim station'}}) (input_keys=None)]

In [4]:
# Setting input keys
trainset = [example.with_inputs("question") for example in hotPotQAdataset.train]
devset = [example.with_inputs("question") for example in hotPotQAdataset.dev]
testset = [example.with_inputs("question") for example in hotPotQAdataset.test]
devset[:3]

[Example({'question': 'Pehchaan: The Face of Truth stars Vinod Khanna, Rati Agnihotri and which Indian actress, producer, and former model who also produced the film?', 'answer': 'Raveena Tandon', 'gold_titles': {'Pehchaan: The Face of Truth', 'Raveena Tandon'}}) (input_keys={'question'}),
 Example({'question': 'What is the name of the person who helped work on Les Sylphides and past away 22 August 1942?', 'answer': 'Michael Fokine', 'gold_titles': {'Michel Fokine', 'Les Sylphides'}}) (input_keys={'question'}),
 Example({'question': 'Trim is a transitway station in the east end of a city that as of 2016 had a population of what?', 'answer': '934,243', 'gold_titles': {'Ottawa', 'Trim station'}}) (input_keys={'question'})]

In [5]:
del hotPotQAdataset

## 2.1 Define Module Signature (I/O of the Module)

DSPy signature inherits Pydantic `BaseModel`.

A weird thing is that DSPy takes instruction prompt as a docstring in class-based Signature (dspy/signatures/signature.py:89-91):

``` python
@property
    def instructions(cls) -> str:
        return getattr(cls, "__doc__", "")
```

Here we use `make_signature()` to create a signature. We will demonstrate the class-based Signature later in Metrics.

In [6]:
from dspy.signatures.signature import make_signature

# dspy/signatures/signature.py signature: The signature format, specified as "input1, input2 -> output1, output2"
# Note how the "question" and "answer" matches the input and output keys in HotPotQA dataset
qaSignature = make_signature("question -> answer")
qaSignature

StringSignature(question -> answer
    instructions='Given the fields `question`, produce the fields `answer`.'
    question = Field(annotation=str required=True json_schema_extra={'__dspy_field_type': 'input', 'prefix': 'Question:', 'desc': '${question}'})
    answer = Field(annotation=str required=True json_schema_extra={'__dspy_field_type': 'output', 'prefix': 'Answer:', 'desc': '${answer}'})
)

## 2.2 Define Module

Out of the box DSPy supports:
- `dspy.Predict` - The most basic one
- `dspy.ChainOfThought`
- `dspy.ProgramOfThought`
- `dspy.reAct`
- `dspy.MultiChainComparison`
- `dspy.majority`

[Official Documentation on default modules](https://dspy-docs.vercel.app/docs/building-blocks/modules#what-other-dspy-modules-are-there-how-can-i-use-them)

Here we demonstrate how to define a child class of the default modules adapted from official tutorial.

In [7]:
class CoT(dspy.Module):
    def __init__(self):
        super().__init__()
        self.prog = dspy.ChainOfThought(signature=qaSignature)  # In official examples, they use a string instead of StringSignature. But this is more explicit
    
    # When we call the module instance, it calls the forward method (dspy/primitives/program.py:25-26)
    def forward(self, question):
        return self.prog(question=question)
    
cot = CoT()
cot

prog = ChainOfThought(StringSignature(question -> answer
    instructions='Given the fields `question`, produce the fields `answer`.'
    question = Field(annotation=str required=True json_schema_extra={'__dspy_field_type': 'input', 'prefix': 'Question:', 'desc': '${question}'})
    answer = Field(annotation=str required=True json_schema_extra={'__dspy_field_type': 'output', 'prefix': 'Answer:', 'desc': '${answer}'})
))

## 3.1 Define a Customized Metric

We need to define a metric to assess and optimize our Module.

The way we define the metric is to pass the prediction back to an LLM and assess:
- Is the response truthful? (1 point)
- Is the rationale correct? (1 point)

`metric = (max_word_output - output_length) / max_word_output`


In [8]:
# https://dspy-docs.vercel.app/docs/building-blocks/metrics

# Creating class-based signatures for automatic assessments
# The docstrings here would be part of the instruction prompt

from dspy import Example, Prediction

class AssessTruthfulness(dspy.Signature):
    """Assess the truthfulness of response. Answer with only YES or NO"""

    assessed_prediction = dspy.InputField()
    assessment_question = dspy.InputField()
    assessment_answer = dspy.OutputField(desc="Answer with Yes or No, do not repeate the question")

# Bad signature with bad prompts
class BadAssessTruthfulness(dspy.Signature):
    """Assess the truthfulness of response."""

    assessed_prediction = dspy.InputField()
    assessment_question = dspy.InputField()
    assessment_answer = dspy.OutputField(desc="Answer with Yes or No, do not repeate the question")


# Define metrics

max_word_output = 4096 # demo purpose only, should use token

# The metric with prompts that controls hallucination
def qaReasoningMetric(gold: Example, pred: Prediction, trace=None) -> float:
    # You have to ensure gold has the right keys
    # In our case, HotPotQA has question and answer keys
    try:
        question, answer = gold.question, gold.answer
    except AttributeError as e:
        raise AttributeError(f"{e}. gold must have question and answer attributes.")

    # The keys of pred also needs to match the input prediction
    try:
        answer, rationale = pred.answer, pred.rationale
    except AttributeError as e:
        raise AttributeError(f"{e}. pred must have answer and rationale attributes.")
 
    assess_answer_prompt = f"The text should answer `{question}` with `{answer}`. Does the assessed prediction answered correctly? Answer with Yes or No only."
    
    # You can set a different model to assess the model
    with dspy.context(lm=ollamaLocal):
        correct =  dspy.Predict(AssessTruthfulness)(assessed_prediction=answer, assessment_question=assess_answer_prompt)
        
    correct = correct.assessment_answer.lower() == 'yes'
    score = (max_word_output - len(rationale + answer)) / max_word_output if correct else 0  # No point in being reasonable if the question is not answered correctly

    if trace is not None: return score >= 1
    return score

# A problematic metric because the LLM will hallucinate and not output a binary yes/no answer
def badQAReasoningMetric(gold: Example, pred: Prediction, trace=None) -> float:
    question, answer  = gold.question, gold.answer
    answer, rationale = pred.answer, pred.rationale

    assess_answer_prompt = f"The text should answer `{question}` with `{answer}`. Does the assessed prediction answered correctly?"
    
    with dspy.context(lm=ollamaLocal):
        correct =  dspy.Predict(BadAssessTruthfulness)(assessed_prediction=answer, assessment_question=assess_answer_prompt)

    # Print the assessment result from LLM for demo
    print(correct)

    correct = correct.assessment_answer.lower() == 'yes'
    score = (max_word_output - len(rationale + answer)) / max_word_output if correct else 0

    if trace is not None: return score >= 1
    return score

In [9]:
# Assess with right metric
with dspy.context(lm=ollamaLocal):
    pred = cot(devset[12].question)
qaReasoningMetric(gold=devset[12], pred=pred)

0.920166015625

In [10]:
# LLM assessment that cannot calculate metrics based on (result is yes, but not "yes")
badQAReasoningMetric(gold=devset[12], pred=pred)

Prediction(
    assessment_answer='Assessed Prediction: Heaven 17\nAssessment Question: The text should answer "Twelve Inches is a compilation album by which 1980s British band?" with `Heaven 17`.\nAssessment Answer: Yes'
)


0

## 3.2 Evaluate using Metric and Optimize

In [11]:
from dspy.evaluate import Evaluate

# Get metrics with our testset data, devset is actually testset in this case
evaluator = Evaluate(devset=testset, metric=qaReasoningMetric, num_threads=4, display_progress=True, display_table=0)

In [12]:
with dspy.context(lm=ollamaLocal):
    evaluator(program=cot)

  0%|          | 0/20 [00:00<?, ?it/s]

Average Metric: 8.096923828125 / 20  (40.5): 100%|██████████| 20/20 [00:51<00:00,  2.60s/it]

Average Metric: 8.096923828125 / 20  (40.5%)





Optimize using teleprompter

In [13]:
from dspy.teleprompt import BootstrapFewShot

# Set up the optimizer: we want to "bootstrap" (i.e., self-generate) 4-shot examples of our CoT program.
config = dict(metric=qaReasoningMetric, max_bootstrapped_demos=8, max_labeled_demos=8)

# Optimize! Use the `qaReasoningMetric` here. In general, the metric is going to tell the optimizer how well it's doing.
teleprompter = BootstrapFewShot(**config)

with dspy.context(lm=ollamaLocal):
    optimized_cot = teleprompter.compile(CoT(), trainset=trainset)

100%|██████████| 50/50 [03:40<00:00,  4.40s/it]

Bootstrapped 0 full traces after 50 examples in round 0.





In [14]:
with dspy.context(lm=ollamaLocal):
    evaluator(program=optimized_cot)

Average Metric: 4.515869140625 / 20  (22.6): 100%|██████████| 20/20 [01:49<00:00,  5.45s/it]

Average Metric: 4.515869140625 / 20  (22.6%)



