<img src="../../docs/docs/static/img/dspy_logo.png" alt="DSPy7 Image" height="150"/>

## **DSPy Assertions**: Asserting Computational Constraints on Foundation Models

### **QuizGen**: Generating multiple choice quiz questions

[<img align="center" src="https://colab.research.google.com/assets/colab-badge.svg" />](https://colab.research.google.com/github/stanfordnlp/dspy/blob/main/examples/quiz/quiz_assertions.ipynb)


This notebook highlights an example of [**DSPy Assertions**](https://dspy-docs.vercel.app/docs/building-blocks/assertions), allowing for declaration of computational constraints within DSPy programs. 


This notebook builds upon the foundational concepts of the **DSPy** framework. Prerequisites of following this notebook is having gone through the [DSPy tutorial](../../intro.ipynb), the [**DSPy Assertions documentation**](https://dspy-docs.vercel.app/docs/building-blocks/assertions) and the introductory DSPy Assertions [tutorial on LongFormQA](../longformqa/longformqa_assertions.ipynb).


In [None]:
%load_ext autoreload
%autoreload 2

import sys
import os
import json

try: # When on google Colab, let's clone the notebook so we download the cache.
    import google.colab  # noqa: F401
    repo_path = 'dspy'
    
    !git -C $repo_path pull origin || git clone https://github.com/stanfordnlp/dspy $repo_path
except:
    repo_path = '.'

if repo_path not in sys.path:
    sys.path.append(repo_path)


import pkg_resources # Install the package if it's not installed
if "dspy-ai" not in {pkg.key for pkg in pkg_resources.working_set}:
    !pip install -U pip
    !pip install dspy-ai==2.4.17
    !pip install openai~=0.28.1
    !pip install -e $repo_path

import dspy
from dspy.predict import Retry
from dspy.datasets import HotPotQA
from dspy.teleprompt import BootstrapFewShotWithRandomSearch
from dspy.evaluate.evaluate import Evaluate
from dspy.primitives.assertions import assert_transform_module, backtrack_handler

In [None]:
import openai
openai.api_key = os.getenv('OPENAI_API_KEY')

In [None]:
colbertv2_wiki17_abstracts = dspy.ColBERTv2(url='http://20.102.90.50:2017/wiki17_abstracts')
dspy.settings.configure(rm=colbertv2_wiki17_abstracts)
turbo = dspy.OpenAI(model='gpt-4o-mini', max_tokens=500)
dspy.settings.configure(lm=turbo, trace=[], temperature=0.7)

In [None]:
dataset = HotPotQA(train_seed=1, train_size=20, eval_seed=2023, dev_size=50, test_size=0, keep_details=True)
trainset = [x.with_inputs('question', 'answer') for x in dataset.train]
devset = [x.with_inputs('question', 'answer') for x in dataset.dev]

### 3] QuizGen

Let's introduce a new task: QuizGen. 

QuizGen takes HotPotQA data points and turns them into multiple choice quiz questions with the corresponding options. Each set of options for the question is produced in a JSON key-value pair format. For this case, we specify the generation of 4 choices.

With this program, we aim to generate quiz choices that adhere to the following guidelines:
1. The generated choices are in a JSON format.
2. The generated choices include the correct answer.
3. The generated choices include plausible distractor options besides the correct answer.

In [None]:
class GenerateAnswerChoices(dspy.Signature):
    """Generate answer choices in JSON format that include the correct answer and plausible distractors for the specified question."""
    question = dspy.InputField()
    correct_answer = dspy.InputField()
    number_of_choices = dspy.InputField()
    answer_choices = dspy.OutputField(desc='JSON key-value pairs')

class QuizAnswerGenerator(dspy.Module):
    def __init__(self):
        super().__init__()
        self.generate_choices = dspy.ChainOfThought(GenerateAnswerChoices)

    def forward(self, question, answer):
        choices = self.generate_choices(question=question, correct_answer=answer, number_of_choices=number_of_choices).answer_choices
        return dspy.Prediction(choices = choices)

number_of_choices = '4'
quiz_generator = QuizAnswerGenerator()

### 4] Evaluation - Intrinsic and Extrinsic

#### Intrinsic Metrics: passing internal computational constraints is the goal 

**Valid Formatting** - The outputted answer choices should be in JSON format which is verified after parsing the key-value pairs.

**Correct Answer Inclusion** - This is a general check to ensure the generated quiz choices actually include the correct answer to the question.

**Plausible Distractors** - This validation is to check that the generated choices include distractor answer options that are reasonable options as answers to the question. We define and call another **DSPy** program: ``Predict`` on ``AssessQuizChoices``, relying on the same LM to answer the question: `"Are the distractors in the answer choices plausible and not easily identifiable as incorrect?"`

In [None]:
def format_checker(choice_string):
    try:
        choices = json.loads(choice_string)
        if isinstance(choices, dict) and all(isinstance(key, str) and isinstance(value, str) for key, value in choices.items()):
            return True
    except json.JSONDecodeError:
        return False

    return False

def is_correct_answer_included(correct_answer, generated_choices):
    try:
        choices_dict = json.loads(generated_choices)
        return correct_answer in choices_dict.values()
    except json.JSONDecodeError:
        return False

def is_plausibility_yes(assessment_answer):
    """Check if the first word of the assessment answer is 'yes'."""
    return assessment_answer.split()[0].lower() == 'yes'
    
class AssessQuizChoices(dspy.Signature):
    """Assess the quality of quiz answer choices along specified dimensions."""
    
    question = dspy.InputField()
    answer_choices = dspy.InputField()
    assessment_question = dspy.InputField()
    assessment_answer = dspy.OutputField(desc="Yes or No")
    
def format_valid_metric(gold, pred, trace=None):
    generated_choices = pred.choices
    format_valid = format_checker(generated_choices)
    score = format_valid
    return score

def is_correct_metric(gold, pred, trace=None):
    correct_answer, generated_choices = gold.answer, pred.choices
    correct_included = is_correct_answer_included(correct_answer, generated_choices)
    score = correct_included
    return score

def plausibility_metric(gold, pred, trace=None):
    question, generated_choices = gold.question, pred.choices
    plausibility_question = "Are the distractors in the answer choices plausible and not easily identifiable as incorrect?"
    plausibility_assessment = dspy.Predict(AssessQuizChoices)(question=question, answer_choices=generated_choices, assessment_question=plausibility_question)
    plausibility_result = plausibility_assessment.assessment_answer.split()[0].lower() == 'yes'
    score = plausibility_result
    return score

#### Extrinsic Metrics: Assess the overall quality and effectiveness of generated output on downstream task

The extrinsic metric is defined as the overall quality of the generated quiz choices and is evaluated over a composite metric, accounting for these constraints.

The composite metric maintains the core intrinsic metrics required for producing a valid set of quiz choices in validating valid formatting and correct answere icnlusion, and the overall composite metric returns an averaged score over the 3 intrinsic metrics.

In [None]:
def overall_metric(gold, pred, trace=None):
    question, correct_answer, generated_choices = gold.question, gold.answer, pred.choices
    format_valid = format_checker(generated_choices)
    correct_included = is_correct_answer_included(correct_answer, generated_choices)
    plausibility_question = "Are the distractors in the answer choices plausible and not easily identifiable as incorrect?"
    plausibility_assessment = dspy.Predict(AssessQuizChoices)(question=question, answer_choices=generated_choices, assessment_question=plausibility_question)
    plausibility_result = plausibility_assessment.assessment_answer.split()[0].lower() == 'yes'
    score = (format_valid + correct_included + plausibility_result) / 3.0 if correct_included and format_valid else 0
    return score

We hence define the evaluation as follows:

In [None]:
metrics = [format_valid_metric, is_correct_metric, plausibility_metric, overall_metric]

for metric in metrics:
    evaluate = Evaluate(metric=metric, devset=devset, num_threads=1, display_progress=True, display_table=5)
    evaluate(quiz_generator)

Let's take a look at an example quiz choice generation:

In [None]:
example = devset[67]
quiz_choices = quiz_generator(question=example.question, answer = example.answer)
print('Generated Quiz Choices: ', quiz_choices.choices)

In [None]:
for metric in metrics:
    evaluate = Evaluate(metric=metric, devset=devset[67:68], num_threads=1, display_progress=True, display_table=5)
    evaluate(quiz_generator)

We see that the generated quiz choices do not maintain valid JSON formatting, which violates the valid formatting and correctness check, even though the choices are noted as plausible. We also see that the correct answer is also labeled by "(Correct Answer)", which is not the intention of producing good quiz question answer choices.  

Let's take a look at how we can integrate DSPy Assertions and impose constraints to produce better answer choices.

### 5] Introducing Assertions: QuizAnswerGeneratorWithAssertions
Let's include assertions that simply reiterate our computational constraints within DSPy Assertion semantics. 

In the first **Assertion**, we check for if the generated quiz choices are in JSON format and if not, assert: **"The format of the answer choices should be in JSON format. Please revise accordingly."**

We also check for if the set of quiz choices includes the correct answer and ensure this if violated with the feedback message: **"The answer choices do not include the correct answer to the question. Please revise accordingly."**

Lastly, we assess if the plausible distractor choices are indeed good distractor options and if not, assert: **"The answer choices are not plausible distractors or are too easily identifiable as incorrect. Please revise to provide more challenging and plausible distractors."**

In [None]:
class QuizAnswerGeneratorWithAssertions(dspy.Module):
    def __init__(self):
        super().__init__()
        self.generate_choices = dspy.ChainOfThought(GenerateAnswerChoices)

    def forward(self, question, answer):
        choice_string = self.generate_choices(question=question, correct_answer=answer, number_of_choices=number_of_choices).answer_choices
        dspy.Suggest(format_checker(choice_string), "The format of the answer choices should be in JSON format. Please revise accordingly.", target_module=self.generate_choices)
        dspy.Suggest(is_correct_answer_included(answer, choice_string), "The answer choices do not include the correct answer to the question. Please revise accordingly.", target_module=self.generate_choices)
        plausibility_question = "Are the distractors in the answer choices plausible and not easily identifiable as incorrect?"
        plausibility_assessment = dspy.Predict(AssessQuizChoices)(question=question, answer_choices=choice_string, assessment_question=plausibility_question)
        dspy.Suggest(is_plausibility_yes(plausibility_assessment.assessment_answer), "The answer choices are not plausible distractors or are too easily identifiable as incorrect. Please revise to provide more challenging and plausible distractors.", target_module=self.generate_choices)
        return dspy.Prediction(choices = choice_string)

number_of_choices = '4'
quiz_generator_with_assertions = assert_transform_module(QuizAnswerGeneratorWithAssertions().map_named_predictors(Retry), backtrack_handler) 

Let's evaluate the `QuizAnswerGeneratorWithAssertions` now over the devset.

In [None]:
metrics = [format_valid_metric, is_correct_metric, plausibility_metric, overall_metric]

for metric in metrics:
    evaluate = Evaluate(metric=metric, devset=devset, num_threads=1, display_progress=True, display_table=5)
    evaluate(quiz_generator_with_assertions)

Now let's take a look at how our generated set of quiz choices has improved with the addition of assertions.

In [None]:
example = devset[67]
quiz_choices = quiz_generator_with_assertions(question=example.question, answer = example.answer)
print('Generated Quiz Choices: ', quiz_choices.choices)

In [None]:
for metric in metrics:
    evaluate = Evaluate(metric=metric, devset=devset[67:68], num_threads=1, display_progress=True, display_table=30)
    evaluate(quiz_generator_with_assertions)

We see that the quiz choices follow all of our constraints!

Not only are the answer choices all plausible, and have removed any indicator of what the correct answer could be, but the answer choices now maintain valid JSON formatting with 4 possible answer choices to the question, which includes the correct answer.

### 6] Compilation With Assertions

We can leverage **DSPy**'s`BootstrapFewShotWithRandomSearch` optimizer, to automatically generate few-shot demonstrations and conduct a random search over the candidates to output the best compiled program. We evaluate this over the `final_metric` composite metric. 

We can first evaluate this on `QuizAnswerGenerator` to see how compilation performs without the inclusion of assertions. 

In [None]:
teleprompter = BootstrapFewShotWithRandomSearch(metric = overall_metric, max_bootstrapped_demos=2, num_candidate_programs=6)
compiled_quiz_generator = teleprompter.compile(student = quiz_generator, teacher = quiz_generator, trainset=trainset, valset=devset[:25])

for metric in metrics:
    evaluate = Evaluate(metric=metric, devset=devset, num_threads=1, display_progress=True, display_table=5)
    evaluate(compiled_quiz_generator)

Now we test the compilation on 2 settings with assertions:

**Compilation with Assertions**: assertion-driven example bootstrapping and counterexample bootstrapping during compilation. Teacher has assertions while the student does not as the student learns from the teacher's assertion-driven bootstrapped examples. 

**Compilation + Inference with Assertions**: assertion-driven optimizations for both the teacher and student to offer enhanced assertion-driven outputs during both compilation and inference.

In [None]:
teleprompter = BootstrapFewShotWithRandomSearch(metric = overall_metric, max_bootstrapped_demos=2, num_candidate_programs=6)
compiled_with_assertions_quiz_generator = teleprompter.compile(student=quiz_generator, teacher = quiz_generator_with_assertions, trainset=trainset, valset=devset[:25])


for metric in metrics:
    evaluate = Evaluate(metric=metric, devset=devset, num_threads=1, display_progress=True, display_table=5)
    evaluate(compiled_with_assertions_quiz_generator)

In [None]:
teleprompter = BootstrapFewShotWithRandomSearch(metric = overall_metric, max_bootstrapped_demos=2, num_candidate_programs=6)
compiled_quiz_generator_with_assertions = teleprompter.compile(student=quiz_generator_with_assertions, teacher = quiz_generator_with_assertions, trainset=trainset, valset=devset[:25])

for metric in metrics:
    evaluate = Evaluate(metric=metric, devset=devset, num_threads=1, display_progress=True, display_table=5)
    evaluate(compiled_quiz_generator_with_assertions)