# Solving BIG-Bench Hard tasks Using DSPy and Weave

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/soumik12345/prompt-engineering-recipes/blob/main/notebooks/dspy/00_big_bench.ipynb)

This notebook demonstrates how we can solve the causal judgement task from the [BIG-bench Hard](https://github.com/suzgunmirac/BIG-Bench-Hard?tab=readme-ov-file) benchmark by optimizing our prompting strategy using [DSPy](https://dspy-docs.vercel.app) and evaluating our prompts using [Weave](https://wandb.me/weave).

In [None]:
!pip install -qU dspy weave datasets rich

In [None]:
import os
from getpass import getpass

api_key = getpass("Enter you OpenAI API key: ")
os.environ["OPENAI_API_KEY"] = api_key

In [None]:
import rich
import weave
from datasets import load_dataset
from pydantic import BaseModel, Field

import dspy
from dspy.teleprompt import BootstrapFewShotWithRandomSearch

## Load the BIG-Bench Hard Dataset

The [BIG-bench](https://github.com/google/BIG-bench) (Beyond the Imitation Game Benchmark) is a collaborative benchmark intended to probe large language models and extrapolate their future capabilities consisting of more than 200 tasks. [BIG-bench Hard](https://github.com/suzgunmirac/BIG-Bench-Hard) is a suite of 23 challenging BIG-Bench tasks for which prior language model evaluations did not outperform the average human-rater.

We're gonna load this dataset from HuggingFace Hub, split into training and validation sets, and [publish](https://wandb.github.io/weave/guides/core-types/datasets) them on Weave, this would let us version the datasets, and also use [`weave.Evaluation`](https://wandb.github.io/weave/guides/core-types/evaluations/) to evaluate our prompting strategy.

In [None]:
def get_dataset(task: str = "causal_judgement", num_train_examples: int = 50):
    # load the BIG-Bench Hard dataset corresponding to the task from Huggingface Hug
    dataset = load_dataset("maveriq/bigbenchhard", task)["train"]
    
    # create the training and validation datasets
    rows = [{"question": data["input"], "answer": data["target"]} for data in dataset]
    train_rows = rows[0:num_train_examples]
    val_rows = rows[num_train_examples:]

    # create the training and validation examples consisting of `dspy.Example` objects
    dspy_train_examples = [dspy.Example(row).with_inputs("question") for row in train_rows]
    dspy_val_examples = [dspy.Example(row).with_inputs("question") for row in val_rows]

    # publish the datasets to the Weave, this would let us version the data and use for evaluation
    weave.publish(weave.Dataset(name=f"bigbenchhard_{task}_train", rows=train_rows))
    weave.publish(weave.Dataset(name=f"bigbenchhard_{task}_val", rows=val_rows))
    
    return dspy_train_examples, dspy_val_examples

Weave is currently integrated with DSPy, and including `weave.init` at the start of our code lets us automatically trace our DSPy functions which can be explored in the Weave UI. Check out the Weave integration docs for DSPy to learn more: https://wandb.github.io/weave/guides/integrations/dspy

In [None]:
weave.init(project_name="dspy-bigbenchhard")

dspy_train_examples, dspy_val_examples = get_dataset()

## The DSPy Program

We're gonna use the `dspy.OpenAI` abstraction to make LLM calls to GPT3.5 Turbo.

In [None]:
system_prompt = """
You are an expert in the field of causal reasoning. You are to analyze the a given question carefully and answer in `Yes` or `No`.
You should also provide a detailed explanation justifying your answer.
"""

llm = dspy.OpenAI(model="gpt-3.5-turbo", system_prompt=system_prompt)
dspy.settings.configure(lm=llm)

DSPy is a framework that pushes building new LM pipelines away from manipulating free-form strings and closer to programming (composing modular operators to build text transformation graphs) where a compiler automatically generates optimized LM invocation strategies and prompts from a program.

According to the DSPy programming model, first string-based prompting techniques are translated into declarative modules that carry natural-language typed signatures. Then, each module is the parameterized so that it can learn its desired behavior by iteratively bootstrapping useful demonstrations within the pipeline.

In [None]:
class Input(BaseModel):
    query: str = Field(description="The question to be answered")


class Output(BaseModel):
    answer: str = Field(description="The answer for the question")
    confidence: float = Field(ge=0, le=1, description="The confidence score for the answer")
    explanation: str = Field(description="The explanation for the answer")


class QuestionAnswerSignature(dspy.Signature):
    input: Input = dspy.InputField()
    output: Output = dspy.OutputField()

DSPy modules are task-adaptive components—akin to neural network layers—that abstract any particular text transformation, in this case returning a structured question-answering result by executing causal reasoning.

In [None]:
class QAModule(dspy.Module):
    
    def __init__(self):
        self.prog = dspy.TypedPredictor(QuestionAnswerSignature)
    
    @weave.op()
    def forward(self, question) -> dict:
        return self.prog(input=Input(query=question)).output.dict()

In [None]:
model = QAModule()

prediction = model(dspy_train_examples[0]["question"])
rich.print(prediction)

## Evaluating our DSPy Program

Now that we have a baseline prompting strategy, let's evaluate it on our validation set using [`weave.Evaluation`](https://wandb.github.io/weave/guides/core-types/evaluations/) on a simple metric that matches the predicted answer with the ground truth. Weave will take each example, pass it through your application and score the output on multiple custom scoring functions. By doing this, you'll have a view of the performance of your application, and a rich UI to drill into individual outputs and scores. 

In [None]:
@weave.op()
def weave_evaluation_metric(answer: str, model_output: Output) -> dict:
    return {
        "match": int(answer.lower() == model_output["answer"].lower())
    }

In [None]:
evaluation = weave.Evaluation(
    name="Naive QAModule",
    dataset=weave.ref("bigbenchhard_causal_judgement_val:v0").get(),
    scorers=[weave_evaluation_metric]
)
await evaluation.evaluate(model.forward)

## Optimizing our DSPy Program

When compiling a DSPy program, we generally invoke a teleprompter, which is an optimizer that takes the program, a training set, and a metric—and returns a new optimized program. In this example, we use the [`BootstrapFewShot`](https://dspy-docs.vercel.app/docs/deep-dive/teleprompter/bootstrap-fewshot#how-bootstrapfewshot-works) teleprompter.

In [None]:
@weave.op()
def get_optimized_program(model: dspy.Module) -> dspy.Module:
    
    @weave.op()
    def dspy_evaluation_metric(true, prediction, trace=None):
        return prediction["answer"].lower() == true.answer.lower()


    teleprompter = BootstrapFewShotWithRandomSearch(
        metric=dspy_evaluation_metric, 
        max_bootstrapped_demos=8, 
        max_labeled_demos=8,
    )
    return teleprompter.compile(
        model, trainset=dspy_train_examples, valset=dspy_val_examples[:10]
    )


optimized_model = get_optimized_program(model)

Now that we have our optimized program (the optimized prompting strategy), let's evaluate it once again on our validation set and compare it with our baseline DSPy program.

In [None]:
evaluation = weave.Evaluation(
    name="Optimized QAModule",
    dataset=weave.ref("bigbenchhard_causal_judgement_val:v0").get(),
    scorers=[weave_evaluation_metric]
)
await evaluation.evaluate(optimized_model.forward)