# Excelling at BIG-Bench Hard tasks Using DSPy and Weave

The [BIG-bench (Beyond the Imitation Game Benchmark)](https://github.com/google/BIG-bench) is a collaborative benchmark intended to probe large language models and extrapolate their future capabilities consisting of more than 200 tasks. The [BIG-Bench Hard (BBH)](https://github.com/suzgunmirac/BIG-Bench-Hard) is a suite of 23 most challenging BIG-Bench tasks that can be quite difficult to be solved using the current generation of language models.

This tutorial demonstrates how we can improve the performance of our LLM workflow implemented  on the **causal judgement task** from the BIG-bench Hard benchmark and evaluate our prompting strategies. We will use [DSPy](https://dspy-docs.vercel.app/) for implementing our LLM workflow and optimizing our prompting strategy. We would also use[ Weave](../../quickstart.md) to track our LLM workflow and evaluate our prompting strategies.

## Installing the Dependencies

We need the following libraries for this tutorial:

- [DSPy](https://dspy-docs.vercel.app/) for building the LLM workflow and optimizing it.
- [Weave](../../quickstart.md) to track our LLM workflow and evaluate our prompting strategies.
- [datasets](https://huggingface.co/docs/datasets/index) to access the Big-Bench Hard dataset from HuggingFace Hub.

In [None]:
!pip install -qU dspy-ai weave datasets

Since we'll be using [OpenAI API](https://openai.com/index/openai-api/) as the LLM Vendor, we will also need an OpenAI API key. You can [sign up](https://platform.openai.com/signup) on the OpenAI platform to get your own API key.

In [None]:
import os
from getpass import getpass

api_key = getpass("Enter you OpenAI API key: ")
os.environ["OPENAI_API_KEY"] = api_key

## Enable Tracking using Weave

Weave is currently integrated with DSPy, and including [`weave.init`](../../api-reference/python/weave.md#function-init) at the start of our code lets us automatically trace our DSPy functions which can be explored in the Weave UI. Check out the [Weave integration docs for DSPy](../../guides/integrations/dspy.md) to learn more.

In [None]:
import weave

weave.init(project_name="dspy-bigbench-hard")

In this tutorial, we use a metadata class inherited from [`weave.Model`](../../guides/core-types/models.md) to manage our metadata.

In [None]:
class Metadata(weave.Model): 
    big_bench_hard_task: str = "causal_judgement"
    num_train_examples: int = 50
    openai_model: str = "gpt-3.5-turbo"
    max_bootstrapped_demos: int = 8
    max_labeled_demos: int = 8
    validation_size_for_optimization: int = 10


metadata = Metadata()

## Load the BIG-Bench Hard Dataset

We're gonna load this dataset from HuggingFace Hub, split into training and validation sets, and [publish](../../guides/core-types/datasets.md) them on Weave, this would let us version the datasets, and also use [`weave.Evaluation`](../../guides/core-types/evaluations.md) to evaluate our prompting strategy.

In [None]:
import dspy
from datasets import load_dataset


@weave.op()
def get_dataset(metadata: Metadata):
    # load the BIG-Bench Hard dataset corresponding to the task from Huggingface Hug
    dataset = load_dataset("maveriq/bigbenchhard", metadata.big_bench_hard_task)["train"]
    
    # create the training and validation datasets
    rows = [{"question": data["input"], "answer": data["target"]} for data in dataset]
    train_rows = rows[0:metadata.num_train_examples]
    val_rows = rows[metadata.num_train_examples:]

    # create the training and validation examples consisting of `dspy.Example` objects
    dspy_train_examples = [dspy.Example(row).with_inputs("question") for row in train_rows]
    dspy_val_examples = [dspy.Example(row).with_inputs("question") for row in val_rows]

    # publish the datasets to the Weave, this would let us version the data and use for evaluation
    weave.publish(weave.Dataset(name=f"bigbenchhard_{metadata.big_bench_hard_task}_train", rows=train_rows))
    weave.publish(weave.Dataset(name=f"bigbenchhard_{metadata.big_bench_hard_task}_val", rows=val_rows))
    
    return dspy_train_examples, dspy_val_examples


dspy_train_examples, dspy_val_examples = get_dataset(metadata)

| ![](../assets/dspy_prompt_optimization/datasets.gif) |
|---|
| The datasets, once published, can be explored in the Weave UI |

## The DSPy Program

We're gonna use the [`dspy.OpenAI`](https://dspy-docs.vercel.app/api/language_model_clients/OpenAI) abstraction to make LLM calls to [GPT3.5 Turbo](https://platform.openai.com/docs/models/gpt-3-5-turbo).

In [None]:
system_prompt = """
You are an expert in the field of causal reasoning and judgement.
You are to analyze the a given question carefully and answer in `Yes` or `No`.
You should also provide a detailed explanation justifying your answer.
"""

llm = dspy.OpenAI(model=metadata.openai_model, system_prompt=system_prompt)
dspy.settings.configure(lm=llm)

DSPy is a framework that pushes building new LM pipelines away from manipulating free-form strings and closer to programming (composing modular operators to build text transformation graphs) where a compiler automatically generates optimized LM invocation strategies and prompts from a program.

According to the DSPy programming model, first string-based prompting techniques are translated into declarative modules that carry natural-language typed signatures. Then, each module is the parameterized so that it can learn its desired behavior by iteratively bootstrapping useful demonstrations within the pipeline.

Check the following papers to learn more about the DSPy paradigm:

- [DSPy: Compiling Declarative Language Model Calls into Self-Improving Pipelines](https://arxiv.org/abs/2310.03714)
- [DSPy Assertions: Computational Constraints for Self-Refining Language Model Pipelines](https://arxiv.org/abs/2312.13382)
- [Optimizing Instructions and Demonstrations for Multi-Stage Language Model Programs](https://arxiv.org/abs/2406.11695v1)

### Writing the Causal Reasoning Signature

A [signature](https://dspy-docs.vercel.app/docs/building-blocks/signatures) is a declarative specification of input/output behavior of a [DSPy module](https://dspy-docs.vercel.app/docs/building-blocks/modules). Signatures allow you to tell the LM what it needs to do, rather than specify how we should ask the LM to do it. This enables us to organize our prompting strategy using modular and clean code, in which LM calls can be optimized into high-quality prompts (or automatic finetunes).

In [None]:
from pydantic import BaseModel, Field


class CausalReasoningInput(BaseModel):
    query: str = Field(description="The question to be answered")


class CausalReasoningOutput(BaseModel):
    answer: str = Field(description="The answer for the question")
    confidence: float = Field(ge=0, le=1, description="The confidence score for the answer")
    explanation: str = Field(description="The explanation for the answer")


class CausalReasoningSignature(dspy.Signature):
    input: CausalReasoningInput = dspy.InputField()
    output: CausalReasoningOutput = dspy.OutputField()

[DSPy modules](https://dspy-docs.vercel.app/docs/building-blocks/modules) are task-adaptive components—akin to neural network layers—that abstract any particular text transformation, in this case returning a structured question-answering result by executing causal reasoning.

In [None]:
class CausalReasoningModule(dspy.Module):
    
    def __init__(self):
        self.prog = dspy.TypedPredictor(CausalReasoningSignature)
    
    @weave.op()
    def forward(self, question) -> dict:
        program_input = CausalReasoningInput(query=question)
        return self.prog(input=program_input).output.dict()

Note that we use [`dspy.TypedPredictor`](https://dspy-docs.vercel.app/docs/building-blocks/typed_predictors) which enforces the type constraints on the inputs and outputs of the fields in a DSPy signature. This enables us to always get the output in the form of a `pydantic.BaseModel` structuring the output of the LLM workflow according a fixed schema, making it easy to parse.

Let's test our LLM workflow, i.e., the `CausalReasoningModule` on an example from the causal reasoning subset of Big-Bench Hard.

In [None]:
import rich

baseline_module = CausalReasoningModule()

prediction = baseline_module(dspy_train_examples[0]["question"])
rich.print(prediction)

## Evaluating our DSPy Program

Now that we have a baseline prompting strategy, let's evaluate it on our validation set using [`weave.Evaluation`](../../guides/core-types/evaluations.md) on a simple metric that matches the predicted answer with the ground truth. Weave will take each example, pass it through your application and score the output on multiple custom scoring functions. By doing this, you'll have a view of the performance of your application, and a rich UI to drill into individual outputs and scores.

First, we need to create a simple evalaution metric function that tells whether the answer from the baseline module's output is the same as the ground truth answer or not.

In [None]:
@weave.op()
def weave_evaluation_metric(answer: str, model_output: CausalReasoningOutput) -> dict:
    return {
        "match": int(answer.lower() == model_output["answer"].lower())
    }

In [None]:
# Define the evaluation and run it
evaluation = weave.Evaluation(
    name="baseline_causal_reasoning_module",
    dataset=weave.ref("bigbenchhard_causal_judgement_val:v0").get(),
    scorers=[weave_evaluation_metric]
)

await evaluation.evaluate(baseline_module.forward)

If you're running from a python script, you can use the following code to run the evaluation:

```python
import asyncio
asyncio.run(evaluation.evaluate(baseline_module.forward))
```

## Optimizing our DSPy Program

Now, that we have a baseline DSPy program, let us try to improve its performance for causal reasoning. We would do this using a [DSPy teleprompter](https://dspy-docs.vercel.app/docs/building-blocks/optimizers) is an algorithm that can tune the parameters of a DSPy program (i.e., the prompts and/or the LM weights) to maximize the metrics you specify, like accuracy. When compiling a DSPy program, we generally invoke a teleprompter, which is an optimizer that takes the program, a training set, and a metric—and returns a new optimized program. In this tutorial, we use the [BootstrapFewShot](https://dspy-docs.vercel.app/api/category/optimizers) teleprompter.

In [None]:
from dspy.teleprompt import BootstrapFewShotWithRandomSearch


@weave.op()
def get_optimized_program(model: dspy.Module, metadata: Metadata) -> dspy.Module:
    
    @weave.op()
    def dspy_evaluation_metric(true, prediction, trace=None):
        return prediction["answer"].lower() == true.answer.lower()


    teleprompter = BootstrapFewShotWithRandomSearch(
        metric=dspy_evaluation_metric, 
        max_bootstrapped_demos=metadata.max_bootstrapped_demos,
        max_labeled_demos=metadata.max_labeled_demos,
    )
    valset = dspy_val_examples[:metadata.validation_size_for_optimization]
    return teleprompter.compile(model, trainset=dspy_train_examples, valset=valset)


optimized_module = get_optimized_program(baseline_module)

Now that we have our optimized program (the optimized prompting strategy), let's evaluate it once again on our validation set and compare it with our baseline DSPy program.

In [None]:
evaluation = weave.Evaluation(
    name="optimized_causal_reasoning_module",
    dataset=weave.ref("bigbenchhard_causal_judgement_val:v0").get(),
    scorers=[weave_evaluation_metric]
)
await evaluation.evaluate(optimized_module.forward)