In [None]:
%env SERPER_API_KEY=YOUR_SERPER_API_KEY
%env OPENAI_API_KEY=YOUR_OPENAI_API_KEY

## Setting Up

The aim of this notebook is to showcase how one can use Langchain Agents using `Avatar` Module and optimize the actor using `AvatarOptimizer` optimizer for each of the toolset for the datasets. We'll be testing our module over 3 datasets:

* ArxivQA
* SearchAQ

Before loading our datasets and going to the execution part, we'll need to configure the `lm` in `dspy.settings`. For the purpose of this notebook we'll be using `gpt-4o-mini`.

In [None]:
import os
import dspy

dspy.settings.configure(
    lm=dspy.OpenAI(
        model="gpt-4o-mini",
        api_key=os.getenv("OPENAI_API_KEY"),
        max_tokens=4000,
        temperature=0,
    )
)

## Defining Signature

Over all the three datasets the nature of problem is essentially a QA type so we'll create similar signatures `SearchQASignature` and `ArxivQASignature`. The only difference between them is `ArxivQASignature` takes `paper_id` as input too. This is mainly for Arxiv API tool.

In [None]:
class SearchQASignature(dspy.Signature):
    """You will be given a question. Your task is to answer the question."""
    
    question: str = dspy.InputField(
        prefix="Question:",
        desc="question to ask",
        format=lambda x: x.strip(),
    )
    answer: str = dspy.OutputField(
        prefix="Answer:",
        desc="answer to the question",
    )

class ArxivQASignature(dspy.Signature):
    """You will be given a question and an Arxiv Paper ID. Your task is to answer the question."""
    
    question: str = dspy.InputField(
        prefix="Question:",
        desc="question to ask",
        format=lambda x: x.strip(),
    )
    paper_id: str = dspy.InputField(
        prefix="Paper ID:",
        desc="Arxiv Paper ID",
    )
    answer: str = dspy.OutputField(
        prefix="Answer:",
        desc="answer to the question",
    )

## Loading Datasets

We'll be loading three datasets to evaluate our model on them. We'll be using `searchqa` and `arxiv_qa` datasets for the purpose of this notebook. We can use DSPy `DataLoader` to load these datasets from HuggingFace to DSPy friendly format of list of `Example`.

In [None]:
from random import sample
from dspy.datasets import DataLoader

dl = DataLoader()

In [None]:
searchqa = dl.from_huggingface(
    "lucadiliello/searchqa",
    split="train",
    input_keys=("question",),
)

arxiv_qa = dl.from_huggingface(
    "taesiri/arxiv_qa",
    split="train",
    input_keys=("question", "paper_id"),
)

Due to demonstration purposes we'll operate on a subset of training and testing dataset. We'll be using 200 examples for training set and 100 examples for testing set.

In [None]:
import random

# Set a random seed for reproducibility
random.seed(42)


sqa_train = [
    dspy.Example(question=example.question, answer=",".join(example.answers)).with_inputs("question")
    for example in sample(searchqa, 200)
]
sqa_test = [
    dspy.Example(question=example.question, answer=",".join(example.answers)).with_inputs("question")
    for example in sample(searchqa, 100)
]

aqa_train = [
    dspy.Example(question=example.question, paper_id=example.paper_id, answer=example.answer).with_inputs("question", "paper_id")
    for example in sample(arxiv_qa, 200)
]
aqa_test = [
    dspy.Example(question=example.question, paper_id=example.paper_id, answer=example.answer).with_inputs("question", "paper_id")
    for example in sample(arxiv_qa, 100)
]

## Setting Up Tools

We'll setup `Avatar` modules for both signatures and all the `tools` can be used by each of the dataset i.e. `searchqa` and `arxiv_qa`. `Tool` is a pydantic model that Avatar expects the `tools` to be composed as more specifically it have 4 fields:

* `name` : Name of the tool
* `input_type` : Type of input the tool accepts
* `output_type` : Type of output the tool returns
* `tool` : The actual tool object

In [None]:
from dspy.predict.avatar import Tool, Avatar
from langchain.tools import WikipediaQueryRun
from langchain_community.utilities import GoogleSerperAPIWrapper, ArxivAPIWrapper, WikipediaAPIWrapper

tools = [
    Tool(
        tool=GoogleSerperAPIWrapper(),
        name="WEB_SEARCH",
        desc="If you have a question, you can use this tool to search the web for the answer."
    ),
    Tool(
        tool=ArxivAPIWrapper(),
        name="ARXIV_SEARCH",
        desc="Pass the arxiv paper id to get the paper information.",
        input_type="Arxiv Paper ID",
    ),
]

Once we have defined our `tools`, we can now create an `Avatar` object by passing the `tools` and `signature`. It takes 2 more optional parameters `verbose` and `max_iters`. `verbose` is used to display the logs and `max_iters` is used to control the number of iterations in multi step execution. 

An avatar agent stops the tool usage iteration once it reaches `max_iters` or when it prompts `Finish`. You can also create custom tools too, all you need to make sure is:

* You pass is a class object.
* Implements `__init__` and `run` method.
* Must take 1 string a input and returns 1 string as output.

If your tool doesn't return or takes input a string then you can make a custom wrapper to take care of that for now. In future we'll try to enable a diverse tool usage.

In [None]:
arxiv_agent = Avatar(
    tools=tools,
    signature=ArxivQASignature,
    verbose=True,
)

search_agent = Avatar(
    tools=tools,
    signature=SearchQASignature,
    verbose=True,
)

## Evaluation

Open enden QA tasks are hard to evaluate on rigid metrics like exact match. So, we'll be using an improvised LLM as Judge for the evaluation of our model on test set.

In [None]:
class Evaluator(dspy.Signature):
    """Please act as an impartial judge and evaluate the quality of the responses provided by multiple AI assistants to the user question displayed below. You should choose the assistant that offers a better user experience by interacting with the user more effectively and efficiently, and providing a correct final response to the user's question.
    
Rules:
1. Avoid Position Biases: Ensure that the order in which the responses were presented does not influence your decision. Evaluate each response on its own merits.
2. Length of Responses: Do not let the length of the responses affect your evaluation. Focus on the quality and relevance of the response. A good response is targeted and addresses the user's needs effectively, rather than simply being detailed.
3. Objectivity: Be as objective as possible. Consider the user's perspective and overall experience with each assistant."""
    
    question: str = dspy.InputField(
        prefix="Question:",
        desc="question to ask",
    )
    reference_answer: str = dspy.InputField(
        prefix="Reference Answer:",
        desc="Answer to the question given by the model.",
    )
    answer: str = dspy.InputField(
        prefix="Answer:",
        desc="Answer to the question given by the model.",
    )
    rationale: str = dspy.OutputField(
        prefix="Rationale:",
        desc="Explanation of why the answer is correct or incorrect.",
    )
    is_correct: bool = dspy.OutputField(
        prefix="Correct:",
        desc="Whether the answer is correct.",
    )


evaluator = dspy.TypedPredictor(Evaluator)


def metric(example, prediction, trace=None):
    return int(
        evaluator(
            question=example.question,
            answer=prediction.answer,
            reference_answer=example.answer
        ).is_correct
    ) 

For evaluation we can't use `dspy.Evaluate`, reason being that `Avatar` changes it's signature per iteration by adding the actions and it's results to it as fields. So we can create our own hacky thread safe evaluator for it.

In [None]:
import tqdm

from concurrent.futures import ThreadPoolExecutor

def process_example(example, signature):
    try:
        avatar = Avatar(
            signature,
            tools=tools,
            verbose=False,
        )
        prediction = avatar(**example.inputs().toDict())

        return metric(example, prediction)
    except Exception as e:
        print(e)
        return 0


def multi_thread_executor(test_set, signature, num_threads=60):
    total_score = 0
    total_examples = len(test_set)

    with ThreadPoolExecutor(max_workers=num_threads) as executor:
        futures = [executor.submit(process_example, example, signature) for example in test_set]

        for future in tqdm.tqdm(futures, total=total_examples, desc="Processing examples"):
            total_score += future.result()

    avg_metric = total_score / total_examples
    return avg_metric

In [None]:
sqa_score = multi_thread_executor(sqa_test, SearchQASignature)
print(f"Average Score on SearchQA: {sqa_score:.2f}")

In [None]:
aqa_score = multi_thread_executor(aqa_test, ArxivQASignature)
print(f"Average Score on ArxivQA: {aqa_score:.2f}")

## Optimization

For the optimization of the `Actor` we'll be using `AvatarOptimizer`. It's a DSPy implementation of the [Avatar](https://github.com/zou-group/avatar/) method that optimizes the `Actor` for the given `tools` using a comparator module that optimizes Actor instruction. Note, that Actor is the Module that directs tool execution and flow, it's not the signature that we are passing. It doesn't optimize the instruction of the signature we pass. It takes the following parameters:

* `metric`: Metric that we'll be optimizing for
* `max_iters`: Maximum number of iterations for the optimizer
* `lower_bound`: Lower bound for the metric to classify example as negative
* `upper_bound`: Upper bound for the metric to classify example as positive
* `max_positive_inputs`: Maximum number of positive inputs sampled for comparator
* `max_negative_inputs`: Maximum number of negative inputs sampled for comparator
* `optimize_for`: Whether we want to maximize the metric or minimize it during optimization

Once the optimizer is done we can get the optimized actor and use it for the evaluation.

In [None]:
from dspy.teleprompt import AvatarOptimizer

teleprompter = AvatarOptimizer(
    metric=metric,
    max_iters=1,
    max_negative_inputs=10,
    max_positive_inputs=10,
)

In [None]:
optimized_arxiv_agent = teleprompter.compile(
    student=arxiv_agent,
    trainset=aqa_train
)

optimized_search_agent = teleprompter.compile(
    student=search_agent,
    trainset=sqa_train
)

Now we can evaluate our actor module, for this we've provided an implementation of thread safe evaluator that we above as part of class method of `AvatarOptimizer`.

In [None]:
teleprompter.thread_safe_evaluator(aqa_test, optimized_arxiv_agent)

In [None]:
teleprompter.thread_safe_evaluator(sqa_test, optimized_search_agent)