<a href="https://colab.research.google.com/github/HackerCupAI/starter-kits/blob/main/rag/demo.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>
<!--- @wandbcode{rag-hackercup} -->

<img src="http://wandb.me/logo-im-png" width="400" alt="Weights & Biases" />


# Introduction


</a>


In this notebook, we will build a few Code Generation agents for the [HackerCup AI](https://hackercupai.github.io/) challenge.

We will build three different agents using different techniques and evaluate them using [W&B Weave](https://weave-docs.wandb.ai/).


<img src="https://raw.githubusercontent.com/wandb/weave/master/docs/static/img/evals-hero.png" width="800" height="450">

A more detailed walkthough of the approach we will use in this notebook can be found in the following Youtube video:
Hint: Click on the image to watch the video 😎

<a target="_blank" href="https://www.youtube.com/watch?v=cObBj2UpWK8">
<img src="https://img.youtube.com/vi/cObBj2UpWK8/0.jpg" width="600" height="450">
</a>

## Weave


Weave is a lightweight toolkit for tracking and evaluating LLM applications, built by Weights & Biases. We will use the following weave to trace and evaluate the various agents we build.

We will use Weave to keep track and evaluate the different agents we build.

Our goal is to bring rigor, best-practices, and composability to the inherently experimental process of developing AI applications, without introducing cognitive overhead.

If you want to learn more about Weave, you can [get started](https://weave-docs.wandb.ai/quickstart) by decorating Python functions with `@weave.op`.

## Setup 

**Note: You need to run this cell only once**
We will clone the starter-kits repo
Set the rag folder as our working directory
and install the dependencies for the project.

**You can comment out the cell after you have run it once.**

In [1]:
# Clone the starter-kits repo
!git clone https://github.com/HackerCupAI/starter-kits
# Change directory to the rag folder. Running the next line twice in the same session will raise an error.
%cd starter-kits/rag
# Install dependencies
!pip install -r requirements.txt

Cloning into 'starter-kits'...
remote: Enumerating objects: 548, done.[K
remote: Counting objects: 100% (364/364), done.[K
remote: Compressing objects: 100% (223/223), done.[K[K
remote: Total 548 (delta 192), reused 281 (delta 131), pack-reused 184 (from 1)[K
Receiving objects: 100% (548/548), 13.42 MiB | 20.76 MiB/s, done.
Resolving deltas: 100% (267/267), done.
/Users/tcapelle/work/starter-kits/rag/starter-kits/rag


In [1]:
import weave

WEAVE_PROJECT = "hackercup"
weave_client = weave.init(WEAVE_PROJECT)

Logged in as Weights & Biases user: capecape.
View Weave data at https://wandb.ai/capecape/hackercup/weave


## Dataset
We will use [HackerCup dataset](https://huggingface.co/datasets/hackercupai/hackercup) in this notebook.

Specifically, the **practice** dataset from the **2023** season.

We have already processed the dataset and saved it as a [`weave.Dataset`](https://weave-docs.wandb.ai/guides/core-types/datasets/). You can either use the Dataset by running the next cell or download the dataset using the instructions below.

We will use the dataset to load some practice problems and solutions from the HackerCup dataset and evaluate our agents on it.

In [9]:
from typing import Any
from openai import AsyncOpenAI
from instructor import from_openai

from utils import (
    Problem,
    Solution,
    check_correctness,
    async_client,
    FAST_LLM,
    STRONG_LLM,
    format_response,
)


practice_dataset_uri = "weave:///parambharat/hackercup/object/practice_dataset:R35fXf9N3FE2IOesg7bRPaPAxiE9YbpirhXO9HcHs8w"
problems_dataset = weave.ref(practice_dataset_uri).get().rows[:]
problems = list(map(lambda x: Problem(**x), problems_dataset))
problem = problems[0]
print("Sample Problem:\n\n", problem.model_dump_json(indent=2))

Sample Problem:

 {
  "problem_dir": "data/2023/practice",
  "problem_name": "two_apples_a_day",
  "problem_description": "“An apple a day keeps the doctor away” is Steve’s motto. His other motto, “You can never have too much of a good thing,” holds true for both apples and mottos. Steve would like to eat two apples per day for the next \\(N\\) days, but with strict adherence to his third motto “Consistency is key.” Specifically, he’d like the sum of the two apple weights he eats over the next \\(N\\) days to be the same for each day.\n\nSteve has already purchased \\(2*N-1\\) apples, the \\(i\\)th of which weighs \\(A_i\\) ounces. He'd like to buy one more apple that's as light as possible to fulfill his goal. Steve can buy an apple of any positive integer weight in ounces from the store. Is it possible for him to reach his goal, and if so, what weight apple should he buy?\n\n{{PHOTO_ID:1563872647765708|WIDTH:600}}\n\n\n*The above image depicts the solution to the first sample. Each d

Alternatively, you can download the dataset by running the download script from the [submit-first-solution](https://github.com/HackerCupAI/starter-kits/tree/main/submit_first_solution). Specifically, you can run the following command to download the dataset:

```bash
python download.py --year 2023 --dataset_folder data
```


This should create a `dataset` folder with the problems and solutions.

Here's an example of what the data looks like for the `dim_sum_delivery` problem from the `2023` season:

```
data/dataset/2023/practice
...
├── dim_sum_delivery.cpp
├── dim_sum_delivery.in
├── dim_sum_delivery.md
├── dim_sum_delivery.out
├── dim_sum_delivery_sample_input.txt
├── dim_sum_delivery_sample_output.txt
├── dim_sum_delivery_sol.md
...
```

Each problem has a `in`, `out`, `md`, `cpp`, and `sol` file.

The `in` file contains the input data for the problem.
The `out` file contains the expected output for the problem.
The `md` file contains the problem statement.
The `cpp` file contains the source code to the solution.
The `sol` file contains the detailed solution to the problem.
The `sample_input.txt` and `sample_output.txt` files contain the sample input and output for the problem. These are the test cases that will be available to the agent during development and evaluation.

In [3]:
import asyncio
import logging

from nest_asyncio import apply

apply()

# Some logging to see the progress
logging.basicConfig(
    format="%(asctime)s : %(levelname)s : %(message)s", level=logging.INFO
)

logger = logging.getLogger(__name__)

## Zero-shot Agent

For our first agent, we will use a `zero-shot solver`.
It's a simple LLM API call with a detailed prompt to solve the problem.

But first we need to load the problems and convert them to a more structured format and define a way to run the code and evaluate the solution.

First we'll start with loading some utilities. While there are other utilities we load, the ones we care about the most are `load_problem` and `check_correctness`.

The `load_problem` function will load a problem from our dataset into a more structured format.
The `check_correctness` function will run the generated code and evaluate the solution against the expected output for the sample test cases.

In [4]:
# Simple check to see if the code evaluation works
# We will use this to check the programs our the agents generate

program_code = "print('hello, world!')"
input_data = ""
expected_output = "hello, world!"
timeout = 2

test_result = check_correctness(program_code, input_data, expected_output, timeout)
print("Example 1: ", test_result)
test_result = check_correctness("print('goodbye')", input_data, "hi there", timeout)
print("Example 2: ", test_result)

🍩 https://wandb.ai/capecape/hackercup/r/call/0191c23d-167b-7551-835b-4be7301fbdd8
Example 1:  passed
🍩 https://wandb.ai/capecape/hackercup/r/call/0191c23d-16ac-7db1-bbd3-186fd184e21a
Example 2:  WRONG ANSWER!!

<expected>
'hi there'
</expected>
---
<got>
'goodbye
'
</got>


Now that we have a way to load a problem and evaluate a solution, let's define a prompt to solve the problem and create a simple agent to solve the problem. 

Here'e one such prompt we will use to solve the problem, it contains instructions for the model on how to solve the problem and the format of the response we expect from the model. Feel free to tweak the prompt if you like but this should work decently well for our use case.

In [5]:
from agent import SOLVER_INSTRUCTIONS

print(SOLVER_INSTRUCTIONS)

2024-09-05 14:51:09,477 : INFO : PyTorch version 2.4.0 available.


You are a world-class competitive programmer tasked with solving a programming problem. 
You will be provided with a problem statement, and you need to create a Python3 solution for it. 
Your task it to develop a winning solution to the problem in Python3 programming language.
You will do this in a step-by-step manner.

Step 1: Extract the core question and the problem-solving information from the problem statement.
Step 2: Describe the algorithm used to solve the problem.
Step 3: Write a short tutorial on the algorithm and how it works.
Step 4: Generate a step by step plan to solve the problem.
Step 5: Generate the pseudocode to solve the problem.
Step 6: Write the final solution in Python3 programming language to solve the problem.

Competition Guidelines:
    a. Do not use any external libraries; stick to Python 3 standard library
    b. Handle input and output using standard input/output (stdin/stdout)
    c. Use helper functions to improve readability of the code.
    c. Use the `

**Note**: Here we have defined a `Solution` model to enforce the format of the response we expect from the model.
If you change the `SOLVER_INSTRUCTIONS`, you need to change the `Solution` model to enforce the new format.
We use `format_response` to enforce the format of the response we expect from the model.

In [6]:
@weave.op
async def draft_solution(
        problem: Problem, model: str = FAST_LLM, temperature: float = 0.0
) -> Solution:
    user_prompt = f"""{problem.as_xml}
---
Let's think step by step to solve the problem:
"""

    response = await async_client.chat.completions.create(
        model=model,
        messages=[
            {"role": "system", "content": SOLVER_INSTRUCTIONS},
            {"role": "user", "content": user_prompt},
        ],
        response_model=None,
        temperature=temperature,
    )
    formatted_response = await format_response(
        response.choices[0].message.content, Solution
    )
    return formatted_response

With the main solution drafter ready, we can define the `zero_shot_solver` agent.
The agent will use the `draft_solution` function to draft a solution and the `check_correctness` function to check the correctness of the generated solution and return the result.



In [7]:
@weave.op
async def zero_shot_solver(
        problem: Problem, model: str = FAST_LLM, temperature: float = 0.0, timeout: int = 10
) -> dict:
    logger.info("Drafting intial zero-shot solution")
    solution = await draft_solution(
        problem=problem,
        model=model,
        temperature=temperature,
    )
    test_report = check_correctness(
        solution.source_code, problem.sample_input, problem.sample_output, timeout
    )
    logger.info(f"Draft solution result: {repr(test_report)}")
    return {"solution": solution, "test_report": test_report, "stage": "zero-shot"}

In [10]:
# test the zero-shot agent on the sample problem
zero_shot_result = await zero_shot_solver(problem)
print("*" * 80)
print(zero_shot_result["solution"].source_code)
print("*" * 80)
print(zero_shot_result["test_report"])

2024-09-05 14:52:48,341 : INFO : Drafting intial zero-shot solution
2024-09-05 14:53:09,978 : INFO : HTTP Request: POST http://195.242.24.252:8000/v1/chat/completions "HTTP/1.1 200 OK"
2024-09-05 14:53:34,759 : INFO : HTTP Request: POST http://195.242.24.252:8000/v1/chat/completions "HTTP/1.1 200 OK"
2024-09-05 14:53:37,542 : INFO : Draft solution result: 'failed:   File "<string>", line 1\n    T = int(input())\\nfor t in range(1, T + 1):\\n    N = int(input())\\n    A = list(map(int, input().split()))\\n    \\n    A.sort()\\n    \\n    if A[-1] - A[0] == 0:\\n        print(f"Case #{t}: -1")\\n    else:\\n        print(f"Case #{t}: {A[-1] + (A[-1] - A[0])}")\n                     ^\nSyntaxError: unexpected character after line continuation character\n'


********************************************************************************
T = int(input())\nfor t in range(1, T + 1):\n    N = int(input())\n    A = list(map(int, input().split()))\n    \n    A.sort()\n    \n    if A[-1] - A[0] == 0:\n        print(f"Case #{t}: -1")\n    else:\n        print(f"Case #{t}: {A[-1] + (A[-1] - A[0])}")
********************************************************************************
failed:   File "<string>", line 1
    T = int(input())\nfor t in range(1, T + 1):\n    N = int(input())\n    A = list(map(int, input().split()))\n    \n    A.sort()\n    \n    if A[-1] - A[0] == 0:\n        print(f"Case #{t}: -1")\n    else:\n        print(f"Case #{t}: {A[-1] + (A[-1] - A[0])}")
                     ^
SyntaxError: unexpected character after line continuation character



Let's build a simple evaluation using weave to evaluate the zero-shot agent.
You'll quickly see how this simple evaluation framework can become very powerful and will scale to very complex workflows.
Our agent already takes care of running the code, evaluating the solution against the expected output for the sample test cases and returning the report in the model output.
We expect that the `test_report` is `"passed"` in the agent output so we can use that to evaluate the agent. 

But first we need to load all the problems and convert them to a more structured format. A good agent should be able to handle all the problems in the dataset.

In [None]:
# This is a simple depection of the evaluation.
# We expect the output to be `"passed"` for all the problems if the agent is working correctly.
examples = [{"problem": problem, "expected": "passed"} for problem in problems]


# A simple scorer that checks if the code generated by agent passed the test case
@weave.op
def scorer(expected: str, model_output: dict) -> dict:
    return {"passed": expected == model_output["test_report"]}


# This is a simple evaluation that checks if the code generated by agent passed the test
eval = weave.Evaluation(dataset=examples, scorers=[scorer])

Now we are ready to evaluate the zero-shot agent.
We will create a `weave.Model` instance for the zero-shot agent.
This will help us conduct robust experiments and comparisons by helping us track various settings and parameters for the agent.
For now, we will focus on the `LLM` and the `temperature` for the model.


In [None]:
# Nothing fancy here, just a model that takes in a problem and returns a solution


class ZeroshotAgent(weave.Model):
    model: str = FAST_LLM
    temperature: float = 0.0
    timeout: int = 30

    @weave.op
    async def predict(self, problem: Problem):
        return await zero_shot_solver(
            Problem(**problem),
            model=self.model,
            temperature=self.temperature,
            timeout=self.timeout,
        )

In [None]:
# Evaluate the zero shot agent for all the models and temperatures
eval_models = [FAST_LLM, STRONG_LLM]
eval_temperatures = [0.0, 0.5, 1.0]
tasks = []
for LLM in eval_models:
    for temperature in eval_temperatures:
        zeroshot_agent = ZeroshotAgent(model=LLM, temperature=temperature, timeout=30)
        zeroshot_results = eval.evaluate(zeroshot_agent)
        tasks.append(zeroshot_results)

# Phew that's 2(models)*3(temps)*5(problems) = 30 evaluations

zeroshot_results = await asyncio.gather(*tasks)

AttributeError: 'builtin_function_or_method' object has no attribute 'evaluate'

Once you have the results you should also be able to visit your weave dashboard to see the results.

## RAG Agent

The RAG agent is a more complex agent that uses the retriever to retrieve the similar problems and solutions, and then uses these as few-shot examples to a model to generate a new solution. We will be using the [codecontests](https://huggingface.co/datasets/deepmind/code_contests) dataset to find the similar problems and the solutions. 

Retriving similar problems and solutions for a given problem statement is a non-trivial task. It involves indexing a large corpus of problems and solutions and then using a search algorithm to find the most similar problems and solutions. We will use the `bm25` algorithm to index the problems and solutions. However, it's important to note that two problems with similar wording - Such as `Alice` and `Bob` are not similar problems. A keyword search algorithm like BM25 might not be able to find similar problems and solutions based on the problem statement due to this limitation. 

While we could use `semantic search` it, would require a lot of data and compute. Therefore, we will use the `bm25` algorithm to index the problems and solutions and then use our zero-shot agent to generate a solution for a given problem statement. Then we can look for similar problems and solutions using the generated solution by comparing the AST (Abstract Syntax Tree) of the problems and solutions. This is a very simplistic approach and is not perfect by any means, but it's a good starting point.


For now, you can just load the retriever below, however, If you wish to use your own data, you might need to pre-process the data and create the retriever. You can checkout `starter-kits/rag/retriever.py` for more details.


However, simply using BM25 is not enough to find similar problems and solutions because two problems with similar solutions might have different problem statements and vice versa.

Can use semantic search to mitigate this by finding the most similar problems and solutions from an initial candidate pool retrieved using BM25. This should keep our compute requirements in check. We can use the `cosine similarity` to find the most similar problems and solutions.

In [41]:
from agent import describe_examples, format_examples, generate_solution
from retriever import Retriever, rerank_docs

logger.info("Loading retriever ... this may take a while ...")
retriever = Retriever()

2024-09-05 14:44:53,336 : INFO : Loading retriever ... this may take a while ...
2024-09-05 14:44:58,993 : DEBUG : Building index from IDs objects      
                                                                               

We are now ready to build the RAG agent.

As we laid out earlier, a RAG agent is a model that takes in a problem and returns a solution using the retriever to retrieve the similar problems and the solutions and then use the model to generate a new solution. We will use the `draft_solution` function to draft a solution for a given problem statement. Then we can look for similar problems and solutions using the generated solution by comparing the AST (Abstract Syntax Tree) of the solution to the solutions in our dataset. We will than present these are few-shot examples to the model to generate a new solution for the given problem statement.

In [42]:
@weave.op
async def rag_solver(
        retriever: Retriever,
        problem: Problem,
        model: str = FAST_LLM,
        temperature: float = 0.0,
        timeout: int = 10,
) -> dict:
    """The RAG Solver"""

    zero_shot_result = await zero_shot_solver(
        problem=problem,
        model=model,
        temperature=temperature,
        timeout=timeout,
    )
    solution = zero_shot_result["solution"]
    test_report = zero_shot_result["test_report"]
    if test_report == "passed":
        return zero_shot_result
    logger.info("Iterating on a RAG solution")

    @weave.op
    async def create_examplars(
            problem: Problem, solution: Solution, top_k: int = 50, top_n: int = 5
    ):
        logger.info(f"Generating examplars:")
        retrieve_docs = retriever.retrieve(solution.source_code, top_k)
        reranked_docs = await rerank_docs(problem, solution, retrieve_docs, top_n)
        analyses = await describe_examples(reranked_docs)
        examplars = format_examples(reranked_docs, analyses)
        return examplars

    @weave.op
    async def rag_solution(
            problem: Problem,
            draft_solution: Solution,
            model: str = STRONG_LLM,
            temperature: float = 0.0,
            timeout: int = timeout,
    ) -> dict:
        logger.info(f"Generating RAG solution:")
        examplars = await create_examplars(problem, draft_solution)
        rag_solution = await generate_solution(
            problem=problem,
            examples=examplars,
            model=model,
            temperature=temperature,
        )
        test_report = check_correctness(
            rag_solution.source_code,
            problem.sample_input,
            problem.sample_output,
            timeout,
        )
        logger.info(f"RAG Solution Result: {repr(test_report)}")
        return {"solution": rag_solution, "test_report": test_report}

    rag_result = await rag_solution(problem, solution, model, temperature, timeout)
    solution = rag_result["solution"]
    test_report = rag_result["test_report"]
    return {"solution": solution, "stage": "rag", "test_report": test_report}

In [43]:
rag_result = await rag_solver(retriever, problem, timeout=30)
print("*" * 80)
print(rag_result["solution"].source_code)
print("*" * 80)
print(rag_result["test_report"])

2024-09-05 14:45:00,525 : INFO : Drafting intial zero-shot solution
2024-09-05 14:45:22,199 : INFO : HTTP Request: POST http://195.242.24.252:8010/v1/chat/completions "HTTP/1.1 200 OK"
2024-09-05 14:45:46,219 : INFO : HTTP Request: POST http://195.242.24.252:8010/v1/chat/completions "HTTP/1.1 200 OK"
2024-09-05 14:45:48,510 : INFO : Draft solution result: 'failed:   File "<string>", line 1\n    T = int(input())\\n\\nfor t in range(1, T + 1):\\n    N = int(input())\\n    A = list(map(int, input().split()))\\n    \\n    A.sort()\\n    \\n    if A[-1] - A[0] == 0:\\n        print(f"Case #{t}: -1")\\n    else:\\n        print(f"Case #{t}: {A[-1] + (A[-1] - A[0])}")\n                     ^\nSyntaxError: unexpected character after line continuation character\n'
2024-09-05 14:45:48,511 : INFO : Iterating on a RAG solution
2024-09-05 14:45:49,182 : INFO : Generating RAG solution:
2024-09-05 14:45:49,795 : INFO : Generating examplars:
                                 

TypeError: 'NoneType' object is not iterable

Again we are now ready to evaluate the RAG agent.
We will create a `weave.Model` instance for the RAG agent and evaluate it using the same evaluation framework we used for the zero-shot agent.

In [None]:
class RAGAgent(weave.Model):
    retriever: Retriever
    model: str = FAST_LLM
    temperature: float = 0.0
    timeout: int = 30

    @weave.op
    async def predict(self, problem: Problem):
        return await rag_solver(
            retriever=self.retriever,
            problem=Problem(**problem),
            model=self.model,
            temperature=self.temperature,
            timeout=self.timeout,
        )

In [None]:
# Evaluate the RAG agent for all the models and temperatures
tasks = []
for LLM in eval_models:
    for temperature in eval_temperatures:
        rag_agent = RAGAgent(
            retriever=retriever, model=LLM, temperature=temperature, timeout=30
        )
        rag_results = eval.evaluate(rag_agent)
        tasks.append(rag_results)

# Again, 30 evals for the RAG agent with different models and temperatures

rag_results = await asyncio.gather(*tasks)

2024-09-05 10:17:02,156 : INFO : Drafting intial zero-shot solution
2024-09-05 10:17:02,470 : INFO : Drafting intial zero-shot solution
2024-09-05 10:17:02,786 : INFO : Drafting intial zero-shot solution
2024-09-05 10:17:03,163 : INFO : Drafting intial zero-shot solution
2024-09-05 10:17:03,497 : INFO : Drafting intial zero-shot solution
2024-09-05 10:17:03,776 : INFO : Drafting intial zero-shot solution
2024-09-05 10:17:04,099 : INFO : Drafting intial zero-shot solution
2024-09-05 10:17:04,410 : INFO : Drafting intial zero-shot solution
2024-09-05 10:17:04,762 : INFO : Drafting intial zero-shot solution
2024-09-05 10:17:05,039 : INFO : Drafting intial zero-shot solution
2024-09-05 10:17:05,337 : INFO : Drafting intial zero-shot solution
2024-09-05 10:17:05,647 : INFO : Drafting intial zero-shot solution
2024-09-05 10:17:06,132 : INFO : Drafting intial zero-shot solution
2024-09-05 10:17:06,445 : INFO : Drafting intial zero-shot solution
2024-09-05 10:17:06,746 : INFO : Drafting intial

2024-09-05 10:18:10,658 : INFO : Draft solution result: "WRONG ANSWER!!\n\n<expected>\n'Case #1: 5\nCase #2: -2\nCase #3: 0\nCase #4: -2\n'\n</expected>\n---\n<got>\n''\n</got>"
2024-09-05 10:18:10,659 : INFO : Iterating on a RAG solution
2024-09-05 10:18:11,207 : INFO : Generating RAG solution:
2024-09-05 10:18:11,840 : INFO : Generating examplars:
2024-09-05 10:18:11,946 : INFO : Retrying request to /embeddings in 0.952920 seconds
2024-09-05 10:18:12,363 : INFO : HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
2024-09-05 10:18:12,664 : INFO : HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
2024-09-05 10:18:13,220 : INFO : HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
2024-09-05 10:18:13,516 : INFO : HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
2024-09-05 10:18:13,569 : INFO : HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
2024-09-05 10:18:14,393 : IN

Traceback (most recent call last):
  File "/Users/tcapelle/miniforge3/envs/weave/lib/python3.11/site-packages/weave/flow/eval.py", line 168, in predict_and_score
    model_output = await async_call(model_predict, **model_predict_args)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/tcapelle/miniforge3/envs/weave/lib/python3.11/site-packages/weave/trace/op.py", line 333, in wrapper
    res, _ = await _execute_call(wrapper, call, *args, **kwargs)  # type: ignore
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/tcapelle/miniforge3/envs/weave/lib/python3.11/site-packages/weave/trace/op.py", line 213, in _call_async
    return handle_exception(e)
           ^^^^^^^^^^^^^^^^^^^
  File "/Users/tcapelle/miniforge3/envs/weave/lib/python3.11/site-packages/weave/trace/op.py", line 211, in _call_async
    res = await func(*args, **kwargs)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/var/folders/sf/tgv7vcv96x557p38bvvp1ms40000

Traceback (most recent call last):
  File "/Users/tcapelle/miniforge3/envs/weave/lib/python3.11/site-packages/weave/flow/eval.py", line 168, in predict_and_score
    model_output = await async_call(model_predict, **model_predict_args)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/tcapelle/miniforge3/envs/weave/lib/python3.11/site-packages/weave/trace/op.py", line 333, in wrapper
    res, _ = await _execute_call(wrapper, call, *args, **kwargs)  # type: ignore
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/tcapelle/miniforge3/envs/weave/lib/python3.11/site-packages/weave/trace/op.py", line 213, in _call_async
    return handle_exception(e)
           ^^^^^^^^^^^^^^^^^^^
  File "/Users/tcapelle/miniforge3/envs/weave/lib/python3.11/site-packages/weave/trace/op.py", line 211, in _call_async
    res = await func(*args, **kwargs)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/var/folders/sf/tgv7vcv96x557p38bvvp1ms40000

Traceback (most recent call last):
  File "/Users/tcapelle/miniforge3/envs/weave/lib/python3.11/site-packages/weave/flow/eval.py", line 287, in eval_example
    eval_row = await self.predict_and_score(model, example)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/tcapelle/miniforge3/envs/weave/lib/python3.11/site-packages/weave/trace/op.py", line 333, in wrapper
    res, _ = await _execute_call(wrapper, call, *args, **kwargs)  # type: ignore
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/tcapelle/miniforge3/envs/weave/lib/python3.11/site-packages/weave/trace/op.py", line 213, in _call_async
    return handle_exception(e)
           ^^^^^^^^^^^^^^^^^^^
  File "/Users/tcapelle/miniforge3/envs/weave/lib/python3.11/site-packages/weave/trace/op.py", line 211, in _call_async
    res = await func(*args, **kwargs)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/tcapelle/miniforge3/envs/weave/lib/python3.11/site-packages/weave/fl

Traceback (most recent call last):
  File "/Users/tcapelle/miniforge3/envs/weave/lib/python3.11/site-packages/weave/flow/eval.py", line 287, in eval_example
    eval_row = await self.predict_and_score(model, example)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/tcapelle/miniforge3/envs/weave/lib/python3.11/site-packages/weave/trace/op.py", line 333, in wrapper
    res, _ = await _execute_call(wrapper, call, *args, **kwargs)  # type: ignore
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/tcapelle/miniforge3/envs/weave/lib/python3.11/site-packages/weave/trace/op.py", line 213, in _call_async
    return handle_exception(e)
           ^^^^^^^^^^^^^^^^^^^
  File "/Users/tcapelle/miniforge3/envs/weave/lib/python3.11/site-packages/weave/trace/op.py", line 211, in _call_async
    res = await func(*args, **kwargs)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/tcapelle/miniforge3/envs/weave/lib/python3.11/site-packages/weave/fl

2024-09-05 10:21:30,762 : INFO : Draft solution result: 'passed'


2024-09-05 10:21:32,002 : INFO : Retrying request to /embeddings in 0.874893 seconds
2024-09-05 10:21:32,003 : INFO : Retrying request to /chat/completions in 0.898895 seconds


2024-09-05 10:21:32,006 : INFO : Retrying request to /chat/completions in 0.986961 seconds
2024-09-05 10:21:32,007 : INFO : Retrying request to /chat/completions in 0.944196 seconds
2024-09-05 10:21:32,008 : INFO : Retrying request to /chat/completions in 0.966591 seconds
2024-09-05 10:21:32,385 : INFO : HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
2024-09-05 10:21:32,459 : INFO : HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
2024-09-05 10:21:32,463 : INFO : HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
2024-09-05 10:21:32,531 : INFO : HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
2024-09-05 10:21:32,533 : INFO : HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
2024-09-05 10:21:32,536 : INFO : HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
2024-09-05 10:21:32,538 : INFO : HTTP Request: POST https://api.openai.com/v1/embeddi

2024-09-05 10:28:06,597 : INFO : HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
2024-09-05 10:28:08,367 : INFO : HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
2024-09-05 10:28:10,028 : INFO : HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
2024-09-05 10:28:11,260 : INFO : HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
2024-09-05 10:28:12,937 : INFO : HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
2024-09-05 10:28:15,437 : INFO : HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
2024-09-05 10:28:15,440 : INFO : HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
2024-09-05 10:28:15,442 : INFO : HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
2024-09-05 10:28:15,443 : INFO : HTTP Request: POST https://api.openai.com/v1/chat/completions "

2024-09-05 10:28:40,188 : INFO : RAG Solution Result: "WRONG ANSWER!!\n\n<expected>\n'Case #1: 5\nCase #2: -2\nCase #3: 0\nCase #4: -2\n'\n</expected>\n---\n<got>\n'Case #1: -1\nCase #2: -1\nCase #3: -1\nCase #4: 0\n'\n</got>"


2024-09-05 10:28:40,198 : INFO : HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"


2024-09-05 10:28:42,337 : INFO : RAG Solution Result: "WRONG ANSWER!!\n\n<expected>\n'Case #1: 5\nCase #2: -2\nCase #3: 0\nCase #4: -2\n'\n</expected>\n---\n<got>\n''\n</got>"


2024-09-05 10:28:47,153 : INFO : HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
2024-09-05 10:28:49,380 : INFO : RAG Solution Result: "WRONG ANSWER!!\n\n<expected>\n'Case #1: 4\nCase #2: 7\nCase #3: 1\nCase #4: -1\nCase #5: 6\nCase #6: -1\nCase #7: 1000000002\n'\n</expected>\n---\n<got>\n'Case #1: 3\nCase #2: 19\nCase #3: 1\nCase #4: 4\nCase #5: 14\nCase #6: 20\nCase #7: 666666670\n'\n</got>"


2024-09-05 10:28:52,969 : INFO : HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
2024-09-05 10:29:25,050 : INFO : RAG Solution Result: 'timed out'
2024-09-05 10:29:25,061 : INFO : HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
2024-09-05 10:29:25,064 : INFO : HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
2024-09-05 10:29:25,066 : INFO : HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
2024-09-05 10:29:25,067 : INFO : HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
2024-09-05 10:29:25,070 : INFO : HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
2024-09-05 10:29:25,073 : INFO : HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
2024-09-05 10:29:25,075 : INFO : HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
2024-09-05 10:29:25,076 : INFO

2024-09-05 10:30:22,512 : INFO : RAG Solution Result: 'passed'
2024-09-05 10:30:54,776 : INFO : RAG Solution Result: 'timed out'
2024-09-05 10:30:57,015 : INFO : RAG Solution Result: "WRONG ANSWER!!\n\n<expected>\n'Case #1: 5\nCase #2: -2\nCase #3: 0\nCase #4: -2\n'\n</expected>\n---\n<got>\n'Case #1: -1\nCase #2: -1\nCase #3: -1\nCase #4: 0\n'\n</got>"


## Reflection Agent



While the RAG agent is an improvement over the zero-shot agent, it's still not perfect.
It's still susceptible to hallucinations and incorrect solutions. 
One way to mitigate this is to use reflection.
We can use another LLM call to reflect on the solution and test results and improve it.
We can then use the improved solution to generate new few-shot examples and repeat the process in a loop until we converge to a solution or the iteration limit is reached.

Again, this is not the best approach to solve the problem and has a lot of room for improvement, but it should help us get towards a working solution.

Here are the reflection instructions we will provide to the LLM to reflect on the solution and test results, feel free to change the instructions to improve the agent's performance.

In [None]:
from agent import REFLECTION_INSTRUCTIONS, rework_solution

print(REFLECTION_INSTRUCTIONS)

You are a world-class competitive programmer with a keen eye for detail and problem solving. 
Your expertise is in algorithms and data structures. 
You have incorrectly answered the following programming problem. 
Your task is to reflect on the problem, your solution, and the correct answer.
You will then use this information help you answer the same question in the future. 
First, explain why you answered the question incorrectly.
Second, list the keywords that describe the type of your errors from most general to most specific.
Third, solve the problem again, step-by-step, based on your knowledge of the correct answer.
Fourth, create a list of detailed instructions to help you correctly solve this problem in the future.
Finally, create a list of general advice to help you solve similar types of problems in the future.
Be concise in your response; however, capture all of the essential information.

{problem}
<incorrect_solution>
{incorrect_solution}
</incorrect_solution>
<test_report>

In [None]:
@weave.op
async def rag_solver_with_reflection(
        retriever: Retriever,
        problem: Problem,
        model: str = FAST_LLM,
        temperature: float = 0.0,
        max_iterations: int = 2,
        timeout: int = 10,
):
    num_iterations = 0
    test_report = "failed"
    solution = None
    while not test_report == "passed" and num_iterations < max_iterations:
        rag_result = await rag_solver(
            retriever=retriever,
            problem=problem,
            timeout=timeout,
            model=model,
            temperature=temperature,
        )
        solution = rag_result["solution"]
        test_report = rag_result["test_report"]
        if test_report == "passed":
            return rag_result
        rework_result = await rework_solution(
            problem=problem,
            incorrect_solution=solution,
            test_report=test_report,
            model=model,
            temperature=temperature,
            timeout=timeout,
        )
        solution = rework_result["solution"]
        test_report = rework_result["test_report"]
        if test_report == "passed":
            return {
                "solution": solution,
                "stage": "reflection",
                "test_report": test_report,
            }
        num_iterations += 1
    logger.info("Failed to generate a solution")
    return {"solution": solution, "stage": "failed", "test_report": test_report}

In [None]:
reflection_result = await rag_solver_with_reflection(
    retriever, problem, max_iterations=2, timeout=30
)

print("*" * 80)
print(reflection_result["solution"].source_code)
print("*" * 80)
print(reflection_result["test_report"])

2024-09-05 10:31:03,780 : INFO : Drafting intial zero-shot solution
2024-09-05 10:31:16,477 : INFO : HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
2024-09-05 10:31:30,492 : INFO : HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
2024-09-05 10:31:40,511 : INFO : Draft solution result: "WRONG ANSWER!!\n\n<expected>\n'Case #1: 4\nCase #2: 7\nCase #3: 1\nCase #4: -1\nCase #5: 6\nCase #6: -1\nCase #7: 1000000002\n'\n</expected>\n---\n<got>\n'Case #1: -1\nCase #2: -1\nCase #3: -1\nCase #4: -1\nCase #5: -1\nCase #6: 1\nCase #7: -1\n'\n</got>"
2024-09-05 10:31:40,513 : INFO : Iterating on a RAG solution
2024-09-05 10:31:41,179 : INFO : Generating RAG solution:
2024-09-05 10:31:41,900 : INFO : Generating examplars:
2024-09-05 10:31:42,434 : INFO : HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
2024-09-05 10:31:43,018 : INFO : HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
2024-

********************************************************************************
def can_form_pairs(weights, target_weight, N):
    from collections import Counter
    count = Counter(weights)
    for weight in sorted(count.keys()):
        while count[weight] > 0:
            complement = target_weight - weight
            if complement in count and count[complement] > 0:
                if weight == complement:
                    if count[weight] < 2:
                        return False
                    count[weight] -= 2
                else:
                    count[weight] -= 1
                    count[complement] -= 1
            else:
                return False
    return True

def solve():
    import sys
    input = sys.stdin.read
    data = input().splitlines()
    T = int(data[0])
    results = []
    index = 1
    for case_number in range(1, T + 1):
        N = int(data[index])
        weights = list(map(int, data[index + 1].split()))
        index += 2
        tota

Great, now, we are ready to evaluate a more complex agent that uses reflection
This agent will try to solve the problem using the retriever
and if it fails, it will ask the model to reflect on the problem
and then re-work the solution
and repeat this process for a fixed number of iterations
or until the solution is correct or the iteration limit is reached

But the best part is that we can use the same evaluation framework we used for the zero-shot and RAG agent to evaluate the RAG reflection agent.

In [None]:
class RAGReflectionAgent(weave.Model):
    retriever: Retriever
    max_iterations: int = 2
    timeout: int = 30
    model: str = STRONG_LLM
    temperature: float = 0.0

    @weave.op
    async def predict(self, problem: Problem):
        return await rag_solver_with_reflection(
            self.retriever,
            Problem(**problem),
            model=self.model,
            temperature=self.temperature,
            max_iterations=self.max_iterations,
            timeout=self.timeout,
        )

In [None]:
# Evaluate the RAG reflection agent for all the models and temperatures
tasks = []
for LLM in eval_models:
    for temperature in eval_temperatures:
        rag_reflection_agent = RAGReflectionAgent(
            retriever=retriever, model=LLM, temperature=temperature, timeout=30
        )
        rag_reflection_results = eval.evaluate(rag_reflection_agent)
        tasks.append(rag_reflection_results)
rag_reflection_results = await asyncio.gather(*tasks)

2024-09-05 10:37:39,184 : INFO : Drafting intial zero-shot solution
2024-09-05 10:37:39,489 : INFO : Drafting intial zero-shot solution
2024-09-05 10:37:39,771 : INFO : Drafting intial zero-shot solution
2024-09-05 10:37:40,055 : INFO : Drafting intial zero-shot solution
2024-09-05 10:37:40,376 : INFO : Drafting intial zero-shot solution
2024-09-05 10:37:40,864 : INFO : Drafting intial zero-shot solution
2024-09-05 10:37:41,169 : INFO : Drafting intial zero-shot solution
2024-09-05 10:37:41,504 : INFO : Drafting intial zero-shot solution
2024-09-05 10:37:41,819 : INFO : Drafting intial zero-shot solution
2024-09-05 10:37:42,090 : INFO : Drafting intial zero-shot solution
2024-09-05 10:37:42,390 : INFO : Drafting intial zero-shot solution
2024-09-05 10:37:42,699 : INFO : Drafting intial zero-shot solution
2024-09-05 10:37:42,981 : INFO : Drafting intial zero-shot solution
2024-09-05 10:37:43,280 : INFO : Drafting intial zero-shot solution
2024-09-05 10:37:43,585 : INFO : Drafting intial

Traceback (most recent call last):
  File "/Users/tcapelle/miniforge3/envs/weave/lib/python3.11/site-packages/weave/flow/eval.py", line 168, in predict_and_score
    model_output = await async_call(model_predict, **model_predict_args)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/tcapelle/miniforge3/envs/weave/lib/python3.11/site-packages/weave/trace/op.py", line 333, in wrapper
    res, _ = await _execute_call(wrapper, call, *args, **kwargs)  # type: ignore
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/tcapelle/miniforge3/envs/weave/lib/python3.11/site-packages/weave/trace/op.py", line 213, in _call_async
    return handle_exception(e)
           ^^^^^^^^^^^^^^^^^^^
  File "/Users/tcapelle/miniforge3/envs/weave/lib/python3.11/site-packages/weave/trace/op.py", line 211, in _call_async
    res = await func(*args, **kwargs)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/var/folders/sf/tgv7vcv96x557p38bvvp1ms40000

Traceback (most recent call last):
  File "/Users/tcapelle/miniforge3/envs/weave/lib/python3.11/site-packages/weave/flow/eval.py", line 287, in eval_example
    eval_row = await self.predict_and_score(model, example)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/tcapelle/miniforge3/envs/weave/lib/python3.11/site-packages/weave/trace/op.py", line 333, in wrapper
    res, _ = await _execute_call(wrapper, call, *args, **kwargs)  # type: ignore
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/tcapelle/miniforge3/envs/weave/lib/python3.11/site-packages/weave/trace/op.py", line 213, in _call_async
    return handle_exception(e)
           ^^^^^^^^^^^^^^^^^^^
  File "/Users/tcapelle/miniforge3/envs/weave/lib/python3.11/site-packages/weave/trace/op.py", line 211, in _call_async
    res = await func(*args, **kwargs)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/tcapelle/miniforge3/envs/weave/lib/python3.11/site-packages/weave/fl

Traceback (most recent call last):
  File "/Users/tcapelle/miniforge3/envs/weave/lib/python3.11/site-packages/weave/flow/eval.py", line 168, in predict_and_score
    model_output = await async_call(model_predict, **model_predict_args)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/tcapelle/miniforge3/envs/weave/lib/python3.11/site-packages/weave/trace/op.py", line 333, in wrapper
    res, _ = await _execute_call(wrapper, call, *args, **kwargs)  # type: ignore
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/tcapelle/miniforge3/envs/weave/lib/python3.11/site-packages/weave/trace/op.py", line 213, in _call_async
    return handle_exception(e)
           ^^^^^^^^^^^^^^^^^^^
  File "/Users/tcapelle/miniforge3/envs/weave/lib/python3.11/site-packages/weave/trace/op.py", line 211, in _call_async
    res = await func(*args, **kwargs)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/var/folders/sf/tgv7vcv96x557p38bvvp1ms40000

2024-09-05 10:40:49,278 : INFO : Draft solution result: 'passed'


2024-09-05 10:40:51,323 : INFO : Draft solution result: 'passed'
2024-09-05 10:40:53,336 : INFO : Draft solution result: 'failed:   File "<string>", line 37\n    print("\n          ^\nSyntaxError: unterminated string literal (detected at line 37)\n'
2024-09-05 10:40:53,338 : INFO : Iterating on a RAG solution
2024-09-05 10:40:53,902 : INFO : Generating RAG solution:
2024-09-05 10:40:54,494 : INFO : Generating examplars:
                                 

Traceback (most recent call last):
  File "/Users/tcapelle/miniforge3/envs/weave/lib/python3.11/site-packages/weave/flow/eval.py", line 168, in predict_and_score
    model_output = await async_call(model_predict, **model_predict_args)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/tcapelle/miniforge3/envs/weave/lib/python3.11/site-packages/weave/trace/op.py", line 333, in wrapper
    res, _ = await _execute_call(wrapper, call, *args, **kwargs)  # type: ignore
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/tcapelle/miniforge3/envs/weave/lib/python3.11/site-packages/weave/trace/op.py", line 213, in _call_async
    return handle_exception(e)
           ^^^^^^^^^^^^^^^^^^^
  File "/Users/tcapelle/miniforge3/envs/weave/lib/python3.11/site-packages/weave/trace/op.py", line 211, in _call_async
    res = await func(*args, **kwargs)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/var/folders/sf/tgv7vcv96x557p38bvvp1ms40000

Traceback (most recent call last):
  File "/Users/tcapelle/miniforge3/envs/weave/lib/python3.11/site-packages/weave/flow/eval.py", line 287, in eval_example
    eval_row = await self.predict_and_score(model, example)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/tcapelle/miniforge3/envs/weave/lib/python3.11/site-packages/weave/trace/op.py", line 333, in wrapper
    res, _ = await _execute_call(wrapper, call, *args, **kwargs)  # type: ignore
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/tcapelle/miniforge3/envs/weave/lib/python3.11/site-packages/weave/trace/op.py", line 213, in _call_async
    return handle_exception(e)
           ^^^^^^^^^^^^^^^^^^^
  File "/Users/tcapelle/miniforge3/envs/weave/lib/python3.11/site-packages/weave/trace/op.py", line 211, in _call_async
    res = await func(*args, **kwargs)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/tcapelle/miniforge3/envs/weave/lib/python3.11/site-packages/weave/fl

Traceback (most recent call last):
  File "/Users/tcapelle/miniforge3/envs/weave/lib/python3.11/site-packages/weave/flow/eval.py", line 168, in predict_and_score
    model_output = await async_call(model_predict, **model_predict_args)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/tcapelle/miniforge3/envs/weave/lib/python3.11/site-packages/weave/trace/op.py", line 333, in wrapper
    res, _ = await _execute_call(wrapper, call, *args, **kwargs)  # type: ignore
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/tcapelle/miniforge3/envs/weave/lib/python3.11/site-packages/weave/trace/op.py", line 213, in _call_async
    return handle_exception(e)
           ^^^^^^^^^^^^^^^^^^^
  File "/Users/tcapelle/miniforge3/envs/weave/lib/python3.11/site-packages/weave/trace/op.py", line 211, in _call_async
    res = await func(*args, **kwargs)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/var/folders/sf/tgv7vcv96x557p38bvvp1ms40000

Traceback (most recent call last):
  File "/Users/tcapelle/miniforge3/envs/weave/lib/python3.11/site-packages/weave/flow/eval.py", line 287, in eval_example
    eval_row = await self.predict_and_score(model, example)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/tcapelle/miniforge3/envs/weave/lib/python3.11/site-packages/weave/trace/op.py", line 333, in wrapper
    res, _ = await _execute_call(wrapper, call, *args, **kwargs)  # type: ignore
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/tcapelle/miniforge3/envs/weave/lib/python3.11/site-packages/weave/trace/op.py", line 213, in _call_async
    return handle_exception(e)
           ^^^^^^^^^^^^^^^^^^^
  File "/Users/tcapelle/miniforge3/envs/weave/lib/python3.11/site-packages/weave/trace/op.py", line 211, in _call_async
    res = await func(*args, **kwargs)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/tcapelle/miniforge3/envs/weave/lib/python3.11/site-packages/weave/fl

2024-09-05 10:41:05,874 : INFO : Retrying request to /embeddings in 0.932150 seconds


2024-09-05 10:41:05,880 : INFO : Retrying request to /chat/completions in 0.948708 seconds
2024-09-05 10:41:05,881 : INFO : Retrying request to /chat/completions in 0.860106 seconds
2024-09-05 10:41:05,881 : INFO : Retrying request to /chat/completions in 0.865750 seconds
2024-09-05 10:41:05,881 : INFO : Retrying request to /chat/completions in 0.850348 seconds


Traceback (most recent call last):
  File "/Users/tcapelle/miniforge3/envs/weave/lib/python3.11/site-packages/weave/flow/eval.py", line 287, in eval_example
    eval_row = await self.predict_and_score(model, example)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/tcapelle/miniforge3/envs/weave/lib/python3.11/site-packages/weave/trace/op.py", line 333, in wrapper
    res, _ = await _execute_call(wrapper, call, *args, **kwargs)  # type: ignore
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/tcapelle/miniforge3/envs/weave/lib/python3.11/site-packages/weave/trace/op.py", line 213, in _call_async
    return handle_exception(e)
           ^^^^^^^^^^^^^^^^^^^
  File "/Users/tcapelle/miniforge3/envs/weave/lib/python3.11/site-packages/weave/trace/op.py", line 211, in _call_async
    res = await func(*args, **kwargs)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/tcapelle/miniforge3/envs/weave/lib/python3.11/site-packages/weave/fl

2024-09-05 10:41:05,891 : INFO : Retrying request to /chat/completions in 0.873927 seconds
2024-09-05 10:41:05,891 : INFO : Retrying request to /chat/completions in 0.944654 seconds
2024-09-05 10:41:05,892 : INFO : Retrying request to /chat/completions in 0.796671 seconds


2024-09-05 10:41:05,895 : INFO : Retrying request to /embeddings in 0.860000 seconds
2024-09-05 10:41:05,896 : INFO : Retrying request to /embeddings in 0.989289 seconds
2024-09-05 10:41:05,896 : INFO : Retrying request to /embeddings in 0.961960 seconds
2024-09-05 10:41:05,897 : INFO : Retrying request to /embeddings in 0.880524 seconds
2024-09-05 10:41:05,899 : INFO : Retrying request to /embeddings in 0.757564 seconds
2024-09-05 10:41:06,201 : INFO : HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
2024-09-05 10:41:06,229 : INFO : HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
2024-09-05 10:41:06,233 : INFO : HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
2024-09-05 10:41:06,278 : INFO : HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
2024-09-05 10:41:06,300 : INFO : HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
2024-09-05 10:41:06,302 : INFO : HTTP Re

2024-09-05 10:47:04,351 : INFO : HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
2024-09-05 10:47:10,135 : INFO : HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
2024-09-05 10:47:12,894 : INFO : HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
2024-09-05 10:47:12,897 : INFO : HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
2024-09-05 10:47:15,781 : INFO : HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
2024-09-05 10:47:17,054 : INFO : HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
2024-09-05 10:47:17,062 : INFO : HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
2024-09-05 10:47:19,632 : INFO : HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
2024-09-05 10:47:19,634 : INFO : HTTP Request: POST https://api.openai.com/v1/chat/completions "

2024-09-05 10:48:16,205 : INFO : RAG Solution Result: 'timed out'
2024-09-05 10:48:16,211 : INFO : Reflecting and improving solution
2024-09-05 10:48:16,226 : INFO : HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
2024-09-05 10:48:16,228 : INFO : HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
2024-09-05 10:48:19,003 : INFO : RAG Solution Result: "WRONG ANSWER!!\n\n<expected>\n'Case #1: 3\nCase #2: 1\nCase #3: 0\nCase #4: 199\nCase #5: 100\nCase #6: 1999999999999\n'\n</expected>\n---\n<got>\n'Case #1: 2\nCase #2: 0\nCase #3: 0\nCase #4: 200\nCase #5: 66\nCase #6: 2000000000000\n'\n</got>"
2024-09-05 10:48:19,007 : INFO : Reflecting and improving solution
2024-09-05 10:48:21,541 : INFO : RAG Solution Result: "WRONG ANSWER!!\n\n<expected>\n'Case #1: 5\nCase #2: -2\nCase #3: 0\nCase #4: -2\n'\n</expected>\n---\n<got>\n''\n</got>"
2024-09-05 10:48:21,543 : INFO : Reflecting and improving solution
2024-09-05 10:48:21,549 : INFO

2024-09-05 10:49:40,237 : INFO : Retrying request to /chat/completions in 0.785914 seconds
2024-09-05 10:49:44,559 : INFO : HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
2024-09-05 10:49:46,886 : INFO : HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
2024-09-05 10:49:46,888 : INFO : HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
2024-09-05 10:49:46,889 : INFO : HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
2024-09-05 10:49:46,890 : INFO : HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
2024-09-05 10:49:54,604 : INFO : HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
2024-09-05 10:49:54,605 : INFO : HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
2024-09-05 10:49:54,606 : INFO : HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
2024-

Traceback (most recent call last):
  File "/Users/tcapelle/miniforge3/envs/weave/lib/python3.11/site-packages/weave/flow/eval.py", line 168, in predict_and_score
    model_output = await async_call(model_predict, **model_predict_args)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/tcapelle/miniforge3/envs/weave/lib/python3.11/site-packages/weave/trace/op.py", line 333, in wrapper
    res, _ = await _execute_call(wrapper, call, *args, **kwargs)  # type: ignore
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/tcapelle/miniforge3/envs/weave/lib/python3.11/site-packages/weave/trace/op.py", line 213, in _call_async
    return handle_exception(e)
           ^^^^^^^^^^^^^^^^^^^
  File "/Users/tcapelle/miniforge3/envs/weave/lib/python3.11/site-packages/weave/trace/op.py", line 211, in _call_async
    res = await func(*args, **kwargs)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/var/folders/sf/tgv7vcv96x557p38bvvp1ms40000

Traceback (most recent call last):
  File "/Users/tcapelle/miniforge3/envs/weave/lib/python3.11/site-packages/weave/flow/eval.py", line 287, in eval_example
    eval_row = await self.predict_and_score(model, example)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/tcapelle/miniforge3/envs/weave/lib/python3.11/site-packages/weave/trace/op.py", line 333, in wrapper
    res, _ = await _execute_call(wrapper, call, *args, **kwargs)  # type: ignore
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/tcapelle/miniforge3/envs/weave/lib/python3.11/site-packages/weave/trace/op.py", line 213, in _call_async
    return handle_exception(e)
           ^^^^^^^^^^^^^^^^^^^
  File "/Users/tcapelle/miniforge3/envs/weave/lib/python3.11/site-packages/weave/trace/op.py", line 211, in _call_async
    res = await func(*args, **kwargs)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/tcapelle/miniforge3/envs/weave/lib/python3.11/site-packages/weave/fl

2024-09-05 10:55:37,518 : INFO : HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
2024-09-05 10:55:37,520 : INFO : HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
2024-09-05 10:55:37,522 : INFO : HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
2024-09-05 10:55:37,523 : INFO : HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
2024-09-05 10:55:37,525 : INFO : HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
2024-09-05 10:55:37,527 : INFO : HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
2024-09-05 10:55:46,660 : INFO : Retrying request to /chat/completions in 0.842058 seconds
2024-09-05 10:55:46,998 : INFO : HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
2024-09-05 10:55:47,026 : INFO : HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
2024-09-05 10:55:

2024-09-05 10:57:53,044 : INFO : Retrying request to /chat/completions in 0.914691 seconds
2024-09-05 10:57:53,045 : INFO : Retrying request to /chat/completions in 0.817028 seconds
2024-09-05 10:57:53,046 : INFO : Retrying request to /chat/completions in 0.767608 seconds
2024-09-05 10:57:53,047 : INFO : Retrying request to /chat/completions in 0.849657 seconds
2024-09-05 10:57:55,670 : INFO : HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
2024-09-05 10:57:55,675 : INFO : HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
2024-09-05 10:57:55,677 : INFO : HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
2024-09-05 10:57:55,678 : INFO : HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
2024-09-05 10:57:55,680 : INFO : HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
2024-09-05 10:57:55,681 : INFO : HTTP Request: POST https://api.openai.

2024-09-05 11:02:16,983 : INFO : HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
2024-09-05 11:02:16,986 : INFO : HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
2024-09-05 11:02:16,988 : INFO : HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
2024-09-05 11:02:19,094 : INFO : RAG Solution Result: "WRONG ANSWER!!\n\n<expected>\n'Case #1: 4\nCase #2: 7\nCase #3: 1\nCase #4: -1\nCase #5: 6\nCase #6: -1\nCase #7: 1000000002\n'\n</expected>\n---\n<got>\n'Case #1: 1\nCase #2: 1\nCase #3: 1\nCase #4: 2\nCase #5: 2\nCase #6: 4\nCase #7: 3\n'\n</got>"
2024-09-05 11:02:19,096 : INFO : Reflecting and improving solution
2024-09-05 11:02:21,301 : INFO : RAG Solution Result: 'failed:   File "<string>", line 88\n    print("\n          ^\nSyntaxError: unterminated string literal (detected at line 88)\n'
2024-09-05 11:02:21,303 : INFO : Reflecting and improving solution
2024-09-05 11:02:22,569 : INFO : HTTP R

2024-09-05 11:02:25,744 : INFO : HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
2024-09-05 11:02:25,746 : INFO : HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
2024-09-05 11:02:25,747 : INFO : HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
2024-09-05 11:02:27,907 : INFO : RAG Solution Result: "WRONG ANSWER!!\n\n<expected>\n'Case #1: 3\nCase #2: 1\nCase #3: 0\nCase #4: 199\nCase #5: 100\nCase #6: 1999999999999\n'\n</expected>\n---\n<got>\n'Case #1: 2\nCase #2: 0\nCase #3: 0\nCase #4: 182\nCase #5: 66\nCase #6: 1999999999998\n'\n</got>"
2024-09-05 11:02:27,909 : INFO : Reflecting and improving solution
2024-09-05 11:02:30,032 : INFO : RAG Solution Result: "WRONG ANSWER!!\n\n<expected>\n'Case #1: 3\nCase #2: 1\nCase #3: 0\nCase #4: 199\nCase #5: 100\nCase #6: 1999999999999\n'\n</expected>\n---\n<got>\n'Case #1: 2\nCase #2: 1\nCase #3: 0\nCase #4: 20\nCase #5: 100\nCase #6: 1000000000000\n'\n

2024-09-05 11:02:48,383 : INFO : HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
2024-09-05 11:02:48,389 : INFO : HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
2024-09-05 11:03:20,509 : INFO : RAG Solution Result: 'timed out'
2024-09-05 11:03:20,517 : INFO : Reflecting and improving solution
2024-09-05 11:03:22,841 : INFO : RAG Solution Result: "WRONG ANSWER!!\n\n<expected>\n'Case #1: 5\nCase #2: -2\nCase #3: 0\nCase #4: -2\n'\n</expected>\n---\n<got>\n'Case #1: 0\nCase #2: -2\nCase #3: 0\nCase #4: -2\n'\n</got>"
2024-09-05 11:03:22,844 : INFO : Reflecting and improving solution
2024-09-05 11:03:22,852 : INFO : HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
2024-09-05 11:03:22,853 : INFO : HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
2024-09-05 11:03:22,855 : INFO : HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
2024-09-0

2024-09-05 11:04:01,222 : INFO : HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
2024-09-05 11:04:03,577 : INFO : HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
2024-09-05 11:04:06,830 : INFO : HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
2024-09-05 11:04:06,832 : INFO : HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
2024-09-05 11:04:10,145 : INFO : Reworked solution result: "WRONG ANSWER!!\n\n<expected>\n'Case #1: 4\nCase #2: 7\nCase #3: 1\nCase #4: -1\nCase #5: 6\nCase #6: -1\nCase #7: 1000000002\n'\n</expected>\n---\n<got>\n'Case #1: 1\nCase #2: 1\nCase #3: 1\nCase #4: 2\nCase #5: 2\nCase #6: 1\nCase #7: 1\n'\n</got>"
2024-09-05 11:04:10,147 : INFO : Failed to generate a solution
2024-09-05 11:04:10,151 : INFO : HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
2024-09-05 11:04:12,260 : INFO : Reworked solution result: "WR

2024-09-05 11:04:12,267 : INFO : HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
2024-09-05 11:04:14,434 : INFO : Reworked solution result: "WRONG ANSWER!!\n\n<expected>\n'Case #1: 5\nCase #2: -2\nCase #3: 0\nCase #4: -2\n'\n</expected>\n---\n<got>\n'Case #1: -4\nCase #2: -2\nCase #3: -2\nCase #4: 0\n'\n</got>"
2024-09-05 11:04:14,436 : INFO : Failed to generate a solution


2024-09-05 11:04:15,512 : INFO : HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
2024-09-05 11:04:31,381 : INFO : HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
2024-09-05 11:04:33,606 : INFO : Reworked solution result: "WRONG ANSWER!!\n\n<expected>\n'Case #1: 3\nCase #2: 1\nCase #3: 0\nCase #4: 199\nCase #5: 100\nCase #6: 1999999999999\n'\n</expected>\n---\n<got>\n'Case #1: 2\nCase #2: 0\nCase #3: 0\nCase #4: 182\nCase #5: 66\nCase #6: 1999999999998\n'\n</got>"
2024-09-05 11:04:33,608 : INFO : Failed to generate a solution
2024-09-05 11:04:33,613 : INFO : HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"


2024-09-05 11:04:36,800 : INFO : HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
2024-09-05 11:04:38,934 : INFO : Reworked solution result: 'failed:   File "<string>", line 2\n    // Calculate the minimum cost to get the required resources\n    ^^\nSyntaxError: invalid syntax\n'
2024-09-05 11:04:38,936 : INFO : Failed to generate a solution
2024-09-05 11:04:38,940 : INFO : HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"


2024-09-05 11:04:51,486 : INFO : HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
2024-09-05 11:05:23,786 : INFO : Reworked solution result: 'timed out'
2024-09-05 11:05:23,792 : INFO : Failed to generate a solution
2024-09-05 11:05:23,800 : INFO : HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
2024-09-05 11:05:26,023 : INFO : Reworked solution result: "WRONG ANSWER!!\n\n<expected>\n'Case #1: 5\nCase #2: -2\nCase #3: 0\nCase #4: -2\n'\n</expected>\n---\n<got>\n'Case #1: 0\nCase #2: -2\nCase #3: 0\nCase #4: -2\n'\n</got>"
2024-09-05 11:05:26,025 : INFO : Failed to generate a solution


Okay, that completes the demo!

Key takeaways from this demo:
1. We tried to solve some challenging competitive programming problems using LLM agents.
2. We tried three different agents:
    - Zero-shot agent
    - RAG agent
    - RAG reflection agent
3. We used Weave to evaluate the agents and compare their performance.

We hope you found this demo useful and interesting and that it gave you some ideas on how to use LLM agents to solve challenging problems.
