<a href="https://colab.research.google.com/github/tcapelle/hackercup_rag/blob/main/rag_code_agent.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>
<!--- @wandbcode{rag-hackercup} -->

<img src="http://wandb.me/logo-im-png" width="400" alt="Weights & Biases" />

# W&B Lighting Competition - AI Hacker Cup 

</a>

[Weights & Biases](https://wandb.ai/site?utm_source=colab&utm_medium=code&utm_campaign=lightning-ai-hacker-cup) are running a 7-day Lightning Competition focussed on solving practice problems for the  [2024 NeurIPS AI Hacker Cup](https://hackercupai.github.io/) challenge.

#### Goal
The goal is to try and solve all 5 of the 2023 practice questions for the AI Hacker Cup using MistralAI's models. We’re offering free MistralAI api access via the code in this colab to get people started.

#### Competition GitHub
The competition [repo here](https://github.com/tcapelle/hackercup_rag) contains this colab, the code for the Code Generation Agent and the details on how to make a submission and the competition rules.

#### Discord
You can join the official NeurIPS AI Hacker Cup [discord here](discord.gg/wWeN9hTH32) to share ideas and discuss winning solutions.

## Prizes

Weights & Biases are giving away a pair of Meta Ray-Ban Smart Glasses for:
-  the first individual to submit code that solves 4 out of 5 correct solutions
-  the first individual to submit code that solves 5 out of 5 correct solutions

## Entry Submissions, Rules & Deadline

See the [competition README](https://github.com/tcapelle/hackercup_rag) for how to make a submissions the the competition rules.

## W&B Weave

[W&B Weave](https://weave-docs.wandb.ai/tutorial-eval?utm_source=colab&utm_medium=code&utm_campaign=lightning-ai-hacker-cup) is used in this competition to run the evaluations. It is a lightweight toolkit for tracking and evaluating LLM applications, built by Weights & Biases. 

<img src="https://raw.githubusercontent.com/wandb/weave/master/docs/static/img/evals-hero.png" width="800" height="450">

If you want to learn more about Weave, you can [get started](https://weave-docs.wandb.ai/quickstart?utm_source=colab&utm_medium=code&utm_campaign=lightning-ai-hacker-cup) by decorating Python functions with `@weave.op`.

# Using RAG for a Code Generation Agent

This colab demonstrates how to retrieve over a dataset of coding question-answer pairs (the [CodeContests](https://huggingface.co/datasets/deepmind/code_contests) dataset from DeepMind) in order to find simlar questions that might help our Agent generate the correct solution.

A more detailed walkthough of the approach we will use in this notebook can be found in the following **[Youtube video](https://www.youtube.com/watch?v=cObBj2UpWK8)**:

<a target="_blank" href="https://www.youtube.com/watch?v=cObBj2UpWK8">
<img src="https://img.youtube.com/vi/cObBj2UpWK8/0.jpg" width="400" height="300">
</a>

## Setup 

**Note: You need to run this cell only once**
We will clone the starter-kits repo
Set the rag folder as our working directory
and install the dependencies for the project.

**You can comment out the cell after you have run it once.**

In [None]:
# Clone the starter-kits repo
!git clone https://github.com/tcapelle/hackercup_rag
# Change directory to the rag folder. Running the next line twice in the same session will raise an error.
%cd hackercup_rag
# Install dependencies
!pip install -r requirements.txt -qq

To run this colab, create a [free Weights & Biases (W&B) account here](https://wandb.ai/site?utm_source=colab&utm_medium=code&utm_campaign=lightning-ai-hacker-cup) and then copy your API key from https://wandb.ai/authorize into the input box below when requested.

In [1]:
import os
import weave

WEAVE_PROJECT = "ai-hacker-cup"
weave_client = weave.init(WEAVE_PROJECT)

Logged in as Weights & Biases user: capecape.
View Weave data at https://wandb.ai/capecape/ai-hacker-cup/weave


In [2]:
# Set the URL for the Mistral model api we'll be using
os.environ["BASE_URL"] = "http://195.242.24.252:8000/v1"

# Select MistralAI models used depending if you want a fast or strong LLM
FAST_LLM = "open-mistral-nemo-2407"
STRONG_LLM = "mistral-large-latest"
os.environ["FAST_LLM"] = FAST_LLM
os.environ["STRONG_LLM"] = STRONG_LLM

# Set the max tokens for the models and how many parallel requests to make in Weave Evaluations
os.environ["MAX_TOKENS"] = "4096"
os.environ["WEAVE_PARALLELISM"] = "1"

## Challenges Dataset
We will use the **practice** dataset from the **2023** [HackerCup dataset](https://huggingface.co/datasets/hackercupai/hackercup).

We have already processed the dataset and saved it as a [`weave.Dataset`](https://weave-docs.wandb.ai/guides/core-types/datasets/?utm_source=colab&utm_medium=code&utm_campaign=lightning-ai-hacker-cup). You can either use the Dataset by running the next cell or download the dataset using the instructions below.

We will use this challenge dataset to load some practice problems and solutions from the HackerCup dataset and evaluate our agents on it.

In [3]:
from agent import rag_solver, rework_solution
from utils import Problem

practice_dataset_uri = "weave:///parambharat/hackercup/object/practice_dataset:R35fXf9N3FE2IOesg7bRPaPAxiE9YbpirhXO9HcHs8w"
problems_dataset = weave.ref(practice_dataset_uri).get().rows[:]
problems = list(map(lambda x: Problem(**x), problems_dataset))
problem = problems[0]  # Select the first problem

print("Sample Problem:\n\n", problem.model_dump_json(indent=2))

2024-09-06 17:18:45,768 : INFO : Use pytorch device_name: mps
2024-09-06 17:18:45,768 : INFO : Load pretrained SentenceTransformer: jinaai/jina-embeddings-v2-base-code


Sample Problem:

 {
  "problem_dir": "data/2023/practice",
  "problem_name": "two_apples_a_day",
  "problem_description": "“An apple a day keeps the doctor away” is Steve’s motto. His other motto, “You can never have too much of a good thing,” holds true for both apples and mottos. Steve would like to eat two apples per day for the next \\(N\\) days, but with strict adherence to his third motto “Consistency is key.” Specifically, he’d like the sum of the two apple weights he eats over the next \\(N\\) days to be the same for each day.\n\nSteve has already purchased \\(2*N-1\\) apples, the \\(i\\)th of which weighs \\(A_i\\) ounces. He'd like to buy one more apple that's as light as possible to fulfill his goal. Steve can buy an apple of any positive integer weight in ounces from the store. Is it possible for him to reach his goal, and if so, what weight apple should he buy?\n\n{{PHOTO_ID:1563872647765708|WIDTH:600}}\n\n\n*The above image depicts the solution to the first sample. Each d

#### [Alternative] Download the raw challenges dataset

You can alternatively download the full raw challenges dataset, see the README to see how.

#### Turn on logging and asyncio for notebooks

In [4]:
import asyncio
import logging
from nest_asyncio import apply

apply()
logging.basicConfig(
    format="%(asctime)s : %(levelname)s : %(message)s", level=logging.INFO
)
logger = logging.getLogger(__name__)

## Running a RAG + Reflection Agent

### RAG Agent with Reflection

We will combine a RAG Agent with Reflection in order to:

- Retrieve similar types of questions from the CodeContests dataset, generate a solution, reflect on the solution and test results and improve it.
- We then use this improved solution to generate new few-shot examples and repeat the process in a loop until we converge to a solution or the iteration limit is reached.

`agent.py` contains the prompts used for analysis (`ANALYSIS_INSTRUCTIONS`), reflection (`REFLECTION_INSTRUCTIONS`) and problem solving (`SOLVER_INSTRUCTIONS`) feel free to edit them to improve the system.

In [5]:
from agent import REFLECTION_INSTRUCTIONS

print(REFLECTION_INSTRUCTIONS)

You are a world-class competitive programmer with a keen eye for detail and problem solving. 
Your expertise is in algorithms and data structures. 
You have incorrectly answered the following programming problem. 
Your task is to reflect on the problem, your solution, and the correct answer.
You will then use this information help you answer the same question in the future. 
First, explain why you answered the question incorrectly.
Second, list the keywords that describe the type of your errors from most general to most specific.
Third, solve the problem again, step-by-step, based on your knowledge of the correct answer.
Fourth, create a list of detailed instructions to help you correctly solve this problem in the future.
Finally, create a list of general advice to help you solve similar types of problems in the future.
Be concise in your response; however, capture all of the essential information.

{problem}
<incorrect_solution>
{incorrect_solution}
</incorrect_solution>
<test_report>

### Retriever

The code used the retrieval over the CodeContests dataset can be found in `retriever.py`. Here we'll initialise our retriever.

In [6]:
from retriever import Retriever

retriever = Retriever()

2024-09-06 17:18:55,859 : DEBUG : Building index from IDs objects      
                                                                               

### RAG Solver Pipeline

Here we run the code generation pipeline which:
- given a problem, retrieves similar problems from the CodeCompletions dataset
- generates candidate code for problem
- executes the code
- checks if the executed code generates the correct solution
- if the solution is correct, it terminates otherwise it retries for `max_iterations`

Note `code_execution_timeout`is used to limit the time available for the generated python code to execute as sometimes the code generated be recursive code that never terminates.

In [7]:
@weave.op
async def rag_solver_with_reflection(
        retriever: Retriever,
        problem: Problem,
        model: str = FAST_LLM,
        temperature: float = 0.7,
        max_iterations: int = 2,
        code_execution_timeout: int = 10,
):
    num_iterations = 0
    test_report = "failed"
    solution = None
    while not test_report == "passed" and num_iterations < max_iterations:
        rag_result = await rag_solver(
            retriever=retriever,
            problem=problem,
            timeout=code_execution_timeout,
            model=model,
            temperature=temperature,
        )
        solution = rag_result["solution"]
        test_report = rag_result["test_report"]
        if test_report == "passed":
            logger.info(f"Passing solution generated successfully for problem: {problem.problem_name}")
            return rag_result
        
        logger.info(f"Solution failed, reworking solution. Problem: {problem.problem_name}")
        rework_result = await rework_solution(
            problem=problem,
            incorrect_solution=solution,
            test_report=test_report,
            model=model,
            temperature=temperature,
            timeout=code_execution_timeout,
        )
        solution = rework_result["solution"]
        test_report = rework_result["test_report"]
        if test_report == "passed":
            logger.info(f"Re-worked solution passed for problem: {problem.problem_name}")
            return {
                "solution": solution,
                "stage": "reflection",
                "test_report": test_report,
            }
        num_iterations += 1
        logger.info(f"Re-worked solution failed, trying iteration {num_iterations}. Problem: {problem.problem_name}")
    logger.info("Failed to generate a solution after {num_iterations} iterations. Problem: {problem.problem_name}")
    return {"solution": solution, "stage": "failed", "test_report": test_report}

Lets run the pipeline on 1 problem:

In [8]:
reflection_result = await rag_solver_with_reflection(
    retriever, problem, STRONG_LLM, max_iterations=2, code_execution_timeout=30
)

print("*" * 80)
print(reflection_result["solution"].source_code)
print("*" * 80)
print(reflection_result["test_report"])

2024-09-06 17:19:04,144 : INFO : Drafting intial zero-shot solution
2024-09-06 17:19:25,815 : INFO : HTTP Request: POST http://195.242.24.252:8000/v1/chat/completions "HTTP/1.1 200 OK"
2024-09-06 17:19:33,982 : INFO : HTTP Request: POST http://195.242.24.252:8000/v1/chat/completions "HTTP/1.1 200 OK"
2024-09-06 17:19:37,661 : INFO : Draft solution result: "WRONG ANSWER!!\n\n<expected>\n'Case #1: 4\nCase #2: 7\nCase #3: 1\nCase #4: -1\nCase #5: 6\nCase #6: -1\nCase #7: 1000000002\n'\n</expected>\n---\n<got>\n'Case #1: -1\nCase #2: -1\nCase #3: -1\nCase #4: -1\nCase #5: 5\nCase #6: 4\nCase #7: -1\n'\n</got>"
2024-09-06 17:19:37,662 : INFO : Iterating on a RAG solution
2024-09-06 17:19:38,306 : INFO : Generating RAG solution:
2024-09-06 17:19:38,937 : INFO : Generating examplars:
Batches: 100%|██████████| 1/1 [00:00<00:00,  1.45it/s]
Batches: 100%|██████████| 25/25 [00:08<00:00,  2.98it/s]
2024-09-06 17:19:59,399 : INFO : HTTP Request: POST http://195.242.24.252:8000/v1/chat/completions "

********************************************************************************
import sys

def main():    input = sys.stdin.read
data = input().split()

T = int(data[0])
index = 1
results = []

for _ in range(T):
    N = int(data[index])
    weights = list(map(int, data[index + 1:index + 2 * N]))
    index += 2 * N

    total_sum = sum(weights)
    target_sum = total_sum // N

    if total_sum % N == 0:
        additional_weight = target_sum - weights[2 * N - 1]
    else:
        additional_weight = -1

    results.append(f"Case #{_ + 1}: {additional_weight}")

sys.stdout.write("\n".join(results) + "\n")

if __name__ == "__main__":
    main()
********************************************************************************
failed:   File "<string>", line 25
    sys.stdout.write("
                     ^
SyntaxError: unterminated string literal (detected at line 25)



# Evaluation

Now we are ready to evaluate against the expected solutions.

### Create a Weave Model
First we create a Weave ["Model"](https://weave-docs.wandb.ai/guides/core-types/models?utm_source=colab&utm_medium=code&utm_campaign=lightning-ai-hacker-cup), which has a `predict` function that Weave Evaluations will call to generate a solution. It also has various attributes that we can set to adjust the behaviour of our pipeline.

In [None]:
class RAGReflectionAgent(weave.Model):
    retriever: Retriever
    max_iterations: int = 2
    code_execution_timeout: int = 30
    model: str = STRONG_LLM
    temperature: float = 0.7

    @weave.op
    async def predict(self, problem: Problem):
        return await rag_solver_with_reflection(
            self.retriever,
            Problem(**problem),
            model=self.model,
            temperature=self.temperature,
            max_iterations=self.max_iterations,
            code_execution_timeout=self.code_execution_timeout,
        )

### Create a Scorer and define the Evals Dataset

We expect the output of the "test_report" from our solver above to be `"passed"` if the solution is correct. You can think of `expected_result` in the `evals_dataset` as the label that the `test_report` from our solver needs to return in order to ensure the generated solution is correct.

Weave Evaluations expect a list of dictionaries. 

In [None]:
evals_dataset = [{"problem": problem, "expected_result": "passed"} for problem in problems]

Weave Evaluations use a scorer function that returns a metric and its result in a dict. Here we define a metric that checks if the code generated by agent passed the test case

In [None]:
@weave.op
def scorer(expected_result: str, model_output: dict) -> dict:
    if model_output is None or model_output["test_report"] is None:
        return {"solution_passed": False}
    return {"solution_passed": expected_result == model_output["test_report"]}


# Weave Evaluations take a dataset and scoring functions.
# This is a evaluation that checks if the code generated by agent passed the test
evaluator = weave.Evaluation(dataset=evals_dataset, scorers=[scorer])

### Run the Evaluation
Now we instantiate the Agent and run the evaluation. Results from the evaluation will be printed in the W&B Weave UI.

In [None]:
# Evaluate the RAG reflection agent
tasks = []

LLM = STRONG_LLM
eval_temperature = 0.7

# Instantiate the agent, which is a subclass of `weave.Model`
rag_reflection_agent = RAGReflectionAgent(
    retriever=retriever, model=LLM, temperature=eval_temperature, code_execution_timeout=30
)

# Evaluate the agent by passing it to the evaluator
# Weave Evaluations are async, so we use `asyncio.gather` to run them in parallel
rag_reflection_results = evaluator.evaluate(rag_reflection_agent)
tasks.append(rag_reflection_results)

rag_reflection_results = await asyncio.gather(*tasks)

logger.info(rag_reflection_results)