# Comparing Multi-Agent System to a Single-Model Baseline for Code Generation

This notebook presents a comparative study evaluating the performance of a multi-agent system for automated code generation against a single large language model (LLM) baseline. The multi-agent system, designed with specialized agents for tasks such as preprocessing, retrieval, planning, coding, and debugging, aims to tackle complex coding challenges more effectively than a monolithic single-model approach.

The objective of this study is to quantify the difference in performance between the two approaches across a set of coding problems. We will evaluate performance based on metrics such as the success rate in generating correct code, the time taken to produce a solution, and potentially other relevant factors like the number of iterations or debugging steps.

This notebook will cover:

1.  **Setup:** Loading necessary libraries, configuring the environment, and preparing the coding problems for evaluation.
2.  **Baseline (Single Model):** Implementing and running the single-model approach on the chosen coding problems.
3.  **Multi-Agent System:** Integrating and running the existing multi-agent system within the notebook environment.
4.  **Evaluation:** Running both approaches on the same set of problems and collecting performance data.
5.  **Analysis:** Comparing the collected data to draw conclusions about the effectiveness of each approach.
6.  **Results and Discussion:** Presenting the findings and discussing the implications of the results.


## Setup

In [29]:
import os
import sys
import dotenv
import json
from loguru import logger

# Append the models path in order to import the models
PROJECT_ROOT = os.path.join(os.getcwd(), 'src/')
print(PROJECT_ROOT)
sys.path.append(PROJECT_ROOT)

# Load env variables
dotenv.load_dotenv()

# Set log level INFO
logger.remove()
logger.add(sys.stderr, level="INFO")

/home/twanh/workspace/thesis/thesis-advent-of-agents/src/


2

In [30]:
# Import models from the system
from models.base_model import BaseLanguageModel
from models.gemini_model import GeminiLanguageModel
from models.openai_model import OpenAILanguageModel
from models.deepseek_model import DeepseekLanguageModel

### Load puzzles and input/outputs

In [31]:
# Get the correct paths
test_data_folder = os.path.join(PROJECT_ROOT, '..', 'experiments', 'test_data')
puzzles_folder = os.path.join(test_data_folder, 'puzzles/')
input_output_file = os.path.join(test_data_folder, 'answers2024.json')
puzzle_files = [os.path.join(puzzles_folder, f) for f in os.listdir(puzzles_folder) if os.path.isfile(os.path.join(puzzles_folder, f))]

In [32]:
# Create a datastructure were we can get by day
json_data = {}
with open(input_output_file, 'r') as f:
    json_data = {item['day']: item for item in json.load(f)}

puzzle_data = []
for file_path in puzzle_files:
    # Get the day of the puzzle file
    file_name = os.path.basename(file_path)
    day_str = file_name.split('_')[-1].split('.')[0]
    day = int(day_str)

    if day in json_data:
        with open(file_path, 'r') as f:
            puzzle_description = f.read()

        puzzle_info = {
            "year": json_data[day]['year'],
            "day": day,
            "description": puzzle_description,
            "input": json_data[day]['input'],
            "expected_output": json_data[day]['part1']
        }

        puzzle_data.append(puzzle_info)

# Sort by day
puzzle_data.sort(key=lambda x: x['day'])
print(len(puzzle_data)) # should be 25

25


### Model Configurations

Create the configurations for the baseline and the system to use. Since the advent of agents system can use multiple models for each agent, the model used for each agent will be the same as the single model used for the baseline.

The models that will be tested are:

<!-- TODO: Update model list -->
- Gemini
- OpenAI

In [None]:
from main import _get_model

models_to_test = ('gemini-2.0-flash', 'gemini-2.5-pro-preview-05-06' ,'gpt-4.1-mini', 'o4-mini', 'o3-mini')

models_to_test = ("gpt-4.1-mini",)

configs = [_get_model(model) for model in models_to_test]
print(configs)

[<models.openai_model.OpenAILanguageModel object at 0x7016e3f8aa80>]


## Baseline

### Baseline prompt

The prompt is based on the prompt the advent of agents system uses. However the baseline has no acess to the information the other agents provide. So there is only the `full_description` that provides the model with the puzzle. 
The steps in the prompt are the same expect the first step is removed, which was to analyze the generated plan.

In [36]:
BASELINE_PROMPT = """
# Advent of Code Implementation Agent

You are an expert coding agent specializing in implementing solutions for Advent of Code puzzles.
Your task is to convert a detailed solution plan into clean, efficient, and correct Python code that solves the given problem.
You excel at translating algorithmic plans into precise implementations.


It will be provided as the following JSON

```json
{{
    "full_description": "The full description of the problem (string)",
}}
```

## YOUR RESPONSIBILITIES

Your primary goal is to produce a complete, correct, and efficient Python implementation that:

1. Correctly solves both the provided examples and will work for the actual puzzle input
2. Follows good software engineering practices
3. Includes appropriate comments and documentation
4. Handles edge cases and potential errors
5. Is executable via command line as: `python3 [program].py [puzzleinputfile]`

## IMPLEMENTATION PROCESS

Follow these steps meticulously:

-----------------------------------------
STEP 1. Design Your Code Structure
-----------------------------------------

- Create a clear, modular structure with well-named functions matching the plan's major steps
- Define appropriate data structures with explicit type hints
- Plan your function signatures and interfaces before implementation
- Use the keywords and underlying concepts to think about what algoritms to use to solve the problems.

-----------------------------------------
STEP 2. Implement Core Logic
-----------------------------------------

- Write robust implementations of all algorithms described in the plan
- Include detailed comments explaining complex logic
- Follow Python best practices (PEP 8, appropriate naming conventions)
- Use type hints throughout your code


-----------------------------------------
STEP 3. Handle Edge Cases Explicitly
-----------------------------------------

- Add specific code to handle all edge cases mentioned in the plan
- Anticipate and handle additional edge cases common in Advent of Code:
  - Empty input
  - Boundary conditions (min/max values)
- Use the test cases to reason about your code and make sure it would solve the test cases correctly

----------------------------------------
STEP 4. Test Against Examples
----------------------------------------

- Include code that runs and validates against all provided examples
- Add assertions to verify intermediate results match expected values
- Print debugging information that would help diagnose issues to STDERR
    - STDOUT can only be used to print the final result.

----------------------------------------
STEP 5. Optimize If Necessary
-----------------------------------------

- Review your solution for performance bottlenecks
- Apply optimizations where appropriate, explaining your choices
- Ensure the solution will scale to handle the full problem input

-----------------------------------------
STEP 6. Finalize Solution
-----------------------------------------

- Ensure your code has a clear entry point (typically a `main()` function)
- Include code to read from the puzzle input file specified as a command-line argument
- Make sure that your code follows the proper structure as documented (example code template) below.
- Add a brief summary comment at the top explaining the approach
- Verify all functions have appropriate docstrings


-----------------------------------------
OUTPUT FORMAT
-----------------------------------------
Your response must be a valid JSON object with the following structure:

The generated code should be provided as the value of the code key in the JSON object. Ensure that the code is properly escaped to be a valid JSON string. This means that any double quotes within the code should be escaped with a backslash (\"), and newlines should be represented as \\n

```json
{{
  "code": "Complete Python code as a string with all necessary formatting. MAKE SURE THAT THIS IS VALID JSON"
}}
```


## EXAMPLE CODE TEMPLATE

```python
\"\"\"
Advent of Code [Year] Day [Number]: [Title]
Solution implementation based on the provided plan.

Usage: python3 solution.py [input_file]
\"\"\"
from typing import List, Dict, Tuple, Set, Optional
import sys
from collections import defaultdict, deque
import re
# Import other necessary libraries

def parse_input(input_file: str) -> [appropriate_return_type]:
    "\"\"Parse the puzzle input from file into appropriate data structures.

    Args:
        input_file: Path to the input file

    Returns:
        [Description of return value]
    \"\"\"
    with open(input_file, 'r') as f:
        # Process file content
        pass
    # Implementation...

def solve_part_one(parsed_data: [type]) -> [type]:
    \"\"\"Solve part one of the puzzle.

    Args:
        parsed_data: Processed input data

    Returns:
        Solution for part one
    \"\"\"
    # Implementation...

def main():
    # Check command line arguments
    if len(sys.argv) < 2:
        print("Usage: python3 solution.py [input_file]")
        return

    input_file = sys.argv[1]

    # Parse input
    parsed_data = parse_input(input_file)

    # Solve part one
    part_one_solution = solve_part_one(parsed_data)
    # ONLY PRINT THE RESULT, NO OTHER TEXT
    print(part_one_solution)

    # Test with examples (if available)
    # [Example testing code]

if __name__ == "__main__":
    main()
```

Remember to follow the plan closely while filling in implementation details that the planner may have omitted. Your goal is to bridge the gap between algorithmic description and working code.


-----------------------------------

Your input is:

{json_input}

"""

### Run baseline

In [37]:
from utils.utils import extract_json_from_markdown
from agents.debugging_agent import DebuggingAgent
from utils.util_types import TestCase

def run_and_test_baseline(puzzle: str, puzzle_input: str, expected_output: str, model: BaseLanguageModel) -> bool:

    # Create the prompts
    json_inp = json.dumps({'full_description': puzzle})
    prompt = BASELINE_PROMPT.format(json_input=json_inp)
    # Prompt the model
    resp = model.prompt(prompt)

    # Extract the solution
    try:
        code = json.loads(extract_json_from_markdown(resp)[0]).get('code')
    except json.JSONDecodeError:
        print("Failed because decode")
        # TODO: Add retrying?
        return False, "NO CODE"

    # Use debugging agent to test the final solution
    dba = DebuggingAgent(
        'debugging',
        model=model,
        expected_output=expected_output,
        puzzle_input=puzzle_input,
    )
    run_result = dba._run_test(
        code,
        TestCase(
            input_=puzzle_input,
            expected_output=expected_output,
        ),
    )

    return run_result.success, code


In [None]:
# baseline_results = {}
# total = 0
# solved = 0
# # TODO: remove slice when running full tests
# for config in configs[:1]:
#     baseline_model = config['baseline']
#     baseline_results[baseline_model.model_name] = {}
#     print(f"Using model: {baseline_model.model_name}")
#     for puzzle in puzzle_data:
#         total += 1
#         print(f"Running day {puzzle['day']}")
#         success = run_and_test_baseline(
#             puzzle=puzzle['description'],
#             puzzle_input=puzzle['input'],
#             expected_output=str(puzzle['expected_output']),
#             model=baseline_model
#         )
#         baseline_results[baseline_model.model_name][puzzle['day']] = success
#         if success:
#             print("Solved!")
#             solved += 1
#         else:
#             print("Not solved")

# print(f"Solved: {solved}/{total}")

Using model: gemini-2.0-flash
Running day 1


[32m2025-05-19 11:28:45.169[0m | [1mINFO    [0m | [36magents.debugging_agent[0m:[36m_run_test[0m:[36m337[0m - [1mRunning code with test case[0m
[32m2025-05-19 11:28:45.172[0m | [1mINFO    [0m | [36magents.debugging_agent[0m:[36m_run_code[0m:[36m307[0m - [1mRunning code[0m
[32m2025-05-19 11:28:45.244[0m | [1mINFO    [0m | [36magents.debugging_agent[0m:[36m_run_test[0m:[36m345[0m - [1mTest case is successful 1646452=1646452[0m


Solved!
Running day 2


[32m2025-05-19 11:28:49.242[0m | [1mINFO    [0m | [36magents.debugging_agent[0m:[36m_run_test[0m:[36m337[0m - [1mRunning code with test case[0m
[32m2025-05-19 11:28:49.245[0m | [1mINFO    [0m | [36magents.debugging_agent[0m:[36m_run_code[0m:[36m307[0m - [1mRunning code[0m
[32m2025-05-19 11:28:49.364[0m | [1mINFO    [0m | [36magents.debugging_agent[0m:[36m_run_test[0m:[36m345[0m - [1mTest case is successful 524=524[0m


Solved!
Running day 3
Failed because decode
Not solved
Running day 4


[32m2025-05-19 11:28:58.379[0m | [1mINFO    [0m | [36magents.debugging_agent[0m:[36m_run_test[0m:[36m337[0m - [1mRunning code with test case[0m
[32m2025-05-19 11:28:58.380[0m | [1mINFO    [0m | [36magents.debugging_agent[0m:[36m_run_code[0m:[36m307[0m - [1mRunning code[0m
[32m2025-05-19 11:28:58.481[0m | [1mINFO    [0m | [36magents.debugging_agent[0m:[36m_run_test[0m:[36m345[0m - [1mTest case is successful 2464=2464[0m


Solved!
Running day 5


[32m2025-05-19 11:29:03.091[0m | [1mINFO    [0m | [36magents.debugging_agent[0m:[36m_run_test[0m:[36m337[0m - [1mRunning code with test case[0m
[32m2025-05-19 11:29:03.095[0m | [1mINFO    [0m | [36magents.debugging_agent[0m:[36m_run_code[0m:[36m307[0m - [1mRunning code[0m
[32m2025-05-19 11:29:03.267[0m | [1mINFO    [0m | [36magents.debugging_agent[0m:[36m_run_test[0m:[36m358[0m - [1mGot: 5997, expected: 5391[0m


Not solved
Running day 6
Failed because decode
Not solved
Running day 7
Failed because decode
Not solved
Running day 8
Failed because decode
Not solved
Running day 9


[32m2025-05-19 11:29:23.402[0m | [1mINFO    [0m | [36magents.debugging_agent[0m:[36m_run_test[0m:[36m337[0m - [1mRunning code with test case[0m
[32m2025-05-19 11:29:23.406[0m | [1mINFO    [0m | [36magents.debugging_agent[0m:[36m_run_code[0m:[36m307[0m - [1mRunning code[0m
[32m2025-05-19 11:29:28.592[0m | [1mINFO    [0m | [36magents.debugging_agent[0m:[36m_run_test[0m:[36m358[0m - [1mGot: None, expected: 6279058075753[0m


Not solved
Running day 10


[32m2025-05-19 11:29:33.808[0m | [1mINFO    [0m | [36magents.debugging_agent[0m:[36m_run_test[0m:[36m337[0m - [1mRunning code with test case[0m
[32m2025-05-19 11:29:33.810[0m | [1mINFO    [0m | [36magents.debugging_agent[0m:[36m_run_code[0m:[36m307[0m - [1mRunning code[0m
[32m2025-05-19 11:29:33.894[0m | [1mINFO    [0m | [36magents.debugging_agent[0m:[36m_run_test[0m:[36m345[0m - [1mTest case is successful 459=459[0m


Solved!
Running day 11


[32m2025-05-19 11:29:37.910[0m | [1mINFO    [0m | [36magents.debugging_agent[0m:[36m_run_test[0m:[36m337[0m - [1mRunning code with test case[0m
[32m2025-05-19 11:29:37.913[0m | [1mINFO    [0m | [36magents.debugging_agent[0m:[36m_run_code[0m:[36m307[0m - [1mRunning code[0m
[32m2025-05-19 11:29:38.215[0m | [1mINFO    [0m | [36magents.debugging_agent[0m:[36m_run_test[0m:[36m345[0m - [1mTest case is successful 193899=193899[0m


Solved!
Running day 12


[32m2025-05-19 11:29:44.255[0m | [1mINFO    [0m | [36magents.debugging_agent[0m:[36m_run_test[0m:[36m337[0m - [1mRunning code with test case[0m
[32m2025-05-19 11:29:44.260[0m | [1mINFO    [0m | [36magents.debugging_agent[0m:[36m_run_code[0m:[36m307[0m - [1mRunning code[0m
[32m2025-05-19 11:29:44.486[0m | [1mINFO    [0m | [36magents.debugging_agent[0m:[36m_run_test[0m:[36m345[0m - [1mTest case is successful 1449902=1449902[0m


Solved!
Running day 13


[32m2025-05-19 11:29:51.834[0m | [1mINFO    [0m | [36magents.debugging_agent[0m:[36m_run_test[0m:[36m337[0m - [1mRunning code with test case[0m
[32m2025-05-19 11:29:51.837[0m | [1mINFO    [0m | [36magents.debugging_agent[0m:[36m_run_code[0m:[36m307[0m - [1mRunning code[0m
[32m2025-05-19 11:29:51.918[0m | [1mINFO    [0m | [36magents.debugging_agent[0m:[36m_run_test[0m:[36m345[0m - [1mTest case is successful 29517=29517[0m


Solved!
Running day 14


[32m2025-05-19 11:29:58.081[0m | [1mINFO    [0m | [36magents.debugging_agent[0m:[36m_run_test[0m:[36m337[0m - [1mRunning code with test case[0m
[32m2025-05-19 11:29:58.083[0m | [1mINFO    [0m | [36magents.debugging_agent[0m:[36m_run_code[0m:[36m307[0m - [1mRunning code[0m
[32m2025-05-19 11:29:58.168[0m | [1mINFO    [0m | [36magents.debugging_agent[0m:[36m_run_test[0m:[36m345[0m - [1mTest case is successful 230461440=230461440[0m


Solved!
Running day 15


[32m2025-05-19 11:30:05.967[0m | [1mINFO    [0m | [36magents.debugging_agent[0m:[36m_run_test[0m:[36m337[0m - [1mRunning code with test case[0m
[32m2025-05-19 11:30:05.972[0m | [1mINFO    [0m | [36magents.debugging_agent[0m:[36m_run_code[0m:[36m307[0m - [1mRunning code[0m
[32m2025-05-19 11:30:06.074[0m | [1mINFO    [0m | [36magents.debugging_agent[0m:[36m_run_test[0m:[36m358[0m - [1mGot: 1453784, expected: 1478649[0m


Not solved
Running day 16
Failed because decode
Not solved
Running day 17
Failed because decode
Not solved
Running day 18


[32m2025-05-19 11:30:22.867[0m | [1mINFO    [0m | [36magents.debugging_agent[0m:[36m_run_test[0m:[36m337[0m - [1mRunning code with test case[0m
[32m2025-05-19 11:30:22.870[0m | [1mINFO    [0m | [36magents.debugging_agent[0m:[36m_run_code[0m:[36m307[0m - [1mRunning code[0m
[32m2025-05-19 11:30:22.991[0m | [1mINFO    [0m | [36magents.debugging_agent[0m:[36m_run_test[0m:[36m345[0m - [1mTest case is successful 334=334[0m


Solved!
Running day 19
Failed because decode
Not solved
Running day 20


[32m2025-05-19 11:30:36.278[0m | [1mINFO    [0m | [36magents.debugging_agent[0m:[36m_run_test[0m:[36m337[0m - [1mRunning code with test case[0m
[32m2025-05-19 11:30:36.282[0m | [1mINFO    [0m | [36magents.debugging_agent[0m:[36m_run_code[0m:[36m307[0m - [1mRunning code[0m
[32m2025-05-19 11:30:41.355[0m | [1mINFO    [0m | [36magents.debugging_agent[0m:[36m_run_test[0m:[36m358[0m - [1mGot: None, expected: 1415[0m


Not solved
Running day 21
Failed because decode
Not solved
Running day 22


[32m2025-05-19 11:30:52.069[0m | [1mINFO    [0m | [36magents.debugging_agent[0m:[36m_run_test[0m:[36m337[0m - [1mRunning code with test case[0m
[32m2025-05-19 11:30:52.073[0m | [1mINFO    [0m | [36magents.debugging_agent[0m:[36m_run_code[0m:[36m307[0m - [1mRunning code[0m
[32m2025-05-19 11:30:53.665[0m | [1mINFO    [0m | [36magents.debugging_agent[0m:[36m_run_test[0m:[36m345[0m - [1mTest case is successful 13234715490=13234715490[0m


Solved!
Running day 23


[32m2025-05-19 11:30:58.704[0m | [1mINFO    [0m | [36magents.debugging_agent[0m:[36m_run_test[0m:[36m337[0m - [1mRunning code with test case[0m
[32m2025-05-19 11:30:58.707[0m | [1mINFO    [0m | [36magents.debugging_agent[0m:[36m_run_code[0m:[36m307[0m - [1mRunning code[0m
[32m2025-05-19 11:31:03.740[0m | [1mINFO    [0m | [36magents.debugging_agent[0m:[36m_run_test[0m:[36m358[0m - [1mGot: None, expected: 1485[0m


Not solved
Running day 24


[32m2025-05-19 11:31:11.193[0m | [1mINFO    [0m | [36magents.debugging_agent[0m:[36m_run_test[0m:[36m337[0m - [1mRunning code with test case[0m
[32m2025-05-19 11:31:11.195[0m | [1mINFO    [0m | [36magents.debugging_agent[0m:[36m_run_code[0m:[36m307[0m - [1mRunning code[0m
[32m2025-05-19 11:31:11.264[0m | [1mINFO    [0m | [36magents.debugging_agent[0m:[36m_run_test[0m:[36m345[0m - [1mTest case is successful 54715147844840=54715147844840[0m


Solved!
Running day 25


[32m2025-05-19 11:31:17.338[0m | [1mINFO    [0m | [36magents.debugging_agent[0m:[36m_run_test[0m:[36m337[0m - [1mRunning code with test case[0m
[32m2025-05-19 11:31:17.342[0m | [1mINFO    [0m | [36magents.debugging_agent[0m:[36m_run_code[0m:[36m307[0m - [1mRunning code[0m
[32m2025-05-19 11:31:17.433[0m | [1mINFO    [0m | [36magents.debugging_agent[0m:[36m_run_test[0m:[36m358[0m - [1mGot: 22250, expected: 2900[0m


Not solved
Solved: 11/25


## System

In [38]:
from agents.base_agent import BaseAgent
from agents.coding_agent import CodingAgent
from agents.debugging_agent import DebuggingAgent
from agents.planning_agent import PlanningAgent
from agents.pre_processing_agent import PreProcessingAgent
from agents.retreival_agent import RetrievalAgent
from core.orchestrator import Orchestrator
from utils.util_types import AgentSettings
from core.state import MainState
from utils.util_types import Puzzle

In [39]:
def setup_system(model: BaseLanguageModel, puzzle_input: str, expected_output:str) -> Orchestrator:
    agents = (
        (
            PreProcessingAgent(
                'preprocess', model=model,
            ),
            AgentSettings(enabled=True, can_debug=False),
        ),
        (
            RetrievalAgent(
                'retreival',
                model=model,
                connection_string=os.getenv('DB_CONNECTION_STRING') or '',
                openai_key=os.getenv('OPENAI_API_KEY') or '',
                # Use default weights
                weights=None,
            ),
            AgentSettings(enabled=True, can_debug=False),
        ),
        (
            PlanningAgent(
                'planning',
                model=model,
                n_plans=3,
            ),
            AgentSettings(enabled=True, can_debug=False),
        ),
        (
            CodingAgent('coding', model=model),
            AgentSettings(enabled=True, can_debug=False),
        ),
        (
            DebuggingAgent(
                'debugging',
                model=model,
                expected_output=expected_output,
                puzzle_input=puzzle_input,
            ),
            AgentSettings(enabled=True, can_debug=True),
        ),
    )

    orchestrator = Orchestrator(agents, {})
    return orchestrator


In [40]:
def run__and_test_system(day: int, puzzle_desc: str, puzzle_input: str, expected_output: str, model: BaseLanguageModel) -> tuple[bool, str]:

    orch = setup_system(model, puzzle_input, expected_output)

    puzzle = Puzzle(
        description=puzzle_desc,
        solution=None,
        year = 2024,
        day=day,

    )

    state = MainState(puzzle=puzzle)
    ret_state = orch.solve_puzzle(state)

    return ret_state.is_solved, ret_state.final_code

In [None]:
# system_results = {}
# system_total = 0
# system_solved = 0
# # TODO: remove slice when running full tests
# for config in configs[:1]:
#     baseline_model = config['baseline']
#     system_results[baseline_model.model_name] = {}
#     print(f"Using model: {baseline_model.model_name}")
#     for puzzle in puzzle_data:
#         total += 1
#         print(f"Running day {puzzle['day']}")
#         success = run__and_test_system(
#             puzzle['day'], puzzle['description'], puzzle['input'], puzzle['expected_output'], baseline_model
#         )
#         system_results[baseline_model.model_name][puzzle['day']] = success
#         if success:
#             solved += 1

# print(f"Solved: {solved}/{total}")

[32m2025-05-19 11:50:19.448[0m | [1mINFO    [0m | [36mcore.retreival[0m:[36minit_db[0m:[36m154[0m - [1mDatabase initialization complete.[0m
[32m2025-05-19 11:50:19.454[0m | [1mINFO    [0m | [36mcore.orchestrator[0m:[36msolve_puzzle[0m:[36m58[0m - [1mRunning agent: preprocess[0m


Using model: gemini-2.0-flash
Running day 1


[32m2025-05-19 11:50:21.825[0m | [1mINFO    [0m | [36mcore.orchestrator[0m:[36msolve_puzzle[0m:[36m58[0m - [1mRunning agent: retreival[0m
[32m2025-05-19 11:50:45.588[0m | [1mINFO    [0m | [36mcore.orchestrator[0m:[36msolve_puzzle[0m:[36m58[0m - [1mRunning agent: planning[0m
[32m2025-05-19 11:50:45.593[0m | [1mINFO    [0m | [36magents.planning_agent[0m:[36mprocess[0m:[36m149[0m - [1mGenerating 3 plans[0m
[32m2025-05-19 11:50:45.595[0m | [1mINFO    [0m | [36magents.planning_agent[0m:[36mprocess[0m:[36m152[0m - [1mCreating plan 1/3[0m
[32m2025-05-19 11:50:50.859[0m | [1mINFO    [0m | [36magents.planning_agent[0m:[36mprocess[0m:[36m152[0m - [1mCreating plan 2/3[0m
[32m2025-05-19 11:50:56.336[0m | [1mINFO    [0m | [36magents.planning_agent[0m:[36mprocess[0m:[36m152[0m - [1mCreating plan 3/3[0m
[32m2025-05-19 11:51:02.126[0m | [1mINFO    [0m | [36mcore.orchestrator[0m:[36msolve_puzzle[0m:[36m58[0m - [1mRunning

Solved puzzle
Running day 2


[32m2025-05-19 11:51:09.283[0m | [1mINFO    [0m | [36mcore.orchestrator[0m:[36msolve_puzzle[0m:[36m58[0m - [1mRunning agent: retreival[0m
[32m2025-05-19 11:51:42.538[0m | [1mINFO    [0m | [36mcore.orchestrator[0m:[36msolve_puzzle[0m:[36m58[0m - [1mRunning agent: planning[0m
[32m2025-05-19 11:51:42.544[0m | [1mINFO    [0m | [36magents.planning_agent[0m:[36mprocess[0m:[36m149[0m - [1mGenerating 3 plans[0m
[32m2025-05-19 11:51:42.548[0m | [1mINFO    [0m | [36magents.planning_agent[0m:[36mprocess[0m:[36m152[0m - [1mCreating plan 1/3[0m
[32m2025-05-19 11:51:52.295[0m | [1mINFO    [0m | [36magents.planning_agent[0m:[36mprocess[0m:[36m152[0m - [1mCreating plan 2/3[0m
[32m2025-05-19 11:52:01.821[0m | [1mINFO    [0m | [36magents.planning_agent[0m:[36mprocess[0m:[36m152[0m - [1mCreating plan 3/3[0m
[32m2025-05-19 11:52:12.063[0m | [1mINFO    [0m | [36mcore.orchestrator[0m:[36msolve_puzzle[0m:[36m58[0m - [1mRunning

Solved puzzle
Running day 3


[32m2025-05-19 11:52:20.914[0m | [1mINFO    [0m | [36mcore.orchestrator[0m:[36msolve_puzzle[0m:[36m58[0m - [1mRunning agent: retreival[0m
[32m2025-05-19 11:52:39.361[0m | [1mINFO    [0m | [36mcore.orchestrator[0m:[36msolve_puzzle[0m:[36m58[0m - [1mRunning agent: planning[0m
[32m2025-05-19 11:52:39.364[0m | [1mINFO    [0m | [36magents.planning_agent[0m:[36mprocess[0m:[36m149[0m - [1mGenerating 3 plans[0m
[32m2025-05-19 11:52:39.366[0m | [1mINFO    [0m | [36magents.planning_agent[0m:[36mprocess[0m:[36m152[0m - [1mCreating plan 1/3[0m
[32m2025-05-19 11:52:46.726[0m | [1mINFO    [0m | [36magents.planning_agent[0m:[36mprocess[0m:[36m152[0m - [1mCreating plan 2/3[0m
[32m2025-05-19 11:52:53.768[0m | [1mINFO    [0m | [36magents.planning_agent[0m:[36mprocess[0m:[36m152[0m - [1mCreating plan 3/3[0m
[32m2025-05-19 11:53:00.133[0m | [1mINFO    [0m | [36mcore.orchestrator[0m:[36msolve_puzzle[0m:[36m58[0m - [1mRunning

Solved puzzle
Running day 4


[32m2025-05-19 11:53:06.900[0m | [1mINFO    [0m | [36mcore.orchestrator[0m:[36msolve_puzzle[0m:[36m58[0m - [1mRunning agent: retreival[0m
[32m2025-05-19 11:53:34.550[0m | [1mINFO    [0m | [36mcore.orchestrator[0m:[36msolve_puzzle[0m:[36m58[0m - [1mRunning agent: planning[0m
[32m2025-05-19 11:53:34.552[0m | [1mINFO    [0m | [36magents.planning_agent[0m:[36mprocess[0m:[36m149[0m - [1mGenerating 3 plans[0m
[32m2025-05-19 11:53:34.553[0m | [1mINFO    [0m | [36magents.planning_agent[0m:[36mprocess[0m:[36m152[0m - [1mCreating plan 1/3[0m
[32m2025-05-19 11:53:43.200[0m | [1mINFO    [0m | [36magents.planning_agent[0m:[36mprocess[0m:[36m152[0m - [1mCreating plan 2/3[0m
[32m2025-05-19 11:53:51.595[0m | [1mINFO    [0m | [36magents.planning_agent[0m:[36mprocess[0m:[36m152[0m - [1mCreating plan 3/3[0m
[32m2025-05-19 11:54:02.757[0m | [1mINFO    [0m | [36mcore.orchestrator[0m:[36msolve_puzzle[0m:[36m58[0m - [1mRunning

Solved puzzle
Running day 5


[32m2025-05-19 11:54:12.703[0m | [1mINFO    [0m | [36mcore.orchestrator[0m:[36msolve_puzzle[0m:[36m58[0m - [1mRunning agent: retreival[0m
[32m2025-05-19 11:54:49.768[0m | [1mINFO    [0m | [36mcore.orchestrator[0m:[36msolve_puzzle[0m:[36m58[0m - [1mRunning agent: planning[0m
[32m2025-05-19 11:54:49.772[0m | [1mINFO    [0m | [36magents.planning_agent[0m:[36mprocess[0m:[36m149[0m - [1mGenerating 3 plans[0m
[32m2025-05-19 11:54:49.774[0m | [1mINFO    [0m | [36magents.planning_agent[0m:[36mprocess[0m:[36m152[0m - [1mCreating plan 1/3[0m
[32m2025-05-19 11:54:59.999[0m | [1mINFO    [0m | [36magents.planning_agent[0m:[36mprocess[0m:[36m152[0m - [1mCreating plan 2/3[0m
[32m2025-05-19 11:55:11.363[0m | [1mINFO    [0m | [36magents.planning_agent[0m:[36mprocess[0m:[36m152[0m - [1mCreating plan 3/3[0m
[32m2025-05-19 11:55:20.329[0m | [1mINFO    [0m | [36mcore.orchestrator[0m:[36msolve_puzzle[0m:[36m58[0m - [1mRunning

Solved puzzle
Running day 6


[32m2025-05-19 11:55:29.490[0m | [1mINFO    [0m | [36mcore.orchestrator[0m:[36msolve_puzzle[0m:[36m58[0m - [1mRunning agent: retreival[0m
[32m2025-05-19 11:55:53.171[0m | [1mINFO    [0m | [36mcore.orchestrator[0m:[36msolve_puzzle[0m:[36m58[0m - [1mRunning agent: planning[0m
[32m2025-05-19 11:55:53.173[0m | [1mINFO    [0m | [36magents.planning_agent[0m:[36mprocess[0m:[36m149[0m - [1mGenerating 3 plans[0m
[32m2025-05-19 11:55:53.174[0m | [1mINFO    [0m | [36magents.planning_agent[0m:[36mprocess[0m:[36m152[0m - [1mCreating plan 1/3[0m
[32m2025-05-19 11:56:02.306[0m | [1mINFO    [0m | [36magents.planning_agent[0m:[36mprocess[0m:[36m152[0m - [1mCreating plan 2/3[0m
[32m2025-05-19 11:56:13.316[0m | [1mINFO    [0m | [36magents.planning_agent[0m:[36mprocess[0m:[36m152[0m - [1mCreating plan 3/3[0m
[32m2025-05-19 11:56:21.800[0m | [1mINFO    [0m | [36mcore.orchestrator[0m:[36msolve_puzzle[0m:[36m58[0m - [1mRunning

Solved puzzle
Running day 7


[32m2025-05-19 11:57:38.494[0m | [1mINFO    [0m | [36mcore.retreival[0m:[36minit_db[0m:[36m154[0m - [1mDatabase initialization complete.[0m
[32m2025-05-19 11:57:38.496[0m | [1mINFO    [0m | [36mcore.orchestrator[0m:[36msolve_puzzle[0m:[36m58[0m - [1mRunning agent: preprocess[0m
[32m2025-05-19 11:57:42.177[0m | [1mINFO    [0m | [36mcore.orchestrator[0m:[36msolve_puzzle[0m:[36m58[0m - [1mRunning agent: retreival[0m
[32m2025-05-19 11:57:55.384[0m | [1mINFO    [0m | [36mcore.orchestrator[0m:[36msolve_puzzle[0m:[36m58[0m - [1mRunning agent: planning[0m
[32m2025-05-19 11:57:55.387[0m | [1mINFO    [0m | [36magents.planning_agent[0m:[36mprocess[0m:[36m149[0m - [1mGenerating 3 plans[0m
[32m2025-05-19 11:57:55.389[0m | [1mINFO    [0m | [36magents.planning_agent[0m:[36mprocess[0m:[36m152[0m - [1mCreating plan 1/3[0m


KeyboardInterrupt: 

## Comparing baseline vs system

In [41]:
import time
import datetime
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [42]:
results = []

for model in configs:

    for puzzle in puzzle_data:
    
        puzzle_day = puzzle['day']
        puzzle_description = puzzle['description']
        input_ = puzzle['input']
        expected_ouptut = puzzle['expected_output']

        # Run single model on puzzle
        print(f"---- RUNNING SINGLE MODEL ON DAY {puzzle_day} WITH {model.model_name} ----")

        bl_start_time = time.time()
        bl_suc, bl_code = run_and_test_baseline(puzzle_description, input_, expected_ouptut, model)
        bl_end_time = time.time()

        print(f"Baseline Success: {bl_suc}")

        results.append({
            "day": puzzle_day,
            "approach": 'single-model',
            "model": model.model_name,
            "success": bl_suc,
            "time_taken": bl_end_time - bl_start_time,
            "code": bl_code
        })

        # Run system on puzzle
        print(f"---- RUNNING AOA SYSTEM ON DAY {puzzle_day} WITH {model.model_name} ----")
        sys_start_time = time.time()
        sys_suc, sys_code = run__and_test_system(puzzle_day, puzzle_description, input_, expected_ouptut, model)
        sys_end_time = time.time()

        results.append({
            "day": puzzle_day,
            "approach": 'system',
            "model": model.model_name,
            "success": sys_suc,
            "time_taken": sys_end_time - sys_start_time,
            "code": sys_code
        })

results_df = pd.DataFrame(results)

timestamp = datetime.datetime.now().strftime("%Y%m%d_%H%M%S")
results_file_path = f'results-base-line-gpt-4.1-mini-{timestamp}'

results_df.to_csv(results_file_path, index=False)

---- RUNNING SINGLE MODEL ON DAY 1 WITH gpt-4.1-mini ----


[32m2025-05-19 15:53:33.798[0m | [1mINFO    [0m | [36magents.debugging_agent[0m:[36m_run_test[0m:[36m337[0m - [1mRunning code with test case[0m
[32m2025-05-19 15:53:33.801[0m | [1mINFO    [0m | [36magents.debugging_agent[0m:[36m_run_code[0m:[36m307[0m - [1mRunning code[0m
[32m2025-05-19 15:53:33.864[0m | [1mINFO    [0m | [36magents.debugging_agent[0m:[36m_run_test[0m:[36m345[0m - [1mTest case is successful 1646452=1646452[0m
[32m2025-05-19 15:53:34.033[0m | [1mINFO    [0m | [36mcore.retreival[0m:[36minit_db[0m:[36m154[0m - [1mDatabase initialization complete.[0m
[32m2025-05-19 15:53:34.034[0m | [1mINFO    [0m | [36mcore.orchestrator[0m:[36msolve_puzzle[0m:[36m58[0m - [1mRunning agent: preprocess[0m


Baseline Success: True
---- RUNNING AOA SYSTEM ON DAY 1 WITH gpt-4.1-mini ----


[32m2025-05-19 15:53:39.501[0m | [1mINFO    [0m | [36mcore.orchestrator[0m:[36msolve_puzzle[0m:[36m58[0m - [1mRunning agent: retreival[0m
[32m2025-05-19 15:54:34.278[0m | [1mINFO    [0m | [36mcore.orchestrator[0m:[36msolve_puzzle[0m:[36m58[0m - [1mRunning agent: planning[0m
[32m2025-05-19 15:54:34.283[0m | [1mINFO    [0m | [36magents.planning_agent[0m:[36mprocess[0m:[36m149[0m - [1mGenerating 3 plans[0m
[32m2025-05-19 15:54:34.285[0m | [1mINFO    [0m | [36magents.planning_agent[0m:[36mprocess[0m:[36m152[0m - [1mCreating plan 1/3[0m
[32m2025-05-19 15:54:51.269[0m | [1mINFO    [0m | [36magents.planning_agent[0m:[36mprocess[0m:[36m152[0m - [1mCreating plan 2/3[0m
[32m2025-05-19 15:55:37.659[0m | [1mINFO    [0m | [36magents.planning_agent[0m:[36mprocess[0m:[36m152[0m - [1mCreating plan 3/3[0m
[32m2025-05-19 15:55:53.113[0m | [1mINFO    [0m | [36mcore.orchestrator[0m:[36msolve_puzzle[0m:[36m58[0m - [1mRunning

---- RUNNING SINGLE MODEL ON DAY 2 WITH gpt-4.1-mini ----


[32m2025-05-19 15:56:24.147[0m | [1mINFO    [0m | [36magents.debugging_agent[0m:[36m_run_test[0m:[36m337[0m - [1mRunning code with test case[0m
[32m2025-05-19 15:56:24.153[0m | [1mINFO    [0m | [36magents.debugging_agent[0m:[36m_run_code[0m:[36m307[0m - [1mRunning code[0m
[32m2025-05-19 15:56:24.246[0m | [1mINFO    [0m | [36magents.debugging_agent[0m:[36m_run_test[0m:[36m345[0m - [1mTest case is successful 524=524[0m
[32m2025-05-19 15:56:24.408[0m | [1mINFO    [0m | [36mcore.retreival[0m:[36minit_db[0m:[36m154[0m - [1mDatabase initialization complete.[0m
[32m2025-05-19 15:56:24.409[0m | [1mINFO    [0m | [36mcore.orchestrator[0m:[36msolve_puzzle[0m:[36m58[0m - [1mRunning agent: preprocess[0m


Baseline Success: True
---- RUNNING AOA SYSTEM ON DAY 2 WITH gpt-4.1-mini ----


[32m2025-05-19 15:56:31.325[0m | [1mINFO    [0m | [36mcore.orchestrator[0m:[36msolve_puzzle[0m:[36m58[0m - [1mRunning agent: retreival[0m
[32m2025-05-19 15:57:26.852[0m | [1mINFO    [0m | [36mcore.orchestrator[0m:[36msolve_puzzle[0m:[36m58[0m - [1mRunning agent: planning[0m
[32m2025-05-19 15:57:26.854[0m | [1mINFO    [0m | [36magents.planning_agent[0m:[36mprocess[0m:[36m149[0m - [1mGenerating 3 plans[0m
[32m2025-05-19 15:57:26.855[0m | [1mINFO    [0m | [36magents.planning_agent[0m:[36mprocess[0m:[36m152[0m - [1mCreating plan 1/3[0m
[32m2025-05-19 15:57:43.834[0m | [1mINFO    [0m | [36magents.planning_agent[0m:[36mprocess[0m:[36m152[0m - [1mCreating plan 2/3[0m
[32m2025-05-19 15:57:57.689[0m | [1mINFO    [0m | [36magents.planning_agent[0m:[36mprocess[0m:[36m152[0m - [1mCreating plan 3/3[0m
[32m2025-05-19 15:58:13.415[0m | [1mINFO    [0m | [36mcore.orchestrator[0m:[36msolve_puzzle[0m:[36m58[0m - [1mRunning

---- RUNNING SINGLE MODEL ON DAY 3 WITH gpt-4.1-mini ----


[32m2025-05-19 15:58:38.187[0m | [1mINFO    [0m | [36magents.debugging_agent[0m:[36m_run_test[0m:[36m337[0m - [1mRunning code with test case[0m
[32m2025-05-19 15:58:38.191[0m | [1mINFO    [0m | [36magents.debugging_agent[0m:[36m_run_code[0m:[36m307[0m - [1mRunning code[0m
[32m2025-05-19 15:58:38.261[0m | [1mINFO    [0m | [36magents.debugging_agent[0m:[36m_run_test[0m:[36m345[0m - [1mTest case is successful 167650499=167650499[0m
[32m2025-05-19 15:58:38.358[0m | [1mINFO    [0m | [36mcore.retreival[0m:[36minit_db[0m:[36m154[0m - [1mDatabase initialization complete.[0m
[32m2025-05-19 15:58:38.359[0m | [1mINFO    [0m | [36mcore.orchestrator[0m:[36msolve_puzzle[0m:[36m58[0m - [1mRunning agent: preprocess[0m


Baseline Success: True
---- RUNNING AOA SYSTEM ON DAY 3 WITH gpt-4.1-mini ----


[32m2025-05-19 15:58:43.819[0m | [1mINFO    [0m | [36mcore.orchestrator[0m:[36msolve_puzzle[0m:[36m58[0m - [1mRunning agent: retreival[0m
[32m2025-05-19 15:58:59.032[0m | [1mINFO    [0m | [36mcore.orchestrator[0m:[36msolve_puzzle[0m:[36m58[0m - [1mRunning agent: planning[0m
[32m2025-05-19 15:58:59.033[0m | [1mINFO    [0m | [36magents.planning_agent[0m:[36mprocess[0m:[36m149[0m - [1mGenerating 3 plans[0m
[32m2025-05-19 15:58:59.034[0m | [1mINFO    [0m | [36magents.planning_agent[0m:[36mprocess[0m:[36m152[0m - [1mCreating plan 1/3[0m
[32m2025-05-19 15:59:16.055[0m | [1mINFO    [0m | [36magents.planning_agent[0m:[36mprocess[0m:[36m152[0m - [1mCreating plan 2/3[0m
[32m2025-05-19 15:59:36.157[0m | [1mINFO    [0m | [36magents.planning_agent[0m:[36mprocess[0m:[36m152[0m - [1mCreating plan 3/3[0m
[32m2025-05-19 15:59:55.354[0m | [1mINFO    [0m | [36mcore.orchestrator[0m:[36msolve_puzzle[0m:[36m58[0m - [1mRunning

---- RUNNING SINGLE MODEL ON DAY 4 WITH gpt-4.1-mini ----


[32m2025-05-19 16:00:14.790[0m | [1mINFO    [0m | [36magents.debugging_agent[0m:[36m_run_test[0m:[36m337[0m - [1mRunning code with test case[0m
[32m2025-05-19 16:00:14.793[0m | [1mINFO    [0m | [36magents.debugging_agent[0m:[36m_run_code[0m:[36m307[0m - [1mRunning code[0m
[32m2025-05-19 16:00:14.845[0m | [1mINFO    [0m | [36magents.debugging_agent[0m:[36m_run_test[0m:[36m358[0m - [1mGot: , expected: 2464[0m
[32m2025-05-19 16:00:14.937[0m | [1mINFO    [0m | [36mcore.retreival[0m:[36minit_db[0m:[36m154[0m - [1mDatabase initialization complete.[0m
[32m2025-05-19 16:00:14.938[0m | [1mINFO    [0m | [36mcore.orchestrator[0m:[36msolve_puzzle[0m:[36m58[0m - [1mRunning agent: preprocess[0m


Baseline Success: False
---- RUNNING AOA SYSTEM ON DAY 4 WITH gpt-4.1-mini ----


[32m2025-05-19 16:00:21.112[0m | [1mINFO    [0m | [36mcore.orchestrator[0m:[36msolve_puzzle[0m:[36m58[0m - [1mRunning agent: retreival[0m


IndexError: list index out of range