# Model comparison



## Setup

In [7]:
import os
import sys
import dotenv
import json
from loguru import logger

# Append the models path in order to import the models
PROJECT_ROOT = os.path.join(os.getcwd(), 'src/')
print(PROJECT_ROOT)
sys.path.append(PROJECT_ROOT)

# Load env variables
dotenv.load_dotenv()

# Set log level INFO
logger.remove()
logger.add(sys.stderr, level="INFO")

/home/twanh/workspace/thesis/thesis-advent-of-agents/src/


4

In [8]:
# Import models from the system
from models.base_model import BaseLanguageModel
from models.gemini_model import GeminiLanguageModel
from models.openai_model import OpenAILanguageModel
from models.deepseek_model import DeepseekLanguageModel

### Load puzzles and input/outputs

In [69]:
# Get the correct paths
test_data_folder = os.path.join(PROJECT_ROOT, '..', 'experiments', 'test_data')
puzzles_folder = os.path.join(test_data_folder, 'puzzles/')
input_output_file = os.path.join(test_data_folder, 'answers2024.json')
puzzle_files = [os.path.join(puzzles_folder, f) for f in os.listdir(puzzles_folder) if os.path.isfile(os.path.join(puzzles_folder, f))]

In [70]:
# Create a datastructure were we can get by day
json_data = {}
with open(input_output_file, 'r') as f:
    json_data = {item['day']: item for item in json.load(f)}

puzzle_data = []
for file_path in puzzle_files:
    # Get the day of the puzzle file
    file_name = os.path.basename(file_path)
    day_str = file_name.split('_')[-1].split('.')[0]
    day = int(day_str)

    if day in json_data:
        with open(file_path, 'r') as f:
            puzzle_description = f.read()

        puzzle_info = {
            "year": json_data[day]['year'],
            "day": day,
            "description": puzzle_description,
            "input": json_data[day]['input'],
            "expected_output": json_data[day]['part1']
        }

        puzzle_data.append(puzzle_info)

# Sort by day
puzzle_data.sort(key=lambda x: x['day'])
print(len(puzzle_data)) # should be 25

25


### Model Configurations

| Agent          | Configuration 1 (Ultimate Power) | Configuration 2 (Pre-AOC Power) | Configuration 3 (Reasoning Focus) | Configuration 4 (Speed Optimized) | Configuration 5 (P&D Reasoning) | Configuration 6 (Balanced Mix) |
| :------------- | :------------------------------ | :------------------------------ | :-------------------------------- | :------------------------------- | :------------------------------ | :----------------------------- |
| **preprocessing** | Claude 3.5 Sonnet               | Claude 3.5 Sonnet               | o3-mini (Reasoning)               | Claude 3.5 Haiku                  | GPT-4o mini                     | GPT-4o mini                     |
| **retrieval**  | Gemini 2.5 Pro                  | GPT-4.1                        | o4-mini (Reasoning)               | Gemini 2.0 Flash                 | Gemini 2.0 Flash                | Claude 3.5 Sonnet               |
| **planning**   | o3 (Reasoning)                  | o4-mini (Reasoning)            | o4-mini (Reasoning)               | GPT-4o mini                      | o4-mini (Reasoning)             | GPT-4.1                         |
| **coding**     | GPT-4.1                         | GPT-4.1                        | o3-mini (Reasoning)               | Gemini 2.0 Flash                 | GPT-4.1                         | Gemini 2.0 Flash                |
| **debugging**  | Claude 3.7 Sonnet               | o3-mini (Reasoning)            | o4-mini (Reasoning)               | o3-mini (Reasoning)              | o3-mini (Reasoning)             | o4-mini (Reasoning)             |

##### Ultimate Power
Aim: Achieve maximum possible performance by strategically deploying the most powerful models where they matter most.
Strategy: Places o3 (the most powerful reasoning model) in Planning where complex problem decomposition is critical. Uses elite models for each other role, pairing Claude 3.7 Sonnet's strong analytical abilities with debugging, and GPT-4.1's code generation excellence with the coding agent. Balanced across providers to leverage each company's strengths.

##### Pre-AOC Power
Aim: Maximize performance while controlling for potential AOC training data advantage.
Strategy: Deliberately excludes Gemini 2.5 models which may have seen AOC 2024 data. Still achieves high capability with Claude 3.5 Sonnet for preprocessing, GPT-4.1 for retrieval and coding, and reasoning-capable models strategically placed for planning (o4-mini) and debugging (o3-mini).

##### Reasoning Focus
Aim: Evaluate whether reasoning-flagged models throughout the entire pipeline significantly improve performance.
Strategy: Uses only models with explicit reasoning capabilities for all agent roles. Balances o3-mini (one use for coding) with o4-mini across other roles, creating a pipeline entirely focused on reasoning capabilities. Will reveal if reasoning throughout is worth the potential cost increase.

##### Speed Optimized
Aim: Maximize speed while maintaining robust performance on complex puzzles.
Strategy: Employs the fastest models in the collection (Haiku, Flash, mini variants) while reserving reasoning capabilities (o3-mini) only for debugging where errors are most critical to resolve quickly. Uses the speed of Gemini 2.0 Flash for coding while ensuring planning is still effective with GPT-4o mini.

#### P&D Reasoning
Aim: Test the hypothesis that reasoning capabilities mainly benefit Planning and Debugging.
Strategy: Concentrates reasoning models only in Planning and Debugging roles, using capable but non-reasoning models elsewhere. If this configuration performs nearly as well as the Reasoning Focus configuration but with lower cost/latency, it indicates that reasoning is most valuable in these specific roles.

#### Balanced Mix
Aim: Create a balanced, cost-effective pipeline using strengths from each provider.
Strategy: Distributes roles across providers to leverage each company's strengths: Google for fast coding, OpenAI for planning and debugging, and Claude for context-handling in retrieval. Avoids using the most expensive models while maintaining good coverage of capabilities across the pipeline.

In [81]:
from main import _get_model

# Configuration 1 -- Ultimate power
config_ultimate_power = {
    "preprocessing": _get_model("claude-3-5-sonnet-20241022"),
    "retrieval": _get_model("gemini-2.5-pro-preview-05-06"),
    "planning": _get_model("o3"),
    "coding": _get_model("gpt-4.1-2025-04-14"),
    "debugging": _get_model("claude-3-7-sonnet@20250219"),
}

# Configuration 2 -- PreAOC power
config_pre_aoc_power = {
    "preprocessing": _get_model("claude-3-5-sonnet-20241022"),
    "retrieval": _get_model("gpt-4.1-2025-04-14"),
    "planning": _get_model("o4-mini-2025-04-16"),
    "coding": _get_model("gpt-4.1-2025-04-14"),
    "debugging": _get_model("o3-mini-2025-01-31"),
}

# Configuration 3 -- Reasoning
config_reasoning_focus = {
    "preprocessing": _get_model("o3-mini-2025-01-31"),
    "retrieval": _get_model("o4-mini-2025-04-16"),
    "planning": _get_model("o4-mini-2025-04-16"),
    "coding": _get_model("o3-mini-2025-01-31"),
    "debugging": _get_model("o4-mini-2025-04-16"),
}

# Configuration 4 -- Speed
config_speed_optimized = {
    "preprocessing": _get_model("claude-3-5-haiku-20241022"),
    "retrieval": _get_model("gemini-2.0-flash"),
    "planning": _get_model("gpt-4o-mini-2024-07-18"),
    "coding": _get_model("gemini-2.0-flash"),
    "debugging": _get_model("o3-mini-2025-01-31"),
}

# Configuration 5 -- Planning and Debugging Reasoning
config_pd_reasoning = {
    "preprocessing": _get_model("gpt-4o-mini-2024-07-18"),
    "retrieval": _get_model("gemini-2.0-flash"),
    "planning": _get_model("o4-mini-2025-04-16"),
    "coding": _get_model("gpt-4.1-2025-04-14"),
    "debugging": _get_model("o3-mini-2025-01-31"),
}

# Configuration 6 -- Balanced
config_balanced_mix = {
    "preprocessing": _get_model("gpt-4o-mini-2024-07-18"),
    "retrieval": _get_model("claude-3-5-sonnet-20241022"),
    "planning": _get_model("gpt-4.1-2025-04-14"),
    "coding": _get_model("gemini-2.0-flash"),
    "debugging": _get_model("o4-mini-2025-04-16"),
}

configurations_to_test = {
#    "ultimate_power": config_ultimate_power, TEMPORARY DISABLED BECAUSE IT RAN ALREADY
    "pre_aoc_power": config_pre_aoc_power,
    "reasoning_focus": config_reasoning_focus,
    "speed_optimized": config_speed_optimized,
    "pd_reasoning": config_pd_reasoning,
    "balanced_mix": config_balanced_mix,
}


## System

In [72]:
from agents.base_agent import BaseAgent
from agents.coding_agent import CodingAgent
from agents.debugging_agent import DebuggingAgent
from agents.planning_agent import PlanningAgent
from agents.pre_processing_agent import PreProcessingAgent
from agents.retreival_agent import RetrievalAgent
from core.orchestrator import Orchestrator
from utils.util_types import AgentSettings
from core.state import MainState
from utils.util_types import Puzzle

In [73]:

def setup_system(config: dict[str, BaseLanguageModel], expected_output: str, puzzle_input: str):

    preprocessing_model = config["preprocessing"]
    retrieval_model = config["retrieval"]
    planning_model = config["planning"]
    coding_model = config["coding"]
    debugging_model = config["debugging"]

    agents = (
        (
            PreProcessingAgent(
                'preprocess', model=preprocessing_model,
            ),
            AgentSettings(enabled=True, can_debug=False),
        ),
        (
            RetrievalAgent(
                'retreival',
                model=retrieval_model,
                connection_string=os.getenv('DB_CONNECTION_STRING') or '',
                openai_key=os.getenv('OPENAI_API_KEY') or '',
                weights=None, # Use default weights
            ),
            AgentSettings(enabled=True, can_debug=False),
        ),
        (
            PlanningAgent(
                'planning',
                model=planning_model,
                n_plans=3, # Keep n_plans fixed for consistent comparison
            ),
            AgentSettings(enabled=True, can_debug=False),
        ),
        (
            CodingAgent('coding', model=coding_model),
            AgentSettings(enabled=True, can_debug=False),
        ),
        (
            DebuggingAgent(
                'debugging',
                model=debugging_model,
                expected_output=expected_output,
                puzzle_input=puzzle_input,
            ),
            AgentSettings(enabled=True, can_debug=True),
        ),
    )

    orchestrator = Orchestrator(agents, {})
    return orchestrator


In [None]:
import time
def run_and_test_system(
    day: int,
    puzzle_desc: str,
    puzzle_input: str,
    expected_output: str,
    config: dict 
) -> dict[str, str|int|None]:

    orchestrator = setup_system(
        config,
        expected_output=expected_output,
        puzzle_input=puzzle_input
    )


    puzzle = Puzzle(
        description=puzzle_desc,
        solution=None,
        year=2024,
        day=day,
    )

    state = MainState(puzzle=puzzle)

    try:
        start_time = time.time()
        ret_state = orchestrator.solve_puzzle(state)
        end_time = time.time()
        return {
            'success': ret_state.is_solved,
            'day': day,
            'code': ret_state.final_code,
            'debug_attempts': ret_state.debug_attempts,
            'debug_suggestions': ret_state.debug_suggestions,
            'n_retreived_puzzles': len(ret_state.retreived_puzzles),
            'keywords': ','.join(ret_state.keywords),
            'concepts': ','.join(ret_state.underlying_concepts),
            'time': end_time - start_time
        }

    except Exception as e:
        print(f"Runtime error during puzzle solving for Day {day}: {e}")
        return {
            'success': False,
            'code': None,
            'debug_attempts': None,
            'debug_suggestions': None,
            'n_retreived_puzzles': None,
            'keywords': None,
            'concepts': None,
            'time': None,
        }

In [75]:
import datetime
import pandas as pd
SAVE_DIR = os.path.join(PROJECT_ROOT, '../', 'experiments/', 'results/', 'model_comparison/')

def run_config(config: dict[str, BaseLanguageModel], config_name: str, puzzle_data):

    all_results = []
    for puzzle in puzzle_data:
        # Load puzzle
        puzzle_day = puzzle['day']
        puzzle_description = puzzle['description']
        input_ = puzzle['input']
        expected_ouptut = puzzle['expected_output']
        print(f"Running day {puzzle_day}")

        # RUn the tests
        results = run_and_test_system(
            puzzle_day,
            puzzle_description,
            input_,
            expected_ouptut,
            config,
        )

        results['config_name'] = config_name
        all_results.append(results)

    # Save the resu
    timestamp = datetime.datetime.now().strftime("%Y%m%d_%H%M%S")
    filename = f'comp-{config_name}-{timestamp}.csv'
    filepath =os.path.join(SAVE_DIR, filename)

    results_df = pd.DataFrame(all_results)
    print(f"Saving results for {config_name} to {filepath}")
    results_df.to_csv(filepath, index=False)
    
    return results_df

In [None]:
# Runn all configs

results_list = []
for config_name, config in configurations_to_test.items():

    print(f"----- Running config {config_name}")
    print(config)
    results_config = run_config(config, config_name, puzzle_data)
    results_list.append(results_config)

timestamp = datetime.datetime.now().strftime("%Y%m%d_%H%M%S")
filename = f'all_configs-{timestamp}.csv'
filepath = os.path.join(SAVE_DIR, filename)
results = pd.concat(results_list)
results.to_csv(filepath, index=False)

----- Running config pre_aoc_power
{'preprocessing': <models.anthropic_model.AnthropicLanguageModel object at 0x704d852b96d0>, 'retrieval': <models.openai_model.OpenAILanguageModel object at 0x704d852bb1d0>, 'planning': <models.openai_model.OpenAILanguageModel object at 0x704d852dea80>, 'coding': <models.openai_model.OpenAILanguageModel object at 0x704d852d3560>, 'debugging': <models.openai_model.OpenAILanguageModel object at 0x704d852c29f0>}
Running day 1


[32m2025-05-20 20:36:10.090[0m | [1mINFO    [0m | [36mcore.retreival[0m:[36minit_db[0m:[36m154[0m - [1mDatabase initialization complete.[0m
[32m2025-05-20 20:36:10.103[0m | [1mINFO    [0m | [36mcore.orchestrator[0m:[36msolve_puzzle[0m:[36m58[0m - [1mRunning agent: preprocess[0m
[32m2025-05-20 20:36:20.064[0m | [1mINFO    [0m | [36mcore.orchestrator[0m:[36msolve_puzzle[0m:[36m58[0m - [1mRunning agent: retreival[0m
[32m2025-05-20 20:38:06.208[0m | [1mINFO    [0m | [36mcore.orchestrator[0m:[36msolve_puzzle[0m:[36m58[0m - [1mRunning agent: planning[0m
[32m2025-05-20 20:38:06.213[0m | [1mINFO    [0m | [36magents.planning_agent[0m:[36mprocess[0m:[36m149[0m - [1mGenerating 3 plans[0m
[32m2025-05-20 20:38:06.215[0m | [1mINFO    [0m | [36magents.planning_agent[0m:[36mprocess[0m:[36m152[0m - [1mCreating plan 1/3[0m
[32m2025-05-20 20:38:21.283[0m | [1mINFO    [0m | [36magents.planning_agent[0m:[36mprocess[0m:[36m152

Running day 2


[32m2025-05-20 20:39:34.152[0m | [1mINFO    [0m | [36mcore.orchestrator[0m:[36msolve_puzzle[0m:[36m58[0m - [1mRunning agent: retreival[0m
[32m2025-05-20 20:40:27.039[0m | [1mINFO    [0m | [36mcore.orchestrator[0m:[36msolve_puzzle[0m:[36m58[0m - [1mRunning agent: planning[0m
[32m2025-05-20 20:40:27.042[0m | [1mINFO    [0m | [36magents.planning_agent[0m:[36mprocess[0m:[36m149[0m - [1mGenerating 3 plans[0m
[32m2025-05-20 20:40:27.043[0m | [1mINFO    [0m | [36magents.planning_agent[0m:[36mprocess[0m:[36m152[0m - [1mCreating plan 1/3[0m
[32m2025-05-20 20:40:55.090[0m | [1mINFO    [0m | [36magents.planning_agent[0m:[36mprocess[0m:[36m152[0m - [1mCreating plan 2/3[0m
[32m2025-05-20 20:41:20.383[0m | [1mINFO    [0m | [36magents.planning_agent[0m:[36mprocess[0m:[36m152[0m - [1mCreating plan 3/3[0m
[32m2025-05-20 20:41:47.220[0m | [1mINFO    [0m | [36mcore.orchestrator[0m:[36msolve_puzzle[0m:[36m58[0m - [1mRunning

Running day 3


[32m2025-05-20 20:42:06.423[0m | [31m[1mERROR   [0m | [36magents.pre_processing_agent[0m:[36mprocess[0m:[36m76[0m - [31m[1mError parsing JSON: Invalid control character at: line 4 column 121 (char 320)[0m
[32m2025-05-20 20:42:14.554[0m | [1mINFO    [0m | [36mcore.orchestrator[0m:[36msolve_puzzle[0m:[36m58[0m - [1mRunning agent: retreival[0m
[32m2025-05-20 20:42:36.210[0m | [1mINFO    [0m | [36mcore.orchestrator[0m:[36msolve_puzzle[0m:[36m58[0m - [1mRunning agent: planning[0m
[32m2025-05-20 20:42:36.211[0m | [1mINFO    [0m | [36magents.planning_agent[0m:[36mprocess[0m:[36m149[0m - [1mGenerating 3 plans[0m
[32m2025-05-20 20:42:36.212[0m | [1mINFO    [0m | [36magents.planning_agent[0m:[36mprocess[0m:[36m152[0m - [1mCreating plan 1/3[0m
[32m2025-05-20 20:43:32.787[0m | [1mINFO    [0m | [36magents.planning_agent[0m:[36mprocess[0m:[36m152[0m - [1mCreating plan 2/3[0m
[32m2025-05-20 20:44:14.772[0m | [1mINFO    [0m 

Running day 4


[32m2025-05-20 20:45:24.740[0m | [31m[1mERROR   [0m | [36magents.pre_processing_agent[0m:[36mprocess[0m:[36m76[0m - [31m[1mError parsing JSON: Invalid control character at: line 37 column 100 (char 1548)[0m
[32m2025-05-20 20:45:34.031[0m | [31m[1mERROR   [0m | [36magents.pre_processing_agent[0m:[36mprocess[0m:[36m76[0m - [31m[1mError parsing JSON: Invalid control character at: line 32 column 98 (char 1433)[0m
[32m2025-05-20 20:45:44.165[0m | [31m[1mERROR   [0m | [36magents.pre_processing_agent[0m:[36mprocess[0m:[36m76[0m - [31m[1mError parsing JSON: Invalid control character at: line 38 column 93 (char 1439)[0m
[32m2025-05-20 20:45:53.488[0m | [1mINFO    [0m | [36mcore.orchestrator[0m:[36msolve_puzzle[0m:[36m58[0m - [1mRunning agent: retreival[0m
[32m2025-05-20 20:46:52.889[0m | [1mINFO    [0m | [36mcore.orchestrator[0m:[36msolve_puzzle[0m:[36m58[0m - [1mRunning agent: planning[0m
[32m2025-05-20 20:46:52.894[0m | [1mI

Running day 5


[32m2025-05-20 20:48:32.483[0m | [1mINFO    [0m | [36mcore.orchestrator[0m:[36msolve_puzzle[0m:[36m58[0m - [1mRunning agent: retreival[0m
[32m2025-05-20 20:49:42.152[0m | [1mINFO    [0m | [36mcore.orchestrator[0m:[36msolve_puzzle[0m:[36m58[0m - [1mRunning agent: planning[0m
[32m2025-05-20 20:49:42.157[0m | [1mINFO    [0m | [36magents.planning_agent[0m:[36mprocess[0m:[36m149[0m - [1mGenerating 3 plans[0m
[32m2025-05-20 20:49:42.160[0m | [1mINFO    [0m | [36magents.planning_agent[0m:[36mprocess[0m:[36m152[0m - [1mCreating plan 1/3[0m
[32m2025-05-20 20:50:00.168[0m | [1mINFO    [0m | [36magents.planning_agent[0m:[36mprocess[0m:[36m152[0m - [1mCreating plan 2/3[0m
[32m2025-05-20 20:50:19.423[0m | [1mINFO    [0m | [36magents.planning_agent[0m:[36mprocess[0m:[36m152[0m - [1mCreating plan 3/3[0m
[32m2025-05-20 20:50:42.571[0m | [1mINFO    [0m | [36mcore.orchestrator[0m:[36msolve_puzzle[0m:[36m58[0m - [1mRunning

Running day 6


[32m2025-05-20 20:51:19.738[0m | [1mINFO    [0m | [36mcore.orchestrator[0m:[36msolve_puzzle[0m:[36m58[0m - [1mRunning agent: retreival[0m
[32m2025-05-20 20:52:14.546[0m | [1mINFO    [0m | [36mcore.orchestrator[0m:[36msolve_puzzle[0m:[36m58[0m - [1mRunning agent: planning[0m
[32m2025-05-20 20:52:14.550[0m | [1mINFO    [0m | [36magents.planning_agent[0m:[36mprocess[0m:[36m149[0m - [1mGenerating 3 plans[0m
[32m2025-05-20 20:52:14.553[0m | [1mINFO    [0m | [36magents.planning_agent[0m:[36mprocess[0m:[36m152[0m - [1mCreating plan 1/3[0m
[32m2025-05-20 20:52:31.539[0m | [1mINFO    [0m | [36magents.planning_agent[0m:[36mprocess[0m:[36m152[0m - [1mCreating plan 2/3[0m
[32m2025-05-20 20:53:00.499[0m | [1mINFO    [0m | [36magents.planning_agent[0m:[36mprocess[0m:[36m152[0m - [1mCreating plan 3/3[0m
[32m2025-05-20 20:53:25.597[0m | [1mINFO    [0m | [36mcore.orchestrator[0m:[36msolve_puzzle[0m:[36m58[0m - [1mRunning

Running day 7


[32m2025-05-20 20:53:52.826[0m | [1mINFO    [0m | [36mcore.orchestrator[0m:[36msolve_puzzle[0m:[36m58[0m - [1mRunning agent: retreival[0m
[32m2025-05-20 20:54:49.767[0m | [1mINFO    [0m | [36mcore.orchestrator[0m:[36msolve_puzzle[0m:[36m58[0m - [1mRunning agent: planning[0m
[32m2025-05-20 20:54:49.779[0m | [1mINFO    [0m | [36magents.planning_agent[0m:[36mprocess[0m:[36m149[0m - [1mGenerating 3 plans[0m
[32m2025-05-20 20:54:49.781[0m | [1mINFO    [0m | [36magents.planning_agent[0m:[36mprocess[0m:[36m152[0m - [1mCreating plan 1/3[0m
[32m2025-05-20 20:55:09.672[0m | [1mINFO    [0m | [36magents.planning_agent[0m:[36mprocess[0m:[36m152[0m - [1mCreating plan 2/3[0m
[32m2025-05-20 20:55:35.027[0m | [1mINFO    [0m | [36magents.planning_agent[0m:[36mprocess[0m:[36m152[0m - [1mCreating plan 3/3[0m
[32m2025-05-20 20:56:05.131[0m | [1mINFO    [0m | [36mcore.orchestrator[0m:[36msolve_puzzle[0m:[36m58[0m - [1mRunning

Running day 8


[32m2025-05-20 20:56:25.403[0m | [1mINFO    [0m | [36mcore.orchestrator[0m:[36msolve_puzzle[0m:[36m58[0m - [1mRunning agent: retreival[0m
[32m2025-05-20 20:57:07.925[0m | [1mINFO    [0m | [36mcore.orchestrator[0m:[36msolve_puzzle[0m:[36m58[0m - [1mRunning agent: planning[0m
[32m2025-05-20 20:57:07.939[0m | [1mINFO    [0m | [36magents.planning_agent[0m:[36mprocess[0m:[36m149[0m - [1mGenerating 3 plans[0m
[32m2025-05-20 20:57:07.941[0m | [1mINFO    [0m | [36magents.planning_agent[0m:[36mprocess[0m:[36m152[0m - [1mCreating plan 1/3[0m
[32m2025-05-20 20:57:55.114[0m | [1mINFO    [0m | [36magents.planning_agent[0m:[36mprocess[0m:[36m152[0m - [1mCreating plan 2/3[0m
[32m2025-05-20 20:58:47.290[0m | [1mINFO    [0m | [36magents.planning_agent[0m:[36mprocess[0m:[36m152[0m - [1mCreating plan 3/3[0m
[32m2025-05-20 20:59:31.886[0m | [1mINFO    [0m | [36mcore.orchestrator[0m:[36msolve_puzzle[0m:[36m58[0m - [1mRunning

Running day 9


[32m2025-05-20 21:00:05.055[0m | [1mINFO    [0m | [36mcore.orchestrator[0m:[36msolve_puzzle[0m:[36m58[0m - [1mRunning agent: retreival[0m
[32m2025-05-20 21:00:07.668[0m | [1mINFO    [0m | [36mcore.orchestrator[0m:[36msolve_puzzle[0m:[36m58[0m - [1mRunning agent: planning[0m
[32m2025-05-20 21:00:07.670[0m | [1mINFO    [0m | [36magents.planning_agent[0m:[36mprocess[0m:[36m149[0m - [1mGenerating 3 plans[0m
[32m2025-05-20 21:00:07.671[0m | [1mINFO    [0m | [36magents.planning_agent[0m:[36mprocess[0m:[36m152[0m - [1mCreating plan 1/3[0m
[32m2025-05-20 21:00:38.950[0m | [1mINFO    [0m | [36magents.planning_agent[0m:[36mprocess[0m:[36m152[0m - [1mCreating plan 2/3[0m
[32m2025-05-20 21:01:01.672[0m | [1mINFO    [0m | [36magents.planning_agent[0m:[36mprocess[0m:[36m152[0m - [1mCreating plan 3/3[0m
[32m2025-05-20 21:01:51.558[0m | [1mINFO    [0m | [36mcore.orchestrator[0m:[36msolve_puzzle[0m:[36m58[0m - [1mRunning

Running day 10


[32m2025-05-20 21:04:42.611[0m | [1mINFO    [0m | [36mcore.orchestrator[0m:[36msolve_puzzle[0m:[36m58[0m - [1mRunning agent: retreival[0m
[32m2025-05-20 21:05:18.853[0m | [1mINFO    [0m | [36mcore.orchestrator[0m:[36msolve_puzzle[0m:[36m58[0m - [1mRunning agent: planning[0m
[32m2025-05-20 21:05:18.856[0m | [1mINFO    [0m | [36magents.planning_agent[0m:[36mprocess[0m:[36m149[0m - [1mGenerating 3 plans[0m
[32m2025-05-20 21:05:18.858[0m | [1mINFO    [0m | [36magents.planning_agent[0m:[36mprocess[0m:[36m152[0m - [1mCreating plan 1/3[0m
[32m2025-05-20 21:05:44.546[0m | [1mINFO    [0m | [36magents.planning_agent[0m:[36mprocess[0m:[36m152[0m - [1mCreating plan 2/3[0m
[32m2025-05-20 21:06:16.873[0m | [1mINFO    [0m | [36magents.planning_agent[0m:[36mprocess[0m:[36m152[0m - [1mCreating plan 3/3[0m
[32m2025-05-20 21:06:50.509[0m | [1mINFO    [0m | [36mcore.orchestrator[0m:[36msolve_puzzle[0m:[36m58[0m - [1mRunning

Running day 11


[32m2025-05-20 21:07:35.078[0m | [1mINFO    [0m | [36mcore.orchestrator[0m:[36msolve_puzzle[0m:[36m58[0m - [1mRunning agent: retreival[0m
[32m2025-05-20 21:08:35.489[0m | [1mINFO    [0m | [36mcore.orchestrator[0m:[36msolve_puzzle[0m:[36m58[0m - [1mRunning agent: planning[0m
[32m2025-05-20 21:08:35.490[0m | [1mINFO    [0m | [36magents.planning_agent[0m:[36mprocess[0m:[36m149[0m - [1mGenerating 3 plans[0m
[32m2025-05-20 21:08:35.492[0m | [1mINFO    [0m | [36magents.planning_agent[0m:[36mprocess[0m:[36m152[0m - [1mCreating plan 1/3[0m
[32m2025-05-20 21:08:56.414[0m | [1mINFO    [0m | [36magents.planning_agent[0m:[36mprocess[0m:[36m152[0m - [1mCreating plan 2/3[0m
[32m2025-05-20 21:09:29.987[0m | [1mINFO    [0m | [36magents.planning_agent[0m:[36mprocess[0m:[36m152[0m - [1mCreating plan 3/3[0m
[32m2025-05-20 21:09:52.788[0m | [1mINFO    [0m | [36mcore.orchestrator[0m:[36msolve_puzzle[0m:[36m58[0m - [1mRunning

Running day 12


[32m2025-05-20 21:11:26.228[0m | [1mINFO    [0m | [36mcore.orchestrator[0m:[36msolve_puzzle[0m:[36m58[0m - [1mRunning agent: retreival[0m
[32m2025-05-20 21:12:35.813[0m | [1mINFO    [0m | [36mcore.orchestrator[0m:[36msolve_puzzle[0m:[36m58[0m - [1mRunning agent: planning[0m
[32m2025-05-20 21:12:35.818[0m | [1mINFO    [0m | [36magents.planning_agent[0m:[36mprocess[0m:[36m149[0m - [1mGenerating 3 plans[0m
[32m2025-05-20 21:12:35.821[0m | [1mINFO    [0m | [36magents.planning_agent[0m:[36mprocess[0m:[36m152[0m - [1mCreating plan 1/3[0m
[32m2025-05-20 21:13:00.227[0m | [1mINFO    [0m | [36magents.planning_agent[0m:[36mprocess[0m:[36m152[0m - [1mCreating plan 2/3[0m
[32m2025-05-20 21:13:35.264[0m | [1mINFO    [0m | [36magents.planning_agent[0m:[36mprocess[0m:[36m152[0m - [1mCreating plan 3/3[0m
[32m2025-05-20 21:14:02.146[0m | [1mINFO    [0m | [36mcore.orchestrator[0m:[36msolve_puzzle[0m:[36m58[0m - [1mRunning

Running day 13


[32m2025-05-20 21:14:34.111[0m | [1mINFO    [0m | [36mcore.orchestrator[0m:[36msolve_puzzle[0m:[36m58[0m - [1mRunning agent: retreival[0m
[32m2025-05-20 21:15:13.666[0m | [1mINFO    [0m | [36mcore.orchestrator[0m:[36msolve_puzzle[0m:[36m58[0m - [1mRunning agent: planning[0m
[32m2025-05-20 21:15:13.669[0m | [1mINFO    [0m | [36magents.planning_agent[0m:[36mprocess[0m:[36m149[0m - [1mGenerating 3 plans[0m
[32m2025-05-20 21:15:13.671[0m | [1mINFO    [0m | [36magents.planning_agent[0m:[36mprocess[0m:[36m152[0m - [1mCreating plan 1/3[0m
[32m2025-05-20 21:15:42.845[0m | [1mINFO    [0m | [36magents.planning_agent[0m:[36mprocess[0m:[36m152[0m - [1mCreating plan 2/3[0m
[32m2025-05-20 21:16:09.160[0m | [1mINFO    [0m | [36magents.planning_agent[0m:[36mprocess[0m:[36m152[0m - [1mCreating plan 3/3[0m
[32m2025-05-20 21:16:33.438[0m | [1mINFO    [0m | [36mcore.orchestrator[0m:[36msolve_puzzle[0m:[36m58[0m - [1mRunning

Running day 14


[32m2025-05-20 21:16:57.291[0m | [1mINFO    [0m | [36mcore.orchestrator[0m:[36msolve_puzzle[0m:[36m58[0m - [1mRunning agent: retreival[0m
[32m2025-05-20 21:18:29.557[0m | [1mINFO    [0m | [36mcore.orchestrator[0m:[36msolve_puzzle[0m:[36m58[0m - [1mRunning agent: planning[0m
[32m2025-05-20 21:18:29.559[0m | [1mINFO    [0m | [36magents.planning_agent[0m:[36mprocess[0m:[36m149[0m - [1mGenerating 3 plans[0m
[32m2025-05-20 21:18:29.561[0m | [1mINFO    [0m | [36magents.planning_agent[0m:[36mprocess[0m:[36m152[0m - [1mCreating plan 1/3[0m
[32m2025-05-20 21:19:13.388[0m | [1mINFO    [0m | [36magents.planning_agent[0m:[36mprocess[0m:[36m152[0m - [1mCreating plan 2/3[0m
[32m2025-05-20 21:19:39.884[0m | [1mINFO    [0m | [36magents.planning_agent[0m:[36mprocess[0m:[36m152[0m - [1mCreating plan 3/3[0m
[32m2025-05-20 21:20:04.513[0m | [1mINFO    [0m | [36mcore.orchestrator[0m:[36msolve_puzzle[0m:[36m58[0m - [1mRunning

Running day 15


[32m2025-05-20 21:20:30.582[0m | [1mINFO    [0m | [36mcore.orchestrator[0m:[36msolve_puzzle[0m:[36m58[0m - [1mRunning agent: retreival[0m
[32m2025-05-20 21:21:11.591[0m | [1mINFO    [0m | [36mcore.orchestrator[0m:[36msolve_puzzle[0m:[36m58[0m - [1mRunning agent: planning[0m
[32m2025-05-20 21:21:11.593[0m | [1mINFO    [0m | [36magents.planning_agent[0m:[36mprocess[0m:[36m149[0m - [1mGenerating 3 plans[0m
[32m2025-05-20 21:21:11.594[0m | [1mINFO    [0m | [36magents.planning_agent[0m:[36mprocess[0m:[36m152[0m - [1mCreating plan 1/3[0m
[32m2025-05-20 21:21:54.698[0m | [1mINFO    [0m | [36magents.planning_agent[0m:[36mprocess[0m:[36m152[0m - [1mCreating plan 2/3[0m
[32m2025-05-20 21:22:24.019[0m | [1mINFO    [0m | [36magents.planning_agent[0m:[36mprocess[0m:[36m152[0m - [1mCreating plan 3/3[0m
[32m2025-05-20 21:22:47.196[0m | [1mINFO    [0m | [36mcore.orchestrator[0m:[36msolve_puzzle[0m:[36m58[0m - [1mRunning

Running day 16


[32m2025-05-20 21:30:03.026[0m | [1mINFO    [0m | [36mcore.orchestrator[0m:[36msolve_puzzle[0m:[36m58[0m - [1mRunning agent: retreival[0m
[32m2025-05-20 21:30:42.337[0m | [1mINFO    [0m | [36mcore.orchestrator[0m:[36msolve_puzzle[0m:[36m58[0m - [1mRunning agent: planning[0m
[32m2025-05-20 21:30:42.339[0m | [1mINFO    [0m | [36magents.planning_agent[0m:[36mprocess[0m:[36m149[0m - [1mGenerating 3 plans[0m
[32m2025-05-20 21:30:42.341[0m | [1mINFO    [0m | [36magents.planning_agent[0m:[36mprocess[0m:[36m152[0m - [1mCreating plan 1/3[0m
[32m2025-05-20 21:31:13.976[0m | [1mINFO    [0m | [36magents.planning_agent[0m:[36mprocess[0m:[36m152[0m - [1mCreating plan 2/3[0m
[32m2025-05-20 21:31:46.457[0m | [1mINFO    [0m | [36magents.planning_agent[0m:[36mprocess[0m:[36m152[0m - [1mCreating plan 3/3[0m
[32m2025-05-20 21:32:04.567[0m | [1mINFO    [0m | [36mcore.orchestrator[0m:[36msolve_puzzle[0m:[36m58[0m - [1mRunning

Running day 17


[32m2025-05-20 21:32:52.795[0m | [1mINFO    [0m | [36mcore.orchestrator[0m:[36msolve_puzzle[0m:[36m58[0m - [1mRunning agent: retreival[0m
[32m2025-05-20 21:33:34.482[0m | [1mINFO    [0m | [36mcore.orchestrator[0m:[36msolve_puzzle[0m:[36m58[0m - [1mRunning agent: planning[0m
[32m2025-05-20 21:33:34.484[0m | [1mINFO    [0m | [36magents.planning_agent[0m:[36mprocess[0m:[36m149[0m - [1mGenerating 3 plans[0m
[32m2025-05-20 21:33:34.486[0m | [1mINFO    [0m | [36magents.planning_agent[0m:[36mprocess[0m:[36m152[0m - [1mCreating plan 1/3[0m
[32m2025-05-20 21:34:05.602[0m | [1mINFO    [0m | [36magents.planning_agent[0m:[36mprocess[0m:[36m152[0m - [1mCreating plan 2/3[0m
[32m2025-05-20 21:34:31.713[0m | [1mINFO    [0m | [36magents.planning_agent[0m:[36mprocess[0m:[36m152[0m - [1mCreating plan 3/3[0m
[32m2025-05-20 21:35:09.301[0m | [1mINFO    [0m | [36mcore.orchestrator[0m:[36msolve_puzzle[0m:[36m58[0m - [1mRunning

Running day 18


[32m2025-05-20 21:36:06.745[0m | [1mINFO    [0m | [36mcore.orchestrator[0m:[36msolve_puzzle[0m:[36m58[0m - [1mRunning agent: retreival[0m
[32m2025-05-20 21:36:43.417[0m | [1mINFO    [0m | [36mcore.orchestrator[0m:[36msolve_puzzle[0m:[36m58[0m - [1mRunning agent: planning[0m
[32m2025-05-20 21:36:43.422[0m | [1mINFO    [0m | [36magents.planning_agent[0m:[36mprocess[0m:[36m149[0m - [1mGenerating 3 plans[0m
[32m2025-05-20 21:36:43.425[0m | [1mINFO    [0m | [36magents.planning_agent[0m:[36mprocess[0m:[36m152[0m - [1mCreating plan 1/3[0m
[32m2025-05-20 21:37:02.222[0m | [1mINFO    [0m | [36magents.planning_agent[0m:[36mprocess[0m:[36m152[0m - [1mCreating plan 2/3[0m
[32m2025-05-20 21:37:26.104[0m | [1mINFO    [0m | [36magents.planning_agent[0m:[36mprocess[0m:[36m152[0m - [1mCreating plan 3/3[0m
[32m2025-05-20 21:38:00.412[0m | [1mINFO    [0m | [36mcore.orchestrator[0m:[36msolve_puzzle[0m:[36m58[0m - [1mRunning

Running day 19


[32m2025-05-20 21:38:28.980[0m | [1mINFO    [0m | [36mcore.orchestrator[0m:[36msolve_puzzle[0m:[36m58[0m - [1mRunning agent: retreival[0m
[32m2025-05-20 21:39:09.063[0m | [1mINFO    [0m | [36mcore.orchestrator[0m:[36msolve_puzzle[0m:[36m58[0m - [1mRunning agent: planning[0m
[32m2025-05-20 21:39:09.065[0m | [1mINFO    [0m | [36magents.planning_agent[0m:[36mprocess[0m:[36m149[0m - [1mGenerating 3 plans[0m
[32m2025-05-20 21:39:09.065[0m | [1mINFO    [0m | [36magents.planning_agent[0m:[36mprocess[0m:[36m152[0m - [1mCreating plan 1/3[0m
[32m2025-05-20 21:39:30.323[0m | [1mINFO    [0m | [36magents.planning_agent[0m:[36mprocess[0m:[36m152[0m - [1mCreating plan 2/3[0m
[32m2025-05-20 21:39:57.350[0m | [1mINFO    [0m | [36magents.planning_agent[0m:[36mprocess[0m:[36m152[0m - [1mCreating plan 3/3[0m
[32m2025-05-20 21:40:23.673[0m | [1mINFO    [0m | [36mcore.orchestrator[0m:[36msolve_puzzle[0m:[36m58[0m - [1mRunning

Running day 20


[32m2025-05-20 21:40:51.403[0m | [1mINFO    [0m | [36mcore.orchestrator[0m:[36msolve_puzzle[0m:[36m58[0m - [1mRunning agent: retreival[0m
[32m2025-05-20 21:41:46.313[0m | [1mINFO    [0m | [36mcore.orchestrator[0m:[36msolve_puzzle[0m:[36m58[0m - [1mRunning agent: planning[0m
[32m2025-05-20 21:41:46.318[0m | [1mINFO    [0m | [36magents.planning_agent[0m:[36mprocess[0m:[36m149[0m - [1mGenerating 3 plans[0m
[32m2025-05-20 21:41:46.320[0m | [1mINFO    [0m | [36magents.planning_agent[0m:[36mprocess[0m:[36m152[0m - [1mCreating plan 1/3[0m
[32m2025-05-20 21:42:19.893[0m | [1mINFO    [0m | [36magents.planning_agent[0m:[36mprocess[0m:[36m152[0m - [1mCreating plan 2/3[0m
[32m2025-05-20 21:42:46.124[0m | [31m[1mERROR   [0m | [36mmodels.openai_model[0m:[36mprompt[0m:[36m43[0m - [31m[1mError while prompting OpenAILanguageModel(model_name='o4-mini-2025-04-16'): Error code: 429 - {'error': {'message': 'You exceeded your current q