CO-Bench: Benchmarking Language Model Agents in Algorithm Search for Combinatorial Optimization

Paper: CO-Bench: Benchmarking Language Model Agents in Algorithm Search for Combinatorial Optimization

News

[2025.5.24] Check out FrontierCO, a benchmark of real-world, challenging (and in some cases unsolved) combinatorial optimization problems. See for Evaluation on FrontierCO section for code to evaluate agents on this benchmark.
[2025.4.6] Released CO-Bench.

Download Data

Download the raw data from https://huggingface.co/datasets/CO-Bench/CO-Bench to the local directory data

huggingface-cli download CO-Bench/CO-Bench --repo-type dataset --local-dir data

or

from huggingface_hub import snapshot_download

snapshot_download(
    repo_id='CO-Bench/CO-Bench',
    repo_type='dataset',
    local_dir='data'
)

Agent Implementations

Agents are implemented in the agents module. Currently supported agents include: GreedyRefine, DirectAnswer, BestOfN, FunSearch (link), AIDE (link), ChainOfExperts (link), ReEvo (link), EoH (link), MCTS-AHD (link). LLMs are supported via liteLLM.

Each agent implements the following functions:

step(): Returns the next candidate code for evaluation.
feedback(): Accepts evaluation results of previous candidate code.
finalize(): Returns the final code.

Evaluation

Below is code to run evaluation of Greedy Refinement agent on Aircraft landing for 64 iterations.

from agents import GreedyRefine, DirectAnswer, FunSearch, AIDE, ChainOfExperts, ReEvo, BestOfN
from evaluation import Evaluator, get_data

# Load data
data = get_data('Aircraft landing', src_dir='data')

# Define agent, here we use GreedyRefine
agent = GreedyRefine(
    problem_description=data.problem_description,
    timeout=10,
    model='openai/o3-mini', # We use LiteLLM to call API
)

# Load evaluator
evaluator = Evaluator(data, timeout=10)

# Run for 64 iterations
for it in range(64):
    code = agent.step()
    if code is None:  # agent decides to terminate
        break
    feedback = evaluator.evaluate(code)  # Run evaluation
    agent.feedback(feedback.dev_score, feedback.dev_feedback)  # Use dev set score as feedback

# Get the final solution
code = agent.finalize()
feedback = evaluator.evaluate(code)
print(feedback.test_feedback)  # Test set score

Evaluation on All Tasks

TASK_LIST = ['Aircraft landing', 'Assignment problem', 'Assortment problem', 'Bin packing - one-dimensional', 'Capacitated warehouse location', 'Common due date scheduling', 'Constrained guillotine cutting', 'Constrained non-guillotine cutting', 'Container loading', 'Container loading with weight restrictions', 'Corporate structuring', 'Crew scheduling', 'Equitable partitioning problem', 'Euclidean Steiner problem', 'Flow shop scheduling', 'Generalised assignment problem', 'Graph colouring', 'Hybrid Reentrant Shop Scheduling', 'Job shop scheduling', 'MIS', 'Multi-Demand Multidimensional Knapsack problem', 'Multidimensional knapsack problem', 'Open shop scheduling', 'Packing unequal circles', 'Packing unequal circles area', 'Packing unequal rectangles and squares', 'Packing unequal rectangles and squares area', 'Resource constrained shortest path', 'Set covering', 'Set partitioning', 'TSP', 'Uncapacitated warehouse location', 'Unconstrained guillotine cutting', 'Vehicle routing: period routing', 'p-median - capacitated', 'p-median - uncapacitated']

for task in TASK_LIST:
    ... # Run evaluation on one task

Using Agents on Custom Problems

Step 1: Include a concise description and a solve template. For example:

problem_description = '''The Traveling Salesman Problem (TSP) is a classic combinatorial optimization problem where, given a set of cities with known pairwise distances, the objective is to find the shortest possible tour that visits each city exactly once and returns to the starting city. More formally, given a complete graph G = (V, E) with vertices V representing cities and edges E with weights representing distances, we seek to find a Hamiltonian cycle (a closed path visiting each vertex exactly once) of minimum total weight.

Implement in Solve Function

def solve(**kwargs):
    """
    Solve a TSP instance.

    Args:
        - nodes (list): List of (x, y) coordinates representing cities in the TSP problem
                      Format: [(x1, y1), (x2, y2), ..., (xn, yn)]

    Returns:
        dict: Solution information with:
            - 'tour' (list): List of node indices representing the solution path
                           Format: [0, 3, 1, ...] where numbers are indices into the nodes list
    """

    return {
        'tour': [],
    }
'''

Step 2: Define the agent

from agents import GreedyRefine
agent = GreedyRefine(
    problem_description=problem_description,
    timeout=10,
    model='openai/o3-mini')

Step 3: Define the evaluate function and run the loop. Use the evaluate function to get results on the data, and iteratively improve the solution based on feedback:

evaluate = ...  # Define evaluate() to return score (float) and feedback (str)
# Run for 64 iterations
for it in range(64):
    code = agent.step()
    dev_score, dev_feedback = evaluate(code) # Define evaluate() to return score (float) and feedback (str)
    agent.feedback(dev_score, dev_feedback) 

# Get the final soltuion
code = agent.finalize()
print(code)

Docker environment for sandboxed agent execution and solution evaluation: coming soon.

Evaluation on FrontierCO

Download the raw data from https://huggingface.co/datasets/CO-Bench/FrontierCO to the local directory data:

huggingface-cli download CO-Bench/FrontierCO --repo-type dataset --local-dir data

or

from huggingface_hub import snapshot_download

snapshot_download(
    repo_id='CO-Bench/FrontierCO',
    repo_type='dataset',
    local_dir='data'
)

Note: In FrontierCO, instead of returning a single solution, the solve function must yield increasingly better solutions over time. The solve template looks like this:

def solve(**kwargs):
    """
    Solve a TSP instance.

    Args:
        - nodes (list): List of (x, y) coordinates representing cities in the TSP problem.
                        Format: [(x1, y1), (x2, y2), ..., (xn, yn)]

    Yields:
        dict: Solution information with:
            - 'tour' (list): List of node indices representing the solution path.
                             Format: [0, 3, 1, ...] where numbers are indices into the nodes list.
    """
    # Your function must yield multiple solutions over time, not just return one.
    # Use Python's yield keyword repeatedly to produce a stream of solutions.
    # Each yielded solution should be better than the previous one.
    # Evaluation is based on the last yielded solution before timeout.
    while True:
        yield {
            'tour': [],
        }

Compared with CO-Bench, we use new agent implementations in FrontierCO:

from agents import YieldGreedyRefine, YieldFunSearch, YieldReEvo

# And a new evaluator to fetch solutions yielded by the solver,
# evaluating only the last solution before timeout:
from evaluation import YieldingEvaluator, get_new_data

# Load data
data = get_new_data(task, src_dir='data', data_dir='data')

# Define agent (example: YieldGreedyRefine)
agent = YieldGreedyRefine(
    problem_description=data.problem_description,
    timeout=300,  # 300s timeout during solver development
    model='openai/o3-mini',  # We use LiteLLM to call the API
)

# Load YieldingEvaluator
# 300s timeout during solver development
evaluator = YieldingEvaluator(data, timeout=300)

# Run for 64 iterations
for it in range(64):
    code = agent.step()
    if code is None:  # agent decides to terminate
        break
    feedback = evaluator.evaluate(code)  # Run evaluation
    agent.feedback(feedback.dev_score, feedback.dev_feedback)  # Use dev set score as feedback

# Get the final solution
code = agent.finalize()

# For final evaluation, run the solver for 1 hour
final_evaluator = YieldingEvaluator(data, timeout=60 * 60)
feedback = final_evaluator.evaluate(code)
print(feedback.test_feedback)  # Test set score

Cite

@article{Sun2025COBench,
  title={CO-Bench: Benchmarking Language Model Agents in Algorithm Search for Combinatorial Optimization},
  author={Weiwei Sun and Shengyu Feng and Shanda Li and Yiming Yang},
  journal={ArXiv},
  year={2025},
  volume={abs/2504.04310},
  url={https://arxiv.org/abs/2504.04310},
}

@article{Feng2025FrontierCO,
  title={A Comprehensive Evaluation of Contemporary ML-Based Solvers for Combinatorial Optimization},
  author={Shengyu Feng and Weiwei Sun and Shanda Li and Ameet Talwalkar and Yiming Yang},
  journal={ArXiv},
  year={2025},
  volume={abs/2505.16952},
  url={https://arxiv.org/abs/2505.16952},
}

Name		Name	Last commit message	Last commit date
Latest commit History 56 Commits
agents		agents
evaluation		evaluation
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

CO-Bench: Benchmarking Language Model Agents in Algorithm Search for Combinatorial Optimization

News

Download Data

Agent Implementations

Evaluation

Evaluation on FrontierCO

Cite

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 2

Uh oh!

Languages

sunnweiwei/CO-Bench

Folders and files

Latest commit

History

Repository files navigation

CO-Bench: Benchmarking Language Model Agents in Algorithm Search for Combinatorial Optimization

News

Download Data

Agent Implementations

Evaluation

Evaluation on FrontierCO

Cite

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 2

Uh oh!

Languages

Packages