# MLE-Bench Demo

Two ways to evaluate AI agents:

**1. Competition Runs** ‚Äì Full Kaggle competitions (build models, make predictions, score against leaderboard)

**2. Technique-Tasks** ‚Äì Focused ML skill assessments: `imbalance`, `missing`, `encoding`, `cv`, `scaling`, `leakage`

**Competitions:** Edit `experiments/splits/dev.txt` to choose which competitions to run.  
Current: `random-acts-of-pizza`, `spooky-author-identification`, `dogs-vs-cats-redux-kernels-edition`

---
*First, read `RUN_DEMO.md` for setup instructions.*

## Setup

First, let's import the SDK and check the server is running.

In [22]:
import sys
sys.path.insert(0, '.')

from sdk import Client

client = Client()
print(f"Server: {client.health()}")

Server: {'status': 'ok'}


---
# Option A: Competition Run (Original MLE-Bench)

Run the agent on a full Kaggle competition. The agent will:
1. Read the competition description
2. Explore and analyze the data
3. Build and train models
4. Generate predictions
5. Submit for scoring against the leaderboard

**Uncomment the cell below to run a competition instead of technique-tasks.**

In [None]:
# OPTION A: Competition Run (uncomment to use)
# run_id = client.run(
#     agent_id="aide/dev",
#     competition_set="experiments/splits/dev.txt",
#     lite=True,
#     technique_tasks=False,  # Regular competition run
# )
# print(f"Started competition run: {run_id}")
# final_status = client.wait_for_completion(run_id, poll_interval=10, timeout=600)
# print(f"Final status: {final_status['status']}")

Started run: ac9f1374-2d14-4165-8f51-1cf8d97b3026
Status: running


---
# Option B: Technique-Tasks (ML Skill Assessment)

Run the agent on focused skill assessments. The agent will:
1. Read the technique-specific prompt (e.g., "analyze class imbalance")
2. Use the same competition data
3. Produce a focused analysis
4. Get graded on that specific skill

Available tasks: `imbalance`, `missing`, `encoding`, `cv`, `scaling`, `leakage`

**Alternate model**: To try an older agent preconfigured with more internal search steps, set `agent_id="aide/gpt-4-turbo-dev"` in the notebook ‚Äî that agent is typically configured to use 8 steps.

In [23]:
# OPTION B: Technique-Tasks Run (using this for the demo)
run_id = client.run(
    agent_id="aide/dev",
    competition_set="experiments/splits/dev.txt",
    lite=True,
    tasks=["imbalance"],    # Which tasks to run
)
print(f"Started run: {run_id}")

# Wait for completion (polls every 10s)
client.wait_for_completion(run_id, poll_interval=10, timeout=600)

Started run: 48a02dac-c084-489e-84ed-2ed549fe05b9
[Poll 1] Run 48a02dac... status: running (elapsed: 0.0s)
[Poll 2] Run 48a02dac... status: running (elapsed: 10.0s)
[Poll 3] Run 48a02dac... status: running (elapsed: 20.0s)
[Poll 4] Run 48a02dac... status: running (elapsed: 30.0s)
[Poll 5] Run 48a02dac... status: running (elapsed: 40.0s)
[Poll 6] Run 48a02dac... status: running (elapsed: 50.0s)
[Poll 7] Run 48a02dac... status: running (elapsed: 60.1s)
[Poll 8] Run 48a02dac... status: running (elapsed: 70.1s)
[Poll 9] Run 48a02dac... status: running (elapsed: 80.1s)
[Poll 10] Run 48a02dac... status: running (elapsed: 90.1s)
[Poll 11] Run 48a02dac... status: running (elapsed: 100.1s)
[Poll 12] Run 48a02dac... status: running (elapsed: 110.1s)
[Poll 13] Run 48a02dac... status: running (elapsed: 120.1s)
[Poll 14] Run 48a02dac... status: running (elapsed: 130.1s)
[Poll 15] Run 48a02dac... status: running (elapsed: 140.1s)
[Poll 16] Run 48a02dac... status: running (elapsed: 150.1s)
[Poll 17] 

{'run_id': '48a02dac-c084-489e-84ed-2ed549fe05b9',
 'status': 'completed',
 'created_at': 1765498632.0443048,
 'updated_at': 1765499122.214145,
 'request': {'competition_set': 'experiments/splits/dev.txt',
  'agent_id': 'aide/dev',
  'lite': True,
  'n_seeds': 1,
  'n_workers': 1,
  'retain': False,
  'data_dir': None,
  'gitlink': None,
  'notes': None,
  'tasks': ['imbalance']},
 'message': None,
 'run_group': '2025-12-12T00-17-15-UTC_run-group_aide',
 'run_dir': 'runs/2025-12-12T00-17-15-UTC_run-group_aide',
 'logs': [{'cmd': ['/Users/sanketshah/miniforge3/bin/python',
    '-m',
    'mlebench.cli',
    'prepare',
    '--list',
    'experiments/splits/dev.txt',
    '--lite'],
   'returncode': 0,
   'stderr': "ating checksum for `/Users/sanketshah/Library/Caches/mle-bench/data/random-acts-of-pizza/random-acts-of-pizza.zip`...\n[2025-12-11 19:17:15,012] [data.py:92] Checksum for `/Users/sanketshah/Library/Caches/mle-bench/data/random-acts-of-pizza/random-acts-of-pizza.zip` matches the 

## Review Results

Check what the agent produced and how it was graded.

In [25]:
run = client.get_run(run_id)
print(f"Run ID: {run['run_id']}")
print(f"Status: {run['status']}")
print(f"Run Group: {run.get('run_group')}")
print(f"Run Dir: {run.get('run_dir')}")

if run.get('logs'):
    print(f"\nLogs ({len(run['logs'])} entries):")
    for log in run['logs'][-3:]:
        print(f"  {log}")

Run ID: 48a02dac-c084-489e-84ed-2ed549fe05b9
Status: completed
Run Group: 2025-12-12T00-17-15-UTC_run-group_aide
Run Dir: runs/2025-12-12T00-17-15-UTC_run-group_aide

Logs (3 entries):
  {'stage': 'Running technique-task: imbalance'}
  {'cmd': ['/Users/sanketshah/miniforge3/bin/python', 'run_agent.py', '--agent-id', 'aide/dev', '--competition-set', '/var/folders/5h/x0v7_l5x6j38zfc0y9sqfcwr0000gn/T/tmp2h29tkkb.txt', '--n-seeds', '1', '--n-workers', '1', '--technique-task', 'imbalance'], 'returncode': 0, 'stdout': '', 'stderr': '[2025-12-11 19:17:16,223] [run_agent.py:134] Launching run group: 2025-12-12T00-17-15-UTC_run-group_aide\n[2025-12-11 19:17:16,225] [run_agent.py:154] Creating 1 workers to serve 3 tasks...\n[2025-12-11 19:17:16,443] [utils.py:154] Container created: competition-random-acts-of-pizza-2025-12-12T00-17-16-UTC-389dace6edfd43138d73895a4f3147d2\n[2025-12-11 19:19:33,159] [utils.py:154] Container created: competition-spaceship-titanic-2025-12-12T00-19-32-UTC-0a7d99be1bf

## Browse Rollout Artifacts

View all files generated by the agent during the run.

In [26]:
from pathlib import Path
import json

def show_tree(path, prefix="", max_depth=3, current_depth=0):
    """Display directory tree structure."""
    if current_depth >= max_depth:
        return
    path = Path(path)
    items = sorted(path.iterdir(), key=lambda x: (x.is_file(), x.name))
    for i, item in enumerate(items):
        is_last = i == len(items) - 1
        connector = "‚îî‚îÄ‚îÄ " if is_last else "‚îú‚îÄ‚îÄ "
        print(f"{prefix}{connector}{item.name}")
        if item.is_dir():
            extension = "    " if is_last else "‚îÇ   "
            show_tree(item, prefix + extension, max_depth, current_depth + 1)

if run.get('run_dir'):
    print(f"üìÇ {run['run_dir']}\n")
    show_tree(run['run_dir'])

üìÇ runs/2025-12-12T00-17-15-UTC_run-group_aide

‚îú‚îÄ‚îÄ dogs-vs-cats-redux-kernels-edition_cae97058-faab-4ed4-ae3a-24ef9409c300
‚îÇ   ‚îú‚îÄ‚îÄ code
‚îÇ   ‚îú‚îÄ‚îÄ logs
‚îÇ   ‚îÇ   ‚îî‚îÄ‚îÄ entrypoint.log
‚îÇ   ‚îú‚îÄ‚îÄ submission
‚îÇ   ‚îÇ   ‚îî‚îÄ‚îÄ imbalance_analysis.json
‚îÇ   ‚îî‚îÄ‚îÄ run.log
‚îú‚îÄ‚îÄ random-acts-of-pizza_8710921f-96a1-46b2-8d60-db480f6672d5
‚îÇ   ‚îú‚îÄ‚îÄ code
‚îÇ   ‚îú‚îÄ‚îÄ logs
‚îÇ   ‚îÇ   ‚îî‚îÄ‚îÄ entrypoint.log
‚îÇ   ‚îú‚îÄ‚îÄ submission
‚îÇ   ‚îÇ   ‚îî‚îÄ‚îÄ imbalance_analysis.json
‚îÇ   ‚îî‚îÄ‚îÄ run.log
‚îú‚îÄ‚îÄ spaceship-titanic_c129d9ca-4660-42b6-a310-0a36e29bf1cb
‚îÇ   ‚îú‚îÄ‚îÄ code
‚îÇ   ‚îú‚îÄ‚îÄ logs
‚îÇ   ‚îÇ   ‚îî‚îÄ‚îÄ entrypoint.log
‚îÇ   ‚îú‚îÄ‚îÄ submission
‚îÇ   ‚îÇ   ‚îî‚îÄ‚îÄ imbalance_analysis.json
‚îÇ   ‚îî‚îÄ‚îÄ run.log
‚îú‚îÄ‚îÄ metadata.json
‚îî‚îÄ‚îÄ technique_grades.json


## Inspect the Outputs

Let's look at the grading results and run artifacts.

In [None]:
from pathlib import Path
import json

if run.get('run_dir'):
    run_dir = Path(run['run_dir'])
    print(f"Run directory: {run_dir}")
    
    # Check for technique grades (technique-tasks mode)
    grades_file = run_dir / "technique_grades.json"
    if grades_file.exists():
        grades = json.loads(grades_file.read_text())
        print(f"\nüìä Technique Grades:")
        print(json.dumps(grades, indent=2))
    
    # Check for competition grades (regular mode)
    report_file = run_dir / "grading_report.json"
    if report_file.exists():
        report = json.loads(report_file.read_text())
        print(f"\nüèÜ Competition Results:")
        print(json.dumps(report, indent=2))
    
    # List contents
    print(f"\nüìÅ Run Contents:")
    for item in sorted(run_dir.iterdir()):
        print(f"  {item.name}")
else:
    print("No run directory found yet")

## Agent Decision Trace (AIDE Journal)

AIDE logs its reasoning and code at each step. Useful for debugging agent behavior.

In [None]:
if run.get('run_dir'):
    run_dir = Path(run['run_dir'])
    for comp_dir in run_dir.iterdir():
        if not comp_dir.is_dir():
            continue
        journal_file = comp_dir / "submission" / "journal.json"
        if journal_file.exists():
            print(f"üìì {journal_file}\n")
            journal = json.loads(journal_file.read_text())
            print(f"AIDE took {len(journal)} steps:")
            for i, step in enumerate(journal[:3]):
                print(f"\n--- Step {i+1} ---")
                if 'plan' in step:
                    print(f"Plan: {step['plan'][:200]}...")
                if 'code' in step:
                    print(f"Code: {len(step['code'])} chars")
            break
    else:
        print("No journal.json found")

---
## Next Steps

- **Try different tasks**: Change `tasks=["imbalance"]` to `["missing", "encoding", "cv"]`
- **Run full competitions**: Comment out Option B and uncomment Option A
- **Use more steps**: Change `aide/dev` to `aide` for more thorough agent exploration
- **Add more competitions**: Edit `experiments/splits/dev.txt` to include more

See `README.md` and `RUN_DEMO.md` for more details.