test(agent): formalize integration test ladder with fixture-driven architecture

## Summary

The current integration tests (M4-M7) are working but have grown organically. We need a formal, progressive integration test architecture inspired by dart_monty's **integration ladder** pattern — fixture-driven, tiered, and extensible for scripting (0008) and future features.

## Current State

- **29 integration tests** across 2 files, all passing
- `run_orchestrator_integration_test.dart` — 9 tests (M4/M5/M6)
- `m7_room_integration_test.dart` — 20 tests (19 groups)
- Helpers duplicated between files
- No fixture files — all test data inline
- Server gets overwhelmed under heavy concurrent load (thread deletion warnings)
- Test 10 has a latent unicode dash bug (en-dash vs ASCII hyphen)

## Proposal: Integration Ladder

Adopt dart_monty's tier-based fixture pattern with progressive complexity.

### Tier Structure

| Tier | Layer | Focus | Example Tests |
|------|-------|-------|---------------|
| tier_01_lifecycle | L0 | Basic room lifecycle | Idle -> Running -> Completed, error rooms, 404 |
| tier_02_tools | L1 | Tool yielding and resume | Single tool, multi-tool, tool failure recovery |
| tier_03_conversation | L0+ | Multi-turn history | Accumulation, context depth, thread reuse |
| tier_04_runtime | L2 | AgentRuntime patterns | spawn, waitAll, waitAny, cancel, introspection |
| tier_05_pipelines | L2+ | Multi-agent orchestration | Fan-out/fan-in, write-review-revise, cascading |
| tier_06_advanced | L2++ | Complex compositions | Debate, consensus, MapReduce, speculative exec |
| tier_07_scripting | L3 | Monty bridge integration | HostFunctionWiring, MontyToolExecutor, script rooms |

### Fixture Format

JSON fixtures per tier, matching dart_monty's pattern:

```json
{
  "id": 1,
  "tier": 1,
  "name": "echo room basic lifecycle",
  "room": "echo",
  "prompt": "Say hello",
  "expectedState": "CompletedState",
  "responseContains": null,
  "tools": null,
  "toolResponses": null,
  "turns": 1,
  "concurrency": 1,
  "xfail": null
}
```

### Shared Test Infrastructure

Extract duplicated helpers into a shared module:

```
packages/soliplex_agent/test/
  integration/
    fixtures/
      tier_01_lifecycle.json
      tier_02_tools.json
      tier_03_conversation.json
      tier_04_runtime.json
      tier_05_pipelines.json
      tier_06_advanced.json
      tier_07_scripting.json
    helpers/
      integration_harness.dart    # Shared setup/teardown, HTTP clients
      state_waiters.dart          # _waitForTerminalState, _waitForYieldOrTerminal
      fixture_runner.dart         # registerLadderTests() equivalent
      assertions.dart             # Response matchers (unicode-safe)
    tier_01_lifecycle_test.dart
    tier_02_tools_test.dart
    tier_03_conversation_test.dart
    tier_04_runtime_test.dart
    tier_05_pipelines_test.dart
    tier_06_advanced_test.dart
    tier_07_scripting_test.dart
```

### Key Improvements

1. **Fixture-driven tests** — JSON fixtures enable:
   - Easy addition of new test cases without code changes
   - Parity comparison across environments (native vs WASM)
   - xfail markers for known issues (like WASM concurrency limits)

2. **Shared harness** — Single IntegrationHarness class:
   - Creates/disposes HTTP clients, API, AgUiClient
   - Manages backend health checks before test runs
   - Handles thread cleanup with bounded retries (not infinite)

3. **Unicode-safe assertions** — Normalize dashes/quotes before string comparison

4. **Sequential tier execution** — Run tiers in order to avoid server overload:
   - Tier 1-3: Sequential (single session tests)
   - Tier 4-6: Sequential between tiers, parallel within where appropriate
   - Server health check between tiers

5. **Scripting tier (tier_07)** — New tests for 0008-soliplex-scripting:
   - HostFunctionWiring binds correctly
   - MontyToolExecutor dispatches tool calls through bridge
   - ScriptingToolRegistryResolver resolves tools from script context
   - Script room runs Python code and returns results
   - Script room with external function calls (yield/resume through Monty bridge)
   - Error propagation from Python runtime

## Known Issues to Address

- [ ] Thread deletion retry loop is unbounded — cap at 3 retries with backoff
- [ ] Test 10 unicode dash comparison (en-dash vs ASCII hyphen)
- [ ] Server instability under heavy concurrent load (tests 12+ in sequence)
- [ ] Helpers duplicated between two test files
- [ ] No dart_test.yaml configuration for integration test timeouts

## Migration Path

1. Extract shared helpers from existing test files
2. Create fixture JSON files from existing inline test data
3. Build fixture_runner.dart (registerLadderTests equivalent)
4. Migrate existing 29 tests to tier files
5. Add tier_07 scripting tests
6. Verify all 29+ tests still pass
7. Remove old test files

## Related

- dart_monty ladder: `packages/dart_monty_platform_interface/lib/src/testing/ladder_runner.dart`
- soliplex_interpreter_monty room fixtures: `packages/soliplex_interpreter_monty/test/integration/room_fixture.dart`
- #393 — auto-history refactoring
- #394 — WASM incompatibility

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

test(agent): formalize integration test ladder with fixture-driven architecture #397

Summary

Current State

Proposal: Integration Ladder

Tier Structure

Fixture Format

Shared Test Infrastructure

Key Improvements

Known Issues to Address

Migration Path

Related

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Tier	Layer	Focus	Example Tests
tier_01_lifecycle	L0	Basic room lifecycle	Idle -> Running -> Completed, error rooms, 404
tier_02_tools	L1	Tool yielding and resume	Single tool, multi-tool, tool failure recovery
tier_03_conversation	L0+	Multi-turn history	Accumulation, context depth, thread reuse
tier_04_runtime	L2	AgentRuntime patterns	spawn, waitAll, waitAny, cancel, introspection
tier_05_pipelines	L2+	Multi-agent orchestration	Fan-out/fan-in, write-review-revise, cascading
tier_06_advanced	L2++	Complex compositions	Debate, consensus, MapReduce, speculative exec
tier_07_scripting	L3	Monty bridge integration	HostFunctionWiring, MontyToolExecutor, script rooms

test(agent): formalize integration test ladder with fixture-driven architecture #397

Description

Summary

Current State

Proposal: Integration Ladder

Tier Structure

Fixture Format

Shared Test Infrastructure

Key Improvements

Known Issues to Address

Migration Path

Related

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions