Skip to content

AgentSuiteLocal v0.8.9

Choose a tag to compare

@scottconverse scottconverse released this 06 May 08:33
· 69 commits to main since this release
cadd021

[0.8.9] — 2026-05-06

Fixed

  • QA-DD-001 (Critical) — Trust/Risk agent slug drift fixed. v0.8.8 advertised seven agents in the picker but web/src/data.js used id: "trust" while launcher.py / cli.py used trust_risk. The kernel registry only knows trust_risk, so every Trust/Risk run errored 3 s after launch with Agent 'trust' is not enabled or not registered. Fixed by aligning data.js (id and mock-run reference) and the _SETTINGS_DEFAULTS["enabled_agents"] default in agentsuitelocal/api/config.py to the canonical trust_risk.

  • TEST-CRIT-001 (Critical, Test discipline) — tests/test_execution.py restructured. The file mocked every dependency it claimed to integrate (5 of 5 tests patched _resolve_llm, the agent class, _save_state, telemetry, notifications, _workspace) — the same pattern that shipped v0.8.7's missing-ollama-SDK regression. Renamed to tests/test_execution_state_machine.py with a corrected docstring stating what the file actually covers (run-status state machine, dispatch, SSE wiring) and what it does NOT cover (resolver path, agent class). Added a new tests/test_execution_integration.py that uses AGENTSUITE_LLM_PROVIDER_FACTORY to exercise the real resolver path with no patching: an MockLLMProvider from agentsuite.llm.mock, a per-test AGENTSUITE_WORKSPACE tmpdir, and unmocked _save_state / _log_telemetry / _send_notification.

  • DOC-V088-001 (Critical, Documentation) — In-app ManualView.jsx refreshed to v0.8.9. Six stale items the round-1 / round-3 doc fixes had missed: (1) Smoke step described "five quick checks" — actually four since v0.8.8; updated and notes that v0.8.8 added the kernel-inference check. (2) Kernel section claimed "you can't delete from the Kernel through the UI in v0.1" — v0.1-vintage caveat replaced with current behaviour: read-only by design, demote via file system. (3) Troubleshooting note about smoke-test failures referenced a "Phase 2 will surface these errors" roadmap promise that's been closed for months — replaced with a description of the current per-check fix-card UX. (4) "My run disappeared" answer pointed users at ~/.agentsuitelocal/runs.json (replaced by SQLite in v0.8.0); now describes the WAL-mode state.db and notes the legacy file is migrated on first launch. (5) Added a "Manual version: v0.8.9 · matches docs/user-manual.md" stamp at the top so drift is now visible. (6) The recommended-models table — already in sync from QA-DD-002.

  • ENG-088-002 (Critical, Performance/Data) — run["events"] and pipeline["events"] capped at 200 entries. The lists were unbounded and serialized to SQLite on every _save_state(), so disk write size grew linearly with run length. Long pipeline runs amplified the cost noticeably. Chose Option A (cap; drop the dead deque) over Option B (wire deque into SSE replay; drop persistent events) — smaller diff, simpler invariant, no SSE protocol change. Replaced direct ["events"].append(evt) calls in execution.py (run + pipeline emit), routers/runs.py (cancellation), and routers/pipelines.py (rejection) with a single _append_event(container, evt) helper in api.state that FIFO-evicts beyond _MAX_EVENTS_PER_RUN = 200. Removed the dead _run_event_buffers dict and _SSE_BUFFER_SIZE constant from state.py. Lifecycle markers (agent_start, agent_done, approval, error) are <10 events — well within the cap; ~190 stage_progress events of recent history fit alongside.

  • UX-V088-001 (Critical, UX) — Settings save errors no longer silently show "Saved". SettingsView.jsx's save handler did .catch(() => {}); setSaved(true) regardless of fetch outcome, giving users false confirmation when the backend was unreachable or returned non-2xx. Now: optimistic update is rolled back on failure, the topbar shows Couldn't save: <reason> (red), and saved is not set to true. Distinguishes 5xx (detail from response body), 4xx, and network errors. Affects every toggle, the API key save, run timeout, QA gate threshold, model tier — every edit-and-save in Settings.

  • ENG-088-001 (Critical, Security/Correctness) — PDF export now HTML-escapes artifact content. agentsuitelocal/api/routers/runs.py:333-344 was interpolating run_id, file paths, and artifact bodies directly into HTML inside <pre> blocks. LLM-produced artifacts routinely contain <, >, &, or literal </pre> (markdown-with-embedded-HTML, code blocks); without escaping, weasyprint parsed them as live HTML and the PDF rendered incorrectly. With a malicious artifact, injected <style> or <a href="javascript:"> would execute against the rendering context. Extracted the HTML-construction logic into _build_pdf_html(run_id, outputs_dir) and applied html.escape() to every interpolated value.

  • QA-DD-002 (Critical) — Pro-tier model name fixed. _TIER_MODEL_MAP["pro"] was gemma4:26b-moe, which 404s from https://registry.ollama.ai/v2/library/gemma4/manifests/26b-moe. Fresh installs that selected the Pro tier failed to pull. The wrong suffix was the entire bug — bare gemma4:26b (and gemma4:31b, gemma4:latest) all exist on Ollama Hub. Fixed to gemma4:26b (the closest real tag to the original 26B intent; same gemma4 family as light/balanced for consistency). Fanned out to web/src/data.js, docs/user-manual.md, docs/architecture.md, README, both discussion seeds, ManualView.jsx, and ModelView.test.jsx. The gemma4:e2b and gemma4:e4b entries — flagged by the audit as also missing — actually do exist; left unchanged.

Changed (CI test-environment alignment)

  • .github/workflows/ci.yml: Playwright job now pulls gemma4:e4b instead of gemma2:2b. The smoke endpoint verifies the configured model is installed locally before running the kernel-inference probe; _SETTINGS_DEFAULTS["model_name"] is gemma4:e4b, so CI must have that model present or the smoke step rejects with "Model not installed" and the installer walk fails on Step 5 (Continue stays disabled). v0.8.8's audit-round-1 added the smoke check; the CI workflow was never updated to match. v0.8.8 Playwright hung at "Install Playwright browsers" so this regression was masked. v0.8.9 was the first run to actually surface it.

Documentation (cross-surface currency sweep)

  • CONTRIBUTING.md reconciled with v0.8.0+ reality: dev port description now points at launcher.port.json (was the legacy launcher.log plaintext file); E2E test instructions reference gemma4:e4b (was gemma2:2b); test-count claim "108+ tests" replaced with "160+" + "see CHANGELOG for an exact figure"; the long-stale "keep main.py the single source of truth" instruction replaced with the actual v0.8.0 router-per-domain layout and the policy that new routes go in the closest existing router; bug-report instructions reference both launcher.log and launcher.port.json correctly.
  • docs/architecture.md: doc-currency stamp bumped from "as of v0.8.8" to "as of v0.8.9"; the hard-coded "Full suite as of v0.8.7: 135 passing" line replaced with a release-by-release approximate table that points at CHANGELOG for exact figures.
  • README "Updated in" hero: the trailing block stopped at v0.8.0–v0.8.2. Added paragraphs for v0.8.0–v0.8.4, v0.8.5–v0.8.7, v0.8.8, and v0.8.9 so a reader can see the full release shape from the top of the README without opening the CHANGELOG.
  • Discussion seeds and reddit launch post: bumped current-version line and the download filename (AgentSuiteLocal-0.8.9-setup.exe) — the latter sed missed because the filename has no v prefix.
  • README known-issues + architecture.md test-tree note: previously claimed "E2E test suite uses gemma2:2b (Gemma 2 family), not a Gemma 4 model" and pointed readers at tests/e2e/conftest.py for the documentation of that choice. Both became stale the moment CI was bumped to gemma4:e4b in this same release; the conftest pointer was also factually wrong (conftest contains zero gemma references). Both surfaces now correctly cite .github/workflows/ci.yml as the source-of-truth and describe the smoke-step model-installed check that motivates the CI choice.
  • docs/user-manual.md per-agent artifact totals: closes the audit's DOC-V088-004 (landing-page agent cards advertised 17–18 artifacts per non-Founder agent while the manual listed 5 named categories — different views of the same agent output). Added a "What you'll get back: ~N artifacts" line to each non-Founder agent matching web/src/data.js exactly: Design 18, Product 17, Engineering 17, Marketing 18, Trust/Risk 17, CIO 17. Reader sees both the named categories AND the total file count, so the numbers can't read as contradictory.

Added

  • tests/test_execution_integration.py: real-path integration coverage for TEST-CRIT-001. Two tests: a resolver smoke-test that catches v0.8.7-class regressions directly (passes — closes the main concern of TEST-CRIT-001), and a full _execute_run end-to-end that exposed a test-fixture limitation rather than a production bug (the substring-router mock provider returns prose for the extract stage; production correctly rejects it as invalid JSON). The full-flow test is xfail-marked with an explicit pointer to the fixture follow-up; the resolver test is the active regression guard.

Watchlist follow-up

  • The xfail on test_execute_run_real_path_against_factory_provider belongs on the next-sprint audit watchlist (W-1 — sweep over-mocking). Hardening the mock provider to return canonical JSON for stages that demand it (extract / qa) closes the gap. Recommended approach: switch to a RecordingMockProvider keyed by stage name with explicit JSON shapes, or a agentsuite.testing.fixtures.founder_smoke_provider() factory.

  • tests/test_event_cap.py: regression test for ENG-088-002. Asserts the helper caps at _MAX_EVENTS_PER_RUN, FIFO-evicts oldest first, initialises a missing events key, works on pipelines, and grep-checks production code for any ["events"].append(...) that would bypass the helper (excluding state.py itself).

  • 2 new SettingsView Vitest cases: regression coverage for UX-V088-001. Mocked PATCH /api/settings returning 500 + JSON detail asserts the error banner appears and Saved does not. Mocked PATCH that rejects (network error) asserts Couldn't save: Failed to fetch. Existing 6 SettingsView tests still pass.

  • tests/test_pdf_export_escape.py: regression test for ENG-088-001's bug class. Six unit tests exercise _build_pdf_html directly (no weasyprint dependency): run-id with embedded <script>, filename with &, artifact body with </pre><script> breakout, mixed </>/&, missing-outputs-dir empty case, and binary file. Each asserts the literal special characters are HTML-escaped, not present as live tags.

  • tests/test_agent_slugs.py: regression test for QA-DD-001's bug class. Asserts the four sources of truth for the enabled-agent set (launcher.py env default, cli.py env default, _SETTINGS_DEFAULTS["enabled_agents"], web/src/data.js AGENTS list) agree on the same seven slugs. Re-introducing slug drift in any one of those four files now fails CI at the lint/test gate before merge.

  • tests/test_tier_models_resolve.py: regression test for QA-DD-002's bug class. Each entry in _TIER_MODEL_MAP plus _SETTINGS_DEFAULTS["model_name"] is HEAD-checked against the Ollama OCI registry manifest endpoint that ollama pull queries internally. A 404 fails the test; a network error skips (so offline runners don't go red). Catches "named-but-nonexistent model" before it ships.