AgentSuiteLocal v0.8.9
[0.8.9] — 2026-05-06
Fixed
-
QA-DD-001 (Critical) — Trust/Risk agent slug drift fixed. v0.8.8 advertised seven agents in the picker but
web/src/data.jsusedid: "trust"whilelauncher.py/cli.pyusedtrust_risk. The kernel registry only knowstrust_risk, so every Trust/Risk run errored 3 s after launch withAgent 'trust' is not enabled or not registered. Fixed by aligningdata.js(id and mock-run reference) and the_SETTINGS_DEFAULTS["enabled_agents"]default inagentsuitelocal/api/config.pyto the canonicaltrust_risk. -
TEST-CRIT-001 (Critical, Test discipline) —
tests/test_execution.pyrestructured. The file mocked every dependency it claimed to integrate (5 of 5 tests patched_resolve_llm, the agent class,_save_state, telemetry, notifications,_workspace) — the same pattern that shipped v0.8.7's missing-ollama-SDK regression. Renamed totests/test_execution_state_machine.pywith a corrected docstring stating what the file actually covers (run-status state machine, dispatch, SSE wiring) and what it does NOT cover (resolver path, agent class). Added a newtests/test_execution_integration.pythat usesAGENTSUITE_LLM_PROVIDER_FACTORYto exercise the real resolver path with no patching: anMockLLMProviderfromagentsuite.llm.mock, a per-testAGENTSUITE_WORKSPACEtmpdir, and unmocked_save_state/_log_telemetry/_send_notification. -
DOC-V088-001 (Critical, Documentation) — In-app
ManualView.jsxrefreshed to v0.8.9. Six stale items the round-1 / round-3 doc fixes had missed: (1) Smoke step described "five quick checks" — actually four since v0.8.8; updated and notes that v0.8.8 added the kernel-inference check. (2) Kernel section claimed "you can't delete from the Kernel through the UI in v0.1" — v0.1-vintage caveat replaced with current behaviour: read-only by design, demote via file system. (3) Troubleshooting note about smoke-test failures referenced a "Phase 2 will surface these errors" roadmap promise that's been closed for months — replaced with a description of the current per-check fix-card UX. (4) "My run disappeared" answer pointed users at~/.agentsuitelocal/runs.json(replaced by SQLite in v0.8.0); now describes the WAL-modestate.dband notes the legacy file is migrated on first launch. (5) Added a "Manual version: v0.8.9 · matches docs/user-manual.md" stamp at the top so drift is now visible. (6) The recommended-models table — already in sync from QA-DD-002. -
ENG-088-002 (Critical, Performance/Data) —
run["events"]andpipeline["events"]capped at 200 entries. The lists were unbounded and serialized to SQLite on every_save_state(), so disk write size grew linearly with run length. Long pipeline runs amplified the cost noticeably. Chose Option A (cap; drop the dead deque) over Option B (wire deque into SSE replay; drop persistent events) — smaller diff, simpler invariant, no SSE protocol change. Replaced direct["events"].append(evt)calls inexecution.py(run + pipeline emit),routers/runs.py(cancellation), androuters/pipelines.py(rejection) with a single_append_event(container, evt)helper inapi.statethat FIFO-evicts beyond_MAX_EVENTS_PER_RUN = 200. Removed the dead_run_event_buffersdict and_SSE_BUFFER_SIZEconstant fromstate.py. Lifecycle markers (agent_start, agent_done, approval, error) are <10 events — well within the cap; ~190 stage_progress events of recent history fit alongside. -
UX-V088-001 (Critical, UX) — Settings save errors no longer silently show "Saved".
SettingsView.jsx's save handler did.catch(() => {}); setSaved(true)regardless of fetch outcome, giving users false confirmation when the backend was unreachable or returned non-2xx. Now: optimistic update is rolled back on failure, the topbar showsCouldn't save: <reason>(red), andsavedis not set to true. Distinguishes 5xx (detailfrom response body), 4xx, and network errors. Affects every toggle, the API key save, run timeout, QA gate threshold, model tier — every edit-and-save in Settings. -
ENG-088-001 (Critical, Security/Correctness) — PDF export now HTML-escapes artifact content.
agentsuitelocal/api/routers/runs.py:333-344was interpolatingrun_id, file paths, and artifact bodies directly into HTML inside<pre>blocks. LLM-produced artifacts routinely contain<,>,&, or literal</pre>(markdown-with-embedded-HTML, code blocks); without escaping, weasyprint parsed them as live HTML and the PDF rendered incorrectly. With a malicious artifact, injected<style>or<a href="javascript:">would execute against the rendering context. Extracted the HTML-construction logic into_build_pdf_html(run_id, outputs_dir)and appliedhtml.escape()to every interpolated value. -
QA-DD-002 (Critical) — Pro-tier model name fixed.
_TIER_MODEL_MAP["pro"]wasgemma4:26b-moe, which 404s fromhttps://registry.ollama.ai/v2/library/gemma4/manifests/26b-moe. Fresh installs that selected the Pro tier failed to pull. The wrong suffix was the entire bug — baregemma4:26b(andgemma4:31b,gemma4:latest) all exist on Ollama Hub. Fixed togemma4:26b(the closest real tag to the original 26B intent; same gemma4 family as light/balanced for consistency). Fanned out toweb/src/data.js,docs/user-manual.md,docs/architecture.md, README, both discussion seeds,ManualView.jsx, andModelView.test.jsx. Thegemma4:e2bandgemma4:e4bentries — flagged by the audit as also missing — actually do exist; left unchanged.
Changed (CI test-environment alignment)
.github/workflows/ci.yml: Playwright job now pullsgemma4:e4binstead ofgemma2:2b. The smoke endpoint verifies the configured model is installed locally before running the kernel-inference probe;_SETTINGS_DEFAULTS["model_name"]isgemma4:e4b, so CI must have that model present or the smoke step rejects with "Model not installed" and the installer walk fails on Step 5 (Continue stays disabled). v0.8.8's audit-round-1 added the smoke check; the CI workflow was never updated to match. v0.8.8 Playwright hung at "Install Playwright browsers" so this regression was masked. v0.8.9 was the first run to actually surface it.
Documentation (cross-surface currency sweep)
CONTRIBUTING.mdreconciled with v0.8.0+ reality: dev port description now points atlauncher.port.json(was the legacylauncher.logplaintext file); E2E test instructions referencegemma4:e4b(wasgemma2:2b); test-count claim "108+ tests" replaced with "160+" + "see CHANGELOG for an exact figure"; the long-stale "keepmain.pythe single source of truth" instruction replaced with the actual v0.8.0 router-per-domain layout and the policy that new routes go in the closest existing router; bug-report instructions reference bothlauncher.logandlauncher.port.jsoncorrectly.docs/architecture.md: doc-currency stamp bumped from "as of v0.8.8" to "as of v0.8.9"; the hard-coded "Full suite as of v0.8.7: 135 passing" line replaced with a release-by-release approximate table that points at CHANGELOG for exact figures.- README "Updated in" hero: the trailing block stopped at v0.8.0–v0.8.2. Added paragraphs for v0.8.0–v0.8.4, v0.8.5–v0.8.7, v0.8.8, and v0.8.9 so a reader can see the full release shape from the top of the README without opening the CHANGELOG.
- Discussion seeds and reddit launch post: bumped current-version line and the download filename (
AgentSuiteLocal-0.8.9-setup.exe) — the latter sed missed because the filename has novprefix. - README known-issues + architecture.md test-tree note: previously claimed "E2E test suite uses
gemma2:2b(Gemma 2 family), not a Gemma 4 model" and pointed readers attests/e2e/conftest.pyfor the documentation of that choice. Both became stale the moment CI was bumped togemma4:e4bin this same release; the conftest pointer was also factually wrong (conftest contains zero gemma references). Both surfaces now correctly cite.github/workflows/ci.ymlas the source-of-truth and describe the smoke-step model-installed check that motivates the CI choice. docs/user-manual.mdper-agent artifact totals: closes the audit's DOC-V088-004 (landing-page agent cards advertised 17–18 artifacts per non-Founder agent while the manual listed 5 named categories — different views of the same agent output). Added a "What you'll get back: ~N artifacts" line to each non-Founder agent matchingweb/src/data.jsexactly: Design 18, Product 17, Engineering 17, Marketing 18, Trust/Risk 17, CIO 17. Reader sees both the named categories AND the total file count, so the numbers can't read as contradictory.
Added
tests/test_execution_integration.py: real-path integration coverage for TEST-CRIT-001. Two tests: a resolver smoke-test that catches v0.8.7-class regressions directly (passes — closes the main concern of TEST-CRIT-001), and a full_execute_runend-to-end that exposed a test-fixture limitation rather than a production bug (the substring-router mock provider returns prose for the extract stage; production correctly rejects it as invalid JSON). The full-flow test isxfail-marked with an explicit pointer to the fixture follow-up; the resolver test is the active regression guard.
Watchlist follow-up
-
The
xfailontest_execute_run_real_path_against_factory_providerbelongs on the next-sprint audit watchlist (W-1 — sweep over-mocking). Hardening the mock provider to return canonical JSON for stages that demand it (extract / qa) closes the gap. Recommended approach: switch to aRecordingMockProviderkeyed by stage name with explicit JSON shapes, or aagentsuite.testing.fixtures.founder_smoke_provider()factory. -
tests/test_event_cap.py: regression test for ENG-088-002. Asserts the helper caps at_MAX_EVENTS_PER_RUN, FIFO-evicts oldest first, initialises a missingeventskey, works on pipelines, and grep-checks production code for any["events"].append(...)that would bypass the helper (excludingstate.pyitself). -
2 new SettingsView Vitest cases: regression coverage for UX-V088-001. Mocked PATCH
/api/settingsreturning 500 + JSON detail asserts the error banner appears andSaveddoes not. Mocked PATCH that rejects (network error) assertsCouldn't save: Failed to fetch. Existing 6 SettingsView tests still pass. -
tests/test_pdf_export_escape.py: regression test for ENG-088-001's bug class. Six unit tests exercise_build_pdf_htmldirectly (no weasyprint dependency): run-id with embedded<script>, filename with&, artifact body with</pre><script>breakout, mixed</>/&, missing-outputs-dir empty case, and binary file. Each asserts the literal special characters are HTML-escaped, not present as live tags. -
tests/test_agent_slugs.py: regression test for QA-DD-001's bug class. Asserts the four sources of truth for the enabled-agent set (launcher.py env default, cli.py env default,_SETTINGS_DEFAULTS["enabled_agents"],web/src/data.jsAGENTS list) agree on the same seven slugs. Re-introducing slug drift in any one of those four files now fails CI at the lint/test gate before merge. -
tests/test_tier_models_resolve.py: regression test for QA-DD-002's bug class. Each entry in_TIER_MODEL_MAPplus_SETTINGS_DEFAULTS["model_name"]is HEAD-checked against the Ollama OCI registry manifest endpoint thatollama pullqueries internally. A 404 fails the test; a network error skips (so offline runners don't go red). Catches "named-but-nonexistent model" before it ships.