Skip to content

Reliability research#7

Open
vilaca wants to merge 13 commits into
mainfrom
reliability-research
Open

Reliability research#7
vilaca wants to merge 13 commits into
mainfrom
reliability-research

Conversation

@vilaca
Copy link
Copy Markdown
Owner

@vilaca vilaca commented May 22, 2026

No description provided.

@vilaca vilaca force-pushed the reliability-research branch 8 times, most recently from 2a1e035 to 756e6ef Compare May 23, 2026 01:18
vilaca added 12 commits May 23, 2026 02:20
Deep-dive notes on a reliability stack for self-hosted LLM tool calling — guardrails, compaction, synthetic respond tool, ablation, and eval significance — captured as next-steps.md for replication on our stack.
Empirical results from the IEEE preprint — compounding-error math, 8B+framework parity with frontier, backend hidden-variable swings, compaction strategy table, and replication takeaways. Companion to next-steps.md.
…l framework, and BFCL integration

Three additions from a fuller read of the ADR directory: the _resolve_service lesson on why strict backends need a soft-error type, the five-tier diagnostic eval framework (lambda → ablated lambda → stateful → ablated stateful → stateful+strict), the information_loss future scenario concept, and the BFCL integration architecture as external validation.
…on caveat, and related-work section

From the author's HN-thread discussion: per-guardrail impact ranking (retry nudges 24-49pt, error recovery ~10pt, rescue + compaction no eval signal but kept for production), 97-config coverage figure, and comparison to Instructor / LangChain / DSPy / Outlines / tool_choice=any explaining why they stack rather than substitute.
…and JSONL row schema

Three operational additions: --reasoning-budget 0 workaround for post-2026-04-10 llama.cpp builds on reasoning models (silent hangs); production observability wiring patterns for on_message / on_compact / on_chunk callbacks with logger + Prometheus + alerting examples; and the JSONL row schema for resumable batch eval including the per-run identity/outcome/mechanics/history fields plus the rig provenance field for multi-rig datasets.
…rdrails)

Implements the in-loop reliability guardrails from the IEEE preprint / next-steps
specs so cheaper, smaller models reach near-frontier completion rates on
multi-step tool workflows. Auto-enabled for weak-tier models and any model in the
sampling-defaults table; frontier models pay near-zero cost.

What's new (each gated by the auto-enable rule in reliability-config.ts):

- Synthetic Respond tool — gives small models a structured terminal action so
  the loop never has to disambiguate text vs. tool call. Auto-injected on the
  wire only for weak-tier / known small-model deployments; single-Respond
  batches short-circuit to text-done.
- Message tagging — every conversation append carries a MessageType
  (system_prompt / user_input / tool_call / tool_result / reasoning /
  text_response / step_nudge / prerequisite_nudge / retry_nudge /
  context_warning / summary). Stripped at the wire boundary.
- Tiered 3-phase compaction — deterministic text manipulation, no LLM call.
  P1 drops nudges + truncates tool_results to 200 chars; P2 also drops
  tool_results; P3 drops reasoning + text_response. keep_recent=2 iteration
  boundaries. LLM-summary path stays as a Phase-4 emergency fallback.
- ResponseValidator + Nudge — stateless validator emits structured retry /
  unknown-tool / step / prerequisite nudges with three-tier escalation for
  premature-terminal attempts.
- StepEnforcer + tool prerequisites — opt-in via requiredSteps / terminalTools
  on AgentOptions; declared ToolDefinition.prerequisites validated at
  registry-build time. StepEnforcementError / PrerequisiteError on exhaustion.
- ToolResolutionError — new tool-author exception type ("valid request, no
  data") with softError tagging; the hard-error counter only bumps on
  non-resolution throws, letting models fumble 8+ wrong-key lookups within
  the iteration budget while still bailing on real bugs.
- Context threshold warnings — once-per-session-per-threshold transient
  "context filling up" / "nearly full" injection at 65% / 80% with re-arm on
  drop.
- Reasoning fold + think-tag utilities ([THINK] / <think>) for future wire
  serialization.
- Per-model sampling defaults map (Ministral / Qwen3 / Granite 4 / Gemma 4)
  with strict / non-strict policy.
- Per-call sampling overrides on every provider (Ollama, Anthropic, all
  OpenAI-compat via shared adapter). New ChatOptions fields: topP, topK,
  minP, repeatPenalty, presencePenalty, recommendedSampling.
- Anthropic synthetic tool_result for unpaired tool_use (load-bearing for the
  step/prereq nudge path) + tool_choice="any" exposed via
  ChatOptions.forceToolCall (auto-enabled on weak-tier Anthropic models).
- Backend context discovery via API — Ollama /api/show, llama-server /props.
- Observability events — step-nudge, prerequisite-nudge, step-completed,
  context-warning, respond-stripped; compaction event now carries
  phase: 0|1|2|3|4.

Out of scope (API-only constraint): ServerManager / launch flags / nvidia-smi
/ multi-slot / proxy server / ablation framework / eval harness.

Test coverage: 1260 unit tests (+80) across 11 new test files. Full plan in
~/.claude/plans/recursive-stargazing-truffle.md.
CI lint flagged five rules after the reliability-stack commit. All
behavior-preserving extractions:

- run-agent.ts: split the runAgent generator under the 300-line cap by
  extracting fireUserPromptSubmit, handleNoToolCallsBranch,
  detectRespondShortCircuit, emitEnforcerObservability,
  resolveChainPointer, captureChainPointer, maybeInjectContextWarning,
  settleCleanBatch, isHardErrorBudgetExhausted, and hasAnyPrereqs.
  Replaced an inline import() type annotation with the proper
  type-import (consistent-type-imports rule).
- shared.ts: pulled resolveSampling's repetitive conditional ladders
  into mergeDefaults / mergeOverrides helpers backed by a single
  ChatOptions→ResolvedSampling field map; complexity drops well below
  the cap.
- anthropic.ts: extracted anthropicExtras(sampling, forceToolCall, hasTools)
  so both the streaming and non-streaming params builders stay under the
  per-method complexity cap.
- ollama.ts: extracted ollamaOptions(model, maxTokens, sampling) used
  by both chat() and chatNoStream(); same intent.

Tests still green (1260 unit + 48 e2e, 4 PTY skipped). No public API
changes.
Applies fixes from the branch review at
~/.claude/plans/humming-bubbling-lollipop.md.

- providers: thread `providerName` through buildChatBody so the
  per-model sampling-defaults diagnostic INFO line fires for OpenAI-
  compat providers too (openai, cerebras, llamacpp, groq, openrouter,
  vercel, mistral, workersai, copilot, googleaistudio, opencodezen).
  Defaults already applied silently — only the log was missing.
- docs: move research notes off the repo root.
    next-steps.md       → docs/reliability/next-steps.md
    next_steps_paper.md → docs/reliability/paper-findings.md
  Updates 19 in-source `next-steps.md §N` references to the new path.
- tiered-compact: `changed` is now cumulative across P1/P2/P3 via a
  running OR. Previously returned only the final phase's delta, so
  a P1-only mutation followed by no-op P2 falsely reported
  `changed=false` to callers (logging-only impact, but misleading).
- anthropic: flush `pendingToolUseIds` on *any* plain-message
  boundary in `splitMessagesForAnthropic`, not just user-role. The
  conversation grammar never produces consecutive assistant runs
  today, but the guard hardens against future regressions (summary
  injection, mid-turn rewrites) that could leak unpaired tool_use
  to the wire — Anthropic 400s on that shape.
- tools/types: codify the `ToolResult` flag matrix as a doc comment.
  Enumerates the 7 reachable combinations and the mutual-exclusion
  rules (softError ⊕ hardError, both imply success=false, etc.).
- tests (+23): three new suites/sections.
    * resolveSampling three-tier merge chain (8 cases) — instance
      defaults → per-model table → per-call overrides precedence,
      camel→snake field mapping, recommendedSampling on unknown
      model is a no-op.
    * autoEnableForModel tier gating (5 cases) — weak/medium/strong
      paths, sampling-profile override of tier, forceToolCall only
      fires for weak-tier Anthropic.
    * tiered-compact (3 cases) — cumulative `changed` after multi-
      phase run, tool_call ↔ tool_result pairing across P2, metadata
      preservation through all phases.

No public API changes (buildChatBody gains an optional field).
1283 unit tests pass (+23). No production behavior change beyond
the diagnostic-log and Anthropic flush hardening above.
Implements the punch list from the reliability-research review:
- Anthropic: tests for the unpaired tool_use → is_error tool_result shim.
- Recovery state: per-pipeline clone + deterministic merge to eliminate
  the read-modify-write race in parallel Delegate batches.
- ChatOptions.thinking tri-state (true/false/'auto') with resolver,
  ThinkingNotSupportedError, and inline <think>/[THINK] discard path
  for Ollama's leak case. Tag-parsing helpers moved to src/utils/
  to honour the providers→core architecture boundary.
- Ollama native↔prompt auto-downgrade: per-model tool-mode cache via
  getModelInfo, prompt-mode preamble + history downgrade
  (tool→user, assistant.tool_calls→<tool_call> JSON).
- Typed Nudge.meta so observability events no longer regex-parse the
  rendered template.
- resolveSampling rejects instance-level seed (per-call only per §17)
  and gains immutability tests.
- Threshold-warning re-fire integration test across turns.
- Comment clarification on per-call vs batch-level hard-error reset.

Unit suite: 1289 → 1320 tests, all passing.
Splits the tool-mode/sampling/thinking branches out of `chat` and
`chatNoStream` into a private `buildChatRequest`, and the chunk→
ChatChunk shaping into a free `mapOllamaChunk`. Brings both methods
back under the eslint complexity cap (25) after the prompt-mode
auto-downgrade landed.
The strict test tsconfig (tsconfig.test.json) enforces all fields on
ToolDefinition; the unit-only `tsx --test` run elides this check, so
the missing discriminator only surfaced in CI.
Punch list from the second branch review:

- anthropic: emit a `tool-result-orphan` diagnostic provider-log entry
  when a `tool_result.tool_call_id` doesn't match any pending
  `tool_use` in `splitMessagesForAnthropic`. Anthropic 400s on that
  shape anyway; the log shortens the diagnostic path from "opaque
  400" to "exact id that went unmatched".
- text-tool-parser: case-insensitive tool-name matching with
  canonicalisation. Small models routinely lowercase ("read" vs
  "Read"); the validator's unknown-tool check and the registry's
  `get()` are both case-insensitive, but the parser was stricter and
  silently dropped lowercase calls before either could see them.
  Now the parser does the same case-fold and rewrites the name to
  the registered canonical form so downstream consumers (which
  assume exact case) stay happy. Applied to the <tool_call> tag,
  Hermes-style <function=...> tag, and bare-JSON fallback paths.
- nudges: remove unused `RESPOND_TOOL` re-export (no callers).
- docs: replace the misleading "Aim: reproduce this on our own
  stack" preface in next-steps.md with a status block clarifying
  that the file is upstream research notes from the IEEE preprint
  + Python framework, not the TS implementation. Add a small
  section→file map so readers know which `src/` paths actually
  implement each §N anchor referenced from doc-comments.
- doc-comment refs: normalise two shorthand `next-steps.md §N`
  comments to the full `docs/reliability/next-steps.md §N` path
  (shared.ts, think-tags.ts) so every reference in `src/` uses the
  same form.

Tests (+8): five new parser cases covering lowercase tag/bare/
Hermes-tag and the still-rejects-truly-unknown guarantee; three
new cases under `runToolCalls — parallel Delegate batches >
StepEnforcer + parallel Delegate` pinning that the shared
StepEnforcer correctly records all successful siblings (and skips
failed ones) while recovery state is cloned per-pipeline.

Unit suite: 1324 → 1332 tests, all passing.
@vilaca vilaca force-pushed the reliability-research branch from 756e6ef to ef71b59 Compare May 23, 2026 01:21
- format: prettier --write across the branch's reliability surface
  (docs/, src/core/agent/, src/providers/, tests). Pure whitespace /
  line-wrapping; the codebase's prettier config is the source of truth.
- knip: drop `export` from five symbols that were never imported
  externally: ReliabilityError (still extended by the three subclasses
  in the same file), DEFAULT_KEEP_RECENT (sole caller is in the same
  file), NudgeKind + NudgeMeta (used inside `Nudge`), and
  ThinkExtractResult (return type of an exported function — callers
  infer it). Also drop the stale `type ThinkExtractResult` re-export
  from reasoning.ts.
- run-agent: prettier reformat bumped `runAgent` to 306 lines, over
  the eslint max-lines-per-function cap (300). Extract
  `buildStepEnforcer(options, toolRegistry)` to bring it back under;
  pure behaviour-preserving extraction of the conditional StepEnforcer
  construction block.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant