Skip to content

v0.6.65 — multi-turn cache reuse for hybrid models + harmony channel fixes

Choose a tag to compare

@raullenchai raullenchai released this 22 May 19:59

Bug-fix release. Five PRs since 0.6.64, all addressing community-reported regressions.

Headline fix: multi-turn cache reuse on hybrid models (closes #427)

Hybrid linear-attention models (Qwen3.6, GLM-4.7 DSA, anything mixing Mamba/DeltaNet with Transformer) couldn't reuse the prefix cache across turns — the LCP scan found a match but the non-trimmable linear-attention state made it unusable, forcing a full re-prefill every turn. Reported by @fishloa against `TheCluster/Qwen3.6-35B-A3B-MLX-mixed-9bit` with opencode.

  • #435 Port boundary-snapshot save path to mlx-lm 0.31+ `BatchGenerator` (`insert_segments` + `end_of_segment` signal). Captures cache state at the message boundary so the next turn's lookup gets an exact-length match.
  • #439 Wire the boundary computation into the non-streaming path (`engine.chat()` / `engine.generate()`). PR #435 only covered `stream_chat` so the fix was a no-op for pydantic_ai / smolagents / langchain / opencode (all default to `stream:false`). Also gates the boundary split on `model_config.is_hybrid` to avoid a regression on pure Transformer models (gpt-oss-20b harmony tool-call channel corrupted under `insert_segments`).

Verified end-to-end on `unsloth/Qwen3.6-27B-MLX-8bit` (same hybrid family): turn-3 `HIT cached=97 remaining=95` from a turn-2 boundary save.

Harmony / GPT-OSS fixes

  • #436 Reasoning parser accepts `<|end|>` terminator on the final channel; final block was getting truncated under non-greedy regex when the model used the alternate terminator. Plus dropped a tiny smoke model from the doctor harness that was producing false negatives.
  • #438 GPT-OSS-20B tool calls being extracted as plain text. Root cause was a "defense in depth" strip in `_clean_gpt_oss_output` that ate the analysis-channel markers HarmonyReasoningParser depends on. Multi-tool turn loops where the model called `add(3,4)` repeatedly until pydantic_ai's request_limit exhausted. Also adds support for hyphenated tool names (`get-weather`, `my-tool`).

PR validation / SOP

  • #437 Bumped per-agent subprocess timeout 1200s → 1800s in `stress_e2e_bench`. qwen3.6-27B + pydantic_ai legitimately took >20 min in some scenarios and was being SIGKILL'd as a false-positive regression.
  • #439 New `test_plan_check` step in pr_validate: scans the PR body for unchecked `- [ ]` / `* [ ]` task-list items and fails if any remain. Catches the class of bug where PR #435 was merged with `- [ ] E2E verification against fishloa's repro` still unchecked — pr_validate's correctness matrix passes regardless of whether the fix actually reuses cache.

Backward compatibility

  • No API changes. Same CLI surface, same wire format.
  • Non-hybrid models behave exactly as before (the boundary path is gated off).
  • Hybrid models get the new boundary save automatically — no flag needed.

Verification matrix

  • `mlx-community/gpt-oss-20b-MXFP4-Q8` (harmony): pydantic_ai 6/6, smolagents 4/4, langchain 6/6, anthropic_sdk 5/5
  • `unsloth/Qwen3.6-27B-MLX-8bit` (hybrid): stress 8/8, fishloa repro fixed, 18 tok/s stable
  • `mlx-community/Qwen3.5-35B-A3B-8bit` (MoE): stress + agents pass
  • 3885 unit tests pass (+15 new for boundary path parity & test_plan_check)
  • `make release-smoke` clean (clean-room PyPI install)

🤖 Generated with Claude Code