Skip to content

Fix AdaptiveMessageBatcher oscillation on no-fixed-point loads#953

Merged
SimonHeybrock merged 1 commit into
mainfrom
issue-877-batcher-oscillation
May 29, 2026
Merged

Fix AdaptiveMessageBatcher oscillation on no-fixed-point loads#953
SimonHeybrock merged 1 commit into
mainfrom
issue-877-batcher-oscillation

Conversation

@SimonHeybrock
Copy link
Copy Markdown
Member

@SimonHeybrock SimonHeybrock commented May 29, 2026

Why

AdaptiveMessageBatcher could oscillate indefinitely (0→2→1→0) instead of settling on a stable batch length. An overhead-dominated load that overloads at the base window but fits with headroom one or more steps up has no level where utilisation lands in the stable dead zone: it escalates, finds every escalated level "underloaded", de-escalates back until it overloads again, and repeats. This is Finding 1 of #877, and it reproduces deterministically with a flat ~1.05 s processing time.

The fix

The dead zone is [headroom, 1.0] in utilisation, and de-escalating one half-step raises utilisation by the consecutive-level ratio. For every workload to have a stable level, the dead zone must span one such step (headroom ≤ 1 / ratio). The nominal ratio is √2 ≈ 1.414, but pulse-quantization rounds windows so the widest consecutive ratio is ~1.43 (round(14·√2)/14 = 20/14), requiring headroom ≤ ~0.70. The previous 0.75 left exactly the gap that caused the oscillation. I verified 1/√2 ≈ 0.707 is still insufficient at the 1.0→1.43 grid step; 0.70 is the binding value.

This deliberately does not address Finding 2 (a report after a level change is classified against the new window though its batch ran under the old one). In isolation it only ever wastes a single report, since the next steady cycle resets the counter — it is an amplifier of Finding 1, not an independent problem. The discard-the-first-report mitigation would add state to suppress a real signal for near-zero benefit, and the wider dead zone shrinks its relevance further. It is pinned as a documented-limitation test instead; the faithful fix, if ever warranted, is to classify each report against the window its batch actually ran under.

Notes

The remaining genuine limitation (a load landing in the 70–100% dead zone at an escalated level cannot de-escalate even if a lower level would fit) is unchanged and still documented by dead_zone_stuck. Removing that stickiness needs active probing of lower levels, which is out of scope here.

Test plan

  • pytest tests/core/ — 406 passed.
  • New TestNoFixedPointLoad scenario fails at 0.75 and passes at 0.70, confirming it is a regression test for the fix.

Closes #877

The adaptive batcher could oscillate indefinitely (0->2->1->0) instead of
settling. An overhead-dominated load that overloads at the base window but
fits with headroom one or more steps up has no level where utilisation lands
in the stable dead zone: it escalates, finds every escalated level
"underloaded", de-escalates back until it overloads again, and repeats.

The dead zone is [headroom, 1.0] in utilisation, and de-escalating one
half-step raises utilisation by the consecutive-level ratio. For every
workload to have a stable level the dead zone must span one such step. The
nominal ratio is sqrt(2), but pulse-quantization rounds windows so the widest
consecutive ratio is ~1.43, requiring headroom <= ~0.70. The previous 0.75
left the gap that caused the oscillation.

Lower DEESCALATION_HEADROOM_RATIO to 0.70 and add a scenario test that pins
convergence for such loads. The existing partial-deescalation scenario is
retuned to keep demonstrating a load-drop de-escalation under the wider dead
zone.

Also document the related transition mis-classification (a report after a
level change is classified against the new window though its batch ran under
the old one). In isolation it only wastes one report, so it is left as a
documented limitation rather than fixed.

Closes #877

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@SimonHeybrock
Copy link
Copy Markdown
Member Author

LGTM

@SimonHeybrock SimonHeybrock merged commit f7f273e into main May 29, 2026
13 checks passed
@SimonHeybrock SimonHeybrock deleted the issue-877-batcher-oscillation branch May 29, 2026 10:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

AdaptiveMessageBatcher: oscillation from sensitive de-escalation + mis-classified transition reports

1 participant