Fix AdaptiveMessageBatcher oscillation on no-fixed-point loads#953
Merged
Conversation
The adaptive batcher could oscillate indefinitely (0->2->1->0) instead of settling. An overhead-dominated load that overloads at the base window but fits with headroom one or more steps up has no level where utilisation lands in the stable dead zone: it escalates, finds every escalated level "underloaded", de-escalates back until it overloads again, and repeats. The dead zone is [headroom, 1.0] in utilisation, and de-escalating one half-step raises utilisation by the consecutive-level ratio. For every workload to have a stable level the dead zone must span one such step. The nominal ratio is sqrt(2), but pulse-quantization rounds windows so the widest consecutive ratio is ~1.43, requiring headroom <= ~0.70. The previous 0.75 left the gap that caused the oscillation. Lower DEESCALATION_HEADROOM_RATIO to 0.70 and add a scenario test that pins convergence for such loads. The existing partial-deescalation scenario is retuned to keep demonstrating a load-drop de-escalation under the wider dead zone. Also document the related transition mis-classification (a report after a level change is classified against the new window though its batch ran under the old one). In isolation it only wastes one report, so it is left as a documented limitation rather than fixed. Closes #877 Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Member
Author
|
LGTM |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Why
AdaptiveMessageBatchercould oscillate indefinitely (0→2→1→0) instead of settling on a stable batch length. An overhead-dominated load that overloads at the base window but fits with headroom one or more steps up has no level where utilisation lands in the stable dead zone: it escalates, finds every escalated level "underloaded", de-escalates back until it overloads again, and repeats. This is Finding 1 of #877, and it reproduces deterministically with a flat ~1.05 s processing time.The fix
The dead zone is
[headroom, 1.0]in utilisation, and de-escalating one half-step raises utilisation by the consecutive-level ratio. For every workload to have a stable level, the dead zone must span one such step (headroom ≤ 1 / ratio). The nominal ratio is √2 ≈ 1.414, but pulse-quantization rounds windows so the widest consecutive ratio is ~1.43 (round(14·√2)/14 = 20/14), requiringheadroom ≤ ~0.70. The previous0.75left exactly the gap that caused the oscillation. I verified1/√2 ≈ 0.707is still insufficient at the 1.0→1.43 grid step;0.70is the binding value.This deliberately does not address Finding 2 (a report after a level change is classified against the new window though its batch ran under the old one). In isolation it only ever wastes a single report, since the next steady cycle resets the counter — it is an amplifier of Finding 1, not an independent problem. The discard-the-first-report mitigation would add state to suppress a real signal for near-zero benefit, and the wider dead zone shrinks its relevance further. It is pinned as a documented-limitation test instead; the faithful fix, if ever warranted, is to classify each report against the window its batch actually ran under.
Notes
The remaining genuine limitation (a load landing in the 70–100% dead zone at an escalated level cannot de-escalate even if a lower level would fit) is unchanged and still documented by
dead_zone_stuck. Removing that stickiness needs active probing of lower levels, which is out of scope here.Test plan
pytest tests/core/— 406 passed.TestNoFixedPointLoadscenario fails at0.75and passes at0.70, confirming it is a regression test for the fix.Closes #877