Skip to content

AdaptiveMessageBatcher: oscillation from sensitive de-escalation + mis-classified transition reports #877

@SimonHeybrock

Description

@SimonHeybrock

Summary

While adding integration tests for AdaptiveMessageBatcher + RateAwareMessageBatcher, two interacting issues with the adaptive controller surfaced. Neither is rate-aware-specific — both also apply to SimpleMessageBatcher as the inner.

Finding 1: De-escalation is easy to trigger

DEESCALATION_HEADROOM_RATIO = 0.75 and DEESCALATION_UNDERLOAD_THRESHOLD = 3 mean that any workload whose processing_time_s < 0.75 * batch_length_s for three batches in a row triggers a de-escalation step. A workload that comfortably fits its window (say, 50 % utilisation) will therefore continuously de-escalate until it overloads at a smaller window, escalates, and de-escalates again — steady oscillation instead of a stable level.

Writing test_oscillation_preserves_messages exposed this directly: the first draft used processing_time=1.0 during the level-2 (2 s) run phase; the controller drifted back to level 0 within a few reports.

Finding 2: Transition-batch classification is ambiguous

The moment AdaptiveMessageBatcher escalates or de-escalates, its own batch_length_s property returns the new value. But the inner batcher's currently-active batch was created with the old window (both SimpleMessageBatcher and RateAwareMessageBatcher document this: "the current active batch keeps its boundaries"). The next report_batch(count, processing_time) therefore compares a processing time that belongs to the old window against the new threshold.

Concrete consequence:

  • After escalation (e.g., 1 s → 2 s): the first report is almost certainly processing_time < 0.75 * 2 s, so it's classified as "underloaded" and increments the de-escalation counter — through no fault of the workload.
  • After de-escalation (e.g., 2 s → 1.43 s): the first report can be spuriously classified as "overloaded" by the new, smaller threshold.

Combined with finding 1, this means every escalation is immediately followed by a free step toward de-escalation, amplifying the oscillation tendency.

Why both find-ings show up together

Finding 2 contributes one spurious underload report per escalation. Finding 1 converts only three such reports into a level change. So in a workload that has some variability, the controller can be nudged down just by the transition itself, not by the workload genuinely changing.

Fix sketches (not decided)

Options to discuss:

  1. Skip classification for one cycle after a level change — the adaptive wrapper tracks "just changed" and treats the next report_batch as neutral.
  2. Classify reports against the window the inner batch was actually created with — the adaptive wrapper captures "the threshold at the time the batch started" and uses that for classification.
  3. Raise DEESCALATION_HEADROOM_RATIO and/or DEESCALATION_UNDERLOAD_THRESHOLD so settling behavior becomes the norm. This only mitigates finding 1; finding 2 remains.
  4. Asymmetric counter reset: after a level change, reset both consecutive counters to 0 (already done) and discard the first report. Covers both findings cheaply.

Option 2 is the most physically accurate but couples the adaptive wrapper more tightly to the inner's batch lifecycle. Option 4 is the smallest change and handles both findings.

Scope / applies to

  • src/ess/livedata/core/message_batcher.pyAdaptiveMessageBatcher
  • Affects any inner batcher (SimpleMessageBatcher, RateAwareMessageBatcher).
  • Existing tests/core/adaptive_batching_scenarios_test.py does not exhibit these problems because its simulation model treats batch_length_s as always matching the active window — the divergence between adaptive's view and the inner's view is the bug these tests don't model.

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions