Summary
While adding integration tests for AdaptiveMessageBatcher + RateAwareMessageBatcher, two interacting issues with the adaptive controller surfaced. Neither is rate-aware-specific — both also apply to SimpleMessageBatcher as the inner.
Finding 1: De-escalation is easy to trigger
DEESCALATION_HEADROOM_RATIO = 0.75 and DEESCALATION_UNDERLOAD_THRESHOLD = 3 mean that any workload whose processing_time_s < 0.75 * batch_length_s for three batches in a row triggers a de-escalation step. A workload that comfortably fits its window (say, 50 % utilisation) will therefore continuously de-escalate until it overloads at a smaller window, escalates, and de-escalates again — steady oscillation instead of a stable level.
Writing test_oscillation_preserves_messages exposed this directly: the first draft used processing_time=1.0 during the level-2 (2 s) run phase; the controller drifted back to level 0 within a few reports.
Finding 2: Transition-batch classification is ambiguous
The moment AdaptiveMessageBatcher escalates or de-escalates, its own batch_length_s property returns the new value. But the inner batcher's currently-active batch was created with the old window (both SimpleMessageBatcher and RateAwareMessageBatcher document this: "the current active batch keeps its boundaries"). The next report_batch(count, processing_time) therefore compares a processing time that belongs to the old window against the new threshold.
Concrete consequence:
- After escalation (e.g., 1 s → 2 s): the first report is almost certainly
processing_time < 0.75 * 2 s, so it's classified as "underloaded" and increments the de-escalation counter — through no fault of the workload.
- After de-escalation (e.g., 2 s → 1.43 s): the first report can be spuriously classified as "overloaded" by the new, smaller threshold.
Combined with finding 1, this means every escalation is immediately followed by a free step toward de-escalation, amplifying the oscillation tendency.
Why both find-ings show up together
Finding 2 contributes one spurious underload report per escalation. Finding 1 converts only three such reports into a level change. So in a workload that has some variability, the controller can be nudged down just by the transition itself, not by the workload genuinely changing.
Fix sketches (not decided)
Options to discuss:
- Skip classification for one cycle after a level change — the adaptive wrapper tracks "just changed" and treats the next
report_batch as neutral.
- Classify reports against the window the inner batch was actually created with — the adaptive wrapper captures "the threshold at the time the batch started" and uses that for classification.
- Raise
DEESCALATION_HEADROOM_RATIO and/or DEESCALATION_UNDERLOAD_THRESHOLD so settling behavior becomes the norm. This only mitigates finding 1; finding 2 remains.
- Asymmetric counter reset: after a level change, reset both consecutive counters to 0 (already done) and discard the first report. Covers both findings cheaply.
Option 2 is the most physically accurate but couples the adaptive wrapper more tightly to the inner's batch lifecycle. Option 4 is the smallest change and handles both findings.
Scope / applies to
src/ess/livedata/core/message_batcher.py → AdaptiveMessageBatcher
- Affects any inner batcher (
SimpleMessageBatcher, RateAwareMessageBatcher).
- Existing
tests/core/adaptive_batching_scenarios_test.py does not exhibit these problems because its simulation model treats batch_length_s as always matching the active window — the divergence between adaptive's view and the inner's view is the bug these tests don't model.
Summary
While adding integration tests for
AdaptiveMessageBatcher+RateAwareMessageBatcher, two interacting issues with the adaptive controller surfaced. Neither is rate-aware-specific — both also apply toSimpleMessageBatcheras the inner.Finding 1: De-escalation is easy to trigger
DEESCALATION_HEADROOM_RATIO = 0.75andDEESCALATION_UNDERLOAD_THRESHOLD = 3mean that any workload whoseprocessing_time_s < 0.75 * batch_length_sfor three batches in a row triggers a de-escalation step. A workload that comfortably fits its window (say, 50 % utilisation) will therefore continuously de-escalate until it overloads at a smaller window, escalates, and de-escalates again — steady oscillation instead of a stable level.Writing
test_oscillation_preserves_messagesexposed this directly: the first draft usedprocessing_time=1.0during the level-2 (2 s) run phase; the controller drifted back to level 0 within a few reports.Finding 2: Transition-batch classification is ambiguous
The moment
AdaptiveMessageBatcherescalates or de-escalates, its ownbatch_length_sproperty returns the new value. But the inner batcher's currently-active batch was created with the old window (bothSimpleMessageBatcherandRateAwareMessageBatcherdocument this: "the current active batch keeps its boundaries"). The nextreport_batch(count, processing_time)therefore compares a processing time that belongs to the old window against the new threshold.Concrete consequence:
processing_time < 0.75 * 2 s, so it's classified as "underloaded" and increments the de-escalation counter — through no fault of the workload.Combined with finding 1, this means every escalation is immediately followed by a free step toward de-escalation, amplifying the oscillation tendency.
Why both find-ings show up together
Finding 2 contributes one spurious underload report per escalation. Finding 1 converts only three such reports into a level change. So in a workload that has some variability, the controller can be nudged down just by the transition itself, not by the workload genuinely changing.
Fix sketches (not decided)
Options to discuss:
report_batchas neutral.DEESCALATION_HEADROOM_RATIOand/orDEESCALATION_UNDERLOAD_THRESHOLDso settling behavior becomes the norm. This only mitigates finding 1; finding 2 remains.Option 2 is the most physically accurate but couples the adaptive wrapper more tightly to the inner's batch lifecycle. Option 4 is the smallest change and handles both findings.
Scope / applies to
src/ess/livedata/core/message_batcher.py→AdaptiveMessageBatcherSimpleMessageBatcher,RateAwareMessageBatcher).tests/core/adaptive_batching_scenarios_test.pydoes not exhibit these problems because its simulation model treatsbatch_length_sas always matching the active window — the divergence between adaptive's view and the inner's view is the bug these tests don't model.