LLO: Fix buffered telemetry sampling#22548
Conversation
|
👋 calvwang9, thanks for creating this pull request! To help reviewers, please consider creating future PRs as drafts first. This allows you to self-review and make any final changes before notifying the team. Once you're ready, you can mark it as "Ready for review" to request feedback. Thanks! |
|
✅ No conflicts with other open PRs targeting |
There was a problem hiding this comment.
Pull request overview
Risk Rating: MEDIUM
This PR fixes a telemetry sampling bug in the LLO telemeter where buffered outcome/report telemetry could be sampled out before being keyed into the seqNr buffer, causing missing telemetry on transmit-driven flushes (especially when seqNr advances much faster than transmits occur).
Changes:
- Apply sampling for buffered telemetry at flush time inside
sendBufferedTelemetry, rather than at enqueue time. - Always enqueue buffered
LLOOutcomeandLLOReporttelemetry for all seqNrs (overwrite for outcomes, append for reports), then sample right before marshal/send. - Add regression tests covering outcome/report sampling behavior under seqNr/transmit mismatches, including multi-channel report behavior.
Scrupulous human review recommended (risk areas):
telemetryBuffergrowth characteristics when transmits are delayed/stalled or whenTrackSeqNrevents are dropped (bounded channel + non-blocking send).- Operational impact of buffering full-fidelity report telemetry for all seqNrs (potentially large per-seqNr slices) prior to flush-time sampling.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.
| File | Description |
|---|---|
| core/services/llo/telem/telemetry.go | Moves buffered telemetry sampling from enqueue-time to flush-time and buffers all outcomes/reports by seqNr. |
| core/services/llo/telem/telemetry_test.go | Adds regression tests validating flush-time sampling for outcomes and reports under transmit/seqNr mismatch scenarios. |
| @@ -352,12 +356,11 @@ func (t *telemeter) enqueueTelemetry(digest string, seqNr uint64, typ synchroniz | |||
| if _, ok := t.telemetryBuffer[digest]; !ok { | |||
| t.telemetryBuffer[digest] = make(map[uint64][]telemetryEntry) | |||
| } | |||
| if t.sampler.Sample(typ, msg) { | |||
| t.telemetryBuffer[digest][seqNr] = append(t.telemetryBuffer[digest][seqNr], telemetryEntry{ | |||
| telemType: typ, | |||
| msg: msg, | |||
| }) | |||
| } | |||
| // Sampling is applied at flush time for buffered telemetry | |||
| t.telemetryBuffer[digest][seqNr] = append(t.telemetryBuffer[digest][seqNr], telemetryEntry{ | |||
| telemType: typ, | |||
| msg: msg, | |||
| }) | |||
There was a problem hiding this comment.
I agree can we consider adding this @calvwang9 ?
|
…fix/llo-outcome-telemetry-sampling
|




Fixes a bug in the sampling mechanism for buffered telemetry
Root cause
Telemetry for LLOOutcome and LLOReports is not sent immediately, but rather stored in a buffer keyed by sequence number. This telemetry is only flushed if there is a transmit for that specific sequence number, and is dropped otherwise - this is intentional to only send useful outcome/report telemetry for sequence numbers that actually matter.
However, the new sampling mechanism was sampling outcome and report telemetry before storing in the buffer, meaning that when flushing telemetry for upon transmit, it would not always find corresponding telemetry for the given transmitting sequence number in the buffer.
In the case of HF streams, the OCR param on prod mainnet:
DeltaRound: 10ms→ ocr rounds (and therefore seqNr increments) advance every ~10ms = +~100 seqNrs/sec.DefaultMinReportIntervalNanoseconds: 40ms→ reports generated every ~40–50ms = ~20–25 Transmits/sec.The sampler only passes the first telemetry message of each second into the buffer. The probability of the sampled sequence number being one that transmits is about 1 in 4 - 5, which matches the observed average rate of 0.2-0.3 sampled telemetry messages per second.
Fix
Move the sampler for buffered telemetry to flush time rather than enqueue time, allowing the buffer to store telemetry for all sequence numbers.