Skip to content

LLO: Fix buffered telemetry sampling#22548

Merged
akuzni2 merged 3 commits into
developfrom
fix/llo-outcome-telemetry-sampling
May 26, 2026
Merged

LLO: Fix buffered telemetry sampling#22548
akuzni2 merged 3 commits into
developfrom
fix/llo-outcome-telemetry-sampling

Conversation

@calvwang9
Copy link
Copy Markdown
Contributor

Fixes a bug in the sampling mechanism for buffered telemetry

Root cause

Telemetry for LLOOutcome and LLOReports is not sent immediately, but rather stored in a buffer keyed by sequence number. This telemetry is only flushed if there is a transmit for that specific sequence number, and is dropped otherwise - this is intentional to only send useful outcome/report telemetry for sequence numbers that actually matter.

However, the new sampling mechanism was sampling outcome and report telemetry before storing in the buffer, meaning that when flushing telemetry for upon transmit, it would not always find corresponding telemetry for the given transmitting sequence number in the buffer.

In the case of HF streams, the OCR param on prod mainnet:

  • DeltaRound: 10ms → ocr rounds (and therefore seqNr increments) advance every ~10ms = +~100 seqNrs/sec.
  • DefaultMinReportIntervalNanoseconds: 40ms → reports generated every ~40–50ms = ~20–25 Transmits/sec.

The sampler only passes the first telemetry message of each second into the buffer. The probability of the sampled sequence number being one that transmits is about 1 in 4 - 5, which matches the observed average rate of 0.2-0.3 sampled telemetry messages per second.

Fix

Move the sampler for buffered telemetry to flush time rather than enqueue time, allowing the buffer to store telemetry for all sequence numbers.

@github-actions
Copy link
Copy Markdown
Contributor

👋 calvwang9, thanks for creating this pull request!

To help reviewers, please consider creating future PRs as drafts first. This allows you to self-review and make any final changes before notifying the team.

Once you're ready, you can mark it as "Ready for review" to request feedback. Thanks!

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 20, 2026

✅ No conflicts with other open PRs targeting develop

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Risk Rating: MEDIUM

This PR fixes a telemetry sampling bug in the LLO telemeter where buffered outcome/report telemetry could be sampled out before being keyed into the seqNr buffer, causing missing telemetry on transmit-driven flushes (especially when seqNr advances much faster than transmits occur).

Changes:

  • Apply sampling for buffered telemetry at flush time inside sendBufferedTelemetry, rather than at enqueue time.
  • Always enqueue buffered LLOOutcome and LLOReport telemetry for all seqNrs (overwrite for outcomes, append for reports), then sample right before marshal/send.
  • Add regression tests covering outcome/report sampling behavior under seqNr/transmit mismatches, including multi-channel report behavior.

Scrupulous human review recommended (risk areas):

  • telemetryBuffer growth characteristics when transmits are delayed/stalled or when TrackSeqNr events are dropped (bounded channel + non-blocking send).
  • Operational impact of buffering full-fidelity report telemetry for all seqNrs (potentially large per-seqNr slices) prior to flush-time sampling.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.

File Description
core/services/llo/telem/telemetry.go Moves buffered telemetry sampling from enqueue-time to flush-time and buffers all outcomes/reports by seqNr.
core/services/llo/telem/telemetry_test.go Adds regression tests validating flush-time sampling for outcomes and reports under transmit/seqNr mismatch scenarios.

Comment on lines 345 to +363
@@ -352,12 +356,11 @@ func (t *telemeter) enqueueTelemetry(digest string, seqNr uint64, typ synchroniz
if _, ok := t.telemetryBuffer[digest]; !ok {
t.telemetryBuffer[digest] = make(map[uint64][]telemetryEntry)
}
if t.sampler.Sample(typ, msg) {
t.telemetryBuffer[digest][seqNr] = append(t.telemetryBuffer[digest][seqNr], telemetryEntry{
telemType: typ,
msg: msg,
})
}
// Sampling is applied at flush time for buffered telemetry
t.telemetryBuffer[digest][seqNr] = append(t.telemetryBuffer[digest][seqNr], telemetryEntry{
telemType: typ,
msg: msg,
})
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree can we consider adding this @calvwang9 ?

@trunk-io
Copy link
Copy Markdown

trunk-io Bot commented May 20, 2026

Static BadgeStatic BadgeStatic BadgeStatic Badge

Failed Test Failure Summary Logs
TestVRFV2PlusIntegration_ExternalOwnerConsumerExample The test failed due to an 'underpriced transaction' error during a VRF integration test. Logs ↗︎
TestFunctionsConfigPoller/FunctionsPlugin The test failed because deploying the test access controller contract resulted in a 'replacement transaction underpriced' error. Logs ↗︎
TestIntegration_MercuryV3 The test failed without a specific error message, indicating an unspecified failure in the integration test. Logs ↗︎

View Full Report ↗︎Docs

ro-tex
ro-tex previously approved these changes May 20, 2026
@calvwang9 calvwang9 requested a review from ro-tex May 21, 2026 01:40
@akuzni2 akuzni2 added the build-publish Build and Publish image to SDLC label May 21, 2026
@cl-sonarqube-production
Copy link
Copy Markdown

@akuzni2 akuzni2 added this pull request to the merge queue May 26, 2026
@github-merge-queue github-merge-queue Bot removed this pull request from the merge queue due to failed status checks May 26, 2026
@akuzni2 akuzni2 added this pull request to the merge queue May 26, 2026
Merged via the queue into develop with commit 9845119 May 26, 2026
266 of 269 checks passed
@akuzni2 akuzni2 deleted the fix/llo-outcome-telemetry-sampling branch May 26, 2026 20:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

build-publish Build and Publish image to SDLC

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants