fix(gcp_pubsub source): treat expected stream closures as non-errors#25149
fix(gcp_pubsub source): treat expected stream closures as non-errors#25149andylibrian wants to merge 1 commit intovectordotdev:masterfrom
Conversation
|
All contributors have signed the CLA ✍️ ✅ |
There was a problem hiding this comment.
Pull request overview
This PR updates the gcp_pubsub source error handling so routine Pub/Sub StreamingPull stream closures are treated as expected behavior (avoiding ERROR logs, component_errors_total increments, and retry delays), aligning Vector’s behavior with Google’s documented StreamingPull lifecycle.
Changes:
- Added
is_expected_closure()to detect expected Pub/Sub StreamingPull closures (gRPCUnavailablewith known message prefix). - Updated
translate_error()to immediately reconnect (State::RetryNow) and log atdebug!level for expected closures. - Added unit tests covering the new predicate and
translate_error()behavior, plus a changelog fragment.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
src/sources/gcp_pubsub.rs |
Detects expected StreamingPull closures and retries immediately without emitting error events; adds unit tests. |
changelog.d/22304_gcp_pubsub_expected_closure.fix.md |
Documents the user-facing change in logging/metrics/retry behavior for expected closures. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
98087f9 to
86eac3a
Compare
|
|
|
I have read the CLA Document and I hereby sign the CLA |
86eac3a to
cb81dc6
Compare
GCP Pub/Sub sends UNAVAILABLE status with message "The StreamingPull stream closed for an expected reason" during routine stream management (subscription changes, load balancing). Previously this was emitted as a GcpPubsubReceiveError and retried with a delay. Now it is recognized as benign and retried immediately, matching the existing is_reset() behavior. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
cb81dc6 to
3d97686
Compare
|
@codex review |
|
Codex Review: Didn't find any major issues. Keep them coming! ℹ️ About Codex in GitHubYour team has set up Codex to review pull requests in this repo. Reviews are triggered when you
If Codex has suggestions, it will comment; otherwise it will react with 👍. Codex can also answer questions or update the PR. Try commenting "@codex address that feedback". |
Summary
The
gcp_pubsubsource emits ERROR-level logs and incrementscomponent_errors_totalwhen Google's Pub/Sub server closes a StreamingPull connection for an expected reason. Google's documentation describes these periodic closures as routine behavior:https://cloud.google.com/pubsub/docs/pull#streamingpull
The observed error:
This causes false alerts, inflates
component_errors_total, and introduces an unnecessaryretry_delay_secs(default 1s) pause before reconnecting.Root cause:
translate_error()only special-cases HTTP/2-level resets viais_reset(), which inspects thehyper::Error→h2::Errorsource chain. The expected closure arrives as a gRPC-leveltonic::StatuswithCode::Unavailable, whichis_reset()cannot detect. It falls through to the else branch, emittingGcpPubsubReceiveError(ERROR + metric) and returningState::RetryDelay.Fix: Add an
is_expected_closure()predicate that checks forCode::Unavailablewith a message starting with"The StreamingPull stream closed for an expected reason". When matched, log atdebug!level and returnState::RetryNowfor immediate reconnection — following the same pattern as the existingis_reset()handler.If Google ever changes the message text, detection stops and behavior safely regresses to the current ERROR + delay — not a new failure mode.
Vector configuration
How did you test this PR?
Unit tests — 4 new tests in
src/sources/gcp_pubsub.rs:expected_closure_matches_unavailable_with_known_message— predicate matches expected closureexpected_closure_does_not_match_different_message— predicate rejects other Unavailable messagesexpected_closure_does_not_match_different_code— predicate rejects non-Unavailable codestranslate_error_retries_now_on_expected_closure— translate_error returnsState::RetryNowReal GKE test — two-phase test on a GKE cluster consuming from GCP Pub/Sub:
master(v0.55.0).Confirmed ERROR logs and
component_errors_totalincrements continued at the expected rate. Also observed a genuine Unavailable error ("The service was unable to fulfill your request") which is correctly NOT matched by this fix.Failed to fetch eventsERROR logs from the fix pod, while streams continued scaling up normally (concurrency 1→3), confirming expected closures are handled silently with immediate reconnect.Change Type
Is this a breaking change?
Does this PR include user facing changes?
no-changeloglabel to this PR.References
Closes #25151
Files to create/modify
changelog.d/22304_gcp_pubsub_expected_closure.fix.mdsrc/sources/gcp_pubsub.rsVerification