-
Notifications
You must be signed in to change notification settings - Fork 4.6k
Description
Clients can configure non transparent retries either via XDS
or via ServiceConfig
however the stats
handler have no good way to retrieve if a particular RPC
method is a retry or not.
This type of observability is quite useful to determine if retry-storms
(due to poor configurations) cause incidents and/or to see if retries
are actually effective.
I thought of looking at the retry metadata
header "grpc-previous-rpc-attempts"
but this one is added by the http2_client
as a header after the HandleRPC
method is called so that didnt yield any results.
My first question is: Is this actually true or is there a way to get o11y
on the non-transparent
retries from the client?
My second question is: If it is not available, the numRetries
is available when we construct the stats.Begin
struct so if you are open to extending the stats.Begin
struct, we can add it there.
Looking forward to your thoughts and also happy to implement the feature if it is not available
Activity
purnesh42H commentedon May 8, 2025
We don't have retry metrics yet. There is a proposal in progress https://github.com/grpc/proposal/pull/488/files. A metric label
grpc.client.call.retries
is proposed which has a labelgrpc.retry_type
that can take one of three values - "retry", "hedge", or "transparent". Please take a look at the proposal to see if it has everything what you need. Tagging @yashykt in case you have more questions regarding the proposal.In tracing though we have a way to identify if an rpc attempt was a transparent retry or not. When span is created, an attribute with key
transparent-retry
is set with a boolean value indicating whether the stream is undergoing a transparent retry. Tracing is still in experimental state. You can enable tracing through dial options https://github.com/grpc/grpc-go/blob/master/examples/features/opentelemetry/client/main.go#L65davinci26 commentedon May 8, 2025
Seems reasonable, have to check our implementation of tracing to see if we include this or not but thanks for the pointer.
Cool proposal, you guys are one step ahead :D. Everything in the proposal makes sense.
Havent thought this fully so take this with a bit of a grain of salt but given the proposal above and the fact that the
otel
implementation ingrpc-go
is a wrapper aroundStatsHandler
would it make sense to extendstats.Begin
with:numRetries
RPC
is ahedge
or notThat way consumers who have an implementation of the stats handler can implement similar metrics as well.
purnesh42H commentedon May 12, 2025
@davinci26 The OpenTelemetry (OTel) plugin provides trace attributes for RPC attempts that help in tracking retries. As you can see in https://github.com/grpc/grpc-go/blob/master/stats/opentelemetry/trace.go#L50-L51 (in the populateSpan function for *stats.Begin), these include:
And yes, for retry-specific "metrics", there's a proposal in progress.
stats.Handler
is experimental and for common observability requirements, it's not recommended as the primary solution for most users. The preferred path is to identify what's missing in the official plugins (like the OTel or OpenCensus ones) and work to add those features if they benefit a broad range of users. For users who need very specific, custom metrics that aren't covered by standard tooling, the stats.Handler interface is precisely the mechanism gRPC-Go provides for such advanced customization. Information like numRetries and the type of retry (e.g., standard vs. hedged) are common requirements. The goal is to make such essential data easily accessible, ideally out-of-the-box via the OTel plugin.4 remaining items
purnesh42H commentedon May 16, 2025
@davinci26 you are right about the problem with attemptInfo. Each attempt of rpc gets a new attempt info so its wrong to maintain the counter of previous rpc attempts within that. This is a bug. I think the right way to do this is to funnel the
clientStream.numRetries
all the way to HandleRPC and record the value in "previous-rpc-attempts".clientStream.numRetries
exclusively counts non-transparent retries.Once we do that, we don't have to make it separately available in stats.Begin. Although i guess its fine even if we add
numRetries
tostats.Begin
as an exported field? @dfawley do you see any issue with exposing numRetries in stats.Begin?dfawley commentedon May 16, 2025
Retry attempts can be counted by storing data in the context in a per-call interceptor, and incrementing it in
TagRPC
.purnesh42H commentedon May 16, 2025
What is the issue with funneling cs.numRetries? We need to also make sure to exclude transparent retries https://github.com/grpc/proposal/blob/master/A72-open-telemetry-tracing.md#tracing-information. clientStream.numRetries already have what we want.
dfawley commentedon May 16, 2025
Sorry, I missed the bit about excluding transparent retries. You can wait to increment when
stats.Begin
happens, then, and check that struct'sIsTransparentRetryAttempt
field.Generally, adding things requires proof that it is either necessary or sufficiently convenient. So the question should be the other way around: what is the issue with doing it the way that is already possible?
Also...we don't support hedging yet, but when/if we do, those likely wouldn't be included in the
clientStream.numRetries
, and it would be wrong to not count hedging attempts in our tracing instrumentation.davinci26 commentedon May 16, 2025
Let me code up the suggestion above so take this statement with 75% confidence but where my head is at is that this fairly complicated/non-ergonomic. I appreciate the aversion towards extending the
api
surface butRetry attempts can be counted by storing data in the context in a per-call interceptor, and incrementing it in TagRPC. and on top You can wait to increment when stats.Begin happens, then, and check that struct's IsTransparentRetryAttempt field.
seems fairly easy to get it wrong for something that is pretty useful o11y for service owners.Am I retrying too much
Edit: I think where I am coming from is that I already got this wrong once and I couldnt think of a way to implement the retries hence my question and then we realized that even the
grpc-go
implementation didn't get this 100% right so maybe we can do better and make it easier for folks to implement this.dfawley commentedon May 16, 2025
The
stats
package's API is mainly intended for internal usage or for power users. Optimal usability is not really a goal (it already has a ton of rough edges).Our goal is that 99% of users should be able to use the OpenCensus or OpenTelemetry plugins, instead, and that those should be easy to configure and use.
davinci26 commentedon May 18, 2025
Make sense @dfawley, thanks for the feedback and the pointer!
I still have my reservations:
context
is going to create animplicit
api forgrpc-go
that is much harder to test and modify on theOSS
side. But I think this tradeoff is yours to decide, I am happy and I can support either outcome.OpenCensus
/OpenTelemetry
plugins + power users.I think the main advantage on my side is that this method is available right now with the current release, so I ended up implementing it. It works well so far.
(issue is closed on my side, but I think it is used to track the bug on the
OTEL
plugin so leaving it open)vinothkumarr227 commentedon May 19, 2025
@purnesh42H I'm looking into this. Please assign the issue to me.