Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tail sampling #1768

Open
jwilm opened this issue Dec 3, 2021 · 2 comments
Open

Tail sampling #1768

jwilm opened this issue Dec 3, 2021 · 2 comments

Comments

@jwilm
Copy link

jwilm commented Dec 3, 2021

Feature Request

Better support tail sampling based on computed properties such as HTTP response codes or other things known only later in request lifecycles (vs upfront like level filter).

Motivation

High volume systems might want to drop uninteresting collections of spans to reduce load on other systems. This is already possible to some extent by using filtering to do a probabilistic sampling up front or via level filter to reduce the number of spans for a trace, but for looking at things like HTTP response codes and dropping uninteresting things like nominal 2xx responses, there doesn't seem to be much support at all.

My particular use case is actually for a Kafka consumer which is uninterested in a large fraction of messages on the topic.

Proposal

The way I've been thinking about this problem is that a Subscriber/Collector implementation could buffer spans, events, etc. before handing off to a new Export / Report / type to actually report on the trace. In some ways, this is not too dissimilar from what the tracing-opentelemetry::Layer impl is already doing -- data is buffered in the Layer until the span is closed at which point the opentelemetry builder is finalized and the span is started/dropped and then handed off to the opentelemetry library to export.

This could be achieved without any additional traits by implementing a Layer as described and chaining it with other reporting layers. However, that is merely convention rather than a hard abstraction. Based on my current understanding of the code, it seems to me the best approach would be to separate out trace collection from trace exporting more formally.

The benefit of an extra Report trait that operates on groups of spans is that it makes it easy to implement an arbitrary tail sampling logic between eg the collection step and the opentelemetry reporter.

Alternatives

  • Don't support this in the tracing library directly. As mentioned, it's possible to implement, although not in a user friendly way. It took quite a bit of reading the tracing library code to build enough understanding to provide a proposal (and I'm still not particularly confident I haven't missed something obvious).
  • Support this but without adding additional traits. A "buffering Collect" implementation could be provided where users could then chain other layers to. This doesn't compose particularly well with the provided tracing-opentelemetry::Layer, however.

Closing notes

I actually came here to ask a question about whether or not this was possible and I was just missing something, but it wasn't one of the options in the issue templates 🙃. I had done enough code reading and thinking about the problem however to make a proposal, so here it is. If this is something the project is interested in, I believe I could provide an implementation, given a bit of guidance.

Thanks!

@jwilm
Copy link
Author

jwilm commented Dec 9, 2021

I ended up putting together a proof-of-concept using a different approach than described above. This commit included the relevant changes to the OpenTelemetryLayer implementation to support tail sampling using a couple of new abstractions.

The first was a new layer which attaches a trace data to every span in a hierarchy. This enables sharing data across spans, or aggregating spans for export by a collector like opentelemetry.

The second was basically a flag that can be attached to that shared trace data to indicate to the collection layer that the trace should be dropped or exported. Naturally, this sort of all on or all off approach is limiting, but I think it's a reasonable demonstration of something that works and could be extended without having much overhead.

If you end up jumping over to that repository to read the code, all the interesting stuff is in the copy of the OpenTelemetryLayer borrowed from here (src/opentelemetry/layer.rs), and the src/lib.rs file which contains the TraceContextLayer which manages the shared state.

@jwilm
Copy link
Author

jwilm commented Jan 22, 2022

Hi all,

I was considering opening a fresh issue with a proposal closer to our current tail sampling implementation on our tracing-opentelemetry fork. Before doing so, I wanted to ask, is there any interest in adding support for tail sampling in mainline tracing? I believe the tail sampling behavior could be completely opt-in and 0 overhead for those that don't need it. We certainly need it, and I'd love to contribute here if there's interest.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant