Tail sampling #1768

jwilm · 2021-12-03T00:41:45Z

Feature Request

Better support tail sampling based on computed properties such as HTTP response codes or other things known only later in request lifecycles (vs upfront like level filter).

Motivation

High volume systems might want to drop uninteresting collections of spans to reduce load on other systems. This is already possible to some extent by using filtering to do a probabilistic sampling up front or via level filter to reduce the number of spans for a trace, but for looking at things like HTTP response codes and dropping uninteresting things like nominal 2xx responses, there doesn't seem to be much support at all.

My particular use case is actually for a Kafka consumer which is uninterested in a large fraction of messages on the topic.

Proposal

The way I've been thinking about this problem is that a Subscriber/Collector implementation could buffer spans, events, etc. before handing off to a new Export / Report / type to actually report on the trace. In some ways, this is not too dissimilar from what the tracing-opentelemetry::Layer impl is already doing -- data is buffered in the Layer until the span is closed at which point the opentelemetry builder is finalized and the span is started/dropped and then handed off to the opentelemetry library to export.

This could be achieved without any additional traits by implementing a Layer as described and chaining it with other reporting layers. However, that is merely convention rather than a hard abstraction. Based on my current understanding of the code, it seems to me the best approach would be to separate out trace collection from trace exporting more formally.

The benefit of an extra Report trait that operates on groups of spans is that it makes it easy to implement an arbitrary tail sampling logic between eg the collection step and the opentelemetry reporter.

Alternatives

Don't support this in the tracing library directly. As mentioned, it's possible to implement, although not in a user friendly way. It took quite a bit of reading the tracing library code to build enough understanding to provide a proposal (and I'm still not particularly confident I haven't missed something obvious).
Support this but without adding additional traits. A "buffering Collect" implementation could be provided where users could then chain other layers to. This doesn't compose particularly well with the provided tracing-opentelemetry::Layer, however.

Closing notes

I actually came here to ask a question about whether or not this was possible and I was just missing something, but it wasn't one of the options in the issue templates 🙃. I had done enough code reading and thinking about the problem however to make a proposal, so here it is. If this is something the project is interested in, I believe I could provide an implementation, given a bit of guidance.

Thanks!

The text was updated successfully, but these errors were encountered:

jwilm · 2021-12-09T02:28:41Z

I ended up putting together a proof-of-concept using a different approach than described above. This commit included the relevant changes to the OpenTelemetryLayer implementation to support tail sampling using a couple of new abstractions.

The first was a new layer which attaches a trace data to every span in a hierarchy. This enables sharing data across spans, or aggregating spans for export by a collector like opentelemetry.

The second was basically a flag that can be attached to that shared trace data to indicate to the collection layer that the trace should be dropped or exported. Naturally, this sort of all on or all off approach is limiting, but I think it's a reasonable demonstration of something that works and could be extended without having much overhead.

If you end up jumping over to that repository to read the code, all the interesting stuff is in the copy of the OpenTelemetryLayer borrowed from here (src/opentelemetry/layer.rs), and the src/lib.rs file which contains the TraceContextLayer which manages the shared state.

jwilm · 2022-01-22T00:06:19Z

Hi all,

I was considering opening a fresh issue with a proposal closer to our current tail sampling implementation on our tracing-opentelemetry fork. Before doing so, I wanted to ask, is there any interest in adding support for tail sampling in mainline tracing? I believe the tail sampling behavior could be completely opt-in and 0 overhead for those that don't need it. We certainly need it, and I'd love to contribute here if there's interest.

Geal mentioned this issue Dec 8, 2021

Trace sampling apollographql/router#228

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tail sampling #1768

Tail sampling #1768

jwilm commented Dec 3, 2021 •

edited

jwilm commented Dec 9, 2021

jwilm commented Jan 22, 2022

Tail sampling #1768

Tail sampling #1768

Comments

jwilm commented Dec 3, 2021 • edited

Feature Request

Motivation

Proposal

Alternatives

Closing notes

jwilm commented Dec 9, 2021

jwilm commented Jan 22, 2022

jwilm commented Dec 3, 2021 •

edited