Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consider trace context support for probability sampling #463

Open
jmacd opened this issue Aug 17, 2021 · 2 comments
Open

Consider trace context support for probability sampling #463

jmacd opened this issue Aug 17, 2021 · 2 comments
Assignees

Comments

@jmacd
Copy link

jmacd commented Aug 17, 2021

The OpenTelemetry project has been working to specify how it propagates information about probability sampling through a couple of OTEP drafts:

OTEP 168: Specify how to propagate consistent head sampling probability
OTEP 170: Probability sampling: Sampler Name and Adjusted Count attributes

The first of these discusses how to propagate head probability so that each Span recorded in a Trace Context that has the sampled flag set knows its "adjusted count", which is the inverse of probability. We have proposed to use power-of-two sampling rates, following research by Otmar Ertl, and have come to see the use of a dedicated tracestate field as potentially too costly to have on-by-default.

Using tracestate means passing around 30 bytes per context, and considering this overhead we would like to see a Version-1 W3C traceparent with the addition of a couple of bytes of information. We can do this with 6 or 7 bits of information, ideally, but it will require specifying a lot more about traceparent and which bits of the TraceID are truly random.

This issue is a placeholder for raising this discussion in the W3C group.

@jmacd
Copy link
Author

jmacd commented Aug 17, 2021

Since the existing parts of a traceparent are base16 encoded, and whereas the version 0 traceparent reads like

traceparent: 00-TTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTT-SSSSSSSSSSSSSSSS-FF

for T, S and F being half bytes of the TraceID, SpanID, and Flag parts.

The variation discussed by OpenTelemetry would be:

traceparent: 01-RRRRRRRRRRRRRRRRTTTTTTTTTTTTTTTT-SSSSSSSSSSSSSSSS-FF-PP

where R represents a half byte of true random TraceID (the most significant half), and P represents new information related to the head probability. The 64 available values for PP are recognized as negative base2 logarithm of the sampling probability:

0: an adjusted count of 1 (i.e., probability == 2^0)
1: an adjusted count of 2 (i.e., probability == 2^-1)
2: an adjusted count of 4 (i.e., probability == 2^-2)
...
62: an adjusted count of 2^62
63: an adjusted count of zero

In order to recognize unknown head sampling probability, we would propose a new trace context flag to indicate two things: (a) the most significant 64 bits of the TraceID are true random, (b) the head probability is known.

This proposal follows from research by Otmar Ertl, see https://arxiv.org/pdf/2107.07703.pdf.

@dyladan
Copy link
Member

dyladan commented Aug 17, 2021

Summary of discussion from working group meeting:

  • In order to satisfy the paper linked above, the actual random number itself does not need to be propagated. It is sufficient to propagate only the calculated randomness and the head probability (6 bits each, encodable as RRPP where PP are two bytes of base16 probability value and RR are two
    bytes of base16 random value)
  • Randomness (required by this) and uniqueness (required by trace id) can be surprisingly complicated (see https://datatracker.ietf.org/doc/html/rfc4086 and https://datatracker.ietf.org/doc/html/rfc4122.html). We need to be very specific about our randomness/uniqueness requirements and how they may be met.
  • Enforcing randomness, or indeed any hard format restrictions, on trace ID was something that was opposed strongly in the past and is not likely to be accepted.
    • Alternative: add calculated randomness and head sampling probability RRPP to the end of the trace context header as a new field
    • Alternative: use a trace flag to denote that only some part of the trace id is format-restricted if and only if the flag is set.
      • Tracing systems who don't wish to use the restricted format simply propagate the header without special handling

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants