Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Change traced flag semantics and add recorded flag #142

Merged
merged 3 commits into from
Aug 7, 2018

Conversation

erabug
Copy link
Contributor

@erabug erabug commented Aug 7, 2018

This PR proposes slightly changing the semantics of the traced (also called sampled) flag in trace-options, and adding a new flag.

  • traced? becomes requested? and describes whether someone explicitly requested this trace be captured
    • This value is either true or false
    • Once true, this value cannot be downgraded to false
  • recorded describes whether your parent recorded its tracing information out-of-band
    • This value is either no or maybe
    • This value can change from node to node, it just describes what happened in the last hop

Background

There are three use cases I've heard described:

  1. As an end-user, I would like to see an end-to-end trace for a particular request. I would like to mark a request for tracing and have that decision propagated and respected by all downstream callers.

(This is pretty hard to do, since you can't actually force anything downstream to do something. Although also called "force" or "debug" trace, this is more like a "pretty-please" request or hint.)

  1. As an end-user, I would like to see my request traced through multiple implementations/vendors. If a tracer sees that its caller was from a different implementation, it can record that information (trace ID, span ID, and implementor name) so I can go find more information from that implementor. If the upstream caller did not record any additional information (such as timing), it saves me a lookup.

(Because the hint cannot actually be trusted, i.e. you can't trust that ALL the upstream nodes actually recorded information, just understanding what your parent did is helpful.)

  1. As an end-user, I would like traces to include proxies and load balancers.

(However, at the time the request hits the proxy or load balancer, you may not know whether or not you should sample. You want to defer the decision.)

Proposal

These use cases can be solved by two different signals: (1) did someone explicitly request that I record this? (2) did my parent propagate its tracing information out of band? These signals can be sent separately as two trace-option flags.

option recorded? requested? recording probability situation
00 no false low I definitely dropped the data and no one asked for it
01 no true medium I definitely dropped the data but someone asked for it
10 maybe false medium Maybe I recorded this but no one asked for it yet (maybe deferred)
11 maybe true high Maybe I recorded this and someone asked for it

I believe these changes address the concerns of PR #122:

"sampling states"

  • Undecided - Tracing data collection decision is deferred. (2)
  • Traced - Last hop of request may be recorded. (1)
  • Not traced - Last hop of request is not recorded. (0)

The recorded? flag covers 0 and 1 and requested? covers 2.

"force/debug"

  • user requested to trace a request from e.g. Chrome Dev Tools
  • a synthetic test users want to collect end-to-end information

These are covered by requested?.

"deferred"

  • Injecting IDs in the first hop but not collecting a trace and not making a sampling decision
  • A load balancer waiting for a request to return and deciding on whether to keep a trace.

The two deferred cases are covered by 10.

At @SergeyKanzhelev's suggestion, I added the table to the spec. While it describes all existing flags, those are both specific to "sampling" and may not scale well if we add more unrelated flags in the future. (On a related note, @tedsuo suggested renaming whole trace-options section to sampling-options until the point where we have unrelated flags.)

@@ -118,12 +118,15 @@ static final byte FLAG_TRACED = 1; // 00000001
boolean traced = (traceOptions & FLAG_TRACED) == FLAG_TRACED
```

#### Traced Flag (00000001)
#### Requested Flag (00000001)
When set, the least significant bit recommends the request should be traced. A caller who
defers a tracing decision leaves this flag unset.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would add something like this:

"An implementation might still decide on it's own if this flag is respected or not. Consider a scenario when capturing a trace induces cost. An upstream caller could set this flag to maliciously induce cost in a tracing system. It's recommended that an implementation is careful about this and only respects the flag if the upstream caller is trusted."

When set, the least significant bit recommends the request should be traced. A caller who
defers a tracing decision leaves this flag unset.

#### Recorded Flag (00000010)
When set, the least significant bit documents that the caller may have recorded trace data. A caller who does not record trace data out-of-band leaves this flag unset.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we be more clear about the "Definitively not recorded" and "Maybe recorded" states here? Otherwise discussion will spur again, when an implementation cannot definitively declare if it recorded a trace ("undecided").

@erabug erabug force-pushed the change-traced-flag-semantics branch from 1f6bd82 to d9501f5 Compare August 7, 2018 16:57
@SergeyKanzhelev SergeyKanzhelev merged commit 9a582a2 into w3c:master Aug 7, 2018
felixbarny added a commit to felixbarny/apm-agent-java that referenced this pull request Aug 8, 2018
axw added a commit to axw/apm-agent-go that referenced this pull request Aug 9, 2018
Update TraceContext options to store two bits: requested and recorded.

See: w3c/trace-context#142
felixbarny added a commit to elastic/apm-agent-java that referenced this pull request Aug 9, 2018
@codefromthecrypt
Copy link

FWIW I don't like this for many reasons.

Firstly, it is it translated in a non-user friendly way. For example, say we did have two options like sampled and debug which exist today. The way this translated in is a way that forces people to think in bit flags. The documentation doesn't show how it will look if you did have both flags set except requested/not requested. Adding "recorded" will show how this will look very odd to people.
https://github.com/w3c/distributed-tracing/blob/master/trace_context/HTTP_HEADER_FORMAT.md#flag-behavior

Secondly the mapping isn't semantically correct at least not in a way that matches existing stuff. For example, it is adding a trait "recorded" which is hard to propagate accurately while also muddying the "sampled" state in a way that conflates itself with "debug" or "force trace". For example, "requested" is more muddy than yes/no/abstain of something sampled it.. It doesn't naturally have any more information about user intent. "force trace" is very edge case on the other hand and more important. The whole thing seems like good intentions but missed some experience that is already in B3 and Amazon trace specifications yet muddied with this design. It is also missing real world situation where what is recorded is quite flexible with regards to sampling. The abstraction of traceparent if done well allows a simple compatible route for existing sites who have used similar mechanics for several years in various open source tools... more on this below.

Finally, I'm surprised folks like @felixbarny don't speak up and "just implement it" This is a bad sign that people just accept whatever eventhough it doesn't match in a reliable way. Lacking outreach of those directly impacted, lacking folks pushing back at all, and adding rush to merge when meetings happen doesn't lead way for a solid standard, just "another standard". Please next time reach out directly especially knowing that B3 and Amazon both do not follow this, and both represent a significant stake.

@justindsmith
Copy link

I had a clarifying question around this as well. (This might be my ignorance of the previous specs - b3, etc - or the history of this spec.)

Does the requested flag represent:

  1. A human user's request to explicitly have a specific trace be captured
  2. A sampler's decision to have a specific trace be captured
  3. Both 1 and 2

(1) did someone explicitly request that I record this?

I guess I'm looking for a definition of who "someone" is.

Thanks!

@codefromthecrypt
Copy link

codefromthecrypt commented Aug 23, 2018 via email

@SergeyKanzhelev
Copy link
Member

requested was described the same way as sampled would be. Basically it is a decision of a system upstream on whether to keep this trace or not.

recorded is about your parent's decision. Main idea is to avoid situation when vendor which wants to decide on sampling later had a way to tell about it downstream without always changing of requested flag.

There is no explicit debug flag in the spec. requested can be used for this purpose.

@SergeyKanzhelev
Copy link
Member

needs clarification in a doc, I'll send a PR tomorrow morning.

@justindsmith
Copy link

Thanks! That helps!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants