New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add the HTTP header format proposal for TraceContext propagation. #1

Merged
merged 2 commits into from Sep 6, 2017
Jump to file or symbol
Failed to load files and symbols.
+89 −0
Diff settings

Always

Just for now

View
@@ -0,0 +1,3 @@
# IntelliJ IDEA
.idea
*.iml
View
@@ -0,0 +1,86 @@
# Trace Context HTTP Header Format

This comment has been minimized.

@bhs

bhs Apr 9, 2017

Contributor

Which systems are committed to support this? Which systems would you like to support this? Might be helpful to @-mention people from the latter so we can have any debates before this is merged.

@bhs

bhs Apr 9, 2017

Contributor

Which systems are committed to support this? Which systems would you like to support this? Might be helpful to @-mention people from the latter so we can have any debates before this is merged.

This comment has been minimized.

@adriancole

adriancole Apr 10, 2017

Collaborator

putting in 2p eventhough the question was for @bogdandrutu :)

TL;DR; I'd expect tracing systems which maintain all of their tracers to make a more top-down decision, but yeah not currently the case in zipkin as this trace context is compatible with zipkin's wire format.

Currently, some zipkin-compatible tracers support vendor-specific or not widely used formats. Those types of tracers will have an easier time with this since zipkin's trace context isn't inherently incompatible with this specification (at the moment). I'd expect the cross-section of google+zipkin users to be first to ask, as it is likely google will land some variant of this first (grpc, cloud services and stackdriver instrumentation).

Similar to other things that happen, when that demand occurs it is usually in a repo or two. For example, our first requests for StackDriver and X-Ray trace support came from sleuth issues list. Tracers run independently and can move to support something sooner or later. I often ping people across tracers on things like this so that they can weigh-in before organic demand hits.

Regardless, support or not support lies in the scope of each tracer to decide until there are server implications like the trace context is incompatible or too wide to store in zipkin.

@adriancole

adriancole Apr 10, 2017

Collaborator

putting in 2p eventhough the question was for @bogdandrutu :)

TL;DR; I'd expect tracing systems which maintain all of their tracers to make a more top-down decision, but yeah not currently the case in zipkin as this trace context is compatible with zipkin's wire format.

Currently, some zipkin-compatible tracers support vendor-specific or not widely used formats. Those types of tracers will have an easier time with this since zipkin's trace context isn't inherently incompatible with this specification (at the moment). I'd expect the cross-section of google+zipkin users to be first to ask, as it is likely google will land some variant of this first (grpc, cloud services and stackdriver instrumentation).

Similar to other things that happen, when that demand occurs it is usually in a repo or two. For example, our first requests for StackDriver and X-Ray trace support came from sleuth issues list. Tracers run independently and can move to support something sooner or later. I often ping people across tracers on things like this so that they can weigh-in before organic demand hits.

Regardless, support or not support lies in the scope of each tracer to decide until there are server implications like the trace context is incompatible or too wide to store in zipkin.

This comment has been minimized.

@bogdandrutu

bogdandrutu Apr 10, 2017

Collaborator

With the current format (that is compatible with ZIpkin and Google) we try to make at least these systems to work with this format. Having a common place for the specs (maybe some simple implementation in multiple languages) is one of the goal.

Anyone who is interested in using this format is welcome to join the effort and send patches/PRs etc.

@bogdandrutu

bogdandrutu Apr 10, 2017

Collaborator

With the current format (that is compatible with ZIpkin and Google) we try to make at least these systems to work with this format. Having a common place for the specs (maybe some simple implementation in multiple languages) is one of the goal.

Anyone who is interested in using this format is welcome to join the effort and send patches/PRs etc.

A trace context header is used to pass trace context information across systems
for a HTTP request. Our goal is to share this with the community so that various
tracing and diagnostics products can operate together, and so that services can
pass context through them, even if they're not being traced (useful for load
balancers, etc.)
# Format
## Header name
`Trace-Context`

This comment has been minimized.

@costinm

costinm May 23, 2017

'Via' is a standard header with pretty similar semantics. It is also in hpack.

@costinm

costinm May 23, 2017

'Via' is a standard header with pretty similar semantics. It is also in hpack.

This comment has been minimized.

@SergeyKanzhelev

SergeyKanzhelev Jul 11, 2017

Contributor

Trace-Context can be mistaken with something carrying the baggage or context properties. This is id. We decided to use Request-Id for correlation protocol in .NET. If we will be able to converge on format - it may be a good one to use.

Request-Id is quite descriptive for the header purpose I think.

@SergeyKanzhelev

SergeyKanzhelev Jul 11, 2017

Contributor

Trace-Context can be mistaken with something carrying the baggage or context properties. This is id. We decided to use Request-Id for correlation protocol in .NET. If we will be able to converge on format - it may be a good one to use.

Request-Id is quite descriptive for the header purpose I think.

## Field value
`base16(<version>)-<version_format>)`

This comment has been minimized.

@BRMatt

BRMatt Jul 25, 2017

Is this meant to be base16(<version>)-base16(<version_format>), or base16(<version>)-<version_format>?

@BRMatt

BRMatt Jul 25, 2017

Is this meant to be base16(<version>)-base16(<version_format>), or base16(<version>)-<version_format>?

The value will be US-ASCII encoded (which is UTF-8 compliant). Character `-` is
used as a delimiter between fields.
### Version
Is a 1-byte representing a 8-bit unsigned integer. Version 255 reserved.

This comment has been minimized.

@adriancole

adriancole Jun 28, 2017

Collaborator

rationale here is that we expect the key to remain the same even if the value changes. This way, we can change format in worst case where such a thing is needed, and that can be done without a fan out of administrative activity such as new filter patterns to afford a new trace header key. We should update this to make very clear that version changes are not expected and highly discouraged, especially breaking ones.

@adriancole

adriancole Jun 28, 2017

Collaborator

rationale here is that we expect the key to remain the same even if the value changes. This way, we can change format in worst case where such a thing is needed, and that can be done without a fan out of administrative activity such as new filter patterns to afford a new trace header key. We should update this to make very clear that version changes are not expected and highly discouraged, especially breaking ones.

This comment has been minimized.

@SergeyKanzhelev

SergeyKanzhelev Jul 11, 2017

Contributor

@adriancole I really like an explicit versioning. I think it will help a lot long term

@bogdandrutu was the idea of the version an incremental thing or the format of the header? Let's say 00 is what explained here, 01 - binary format, 02 - alternative length of request ID or other implementation? We've been thinking suggest to use it to indicate the alternative way of generating identifiers.

@SergeyKanzhelev

SergeyKanzhelev Jul 11, 2017

Contributor

@adriancole I really like an explicit versioning. I think it will help a lot long term

@bogdandrutu was the idea of the version an incremental thing or the format of the header? Let's say 00 is what explained here, 01 - binary format, 02 - alternative length of request ID or other implementation? We've been thinking suggest to use it to indicate the alternative way of generating identifiers.

This comment has been minimized.

@adriancole

adriancole Jul 11, 2017

Collaborator

@SergeyKanzhelev cool. I am in favor of versioning at this point, notably to help folks know when to not process a header (ex check magic type of thing likely used similarly in X-Ray's format).

Wrt version also being a format flag, not sure it applies here. For example, in grpc, binary headers are actually encoded as base64, and their header names have -bin appended to them. So in this case, they don't need a different format bit inside their encoded trace data as it is already implicit in the scheme.

@adriancole

adriancole Jul 11, 2017

Collaborator

@SergeyKanzhelev cool. I am in favor of versioning at this point, notably to help folks know when to not process a header (ex check magic type of thing likely used similarly in X-Ray's format).

Wrt version also being a format flag, not sure it applies here. For example, in grpc, binary headers are actually encoded as base64, and their header names have -bin appended to them. So in this case, they don't need a different format bit inside their encoded trace data as it is already implicit in the scheme.

This comment has been minimized.

@SergeyKanzhelev

SergeyKanzhelev Jul 11, 2017

Contributor

what if a service is called from device via http with the size concern and from another service with the regular header format? This service need to read two headers? Which one will win?

When you have a single header and allow for binary format - things got easier I believe

@SergeyKanzhelev

SergeyKanzhelev Jul 11, 2017

Contributor

what if a service is called from device via http with the size concern and from another service with the regular header format? This service need to read two headers? Which one will win?

When you have a single header and allow for binary format - things got easier I believe

### Version = 0
#### Format
`base16(<trace-id>)-base16(<span-id>)-base16(<trace-options>)`
All fields are required. Character `-` is used as a delimiter between fields.
#### Trace-id
Is the ID of the whole trace forest. It is represented as a 16-bytes array,
e.g., `4bf92f3577b34da6a3ce929d0e0e4736`. All bytes 0 is considered invalid.
Implementation may decide to completely ignore the trace-context if the trace-id
is invalid.
#### Span-id
Is the ID of the caller span (parent). It is represented as a 8-bytes array,
e.g., `00f067aa0ba902b7`. All bytes 0 is considered invalid.
Implementation may decide to completely ignore the trace-context if the span-id
is invalid.
#### Trace-options

This comment has been minimized.

@bhs

bhs Apr 9, 2017

Contributor

it's odd to me that we support a versioning bit, yet have this 4-byte thing we only have 1 bit specified for... we could alternatively just have a single byte for version 0 and skip all of the endianness discussion.

@bhs

bhs Apr 9, 2017

Contributor

it's odd to me that we support a versioning bit, yet have this 4-byte thing we only have 1 bit specified for... we could alternatively just have a single byte for version 0 and skip all of the endianness discussion.

This comment has been minimized.

@bogdandrutu

bogdandrutu Apr 10, 2017

Collaborator

I was expecting this to fill up (the 4-byte options) with things like what you proposed (sampling probability). Uber also suggested an extra bit for deferred sampling decision.

I am trying to get for the moment the minimum requirement.

@bogdandrutu

bogdandrutu Apr 10, 2017

Collaborator

I was expecting this to fill up (the 4-byte options) with things like what you proposed (sampling probability). Uber also suggested an extra bit for deferred sampling decision.

I am trying to get for the moment the minimum requirement.

This comment has been minimized.

@bogdandrutu

bogdandrutu Apr 10, 2017

Collaborator

I am trying to collect data about how many bytes we need. So far the list contains 3 (based on Jager requirements).

If the list is less than 8 we can definitely go with 1 byte for the options in v0.

Should we have the sampling probability that @bhs proposed as a separate field?

@bogdandrutu

bogdandrutu Apr 10, 2017

Collaborator

I am trying to collect data about how many bytes we need. So far the list contains 3 (based on Jager requirements).

If the list is less than 8 we can definitely go with 1 byte for the options in v0.

Should we have the sampling probability that @bhs proposed as a separate field?

This comment has been minimized.

@bogdandrutu

bogdandrutu Jun 20, 2017

Collaborator

Done. 1-byte used for options.

@bogdandrutu

bogdandrutu Jun 20, 2017

Collaborator

Done. 1-byte used for options.

Controls tracing options such as sampling, trace level etc. It is a 1-byte
representing a 8-bit unsigned integer. The least significant bit provides
recommendation whether the request should be traced or not (1 recommends the
request should be traced, 0 means the caller does not make a decision to trace
and the decision might be deferred). The flags are recommendations given by the
caller rather than strict rules to follow for 3 reasons:
1. Trust and abuse.
2. Bug in caller
3. Different load between caller service and callee service might force callee
to down sample.

This comment has been minimized.

@SergeyKanzhelev

SergeyKanzhelev Jun 27, 2017

Contributor

Assuming sampling logic may differ much between vendors - why place sampling flag to the trace context? Why not pass it as a separate header that is subject for vendor-specific logic? In some cases to resolve the issues you describe - extension libraries will need more than just a sampling flag. Most probably some additional information from baggage or other headers.

You may also have multi-tier sampling. Take an example of local agent mode that @bogdandrutu demo-ed for census. You may want to sample data that you send to backend with sampling rate 0.01 and locally - 0.1. So you'd need more than one bit to carry both sampling decisions.

Another consideration - properties of Trace-Context belongs to the new span you create. So developer or extension library can access those fields by taking the "running" span details. You can use those identifiers for the non-tracing needs. Sampling flag will not be one of a span properties (will it?). So you have no access to the original value of that flag. And you may not need any options as you may only care about the identifiers. So having options in Trace-Context looks inconsistent.

@SergeyKanzhelev

SergeyKanzhelev Jun 27, 2017

Contributor

Assuming sampling logic may differ much between vendors - why place sampling flag to the trace context? Why not pass it as a separate header that is subject for vendor-specific logic? In some cases to resolve the issues you describe - extension libraries will need more than just a sampling flag. Most probably some additional information from baggage or other headers.

You may also have multi-tier sampling. Take an example of local agent mode that @bogdandrutu demo-ed for census. You may want to sample data that you send to backend with sampling rate 0.01 and locally - 0.1. So you'd need more than one bit to carry both sampling decisions.

Another consideration - properties of Trace-Context belongs to the new span you create. So developer or extension library can access those fields by taking the "running" span details. You can use those identifiers for the non-tracing needs. Sampling flag will not be one of a span properties (will it?). So you have no access to the original value of that flag. And you may not need any options as you may only care about the identifiers. So having options in Trace-Context looks inconsistent.

This comment has been minimized.

@yurishkuro

yurishkuro Jun 27, 2017

@SergeyKanzhelev consistent sampling across the whole trace is very important characteristic. If the sampling decision is not propagated and the trace spans multiple implementations t1,...,tN, then each implementation will have to make a sampling decision over and over, and the probability that you will capture the whole trace becomes very small, p1 * ... * pN.

@yurishkuro

yurishkuro Jun 27, 2017

@SergeyKanzhelev consistent sampling across the whole trace is very important characteristic. If the sampling decision is not propagated and the trace spans multiple implementations t1,...,tN, then each implementation will have to make a sampling decision over and over, and the probability that you will capture the whole trace becomes very small, p1 * ... * pN.

This comment has been minimized.

@SergeyKanzhelev

SergeyKanzhelev Jun 27, 2017

Contributor

@yurishkuro if you are making a sampling decision based on traceid you will have min(p1, p2, .. pN) number of full traces, not the multiplication. If all probabilities match - all the traces will be collected fully,

When you do sampling you may need to estimate the count of spans exhibit certain properties based on sampled data. In case of statistical sampling you may just multiple the raw count of spans to sampling percentage to get statistically accurate number.

If sampling decision forced from above without the information on sampling percentage of originator - you cannot do this type of estimations.

@SergeyKanzhelev

SergeyKanzhelev Jun 27, 2017

Contributor

@yurishkuro if you are making a sampling decision based on traceid you will have min(p1, p2, .. pN) number of full traces, not the multiplication. If all probabilities match - all the traces will be collected fully,

When you do sampling you may need to estimate the count of spans exhibit certain properties based on sampled data. In case of statistical sampling you may just multiple the raw count of spans to sampling percentage to get statistically accurate number.

If sampling decision forced from above without the information on sampling percentage of originator - you cannot do this type of estimations.

This comment has been minimized.

@SergeyKanzhelev

SergeyKanzhelev Jun 27, 2017

Contributor

statistical sampling on every layer is just one example of different type of sampling you may want to implement and you will need more than a bit of informaiton

@SergeyKanzhelev

SergeyKanzhelev Jun 27, 2017

Contributor

statistical sampling on every layer is just one example of different type of sampling you may want to implement and you will need more than a bit of informaiton

This comment has been minimized.

@yurishkuro

yurishkuro Jun 27, 2017

if you are making a sampling decision based on traceid...
When you do sampling you may need to estimate the count of spans

Having the sampling bit does not prevent you from passing extra data or making more elaborate decisions, but it's not part of the proposed standard. NOT having the sampling bit pretty much guarantees that the trace will be broken, unless every implementation makes the decision based on the exact same formula, like traceId < p * 2^128. which is again not a part of the proposed standard.

Sampling bit is a recommendation. If a service can respect it and handle the volume - great. If it cannot respect it 100%, maybe it can respect it for "more important" spans like RPCs, and shed the load by dropping in-process spans & metrics. Essentially it does not put any restrictions on how you want to implement sampling, but still provides a way to achieve consistent sampling across the stack, if the volume allows it.

@yurishkuro

yurishkuro Jun 27, 2017

if you are making a sampling decision based on traceid...
When you do sampling you may need to estimate the count of spans

Having the sampling bit does not prevent you from passing extra data or making more elaborate decisions, but it's not part of the proposed standard. NOT having the sampling bit pretty much guarantees that the trace will be broken, unless every implementation makes the decision based on the exact same formula, like traceId < p * 2^128. which is again not a part of the proposed standard.

Sampling bit is a recommendation. If a service can respect it and handle the volume - great. If it cannot respect it 100%, maybe it can respect it for "more important" spans like RPCs, and shed the load by dropping in-process spans & metrics. Essentially it does not put any restrictions on how you want to implement sampling, but still provides a way to achieve consistent sampling across the stack, if the volume allows it.

This comment has been minimized.

@SergeyKanzhelev

SergeyKanzhelev Jun 27, 2017

Contributor

That is precisely my question. Do you think it will be typical for libraries to trust this flag? Or every vendor and library will implement own logic so this flag will never be used? Should this standard define the flag or let customers decide on sampling algorithm across services in their org as a decision separate from the data correlation protocol?

Beauty of sampling flag is ability to implement solutions like forced data collection for period of time or specific span name. Also it removes the need to synchronize the sampling decision algorithm.

On negative side - services looses control of the data volume and statistical accuracy of collected data.

Protocol is optimized for a single team owning many components with relatively similar load. It is not always the case. Every component may be owned by a team which want to play nice and contribute to the overall correlation story. But have a bigger priority to fully control telemetry volume and distribution collected from this component.

@SergeyKanzhelev

SergeyKanzhelev Jun 27, 2017

Contributor

That is precisely my question. Do you think it will be typical for libraries to trust this flag? Or every vendor and library will implement own logic so this flag will never be used? Should this standard define the flag or let customers decide on sampling algorithm across services in their org as a decision separate from the data correlation protocol?

Beauty of sampling flag is ability to implement solutions like forced data collection for period of time or specific span name. Also it removes the need to synchronize the sampling decision algorithm.

On negative side - services looses control of the data volume and statistical accuracy of collected data.

Protocol is optimized for a single team owning many components with relatively similar load. It is not always the case. Every component may be owned by a team which want to play nice and contribute to the overall correlation story. But have a bigger priority to fully control telemetry volume and distribution collected from this component.

This comment has been minimized.

@yurishkuro

yurishkuro Jun 27, 2017

On negative side - services looses control of the data volume and statistical accuracy of collected data.

I don't agree with this assessment - the flag is recommendation, an implementation does not have to respect it if it thinks it can do a better job.

@yurishkuro

yurishkuro Jun 27, 2017

On negative side - services looses control of the data volume and statistical accuracy of collected data.

I don't agree with this assessment - the flag is recommendation, an implementation does not have to respect it if it thinks it can do a better job.

This comment has been minimized.

@yurishkuro

yurishkuro Jun 27, 2017

Do you think it will be typical for libraries to trust this flag?

Yes, I think it's a very common implementation to simply trust the flag, in the absence of other knowledge about the system.

@yurishkuro

yurishkuro Jun 27, 2017

Do you think it will be typical for libraries to trust this flag?

Yes, I think it's a very common implementation to simply trust the flag, in the absence of other knowledge about the system.

This comment has been minimized.

@SergeyKanzhelev

SergeyKanzhelev Jun 27, 2017

Contributor
  1. I'm not against the flag that will force all layers to record a specific trace. It is very useful for troubleshooting scenarios.
  2. I also agree that collecting the entire trace as often as possible is very useful.
  3. It is unavoidable that downstream services will collect more telemetry than was sampled in by front door. So there will be incomplete traces.

What I want to avoid is the situation when you heavily rely on implementation detail of upstream component sampling algo. Also I want to make sure there is an easy to replicate on any language mechanism to control the flood of telemetry in case of non-matching load patterns. Third, for many Application Insights scenarios we need to keep the sampling percentage that was used to sample telemetry out so we can run statistical algorithms on telemetry and recognize patterns. So single bit will not generally work for us as a universal sampling mechanism.

I'd propose to have a debug flag with the semantic of occasional forceful tracing across the system. Sampling flag then can be vendor specific.

@SergeyKanzhelev

SergeyKanzhelev Jun 27, 2017

Contributor
  1. I'm not against the flag that will force all layers to record a specific trace. It is very useful for troubleshooting scenarios.
  2. I also agree that collecting the entire trace as often as possible is very useful.
  3. It is unavoidable that downstream services will collect more telemetry than was sampled in by front door. So there will be incomplete traces.

What I want to avoid is the situation when you heavily rely on implementation detail of upstream component sampling algo. Also I want to make sure there is an easy to replicate on any language mechanism to control the flood of telemetry in case of non-matching load patterns. Third, for many Application Insights scenarios we need to keep the sampling percentage that was used to sample telemetry out so we can run statistical algorithms on telemetry and recognize patterns. So single bit will not generally work for us as a universal sampling mechanism.

I'd propose to have a debug flag with the semantic of occasional forceful tracing across the system. Sampling flag then can be vendor specific.

The behavior of other bits is currently undefined.
#### Examples of HTTP headers
*Valid sampled Trace-Context:*
```
Value = 00-4bf92f3577b34da6a3ce929d0e0e4736-00f067aa0ba902b7-01
base16(<Version>) = 00
base16(<TraceId>) = 4bf92f3577b34da6a3ce929d0e0e4736
base16(<SpanId>) = 00f067aa0ba902b7
base16(<TraceOptions>) = 01 // sampled
```
*Valid not-sampled Trace-Context:*
```
Value = 00-4bf92f3577b34da6a3ce929d0e0e4736-00f067aa0ba902b7-00
base16(<Version>) = 00
base16(<TraceId>) = 4bf92f3577b34da6a3ce929d0e0e4736
base16(<SpanId>) = 00f067aa0ba902b7
base16(<TraceOptions>) = 00 // not-sampled
```