Support OpenTracing span context #931

rcillo · 2018-08-16T09:59:09Z

https://techjira.zalando.net/browse/ARUHA-1883

In order to trace distributed systems, open tracing span context can be
transported in the metadata of events. This way, publishers and consumer
can share a span even when asynchronously communicating through Nakadi.

In order to trace distributed systems, open tracing span context can be transported in the metadata of events. This way, publishers and consumer can share a span even when asynchronously communicating through Nakadi.

codecov-io · 2018-08-16T09:59:25Z

Codecov Report

Merging #931 into master will increase coverage by <.01%.
The diff coverage is 100%.

@@             Coverage Diff              @@
##             master     #931      +/-   ##
============================================
+ Coverage     54.15%   54.16%   +<.01%     
  Complexity     1774     1774              
============================================
  Files           321      321              
  Lines          9687     9689       +2     
  Branches        885      885              
============================================
+ Hits           5246     5248       +2     
  Misses         4123     4123              
  Partials        318      318

Impacted Files	Coverage Δ	Complexity Δ
...alando/nakadi/validation/JsonSchemaEnrichment.java	`94.16% <100%> (+0.09%)`	`42 <0> (ø)`	⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update b6f40df...a742f62. Read the comment docs.

adyach · 2018-08-16T10:02:53Z

👍

v-stepanov · 2018-08-16T10:25:31Z

docs/_data/nakadi-event-bus-api.yaml

+          uses OpenTracing for tracing transcations accross distributed services, this field is be populated with an
+          object that allows consumers to resume the context of a span. The details of the payload are vendor
+          specific and consumers are expected to contact the producer in order to obtain further details.
+        type: object


doesn't it have any defined structure?
(I ask this because we always criticize our users for not defining the schema properly :) )

It is not possible or desired to define a schema as that would enforce vendor lock-in.

Unfortunately, OpenTracing deliberately leaves the span context wire format undefined. This allows each vendor to serialize/deserialize as they see fit. We want to keep the ability to switch vendors and should not depend on any particular format.
Additionally, part of the span context are the baggage items, which are arbitrary key values.
I remember this discussion from the initial narrative and a good idea could be to apply some sanity limits (payload size) for this particular element.

I understand the reasons why it can't be defined.
I'm just afraid that it goes against what we fight for in Nakadi: all data to be defined. With this field it will never be possible to completely define the structure of the event.

@v-stepanov I agree with you. Another point is that this will eventually find its way to the data lake, where incompatible changes in types might lead to breaking their processes. Maybe we should involve them here, so that they know about this new piece of information.

I would say that having arbitrary maps<string, string> is better than having arbitrary objects.

@lmineiro do you think it's possible to assume that the extensible parts fit in a map<string, string> or should we keep it as an opaque object, that might have arbitrary nested structures?

Yes. We deliberately picked the TextMap format to, at least, provide that guarantee.

If it is a Map<String, String>, it can be defined as such in JSON schema/OpenAPI schema:

type: object additionalProperties: type: string

lmineiro · 2018-08-16T11:38:04Z

docs/_data/nakadi-event-bus-api.yaml

+      span_ctx:
+        description: |
+          Object containing an OpenTracing http://opentracing.io span context. In case the producer of this event type
+          uses OpenTracing for tracing transcations accross distributed services, this field is be populated with an


typo: transcations
"this field is be populated" -> "this field is populated"

lmineiro · 2018-08-16T11:43:18Z

docs/_data/nakadi-event-bus-api.yaml

+          Object containing an OpenTracing http://opentracing.io span context. In case the producer of this event type
+          uses OpenTracing for tracing transcations accross distributed services, this field is be populated with an
+          object that allows consumers to resume the context of a span. The details of the payload are vendor
+          specific and consumers are expected to contact the producer in order to obtain further details.


I'd actually add some short mention that any attempt at inspecting or effectively using the payload, apart from the Tracer implementation can lead to vendor lock-in which we want to avoid.

lmontrieux · 2018-08-30T10:31:42Z

👍

lmontrieux

Nice way for Nakadi users to use OpenTracing.

v-stepanov · 2018-08-30T13:20:02Z

I think we still need to discuss the fact that this new field is just an object which is somehow ruins all our efforts of having nakadi events fully defined structure. I will veto for now.

v-stepanov · 2018-08-30T13:20:10Z

👎

lmineiro · 2018-09-03T08:01:03Z

@v-stepanov I completely understand your concerns. I personally share them.

I'm afraid there's absolutely no other way to support context propagation with Nakadi. Even if we ignored the Baggage-Items (arbitrary key-value pairs that will be part of this object), any attempt at defining a schema for it would no longer be vendor-neutral. This seems to be a much higher price to pay.

Without this approach, Nakadi won't support context propagation through events.

jmhofer · 2018-09-03T09:14:14Z

What's the problem with defining a more generic format? - It's a map from string to string, so you could go for an array of objects with the fields "key" and "value", or something similar, and still have it strictly defined?

lmineiro · 2018-09-03T12:08:49Z

I'm sure there can be many creative ways to serialize the span-context. There's a high risk that it won't be supported out of the box by any Tracer implementation or the existing Opentracing libraries. The standard that was proposed (Text Map) would be easily serialized and deserialized by any Tracer as the Text Map is part of the Opentracing specification (it's just a glorified Map anyway).

The advantages of working out of the box, for me personally, outweigh any creative efforts of having a "strongly-typed" span context attribute in the event metadata. Every consumer and programming language would have to implement the deserialization.

If we don't care about the readability of the span-context by humans, I'd then propose a base64 encoded version of the Binary carrier format for this. This would allow Nakadi to have a single string attribute, at the extra expense of the consumers that would have to decode the base64 blob. What do you think @rcillo, @lmontrieux, @v-stepanov, @jmhofer?

jmhofer · 2018-09-03T13:12:34Z

Personally, I do care about readability.

As the specification ensures that we have a representation of the span context as a map from string to string, it should be easy anyway to serialize/deserialize this map to whatever format you like.

I don't mind much in this case whether it's a (unfortunately schema-less) json object, some json map encoding or even something like application/x-www-form-urlencoded.

What's important in my humble opinion is that de-/serialization must be sufficiently simple for every Nakadi user, as everyone will have to write their own version of it.

ePaul · 2018-09-03T18:03:11Z

docs/_data/nakadi-event-bus-api.yaml

+          specific and consumers are expected to contact the producer in order to obtain further details. Any attempt
+          to inspect or effectively use the payload, apart from the tracer implementation, can lead to vendor lock-in
+          which should be avoided.
+        type: object


As the previous discussion somehow already is noted as "outdated" by Github, here my two cents to this:

If we suppose it is a TextMap, i.e. a map of strings to strings (contrary to an arbitrary JSON object with nested stuff), we can just declare it that way with this syntax:

span_ctx: description: | Object containing ... type: object additionalProperties: type: string

This should take care of the concerns of @v-stepanov (not allowing arbitrary structures in here), while at the same time still allows some vendor-independence by not defining any keys or the meaning of their values.

Just to be clear. What @ePaul just described is the actual proposal. If not defining the key names from said Map is not acceptable, only then I'd consider some other encoding.

Serializing/Deserializing from/to Text Map is supported out of the box on any implementation, once again, the reason why it was proposed. No extra work is needed from any consumer, apart from the typical tracer.Extract() call, using the Text Map carrier.

@v-stepanov, from your perspective, is what @ePaul described a reasonable/good compromise for Nakadi?

@lmineiro We scheduled internal team meeting for tomorrow morning to discuss the metadata changes in Nakadi for Open Tracing. So we will come back to you after that. But we also will need to get Data Lake team opinion on that.
As I understand current options are:

just object with free structure (the initial proposal)

single string with encoded span context

flat map of key-value pairs (additionalProperties: type: string)

I think another option (mentioned by @jmhofer) to encode a flat map (not using an object) would be something like this (as a list of objects).

span_ctx: description: ... type: array items: type: object properties: key: type: string value: type: string

I would prefer the additionalProperties version, though – it has less overhead.

We discussed this all with the team. I would like to describe you what we came to. Basically we examined 3 options for this:

1) Free form object

"span_ctx" : { "key1": { "key1_1": "abc" } "key2": 43232, ... }

Our concerns here is that we will break the guarantees that for "compatible" schema mode in Nakadi all fields are defined. This will create a backdoor to put any undefined data to this field.

2) Just a single string containing encoded data

"span_ctx": "WFuIGlzIGRpc3Rpbmd1aXNoZWQsIG5vdCBvbmx5IGJ5IGhpcyByZWFzb2"

With this approach the problem is that it has very poor readability, so it will not be possible to get use of this data before decoding. So, for example, in Data Lake it will be very difficult to use this data.

3) Flat string->string map
a)

"span_ctx" : { "key1": "value1", "key2": "value2", ... }

b)

"span_ctx" : [ { "key": "key1", "value": "value1" }, { "key": "key2", "value": "value2" }, ... ]

In general approaches 3a and 3b are similar as they represent flat string->string map.
(However approach 3a has problems similar to case (1) as it will not allow to define all fields of Nakadi event in compatible mode.)
Our concern regarding flat string->string map is that the open-tracing data is actually more complex and has more hierarchical structure, so in a case of map some data still will have to be compacted or transformed to string - which lowers the data quality as the actual value is different from its representation in that map.

Our another concern for any of these approaches is that none of them define the actual data that will be there and it is not clear how our users can make use of it. You say that there is standardized way to serialize/deserialize open-tracing data to/from string->string map, then I suppose there should be some standardized list of required keys it should contain, in other case it just can not work.

Situation with current suggestions looks quite strange for us. Nakadi defines the following: there is a span_ctx field in metadata, I don't know what is inside and I don't provide any guarantees but probably open-tracing implementations should be able to parse it if the publisher has put correct data there.

From our point of view, if we add something to Nakadi metadata we should be able to explain our users how to adopt it and how to benefit from it. So that the user should be able to know what to do with span_ctx if it is present in metadata. For us the most preferable solution will still be to define at least the basic schema for this field (maybe with loosing of some flexibility).

@lmineiro can you please elaborate on that to clarify those points? Maybe providing some examples of what will be the actual content there? And with some use-cases of apps that will write/read data to/from that field without prior agreements between them. Thank you.

I read the discussion with datalake team and the fact that 3b provides a more stable schema inference could be pointed out as a clear advantage. But I personally expected the span_ctx to be defined by additionalProperties: string. I just didn't proposed it to be like that since the beginning because it was not clear for me how the "baggage items" would look like. I think it's important to have the confirmation that span_ctx = additionalProperties: true suites all use cases.

Even though 3a might lead to ever growing schema inference on the datalake side, I would seriously consider it, since it's always better to have more informative schemas, even for tracing data. It's more "searchable". Having "array<string, string>" is not very good for queries.

Also, it's important to understand if is it "ever growing" or not. I assume it might be quite stable set of properties. I think it would be really useful to have some real examples pasted here with and without baggage items. Then we discuss the real thing and not something hypothetical. @lmineiro do you think it possible to provide samples with the current implementation?

The example from the narrative is what we're expected to have with the current vendor:

Example without baggage - stable

ot-tracer-spanid: b268f901d5f2b865 ot-tracer-traceid: e9435c17dabe8238 ot-tracer-sampled: true

Example with baggage - arbitrary

ot-tracer-spanid: b268f901d5f2b865 ot-tracer-traceid: e9435c17dabe8238 ot-tracer-sampled: true ot-baggage-key1: value1 ot-baggage-key2: value2

We currently don't have any valid use cases for baggage. We also advise our users to use it with care, in respect to the specification note: "Use this feature thoughtfully and with care. Every key and value is copied into every local and remote child of the associated Span, and that can add up to a lot of network and cpu overhead."

I understand the preference for 3b, it fulfills both Nakadi's goal of strictly defined schemas and it seems to be better for the searching the Data Lake (this one a Zalando internal).

I still would like to point out that every Nakadi producer and consumer out there, internal or not, will have to write their own serializer an deserializer as the OpenTracing libraries will not be able to do it out of the box - this is, in fact, the biggest advantage of 3a.

The Data Lake is, as far as I know, proprietary. I don't expect any of the external users using Nakadi to benefit from that.

Only the maintainers can make the decision which way to go forward. My choice, as Nakadi user, even if I do have access to our Data Lake, would be to have the greatest out of the box experience if and when I decide to use OpenTracing. Option 3b adds additional cost and effort to that decision while 3a allows any user to reuse existing code and infrastructure.

Nice, I think we came to a good compromise here. I fixed the type to be additionalProperties: type: string which supports 3a.

Ok, sounds good!

jmhofer · 2018-09-06T08:19:35Z

Our concern regarding flat string->string map is that the open-tracing data is actually more complex and has more hierarchical structure

@v-stepanov just to clarify: this should not be a concern, as the OpenTracing specification guarantees that the span context can always be encoded as and decoded from a text map (map from string to string), no matter how the internal vendor-specific data structure looks.

For example, in Java, using the OpenTracing API, encoding looks like this:

Map<String, String> contextMap = new HashMap<>();
tracer.inject(tracer.activeSpan().context(), TEXT_MAP, new TextMapInjectAdapter(contextMap));
// contextMap now contains the (vendor-specific) serialized span context

And decoding (from a given map) looks like this:

SpanContext spanContext = tracer.extract(TEXT_MAP, new TextMapExtractAdapter(contextMap));

Using additionalProperties string makes sure that only flat maps of strings are accepted in the span_ctx. This is considered a decent level of guarantees in terms of schema, keeping it flexible but also relatively well behaved.

v-stepanov · 2018-09-06T14:42:45Z

src/test/java/org/zalando/nakadi/validation/JSONSchemaValidationTest.java

+                "\"span_ctx\": {" +
+                "      \"ot-tracer-spanid\": \"b268f901d5f2b865\"," +
+                "      \"ot-tracer-traceid\": \"e9435c17dabe8238\"," +
+                "      \"ot-baggage-foo\": \"bar\"" +


can we also have a test that validates the case when we have a none-string field in span_ctx? (Validation should fail in this case)

@v-stepanov done 8488b55

…cing-metadata

rcillo · 2018-09-06T14:52:15Z

👍

v-stepanov · 2018-09-06T14:54:44Z

👍

rcillo · 2018-09-06T15:18:17Z

deploy validation please

jmhofer · 2018-09-13T14:32:35Z

😭

…cing-metadata

rcillo · 2018-09-24T12:38:52Z

👍

lmontrieux · 2018-09-24T12:39:48Z

👍

Support OpenTracing span context

322e837

In order to trace distributed systems, open tracing span context can be transported in the metadata of events. This way, publishers and consumer can share a span even when asynchronously communicating through Nakadi.

v-stepanov reviewed Aug 16, 2018

View reviewed changes

lmineiro reviewed Aug 16, 2018

View reviewed changes

Fixed typos and clarified recommendations on how to use the field

51753d3

lmontrieux approved these changes Aug 30, 2018

View reviewed changes

ePaul reviewed Sep 3, 2018

View reviewed changes

Refine definition of span_ctx

561d64f

Using additionalProperties string makes sure that only flat maps of strings are accepted in the span_ctx. This is considered a decent level of guarantees in terms of schema, keeping it flexible but also relatively well behaved.

rcillo requested a review from a team as a code owner September 6, 2018 14:30

v-stepanov reviewed Sep 6, 2018

View reviewed changes

rcillo added 2 commits September 6, 2018 16:44

Merge remote-tracking branch 'origin/master' into aruha-1883-open-tra…

a68032f

…cing-metadata

Add test to check error path

8488b55

Merge remote-tracking branch 'origin/master' into aruha-1883-open-tra…

a742f62

…cing-metadata

rcillo merged commit effb2ed into master Sep 24, 2018

rcillo deleted the aruha-1883-open-tracing-metadata branch September 24, 2018 13:48

zekefast mentioned this pull request Oct 2, 2018

Add support for Open Tracing's span context. zalando-nakadi/fahrschein#224

Merged

matterche mentioned this pull request Oct 8, 2018

Support OpenTracing span context dehora/nakadi-java#336

Closed

jhorstmann mentioned this pull request Oct 30, 2018

Make Metadata and span ctx immutable zalando-nakadi/fahrschein#227

Merged

Support OpenTracing span context #931

Support OpenTracing span context #931

Conversation

rcillo commented Aug 16, 2018

codecov-io commented Aug 16, 2018 • edited

Codecov Report

adyach commented Aug 16, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ePaul Sep 3, 2018 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lmontrieux commented Aug 30, 2018

lmontrieux left a comment

Choose a reason for hiding this comment

v-stepanov commented Aug 30, 2018

v-stepanov commented Aug 30, 2018

lmineiro commented Sep 3, 2018

jmhofer commented Sep 3, 2018

lmineiro commented Sep 3, 2018

jmhofer commented Sep 3, 2018

ePaul Sep 3, 2018 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lmineiro Sep 6, 2018 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jmhofer commented Sep 6, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rcillo commented Sep 6, 2018

v-stepanov commented Sep 6, 2018

rcillo commented Sep 6, 2018

jmhofer commented Sep 13, 2018

rcillo commented Sep 24, 2018

lmontrieux commented Sep 24, 2018

codecov-io commented Aug 16, 2018 •

edited

ePaul Sep 3, 2018 •

edited

ePaul Sep 3, 2018 •

edited

lmineiro Sep 6, 2018 •

edited