RFC: context tracing

Context tracing

Intro

There has been a few attempts to introduce tracing in Vert.x which have not proven to be worthy because they were mainly difficult to support.

Such effort focused on providing a thread local based propagation with hooks in Vert.x to be aware of asynchronous tasks allowing to combine asynchronous task events and thread local to track a context.

In essence, what is important is to capture some context before an asynchronous operation, so it's available when the asynchronous operation executes:

// Capture context
runAsync(() -> {
  // Restore context
});

Vertx has a built-in notion of context (io.vertx.core.Context) usually associated with a deployment, most of the Vert.x ecosystem uses it in order to guarantee concurrency and threading for a deployment, for instance in Cassandra client:

public CassandraClient execute(Statement statement, Handler<AsyncResult<ResultSet>> resultHandler) {
  Context context = vertx.getOrCreateContext();
  executeWithSession(session -> {
    handleOnContext(session.executeAsync(statement), context, ar -> {
      if (ar.succeeded()) {
        resultHandler.handle(Future.succeededFuture(new ResultSetImpl(ar.result(), vertx)));
      } else {
        resultHandler.handle(Future.failedFuture(ar.cause()));
      }
    });
  }, resultHandler);
  return this;
}

Since it is deployment scoped, it is actually not useful because it cannot track finer grained activities, like a flow triggered by an HttpServerRequest or an event-bus message.

This proposal aims to extend the notion of context, to provide finer grained context and allow to track them. Simply put, the proposal builds on the fact that most of the Vert.x ecosystem already uses the context.

In the above Cassandra example, when the context is bound to the lifecycle of an operation (e.g HttpServerRequest) then it will be propagated naturally by the Cassandra client and by most of the Vert.x ecosystem (given it respects this).

Goals

trace an activity or flow attached to the current Vert.x context (which implies it can be statically retrieved using a thread local using the Vert.x context)
hooks for transport context propagation (HTTP client/server, event-bus consumer/producer)
tracing management, i.e the ability to start/end an activity for third-party integration

Non goals

generic framework for tracing arbitrary continuations
thread local like registries

Proposed changes

allow to shallow duplicate and attach local data to a io.vertx.core.Context, this creates a new context which shares the same state than its creating context (same event-loop, same attribute map, same concurrency, etc...).
add a local map storage on the context, that is used to store state to distinguish contexts (i.e it can store a span id, etc...)
modify server implementations so that contexts are created when necessary

per HTTP server request
per event-bus message received
per TCP socket
etc...

Examine clients so that they capture contexts correctly (as they already should)

Some clients will capture a context for their life time (i.e an HTTP client connectin) and it will be reused to service concurrent requests (connection pooling or HTTP/2), in this situation individual usage of the connection will clone the implicit context (i.e HTTP client connection's context) with the contextual data.

Some clients provide one-shot operations, a simple propagation will work, like in the Cassandra client case.

Stack support

A major concern is the ability to support this feature in the Vert.x stack with a small cost, beyond the Vert.x stack, this should also play nicely with existing Vert.x based middleware.

Previous attempts have been discarded because they were too intrusive or complicated and make this difficult to support. This effort builds on the existing Vert.x context and its thread local association (i.e getOrCreateContext() that is used in the stack and beyond.

An important point that is advocated by the community is the role of the Vert.x context and how it should be used when crossing Vert.x boundaries:

capture the context
use the context to perform the callback

Simple callbacks should be supported out of the box, i.e the Cassandra client example above.

Longer interactions (like a stream or a transaction) need to capture the context and use it for the lifetime of the interaction so they report to the same activity.

Event-bus integration

Event-bus has various interactions between a producer and a consumer.

send
- producer side: sendRequest
- producer side: receiveResponse (with an empty body since there is no reply expected)
- consumer side: receiveRequest
- consumer side: sendResponse
request-reply
- producer side: sendRequest
- consumer side: receiveRequest
- consumer side: sendResponse
- producer side: receiveResponse
request-reply-reply: modelled as request-reply + send
etc...

Such interactions should be mapped to RPC style and not messaging (as it could be in tracer), mainly because otherwise it would be re-entrant and confusing, e.g an HTTP server request performs a request-reply, the reply would start a new trace and it's already in a trace.

Caveat: the consumer sendResponse has no body and is always done upon message delivery since there is no guarantee that the actual message handler will provide a response.

Deprecations

We might have to deprecate/remove a few static methods that use getOrCreateContext():

Vertx#runOnContext(Handler<Void>)

This method uses getOrCreateContext() under the hood and lead to capturing the context at callback time instead of call time.

It can be used this way:

asyncCall(result -> vertx.runOnContext(callback.handle(result)));

Instead one should do:

Context current = vertx.getOrCreateContext();
asyncCall(result -> current.runOnContext(callback.handle(result)));

2...

Status

currently a first milestone has been merged in vertx-core
TODO
- Model as producer/consumer
  - one way send/receive
  - publish/receive
- EventBus on send, the trace might not be determined on timeout because the timeout can happen before the subs is resolved (specially with clustered)
- dispatch currently assumes the thread to be on the event loop which is fine and we should not dispatch at other moments or use run on context ???

Notes

Armeria's tracing support described in this page looks similar to this proposal

Provide feedback

Saved searches

Use saved searches to filter your results more quickly