Join GitHub today
GitHub is home to over 28 million developers working together to host and review code, manage projects, and build software together.Sign up
Stagemonitor 0.80.0.RC1 released #46
It’s finally done, well, almost… A lot of things have changed that’s why we decided to first make a release candidate before releasing proper. So please try this out but be aware that this is not production ready yet as there might be some bugs lingering in there.
So what exactly is (almost) done?
Stagemonitor now supports distributed tracing
What is distributed tracing and why should I care?
Distributed tracing enables you to „debug“ those kind of architectures. With debug I mean to find out what happened during a particular request, what caused an error and why the request was slow. Traditionally, all what was required to understand what was happening inside an application was to attach a profiler. For example, you could use previous versions of stagemonitor in your application and the call tree would show you which methods and JDBC queries were particularly slow. But when a request is not served by only one service but potentially dozens or hundreds, things get a bit more complicated. This is where distributed tracing comes in. It helps you reason about which services were involved during a request and how much time each service contributed to the total execution time.
The main concepts in OpenTracing are spans and traces. Traces consist of multiple spans from different services. A trace groups spans belonging to a single transaction. The trace propagates through a (potentially distributed) system. More information about OpenTracing can be found on the official website http://opentracing.io/.
How is distributed tracing implemented in stagemonitor?
Stagemonitor uses the OpenTracing API, which is a vendor and language neutral standard for distributed tracing. The two most important advantages for stagemonitor are that the actual OpenTracing implementation can be changed at any time and that stagemonitor can take advantage of 3rd party libraries which add support for certain technologies, like OkHttp. In fact, stagemonitor already supports multiple OpenTracing implementations.
In practice, that means that it is not only possible to send tracing data to Elasticsearch, which is the „traditional“ way for stagemonitor. As a new backend, Zipkin is also supported. This is enabled by the Brave OpenTracing bridge which uhm… bridges the OpenTracing API stagemonitor uses to Brave, the library of choice when it comes to reporting to Zipkin. This is also interesting to Brave users, as stagemonitor can automatically instrument their applications without any code changes.
Stagemonitor can also assist in correlating logs, as it adds useful information to slf4j’s MDC (Message Diagnostic Context). This information can be used to identify from which application host and instance a log line is coming from (nothing new so far) as well as which trace the log belongs to (great news, everyone).
Distributed tracing sounds nice, but why should I use stagemonitor?
There are a few things which set stagemonitor apart from other tracing solutions.
One interesting feature stagemonitor supported from the beginning is the included profiler which generates a call tree for a request. The call tree shows you which methods were executed during a request and which ones were particularly slow. This feature gets a new dimension in the context of distributed tracing. Usually, you would „only“ find out which application was slow, but not why. Maybe the reason for the application to be slow was not that it executed a lot of requests to other applications, but that the programming was inefficient. Also, even if you have found out that one applications executes 1000 SQL statements, you do not necessarily know where it happens in the code and why. This is where stagemonitor’s distributed call tree can help out. Stagemonitor answers the question, why the application is slow and where the slow code calls are located.
Another feature of stagemonitor is that there are no code changes required to your application. You do not have to manually implement any tracing code or configure 3rd party modules. Stagemonitor transparently injects tracing into your code via byte code manipulation.
Unlike most libraries capable of distributed tracing, stagemonitor also extracts metrics from the spans it collects. These metrics include response time percentiles, throughput and error rates. When you are embracing distributed tracing, you are likely to only store a fraction of the actual traces in your backend (aka. sampling) to not overwhelm your drives. This also means, that you might miss some information, especially about outliers. The advantage of metrics are, that they do not grow in size as your applications serves more requests. Metrics are always calculated for all requests. no matter if their corresponding spans are sampled or not.
The journey has just begun and there is still a lot of work in front of us. For example, the support for span context propagation is quite limited yet. Currently Spring’s Rest template is instrumented, so that it sends information about the span context downstream via HTTP headers. This information is then used to correlate spans which belong to the same trace.
The OpenTracing API is now at the very core of stagemonitor. That meant a lot of refactoring and a lot of changes. Some changes also directly affect users of stagemonitor.
The most notable changes are the renaming of the modules
One of the former most central classes in stagemonitor - RequestTrace - has been replaced with the Span interface from the OpenTracing API. So, if you previously enhanced the request trace with your own custom information, you will need to migrate to the
The library UADetector, which stagemonitor uses to parse the User-Agent header, is discontinued unfortunately, as the underlying data base is not free anymore. This is the reason, why the parsing of the user agent is deactivated by default. If you want to enable it, set
For a comprehensive list of all features and breaking changes, please refer to the release notes.
Thank you to everyone who participated in the process and who gave feedback (@ryanrupp, @marcust, @kishoremk, @trampi). A special thanks to @adriancole, who tested the Brave/Zipkin integration, gave me valuable tips and was an overall nice guy. It’s always cool when technology connects people which previously did not have anything in common.
If you have any kind of feedback, please share it. Either as a comment to this post or as an issue in the stagemonitor repo.
Known issues in this version:
referenced this issue
Jun 13, 2017
I can try to break down our Container to the minimum to describe how we load stuff there (when I find the time).
Also I have another issue. For me the saved search for Request Analysis in Kibana does not appear, but other Dashboards and Visualisations are available.