Add distribution metric to semantic-core and ffwd-reporter #79

ao2017 · 2020-08-21T13:10:07Z

PR raison d'être

We are adding distribution support to resolve 2 problems.
1. Accurate Aggregated Histogram Data.
Currently percentile data provided by heroic are computed locally.
It is practically impossible to aggregate that data.
With this PR we are introducing a mechanism to record data and send data sketches
to heroic. A sketch of a dataset is a small data structure that let you approximate certain characteristics of the original
dataset. We use sketches to compute rank based statistics such as percentile. Sketches are mergeable and can be used
to compute any percentile on the entire data distribution.

2. Support sophisticated data point value.
We currently have many libraries that emit Open-census metrics. We cannot export Open-census metric to heroic because heroic supports only double data point value. With the introduction of Distribution we will be able to support sophisticated data point value such as Open-census metric distribution.

What's new ?
This PR introduces Distribution. Unlike most metric supported by SemanticMetric this metric is not a DropWizard metric.
However, Distribution implements DropWizard Metric and Counting interface. This is done so we can manage Distribution metric through Semantic Registry which is an extension of DropWizard Registry.

This PR also introduces SemanticMetricDistribution. SemanticMetricDistribution is an implementation of Distribution backed by Tdigest. Check the links below for the rational of using Tdigest.

This PR also refactors FastForwardReporter to add distribution support.

All efforts were made to ensure compatibility. This changes should be an Opt-in for Heroic Users. They shouldn't have to make any change to existing code unless they want to use distribution. At least that is the goal.

Related Issues and PR's
spotify/heroic#476
spotify/ffwd-client-java#13
#78
spotify/heroic#673
spotify/heroic#677

core/src/main/java/com/spotify/metrics/core/codahale/metrics/ext/Distribution.java

core/src/main/java/com/spotify/metrics/core/SemanticMetricRegistryListenerV2.java

core/src/main/java/com/spotify/metrics/core/codahale/metrics/ext/Distribution.java

sming · 2020-08-29T16:01:34Z

This change add new interfaces and extend existing one. All efforts were made to ensure compatibility. Users shouldn't have to make any change to existing code unless they want to use distribution. At least that is the goal.

Hey @ao2017 - could you provide a Use Case-like description please? With links to the actual US/Issue and supporting links e.g. T-digest, codehale.

Also, I'm quite confused about SemanticMetricRegistryListener / MetricRegistryListener / SemanticMetricRegistryListenerV2. The first is in the README but the latter two are not. Could you help me understand what's the plan with these 3 interfaces? thanks.

sming · 2020-08-29T20:16:40Z

core/src/main/java/com/spotify/metrics/core/codahale/metrics/ext/MetricRegistryListener.java

+    /**
+     * A no-op implementation of {@link MetricRegistryListener}.
+     */
+    abstract class Base implements MetricRegistryListener {


can we call this NoOpMetricRegistryListener, or something else meaningful?

That 's the original name Let me see if changing the name won't brake anything.

core/src/main/java/com/spotify/metrics/core/codahale/metrics/ext/MetricRegistryListener.java

sming · 2020-08-29T20:19:26Z

core/src/main/java/com/spotify/metrics/core/codahale/metrics/ext/MetricRegistryListener.java

+     * @param name          the meter's name
+     * @param derivingMeter the meter
+     */
+    void onDerivingMeterAdded(String name, DerivingMeter derivingMeter);


is a distribution a Deriving Meter? I can't quite tell. Here's the definition of a deriving meter:

* A meter that takes the derivative of a value that is expected to be almost monotonically * increasing. A typical use case is to get the rate of change of a counter of the total number of * occurrences of something. * <p> * Implementations will ignore updates that are a decrease of the counter value. The rationale is * that the counter is expected to be monotonically increasing between infrequent resets (such as * when a process has been restarted). Thus, negative values should only happen on restart, and it * should be safe to discard those.

Nope, it is not. Distribution captures the recorded data distribution. It is used to compute percentile in a distributed environment. Can one use distribution to compute rate ? I doubt it won't be accurate since distribution is just capturing an approximation not the actual count.

How do distributions relate to histograms? https://github.com/infusionsoft/yammer-metrics/blob/master/metrics-core/src/main/java/com/codahale/metrics/Histogram.java

Do we need to have both? Is the newly added Distribution a superset of histogram?

So distribution is a name for the type of Histogram that computes rank related stat using actual data distribution.
Dropwizard histogram doesn't support that feature. For more info, take a look at this issue spotify/heroic#476

Do we need to have both ?
At the moment we think yes. Some users may have a preference for local histogram. And I think it is the standard in the industry to support both at least DataDog and OpenTelemetry support both types.

sming · 2020-08-29T20:21:33Z

core/src/main/java/com/spotify/metrics/core/SemanticMetricRegistryAdapter.java

@@ -107,6 +108,29 @@ public void onTimerRemoved(String name) {
        tryRemove(name);
    }

+
+    @Override
+    public void onDerivingMeterAdded(String name, DerivingMeter derivingMeter) {


I'm curious about onHistogramRemoved() and onHistogramAdded() above. Are they a different "kind" of histogram or something?

Sorry I don't understand this question :).

core/src/main/java/com/spotify/metrics/core/SemanticMetricBuilder.java

core/src/main/java/com/spotify/metrics/core/DistributionImpl.java

sming · 2020-08-29T20:43:27Z

core/src/main/java/com/spotify/metrics/core/DistributionImpl.java

+    }
+
+    /**
+     * Returns the current count.


not terribly useful :) Either remove or explain what's being counted, cheers.

Ok, something is better than guessing :) . I will update.

sming · 2020-08-29T20:44:52Z

core/src/main/java/com/spotify/metrics/core/DistributionImpl.java

+    private final AtomicReference<TDigest> distRef;
+
+    protected DistributionImpl() {
+        this.distRef = new AtomicReference<>(TDigest.createDigest(COMPRESSION_DEFAULT_LEVEL));


what's the rationale behind using an AtomicReference? With AtomicRefernce, the only guarantee is that this reference will be updated/set in a thread-safe manner and AFAICT we don't update it...?

Also, AFAICT, multiple threads could still overwrite this.distRef here, no?

That's ok, As long as each thread has its own copy we are good.

ffwd-reporter/src/main/java/com/spotify/metrics/ffwd/FastForwardReporter.java

core/src/main/java/com/spotify/metrics/core/SemanticMetricRegistry.java

spkrka · 2020-08-31T19:14:14Z

core/src/main/java/com/spotify/metrics/core/DistributionImpl.java

+
+    @Override
+    public void record(double val) {
+        distRef.get().add(val);


TDigest.add() does not appear to be threadsafe

Yeah, I am looking into this. I checked couple of Dropwizard metrics there are threadsafe so I will make this thread safe.

I think we could either

add synchronized around it and hope it's enough.

Use ThreadLocal and then merge all the thread local entries somehow inside getValueAndFlush

Push to a queue and have another thread popping them off and adding to the TDigest object

Please check the implementation I added yesterday. I thought about adding threadLocal then change mind. ThreadLocal may introduce memory leak if getValueAndFlush isn't called often.

Yes, that should work now I think (but it's hard to know if there will be much contention, some multithreaded JMH benchmarks might be useful).

Ok, sure will do. Please take a look I need a +1

mattnworb

(@malish8632 asked me to take a look since I've worked with these classes before)

The main question I have is - what is a Distribution? I think the initial PR summary is missing the motivation behind the addition or what it means to achieve (except adding new classes).

Is it necessary to have a separate interface and implementation for Distribution and DistributionImpl, when the other core types (Timer, Meter, Histogram) don't follow this pattern? Those classes implement interfaces, but those are more generic interfaces that have multiple implementations (i.e. Timer and Meter are implementations of Metered) which allows for abstraction in some places (and code reuse). I don't see the similar need for Distribution unless you plan to add more implementations of it later on, in which case DistributionImpl could use a more descriptive name.

A few suggestions/questions below - I don't think you need to go to the trouble of declaring V2 interfaces or new packages, it'd be simpler to update the version of the library to indicate changes to the API of SemanticMetricRegistry and friends.

core/src/main/java/com/spotify/metrics/core/SemanticMetricRegistryListenerV2.java

mattnworb · 2020-08-31T19:51:45Z

core/src/main/java/com/spotify/metrics/core/codahale/metrics/ext/Distribution.java

+ * under the License.
+ */
+
+package com.spotify.metrics.core.codahale.metrics.ext;


why add the new Distribution type to this package, instead of com.spotify.metrics.core? There are interfaces in core that refer to this type, so I think it would be natural to have it live in com.spotify.metrics.core as well

If you want to add it in a subpackage, I'd still recommend against this specific name, its not like any classes live in or com.spotify.metrics.core.codahale com.spotify.metrics.core.codahale.metrics

@mattnworb I don't see how SemanticMetricRegistry.addListener(SemanticMetricRegistryListener). is broken with this PR.

I think I agree with Matt - could we use com.spotify.metrics.core instead?

core/src/main/java/com/spotify/metrics/core/codahale/metrics/ext/Distribution.java

core/src/main/java/com/spotify/metrics/core/codahale/metrics/ext/MetricRegistryListener.java

ffwd-reporter/src/main/java/com/spotify/metrics/ffwd/FastForwardReporter.java

mattnworb · 2020-08-31T20:10:00Z

ffwd-reporter/src/main/java/com/spotify/metrics/ffwd/FastForwardReporter.java

@@ -396,4 +338,101 @@ public void stop() {
    public void close() {
        stop();
    }
+
+    public static final class Builder {


It'd probably be better to make separate PRs for changes like reordering the structure of the class or whitespace changes, as it makes the git history harder to follow if its combined with this PR

+1 to this comment - pls separate this into diff PR

1. Remove SemanticMetricRegistryListenerV2 2. Added Default methods to SemanticMetricRegistryListener 3. We understand that there is a chance for SemanticMetricRegistry to call a no op methods. This can lead to confusion hopefully users of this lib will look at the source code or documentation.

…yListener

core/src/main/java/com/spotify/metrics/core/SemanticMetricRegistry.java

core/src/main/java/com/spotify/metrics/core/SemanticMetricRegistryListener.java

malish8632

Looks much better. Some changes should help simplify this PR and clarify type packages.

…tricRegistry.getDistribution

mattnworb

thanks for resolving all of my earlier comments, and I find the PR summary very helpful now - thanks for both of those.

I have a few small questions below

core/src/main/java/com/spotify/metrics/core/SemanticMetricDistribution.java

mattnworb · 2020-09-04T14:44:39Z

core/src/main/java/com/spotify/metrics/core/Distribution.java

+     * is reset and a new recording starts.
+     * @return
+     */
+    ByteBuffer getValueAndFlush();


I am still curious about the idea of exposing the "value" of the Distribution as a ByteBuffer. It seems like it would couple this Distribution metric to the way in which the t-digest library serializes the underlying com.tdunning.math.stats.TDigest into bytes.

For instance, if there was an alternate implementation of Distribution in the future, it would have to return the values in the same byte format, since FastForwardReporter has the expectation that the results of getValueAndFlush() must look a certain way.

Is there any other data type that could be used? Or at least could the format of the bytes be documented here?

If we use Distribution as the return type then the serializer will have to know something about Distribution implementation. For instance, FFW reporter will have an implementation of this method for every Distribution type.
https://github.com/spotify/semantic-metrics/pull/79/files/4ae53fe81b12e35376529ab79184798e982e7ade#diff-455c8027b591558e13ad0a4d245b2518R265. That will not align with the current paradigm of not exposing the detail of the metric implementation to the serializer.

Documentation for bytes format is handled by FFW. There is a version number, we will use that number upstream to determine how to handle the bytes.

Return something immutable like ByteString

Instead of ByteString, return something more semantically meaningful like a DistributionSnapshot wrapper

I am not sure about DistributionSnapshot, that will expose Distribution implementation to ffwReporter as noted above.

Let's merge this branch. Bytebuffer is transformed into ByteString before serialization in ffw-client. I will change the return type later.

malish8632

LGTM

my concerns were addressed

spkrka · 2020-09-09T07:25:06Z

ffwd-reporter/src/main/java/com/spotify/metrics/ffwd/FastForwardReporter.java

+    private void reportDistribution(final com.spotify.ffwd.v1.Metric metric,
+                                    final Distribution distribution) {
+        ByteBuffer byteBuffer = distribution.getValueAndFlush();
+        Value value = Value.distributionValue(byteBuffer);


I just noticed this PR: spotify/ffwd-client-java#10
Can we update the ffwd API to use ByteString instead of ByteBuffer?
I think it should be safe to change that if no one is using the distribution types yet (since they are not implemented yet)

If you look at the entire PR you will see that ByteBuffer is transformed into ByteString before serialization. Moving byteString up in the stack is something we can do in the next PR.

spkrka · 2020-09-09T07:28:08Z

core/src/main/java/com/spotify/metrics/core/SemanticMetricDistribution.java

@@ -0,0 +1,88 @@
+/*
+ * Copyright (c) 2016 Spotify AB.


Update year

sming

Just a few comments @ao2017 .

sming · 2020-09-09T13:54:52Z

core/src/main/java/com/spotify/metrics/core/Distribution.java

+ * Just get an instance through SemanticMetricBuilder and record data.
+ *
+ * <p> {@link Distribution} is a good choice if you care about percentile accuracy in
+ * a distributed environment and you want to rely on P99 to set SLO.


Excellent comment!

sming · 2020-09-09T13:58:48Z

core/src/main/java/com/spotify/metrics/core/SemanticMetricDistribution.java

+ * From P99.9 to P99.999 the error rate is slightly higher than 2%.
+ *
+ */
+public final class SemanticMetricDistribution implements Distribution {

    private static final int COMPRESSION_DEFAULT_LEVEL = 100;


Might we want to change this level? Should it be configuration-driven perhaps?

I think for the time being it is better to keep every aspect of Tdigest private. Power users have the ability to extend this class and use a different compression level. The typical use case 100 is ok.

sming · 2020-09-09T14:01:06Z

core/src/main/java/com/spotify/metrics/core/SemanticMetricRegistryListener.java

@@ -93,6 +93,16 @@ public void onDerivingMeterAdded(MetricId name, DerivingMeter derivingMeter) {
        @Override
        public void onDerivingMeterRemoved(MetricId name) {
        }
+


I'm still not a fan of this class name. Is it convention?

sming · 2020-09-09T14:03:09Z

ffwd-reporter/src/main/java/com/spotify/metrics/ffwd/FastForwardReporter.java

+        private TagExtractor tagExtractor;
+        private ScheduledExecutorService executorService;
+
+        private Set<Percentile> histogramPercentiles =


Why these particular numbers? Can we comment or const them please.

sming · 2020-09-09T14:04:31Z

ffwd-reporter/src/main/java/com/spotify/metrics/ffwd/FastForwardReporter.java

+        public Builder histogramQuantiles(double... quantiles) {
+            histogramPercentiles = new HashSet<>();
+            for (double q : quantiles) {
+                histogramPercentiles.add(new Percentile(q));


Is there a reason we're not validating that the double is in range?

ao2017 added 3 commits August 21, 2020 08:30

Add distribution metric to semantic-core and ffwd-reporter

58883b5

Merge branch 'adele/add-distribution-support' into adele/distribution

6785e73

Update javadoc

dab6897

ao2017 requested a review from malish8632 August 21, 2020 13:13

ao2017 requested review from spkrka, lmuhlha and jsferrei August 21, 2020 13:14

jsferrei removed their request for review August 21, 2020 13:17

ao2017 requested review from sjoeboo and sming and removed request for spkrka August 21, 2020 13:18

Fix build failure

912e50a

ao2017 requested a review from smstone August 21, 2020 13:30

Switch to ffw-java-client version 0.22

4ae53fe

malish8632 reviewed Aug 24, 2020

View reviewed changes

core/src/main/java/com/spotify/metrics/core/codahale/metrics/ext/Distribution.java Outdated Show resolved Hide resolved

malish8632 reviewed Aug 24, 2020

View reviewed changes

core/src/main/java/com/spotify/metrics/core/SemanticMetricRegistryListenerV2.java Outdated Show resolved Hide resolved

sming reviewed Aug 29, 2020

View reviewed changes

core/src/main/java/com/spotify/metrics/core/codahale/metrics/ext/Distribution.java Outdated Show resolved Hide resolved

sming reviewed Aug 29, 2020

View reviewed changes

core/src/main/java/com/spotify/metrics/core/codahale/metrics/ext/Distribution.java Outdated Show resolved Hide resolved

sming reviewed Aug 29, 2020

View reviewed changes

spkrka reviewed Aug 31, 2020

View reviewed changes

core/src/main/java/com/spotify/metrics/core/SemanticMetricRegistry.java Outdated Show resolved Hide resolved

spkrka reviewed Aug 31, 2020

View reviewed changes

mattnworb previously requested changes Aug 31, 2020

View reviewed changes

ao2017 added 4 commits September 2, 2020 12:48

Change the name of Distribution Implementation and add Threadsafety

89b365d

Add javadoc to Distribution

c2e676a

Revert SemanticMetricRegistryAdapter changes and remove MetricRegistr…

b07eca4

…yListener

malish8632 reviewed Sep 3, 2020

View reviewed changes

core/src/main/java/com/spotify/metrics/core/SemanticMetricRegistry.java Outdated Show resolved Hide resolved

malish8632 reviewed Sep 3, 2020

View reviewed changes

core/src/main/java/com/spotify/metrics/core/SemanticMetricRegistryListener.java Show resolved Hide resolved

malish8632 suggested changes Sep 3, 2020

View reviewed changes

ao2017 added 2 commits September 3, 2020 23:49

Addresse PR's comments: remove code re-ordering and rename SemanticMe…

055de7e

…tricRegistry.getDistribution

Revert re-ordering in FastForwardReporter

339a19f

mattnworb reviewed Sep 4, 2020

View reviewed changes

malish8632 approved these changes Sep 4, 2020

View reviewed changes

Additional info regarding Distribution output

c000a1f

ao2017 requested review from mattnworb and spkrka September 8, 2020 19:50

mattnworb approved these changes Sep 8, 2020

View reviewed changes

spkrka reviewed Sep 9, 2020

View reviewed changes

Update copyright licence agremment header

b0ece44

ao2017 merged commit 6244f29 into spotify:master Sep 9, 2020

sming reviewed Sep 9, 2020

View reviewed changes

Add distribution metric to semantic-core and ffwd-reporter #79

Add distribution metric to semantic-core and ffwd-reporter #79

Conversation

ao2017 commented Aug 21, 2020 • edited Loading

sming commented Aug 29, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sming Aug 29, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mattnworb left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

malish8632 left a comment

Choose a reason for hiding this comment

mattnworb left a comment

Choose a reason for hiding this comment

mattnworb Sep 4, 2020 • edited Loading

Choose a reason for hiding this comment

ao2017 Sep 4, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

malish8632 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sming left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ao2017 commented Aug 21, 2020 •

edited

Loading

sming commented Aug 29, 2020 •

edited

Loading

sming Aug 29, 2020 •

edited

Loading

mattnworb Sep 4, 2020 •

edited

Loading

ao2017 Sep 4, 2020 •

edited

Loading

sming left a comment •

edited

Loading