Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Base histogram implementation #7

Open
wants to merge 14 commits into
base: master
from

Conversation

Projects
None yet
5 participants
@smrutilal2
Copy link
Collaborator

commented Mar 15, 2019

No description provided.

@smrutilal2 smrutilal2 force-pushed the smrutilal2:histograms branch from 04848b9 to 2025ad5 Mar 15, 2019


import io.ultrabrew.metrics.util.DistributionBucket;

public class Histogram extends Timer {

This comment has been minimized.

Copy link
@arungupta

arungupta Mar 15, 2019

Contributor

Histogram should extend Gauge, not Timer. While most use cases of Histogram will be related to Timer/Latencies, there do exist use cases for non-time based measurements like object sizes, message size, bytes transferred, time elapsed (non-Timer use case) since connections were established, etc.

This comment has been minimized.

Copy link
@SeanPMiller

SeanPMiller Mar 21, 2019

Monitoring tools such as statsd use "histogram" and "timer" as synonyms. Confusion is inevitable. Whatever the histogram-enabled class is called, it is-a gauge and therefore should extend it.

This comment has been minimized.

Copy link
@SeanPMiller

SeanPMiller Mar 21, 2019

That's a long-winded way of saying I agree but not necessarily for the same reason. Having Timer distinct from Histogram doesn't make sense to me yet. You want some other type indicating that one or more data sketches (of which a histogram is only one example) are to be applied to measurements of the metric identity.


private static final Logger LOGGER = LoggerFactory.getLogger(BasicHistogramAggregator.class);
private static final String[] AGG_FIELDS = {"count", "sum", "min", "max"};
private static final int[] IDENTITY = {0, 0, Integer.MAX_VALUE, Integer.MIN_VALUE};

This comment has been minimized.

Copy link
@arungupta

arungupta Mar 15, 2019

Contributor

array type needs to be long[] because these are statistics (the gauge part of the histogram vs counts within a bucket). since time is tracked in nanos, approximately 2 seconds will reach Integer.MAX_VALUE in nanos.


@Override
public void apply(String[] tags, long latency, long timestamp) {
apply(tags, (int) latency, timestamp);

This comment has been minimized.

Copy link
@arungupta

arungupta Mar 15, 2019

Contributor

do not cast to int, as per previous explanation


@Override
protected void combine(int[] table, long baseOffset, int latency) {
add(table, baseOffset, 0, 1);

This comment has been minimized.

Copy link
@arungupta

arungupta Mar 15, 2019

Contributor

if we keep the base array type as int[] we need to write a long into 2 ints. need to adjust the offsets accordingly.

int maxVal = buckets[high];

if (key >= maxVal) {
return high + 1; // align with underflow

This comment has been minimized.

Copy link
@arungupta

arungupta Mar 16, 2019

Contributor

comment should indicate this is the overflow bucket

int bucketCount = buckets.length;
String[] names = new String[bucketCount + 1];
int i = 0;
names[i] = UNDERFLOW;

This comment has been minimized.

Copy link
@arungupta

arungupta Mar 16, 2019

Contributor

tracking underflow should be based on user preference to avoid wasting a bucket for uses where it's not relevant. minor optimization.


private int[] buckets;

public DistributionBucket(final int[] buckets) {

This comment has been minimized.

Copy link
@arungupta

arungupta Mar 16, 2019

Contributor

need to throw error if only 1 bucket is passed

This comment has been minimized.

Copy link
@arungupta

arungupta Mar 16, 2019

Contributor

also validate if buckets aren't in increasing order, or if 2 entries have same value

This comment has been minimized.

Copy link
@arungupta

arungupta Mar 16, 2019

Contributor

or sort the values and eliminate duplicates

@arungupta
Copy link
Contributor

left a comment

Please improve as discussed in person

@smrutilal2 smrutilal2 force-pushed the smrutilal2:histograms branch 2 times, most recently from b73d1b1 to 78c3951 Mar 20, 2019

@@ -109,7 +114,8 @@ public void addReporter(final Reporter reporter) {
reporters.add(reporter);
}

private <T extends Metric> T getOrCreate(final String id, final Class<T> klass) {
private <T extends Metric> T getOrCreate(final Class<T> klass, final Object... args) {
String id = (String) args[0];

This comment has been minimized.

Copy link
@lonemeow

lonemeow Mar 21, 2019

Collaborator

This is kinda ugly, I'd prefer having a formal parameter and dealing with it in initArgs construction.

This comment has been minimized.

Copy link
@smrutilal2

smrutilal2 Mar 26, 2019

Author Collaborator

I guess, you are suggesting
private <T extends Metric> T getOrCreate(final Class<T> klass, final String id, final Object... args)

Show resolved Hide resolved core/src/main/java/io/ultrabrew/metrics/MetricRegistry.java Outdated
Show resolved Hide resolved core/src/main/java/io/ultrabrew/metrics/MetricRegistry.java Outdated
Show resolved Hide resolved core/src/main/java/io/ultrabrew/metrics/Metric.java Outdated
return new CursorImpl(tagSets, fields, types, identity, sorted);
}

private long[] buildIdentity(int length) {

This comment has been minimized.

Copy link
@lonemeow

lonemeow Mar 21, 2019

Collaborator

Since neither the source nor the result array is ever modified, this seems pointless.

This comment has been minimized.

Copy link
@smrutilal2

smrutilal2 Mar 26, 2019

Author Collaborator

The point is not clear. Could you please explain your intent?

This comment has been minimized.

Copy link
@mmannerm

mmannerm Apr 8, 2019

Collaborator

what is the point of this method?

This comment has been minimized.

Copy link
@smrutilal2

smrutilal2 Apr 8, 2019

Author Collaborator

This is the identity values for the buckets.

This comment has been minimized.

Copy link
@mmannerm

mmannerm Apr 8, 2019

Collaborator

Yes, but why do you need this copy here? Its not modified anywhere?

This comment has been minimized.

Copy link
@smrutilal2

smrutilal2 Apr 9, 2019

Author Collaborator

These are the identity values used by the Cursor to reset the values after reading. The record length depends on the number of buckets used for the histogram, I initialize this array once, and use it each time the Cursor object is created.

This comment has been minimized.

Copy link
@smrutilal2

smrutilal2 Apr 9, 2019

Author Collaborator

Currently the Base aggregation like count, sum,min,max and buckets counts are tracked in the same table. Once we split that as part of #18, we can default the identity values for the buckets.

This comment has been minimized.

Copy link
@lonemeow

lonemeow Apr 15, 2019

Collaborator

So do I understand this correctly: This method is intended to produce an array that contains copy of IDENTITY followed by buckets.getCount() zeroes?

If that is the case, please make it explicit in the code instead of relying on default initialized values. A comment explaining why would also help.

Show resolved Hide resolved core/src/main/java/io/ultrabrew/metrics/data/BasicHistogramAggregator.java Outdated

final private String[] fields;
final private Type[] types;
private long[] identity;

This comment has been minimized.

Copy link
@lonemeow

lonemeow Mar 21, 2019

Collaborator

Should be final if not removed completely (see comments above).

import org.slf4j.LoggerFactory;
import sun.misc.Unsafe;

public abstract class ConcurrentIntTable {

This comment has been minimized.

Copy link
@lonemeow

lonemeow Mar 21, 2019

Collaborator

This class duplicates a lot of code from ConcurrentMonoidHashTable, is there really nothing that can be shared between them (without sacrificing performance obviously)?

private final int recordSize;
private final int maxCapacity;
private final long[] identity;
private int numAggFields;

This comment has been minimized.

Copy link
@lonemeow

lonemeow Mar 21, 2019

Collaborator

Should be final.


public class Commons {

public static void checkArgument(final boolean condition, final String message) {

This comment has been minimized.

Copy link
@lonemeow

lonemeow Mar 21, 2019

Collaborator

This method (and/or the class) could use better name, it would be nice to see straight away if it throws if the condition is true/false without reading code or javadoc.

This comment has been minimized.

Copy link
@mmannerm

mmannerm Mar 21, 2019

Collaborator

i.e. "assertTrue"


import io.ultrabrew.metrics.util.DistributionBucket;

public class Histogram extends Gauge {

This comment has been minimized.

Copy link
@SeanPMiller

SeanPMiller Mar 21, 2019

I wrote this on the first agupta comment I saw, but I question whether this should be modeled this way. Will discuss with you both.


import io.ultrabrew.metrics.util.DistributionBucket;

public class Histogram extends Gauge {

This comment has been minimized.

Copy link
@mmannerm

mmannerm Mar 21, 2019

Collaborator

There is no need to have a Histogram. Why is there a histogram in instrumentation. Histograms can be used for Gauge or Timer in the reporting side, but what is histogram really instrumenting? It does not make any sense in instrumentation side.

This comment has been minimized.

Copy link
@arungupta

arungupta Mar 21, 2019

Contributor

We need to know the type of data that has to be maintained. ie, histogram vs gauge. if we want histogram a reporter concern, the information that the caller wants a histogram has to somehow be passed down. We could create a different metric/gauge registration api to support that.

This comment has been minimized.

Copy link
@mmannerm

mmannerm Mar 21, 2019

Collaborator

There are two distinct phases as per our readme and documentation, etc. One is to instrument code, i.e. I am measuring now this and that etc. How we then report it is a separate issue. I can use histograms, latency buckets, ngrams etc to report on the timer. I must not need to change the instrumentation code when I decide to switch this.

This comment has been minimized.

Copy link
@arungupta

arungupta Mar 21, 2019

Contributor

Histograms, ngrams, etc require configuration which may be specific to a metric. In case of histograms, that's the bucket spec. Even if we treat Histograms as a reporter level concern, the instrumented code will have to pass the appropriate configurations to the reporter(s) separately from how metrics are registered.

One could argue that metrics should all be predefined in a single place and instrumented code should just use these metrics. I don't believe that'll be feasible for all applications.

We could go for code like this:

MetricRegistry metricRegistry = new MetricRegistry();

Reporter metricsReporter = new Reporter(...);
metricRegistry.addReporter(reporter);

Gauge cacheSizeGauge = metricRegistry.gauge("cacheSize");
Map<String,String> cache = new java.util.Map<>();
String[] tagList = new String[] { "host", hostName };
metricsReporter.addAggregator(cacheSizeGauge, Reporter.TYPE_HISTOGRAM, <spec>)

// ... do something

cacheSizeGauge.set(cache.size(), tagList);

Thoughts?

This comment has been minimized.

Copy link
@mmannerm

mmannerm Mar 22, 2019

Collaborator

Resolve conversation button really sucks.

The configuration is reporter specific anyway, you cant give the configuration in the instrumentation class, as you need to support multiple reporters and one might report it as p99 only and other as avg and 3rd as latency buckets.

There was already ask in our discussion with smruti the need for the config overrides on class AND metric id level

This comment has been minimized.

Copy link
@arungupta

arungupta Mar 22, 2019

Contributor

Are you thinking of providing overrides in a different class, away from where metric is originally registered? I'm not sure if that's the right approach. Configuration around a metric should be managed in a common place.. preferably where the metric is declared.

This comment has been minimized.

Copy link
@mmannerm

mmannerm Mar 22, 2019

Collaborator

That is against the whole design where instrumentation and reporting are separated

This comment has been minimized.

Copy link
@arungupta

arungupta Mar 22, 2019

Contributor

They would still be separated. Specifically, this configuration would be away from where the metric is set/modified.

This comment has been minimized.

Copy link
@mmannerm

mmannerm Mar 22, 2019

Collaborator

No, it should be set where reporter is defined.

@@ -4,6 +4,7 @@

package io.ultrabrew.metrics;

import io.ultrabrew.metrics.util.DistributionBucket;

This comment has been minimized.

Copy link
@mmannerm

mmannerm Mar 21, 2019

Collaborator

Unused import?

private final DistributionBucket buckets;
private final int bucketCount;

public BasicHistogramAggregator(Histogram histogram) {

This comment has been minimized.

Copy link
@mmannerm

mmannerm Mar 21, 2019

Collaborator

javadocs everywhere

This comment has been minimized.

Copy link
@arungupta

arungupta Mar 21, 2019

Contributor

This is a draft PR. Javadocs will follow. @smrutilal2 was trying to get feedback on whatever he had done so far while he's working on remaining parts.

return types;
}

private class CursorImpl implements Cursor {

This comment has been minimized.

Copy link
@mmannerm

mmannerm Mar 21, 2019

Collaborator

We should create a package protected simplecursorimpl and use that and only create override if really needed.


public class Commons {

public static void checkArgument(final boolean condition, final String message) {

This comment has been minimized.

Copy link
@mmannerm

mmannerm Mar 21, 2019

Collaborator

i.e. "assertTrue"

@Test
void testDefaultCardinality() {
BasicCounterAggregator aggregator = new BasicCounterAggregator("test");
int maxCapacity = Deencapsulation.getField(aggregator, "maxCapacity");

This comment has been minimized.

Copy link
@mmannerm

mmannerm Mar 21, 2019

Collaborator

it is already gone, upgrade to latest jmockit as well on the PR.

@smrutilal2 smrutilal2 added the WIP label Mar 22, 2019

Smruti Ranjan Sahoo added some commits Mar 15, 2019

Smruti Ranjan Sahoo
Review Comment Fixed:
1. Histogram extends Gauge
2. Support long agg fields
3. Intutive implementation of DistributionBucket
4. Input validation for DistributionBucket

@smrutilal2 smrutilal2 force-pushed the smrutilal2:histograms branch from d479924 to 10d4a20 Mar 28, 2019

Smruti Ranjan Sahoo
review comment fixed
1. Removed Metric type Histogram
2. Override with BasicHistogramAggergator for a specific metric
3. addressed misc comments

@smrutilal2 smrutilal2 force-pushed the smrutilal2:histograms branch from 10d4a20 to 563a006 Mar 28, 2019

Smruti Ranjan Sahoo added some commits Mar 28, 2019

Smruti Ranjan Sahoo
Smruti Ranjan Sahoo

@smrutilal2 smrutilal2 removed the WIP label Mar 28, 2019

@@ -78,8 +76,8 @@ public ConcurrentIntTable(final int recordSize, int initialCapacity, final int m

private ConcurrentIntTable(final int numAggFields, final int dataSize, int initialCapacity, final int maxCapacity, final long[] identity){

checkArgument(initialCapacity >= 0, "Illegal initial capacity");
checkArgument(maxCapacity >= initialCapacity, "max capacity should be greater than the initial capacity");
assert initialCapacity >= 0 : "Illegal initial capacity";

This comment has been minimized.

Copy link
@arungupta

arungupta Apr 3, 2019

Contributor

how about renaming this class to ConcurrentMonoidIntTable and the other one to ConcurrentMonoidLongTable?

* The histogram buckets are derived from {@link DistributionBucket}.
* <p>A sample histogram would look like:</p>
* <ul>
* <li>[0_10)</li>

This comment has been minimized.

Copy link
@arungupta

arungupta Apr 4, 2019

Contributor

While the buckets in this doc look pretty, we are likely going to have an ugly outcome with bucket naming for timer metrics because the unit of measurement is in nanoseconds while most humans think/expect in milliseconds.
Ideally, when the reporters are built, we should allow users to specify the output bucket spec unit so the devs can state they wish to see bucket names in milliseconds even though internally the library tracks them in nanoseconds.

Additionally, we should have an option to prettify the bucket names to avoid things like 52.4498324_58.230823832, rather 52.5_58.2, etc

@@ -41,8 +41,8 @@
public class BasicHistogramAggregator extends ConcurrentIntTable implements Aggregator {

This comment has been minimized.

Copy link
@arungupta

arungupta Apr 4, 2019

Contributor

as discussed in person with @smrutilal2, we should not absorb the functionality of gauge/timer into histogram aggregator, rather let them run in parallel. this will allow users to eventually have multiple aggregators depending on use cases. eg, timer+histogram+t-digest+HLL for a metric but different downstream use cases.

Smruti Ranjan Sahoo
Smruti Ranjan Sahoo

Smruti Ranjan Sahoo added some commits Apr 5, 2019

@smrutilal2

This comment has been minimized.

Copy link
Collaborator Author

commented Apr 8, 2019

@lonemeow @mmannerm could you please review.

*/
public BasicCounterAggregator(final String metricId, final int capacity) {
super(metricId, capacity, FIELDS, TYPES, IDENTITY);
public BasicCounterAggregator(final String metricId, final int cardinality) {

This comment has been minimized.

Copy link
@mmannerm

mmannerm Apr 8, 2019

Collaborator

I would still make this method public BasicCounterAggregator(final String metricId, final int maxCardinality) instead and calculate the initial capacity from the maxCardinality or use sane default.

@@ -54,7 +63,8 @@ public SLF4JReporter(final String name) {
* @param windowSizeSeconds window size in seconds
*/
public SLF4JReporter(final String name, final int windowSizeSeconds) {
this(name, " ", " ", " ", windowSizeSeconds);
this(name, DEFAULT_TAG_DELIMITER, DEFAULT_FIELD_DELIMITER, DEFAULT_TAGFIELD_DELIMITER,

This comment has been minimized.

Copy link
@mmannerm

mmannerm Apr 8, 2019

Collaborator

I really dislike this as now I have to scroll somewhere and find out what the hell are the values here.

This comment has been minimized.

Copy link
@smrutilal2

smrutilal2 Apr 8, 2019

Author Collaborator

I extracted these constants to make it configurable through the Builder. If we don't want to make them configurable then I can inline them.

This comment has been minimized.

Copy link
@mmannerm

mmannerm Apr 8, 2019

Collaborator

? I am confused, how does replacing static final DEFAULT_TAG_DELIMITER = " " with inline " " change anything except readability?

This comment has been minimized.

Copy link
@smrutilal2

smrutilal2 Apr 9, 2019

Author Collaborator

I pulled out a constant to use it in the builder. If you really hate this then I will have to inline them in two places.

@@ -32,6 +37,10 @@
*/
public class SLF4JReporter extends TimeWindowReporter {

This comment has been minimized.

Copy link
@mmannerm

mmannerm Apr 8, 2019

Collaborator

Every change on this class looks like it could be in a completely separate PR, any reason why thats not the case?

This comment has been minimized.

Copy link
@smrutilal2

smrutilal2 Apr 8, 2019

Author Collaborator

This change is to demonstrate how the reporters are going to use the HistogramAggregator.

This comment has been minimized.

Copy link
@mmannerm

mmannerm Apr 8, 2019

Collaborator

Hmmh, can't see it anywhere? Usage of HistogramAggregator is not in default aggregators or anywhere?

This comment has been minimized.

Copy link
@smrutilal2

smrutilal2 Apr 9, 2019

Author Collaborator

You can see the usage in SLF4JReporterTest::testHistogram()

return matchAny(buckets, i -> buckets[i] == buckets[i + 1]);
}

private static boolean matchAny(final long[] buckets, Predicate<Integer> predicate) {

This comment has been minimized.

Copy link
@mmannerm

mmannerm Apr 8, 2019

Collaborator

While seems this is only called at constructor time, this will box/unbox the integers and create lots of garbage.. It may be fine here, a javadoc would be good..

@Test
void testDefaultCardinality() {
BasicGaugeAggregator aggregator = new BasicGaugeAggregator("test");
int maxCapacity = Deencapsulation.getField(aggregator, "maxCapacity");

This comment has been minimized.

Copy link
@mmannerm

mmannerm Apr 8, 2019

Collaborator

Remove Deencapsulation usage

This comment has been minimized.

Copy link
@smrutilal2

smrutilal2 Apr 8, 2019

Author Collaborator

We will address them as part of #12

@mmannerm
Copy link
Collaborator

left a comment

How are the developer supposed to supply the DistributionBuckets ? Example code? At the moment it seems too complicated for average engineers.

@smrutilal2

This comment has been minimized.

Copy link
Collaborator Author

commented Apr 9, 2019

How are the developer supposed to supply the DistributionBuckets ? Example code? At the moment it seems too complicated for average engineers.

You can see the usage in SLF4JReporterTest::testHistogram. Client can supply the distribution buckets while creating the Reporter builder. For example:

DistributionBucket bucket = new DistributionBucket(new long[]{0, 10, 50, 100});
InfluxDBReporter r = InfluxDBReporter.builder()
        .withBaseUri(TEST_URI)
        .withDatabase("test")
        .withWindowSize(12765)
        .addHistogram("requestTime", bucket)
        .build();
Smruti Ranjan Sahoo
@mmannerm

This comment has been minimized.

Copy link
Collaborator

commented Apr 9, 2019

I still want a better example outside of tests.

*/
private String[] buildFields() {
String[] fields = new String[AGG_FIELDS.length + bucketCount];
String[] buckets = this.buckets.getBucketNames();

This comment has been minimized.

Copy link
@lonemeow

lonemeow Apr 15, 2019

Collaborator

Shadowing members with locals (and then being forced to use this) is ugly, please rename this local variable to bucketNames or something like that.

public class BasicHistogramAggregator extends ConcurrentMonoidIntTable implements Aggregator {

private static final Logger LOGGER = LoggerFactory.getLogger(BasicHistogramAggregator.class);
private static final String[] AGG_FIELDS = {"count", "sum", "min", "max", "lastValue"};

This comment has been minimized.

Copy link
@lonemeow

lonemeow Apr 15, 2019

Collaborator

Style nit: What does AGG stand for? Avoid ambiguous shorthands.

super(AGG_FIELDS.length + buckets.getCount(), maxCardinality, cardinality, IDENTITY);
this.metricId = metricId;
this.buckets = buckets;
this.bucketCount = buckets.getCount();

This comment has been minimized.

Copy link
@lonemeow

lonemeow Apr 15, 2019

Collaborator

Is this field just an optimization for buckets.getCount()? If so, is the performance impact sufficient to justify the extra complexity?

* @param key measurement value
* @return bucket index
*/
private int binarySearch(final long key) {

This comment has been minimized.

Copy link
@lonemeow

lonemeow Apr 15, 2019

Collaborator

I would probably just inline this into getBucketIndex() as the only thing it does is call this.

Also a comment explaining what's special about this particular binary search implementation (ie. how it is different from Arrays.binarySearch()) should be added.

Show resolved Hide resolved core/src/main/java/io/ultrabrew/metrics/reporters/SLF4JReporter.java
*/
public static abstract class TimeWindowReporterBuilder<B extends TimeWindowReporterBuilder, R extends TimeWindowReporter> {

protected Map<Class<? extends Metric>, Function<Metric, ? extends Aggregator>> defaultAggregators = DEFAULT_AGGREGATORS;

This comment has been minimized.

Copy link
@lonemeow

lonemeow Apr 15, 2019

Collaborator

Awfully long lines here, check formatting and max line length.

return new CursorImpl(tagSets, fields, types, identity, sorted);
}

private long[] buildIdentity(int length) {

This comment has been minimized.

Copy link
@lonemeow

lonemeow Apr 15, 2019

Collaborator

So do I understand this correctly: This method is intended to produce an array that contains copy of IDENTITY followed by buckets.getCount() zeroes?

If that is the case, please make it explicit in the code instead of relying on default initialized values. A comment explaining why would also help.


private final String[] fields;
private final Type[] types;
private final long[] identity;

This comment has been minimized.

Copy link
@lonemeow

lonemeow Apr 15, 2019

Collaborator

Since this is a non-static inner class, fields, types and identity can be accessed directly from the outer class, no need to have duplicate variables here.


assert buckets.length >= 2 : "Minimum bucket length is 2";
assert isSorted(buckets) : "Bucket should be sorted in ascending order";
assert !hasDuplicate(buckets) : "Bucket should not have duplicate entries";

This comment has been minimized.

Copy link
@lonemeow

lonemeow Apr 15, 2019

Collaborator

Please don't use assert in non-performance-critical code, almost no one ever runs JVM with assertions enabled.

final int maxCapacity, int initialCapacity, final long[] identity) {

assert initialCapacity >= 0 : "Illegal initial capacity";
assert maxCapacity >= initialCapacity : "max capacity should be greater than the initial capacity";

This comment has been minimized.

Copy link
@lonemeow

lonemeow Apr 15, 2019

Collaborator

Please don't use assert in non-performance-critical code, almost no one ever runs JVM with assertions enabled.

Smruti Ranjan Sahoo
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.