Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Very slow performance; what am I missing? #62

Closed
akalani opened this issue Dec 15, 2015 · 8 comments
Closed

Very slow performance; what am I missing? #62

akalani opened this issue Dec 15, 2015 · 8 comments

Comments

@akalani
Copy link

akalani commented Dec 15, 2015

I have been playing with the t-digest on a sample dataset of latencies and comparing performance against raw sorting to derive quantile statistics. I am using the standard APIs (and recommended configuration values) to build the t-digest, however I am observing significantly worse times than raw sorting. Is this to be expected? I would appreciate any help in this regards.

Here is my sample code:

long[] latencies = new long[1000000];
...
TDigest tDigest = TDigest.createDigest(100);
for (int i=0; i < 1000000; i++) {
    tDigest.add(latencies[i]);
}
System.out.println("Median = " + tDigest.quantile(0.5));

I used a dataset with 1 million latency values (in ms). The max is 100,000. It took about 2.7 secs to build the digest and extract the 50% quantile value. With a dataset with 2 million latency values, the computation time doubled. Raw sorting is under 100ms.

Reducing the compression factor to 5 helps but computation time is still excessively high.

@tdunning
Copy link
Owner

These results are anomalous.

In my own tests, I add a million or so points to a t-digest and see
insertion times of about 200 ns. You are seeing times that are more than
100 times longer than that.

One question I have is whether you are initializing the latencies to
interesting values or if they are all set to zero. If you have identical
values, there might be a corner case that is causing an inefficient data
structure.

Have you tried this experiment with the MergingDigest?

To do this, use the one line:

TDigest tDigest = MergingDigest.create(100);

On Tue, Dec 15, 2015 at 12:19 PM, akalani notifications@github.com wrote:

I have been playing with the t-digest on a sample dataset of latencies and
comparing performance against raw sorting to derive quantile statistics I
am using the standard APIs (and recommended configuration values) to build
the t-digest, however I am observing significantly worse times than raw
sorting Is this to be expected? I would appreciate any help in this regards

Here is my sample code:

long[] latencies = new long[1000000];

TDigest tDigest = TDigestcreateDigest(100);
for (int i=0; i < 1000000; i++) {
tDigestadd(latencies[i]);
}
Systemoutprintln("Median = " + tDigestquantile(05));

I used a dataset with 1 million latency values (in ms) The max is 100,000
It took about 27 secs to build the digest and extract the 50% quantile
value With a dataset with 2 million latency values, the computation time
doubled Raw sorting is under 100ms

Reducing the compression factor to 5 helps but computation time is still
excessively high


Reply to this email directly or view it on GitHub
#62.

@akalani
Copy link
Author

akalani commented Dec 16, 2015

I initialized the latencies by randomizing and duplicating a sample set of Google ping latencies. The values are in milliseconds; the max (the timeout value) in the set is 100,000, though the max values are outliers and occur infrequently.

Could the outlier timeout values be causing performance degradation?

I have built the TDigest using the MergingDigest API, but see no improvement in performance.

@tdunning
Copy link
Owner

The outliers don't cause any problems at all. Nor does the actual distribution cause any issues except possible when there are massive numbers of duplicates. The MergingDigest should not have any distributional issues.

Can you share your code? Your data?

@akalani
Copy link
Author

akalani commented Dec 16, 2015

Here is the test that I wrote:
http://pastebin.com/Rmue62ch

The Java class takes the path to the latencies file as an input. Here is an output that it generated for the latencies dataset with 1 million records (time is in ms, and memory size is in KB):

TDigest: time = 1218
TDigest: median = 44.43894320782193
TDigest: mem size = 21496
Raw: time = 54
Raw: median = 45
Raw: mem size = 8000016

The data file is attached.
latencies_1m.txt.zip

@akalani
Copy link
Author

akalani commented Dec 17, 2015

A minor correction, the memory size reported above is in bytes, not kilobytes.

@tdunning
Copy link
Owner

OK. I find a few issues.

  1. you are passing in long's, the digest works in doubles. This costs a bit.

  2. your timing was confused by the prints. I changed the code to store the time into variables.

  3. there is some warmup to be had

Here is my adapted code:

  double[] latencies = readLatenciesFile(args[0]);

            long t1 = System.currentTimeMillis();
            TDigest tDigest = MergingDigest.createDigest(200);
            for (int i = 0; i < RECORD_COUNT; i++) {
                tDigest.add(latencies[i]);
            }
            long t2 = System.currentTimeMillis();
            Arrays.sort(latencies);
            long t3 = System.currentTimeMillis();

            System.out.println("Read time = " + (t1 - start));
            System.out.println("TDigest: time = " + (t2 - t1));
            System.out.println("TDigest: median = " + tDigest.quantile(0.5));
            //        System.out.println("TDigest: mem size = " + SizeEstimator.estimate(tDigest));

            System.out.println("Raw: time = " + (t3 - t2));

            System.out.println("Raw: median = " + latencies[latencies.length / 2]);
            System.out.println();

After this, I get results like this on my laptop (note that these are with compression = 200):

Read time = 551
TDigest: time = 571
TDigest: median = 44.55266098760193
Raw: time = 302
Raw: median = 44.552607870527396

Read time = 204
TDigest: time = 325
TDigest: median = 44.54911853552305
Raw: time = 78
Raw: median = 44.549239966209704

Read time = 156
TDigest: time = 295
TDigest: median = 44.5561107724475
Raw: time = 83
Raw: median = 44.55575939279197

Read time = 136
TDigest: time = 297
TDigest: median = 44.552868775685916
Raw: time = 83
Raw: median = 44.55243261312767

Read time = 139
TDigest: time = 307
TDigest: median = 44.557862132539675
Raw: time = 82
Raw: median = 44.557593756576985

This means that the t-digest is showing at about 4x slower than sorting the array and the cost per element is about 300 ns which isn't far from the result that I had from micro-benchmarking. I think that there is about another 2-4x speedup available in the code by simple improvements, but I don't have time to do it so these times will have to stand. It makes some sense that a raw sort of a million items would be very fast since the merging digest has to touch the data more often. If there are k elements in the insertion buffer and n in the retained buffer for the merging digest, the sorting time will be N log k versus N log N (with similar constants) for the entire sort where k = 32, log k = 5, and N = 10^6, log N = 20. The insertion time will require a pass over the retention buffer for each sort of the insertion buffer and thus will require n N/k = 6 N touches. With better optimization on the part of the java library writers, this isn't unreasonable.

@akalani
Copy link
Author

akalani commented Dec 17, 2015

Thanks very much for your help. I will keep the above in mind.

@akalani akalani closed this as completed Dec 17, 2015
@tdunning
Copy link
Owner

I just did some experiments (see the t-digests-benchmark repo). I think that your observations lead some good speedups. I experimented with allocating bigger insertion buffers and found that due to the phenomenon that you noted (built-in sort is fast), we can make some big improvements.

Here are the results:

Benchmark            (compression) (factor)   Mode   Samples         Mean   Mean error    Units
c.t.MergeBench.add             100        1   avgt         5     2747.205       71.506    ns/op
c.t.MergeBench.add             100       10   avgt         5      625.746       14.814    ns/op
c.t.MergeBench.add             100      100   avgt         5      416.151       14.514    ns/op
c.t.MergeBench.add             200        1   avgt         5     3137.047       97.135    ns/op
c.t.MergeBench.add             200       10   avgt         5      659.008       33.876    ns/op
c.t.MergeBench.add             200      100   avgt         5      422.534        8.808    ns/op
c.t.MergeBench.add             500        1   avgt         5     3843.771      137.530    ns/op
c.t.MergeBench.add             500       10   avgt         5      749.032       27.951    ns/op
c.t.MergeBench.add             500      100   avgt         5      442.522       14.167    ns/op
c.t.MergeBench.add            1000        1   avgt         5     5875.645      218.997    ns/op
c.t.MergeBench.add            1000       10   avgt         5      953.075       19.278    ns/op
c.t.MergeBench.add            1000      100   avgt         5      474.294       17.505    ns/op

Mean is the time you want to look at. Factor determines how much more space I am allocating. Allocating 100x more space makes a whacking big difference. I will look at this some more and decide if that is a reasonable thing to do. If so, good news for speed, especially for higher compression factors.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants