LogLog-Beta Algorithm support within HLL #3677

vharish836 · 2016-12-09T09:04:42Z

LogLog-β is a new algorithm for estimating cardinalities based on LogLog counting. The new algorithm uses only one formula and needs no additional bias corrections for the entire range of cardinalities, therefore, it is more efficient and simpler to implement.
Refer to https://arxiv.org/pdf/1612.02557.pdf for more details.
This changeset adds the support to compute carinality based on loglog-beta alogorithm and a config option to use the same.

I have tested the changes using PFSELFTEST command with new config option set.

Config option to use LogLog-Beta Algorithm for Cardinality

antirez · 2016-12-16T08:58:17Z

Hello, this looks cool and I compared the two implementations in order to totally switch to the new one, however I found that while for larger cardinalities the error is very similar, for smaller ones (1 to 12) the loglog-beta implementation in this PR outputs an extremely large error. Maybe there is some bug in the implementation or is the algorithm per-se that is not able to deal with small cardinalities?

I was going to compare the speed as well, but stopping here for now since anyway the switch with this behavior for small cardinalities is not possible. Thanks.

antirez · 2016-12-16T08:59:20Z

Btw this is the error of old VS new implementation for 1-100k range (step of 1000).

antirez · 2016-12-16T09:11:22Z

Ok nevermind, I found the bug, it's a trivial thing regarding rounding to get the best result once converted to integer:

@@ -1008,8 +1008,8 @@ uint64_t hllCount(struct hllhdr *hdr, int *invalid) {
                        0.03738027*pow(zl,5) +
                       -0.005384159*pow(zl,6) +
                        0.00042419*pow(zl,7);
-        E  = alpha*m*(m-ez)*(1/(E+beta));
+        E  = llroundl(alpha*m*(m-ez)*(1/(E+beta)));
     } else {
         /* Muliply the inverse of E for alpha_m * m^2 to have the raw estimate. */
         E = (1/E)*alpha*m*m;

Now checking if speed-wise is worth it.

antirez · 2016-12-16T09:26:38Z

Checked the speed (by disabling the cache). Basically the two implementations run more or less at the same exact speed. I guess that simply the performance is dominated by the operation of summing the registers, and it makes totally sense indeed if we look at the code. The code complexity is also very similar after all. However I've the feeling that in general the loglog-beta error seems smaller after I studied a few ranges with a smaller step in the graphs so it may be worth to switch to the implementation proposed in this PR. I'll generate a few "slow to generate" graphs here so that it is possible to make a more informed decision. Thanks.

antirez · 2016-12-16T09:37:48Z

This is a detailed study of what happens in cardinalities between 1 and 10000, using 100 different samples per cardinality. The graph is the absolute value of the averaged error. It is pretty clear that loglog-beta (labeled as new in the graph) is superior. The two algorithms AFAIK will tend to show the same results for larger values, since both the old and new algorithms tend to use the raw estimate asymptotically. I'll also run a test to verify that.

So if there are no surprises studying the bigger values, I think that's just worth to switch to loglog-beta and remove the old code.

antirez · 2016-12-16T09:56:22Z

As expected studying the range up to 10 million the two algorithms are identical. So switching to the code provided in this PR removing the ability to use the old code completely. Thanks.

antirez · 2016-12-16T10:08:31Z

the PR was merged both into the unstable and 4.0 branches. The old code, and configuration parameters to select between the two, were removed. Thanks!

The new algorithm provides the same speed with a smaller error for cardinalities in the range 0-100k. Before switching, the new and old algorithm behavior was studied in details in the context of issue #3677. You can find a few graphs and motivations there.

vharish836 · 2016-12-17T06:53:59Z

Thanks.

The new algorithm provides the same speed with a smaller error for cardinalities in the range 0-100k. Before switching, the new and old algorithm behavior was studied in details in the context of issue redis#3677. You can find a few graphs and motivations there.

LogLog-Beta Algorithm support within HLL

817d8c9

Config option to use LogLog-Beta Algorithm for Cardinality

antirez closed this Dec 16, 2016

costin mentioned this pull request Dec 16, 2016

LogLog-β algorithm for cardinality estimation elastic/elasticsearch#22230

Closed

vharish836 deleted the loglog-beta branch December 17, 2016 06:53

otisg mentioned this pull request Jan 25, 2017

Implement LogLog-Beta/β (new cardinality estimation algo) addthis/stream-lib#125

Open

seiflotfy mentioned this pull request Jun 20, 2017

Switch to LogLog-Beta Cardinality estimation influxdata/influxdb#8512

Merged

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LogLog-Beta Algorithm support within HLL #3677

LogLog-Beta Algorithm support within HLL #3677

vharish836 commented Dec 9, 2016 •

edited

antirez commented Dec 16, 2016

antirez commented Dec 16, 2016

antirez commented Dec 16, 2016 •

edited

antirez commented Dec 16, 2016

antirez commented Dec 16, 2016

antirez commented Dec 16, 2016

antirez commented Dec 16, 2016

vharish836 commented Dec 17, 2016

LogLog-Beta Algorithm support within HLL #3677

LogLog-Beta Algorithm support within HLL #3677

Conversation

vharish836 commented Dec 9, 2016 • edited

antirez commented Dec 16, 2016

antirez commented Dec 16, 2016

antirez commented Dec 16, 2016 • edited

antirez commented Dec 16, 2016

antirez commented Dec 16, 2016

antirez commented Dec 16, 2016

antirez commented Dec 16, 2016

vharish836 commented Dec 17, 2016

vharish836 commented Dec 9, 2016 •

edited

antirez commented Dec 16, 2016 •

edited