Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LogLog-Beta Algorithm support within HLL #3677

Closed
wants to merge 1 commit into from

Conversation

vharish836
Copy link
Contributor

@vharish836 vharish836 commented Dec 9, 2016

LogLog-β is a new algorithm for estimating cardinalities based on LogLog counting. The new algorithm uses only one formula and needs no additional bias corrections for the entire range of cardinalities, therefore, it is more efficient and simpler to implement.
Refer to https://arxiv.org/pdf/1612.02557.pdf for more details.
This changeset adds the support to compute carinality based on loglog-beta alogorithm and a config option to use the same.

I have tested the changes using PFSELFTEST command with new config option set.

Config option to use LogLog-Beta Algorithm for Cardinality
@antirez
Copy link
Contributor

antirez commented Dec 16, 2016

Hello, this looks cool and I compared the two implementations in order to totally switch to the new one, however I found that while for larger cardinalities the error is very similar, for smaller ones (1 to 12) the loglog-beta implementation in this PR outputs an extremely large error. Maybe there is some bug in the implementation or is the algorithm per-se that is not able to deal with small cardinalities?

I was going to compare the speed as well, but stopping here for now since anyway the switch with this behavior for small cardinalities is not possible. Thanks.

@antirez
Copy link
Contributor

antirez commented Dec 16, 2016

Btw this is the error of old VS new implementation for 1-100k range (step of 1000).

screen shot 2016-12-16 at 09 52 18

@antirez
Copy link
Contributor

antirez commented Dec 16, 2016

Ok nevermind, I found the bug, it's a trivial thing regarding rounding to get the best result once converted to integer:

@@ -1008,8 +1008,8 @@ uint64_t hllCount(struct hllhdr *hdr, int *invalid) {
                        0.03738027*pow(zl,5) +
                       -0.005384159*pow(zl,6) +
                        0.00042419*pow(zl,7);
-        E  = alpha*m*(m-ez)*(1/(E+beta));
+        E  = llroundl(alpha*m*(m-ez)*(1/(E+beta)));
     } else {
         /* Muliply the inverse of E for alpha_m * m^2 to have the raw estimate. */
         E = (1/E)*alpha*m*m;

Now checking if speed-wise is worth it.

@antirez
Copy link
Contributor

antirez commented Dec 16, 2016

Checked the speed (by disabling the cache). Basically the two implementations run more or less at the same exact speed. I guess that simply the performance is dominated by the operation of summing the registers, and it makes totally sense indeed if we look at the code. The code complexity is also very similar after all. However I've the feeling that in general the loglog-beta error seems smaller after I studied a few ranges with a smaller step in the graphs so it may be worth to switch to the implementation proposed in this PR. I'll generate a few "slow to generate" graphs here so that it is possible to make a more informed decision. Thanks.

@antirez
Copy link
Contributor

antirez commented Dec 16, 2016

This is a detailed study of what happens in cardinalities between 1 and 10000, using 100 different samples per cardinality. The graph is the absolute value of the averaged error. It is pretty clear that loglog-beta (labeled as new in the graph) is superior. The two algorithms AFAIK will tend to show the same results for larger values, since both the old and new algorithms tend to use the raw estimate asymptotically. I'll also run a test to verify that.

So if there are no surprises studying the bigger values, I think that's just worth to switch to loglog-beta and remove the old code.

screen shot 2016-12-16 at 10 35 14

@antirez
Copy link
Contributor

antirez commented Dec 16, 2016

As expected studying the range up to 10 million the two algorithms are identical. So switching to the code provided in this PR removing the ability to use the old code completely. Thanks.

@antirez
Copy link
Contributor

antirez commented Dec 16, 2016

the PR was merged both into the unstable and 4.0 branches. The old code, and configuration parameters to select between the two, were removed. Thanks!

@antirez antirez closed this Dec 16, 2016
antirez added a commit that referenced this pull request Dec 16, 2016
The new algorithm provides the same speed with a smaller error for
cardinalities in the range 0-100k. Before switching, the new and old
algorithm behavior was studied in details in the context of
issue #3677. You can find a few graphs and motivations there.
antirez added a commit that referenced this pull request Dec 16, 2016
The new algorithm provides the same speed with a smaller error for
cardinalities in the range 0-100k. Before switching, the new and old
algorithm behavior was studied in details in the context of
issue #3677. You can find a few graphs and motivations there.
@vharish836 vharish836 deleted the loglog-beta branch December 17, 2016 06:53
@vharish836
Copy link
Contributor Author

Thanks.

JackieXie168 pushed a commit to JackieXie168/redis that referenced this pull request Feb 17, 2017
The new algorithm provides the same speed with a smaller error for
cardinalities in the range 0-100k. Before switching, the new and old
algorithm behavior was studied in details in the context of
issue redis#3677. You can find a few graphs and motivations there.
JackieXie168 pushed a commit to JackieXie168/redis that referenced this pull request May 16, 2017
The new algorithm provides the same speed with a smaller error for
cardinalities in the range 0-100k. Before switching, the new and old
algorithm behavior was studied in details in the context of
issue redis#3677. You can find a few graphs and motivations there.
pulllock pushed a commit to pulllock/redis that referenced this pull request Jun 28, 2023
The new algorithm provides the same speed with a smaller error for
cardinalities in the range 0-100k. Before switching, the new and old
algorithm behavior was studied in details in the context of
issue redis#3677. You can find a few graphs and motivations there.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants