-
Notifications
You must be signed in to change notification settings - Fork 23.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
LogLog-Beta Algorithm support within HLL #3677
Conversation
Config option to use LogLog-Beta Algorithm for Cardinality
Hello, this looks cool and I compared the two implementations in order to totally switch to the new one, however I found that while for larger cardinalities the error is very similar, for smaller ones (1 to 12) the loglog-beta implementation in this PR outputs an extremely large error. Maybe there is some bug in the implementation or is the algorithm per-se that is not able to deal with small cardinalities? I was going to compare the speed as well, but stopping here for now since anyway the switch with this behavior for small cardinalities is not possible. Thanks. |
Ok nevermind, I found the bug, it's a trivial thing regarding rounding to get the best result once converted to integer: @@ -1008,8 +1008,8 @@ uint64_t hllCount(struct hllhdr *hdr, int *invalid) {
0.03738027*pow(zl,5) +
-0.005384159*pow(zl,6) +
0.00042419*pow(zl,7);
- E = alpha*m*(m-ez)*(1/(E+beta));
+ E = llroundl(alpha*m*(m-ez)*(1/(E+beta)));
} else {
/* Muliply the inverse of E for alpha_m * m^2 to have the raw estimate. */
E = (1/E)*alpha*m*m; Now checking if speed-wise is worth it. |
Checked the speed (by disabling the cache). Basically the two implementations run more or less at the same exact speed. I guess that simply the performance is dominated by the operation of summing the registers, and it makes totally sense indeed if we look at the code. The code complexity is also very similar after all. However I've the feeling that in general the loglog-beta error seems smaller after I studied a few ranges with a smaller step in the graphs so it may be worth to switch to the implementation proposed in this PR. I'll generate a few "slow to generate" graphs here so that it is possible to make a more informed decision. Thanks. |
This is a detailed study of what happens in cardinalities between 1 and 10000, using 100 different samples per cardinality. The graph is the absolute value of the averaged error. It is pretty clear that loglog-beta (labeled as new in the graph) is superior. The two algorithms AFAIK will tend to show the same results for larger values, since both the old and new algorithms tend to use the raw estimate asymptotically. I'll also run a test to verify that. So if there are no surprises studying the bigger values, I think that's just worth to switch to loglog-beta and remove the old code. |
As expected studying the range up to 10 million the two algorithms are identical. So switching to the code provided in this PR removing the ability to use the old code completely. Thanks. |
the PR was merged both into the |
The new algorithm provides the same speed with a smaller error for cardinalities in the range 0-100k. Before switching, the new and old algorithm behavior was studied in details in the context of issue #3677. You can find a few graphs and motivations there.
The new algorithm provides the same speed with a smaller error for cardinalities in the range 0-100k. Before switching, the new and old algorithm behavior was studied in details in the context of issue #3677. You can find a few graphs and motivations there.
Thanks. |
The new algorithm provides the same speed with a smaller error for cardinalities in the range 0-100k. Before switching, the new and old algorithm behavior was studied in details in the context of issue redis#3677. You can find a few graphs and motivations there.
The new algorithm provides the same speed with a smaller error for cardinalities in the range 0-100k. Before switching, the new and old algorithm behavior was studied in details in the context of issue redis#3677. You can find a few graphs and motivations there.
The new algorithm provides the same speed with a smaller error for cardinalities in the range 0-100k. Before switching, the new and old algorithm behavior was studied in details in the context of issue redis#3677. You can find a few graphs and motivations there.
LogLog-β is a new algorithm for estimating cardinalities based on LogLog counting. The new algorithm uses only one formula and needs no additional bias corrections for the entire range of cardinalities, therefore, it is more efficient and simpler to implement.
Refer to https://arxiv.org/pdf/1612.02557.pdf for more details.
This changeset adds the support to compute carinality based on loglog-beta alogorithm and a config option to use the same.
I have tested the changes using PFSELFTEST command with new config option set.