Join GitHub today
GitHub is home to over 28 million developers working together to host and review code, manage projects, and build software together.Sign up
Fix uniqHLL12 and uniqCombined for cardinalities 100M+. #1844
The problem we've faced that for big cardinalities (100M+ elements) function uniqHLL12 returns rubbish - either 0 or a very inaccurate result, which is unacceptable for our large customers who have many unique visitors or for our internal BI reports which get total unique visitors across all customers websites. Please see this Google Spreadsheet with our test results of different uniq functions, which is generated with
There are couple of observations from the data in spreadsheet:
Observation 1 can be explained with the poor quality of intHash32 function, which is used both for Linear Counting and HyperLogLog algorithm:
For cardinalities up to 4096*2.5 = 10240 elements, Linear Counting is used, which is basically counting number of zero buckets and then uses simple formula to estimate number of unique elements. But function intHash32 right most 12 bits are much less random than left most 12 bits, thus it explains inaccuracy. In original Google HyperLogLog paper LSB 0 bit numbering is used and the left most 12 bits are used to choose bucket number, while in Clickhouse uniqHLL12 implementation right most 12 bits are used. In general, it shouldn't matter as long as hash function follows same assumptions in the paper - the hash function produces uniform hash values. I understand that's changing bits order or changing hash function will break back compatibility of uniqHLL12 function and that's why I do not change that code. However, I've tested changing bit order, see "Change bit orders + remove big card fix + UInt64" tab in the spreadsheet and it provides much better results than default version. The error of Linear Counting supposed to be
Observation 2 can be explained with Large range corrections in HyperLogLog algorithm, which occurs for cardinalities more than 2^32/30 = 143M. The formula with logarithm supposed to correct estimate but again because of intHash32 hash function it maybe doesn't work as should be. I've tested intHash32 for collisions in the range from 1 to 2^32 - there are total 2,715,106,849 elements out of 4,294,967,296 which have atleast one another value with same hash, so it's 63.2% of all.
Observation 3 can be explained with same Large range corrections which might give estimates more than 2^32 after applying it, but as we have UInt32 type for the size, it's getting overflow when double casted to UInt32.
Observation 4 can be explained with Large range corrections, when estimate is more than 2^32 and logarithm of negative number is calculated which is NaN and casted to 0:
Observation 5 can be explained by the fact that uniq function doesn't use HyperLogLog algorithm, though at cardinalities over 50B+ it gets some sort of overflow the reason for which I didn't figure out yet. However, overall it has quite good accuracy and performance for real-life applications. The size if its state though can be quite high, the maximum I saw was near 300KiB, while uniqHLL12 always consumes just 2.5KiB.
I've also tried to improve HyperLogLog estimate by using one of the improvements LogLog-Beta and More: A New Algorithm for Cardinality Estimation Based on LogLog Counting, using 7th order polynomial, but it didn't improve accuracy neither for small nor big set cardinalities, again I think because of hash function choice.
In fact, current implementation of HLL comes into ClickHouse from another project, where we have to manage large amount of very tiny (tens of bytes) HLLs for antifraud calculations. In that project, precision doesn't really matter. And we have to use the same hash function in ClickHouse for compatibility of internal states.
We can change it to IntHash64. It both faster and have better quality. BTW, IntHash64 is MurmurHash finalizer. And if you use MurmurHash for numbers 64 bits or less, it should be equivalent in quality. But this will require to change the name of function because of incompatibility. We can name it
@alexey-milovidov thanks for the explanation!
While we're at it, can you please explain why
It still works for cardinalities larger than 2^32.
But when N is much larger than 2^32, the estimate becomes poor.
(How exactly is it poor? Can we improve? We need to calculate variance of the estimate...)
Note: as you mentioned, we was using similar formula in HLL to correct hash collisions, but it was used incorrectly. While in