New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix uniqHLL12 and uniqCombined for cardinalities 100M+. #1844

Merged
merged 1 commit into from Feb 8, 2018

Conversation

Projects
None yet
2 participants
@bocharov
Contributor

bocharov commented Jan 31, 2018

Proposed changes

  • Changed uniqHLL12 size() return type from UInt32 to UInt64 to prevent overflow (this shouldn't break back compatibility).
  • Removed "big cardinalities fix" for cardinalities >2^32/30 as it was very inaccurate and for estimates >2^32 it was trying to do 'log' of negative number which is NaN and it was casted to 0.
  • Added python script to show that intHash32 is not a good choice for HyperLogLog algorithm when it's used for linear counting branch of it.
  • Added bash script to test uniq, uniqHLL12, uniqCombined on different set cardinalities.
  • Changed documentation of uniq* aggregate functions with recommendations to use uniq instead of uniqHLL12 or uniqCombined.

Reasoning
At Cloudflare we see 4B+ monthly unique visitors (based either on client IPv4 or IPv6). We recently have switched our Analytics API and BI tools to use Clickhouse instead of CitusDB (basically it is a sharded Postgres).

The problem we've faced that for big cardinalities (100M+ elements) function uniqHLL12 returns rubbish - either 0 or a very inaccurate result, which is unacceptable for our large customers who have many unique visitors or for our internal BI reports which get total unique visitors across all customers websites. Please see this Google Spreadsheet with our test results of different uniq functions, which is generated with test_uniq_functions.sh script. Although it tests uniq functions on sets of numbers from 1 to N only, I think it's still a valid choice as the real world data such as IPv4 are basically UInt32 numbers from 0 to 2^32 (except some excluded private ranges).

There are couple of observations from the data in spreadsheet:

  1. In the range up to 10K elements uniqHLL12 has really high error - up to 12.6%.
  2. Starting from 200M elements, functions uniqHLL12 and uniqCombined start to misbehave and error grows quite fast, while uniq performs well.
  3. Starting from 3B elements things get really bad and error of uniqHLL12 and uniqCombined is unacceptable.
  4. Starting from 6B elements functions uniqHLL12 and uniqCombined go completely nuts and return 0.
  5. Function uniq behaves well on all tested cases (except 50B) and its error stays under 1.09%

Observation 1 can be explained with the poor quality of intHash32 function, which is used both for Linear Counting and HyperLogLog algorithm:

./dbms/scripts/test_intHash32_for_linear_counting.py 1 10000
Left 12 bits error: min=0.000000 max=2.895610 avg=0.849154 median=0.642690 median_low=0.642539 median_high=0.642842
Right 12 bits error: min=0.000000 max=13.200096 avg=8.233538 median=7.844685 median_low=7.844444 median_high=7.844926

For cardinalities up to 4096*2.5 = 10240 elements, Linear Counting is used, which is basically counting number of zero buckets and then uses simple formula to estimate number of unique elements. But function intHash32 right most 12 bits are much less random than left most 12 bits, thus it explains inaccuracy. In original Google HyperLogLog paper LSB 0 bit numbering is used and the left most 12 bits are used to choose bucket number, while in Clickhouse uniqHLL12 implementation right most 12 bits are used. In general, it shouldn't matter as long as hash function follows same assumptions in the paper - the hash function produces uniform hash values. I understand that's changing bits order or changing hash function will break back compatibility of uniqHLL12 function and that's why I do not change that code. However, I've tested changing bit order, see "Change bit orders + remove big card fix + UInt64" tab in the spreadsheet and it provides much better results than default version. The error of Linear Counting supposed to be 1.30 / sqrt(4096) = 0.0203125 or ~2%, but because of hash function choice the error is much higher. The error of raw HyperLogLog supposed to be 1.04 / sqrt(4096) = 0.01625 or 1.625%.

Observation 2 can be explained with Large range corrections in HyperLogLog algorithm, which occurs for cardinalities more than 2^32/30 = 143M. The formula with logarithm supposed to correct estimate but again because of intHash32 hash function it maybe doesn't work as should be. I've tested intHash32 for collisions in the range from 1 to 2^32 - there are total 2,715,106,849 elements out of 4,294,967,296 which have atleast one another value with same hash, so it's 63.2% of all.

Observation 3 can be explained with same Large range corrections which might give estimates more than 2^32 after applying it, but as we have UInt32 type for the size, it's getting overflow when double casted to UInt32.

Observation 4 can be explained with Large range corrections, when estimate is more than 2^32 and logarithm of negative number is calculated which is NaN and casted to 0:
fixed_estimate = -pow2_32 * log(1.0 - raw_estimate / pow2_32);

Observation 5 can be explained by the fact that uniq function doesn't use HyperLogLog algorithm, though at cardinalities over 50B+ it gets some sort of overflow the reason for which I didn't figure out yet. However, overall it has quite good accuracy and performance for real-life applications. The size if its state though can be quite high, the maximum I saw was near 300KiB, while uniqHLL12 always consumes just 2.5KiB.

I've also tried to improve HyperLogLog estimate by using one of the improvements LogLog-Beta and More: A New Algorithm for Cardinality Estimation Based on LogLog Counting, using 7th order polynomial, but it didn't improve accuracy neither for small nor big set cardinalities, again I think because of hash function choice.

Future
In the future it would really nice to have new uniq function - uniqHLL(precision = 14, regwidth = 6)(X), which will give user flexibility to choose preferable parameters for the targeted size of the state and which will be using good hash function (perhaps MurmurHash3_x64_128), as many other implementations do. One of good properties of HLL structure, that it has constant memory size, which is quite low (Redis implementation uses 12KiB) compare to uniq function state and also it compresses well. Of course, new uniqHLL function's accuracy should be tested well on different data sets, with small and really huge cardinalities.
I really like these 2 implementations:

Fix uniqHLL12 and uniqCombined for cardinalities 100M+.
 * Changing size() return type from UInt32 to UInt64 to prevent overflow (this shouldn't break back compatibility).
 * Removing "big cardinalities fix" for cardinalities >2^32/30 as it was very inaccurate and for estimates >2^32 it was trying to do 'log' of negative number which is NaN and it was casted to 0.
 * Adding python script to show that intHash32 is not a good choice for HyperLogLog algorithm when it's used for linear counting branch of it.
 * Adding bash script to test uniq, uniqHLL12, uniqCombined on different set cardinalities.
 * Altering documentation of uniq* aggregate functions with recommendations to use uniq instead of uniqHLL12 or uniqCombined.

@alexey-milovidov alexey-milovidov merged commit b7d0ae4 into yandex:master Feb 8, 2018

1 check passed

continuous-integration/travis-ci/pr The Travis CI build passed
Details
@alexey-milovidov

This comment has been minimized.

Member

alexey-milovidov commented Feb 22, 2018

IntHash32 is definitely bad choice for HyperLogLog.

In fact, current implementation of HLL comes into ClickHouse from another project, where we have to manage large amount of very tiny (tens of bytes) HLLs for antifraud calculations. In that project, precision doesn't really matter. And we have to use the same hash function in ClickHouse for compatibility of internal states.

We can change it to IntHash64. It both faster and have better quality. BTW, IntHash64 is MurmurHash finalizer. And if you use MurmurHash for numbers 64 bits or less, it should be equivalent in quality. But this will require to change the name of function because of incompatibility. We can name it uniqEnhanced instead of uniqCombined and deprecate uniqCombined function.

uniqHLL12 should be deprecated function also because it is not adaptive. When you generate a report with many rows, you usually have power law distribution of cardinalities. And it is useless to have 2.5 KB for each row.

@bocharov

This comment has been minimized.

Contributor

bocharov commented Feb 23, 2018

@alexey-milovidov thanks for the explanation!

While we're at it, can you please explain why uniq function starts breaking for cardinalities ~50B? At our scale, we're still quite far to reach that order, though it would be nice to fix the issue in advance.

As for uniqEnhanced, sounds good to have it. If it will use HLL data structure for implementation we could name it uniqHLL and have parameters to specify desired precision and regwidth, which impact state size.

@alexey-milovidov

This comment has been minimized.

Member

alexey-milovidov commented Feb 23, 2018

uniq is also using 32bit hash function.
It is reasonable, because it store values of hash function directly (much less memory efficient than HyperLogLog). When switching from 32bit to 64bit hash function, the size of uniq state grows two times (in contrast, the size of HLL grows just 20%).

It still works for cardinalities larger than 2^32.
When you hash N values, you get M different hashes, where M <= N
(N from [0 .. +inf) maps to just [0 .. 2^32-1])
If we have M (number of different hashes), we can estimate N (true cardinality) from that value:

        /** Correction of a systematic error due to collisions during hashing in UInt32.
          * `fixed_res(res)` formula
          * - with how many different elements of fixed_res,
          *   when randomly scattered across 2^32 buckets,
          *   filled buckets with average of res is obtained.
          */
        size_t p32 = 1ULL << 32;
        size_t fixed_res = round(p32 * (log(p32) - log(p32 - res)));
        return fixed_res;

But when N is much larger than 2^32, the estimate becomes poor.

(How exactly is it poor? Can we improve? We need to calculate variance of the estimate...)

Note: as you mentioned, we was using similar formula in HLL to correct hash collisions, but it was used incorrectly. While in uniq it is correct.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment