-
Notifications
You must be signed in to change notification settings - Fork 233
HyperLogLog
Jeremy H. Shi edited this page Jul 17, 2019
·
4 revisions
|<empty> (8bit) | rho (8bit) | reg_id (16bit)
- each hll output value in dense format takes 16384 bytes
- each hll output value in sparse format takes numNonZeroRegisters * 4 bytes
- total size of hll output vector: 16384 bytes * numDenseFormattedDims + 4 * numNonZeroRegisters * numSparseFormattedDims
- threshold for determining dense/sparse format: 4096 non zero registers since 4096 * 4 = 16384
- we only allocate the hll output vector after then end of last batch after we found out the number of non zero registers for each dimension
HyperLogLog query result format defines how we serialize/deserialize hyperloglog query results when client specify "Accept" as "application/hll". Its main purpose is to reduce the memory footprint of the result and let query client deserialize the results into a more meaningful format. The magic number can be used as a version number to support the new format.
Sparse values and dense values are interleaving with each other with a count vector to point to count of how many register IDs are in a single dim value row. If this count is less than 4kb, we think it's a sparse value. Otherwise, we will take it as a dense value.
Hyperloglog query results are serialized in the following format:
[uint32] magic_number [uint32] padding
-----------query result 0-------------------
<header>
[uint32] query result 0 size [uint8] error or result [3 bytes padding]
[uint8] num_enum_columns [uint8] num dims per width ... [padding for 8 bytes]
[uint32] result_size [uint32] raw_dim_values_vector_length
[uint8] dim_index_0... [uint8] dim_index_n [padding for 8 bytes]
[uint32] data_type_0...[uint32] data_type_n [padding for 8 bytes]
<enum cases 0>
[uint32_t] number of bytes of enum cases [uint16] column_index [2 bytes: padding]
<enum values 0> delimited by "\u0000\n" [padding for 8 bytes]
<end of header>
<raw dim values vector>
...
[padding for 8 byte alignment]
<count vector> 2 bytes * result_size
[padding for 8 byte alignment]
<raw hll value vector (sparse and dense)>
...
------------error 1----------
[uint32] query result 1 size [uint8] error or result [3 bytes padding]
...
- ** Sort ** current batch using combined hash (higher 48bit from dimension value hash + 16bit reg_id from hll value)
- Reduce current batch using same hash produced from sort step as key and max as reduction function
- Merge previous batch results and current batch results using hash value (asce) + hll value (desc) as key
- (last batch only) Create hll vector
- copy dimension (same as regular aggregations)