Join GitHub today
GitHub is home to over 31 million developers working together to host and review code, manage projects, and build software together.Sign up
HyperLogLog Value Format (uint32)
|<empty> (8bit) | rho (8bit) | reg_id (16bit)
HyperLogLog Vector (Dense/Sparse Interleaving) Format
- each hll output value in dense format takes 16384 bytes
- each hll output value in sparse format takes numNonZeroRegisters * 4 bytes
- total size of hll output vector: 16384 bytes * numDenseFormattedDims + 4 * numNonZeroRegisters * numSparseFormattedDims
- threshold for determining dense/sparse format: 4096 non zero registers since 4096 * 4 = 16384
- we only allocate the hll output vector after then end of last batch after we found out the number of non zero registers for each dimension
HyperLogLog Query Result Format
HyperLogLog query result format defines how we serialize/deserialize hyperloglog query results when client specify "Accept" as "application/hll". Its main purpose is to reduce the memory footprint of the result and let query client deserialize the results into a more meaningful format. The magic number can be used as a version number to support the new format.
Sparse values and dense values are interleaving with each other with a count vector to point to count of how many register IDs are in a single dim value row. If this count is less than 4kb, we think it's a sparse value. Otherwise, we will take it as a dense value.
Hyperloglog query results are serialized in the following format:
[uint32] magic_number [uint32] padding -----------query result 0------------------- <header> [uint32] query result 0 size [uint8] error or result [3 bytes padding] [uint8] num_enum_columns [uint8] bytes per dim ... [padding for 8 bytes] [uint32] result_size [uint32] raw_dim_values_vector_length [uint8] dim_index_0... [uint8] dim_index_n [padding for 8 bytes] [uint32] data_type_0...[uint32] data_type_n [padding for 8 bytes] <enum cases 0> [uint32_t] number of bytes of enum cases [uint16] column_index [2 bytes: padding] <enum values 0> delimited by "\u0000\n" [padding for 8 bytes] <end of header> <raw dim values vector> ... [padding for 8 byte alignment] <count vector> 2 bytes * result_size [padding for 8 byte alignment] <raw hll value vector (sparse and dense)> ... ------------error 1---------- [uint32] query result 1 size [uint8] error or result [3 bytes padding] ...
HyperLogLog Query Engine Aggregation Step
- ** Sort ** current batch using combined hash (higher 48bit from dimension value hash + 16bit reg_id from hll value)
- Reduce current batch using same hash produced from sort step as key and max as reduction function
- Merge previous batch results and current batch results using hash value (asce) + hll value (desc) as key
- (last batch only) Create hll vector
- copy dimension (same as regular aggregations)