New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[YSQL] Support HyperLogLog extension #16092
Labels
area/ysql
Yugabyte SQL (YSQL)
kind/enhancement
This is an enhancement of an existing feature
priority/medium
Medium priority issue
Comments
timothy-e
added
area/ysql
Yugabyte SQL (YSQL)
status/awaiting-triage
Issue awaiting triage
labels
Feb 13, 2023
yugabyte-ci
added
kind/bug
This issue is a bug
priority/medium
Medium priority issue
and removed
status/awaiting-triage
Issue awaiting triage
labels
Feb 13, 2023
yugabyte-ci
added
kind/enhancement
This is an enhancement of an existing feature
and removed
kind/bug
This issue is a bug
labels
Mar 3, 2023
timothy-e
added a commit
that referenced
this issue
Mar 7, 2023
Summary: > This Postgres module introduces a new data type `hll` which is a HyperLogLog data structure. HyperLogLog is a fixed-size, set-like structure used for distinct value counting with tunable precision. For example, in 1280 bytes `hll` can estimate the count of tens of billions of distinct values with only a few percent errors. This commit imports the postgresql-hll extension from https://github.com/citusdata/postgresql-hll (upstream commit 8b9821c) **without any changes**. A follow up diff (D23122) will integrate it with Yugabyte. = HLL Information = To make sure that we fully understand the HLL type, I read through the extension code to grasp exactly what might not work with Yugabyte. == How is the HLL type stored on disk? == The type is stored as binary and read into a struct. HLLs are a cool data structure, and I recommended reading more about them in detail. Fundamentally, the idea is: * The more contiguous zeros in a hash code, the less likely that hashcode is to occur * Receiving an object with a hashcode that begins with 10 zeros has 2^-10 odds of occurring, so if we receive that, we are probably on our 2^10th unique object received. * This value is very susceptible to random chance, so instead of a large number of counters, and each counter deals with a specific subset of possible hashcodes. We can then average them out to handle unlikely values. * **A hyperloglog data structure is just a fixed number of counters** This specific HLL implementation has made a few changes to optimize them for smaller count use cases. These are briefly explained below as I explain how the data is stored. It is stored as an array of bytes, retrieved with `PG_GETARG_BYTEA_P`. === Header === The first 3 bytes make up the header. The header is 6 fields, each given 4 bits: * version (1) * type (invalid, empty, explicit, sparse, compressed) * the 4 HLL parameters (nbits, log2nregs, expthresh, sparseon) The next bytes depend on the type of the HLL node. === Empty / Undefined === The HLL is stored as just a header. === Explicit === From the HLL docs: > we could just keep an explicit set of the inputs as a sorted list of 64-bit integers until we hit the 161st distinct value. This would give us the true representation of the distinct values in the stream while requiring the same amount of memory. The message consists of n elements of 8-byte chunks, where each 8-byte chunk is a 64-bit integer. === Compressed === This is the typical implementation of the HyperLogLog algorithm. There are 2 ^ `log2nregs` = `nregs` registers, each of size `nbits`. These counters are all packed in sequentially (where the `i`th register is at memory location `header_size + i * nbits`. Here, if the counts of each register `i` are given by `c[i]`, our memory looks like ``` _______ _______ _______ _______ _______ _______ _______ _______ | c[0] | c[1] | c[2] | c[3] | c[4] | ... |c[2046]|c[2047]| +++++++ +++++++ +++++++ +++++++ +++++++ +++++++ +++++++ +++++++ ^ each block is nbits bits ``` where each block is `nbits` wide. Padding is added at the end to make the compressed format a multiple of 8. === Sparse === From the HLL docs: > we could simply represent the set of registers that had non-zero values as a map from index to values. This map is stored as a list of index-value pairs that are bit-packed "short words" of length log2m + regwidth If we have a lot of registers with count 0, it doesn't make sense to use the full size of the compressed notation. Instead, we can write out the register index and its count. Here, if registers `r_0, r_1, r_2, r_3, ...` have non-zero counts, our memory looks like: ``` ______________ ________ ______________ ________ ______________ ________ | r_0 | c[r_0] | r_1 | c[r_1] | r_2 | c[r_2] | ... ++++++++++++++ ******** ++++++++++++++ ******** ++++++++++++++ ******** ^ each even block is log2nregs bits ^ each odd block is nbits bits ``` Padding is added at the end to make the sparse format a multiple of 8. == What operations exist on the HLL type? == There are only equality and non-equality operators. They work by directly comparing the binary value: first comparing the length, and then comparing using `memcmp`. This comparison will work in DocDB as well. Adding a value to an HLL requires first unpacking it into the internal C type `multiset_t`, adding the value (which may promote the HLL type), and then packing it back into the bytearray, so it's guaranteed that two HLLs with the same counts are equivalent when comparing the binary. (And HLL has thousands of test cases to verify this). == How does `set_hll_defaults` work on a multi-node system? == It works the same as it does on a single-node system. From the HLL docs: > The defaults can be changed (per connection) with: > ``` > lang=sql > SELECT hll_set_defaults(log2m, regwidth, expthresh, sparseon); > ``` The values are stored as static ints in the extension code and not persisted anywhere else. This means that even in Postgres, two simultaneous connections to the same database can have different HLL defaults. When we create an HLL, we can specify the parameters or rely on the default values. ``` lang=sql hll_empty([log2m[, regwidth[, expthresh[, sparseon]]]]) hll_add_agg(hll_hashval, [log2m[, regwidth[, expthresh[, sparseon]]]]) ``` Whatever parameters are not specified will use the default values, whatever is set by `set_hll_defaults`. The parameters are stored in the HLL object at creation. Changing the default values has no impact on an existing HLL, because we know what properties it was created with. For example, we can have multiple HLLs and their parameters depend only on what the parameters were set to at creation: ``` lang=sql yugabyte=# CREATE EXTENSION hll; CREATE EXTENSION yugabyte=# CREATE TABLE test(h hll); CREATE TABLE yugabyte=# INSERT INTO test VALUES (hll_empty()); -- created with default (11,5,-1,1) INSERT 0 1 yugabyte=# SELECT hll_set_defaults(10, 6, -1, 1); hll_set_defaults ------------------ (11,5,-1,1) (1 row) yugabyte=# INSERT INTO test VALUES (hll_empty()); -- created with default (10,6,-1,1) INSERT 0 1 yugabyte=# INSERT INTO test VALUES (hll_empty(9,7)); -- created with log2m=9, regwidth=7 INSERT 0 1 yugabyte=# SELECT hll_log2m(h), hll_regwidth(h) FROM test; hll_log2m | hll_regwidth -----------+-------------- 9 | 7 11 | 5 10 | 6 (3 rows) ``` No other session can affect the values used by the current session. Test Plan: Existing Yugabyte test suite - a follow up diff (D23122) will integrate the HLL test suite into the Yugabyte regression test framework. Reviewers: amartsinchyk, ssong Reviewed By: ssong Subscribers: mbautin, smishra, yql Differential Revision: https://phabricator.dev.yugabyte.com/D23121
timothy-e
added a commit
that referenced
this issue
Mar 21, 2023
Summary: = Changes = * Included HLL in `third-party-extensions/Makefile` * Updated `build_postgres.py` to remove `-std=c11` flag. This flag caused some issues compiling the extension, because this extension contains some C++ code. We don't need to set the flag explicitly because Postgres uses the c99 standard and modern compilers use a default value for -std higher than c99. * Create a new regression test `TestPgRegressThirdPartyExtensionsHll` that runs the HLL tests in their directory. Test Plan: ``` ybd --java-test org.yb.pgsql.TestPgRegressThirdPartyExtensionsHll ``` Reviewers: yql, ssong Reviewed By: ssong Subscribers: jason, mbautin, ssong, amartsinchyk Differential Revision: https://phabricator.dev.yugabyte.com/D23122
premkumr
pushed a commit
to premkumr/yugabyte-db
that referenced
this issue
Apr 10, 2023
…abyte Summary: = Changes = * Included HLL in `third-party-extensions/Makefile` * Updated `build_postgres.py` to remove `-std=c11` flag. This flag caused some issues compiling the extension, because this extension contains some C++ code. We don't need to set the flag explicitly because Postgres uses the c99 standard and modern compilers use a default value for -std higher than c99. * Create a new regression test `TestPgRegressThirdPartyExtensionsHll` that runs the HLL tests in their directory. Test Plan: ``` ybd --java-test org.yb.pgsql.TestPgRegressThirdPartyExtensionsHll ``` Reviewers: yql, ssong Reviewed By: ssong Subscribers: jason, mbautin, ssong, amartsinchyk Differential Revision: https://phabricator.dev.yugabyte.com/D23122
1 task
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Labels
area/ysql
Yugabyte SQL (YSQL)
kind/enhancement
This is an enhancement of an existing feature
priority/medium
Medium priority issue
Jira Link: DB-5486
Description
Add support for the extension https://github.com/citusdata/postgresql-hll
The text was updated successfully, but these errors were encountered: