Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[YSQL] Support HyperLogLog extension #16092

Closed
timothy-e opened this issue Feb 13, 2023 · 0 comments
Closed

[YSQL] Support HyperLogLog extension #16092

timothy-e opened this issue Feb 13, 2023 · 0 comments
Assignees
Labels
area/ysql Yugabyte SQL (YSQL) kind/enhancement This is an enhancement of an existing feature priority/medium Medium priority issue

Comments

@timothy-e
Copy link
Contributor

timothy-e commented Feb 13, 2023

Jira Link: DB-5486

Description

Add support for the extension https://github.com/citusdata/postgresql-hll

@timothy-e timothy-e added area/ysql Yugabyte SQL (YSQL) status/awaiting-triage Issue awaiting triage labels Feb 13, 2023
@timothy-e timothy-e self-assigned this Feb 13, 2023
@yugabyte-ci yugabyte-ci added kind/bug This issue is a bug priority/medium Medium priority issue and removed status/awaiting-triage Issue awaiting triage labels Feb 13, 2023
@yugabyte-ci yugabyte-ci added kind/enhancement This is an enhancement of an existing feature and removed kind/bug This issue is a bug labels Mar 3, 2023
timothy-e added a commit that referenced this issue Mar 7, 2023
Summary:
> This Postgres module introduces a new data type `hll` which is a HyperLogLog data structure. HyperLogLog is a fixed-size, set-like structure used for distinct value counting with tunable precision. For example, in 1280 bytes `hll` can estimate the count of tens of billions of distinct values with only a few percent errors.

This commit imports the postgresql-hll extension from https://github.com/citusdata/postgresql-hll (upstream commit 8b9821c) **without any changes**. A follow up diff (D23122) will integrate it with Yugabyte.

= HLL Information =

To make sure that we fully understand the HLL type, I read through the extension code to grasp exactly what might not work with Yugabyte.

== How is the HLL type stored on disk? ==

The type is stored as binary and read into a struct.

HLLs are a cool data structure, and I recommended reading more about them in detail. Fundamentally, the idea is:
* The more contiguous zeros in a hash code, the less likely that hashcode is to occur
* Receiving an object with a hashcode that begins with 10 zeros has 2^-10 odds of occurring, so if we receive that, we are probably on our 2^10th unique object received.
* This value is very susceptible to random chance, so instead of a large number of counters, and each counter deals with a specific subset of possible hashcodes. We can then average them out to handle unlikely values.
* **A hyperloglog data structure is just a fixed number of counters**

This specific HLL implementation has made a few changes to optimize them for smaller count use cases. These are briefly explained below as I explain how the data is stored.

It is stored as an array of bytes, retrieved with `PG_GETARG_BYTEA_P`.

=== Header ===
The first 3 bytes make up the header. The header is 6 fields, each given 4 bits:
* version (1)
* type (invalid, empty, explicit, sparse, compressed)
* the 4 HLL parameters (nbits, log2nregs, expthresh, sparseon)

The next bytes depend on the type of the HLL node.

=== Empty / Undefined ===
The HLL is stored as just a header.

=== Explicit ===
From the HLL docs:
> we could just keep an explicit set of the inputs as a sorted list of 64-bit integers until we hit the 161st distinct value. This would give us the true representation of the distinct values in the stream while requiring the same amount of memory.

The message consists of n elements of 8-byte chunks, where each 8-byte chunk is a 64-bit integer.

=== Compressed ===
This is the typical implementation of the HyperLogLog algorithm. There are 2 ^ `log2nregs` = `nregs` registers, each of size `nbits`.

These counters are all packed in sequentially (where the `i`th register is at memory location `header_size + i * nbits`.

Here, if the counts of each register `i` are given by `c[i]`, our memory looks like
```
 _______ _______ _______ _______ _______ _______ _______ _______
| c[0]  | c[1]  | c[2]  | c[3]  | c[4]  |  ...  |c[2046]|c[2047]|
 +++++++ +++++++ +++++++ +++++++ +++++++ +++++++ +++++++ +++++++
  ^ each block is nbits bits
```
where each block is `nbits` wide.

Padding is added at the end to make the compressed format a multiple of 8.

=== Sparse ===
From the HLL docs:
> we could simply represent the set of registers that had non-zero values as a map from index to values. This map is stored as a list of index-value pairs that are bit-packed "short words" of length log2m + regwidth

If we have a lot of registers with count 0, it doesn't make sense to use the full size of the compressed notation. Instead, we can write out the register index and its count. Here, if registers `r_0, r_1, r_2, r_3, ...` have non-zero counts, our memory looks like:
```
 ______________ ________ ______________ ________ ______________ ________
|     r_0      | c[r_0] |     r_1      | c[r_1] |     r_2      | c[r_2] | ...
 ++++++++++++++ ******** ++++++++++++++ ******** ++++++++++++++ ********
  ^ each even block is log2nregs bits       ^ each odd block is nbits bits
```

Padding is added at the end to make the sparse format a multiple of 8.

== What operations exist on the HLL type? ==
There are only equality and non-equality operators. They work by directly comparing the binary value: first comparing the length, and then comparing using `memcmp`. This comparison will work in DocDB as well.

Adding a value to an HLL requires first unpacking it into the internal C type `multiset_t`, adding the value (which may promote the HLL type), and then packing it back into the bytearray, so it's guaranteed that two HLLs with the same counts are equivalent when comparing the binary. (And HLL has thousands of test cases to verify this).

== How does `set_hll_defaults` work on a multi-node system? ==

It works the same as it does on a single-node system.

From the HLL docs:
> The defaults can be changed (per connection) with:
> ```
> lang=sql
> SELECT hll_set_defaults(log2m, regwidth, expthresh, sparseon);
> ```

The values are stored as static ints in the extension code and not persisted anywhere else. This means that even in Postgres, two simultaneous connections to the same database can have different HLL defaults.

When we create an HLL, we can specify the parameters or rely on the default values.

```
lang=sql
hll_empty([log2m[, regwidth[, expthresh[, sparseon]]]])
hll_add_agg(hll_hashval, [log2m[, regwidth[, expthresh[, sparseon]]]])
```

Whatever parameters are not specified will use the default values, whatever is set by `set_hll_defaults`. The parameters are stored in the HLL object at creation. Changing the default values has no impact on an existing HLL, because we know what properties it was created with.

For example, we can have multiple HLLs and their parameters depend only on what the parameters were set to at creation:
```
lang=sql
yugabyte=# CREATE EXTENSION hll;
CREATE EXTENSION
yugabyte=# CREATE TABLE test(h hll);
CREATE TABLE
yugabyte=# INSERT INTO test VALUES (hll_empty()); -- created with default (11,5,-1,1)
INSERT 0 1
yugabyte=# SELECT hll_set_defaults(10, 6, -1, 1);
 hll_set_defaults
------------------
 (11,5,-1,1)
(1 row)

yugabyte=# INSERT INTO test VALUES (hll_empty()); -- created with default (10,6,-1,1)
INSERT 0 1
yugabyte=# INSERT INTO test VALUES (hll_empty(9,7)); -- created with log2m=9, regwidth=7
INSERT 0 1
yugabyte=# SELECT hll_log2m(h), hll_regwidth(h) FROM test;
 hll_log2m | hll_regwidth
-----------+--------------
         9 |            7
        11 |            5
        10 |            6
(3 rows)
```

No other session can affect the values used by the current session.

Test Plan: Existing Yugabyte test suite - a follow up diff (D23122) will integrate the HLL test suite into the Yugabyte regression test framework.

Reviewers: amartsinchyk, ssong

Reviewed By: ssong

Subscribers: mbautin, smishra, yql

Differential Revision: https://phabricator.dev.yugabyte.com/D23121
timothy-e added a commit that referenced this issue Mar 21, 2023
Summary:
= Changes =
* Included HLL in `third-party-extensions/Makefile`
* Updated `build_postgres.py` to remove `-std=c11` flag. This flag caused some issues compiling the extension, because this extension contains some C++ code. We don't need to set the flag explicitly because Postgres uses the c99 standard and modern compilers use a default value for -std higher than c99.
* Create a new regression test `TestPgRegressThirdPartyExtensionsHll` that runs the HLL tests in their directory.

Test Plan:
```
ybd --java-test org.yb.pgsql.TestPgRegressThirdPartyExtensionsHll
```

Reviewers: yql, ssong

Reviewed By: ssong

Subscribers: jason, mbautin, ssong, amartsinchyk

Differential Revision: https://phabricator.dev.yugabyte.com/D23122
premkumr pushed a commit to premkumr/yugabyte-db that referenced this issue Apr 10, 2023
…abyte

Summary:
= Changes =
* Included HLL in `third-party-extensions/Makefile`
* Updated `build_postgres.py` to remove `-std=c11` flag. This flag caused some issues compiling the extension, because this extension contains some C++ code. We don't need to set the flag explicitly because Postgres uses the c99 standard and modern compilers use a default value for -std higher than c99.
* Create a new regression test `TestPgRegressThirdPartyExtensionsHll` that runs the HLL tests in their directory.

Test Plan:
```
ybd --java-test org.yb.pgsql.TestPgRegressThirdPartyExtensionsHll
```

Reviewers: yql, ssong

Reviewed By: ssong

Subscribers: jason, mbautin, ssong, amartsinchyk

Differential Revision: https://phabricator.dev.yugabyte.com/D23122
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/ysql Yugabyte SQL (YSQL) kind/enhancement This is an enhancement of an existing feature priority/medium Medium priority issue
Projects
None yet
Development

No branches or pull requests

2 participants