Summary:
**Background**
There are 3 fields in the PgsqlReadRequestPB RPC message to define conditions on the index key columns:
- partition_column_values are constants matching the hash columns, if specified, includes all the hash columns in the key
- range_column_values are constants matching the range columns
- condition_expr supports variety of operations, including equality, inequality, IN, logical operators.
The first two are more efficient, but the condition_expr is more versatile.
Initially partition_column_values only allowed single values, since DocDB need to calculate the hash code to build the key prefix.
But at some point we added support for IN conditions as well. The feature was implemented by PgGate transforming the template request
with partition_column_values holding tuples and single values into multiple requests, each holding one permutation.
To prevent system from running out of memory the ysql_request_limit flag was introduced. PgGate flushed the requests when the limit was reached.
The feature sustained a number of enhancement over time. One was the batching of multiple permutations into a single message.
We built a condition like `ROW(hash_code, h1, ..., hn) IN ((tuple1), ..., (tuplem))`, where each permutation made one tuple, one condition per tablet.
Then we added said conditions to the condition_expr of the DocDB requests instead of partition_column_values.
While batched request took less memory, the OOM situation because of too many requests was still possible, and ysql_request_limit flag no longer worked.
Hence hash_key_arena_ was introduced. Allocations for the batched conditions were made on that arena, and when it reached certain size, the batches were added to the request, flushed, and the arena was reset.
Though the memory usage problem was improved, but not fully solved. While batched conditions were removed from the requests between flushes, that did not recover the memory.
Other notable enhancement was made for batched nested loop support.
Batched nested loop transforms simple equality condition to IN condition, and if tables are joined on a hash column, such condition is a subject for hash permutations.
The case that required enhancements was join on multiple columns. The IN condition in this case was a `ROW` of the joined columns `IN` some tuples.
That required some special handling, because the columns in the `ROW` might include both hash and range columns, the whole tuples must participate in permutations, not individual elements, and `ROW IN` conditions may mix with regular IN and equality condition.
**Problem**
Now bucketized index support requires another enhancement.
We define "bucket" as a permutation of multiple equality and IN conditions on bucketized columns, and want to reuse existing code to build those permutations.
The main difference is that hash is no longer a requirement. Bucketized index may be split by range.
**Solution**
The goal of this diff is to refactor the code to separate out the code to generate permutations to make it reusable, and fix the known issues.
New class InPermutationState encapsulates the permutations generator.
It keeps the permutated expressions, the list of participating columns, keeps current position and generates permutations, one at a time as a vector of values.
The InExpressionWrapper provides unified access to one participating expression, whether it is a single value, `IN`, or `ROW IN` condition.
The InPermutationBuilder is a helper class to build and validate a InPermutationState instance.
The permutation state is validated against the target index. InPermutationState does not require target index to be a hash, but if it is a hash, all hash columns must participate in permutations. Only key columns are allowed to participate in permutations.
To address the memory usage problem stated above we use the pgsql_op_arena_ for permutation requests.
We have been using the pgsql_op_arena_ to prevent bleeding memory in the case of fetching rows by ybctids from a secondary index, another cases when we may make potentially too many requests.
The pgsql_op_arena_ is reset when current batch of the requests turns inactive. The requests are recreated after that.
The new usage required a change in the PgDocResultStream. In the case of hash permutations new requests are created immediately after clean up, when responses from the previous batch may still be around. New reset function in PgDocResultStream detaches inactive operations from the responses before cleanup and prevents use after free errors.
Due to the fact that the permutation requests no longer recycled, we don't need to clean up permutations from the condition_expr.
In batch mode we no longer apply ysql_request_limit. We create one request per tablet, like in other cases when we query the tablets in parallel.
Hence we don't need to build the `ROW IN` condition with permutations separately. Now hash_in_conds_ are the pointers to the RHS of the `ROW IN` condition in respective request, where we add the permutations.
Because of that there are some expected test results changes where `ysql_request_limit=1` is used.
Jira: DB-18478
Test Plan:
Run regression tests.
There two minor changes in the batched hash permutations implementation affecting the expected test results.
One is that ysql_request_limit no longer affects the batched hash permutations. Tests that set ysql_request_limit to 1 may send less RPCs.
Other is that the read requests to tablets are created as needed, basically, in the order of elements in the IN clause, not in the order of partitions. The result rows order follows the order of the requests, so the row order may change, while still remains deterministic.
Reviewers: jason
Reviewed By: jason
Subscribers: sanketh, pjain, smishra, yql
Tags: #jenkins-ready
Differential Revision: https://phorge.dev.yugabyte.com/D47384