[YSQL] Provide user with option to avoid kReadRestart error with extra cost even if statement's output exceeds ysql_output_buffer_size #20336
Labels
2024.1 Backport Required
2024.1.1_blocker
area/ysql
Yugabyte SQL (YSQL)
kind/enhancement
This is an enhancement of an existing feature
priority/medium
Medium priority issue
Jira Link: DB-9323
Description
Provide the user with the ability to ensure no kReadRestart errors are thrown in a READ COMMITTED txn even if statement's output exceeds ysql_output_buffer_size (gflag with default of 256KB).
The distributed nature of YugabyteDB means that clock skew can be present between nodes. This clock skew can sometimes result in an unresolvable ambiguity of whether a version of data should be/ or not be part of a read in snapshot-based transaction isolations (i.e., repeatable read & read committed). The database takes the approach of retrying the read with a newer snapshot to workaround this ambiguity whenever permissible/ possible. Such retries are present both in the tserver and the query layer. However, there are situations where the retries can’t be done (say if part of the query’s response has already been sent to the user while the query is still reading more data from docdb) – in this case we can’t transparently move the read’s snapshot (i.e., the read time) ahead. In such cases, YugabteDB will throw a kReadRestart error to the external client (see more details in https://docs.yugabyte.com/preview/architecture/transactions/read-committed/#read-restart-errors)
Some customers use read committed isolation and have reads whose response size exceeds the query layer to client output buffer size. This leads to situations where an internal retry might not be possible by the db. See #11572 as well which is similar to this issue.
Solution
We should provide the user with a syntactic option to pay some extra cost but ensure that kReadRestart error doesn't occur.
There are various ways to pay the extra cost (we are going forward with providing a GUC to relax the guarantee (option 1 below). This github issue will track this work):
Extra enhancements:
Ways to pay the penalty without relaxing the guarantee:
Pick the read time as the maximum value of the current hybrid time across all nodes. This will require the query layer to issue one round of rpcs to all nodes.
Counter point: This solution is similar to having a timestamp oracle. Coordinating all nodes in the cluster leads to unacceptably high latency.
Pick the global limit as current hybrid time + max_clock_skew_usec, then sleep until current hybrid time crosses global limit and set the read time as this global limit. We already do something like this for
SERIALIZABLE READ ONLY DEFERRABLE
.Counter point: Maximum clock skew is quite high at the moment. Doing this incurs a high latency which is again unacceptable.
2a. To avoid waiting for max_clock_skew_usec, we can keep real time track of the clock skew in the cluster and calculate a better upper bound on the clock skew at any instant by assuming a safe user-configurable max_clock_drift_usec per time unit. Let us call this upper bound on clock skew as current_clock_skew_upper_bound. Then we can use the same logic as in option 2, and pick global limit as current hybrid time + current_clock_skew_upper_bound, and then wait out this time until we pick the read time.
Counter point: can be done in a later stage. This solution requires plenty more experimentation.
GH Issue: #21962
2b. Perhaps, for AWS clusters that have AWS Time sync service and provide microsecond clock skews, we can set the gflag
max_clock_skew_usec
to a much smaller number and wait out the clock skew.GH Issue: #21963
Issue Type
kind/enhancement
Warning: Please confirm that this issue does not contain any sensitive information
The text was updated successfully, but these errors were encountered: