[YSQL] Provide user with option to avoid kReadRestart error with extra cost even if statement's output exceeds ysql_output_buffer_size #20336

pkj415 · 2023-12-18T02:09:17Z

Jira Link: DB-9323

Description

Provide the user with the ability to ensure no kReadRestart errors are thrown in a READ COMMITTED txn even if statement's output exceeds ysql_output_buffer_size (gflag with default of 256KB).

The distributed nature of YugabyteDB means that clock skew can be present between nodes. This clock skew can sometimes result in an unresolvable ambiguity of whether a version of data should be/ or not be part of a read in snapshot-based transaction isolations (i.e., repeatable read & read committed). The database takes the approach of retrying the read with a newer snapshot to workaround this ambiguity whenever permissible/ possible. Such retries are present both in the tserver and the query layer. However, there are situations where the retries can’t be done (say if part of the query’s response has already been sent to the user while the query is still reading more data from docdb) – in this case we can’t transparently move the read’s snapshot (i.e., the read time) ahead. In such cases, YugabteDB will throw a kReadRestart error to the external client (see more details in https://docs.yugabyte.com/preview/architecture/transactions/read-committed/#read-restart-errors)

Some customers use read committed isolation and have reads whose response size exceeds the query layer to client output buffer size. This leads to situations where an internal retry might not be possible by the db. See #11572 as well which is similar to this issue.

Solution

We should provide the user with a syntactic option to pay some extra cost but ensure that kReadRestart error doesn't occur.

There are various ways to pay the extra cost (we are going forward with providing a GUC to relax the guarantee (option 1 below). This github issue will track this work):

Relax the guarantee that read restart ambiguity checking provides: a read will be able to see all data that was committed before it as per global wall call time (or true time). Relaxing this guarantee means we could have the following situation: user X make a post on social media app, makes a phone call to user Y to inform about the post and user Y tries to read the post within 500 ms. If clock skew is large, user Y might not be able to see user X's post.
Don't relax the guarantee, but pay an extra latency penalty (see the 3 options in the alternative solutions below)

Extra enhancements:

Don't relax the guarantee for small reads that fit within the ysql_output_buffer_size. Only if they exceed the buffer, we start to relax the guarantee. ([YSQL] Add an option to clamp the read uncertainty window only for long reads #21725)
Pick the ceiling of the uncertainty window (i.e., the global limit) as soon as a query enter YSQL. This can help reduce the probability of hit read restart ambiguity windows by a slight amount ([YSQL] Pick global limit as soon as the query arrives at the YSQL backend process. #21961).

Ways to pay the penalty without relaxing the guarantee:

Pick the read time as the maximum value of the current hybrid time across all nodes. This will require the query layer to issue one round of rpcs to all nodes.
Counter point: This solution is similar to having a timestamp oracle. Coordinating all nodes in the cluster leads to unacceptably high latency.
Pick the global limit as current hybrid time + max_clock_skew_usec, then sleep until current hybrid time crosses global limit and set the read time as this global limit. We already do something like this for SERIALIZABLE READ ONLY DEFERRABLE.
Counter point: Maximum clock skew is quite high at the moment. Doing this incurs a high latency which is again unacceptable.

2a. To avoid waiting for max_clock_skew_usec, we can keep real time track of the clock skew in the cluster and calculate a better upper bound on the clock skew at any instant by assuming a safe user-configurable max_clock_drift_usec per time unit. Let us call this upper bound on clock skew as current_clock_skew_upper_bound. Then we can use the same logic as in option 2, and pick global limit as current hybrid time + current_clock_skew_upper_bound, and then wait out this time until we pick the read time.
Counter point: can be done in a later stage. This solution requires plenty more experimentation.
GH Issue: #21962

2b. Perhaps, for AWS clusters that have AWS Time sync service and provide microsecond clock skews, we can set the gflag max_clock_skew_usec to a much smaller number and wait out the clock skew.
GH Issue: #21963

Issue Type

kind/enhancement

Warning: Please confirm that this issue does not contain any sensitive information

I confirm this issue does not contain any sensitive information.

The text was updated successfully, but these errors were encountered:

mbautin · 2024-01-18T18:24:16Z

Pick the global limit as current hybrid time + max_clock_skew_usec, then sleep until current hybrid time crosses global limit and set the read time as this global limit. We already do something like this for SERIALIZABLE READ ONLY DEFERRABLE.

Typically we don't have to wait until current hybrid time crosses that threshold -- we could issue the reads and the waiting will happen as part of waiting for a particular safe time on tablet servers. This way we can overlap the wait for read time with sending RPCs.

basavaraj29 · 2024-02-14T17:18:08Z

proposed implementation approach - while picking the read time (ReadHybridTime), if the global_limit is set to the read of the picked ReadHybridTime, then the underlying db code wouldn't throw a read restart error (since the ambiguity interval gets shrinked to 0).

rthallamko3 · 2024-02-26T16:21:07Z

Assigning to @sushantrmishra to identify the assignment as the recent approach that was brainstormed doesn't have any DocDB work.

pkj415 added area/ysql Yugabyte SQL (YSQL) status/awaiting-triage Issue awaiting triage labels Dec 18, 2023

yugabyte-ci added kind/enhancement This is an enhancement of an existing feature priority/medium Medium priority issue labels Dec 18, 2023

robertsami assigned basavaraj29 Dec 20, 2023

yugabyte-ci removed the status/awaiting-triage Issue awaiting triage label Jan 3, 2024

rthallamko3 assigned sushantrmishra and unassigned basavaraj29 Feb 26, 2024

sushantrmishra assigned pao214 and unassigned sushantrmishra Mar 11, 2024

sushantrmishra mentioned this issue Mar 11, 2024

Provide user with option to avoid kReadRestart error when statement's output exceeds ysql_output_buffer_size #20767

Closed

rthallamko3 added 2024.1 Backport Required 2.20 Backport Required and removed 2.20 Backport Required labels Mar 18, 2024

pao214 mentioned this issue Mar 28, 2024

[YSQL] Add an option to clamp the read uncertainty window only for long reads #21725

Open

1 task

yugabyte-ci added the 2024.1.1_blocker label Apr 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[YSQL] Provide user with option to avoid kReadRestart error with extra cost even if statement's output exceeds ysql_output_buffer_size #20336

[YSQL] Provide user with option to avoid kReadRestart error with extra cost even if statement's output exceeds ysql_output_buffer_size #20336

pkj415 commented Dec 18, 2023 •

edited by pao214

mbautin commented Jan 18, 2024 •

edited

basavaraj29 commented Feb 14, 2024

rthallamko3 commented Feb 26, 2024

[YSQL] Provide user with option to avoid kReadRestart error with extra cost even if statement's output exceeds ysql_output_buffer_size #20336

[YSQL] Provide user with option to avoid kReadRestart error with extra cost even if statement's output exceeds ysql_output_buffer_size #20336

Comments

pkj415 commented Dec 18, 2023 • edited by pao214

Description

Solution

Ways to pay the penalty without relaxing the guarantee:

Issue Type

Warning: Please confirm that this issue does not contain any sensitive information

mbautin commented Jan 18, 2024 • edited

basavaraj29 commented Feb 14, 2024

rthallamko3 commented Feb 26, 2024

pkj415 commented Dec 18, 2023 •

edited by pao214

mbautin commented Jan 18, 2024 •

edited