Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[YSQL] Provide user with option to avoid kReadRestart error with extra cost even if statement's output exceeds ysql_output_buffer_size #20336

Open
1 task done
pkj415 opened this issue Dec 18, 2023 · 3 comments
Assignees
Labels
2024.1 Backport Required 2024.1.1_blocker area/ysql Yugabyte SQL (YSQL) kind/enhancement This is an enhancement of an existing feature priority/medium Medium priority issue

Comments

@pkj415
Copy link
Contributor

pkj415 commented Dec 18, 2023

Jira Link: DB-9323

Description

Provide the user with the ability to ensure no kReadRestart errors are thrown in a READ COMMITTED txn even if statement's output exceeds ysql_output_buffer_size (gflag with default of 256KB).

The distributed nature of YugabyteDB means that clock skew can be present between nodes. This clock skew can sometimes result in an unresolvable ambiguity of whether a version of data should be/ or not be part of a read in snapshot-based transaction isolations (i.e., repeatable read & read committed). The database takes the approach of retrying the read with a newer snapshot to workaround this ambiguity whenever permissible/ possible. Such retries are present both in the tserver and the query layer. However, there are situations where the retries can’t be done (say if part of the query’s response has already been sent to the user while the query is still reading more data from docdb) – in this case we can’t transparently move the read’s snapshot (i.e., the read time) ahead. In such cases, YugabteDB will throw a kReadRestart error to the external client (see more details in https://docs.yugabyte.com/preview/architecture/transactions/read-committed/#read-restart-errors)

Some customers use read committed isolation and have reads whose response size exceeds the query layer to client output buffer size. This leads to situations where an internal retry might not be possible by the db. See #11572 as well which is similar to this issue.

Solution

We should provide the user with a syntactic option to pay some extra cost but ensure that kReadRestart error doesn't occur.

There are various ways to pay the extra cost (we are going forward with providing a GUC to relax the guarantee (option 1 below). This github issue will track this work):

  1. Relax the guarantee that read restart ambiguity checking provides: a read will be able to see all data that was committed before it as per global wall call time (or true time). Relaxing this guarantee means we could have the following situation: user X make a post on social media app, makes a phone call to user Y to inform about the post and user Y tries to read the post within 500 ms. If clock skew is large, user Y might not be able to see user X's post.
  2. Don't relax the guarantee, but pay an extra latency penalty (see the 3 options in the alternative solutions below)

Extra enhancements:

  1. Don't relax the guarantee for small reads that fit within the ysql_output_buffer_size. Only if they exceed the buffer, we start to relax the guarantee. ([YSQL] Add an option to clamp the read uncertainty window only for long reads #21725)
  2. Pick the ceiling of the uncertainty window (i.e., the global limit) as soon as a query enter YSQL. This can help reduce the probability of hit read restart ambiguity windows by a slight amount ([YSQL] Pick global limit as soon as the query arrives at the YSQL backend process. #21961).

Ways to pay the penalty without relaxing the guarantee:

  1. Pick the read time as the maximum value of the current hybrid time across all nodes. This will require the query layer to issue one round of rpcs to all nodes.
    Counter point: This solution is similar to having a timestamp oracle. Coordinating all nodes in the cluster leads to unacceptably high latency.

  2. Pick the global limit as current hybrid time + max_clock_skew_usec, then sleep until current hybrid time crosses global limit and set the read time as this global limit. We already do something like this for SERIALIZABLE READ ONLY DEFERRABLE.
    Counter point: Maximum clock skew is quite high at the moment. Doing this incurs a high latency which is again unacceptable.

2a. To avoid waiting for max_clock_skew_usec, we can keep real time track of the clock skew in the cluster and calculate a better upper bound on the clock skew at any instant by assuming a safe user-configurable max_clock_drift_usec per time unit. Let us call this upper bound on clock skew as current_clock_skew_upper_bound. Then we can use the same logic as in option 2, and pick global limit as current hybrid time + current_clock_skew_upper_bound, and then wait out this time until we pick the read time.
Counter point: can be done in a later stage. This solution requires plenty more experimentation.
GH Issue: #21962

2b. Perhaps, for AWS clusters that have AWS Time sync service and provide microsecond clock skews, we can set the gflag max_clock_skew_usec to a much smaller number and wait out the clock skew.
GH Issue: #21963

Issue Type

kind/enhancement

Warning: Please confirm that this issue does not contain any sensitive information

  • I confirm this issue does not contain any sensitive information.
@pkj415 pkj415 added area/ysql Yugabyte SQL (YSQL) status/awaiting-triage Issue awaiting triage labels Dec 18, 2023
@yugabyte-ci yugabyte-ci added kind/enhancement This is an enhancement of an existing feature priority/medium Medium priority issue labels Dec 18, 2023
@pkj415 pkj415 changed the title [YSQL] Provide user with option to avoid kReadRestart error with extra cost even if statement's output exceeds ysql_output_buffer_size (gflag with default of 256KB). [YSQL] Provide user with option to avoid kReadRestart error with extra cost even if statement's output exceeds ysql_output_buffer_size Dec 18, 2023
@yugabyte-ci yugabyte-ci removed the status/awaiting-triage Issue awaiting triage label Jan 3, 2024
@mbautin
Copy link
Collaborator

mbautin commented Jan 18, 2024

Pick the global limit as current hybrid time + max_clock_skew_usec, then sleep until current hybrid time crosses global limit and set the read time as this global limit. We already do something like this for SERIALIZABLE READ ONLY DEFERRABLE.

Typically we don't have to wait until current hybrid time crosses that threshold -- we could issue the reads and the waiting will happen as part of waiting for a particular safe time on tablet servers. This way we can overlap the wait for read time with sending RPCs.

@basavaraj29
Copy link
Contributor

proposed implementation approach - while picking the read time (ReadHybridTime), if the global_limit is set to the read of the picked ReadHybridTime, then the underlying db code wouldn't throw a read restart error (since the ambiguity interval gets shrinked to 0).

@rthallamko3
Copy link
Contributor

Assigning to @sushantrmishra to identify the assignment as the recent approach that was brainstormed doesn't have any DocDB work.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
2024.1 Backport Required 2024.1.1_blocker area/ysql Yugabyte SQL (YSQL) kind/enhancement This is an enhancement of an existing feature priority/medium Medium priority issue
Projects
Status: In Review
Development

No branches or pull requests

7 participants