[DocDB] Include timeout value in the error message when timing out from wait-queue while trying to re-acquire shared in-memory locks #19859

basavaraj29 · 2023-11-06T18:28:34Z

Jira Link: DB-8803

Description

When a request requiring explicit locks flows in for a tablet, we try acquiring shared in-memory locks. If the request times out at this phase, we stream back a somewhat similar error message to the client

Failed to obtain locks until deadline: < CoarseTimePoint > ... Timeout: 601.999s...

If the request ends up in the wait-queue, while resuming the waiter, we could as well timeout while trying to re-acquire the shared in-memory locks. In that case, we return an error with just the deadline (CoarseTimePoint) and not the timeout value (delta), which could mislead the end user into thinking that the timeout was in fact the actual deadline seen in the error message

Failed to obtain locks until deadline: 33135.898s

Correct the error message to explicitly have the timeout value as well.

Issue Type

kind/bug

Warning: Please confirm that this issue does not contain any sensitive information

I confirm this issue does not contain any sensitive information.

The text was updated successfully, but these errors were encountered:

basavaraj29 · 2023-11-10T22:21:58Z

planning to address this as part of #19032, will keep the issue open until then.

… waiter resumption until request deadline Summary: In the existing implementation, while resuming waiter records from wait-queue, we try re-acquiring the shared in-memory locks for 1 sec irrespective of the client/statment timeout. In contentious workloads, this could lead to requests failing at this stage even though the client/statement timeout isn't passed. This diff addresses the issue by implementing a retry mechanism at the wait-queue level for acquiring the shared in-memory locks from the wait-queue during waiter resumption. If we fail to acquire the locks for 100ms (changed `wait_for_relock_unblocked_txn_keys_ms` to 100) for the first time, we schedule the retry on the tablet server's reactor thread pool by setting the lock acquisition deadline to the request's deadline. This way, we block the reactor thread trying to acquire the shared in-memory locks until the request deadline, thus simulating a back-pressure on the incoming connections (since we know there is heavy contention). Also, pawning off the resumption of these contentious workers to the reactor threads avoids introducing additional latency in-line with the thread resuming other waiters serially (which could have been waiting on less contentious locks). Jira: DB-7851, DB-8803 Test Plan: ./yb_build.sh tsan --cxx-test='TEST_F(PgWaitQueueContentionStressTest, TestResumeWaitersOnReactor) {' -n 10 Without the fix, i.e with the below patch, the above test fails consistently in tsan mode. Pasting a snippet of the failure below. ``` diff --git a/src/yb/docdb/wait_queue.cc b/src/yb/docdb/wait_queue.cc index 1169707b1d..c77ba5c45b 100644 --- a/src/yb/docdb/wait_queue.cc +++ b/src/yb/docdb/wait_queue.cc @@ -332,13 +332,15 @@ struct WaiterData : public std::enable_shared_from_this<WaiterData> { } else { *locks = unlocked_copy->TryLock(locking_deadline); if (!locks->status().ok()) { - { - // Restore the state of 'unlocked_' enabling retry of waiter resumption. - UniqueLock l(mutex_); - unlocked_.swap(unlocked_copy); - } - return locks->status().CloneAndPrepend( - Format("$0Waiter failed to re-acquire in memory locks", LogPrefix())); + // { + // // Restore the state of 'unlocked_' enabling retry of waiter resumption. + // UniqueLock l(mutex_); + // unlocked_.swap(unlocked_copy); + // } + // return locks->status().CloneAndPrepend( + // Format("$0Waiter failed to re-acquire in memory locks", LogPrefix())); + status = locks->status().CloneAndPrepend( + Format("$0Waiter failed to re-acquire in memory locks", LogPrefix())); } } } ``` Failure that is observed when retry is disabled ``` 2024-01-11 17:33:02.757 UTC [72937] ERROR: TxnId: f09ad238-1cec-4e2a-8657-6cafa0e9c9f7 Waiter failed to re-acquire in memory locks: Failed to obtain locks until deadline: 1932.710s, batch: [{ key: intent_types: [kWeakRead, kWeakWrite] }, { key: 4712104880000001 intent_types: [kStrongRead, kStrongWrite] }] Bad status: Network error (yb/yql/pgwrapper/libpq_utils.cc:366): Fetch 'SELECT * FROM foo WHERE k=1 FOR UPDATE' failed: 7, message: ERROR: TxnId: f09ad238-1cec-4e2a-8657-6cafa0e9c9f7 Waiter failed to re-acquire in memory locks: Failed to obtain locks until deadline: 1932.710s, batch: [{ key: intent_types: [kWeakRead, kWeakWrite] }, { key: 4712104880000001 intent_types: [kStrongRead, kStrongWrite] }] (pgsql error XX000) (aux msg ERROR: TxnId: f09ad238-1cec-4e2a-8657-6cafa0e9c9f7 Waiter failed to re-acquire in memory locks: Failed to obtain locks until deadline: 1932.710s, batch: [{ key: intent_types: [kWeakRead, kWeakWrite] }, { key: 4712104880000001 intent_types: [kStrongRead, kStrongWrite] }]) ``` Reviewers: rsami Reviewed By: rsami Subscribers: bogdan, ybase, yql Differential Revision: https://phorge.dev.yugabyte.com/D29796

… in-memory locks during waiter resumption until request deadline Summary: Original commit: 7d7a636 / D29796 and commit: bf593a8 / D31766 In the existing implementation, while resuming waiter records from wait-queue, we try re-acquiring the shared in-memory locks for 1 sec irrespective of the client/statment timeout. In contentious workloads, this could lead to requests failing at this stage even though the client/statement timeout isn't passed. This diff addresses the issue by implementing a retry mechanism at the wait-queue level for acquiring the shared in-memory locks from the wait-queue during waiter resumption. If we fail to acquire the locks for 100ms (changed `wait_for_relock_unblocked_txn_keys_ms` to 100) for the first time, we schedule the retry on the tablet server's strand (same connection used to schedule incoming rpcs) by setting the lock acquisition deadline to the request's deadline. This way, we block the thread trying to acquire the shared in-memory locks until the request deadline, thus simulating a back-pressure on the incoming connections (since we know there is heavy contention). Also, pawning off the resumption of these contentious workers to the underlying strand avoids introducing additional latency in-line with the thread resuming other waiters serially (which could have been waiting on less contentious locks). Jira: DB-9646, DB-7851, DB-8803 Test Plan: ./yb_build.sh tsan --cxx-test='TEST_F(PgWaitQueueContentionStressTest, TestResumeWaitersOnScheduler) {' -n 10 Without the fix, i.e with the below patch, the above test fails consistently in tsan mode. Pasting a snippet of the failure below. ``` diff --git a/src/yb/docdb/wait_queue.cc b/src/yb/docdb/wait_queue.cc index 1169707b1d..c77ba5c45b 100644 --- a/src/yb/docdb/wait_queue.cc +++ b/src/yb/docdb/wait_queue.cc @@ -332,13 +332,15 @@ struct WaiterData : public std::enable_shared_from_this<WaiterData> { } else { *locks = unlocked_copy->TryLock(locking_deadline); if (!locks->status().ok()) { - { - // Restore the state of 'unlocked_' enabling retry of waiter resumption. - UniqueLock l(mutex_); - unlocked_.swap(unlocked_copy); - } - return locks->status().CloneAndPrepend( - Format("$0Waiter failed to re-acquire in memory locks", LogPrefix())); + // { + // // Restore the state of 'unlocked_' enabling retry of waiter resumption. + // UniqueLock l(mutex_); + // unlocked_.swap(unlocked_copy); + // } + // return locks->status().CloneAndPrepend( + // Format("$0Waiter failed to re-acquire in memory locks", LogPrefix())); + status = locks->status().CloneAndPrepend( + Format("$0Waiter failed to re-acquire in memory locks", LogPrefix())); } } } ``` Failure that is observed when retry is disabled ``` 2024-01-11 17:33:02.757 UTC [72937] ERROR: TxnId: f09ad238-1cec-4e2a-8657-6cafa0e9c9f7 Waiter failed to re-acquire in memory locks: Failed to obtain locks until deadline: 1932.710s, batch: [{ key: intent_types: [kWeakRead, kWeakWrite] }, { key: 4712104880000001 intent_types: [kStrongRead, kStrongWrite] }] Bad status: Network error (yb/yql/pgwrapper/libpq_utils.cc:366): Fetch 'SELECT * FROM foo WHERE k=1 FOR UPDATE' failed: 7, message: ERROR: TxnId: f09ad238-1cec-4e2a-8657-6cafa0e9c9f7 Waiter failed to re-acquire in memory locks: Failed to obtain locks until deadline: 1932.710s, batch: [{ key: intent_types: [kWeakRead, kWeakWrite] }, { key: 4712104880000001 intent_types: [kStrongRead, kStrongWrite] }] (pgsql error XX000) (aux msg ERROR: TxnId: f09ad238-1cec-4e2a-8657-6cafa0e9c9f7 Waiter failed to re-acquire in memory locks: Failed to obtain locks until deadline: 1932.710s, batch: [{ key: intent_types: [kWeakRead, kWeakWrite] }, { key: 4712104880000001 intent_types: [kStrongRead, kStrongWrite] }]) ``` Reviewers: rsami Reviewed By: rsami Subscribers: yql, ybase, bogdan Tags: #jenkins-ready Differential Revision: https://phorge.dev.yugabyte.com/D32123

basavaraj29 added area/docdb YugabyteDB core features status/awaiting-triage Issue awaiting triage labels Nov 6, 2023

basavaraj29 self-assigned this Nov 6, 2023

yugabyte-ci added kind/bug This issue is a bug priority/medium Medium priority issue and removed status/awaiting-triage Issue awaiting triage labels Nov 6, 2023

rthallamko3 added the 2.20 Backport Required label Nov 10, 2023

basavaraj29 closed this as completed Feb 2, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[DocDB] Include timeout value in the error message when timing out from wait-queue while trying to re-acquire shared in-memory locks #19859

[DocDB] Include timeout value in the error message when timing out from wait-queue while trying to re-acquire shared in-memory locks #19859

basavaraj29 commented Nov 6, 2023 •

edited by jira bot

basavaraj29 commented Nov 10, 2023

[DocDB] Include timeout value in the error message when timing out from wait-queue while trying to re-acquire shared in-memory locks #19859

[DocDB] Include timeout value in the error message when timing out from wait-queue while trying to re-acquire shared in-memory locks #19859

Comments

basavaraj29 commented Nov 6, 2023 • edited by jira bot

Description

Issue Type

Warning: Please confirm that this issue does not contain any sensitive information

basavaraj29 commented Nov 10, 2023

basavaraj29 commented Nov 6, 2023 •

edited by jira bot