Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DocDB] Include timeout value in the error message when timing out from wait-queue while trying to re-acquire shared in-memory locks #19859

Closed
1 task done
basavaraj29 opened this issue Nov 6, 2023 · 1 comment
Assignees
Labels
2.20 Backport Required area/docdb YugabyteDB core features kind/bug This issue is a bug priority/medium Medium priority issue

Comments

@basavaraj29
Copy link
Contributor

basavaraj29 commented Nov 6, 2023

Jira Link: DB-8803

Description

When a request requiring explicit locks flows in for a tablet, we try acquiring shared in-memory locks. If the request times out at this phase, we stream back a somewhat similar error message to the client

Failed to obtain locks until deadline: < CoarseTimePoint > ... Timeout: 601.999s...

If the request ends up in the wait-queue, while resuming the waiter, we could as well timeout while trying to re-acquire the shared in-memory locks. In that case, we return an error with just the deadline (CoarseTimePoint) and not the timeout value (delta), which could mislead the end user into thinking that the timeout was in fact the actual deadline seen in the error message

Failed to obtain locks until deadline: 33135.898s

Correct the error message to explicitly have the timeout value as well.

Issue Type

kind/bug

Warning: Please confirm that this issue does not contain any sensitive information

  • I confirm this issue does not contain any sensitive information.
@basavaraj29 basavaraj29 added area/docdb YugabyteDB core features status/awaiting-triage Issue awaiting triage labels Nov 6, 2023
@basavaraj29 basavaraj29 self-assigned this Nov 6, 2023
@yugabyte-ci yugabyte-ci added kind/bug This issue is a bug priority/medium Medium priority issue and removed status/awaiting-triage Issue awaiting triage labels Nov 6, 2023
@basavaraj29
Copy link
Contributor Author

planning to address this as part of #19032, will keep the issue open until then.

basavaraj29 added a commit that referenced this issue Jan 16, 2024
… waiter resumption until request deadline

Summary:
In the existing implementation, while resuming waiter records from wait-queue, we try re-acquiring the shared in-memory locks for 1 sec irrespective of the client/statment timeout. In contentious workloads, this could lead to requests failing at this stage even though the client/statement timeout isn't passed.

This diff addresses the issue by implementing a retry mechanism at the wait-queue level for acquiring the shared in-memory locks from the wait-queue during waiter resumption. If we fail to acquire the locks for 100ms (changed `wait_for_relock_unblocked_txn_keys_ms` to 100) for the first time, we schedule the retry on the tablet server's reactor thread pool by setting the lock acquisition deadline to the request's deadline. This way, we block the reactor thread trying to acquire the shared in-memory locks until the request deadline, thus simulating a back-pressure on the incoming connections (since we know there is heavy contention). Also, pawning off the resumption of these contentious workers to the reactor threads avoids introducing additional latency in-line with the thread resuming other waiters serially (which could have been waiting on less contentious locks).
Jira: DB-7851, DB-8803

Test Plan:
./yb_build.sh tsan --cxx-test='TEST_F(PgWaitQueueContentionStressTest, TestResumeWaitersOnReactor) {' -n 10

Without the fix, i.e with the below patch, the above test fails consistently in tsan mode. Pasting a snippet of the failure below.
```
diff --git a/src/yb/docdb/wait_queue.cc b/src/yb/docdb/wait_queue.cc
index 1169707b1d..c77ba5c45b 100644
--- a/src/yb/docdb/wait_queue.cc
+++ b/src/yb/docdb/wait_queue.cc
@@ -332,13 +332,15 @@ struct WaiterData : public std::enable_shared_from_this<WaiterData> {
       } else {
         *locks = unlocked_copy->TryLock(locking_deadline);
         if (!locks->status().ok()) {
-          {
-            // Restore the state of 'unlocked_' enabling retry of waiter resumption.
-            UniqueLock l(mutex_);
-            unlocked_.swap(unlocked_copy);
-          }
-          return locks->status().CloneAndPrepend(
-            Format("$0Waiter failed to re-acquire in memory locks", LogPrefix()));
+          // {
+          //   // Restore the state of 'unlocked_' enabling retry of waiter resumption.
+          //   UniqueLock l(mutex_);
+          //   unlocked_.swap(unlocked_copy);
+          // }
+          // return locks->status().CloneAndPrepend(
+          //   Format("$0Waiter failed to re-acquire in memory locks", LogPrefix()));
+          status = locks->status().CloneAndPrepend(
+              Format("$0Waiter failed to re-acquire in memory locks", LogPrefix()));
         }
       }
     }
```
Failure that is observed when retry is disabled
```
2024-01-11 17:33:02.757 UTC [72937] ERROR:  TxnId: f09ad238-1cec-4e2a-8657-6cafa0e9c9f7 Waiter failed to re-acquire in memory locks: Failed to obtain locks until deadline: 1932.710s, batch: [{ key:  intent_types: [kWeakRead, kWeakWrite] }, { key: 4712104880000001 intent_types: [kStrongRead, kStrongWrite] }]
Bad status: Network error (yb/yql/pgwrapper/libpq_utils.cc:366): Fetch 'SELECT * FROM foo WHERE k=1 FOR UPDATE' failed: 7, message: ERROR:  TxnId: f09ad238-1cec-4e2a-8657-6cafa0e9c9f7 Waiter failed to re-acquire in memory locks: Failed to obtain locks until deadline: 1932.710s, batch: [{ key:  intent_types: [kWeakRead, kWeakWrite] }, { key: 4712104880000001 intent_types: [kStrongRead, kStrongWrite] }] (pgsql error XX000) (aux msg ERROR:  TxnId: f09ad238-1cec-4e2a-8657-6cafa0e9c9f7 Waiter failed to re-acquire in memory locks: Failed to obtain locks until deadline: 1932.710s, batch: [{ key:  intent_types: [kWeakRead, kWeakWrite] }, { key: 4712104880000001 intent_types: [kStrongRead, kStrongWrite] }])
```

Reviewers: rsami

Reviewed By: rsami

Subscribers: bogdan, ybase, yql

Differential Revision: https://phorge.dev.yugabyte.com/D29796
basavaraj29 added a commit that referenced this issue Feb 2, 2024
… in-memory locks during waiter resumption until request deadline

Summary:
Original commit: 7d7a636 / D29796
and commit: bf593a8 / D31766

In the existing implementation, while resuming waiter records from wait-queue, we try re-acquiring the shared in-memory locks for 1 sec irrespective of the client/statment timeout. In contentious workloads, this could lead to requests failing at this stage even though the client/statement timeout isn't passed.

This diff addresses the issue by implementing a retry mechanism at the wait-queue level for acquiring the shared in-memory locks from the wait-queue during waiter resumption. If we fail to acquire the locks for 100ms (changed `wait_for_relock_unblocked_txn_keys_ms` to 100) for the first time, we schedule the retry on the tablet server's strand (same connection used to schedule incoming rpcs) by setting the lock acquisition deadline to the request's deadline. This way, we block the thread trying to acquire the shared in-memory locks until the request deadline, thus simulating a back-pressure on the incoming connections (since we know there is heavy contention). Also, pawning off the resumption of these contentious workers to the underlying strand avoids introducing additional latency in-line with the thread resuming other waiters serially (which could have been waiting on less contentious locks).

Jira: DB-9646, DB-7851, DB-8803

Test Plan:
./yb_build.sh tsan --cxx-test='TEST_F(PgWaitQueueContentionStressTest, TestResumeWaitersOnScheduler) {' -n 10

Without the fix, i.e with the below patch, the above test fails consistently in tsan mode. Pasting a snippet of the failure below.
```
diff --git a/src/yb/docdb/wait_queue.cc b/src/yb/docdb/wait_queue.cc
index 1169707b1d..c77ba5c45b 100644
--- a/src/yb/docdb/wait_queue.cc
+++ b/src/yb/docdb/wait_queue.cc
@@ -332,13 +332,15 @@ struct WaiterData : public std::enable_shared_from_this<WaiterData> {
       } else {
         *locks = unlocked_copy->TryLock(locking_deadline);
         if (!locks->status().ok()) {
-          {
-            // Restore the state of 'unlocked_' enabling retry of waiter resumption.
-            UniqueLock l(mutex_);
-            unlocked_.swap(unlocked_copy);
-          }
-          return locks->status().CloneAndPrepend(
-            Format("$0Waiter failed to re-acquire in memory locks", LogPrefix()));
+          // {
+          //   // Restore the state of 'unlocked_' enabling retry of waiter resumption.
+          //   UniqueLock l(mutex_);
+          //   unlocked_.swap(unlocked_copy);
+          // }
+          // return locks->status().CloneAndPrepend(
+          //   Format("$0Waiter failed to re-acquire in memory locks", LogPrefix()));
+          status = locks->status().CloneAndPrepend(
+              Format("$0Waiter failed to re-acquire in memory locks", LogPrefix()));
         }
       }
     }
```
Failure that is observed when retry is disabled
```
2024-01-11 17:33:02.757 UTC [72937] ERROR:  TxnId: f09ad238-1cec-4e2a-8657-6cafa0e9c9f7 Waiter failed to re-acquire in memory locks: Failed to obtain locks until deadline: 1932.710s, batch: [{ key:  intent_types: [kWeakRead, kWeakWrite] }, { key: 4712104880000001 intent_types: [kStrongRead, kStrongWrite] }]
Bad status: Network error (yb/yql/pgwrapper/libpq_utils.cc:366): Fetch 'SELECT * FROM foo WHERE k=1 FOR UPDATE' failed: 7, message: ERROR:  TxnId: f09ad238-1cec-4e2a-8657-6cafa0e9c9f7 Waiter failed to re-acquire in memory locks: Failed to obtain locks until deadline: 1932.710s, batch: [{ key:  intent_types: [kWeakRead, kWeakWrite] }, { key: 4712104880000001 intent_types: [kStrongRead, kStrongWrite] }] (pgsql error XX000) (aux msg ERROR:  TxnId: f09ad238-1cec-4e2a-8657-6cafa0e9c9f7 Waiter failed to re-acquire in memory locks: Failed to obtain locks until deadline: 1932.710s, batch: [{ key:  intent_types: [kWeakRead, kWeakWrite] }, { key: 4712104880000001 intent_types: [kStrongRead, kStrongWrite] }])
```

Reviewers: rsami

Reviewed By: rsami

Subscribers: yql, ybase, bogdan

Tags: #jenkins-ready

Differential Revision: https://phorge.dev.yugabyte.com/D32123
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
2.20 Backport Required area/docdb YugabyteDB core features kind/bug This issue is a bug priority/medium Medium priority issue
Projects
Status: Done
Development

No branches or pull requests

3 participants