New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[DocDB] ThreadRestriction error in WaitOnConflictResolver::WaitingDone #20651
Labels
2.20 Backport Required
area/docdb
YugabyteDB core features
kind/bug
This issue is a bug
priority/medium
Medium priority issue
Comments
amitanandaiyer
added
area/docdb
YugabyteDB core features
status/awaiting-triage
Issue awaiting triage
labels
Jan 17, 2024
yugabyte-ci
added
kind/bug
This issue is a bug
priority/medium
Medium priority issue
labels
Jan 17, 2024
basavaraj29
added a commit
that referenced
this issue
Jan 18, 2024
…instead of reactors threads Summary: In 7d7a636, we introduced a codepath in which we defer resumption of contentious waiters and schedule them on reactor threads. This led to a fatal since IO operations (done as part of conflict resolution) aren't supposed to be executed on reactor threads. In this existing implementation, we use the rpc threads and a custom `wait-queue` threadpool to run conflict resolution. This diff addresses the above issue by scheduling contentious waiter resumption on the same underlying `Scheduler::Impl::strand_` that is used for executing incoming rpc calls. Jira: DB-9646 Test Plan: Jenkins: urgent 1. Manually executed the test put up on the github issue. 2. Since the `ResumeContentiousWaiter` has a `DCHECK` in there asserting that IO is allowed, ran the below test `./yb_build.sh tsan --cxx-test='TEST_F(PgWaitQueueContentionStressTest, TestResumeWaitersOnScheduler) {' --test-args --vmodule=wait_queue=1`. grep for `wait_queue.*vlog.*on Scheduler`, and can see that the waiter is resumed on `Scheduler::Impl::strand_` and that the assertion does not fail. Reviewers: rsami Reviewed By: rsami Subscribers: yql, ybase Differential Revision: https://phorge.dev.yugabyte.com/D31766
basavaraj29
added a commit
that referenced
this issue
Feb 2, 2024
… in-memory locks during waiter resumption until request deadline Summary: Original commit: 7d7a636 / D29796 and commit: bf593a8 / D31766 In the existing implementation, while resuming waiter records from wait-queue, we try re-acquiring the shared in-memory locks for 1 sec irrespective of the client/statment timeout. In contentious workloads, this could lead to requests failing at this stage even though the client/statement timeout isn't passed. This diff addresses the issue by implementing a retry mechanism at the wait-queue level for acquiring the shared in-memory locks from the wait-queue during waiter resumption. If we fail to acquire the locks for 100ms (changed `wait_for_relock_unblocked_txn_keys_ms` to 100) for the first time, we schedule the retry on the tablet server's strand (same connection used to schedule incoming rpcs) by setting the lock acquisition deadline to the request's deadline. This way, we block the thread trying to acquire the shared in-memory locks until the request deadline, thus simulating a back-pressure on the incoming connections (since we know there is heavy contention). Also, pawning off the resumption of these contentious workers to the underlying strand avoids introducing additional latency in-line with the thread resuming other waiters serially (which could have been waiting on less contentious locks). Jira: DB-9646, DB-7851, DB-8803 Test Plan: ./yb_build.sh tsan --cxx-test='TEST_F(PgWaitQueueContentionStressTest, TestResumeWaitersOnScheduler) {' -n 10 Without the fix, i.e with the below patch, the above test fails consistently in tsan mode. Pasting a snippet of the failure below. ``` diff --git a/src/yb/docdb/wait_queue.cc b/src/yb/docdb/wait_queue.cc index 1169707b1d..c77ba5c45b 100644 --- a/src/yb/docdb/wait_queue.cc +++ b/src/yb/docdb/wait_queue.cc @@ -332,13 +332,15 @@ struct WaiterData : public std::enable_shared_from_this<WaiterData> { } else { *locks = unlocked_copy->TryLock(locking_deadline); if (!locks->status().ok()) { - { - // Restore the state of 'unlocked_' enabling retry of waiter resumption. - UniqueLock l(mutex_); - unlocked_.swap(unlocked_copy); - } - return locks->status().CloneAndPrepend( - Format("$0Waiter failed to re-acquire in memory locks", LogPrefix())); + // { + // // Restore the state of 'unlocked_' enabling retry of waiter resumption. + // UniqueLock l(mutex_); + // unlocked_.swap(unlocked_copy); + // } + // return locks->status().CloneAndPrepend( + // Format("$0Waiter failed to re-acquire in memory locks", LogPrefix())); + status = locks->status().CloneAndPrepend( + Format("$0Waiter failed to re-acquire in memory locks", LogPrefix())); } } } ``` Failure that is observed when retry is disabled ``` 2024-01-11 17:33:02.757 UTC [72937] ERROR: TxnId: f09ad238-1cec-4e2a-8657-6cafa0e9c9f7 Waiter failed to re-acquire in memory locks: Failed to obtain locks until deadline: 1932.710s, batch: [{ key: intent_types: [kWeakRead, kWeakWrite] }, { key: 4712104880000001 intent_types: [kStrongRead, kStrongWrite] }] Bad status: Network error (yb/yql/pgwrapper/libpq_utils.cc:366): Fetch 'SELECT * FROM foo WHERE k=1 FOR UPDATE' failed: 7, message: ERROR: TxnId: f09ad238-1cec-4e2a-8657-6cafa0e9c9f7 Waiter failed to re-acquire in memory locks: Failed to obtain locks until deadline: 1932.710s, batch: [{ key: intent_types: [kWeakRead, kWeakWrite] }, { key: 4712104880000001 intent_types: [kStrongRead, kStrongWrite] }] (pgsql error XX000) (aux msg ERROR: TxnId: f09ad238-1cec-4e2a-8657-6cafa0e9c9f7 Waiter failed to re-acquire in memory locks: Failed to obtain locks until deadline: 1932.710s, batch: [{ key: intent_types: [kWeakRead, kWeakWrite] }, { key: 4712104880000001 intent_types: [kStrongRead, kStrongWrite] }]) ``` Reviewers: rsami Reviewed By: rsami Subscribers: yql, ybase, bogdan Tags: #jenkins-ready Differential Revision: https://phorge.dev.yugabyte.com/D32123
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Labels
2.20 Backport Required
area/docdb
YugabyteDB core features
kind/bug
This issue is a bug
priority/medium
Medium priority issue
Jira Link: DB-9646
Description
Seen on 587dec6 with the said patch
ash_master.diff.txt
WaitStateITest__AshTestVerifyOccurrence_VerifyWaitStateEntered__kLockedBatchEntry_Lock.log.txt
Issue Type
kind/bug
Warning: Please confirm that this issue does not contain any sensitive information
The text was updated successfully, but these errors were encountered: