Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DocDB] ThreadRestriction error in WaitOnConflictResolver::WaitingDone #20651

Closed
1 task done
amitanandaiyer opened this issue Jan 17, 2024 · 0 comments
Closed
1 task done
Assignees
Labels
2.20 Backport Required area/docdb YugabyteDB core features kind/bug This issue is a bug priority/medium Medium priority issue

Comments

@amitanandaiyer
Copy link
Contributor

amitanandaiyer commented Jan 17, 2024

Jira Link: DB-9646

Description

[ts-1] F0117 14:25:58.060429 1853059072 thread_restrictions.cc:71] Check failed: LoadTLS()->io_allowed Function marked as IO-only was called from a thread that disallows IO!  If this thread really should be allowed to make IO calls, adjust the call to yb::               ThreadRestrictions::SetIOAllowed() in this thread's startup. Thread 10864058 (name: "TabletServer_reactor", category: "TabletServer_reactor")
 Fatal failure details written to ${BUILD_ROOT}/yb-test-logs/tests-integration-tests__wait_states-itest/WaitStateITest__AshTestVerifyOccurrence_VerifyWaitStateEntered__kLockedBatchEntry_Lock.fatal_failure_details.2024-01-17T14_25_58.pid59951.txt
 F20240117 14:25:58 ../../src/yb/util/thread_restrictions.cc:71] Check failed: LoadTLS()->io_allowed Function marked as IO-only was called from a thread that disallows IO!  If this thread really should be allowed to make IO calls, adjust the call to yb::                  ThreadRestrictions::SetIOAllowed() in this thread's startup. Thread 10864058 (name: "TabletServer_reactor", category: "TabletServer_reactor")
     @        0x127b09eac  yb::LogFatalHandlerSink::send()
     @        0x109226f84  google::LogDestination::LogToSinks()
     @        0x1092260e0  google::LogMessage::SendToLog()
     @        0x109226a84  google::LogMessage::Flush()
     @        0x10922aa5c  google::LogMessageFatal::~LogMessageFatal()
     @        0x1092279c4  google::LogMessageFatal::~LogMessageFatal()
     @        0x127dbc070  yb::ThreadRestrictions::AssertIOAllowed()
     @        0x127a7d4c8  yb::PosixRandomAccessFile::Read()
     @        0x127a774d4  yb::RandomAccessFile::Read()
     @        0x127a773d0  yb::RandomAccessFile::ReadAndValidate()
     @        0x11af07480  rocksdb::RandomAccessFileReader::ReadAndValidate()
     @        0x11ae3d860  rocksdb::(anonymous namespace)::ReadBlock()
     @        0x11ae3d294  rocksdb::ReadBlockContents()
     @        0x11ae09344  rocksdb::BlockBasedTable::ReadFilterBlock()
     @        0x11ae06354  rocksdb::BlockBasedTable::GetFilter()
     @        0x11ae05d1c  rocksdb::BloomFilterAwareFileFilter::Filter()
     @        0x11ad4df8c  rocksdb::Version::AddLevel0Iterators<>()
     @        0x11ad4dc00  rocksdb::Version::AddIterators()
     @        0x11abdcfa4  rocksdb::DBImpl::NewInternalIterator()
     @        0x11abe9398  rocksdb::DBImpl::NewIterator()
     @        0x11ab5372c  rocksdb::DB::NewIterator()
     @        0x117d9ad20  yb::docdb::BoundedRocksDbIterator::BoundedRocksDbIterator()
     @        0x117d9b27c  yb::docdb::BoundedRocksDbIterator::BoundedRocksDbIterator()
     @        0x117f89920  yb::docdb::CreateRocksDBIterator()
     @        0x117dc6b10  yb::docdb::(anonymous namespace)::StrongConflictChecker::Check()
     @        0x117dc4ea0  yb::docdb::(anonymous namespace)::TransactionConflictResolverContext::ReadConflicts()
     @        0x117da7ef4  yb::docdb::(anonymous namespace)::ConflictResolver::Resolve()
     @        0x117dad99c  yb::docdb::(anonymous namespace)::WaitOnConflictResolver::WaitingDone()
     @        0x117db162c  _ZNSt3__1L8__invokeIRMN2yb5docdb12_GLOBAL__N_122WaitOnConflictResolverEFvRKNS1_6StatusENS1_10HybridTimeEERNS_10shared_ptrIS4_EEJS7_S8_EvEEDTcldsdescT0_fp0_fp_spscT1_fp1_EEOT_OSF_DpOSG_
     @        0x117db155c                                                                                                                                                                                                                                                       _ZNSt3__1L15__apply_functorIMN2yb5docdb12_GLOBAL__N_122WaitOnConflictResolverEFvRKNS1_6StatusENS1_10HybridTimeEENS_5tupleIJNS_10shared_ptrIS4_EENS_12placeholders4__phILi1EEENSF_ILi2EEEEEEJLm0ELm1ELm2EENSB_IJS7_OS8_EEEEENS_13__bind_returnIT_T0_T2_Xsr22__is_valid_bind_ret urnISM_SN_SO_EE5valueEE4typeERSM_RSN_NS_15__tuple_indicesIJXspT1_EEEEOSO_
     @        0x117db14e4                                                                                                                                                                                                                                                       _ZNSt3__16__bindIMN2yb5docdb12_GLOBAL__N_122WaitOnConflictResolverEFvRKNS1_6StatusENS1_10HybridTimeEEJNS_10shared_ptrIS4_EERKNS_12placeholders4__phILi1EEERKNSE_ILi2EEEEEclIJS7_S8_EEENS_13__bind_returnISA_NS_5tupleIJSC_SF_SI_EEENSO_IJDpOT_EEEXsr22__is_valid_bind_returnIS A_SP_ST_EE5valueEE4typeESS_
     @        0x117db1468  _ZNSt3__1L8__invokeIRNS_6__bindIMN2yb5docdb12_GLOBAL__N_122WaitOnConflictResolverEFvRKNS2_6StatusENS2_10HybridTimeEEJNS_10shared_ptrIS5_EERKNS_12placeholders4__phILi1EEERKNSF_ILi2EEEEEEJS8_S9_EEEDTclscT_fp_spscT0_fp0_EEOSO_DpOSP_

 *** Check failure stack trace: ***

Seen on 587dec6 with the said patch

ash_master.diff.txt
WaitStateITest__AshTestVerifyOccurrence_VerifyWaitStateEntered__kLockedBatchEntry_Lock.log.txt

Issue Type

kind/bug

Warning: Please confirm that this issue does not contain any sensitive information

  • I confirm this issue does not contain any sensitive information.
@amitanandaiyer amitanandaiyer added area/docdb YugabyteDB core features status/awaiting-triage Issue awaiting triage labels Jan 17, 2024
@yugabyte-ci yugabyte-ci added kind/bug This issue is a bug priority/medium Medium priority issue labels Jan 17, 2024
@basavaraj29 basavaraj29 self-assigned this Jan 17, 2024
basavaraj29 added a commit that referenced this issue Jan 18, 2024
…instead of reactors threads

Summary:
In 7d7a636, we introduced a codepath in which we defer resumption of contentious waiters and schedule them on reactor threads. This led to a fatal since IO operations (done as part of conflict resolution) aren't supposed to be executed on reactor threads.

In this existing implementation, we use the rpc threads and a custom `wait-queue` threadpool to run conflict resolution. This diff addresses the above issue by scheduling contentious waiter resumption on the same underlying `Scheduler::Impl::strand_` that is used for executing incoming rpc calls.
Jira: DB-9646

Test Plan:
Jenkins: urgent

1. Manually executed the test put up on the github issue.

2. Since the `ResumeContentiousWaiter` has a `DCHECK` in there asserting that IO is allowed, ran the below test
`./yb_build.sh tsan --cxx-test='TEST_F(PgWaitQueueContentionStressTest, TestResumeWaitersOnScheduler) {' --test-args --vmodule=wait_queue=1`.

grep for `wait_queue.*vlog.*on Scheduler`, and can see that the waiter is resumed on `Scheduler::Impl::strand_` and that the assertion does not fail.

Reviewers: rsami

Reviewed By: rsami

Subscribers: yql, ybase

Differential Revision: https://phorge.dev.yugabyte.com/D31766
@yugabyte-ci yugabyte-ci removed the status/awaiting-triage Issue awaiting triage label Jan 19, 2024
basavaraj29 added a commit that referenced this issue Feb 2, 2024
… in-memory locks during waiter resumption until request deadline

Summary:
Original commit: 7d7a636 / D29796
and commit: bf593a8 / D31766

In the existing implementation, while resuming waiter records from wait-queue, we try re-acquiring the shared in-memory locks for 1 sec irrespective of the client/statment timeout. In contentious workloads, this could lead to requests failing at this stage even though the client/statement timeout isn't passed.

This diff addresses the issue by implementing a retry mechanism at the wait-queue level for acquiring the shared in-memory locks from the wait-queue during waiter resumption. If we fail to acquire the locks for 100ms (changed `wait_for_relock_unblocked_txn_keys_ms` to 100) for the first time, we schedule the retry on the tablet server's strand (same connection used to schedule incoming rpcs) by setting the lock acquisition deadline to the request's deadline. This way, we block the thread trying to acquire the shared in-memory locks until the request deadline, thus simulating a back-pressure on the incoming connections (since we know there is heavy contention). Also, pawning off the resumption of these contentious workers to the underlying strand avoids introducing additional latency in-line with the thread resuming other waiters serially (which could have been waiting on less contentious locks).

Jira: DB-9646, DB-7851, DB-8803

Test Plan:
./yb_build.sh tsan --cxx-test='TEST_F(PgWaitQueueContentionStressTest, TestResumeWaitersOnScheduler) {' -n 10

Without the fix, i.e with the below patch, the above test fails consistently in tsan mode. Pasting a snippet of the failure below.
```
diff --git a/src/yb/docdb/wait_queue.cc b/src/yb/docdb/wait_queue.cc
index 1169707b1d..c77ba5c45b 100644
--- a/src/yb/docdb/wait_queue.cc
+++ b/src/yb/docdb/wait_queue.cc
@@ -332,13 +332,15 @@ struct WaiterData : public std::enable_shared_from_this<WaiterData> {
       } else {
         *locks = unlocked_copy->TryLock(locking_deadline);
         if (!locks->status().ok()) {
-          {
-            // Restore the state of 'unlocked_' enabling retry of waiter resumption.
-            UniqueLock l(mutex_);
-            unlocked_.swap(unlocked_copy);
-          }
-          return locks->status().CloneAndPrepend(
-            Format("$0Waiter failed to re-acquire in memory locks", LogPrefix()));
+          // {
+          //   // Restore the state of 'unlocked_' enabling retry of waiter resumption.
+          //   UniqueLock l(mutex_);
+          //   unlocked_.swap(unlocked_copy);
+          // }
+          // return locks->status().CloneAndPrepend(
+          //   Format("$0Waiter failed to re-acquire in memory locks", LogPrefix()));
+          status = locks->status().CloneAndPrepend(
+              Format("$0Waiter failed to re-acquire in memory locks", LogPrefix()));
         }
       }
     }
```
Failure that is observed when retry is disabled
```
2024-01-11 17:33:02.757 UTC [72937] ERROR:  TxnId: f09ad238-1cec-4e2a-8657-6cafa0e9c9f7 Waiter failed to re-acquire in memory locks: Failed to obtain locks until deadline: 1932.710s, batch: [{ key:  intent_types: [kWeakRead, kWeakWrite] }, { key: 4712104880000001 intent_types: [kStrongRead, kStrongWrite] }]
Bad status: Network error (yb/yql/pgwrapper/libpq_utils.cc:366): Fetch 'SELECT * FROM foo WHERE k=1 FOR UPDATE' failed: 7, message: ERROR:  TxnId: f09ad238-1cec-4e2a-8657-6cafa0e9c9f7 Waiter failed to re-acquire in memory locks: Failed to obtain locks until deadline: 1932.710s, batch: [{ key:  intent_types: [kWeakRead, kWeakWrite] }, { key: 4712104880000001 intent_types: [kStrongRead, kStrongWrite] }] (pgsql error XX000) (aux msg ERROR:  TxnId: f09ad238-1cec-4e2a-8657-6cafa0e9c9f7 Waiter failed to re-acquire in memory locks: Failed to obtain locks until deadline: 1932.710s, batch: [{ key:  intent_types: [kWeakRead, kWeakWrite] }, { key: 4712104880000001 intent_types: [kStrongRead, kStrongWrite] }])
```

Reviewers: rsami

Reviewed By: rsami

Subscribers: yql, ybase, bogdan

Tags: #jenkins-ready

Differential Revision: https://phorge.dev.yugabyte.com/D32123
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
2.20 Backport Required area/docdb YugabyteDB core features kind/bug This issue is a bug priority/medium Medium priority issue
Projects
None yet
Development

No branches or pull requests

3 participants