Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[docdb] TSAN Race on Log Close after shutdown during remote bootstrap deletes tablet data #3012

Closed
bmatican opened this issue Nov 23, 2019 · 2 comments
Assignees
Labels
area/docdb YugabyteDB core features kind/failing-test Tests and testing infra

Comments

@bmatican
Copy link
Contributor

F20191106 04:56:48 ../../src/yb/consensus/log_util.cc:874] Check failed: !IsFooterWritten()

Saw this as effect of #3010

@bmatican bmatican added kind/failing-test Tests and testing infra area/docdb YugabyteDB core features labels Nov 23, 2019
@bmatican bmatican self-assigned this Nov 23, 2019
@bmatican bmatican added this to To Do in YBase features via automation Nov 23, 2019
bmatican added a commit that referenced this issue Nov 27, 2019
Summary:
Started digging into some TSAN failures in QLTransactionTest.RemoteBootstrap.
Found and fixed a number of issues:

- [#3002] Data race on `yb::log::Log::active_segment_sequence_number()`
Seems this field is protected by a read lock for reads, but was not protected on writes. Turned it
into an atomic.

- [#3007] Race condition between TabletPeer `Init` and `Shutdown`
> std::__1::shared_ptr<yb::tablet::enterprise::Tablet>::get()
> td::__1::shared_ptr<yb::consensus::RaftConsensus>::get()
> src/yb/tablet/tablet_peer.cc:385:17 in yb::tablet::TabletPeer::StartShutdown()
Seems like `Shutdown` did not take the appropriate locks to access either `tablet_` or `consensus_`.

- [#3008] Race condition in thread pool `Worker` Shutdown path:
> #12 yb::rpc::ThreadPool::Impl::Shutdown() /n/users/bogdan/code/yugabyte-db/build/tsan-clang-dynamic-ninja/../../src/yb/rpc/thread_pool.cc:224 (libyrpc.so+0x20c73f)
> #3 yb::rpc::(anonymous namespace)::Worker::Notify() /n/users/bogdan/code/yugabyte-db/build/tsan-clang-dynamic-ninja/../../src/yb/rpc/thread_pool.cc:75 (libyrpc.so+0x20ad6e)
Essentially, we're destroying the vector of workers, but it's possible we still end up trying to notify
them afterwards. Moved some of the code around and expoxed an explicit `Join`. Logic should stay
basically the same. Also moved to shared_ptr instead of raw pointers.

- [#3009] Race condition in Master async RPC task vs CatalogManager reading the task description
> #0 yb::master::PickLeaderReplica::PickReplica(yb::master::TSDescriptor**) /n/users/bogdan/code/yugabyte-db/build/tsan-clang-dynamic-ninja/../../src/yb/master/async_rpc_tasks.cc:95:12 (libmaster.so+0x2450b8)
> #2 yb::master::CatalogManager::SendAddServerRequest(scoped_refptr<yb::master::TabletInfo> const&, yb::consensus::RaftPeerPB_MemberType, yb::consensus::ConsensusStatePB const&, std::__1::basic_string<char, std::__1::char_traits      <char>, std::__1::allocator<char> > const&) /n/users/bogdan/code/yugabyte-db/build/tsan-clang-dynamic-ninja/../../src/yb/master/catalog_manager.cc:5046:54
Just removed the log line..

There are a couple more issues I am still seeing:
- [#3010] Another `Long wait for safe op id`, but one seems like a bootstrap bug
Currently, doing a remote bootstrap triggers an inline OpenTablet in TSTabletManager, unlike the
normal ones, which are scheduled through a thread pool. That causes issues, because on shutdown,
we wait for the threadpool tasks to finish / get aborted. However, when done inline, this exposes
race conditions between Init and Shutdown paths for TabletPeer, RaftConsensus, Log, etc.

- [#3011] SEGV during `DisableFailureDetector`, during raft shutdown
Caused by the same race between Start (which creates the timer) and shutdown, which aborts it.

- [#3012] Log Close failures not flipping the state to closed
> F20191106 04:56:48 ../../src/yb/consensus/log_util.cc:874] Check failed: !IsFooterWritten()
Caused by the same race above. If TSTabletManager starts a remote bootstrap, we open a log. If we
shutdown the tablet manager, before finishing a bootstrap, we wipe the data, but then when we
close the log, we error out as files do not exist anymore.

- [#3013] Race condition in Master async RPC tasks state transitions
> F20191106 05:10:05 ../../src/yb/master/async_rpc_tasks.cc:126] Check failed: task_state == MonitoredTaskState::kWaiting State: kScheduling
Seems like there was a race between scheduling the task to run on the reactor thread and only AFTER
flipping the state from kScheduling. This can be a standalone investigation.

Test Plan: `ybd tsan --cxx-test client_ql-transaction-test --gtest_filter QLTransactionTest.RemoteBootstrap -n 100 --tp 4`

Reviewers: mikhail, sergei

Reviewed By: sergei

Subscribers: hector, ybase

Differential Revision: https://phabricator.dev.yugabyte.com/D7529
@bmatican
Copy link
Contributor Author

Found a bunch more examples of this across test failures cc @mbautin

/var/lib/jenkins/jobs/github-yugabyte-db-centos-master-clang-asan/builds/541/archive/build/asan-clang-dynamic-ninja/yb-test-logs/tests-integration-tests__master_path_handlers-itest/MasterPathHandlersItest_TestDeadTServers.log
/var/lib/jenkins/jobs/github-yugabyte-db-centos-master-clang-asan/builds/542/archive/build/asan-clang-dynamic-ninja/yb-test-logs/tests-integration-tests__master_replication-itest/MasterReplicationTest_TestCycleThroughAllMasters.log
/var/lib/jenkins/jobs/github-yugabyte-db-centos-master-clang-asan/builds/547/archive/build/asan-clang-dynamic-ninja/yb-test-logs/tests-integration-tests__master_path_handlers-itest/MasterPathHandlersItest_TestDeadTServers.log
/var/lib/jenkins/jobs/github-yugabyte-db-centos-master-clang-asan/builds/548/archive/build/asan-clang-dynamic-ninja/yb-test-logs/tests-integration-tests__master_path_handlers-itest/MasterPathHandlersItest_TestDeadTServers.log
/var/lib/jenkins/jobs/github-yugabyte-db-centos-master-clang-asan/builds/555/archive/build/asan-clang-dynamic-ninja/yb-test-logs/tests-integration-tests__master_path_handlers-itest/MasterPathHandlersItest_TestDeadTServers.log
/var/lib/jenkins/jobs/github-yugabyte-db-centos-master-clang-asan/builds/555/archive/build/asan-clang-dynamic-ninja/yb-test-logs/tests-integration-tests__master_replication-itest/MasterReplicationTest_TestCycleThroughAllMasters.log
/var/lib/jenkins/jobs/github-yugabyte-db-centos-master-clang-asan/builds/556/archive/build/asan-clang-dynamic-ninja/yb-test-logs/tests-integration-tests__master_path_handlers-itest/MasterPathHandlersItest_TestDeadTServers.log
/var/lib/jenkins/jobs/github-yugabyte-db-centos-master-clang-asan/builds/557/archive/build/asan-clang-dynamic-ninja/yb-test-logs/tests-integration-tests__master_replication-itest/MasterReplicationTest_TestCycleThroughAllMasters.log
/var/lib/jenkins/jobs/github-yugabyte-db-centos-master-clang-asan/builds/561/archive/build/asan-clang-dynamic-ninja/yb-test-logs/tests-integration-tests__master_path_handlers-itest/MasterPathHandlersItest_TestDeadTServers.log
/var/lib/jenkins/jobs/github-yugabyte-db-centos-master-clang-asan/builds/562/archive/build/asan-clang-dynamic-ninja/yb-test-logs/tests-client__ql-transaction-test/QLTransactionTest_RemoteBootstrap.log
/var/lib/jenkins/jobs/github-yugabyte-db-centos-master-clang-asan/builds/564/archive/build/asan-clang-dynamic-ninja/yb-test-logs/tests-integration-tests__master_path_handlers-itest/MasterPathHandlersItest_TestDeadTServers.log
/var/lib/jenkins/jobs/github-yugabyte-db-centos-master-clang-asan/builds/565/archive/build/asan-clang-dynamic-ninja/yb-test-logs/tests-integration-tests__master_replication-itest/MasterReplicationTest_TestCycleThroughAllMasters.log
/var/lib/jenkins/jobs/github-yugabyte-db-centos-master-clang-debug/builds/537/archive/build/debug-clang-dynamic-ninja/yb-test-logs/tests-integration-tests__master_path_handlers-itest/MasterPathHandlersItest_TestDeadTServers.log
/var/lib/jenkins/jobs/github-yugabyte-db-centos-master-clang-debug/builds/543/archive/build/debug-clang-dynamic-ninja/yb-test-logs/tests-integration-tests__master_replication-itest/MasterReplicationTest_TestCycleThroughAllMasters.log
/var/lib/jenkins/jobs/github-yugabyte-db-centos-master-clang-debug/builds/546/archive/build/debug-clang-dynamic-ninja/yb-test-logs/tests-integration-tests__master_path_handlers-itest/MasterPathHandlersItest_TestDeadTServers.log
/var/lib/jenkins/jobs/github-yugabyte-db-centos-master-clang-debug/builds/548/archive/build/debug-clang-dynamic-ninja/yb-test-logs/tests-integration-tests__master_replication-itest/MasterReplicationTest_TestCycleThroughAllMasters.log
/var/lib/jenkins/jobs/github-yugabyte-db-centos-master-clang-debug/builds/553/archive/build/debug-clang-dynamic-ninja/yb-test-logs/tests-integration-tests__master_replication-itest/MasterReplicationTest_TestCycleThroughAllMasters.log
/var/lib/jenkins/jobs/github-yugabyte-db-centos-master-gcc-debug/builds/547/archive/build/debug-gcc-dynamic-ninja/yb-test-logs/tests-integration-tests__master_replication-itest/MasterReplicationTest_TestCycleThroughAllMasters.log
/var/lib/jenkins/jobs/github-yugabyte-db-centos-master-gcc-debug/builds/551/archive/build/debug-gcc-dynamic-ninja/yb-test-logs/tests-integration-tests__master_replication-itest/MasterReplicationTest_TestCycleThroughAllMasters.log
/var/lib/jenkins/jobs/github-yugabyte-db-centos-master-gcc-debug/builds/559/archive/build/debug-gcc-dynamic-ninja/yb-test-logs/tests-integration-tests__master_replication-itest/MasterReplicationTest_TestCycleThroughAllMasters.log
/var/lib/jenkins/jobs/github-yugabyte-db-centos-master-gcc-debug/builds/560/archive/build/debug-gcc-dynamic-ninja/yb-test-logs/tests-integration-tests__master_replication-itest/MasterReplicationTest_TestCycleThroughAllMasters.log

Example message when RBS happens during shutdown, which I believe is the root cause:

W0108 14:10:59.177556 15321 tablet_service.cc:1895] Start remote bootstrap failed: Invalid argument (yb/tablet/tablet_peer.cc:1068): Incorrect state to start TabletPeer, Expected state: BOOTSTRAPPING, got: QUIESCING

@bmatican
Copy link
Contributor Author

Some more recent examples:

/var/lib/jenkins/jobs/github-yugabyte-db-centos-master-clang-asan/builds/623/archive/build/asan-clang-dynamic-ninja/yb-test-logs/tests-integration-tests__master_replication-itest/MasterReplicationTest_TestCycleThroughAllMasters.log
/var/lib/jenkins/jobs/github-yugabyte-db-centos-master-clang-asan/builds/625/archive/build/asan-clang-dynamic-ninja/yb-test-logs/tests-integration-tests__master_path_handlers-itest/MasterPathHandlersItest_TestDeadTServers.log
/var/lib/jenkins/jobs/github-yugabyte-db-centos-master-clang-asan/builds/626/archive/build/asan-clang-dynamic-ninja/yb-test-logs/tests-integration-tests__master_path_handlers-itest/MasterPathHandlersItest_TestDeadTServers.log
/var/lib/jenkins/jobs/github-yugabyte-db-centos-master-clang-asan/builds/627/archive/build/asan-clang-dynamic-ninja/yb-test-logs/tests-integration-tests__master_replication-itest/MasterReplicationTest_TestCycleThroughAllMasters.log
/var/lib/jenkins/jobs/github-yugabyte-db-centos-master-clang-asan/builds/639/archive/build/asan-clang-dynamic-ninja/yb-test-logs/tests-integration-tests__master_path_handlers-itest/MasterPathHandlersItest_TestDeadTServers.log
/var/lib/jenkins/jobs/github-yugabyte-db-centos-master-clang-asan/builds/640/archive/build/asan-clang-dynamic-ninja/yb-test-logs/tests-integration-tests__master_replication-itest/MasterReplicationTest_TestCycleThroughAllMasters.log
/var/lib/jenkins/jobs/github-yugabyte-db-centos-master-clang-asan/builds/646/archive/build/asan-clang-dynamic-ninja/yb-test-logs/tests-integration-tests__master_path_handlers-itest/MasterPathHandlersItest_TestDeadTServers.log
/var/lib/jenkins/jobs/github-yugabyte-db-centos-master-clang-asan/builds/648/archive/build/asan-clang-dynamic-ninja/yb-test-logs/tests-integration-tests__master_path_handlers-itest/MasterPathHandlersItest_TestDeadTServers.log
/var/lib/jenkins/jobs/github-yugabyte-db-centos-master-clang-asan/builds/656/archive/build/asan-clang-dynamic-ninja/yb-test-logs/tests-integration-tests__master_replication-itest/MasterReplicationTest_TestCycleThroughAllMasters.log
/var/lib/jenkins/jobs/github-yugabyte-db-centos-master-clang-asan/builds/658/archive/build/asan-clang-dynamic-ninja/yb-test-logs/tests-integration-tests__master_replication-itest/MasterReplicationTest_TestCycleThroughAllMasters.log
/var/lib/jenkins/jobs/github-yugabyte-db-centos-master-clang-asan/builds/670/archive/build/asan-clang-dynamic-ninja/yb-test-logs/tests-client__snapshot-txn-test/SnapshotTxnTest_RemoteBootstrapOnStart.log
/var/lib/jenkins/jobs/github-yugabyte-db-centos-master-clang-asan/builds/686/archive/build/asan-clang-dynamic-ninja/yb-test-logs/tests-integration-tests__master_replication-itest/MasterReplicationTest_TestCycleThroughAllMasters.log
/var/lib/jenkins/jobs/github-yugabyte-db-centos-master-clang-asan/builds/689/archive/build/asan-clang-dynamic-ninja/yb-test-logs/tests-integration-tests__master_replication-itest/MasterReplicationTest_TestCycleThroughAllMasters.log
/var/lib/jenkins/jobs/github-yugabyte-db-centos-master-clang-asan/builds/692/archive/build/asan-clang-dynamic-ninja/yb-test-logs/tests-integration-tests__master_replication-itest/MasterReplicationTest_TestCycleThroughAllMasters.log
/var/lib/jenkins/jobs/github-yugabyte-db-centos-master-clang-debug/builds/606/archive/build/debug-clang-dynamic-ninja/yb-test-logs/tests-integration-tests__master_replication-itest/MasterReplicationTest_TestCycleThroughAllMasters.log
/var/lib/jenkins/jobs/github-yugabyte-db-centos-master-clang-debug/builds/625/archive/build/debug-clang-dynamic-ninja/yb-test-logs/tests-integration-tests__master_replication-itest/MasterReplicationTest_TestCycleThroughAllMasters.log
/var/lib/jenkins/jobs/github-yugabyte-db-centos-master-clang-debug/builds/630/archive/build/debug-clang-dynamic-ninja/yb-test-logs/tests-integration-tests__master_replication-itest/MasterReplicationTest_TestCycleThroughAllMasters.log
/var/lib/jenkins/jobs/github-yugabyte-db-centos-master-clang-tsan/builds/614/archive/build/tsan-clang-dynamic-ninja/yb-test-logs/tests-integration-tests__master_replication-itest/MasterReplicationTest_TestCycleThroughAllMasters.log
/var/lib/jenkins/jobs/github-yugabyte-db-centos-master-clang-tsan/builds/617/archive/build/tsan-clang-dynamic-ninja/yb-test-logs/tests-integration-tests__master_replication-itest/MasterReplicationTest_TestCycleThroughAllMasters.log
/var/lib/jenkins/jobs/github-yugabyte-db-centos-master-clang-tsan/builds/625/archive/build/tsan-clang-dynamic-ninja/yb-test-logs/tests-integration-tests__master_replication-itest/MasterReplicationTest_TestCycleThroughAllMasters.log
/var/lib/jenkins/jobs/github-yugabyte-db-centos-master-clang-tsan/builds/632/archive/build/tsan-clang-dynamic-ninja/yb-test-logs/tests-integration-tests__master_replication-itest/MasterReplicationTest_TestCycleThroughAllMasters.log
/var/lib/jenkins/jobs/github-yugabyte-db-centos-master-clang-tsan/builds/659/archive/build/tsan-clang-dynamic-ninja/yb-test-logs/tests-client__snapshot-txn-test/SnapshotTxnTest_RemoteBootstrapOnStart.log
/var/lib/jenkins/jobs/github-yugabyte-db-centos-master-clang-tsan/builds/666/archive/build/tsan-clang-dynamic-ninja/yb-test-logs/tests-integration-tests__master_replication-itest/MasterReplicationTest_TestCycleThroughAllMasters.log
/var/lib/jenkins/jobs/github-yugabyte-db-centos-master-clang-tsan/builds/669/archive/build/tsan-clang-dynamic-ninja/yb-test-logs/tests-client__snapshot-txn-test/SnapshotTxnTest_RemoteBootstrapOnStart.log
/var/lib/jenkins/jobs/github-yugabyte-db-centos-master-clang-tsan/builds/674/archive/build/tsan-clang-dynamic-ninja/yb-test-logs/tests-integration-tests__master_replication-itest/MasterReplicationTest_TestCycleThroughAllMasters.log
/var/lib/jenkins/jobs/github-yugabyte-db-centos-master-clang-tsan/builds/688/archive/build/tsan-clang-dynamic-ninja/yb-test-logs/tests-integration-tests__master_replication-itest/MasterReplicationTest_TestCycleThroughAllMasters.log
/var/lib/jenkins/jobs/github-yugabyte-db-centos-master-gcc-debug/builds/604/archive/build/debug-gcc-dynamic-ninja/yb-test-logs/tests-integration-tests__master_replication-itest/MasterReplicationTest_TestCycleThroughAllMasters.log
/var/lib/jenkins/jobs/github-yugabyte-db-centos-master-gcc-debug/builds/606/archive/build/debug-gcc-dynamic-ninja/yb-test-logs/tests-integration-tests__master_replication-itest/MasterReplicationTest_TestCycleThroughAllMasters.log```

bmatican added a commit that referenced this issue Mar 19, 2020
Summary:
Currently, there is a race in `TSTabletManager` between `StartRemoteBootstrap` and `Shutdown`.

In particular, Shutdown does not wait for in-progress remote bootstraps and can proceed to shutdown the tablet peers, while the RBS is still acting on them. This causes TabletPeer/Tablet/RaftConsensus/Log races between their respective Init/Start methods and Shutdown/destructor.

To fix this, I introduced an atomic counter for in progress remote bootstrap calls.
When starting a remote bootstrap, we already check if the TSTabletManager is closing and early exit. If we are not closing, then we increment the counter and make sure to decrement at scope exit (in case of any failures).

During shutdown, we first set the state to QUIESCING then we wait on in progress RBS to reach 0, which should eventually finish, unless we have some other deadlocks style situations...Even so, if we fail during shutdown, it's decently safe to crash, since the server must have been shutting down anyway.

Extra fixes:
- race condition on TSDescriptor when clearing metrics (missing lock)
- race condition on RaftConsensus if calling Shutdown before Start (null FD)
- race condition on TSTabletManager not waiting for TabletPeer shutdown in `ShutdownAndTombstoneTabletPeerNotOk` leading to potentially deleting the files on disk for a tablet, while some peer could be accessing them (ie: shutdown happening in a different thread)

Test Plan:
This used to fail about 17/100, now passed 500/500:
`ybd tsan --cxx-test integration-tests_master_path_handlers-itest --gtest_filter MasterPathHandlersItest.TestDeadTServers -n 500 --tp 4`

Reviewers: mikhail, hector, rahuldesirazu, sergei

Reviewed By: sergei

Subscribers: ybase

Differential Revision: https://phabricator.dev.yugabyte.com/D8111
YBase features automation moved this from To Do to Done Mar 19, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/docdb YugabyteDB core features kind/failing-test Tests and testing infra
Projects
Development

No branches or pull requests

1 participant