-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[DocDB] Data race in accessing queue_state_.current_term leading to failures in tsan runs #13850
Comments
@yusong-yan , Basava had made changes to the remote bootstrap logic recently. Can you check the failures and sync with Amit on whether Basava's changes have caused these, and address the test failure. |
Looks like the test was added by Basava by his commit - 739ea8f . @yusong-yan , Consult @amitanandaiyer , if you have any questions on the intent of the test etc. The failures are only in centos-clang12-tsan and alma8-gcc11-fastdebug build types, so it looks like a race condition in the test. |
It does seem to be a race, that we even catch in TSAN: https://jenkins.dev.yugabyte.com/job/github-yugabyte-db-centos-master-clang12-tsan/903/artifact/build/tsan-clang12-dynamic-ninja/yb-test-logs/tests-integration-tests__remote_bootstrap-itest/RemoteBootstrapITest_TestLongRemoteBootstrapsAcrossServers.log
|
…rossServer Summary: 1. This test fails on both TSAN and Fastdebug is because of new added server got removed after **follower_unavailable_considered_failed_sec** timeout. It might because TSAN and Fastdebug have slower operation which cause faulty removal. From checking the log record, I notice it took those new added server more than 10 second to complete the pending commit. After change **follower_unavailable_considered_failed_sec** from 10 to 20, it give those server enough time to complete the commit. Now, the test passed both on TSAN and Fastdebug. 2. Fixed race condition warning being noticed from TSAN, which is because reading **queue_state_.current_term** instance without holding the lock. 3. Marked queue_state_ with GUARDED_BY(queue_lock_). And add REQUIRE(queue_lock_) to associated functions. Meanwhile, I notice another potential race condition similar to the one above which is reading **queue_state_.current_term , queue_state.active_config** instances without holding the lock. The way to fix this race condition is also same as the previous one, which is cacheing those values inside the lock scope, then use the cached values outside of the scope. Test Plan: Tested on both TSAN and Fastdebug for 100 times Reviewers: rthallam, amitanand Reviewed By: rthallam, amitanand Subscribers: bogdan, ybase Differential Revision: https://phabricator.dev.yugabyte.com/D19506
Reopening for backports. |
…oteBootstrapsAcrossServer Summary: Original commit:82bcb2e159106d72d7725777d171fcf93449bd11/D19506. 1. Fixed race condition warning being noticed from TSAN, which is because reading **queue_state_.current_term** instance without holding the lock. 2. Marked queue_state_ with GUARDED_BY(queue_lock_). And add REQUIRE(queue_lock_) to associated functions. Meanwhile, I notice another potential race condition similar to the one above which is reading **queue_state_.current_term , queue_state.active_config** instances without holding the lock. The way to fix this race condition is also same as the previous one, which is cacheing those values inside the lock scope, then use the cached values outside of the scope. Test Plan: Tested on both TSAN and Fastdebug for 100 times Reviewers: amitanand, rthallam, qhu Reviewed By: rthallam, qhu Subscribers: ybase, bogdan Differential Revision: https://phabricator.dev.yugabyte.com/D23478
Jira Link: DB-3349
Description
https://detective-gcp.dev.yugabyte.com/stability/test?branch=master&build_type=all&class=RemoteBootstrapITest&fail_tag=all&name=TestLongRemoteBootstrapsAcrossServers&platform=linux
The text was updated successfully, but these errors were encountered: