-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Shutting down database hangs in dtest-debug #7331
Comments
@avikivity this might be related to your sstables manager changes |
I reproduced the hang locally with the following additional debug printouts: diff --git a/database.cc b/database.cc
index c03bd3dca4..35a8f45be7 100644
--- a/database.cc
+++ b/database.cc
@@ -1799,7 +1799,9 @@ database::stop() {
// try to ensure that CL has done disk flushing
future<> maybe_shutdown_commitlog = _commitlog != nullptr ? _commitlog->shutdown() : make_ready_future<>();
+ dblog.debug("Shutting down commitlog");
return maybe_shutdown_commitlog.then([this] {
+ dblog.debug("Waiting for view updates");
return _view_update_concurrency_sem.wait(max_memory_pending_view_updates());
}).then([this] {
if (_commitlog != nullptr) {
@@ -1807,16 +1809,22 @@ database::stop() {
}
return make_ready_future<>();
}).then([this] {
+ dblog.debug("Shutting down system dirty memory manager");
return _system_dirty_memory_manager.shutdown();
}).then([this] {
+ dblog.debug("Shutting down dirty memory manager");
return _dirty_memory_manager.shutdown();
}).then([this] {
+ dblog.debug("Shutting down streaming dirty memory manager");
return _streaming_dirty_memory_manager.shutdown();
}).then([this] {
+ dblog.debug("Shutting down memtable controller");
return _memtable_controller.shutdown();
}).then([this] {
+ dblog.debug("Shutting down user sstables manager");
return _user_sstables_manager->close();
}).then([this] {
+ dblog.debug("Shutting down system sstables manager");
return _system_sstables_manager->close();
});
}
diff --git a/db/commitlog/commitlog.cc b/db/commitlog/commitlog.cc
index c2a34db182..8a6e995e83 100644
--- a/db/commitlog/commitlog.cc
+++ b/db/commitlog/commitlog.cc
@@ -1565,11 +1565,15 @@ future<> db::commitlog::segment_manager::shutdown() {
discard_unused_segments();
// Now that the gate is closed and requests completed we are sure nobody else will pop()
return clear_reserve_segments().finally([this] {
+ clogger.debug("Waiting on reserve_replenisher");
return std::move(_reserve_replenisher).then_wrapped([this] (auto f) {
// Could be cleaner with proper seastar support
if (f.failed()) {
- _shutdown_promise->set_exception(f.get_exception());
+ auto ep = f.get_exception();
+ clogger.debug("Shutdown failed: {}", ep);
+ _shutdown_promise->set_exception(std::move(ep));
} else {
+ clogger.debug("Shutdown completed successfully");
_shutdown_promise->set_value();
}
}); dtest command line:
I'm seeing the following pattern in the logs:
Here is the full log: |
Cc @xemul |
Similarly, in database_test
|
…tion()' from Avi Kivity Fix a race condition in on_compaction_completion() that can prevent shutdown, as well as an exception handling error. See individual patches for details. Fixes #7331. Closes #7334 * github.com:scylladb/scylla: table: fix mishandled _sstable_deleted_gate exception in on_compaction_completion table: fix on_compaction_completion corrupting _sstables_compacted_but_not_deleted during self-race
…tion()' from Avi Kivity Fix a race condition in on_compaction_completion() that can prevent shutdown, as well as an exception handling error. See individual patches for details. Fixes #7331. Closes #7334 * github.com:scylladb/scylla: table: fix mishandled _sstable_deleted_gate exception in on_compaction_completion table: fix on_compaction_completion corrupting _sstables_compacted_but_not_deleted during self-race
This fixes a bug that wasn't in any released branch (broken in a536988), so no need to backport. |
"Before a536988, this was a benign race (only resulting in turns out this is not a benign race. #8038 if a deleted sstable remains around, it may potentially prevent expired sstables from being purged. i'll prepare a backport of it to 4.2.1 |
…t_not_deleted during self-race on_compaction_completion() updates _sstables_compacted_but_not_deleted through a temporary to avoid an exception causing a partial update: 1. copy _sstables_compacted_but_not_deleted to a temporary 2. update temporary 3. do dangerous stuff 4. move temporary to _sstables_compacted_but_not_deleted This is racy when we have parallel compactions, since step 3 yields. We can have two invocations running in parallel, taking snapshots of the same _sstables_compacted_but_not_deleted in step 1, each modifying it in different ways, and only one of them winning the race and assigning in step 4. With the right timing we can end with extra sstables in _sstables_compacted_but_not_deleted. Before a536988, this was a benign race (only resulting in deleted file space not being reclaimed until the service is shut down), but afterwards, extra sstable references result in the service refusing to shut down. This was observed in database_test in debug mode, where the race more or less reliably happens for system.truncated. Fix by using a different method to protect _sstables_compacted_but_not_deleted. We unconditionally update it, and also unconditionally fix it up (on success or failure) using seastar::defer(). The fixup includes a call to rebuild_statistics() which must happen every time we touch the sstable list. Ref #7331. Fixes #8038. BACKPORT NOTES: - Turns out this race prevented deletion of expired sstables because the leaked deleted sstables would be accounted when checking if an expired sstable can be purged. - Switch to unordered_set<>::count() as it's not supported by older compilers. (cherry picked from commit a43d507) Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <20210212203832.45846-1-raphaelsc@scylladb.com>
Backported to 4.2 too, in light of the extra negative effects of the bug. |
…t_not_deleted during self-race on_compaction_completion() updates _sstables_compacted_but_not_deleted through a temporary to avoid an exception causing a partial update: 1. copy _sstables_compacted_but_not_deleted to a temporary 2. update temporary 3. do dangerous stuff 4. move temporary to _sstables_compacted_but_not_deleted This is racy when we have parallel compactions, since step 3 yields. We can have two invocations running in parallel, taking snapshots of the same _sstables_compacted_but_not_deleted in step 1, each modifying it in different ways, and only one of them winning the race and assigning in step 4. With the right timing we can end with extra sstables in _sstables_compacted_but_not_deleted. Before a536988, this was a benign race (only resulting in deleted file space not being reclaimed until the service is shut down), but afterwards, extra sstable references result in the service refusing to shut down. This was observed in database_test in debug mode, where the race more or less reliably happens for system.truncated. Fix by using a different method to protect _sstables_compacted_but_not_deleted. We unconditionally update it, and also unconditionally fix it up (on success or failure) using seastar::defer(). The fixup includes a call to rebuild_statistics() which must happen every time we touch the sstable list. Ref scylladb#7331. Fixes scylladb#8038. BACKPORT NOTES: - Turns out this race prevented deletion of expired sstables because the leaked deleted sstables would be accounted when checking if an expired sstable can be purged. - Switch to unordered_set<>::count() as it's not supported by older compilers. (cherry picked from commit a43d507) Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <20210212203832.45846-1-raphaelsc@scylladb.com>
Seen consistently since https://jenkins.scylladb.com/view/master/job/scylla-master/job/dtest-debug/604/testReport
Scylla version f1fcf4f
https://jenkins.scylladb.com/view/master/job/scylla-master/job/dtest-debug/604/artifact/logs-debug.2/1601137898758_repair_additional_test.RepairAdditionalTest.repair_option_pr_dc_host_test/node1.log
https://jenkins.scylladb.com/view/master/job/scylla-master/job/dtest-debug/604/artifact/logs-debug.2/1601139448436_repair_additional_test.RepairAdditionalTest.repair_option_pr_multi_dc_test/node1.log
But not only in those dtests, for example:
https://jenkins.scylladb.com/view/master/job/scylla-master/job/dtest-debug/610/artifact/logs-debug.2/1601850222686_bootstrap_test.TestBootstrap.start_stop_test/node1.log
Note that this looks similar, but different than #7183
The text was updated successfully, but these errors were encountered: