New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Shutting down auth service may hang #13545
Comments
Observed in another run
|
Reproducer: kbr-scylla@1ae2f86 There's a query in auth service which we perform after we start a node. This query uses infinite timeout: future<bool> default_role_row_satisfies(
cql3::query_processor& qp,
std::function<bool(const cql3::untyped_result_set_row&)> p) {
static const sstring query = format("SELECT * FROM {} WHERE {} = ?",
meta::roles_table::qualified_name,
meta::roles_table::role_col_name);
return do_with(std::move(p), [&qp](const auto& p) {
return qp.execute_internal(
query,
db::consistency_level::ONE,
{meta::DEFAULT_SUPERUSER_NAME},
cql3::query_processor::cache_internal::yes).then([&qp, &p](::shared_ptr<cql3::untyped_result_set> results) { with the right conditions and timing (we kill a node which serves this data before another node attempts the query), shutdown will hang since it's waiting on this query: _stopped = auth::do_after_system_ready(_as, [this] {
return seastar::async([this] {
_migration_manager.wait_for_schema_agreement(_qp.db().real_database(), db::timeout_clock::time_point::max(), &_as).get0();
if (any_nondefault_role_row_satisfies(_qp, &has_can_login).get0()) {
if (this->legacy_metadata_exists()) {
log.warn("Ignoring legacy user metadata since nondefault roles already exist.");
}
return;
}
if (this->legacy_metadata_exists()) {
this->migrate_legacy_metadata().get0();
return;
}
create_default_role_if_missing().get0();
});
}); (it's done in It looks like an oversight because in the past there was a commit that was supposed to remove infinite timeouts from distributed table queries. 620e950 issue: #3603 The fix is simple - use a timeout. |
A long long time ago there was an issue about removing infinite timeouts from distributed queries: scylladb#3603. There was also a fix: 620e950. But apparently some queries escaped the fix, like the one in `default_role_row_satisfies`. With the right conditions and timing this query may cause a node to hang indefinitely on shutdown. A node tries to perform this query after it starts. If we kill another node which is required to serve this query right before that moment, the query will hang; when we try to shutdown the querying node, it will wait for the query to finish (it's a background task in auth service), which it never does due to infinite timeout. Use the same timeout configuration as other queries in this module do. Fixes scylladb#13545.
A long long time ago there was an issue about removing infinite timeouts from distributed queries: #3603. There was also a fix: 620e950. But apparently some queries escaped the fix, like the one in `default_role_row_satisfies`. With the right conditions and timing this query may cause a node to hang indefinitely on shutdown. A node tries to perform this query after it starts. If we kill another node which is required to serve this query right before that moment, the query will hang; when we try to shutdown the querying node, it will wait for the query to finish (it's a background task in auth service), which it never does due to infinite timeout. Use the same timeout configuration as other queries in this module do. Fixes #13545. Closes #14134 (cherry picked from commit f51312e)
A long long time ago there was an issue about removing infinite timeouts from distributed queries: #3603. There was also a fix: 620e950. But apparently some queries escaped the fix, like the one in `default_role_row_satisfies`. With the right conditions and timing this query may cause a node to hang indefinitely on shutdown. A node tries to perform this query after it starts. If we kill another node which is required to serve this query right before that moment, the query will hang; when we try to shutdown the querying node, it will wait for the query to finish (it's a background task in auth service), which it never does due to infinite timeout. Use the same timeout configuration as other queries in this module do. Fixes #13545. Closes #14134 (cherry picked from commit f51312e)
A long long time ago there was an issue about removing infinite timeouts from distributed queries: #3603. There was also a fix: 620e950. But apparently some queries escaped the fix, like the one in `default_role_row_satisfies`. With the right conditions and timing this query may cause a node to hang indefinitely on shutdown. A node tries to perform this query after it starts. If we kill another node which is required to serve this query right before that moment, the query will hang; when we try to shutdown the querying node, it will wait for the query to finish (it's a background task in auth service), which it never does due to infinite timeout. Use the same timeout configuration as other queries in this module do. Fixes #13545. Closes #14134 (cherry picked from commit f51312e)
Backported to 5.1, 5.2, 5.3. |
Seen in https://jenkins.scylladb.com/job/scylla-master/job/build/1423/artifact/testlog/aarch64/debug/topology_raft_disabled.test_raft_upgrade.1.log
topology_raft_disabled.test_raft_upgrade.1.log:
https://jenkins.scylladb.com/job/scylla-master/job/build/1423/artifact/testlog/aarch64/debug/scylla-211.log shows it's hung on shitdown. Last printout is
Shutting down auth service
and then suppressed backtraces (reactor stalls?)
@xemul does that look familiar?
Forked off of #8079
The text was updated successfully, but these errors were encountered: