-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Scylla crashes when multiple clients try to create the same schema #9774
Comments
Judging from a gdb session, this "def" definition that fires an assertion is bogus: (gdb) p def
$15 = (column_definition &) @0x60000398eed0: {_name = {u = {external = {str = 0x69006300454e4f65 <error: Cannot access memory at address 0x69006300454e4f65>, size = 1694525294, pad = 121 'y'}, internal = {
str = "eONE\000c\000ing\000ey\000\000\004\200\267\343\003\000`\000\000\342\260\375\003\000\000", size = 1 '\001'}}, static npos = <optimized out>}, _dropped_at = -9223372036854775808, _is_atomic = true,
_is_counter = false, _is_view_virtual = column_view_virtual::no, _computation = std::unique_ptr<column_computation> = {get() = {<No data fields>}}, _thrift_bits = {is_on_all_components = 0 '\000'}, type = {
_b = 0x6000001725f0, _p = 0x6000001725f0}, id = 1, ordinal_id = 0, kind = column_kind::partition_key, column_specification = {_p = 0x6000045ea0a0}} its name is a garbage string, so perhaps somehow we ended up with use-after-free? |
@denesb what I've done is not nearly as precise as proper bisecting, but I followed git log and just checked out a Scylla version without any of your recent patches, and Scylla no longer crashes. Namely, I got rid of these:
Could you take a look, maybe something rings a bell? Perhaps some lw_shared_ptrs went out of scope too soon? |
It might not be hard to write a cql-pytest reproducing this bug. We already have a test By the way, regardless of this Scylla bug, your driver tests probably shouldn't all try to create a table with the same name. They should either rely on it existing and using unique keys inside it - or create tables with unique names. This is what we did in cql-pytest and alternator tests. |
It's quite hard - I did a few simple attempts, but none of them seem to crash Scylla - as opposed to Rust driver tests run in parallel, which cause a crash +- every time. I partially blame Python's "threads" and GIL which effectively limits the parallelism in interesting places, but also I'm not quite sure which particular combination causes this failure. It might be easier to start by trying to investigate the root cause with gdb and bisect, and then it should be easier to figure out a concise reproducer. |
Nothing rings a bell OTOH. Can you post the full backtrace please? I think a proper bisecting would be best if we have a reliable reproducer. |
I'll try bisecting to a specific commit later |
Hm, I have 1 suspect, let me try to check my hypothesis |
This reproducer does not unfortunately expose the bug every time, so I resorted to testing it in a tight loop. Now I'm reasonably sure that the code is broken even before these schema-related patches I listed above. I'll keep looking, but I might not succeed in bisecting by today, since every recompilation and retest takes some time, and I need to broaden the range |
I'm leaving the investigation at 80fe158 - which is from September '21, and a commit before corounitization of one of the functions that appears on the backtrace. Still, the problem reproduces, so I need to dig deeper tomorrow. |
I did one final check, but 3089558 still fails - and that's a right commit before the big corounitization series of the schema tables. |
I started the bisect with higher granularity and no recompilations by checking docker images. So far my conclusions are as follows - 2.3.1 seems not to be affected, but 4.3.6 already is, so the regression looks fairly old. |
I think bisecting will stop being useful at this point. Any patch it will find has likely accumulated so many changes on top that the bug will probably be in some completely different place in master. |
@denesb it might be in a different place, but it definitely survived. What I hope for now is that I'll bisect it down to a specific minor release, which could be helpful - e.g. if I find out that the change was introduced precisely in version x.y.z, we can take a look at which patches were applied between that one and x.y.(z-1). Right now I see that the failure didn't reproduce on 4.0.11, but did on 4.3, so the window got a little narrower. |
(unless I was unlucky and I simply didn't try enough times on 4.0.11, which is also possible) |
Sigh, of course. Empirical evidence shows that the regression happens on 4.3.0, 4.2.4 proves fine. I was crossing my fingers for finding a regression in one of the micro releases in order to browse a small set of backported changes, but still, it's something. Also, I noticed that one test case that notoriously fails (but only when running in parallel with other tests) is
|
I was curious whether the raft series fixes the issue, but it doesn't - the same assertion seems to be hit:
|
It does look related to UDT though - the test that previously failed also used UDT, and here I see a |
I see the following problem - at some point, schema tables get corrupted and one of the tables claims that it has the following partition key parts:
Point (2.) is bogus and most likely comes from some other table with the same name, because first of all the definition for this table's key is Not sure yet why exactly it happened (and why the regression started from 4.3 release. |
On Wed, Dec 15, 2021 at 12:59:03AM -0800, Piotr Sarna wrote:
I was curious whether the raft series fixes the issue, but it doesn't - the same assertion seems to be hit:
To get any benefit from it you need to enable experimental raft feature.
It will not yet linearize anything but should prevent simultaneous
schema application.
…--
Gleb.
|
Well, it looks like this race is entirely possible when 2 or more
It sounds like a very, very bad bug, because a single client is able to crash Scylla by sending 2 legal statements in parallel. |
Well, I can sadly confirm that a quick patch below prevents Scylla from crashing: diff --git a/cql3/statements/schema_altering_statement.cc b/cql3/statements/schema_altering_statement.cc
index e15690cad..96ab04c2d 100644
--- a/cql3/statements/schema_altering_statement.cc
+++ b/cql3/statements/schema_altering_statement.cc
@@ -104,6 +104,7 @@ schema_altering_statement::execute0(query_processor& qp, service::query_state& s
}
co_await mm.schema_read_barrier();
+ auto guard = co_await get_units(mm.announce_sem(), 1);
auto [ret, m] = co_await prepare_schema_mutations(qp);
diff --git a/service/migration_manager.hh b/service/migration_manager.hh
index 5d046c5ed..9b7453186 100644
--- a/service/migration_manager.hh
+++ b/service/migration_manager.hh
@@ -92,9 +92,12 @@ class migration_manager : public seastar::async_sharded_service<migration_manage
service::raft_group_registry& _raft_gr;
serialized_action _schema_push;
utils::UUID _schema_version_to_publish;
+ named_semaphore _announce_sem{1, named_semaphore_exception_factory{"announce"}};
public:
migration_manager(migration_notifier&, gms::feature_service&, netw::messaging_service& ms, gms::gossiper& gossiper, service::raft_group_registry& raft_gr);
+ named_semaphore& announce_sem() { return _announce_sem; }
+
migration_notifier& get_notifier() { return _notifier; }
const migration_notifier& get_notifier() const { return _notifier; }
Unfortunately, I bet that the same issue applies to creating keyspaces, types, updating them, and so on, pretty much everything that uses system_schema.tables. One fix is to add a bunch of semaphores for each of these operations, or even a single "announcement" semaphore, since DDL statements are considered rare anyway. Opinions? /cc @avikivity @tgrabiec |
On Wed, Dec 15, 2021 at 03:32:52AM -0800, Piotr Sarna wrote:
Unfortunately, I bet that the same issue applies to creating keyspaces, types, updating them, and so on, pretty much everything that uses system_schema.tables.
You do not need the semaphore in announce_new_column_family() since it
is no longer used and the semaphore in execute0() covers all schema
altering statements. To fully linearise schema application we will have
to take "the_merge_lock" anyway. No need for a new lock.
One fix is to add a bunch of semaphores for each of these operations, or even a single "announcement" semaphore, since DDL statements are considered rare anyway. Opinions?
How just adding semaphore here fixes anything? Two nodes may do the
change and they will not shard the lock.
…--
Gleb.
|
I don't care about multi-nodes scenarios here, since I'm aware it's possible to end up with conflicting schemas on multiple nodes, and we're waiting for raft to fix it. This particular workaround prevents a crash of a single node, after it tries to apply a mix of two or more schema updates into a single system_schema entry. |
But you're right that the guard for announce_new_column_family is superfluous, I'll simplify the patch in-place |
On Wed, Dec 15, 2021 at 03:47:42AM -0800, Piotr Sarna wrote:
> On Wed, Dec 15, 2021 at 03:32:52AM -0800, Piotr Sarna wrote: Unfortunately, I bet that the same issue applies to creating keyspaces, types, updating them, and so on, pretty much everything that uses system_schema.tables.
> You do not need the semaphore in announce_new_column_family() since it is no longer used and the semaphore in execute0() covers all schema altering statements. To fully linearise schema application we will have to take "the_merge_lock" anyway. No need for a new lock.
> One fix is to add a bunch of semaphores for each of these operations, or even a single "announcement" semaphore, since DDL statements are considered rare anyway. Opinions?
> How just adding semaphore here fixes anything? Two nodes may do the change and they will not shard the lock.
I don't care about multi-nodes scenarios here, since I'm aware it's possible to end up with conflicting schemas on multiple nodes, and we're waiting for raft to fix it. This particular workaround prevents a crash of a single node, after it tries to apply a mix of two or more schema updates into a single system_schema entry.
I do not understand how the same may not happen if two nodes change the
schema. In the end all mutations are distributed to all nodes and they
all apply them.
…--
Gleb.
|
Yes, but if node A pulls schema from node B and node B knows a single definition for a table named T, it's not possible for A to receive multiple mutations for table named T, but with different column specification. If node A already knows a table named T, it will also have its version and column information stored, so it will be able to compare it to whatever node B sends. And if node A is not aware of any table named T, then there's no conflict in the first place. Btw - I'm not saying that a crash is impossible with multiple nodes, but I can definitely say that I never experienced such a crash during tests, while it's trivial to make one node crash with a few statements. That's why I think that a local semaphore is much better than nothing. |
On Wed, Dec 15, 2021 at 03:51:33AM -0800, Piotr Sarna wrote:
But you're right that the guard for announce_new_column_family is superfluous, I'll simplify the patch in-place
What about using "the_merge_lock"? It exists only on shard zero, but
looking at your patch two statement running on different shards can
still run in parallel...
…--
Gleb.
|
Oh, I wasn't aware that we already have |
On Wed, Dec 15, 2021 at 04:10:31AM -0800, Piotr Sarna wrote:
> On Wed, Dec 15, 2021 at 03:51:33AM -0800, Piotr Sarna wrote: But you're right that the guard for announce_new_column_family is superfluous, I'll simplify the patch in-place
> What about using "the_merge_lock"? It exists only on shard zero, but looking at your patch two statement running on different shards can still run in parallel...
> […](#)
> -- Gleb.
Oh, I wasn't aware that we already have `the_merge_lock`, it sounds much better. I was testing on 1 shard and hence didn't experience any issues. We should indeed avoid races between shards too.
Then you need to enable altering schema statements bouncing even if raft
is disabled. The code is just above the read barrier.
…--
Gleb.
|
On Wed, Dec 15, 2021 at 04:07:05AM -0800, Piotr Sarna wrote:
> On Wed, Dec 15, 2021 at 03:47:42AM -0800, Piotr Sarna wrote: > On Wed, Dec 15, 2021 at 03:32:52AM -0800, Piotr Sarna wrote: Unfortunately, I bet that the same issue applies to creating keyspaces, types, updating them, and so on, pretty much everything that uses system_schema.tables. > You do not need the semaphore in announce_new_column_family() since it is no longer used and the semaphore in execute0() covers all schema altering statements. To fully linearise schema application we will have to take "the_merge_lock" anyway. No need for a new lock. > One fix is to add a bunch of semaphores for each of these operations, or even a single "announcement" semaphore, since DDL statements are considered rare anyway. Opinions? > How just adding semaphore here fixes anything? Two nodes may do the change and they will not shard the lock. I don't care about multi-nodes scenarios here, since I'm aware it's possible to end up with conflicting schemas on multiple nodes, and we're waiting for raft to fix it. This particular workaround prevents a crash of a single node, after it tries to apply a mix of two or more schema updates into a single system_schema entry.
> I do not understand how the same may not happen if two nodes change the schema. In the end all mutations are distributed to all nodes and they all apply them.
> […](#)
Yes, but if node A pulls schema from node B and node B knows a single definition for table named T, it's not possible for A to receive multiple mutations for table named T, but with different column specification. If node A already knows a table named T, it will also have its version and column information stored, so it will be able to compare it to whatever table B sends. And if node A is not aware of any table named T, then there's no conflict in the first place. Btw - I'm not saying that a crash is impossible with multiple nodes, but I can definitely say that I never experienced such a crash during tests, while it's *trivial* to make one node crash with a few statements. That's why I think that a local semaphore is **much** better than nothing.
But two nodes may push (announce) two conflicting table versions to the
same node simultaneously. I think merge lock suppose to prevent them
from been applied in parallel, but then is should be the same for local
modifications. Or do we not taking the lock for local merges?
…--
Gleb.
|
We take the lock for local merges, but by that time it's too late, because we already validated (a few times...) that such a table does not exist, so it's safe to produce mutations for it. Then, each set of mutations for a table with a single name is applied in a serialized manner, and it produces an incorrect entry in system_schema tables - e.g. two partition key columns declared to be at position 0 at the same time. I assumed that schema is pulled only from one node at a time, is it not true? If it isn't, then maybe we can end up with a similar crash, just much less likely - since multiple nodes would have to send conflicting mutations at precisely right time. |
Also - using In any case, the original patch I posted would actually work fine if we also unconditionally turn on bouncing to shard 0 for schema altering statements. In fact, the semaphore could even be local to this schema_altering_statement, since it's the only user. |
Second prototype: diff --git a/cql3/statements/schema_altering_statement.cc b/cql3/statements/schema_altering_statement.cc
index e15690cad..6c1419582 100644
--- a/cql3/statements/schema_altering_statement.cc
+++ b/cql3/statements/schema_altering_statement.cc
@@ -94,16 +94,19 @@ void schema_altering_statement::prepare_keyspace(const service::client_state& st
future<::shared_ptr<messages::result_message>>
schema_altering_statement::execute0(query_processor& qp, service::query_state& state, const query_options& options) const {
+ static named_semaphore schema_altering_sem{1, named_semaphore_exception_factory{"schema_altering"}};
+
auto& mm = qp.get_migration_manager();
::shared_ptr<cql_transport::event::schema_change> ce;
- if (mm.is_raft_enabled() && this_shard_id() != 0) {
+ if (this_shard_id() != 0) {
// execute all schema altering statements on a shard zero since this is where raft group 0 is
co_return ::make_shared<cql_transport::messages::result_message::bounce_to_shard>(0,
std::move(const_cast<cql3::query_options&>(options).take_cached_pk_function_calls()));
}
co_await mm.schema_read_barrier();
+ auto guard = co_await get_units(schema_altering_sem, 1);
auto [ret, m] = co_await prepare_schema_mutations(qp);
|
On Wed, Dec 15, 2021 at 04:25:25AM -0800, Piotr Sarna wrote:
We take the lock for local merges, but by that time it's too late, because we already validated (a few times...) that such a table does not exist, so it's safe to produce mutations for it. Then, each set of mutations for a table with a single name is applied in a serialized manner, and it produces an incorrect entry in system_schema tables - e.g. two partition key columns declared to be at position 0 at the same time.
Then I completely see how it may happen through different coordinators.
I assumed that schema is pulled only from one node at a time, is it not true? If it isn't, then maybe we can end up with a similar crash, just much less likely - since multiple nodes would have to send conflicting mutations at precisely right time.
I am not talking about schema pull. When ddl is executed resulting
mutations are _pushed_ to all nodes. So imaging two different coordinators
check that table T does not exist. Create a (incompatible) mutation for it and
push it to all other nodes. A 3 node gets then and try to apply both of
them one after another.
This is exactly he scenario we try to fix with raft.
…--
Gleb.
|
Ok, I wasn't aware of that. In this case, we're probably screwed for multiple nodes too. Then, we should try to find out a way to not crash if such a conflict is discovered. Perhaps solving this race at least for a single node still makes sense, but in any case we should at least be able to detect such malformed mutations without crashing and graciously refusing to apply them. |
I think this is a good idea even with raft. We do not want potential linerazability bug to become a crash. |
As pointed by @kbr- this is fixed in current master thanks to raft group0 guard. Verified with |
(note: only with raft experimental enabled, for now!) |
Cool! By the way, it will also not be reproducible on current scylla-rust-driver master since 0.4.1 due to scylladb/scylla-rust-driver@df41ab0 . Regardless, good job! |
I can repro the crash with scylladb/scylla-rust-driver@9a2b97b (0.4.0) and @psarna this issue should be re-opened, right? |
Sad, but it's good that we have a reproducer. Reopened |
Scylla does not crash with Raft group0 guard. I was starting experimental raft from command line instead of yaml. I double checked and the tests fail but no abort. And single thread works fine afterwards. Thanks to @kbr- |
Why is starting with experimental raft from the command line wrong? |
The bad behavior was discovered when accidentally running tests from scylladb/scylla-rust-driver in parallel instead of sequentially (which breaks their invariants and is not expected to succeed, but that's beside the point here).
Unfortunately, when running these tests, Scylla very consistently crashes on an assertion each time:
Steps to reproduce:
cd <repo>
cargo test session_test
The same behavior is not reproducible when running the tests sequentially:
I'm aware that concurrent schema changes are not considered good practice, but we definitely don't want to react by crashing the server.
The text was updated successfully, but these errors were encountered: