Scylla crashes when multiple clients try to create the same schema #9774

psarna · 2021-12-09T10:45:14Z

The bad behavior was discovered when accidentally running tests from scylladb/scylla-rust-driver in parallel instead of sequentially (which breaks their invariants and is not expected to succeed, but that's beside the point here).

Unfortunately, when running these tests, Scylla very consistently crashes on an assertion each time:

scylla: schema.cc:375: schema::schema(schema::private_tag, const schema::raw_schema &, std::optional<raw_view_info>): Assertion `!def.id || def.id == id - column_offset(def.kind)' failed.

Steps to reproduce:

Clone https://github.com/scylladb/scylla-rust-driver
cd <repo>
Run a local Scylla instance (can have 1 shard, be done with ccm or manually, doesn't matter)
Run cargo test session_test
Observe the local Scylla crash

The same behavior is not reproducible when running the tests sequentially:

cargo test session_test -- --test-threads 1

I'm aware that concurrent schema changes are not considered good practice, but we definitely don't want to react by crashing the server.

The text was updated successfully, but these errors were encountered:

psarna · 2021-12-09T10:54:24Z

Judging from a gdb session, this "def" definition that fires an assertion is bogus:

(gdb) p def
$15 = (column_definition &) @0x60000398eed0: {_name = {u = {external = {str = 0x69006300454e4f65 <error: Cannot access memory at address 0x69006300454e4f65>, size = 1694525294, pad = 121 'y'}, internal = {
        str = "eONE\000c\000ing\000ey\000\000\004\200\267\343\003\000`\000\000\342\260\375\003\000\000", size = 1 '\001'}}, static npos = <optimized out>}, _dropped_at = -9223372036854775808, _is_atomic = true, 
  _is_counter = false, _is_view_virtual = column_view_virtual::no, _computation = std::unique_ptr<column_computation> = {get() = {<No data fields>}}, _thrift_bits = {is_on_all_components = 0 '\000'}, type = {
    _b = 0x6000001725f0, _p = 0x6000001725f0}, id = 1, ordinal_id = 0, kind = column_kind::partition_key, column_specification = {_p = 0x6000045ea0a0}}

its name is a garbage string, so perhaps somehow we ended up with use-after-free?

psarna · 2021-12-09T11:12:01Z

@denesb what I've done is not nearly as precise as proper bisecting, but I followed git log and just checked out a Scylla version without any of your recent patches, and Scylla no longer crashes. Namely, I got rid of these:

247f2b69d Merge "system tables: create the schema more efficiently" from Botond
e99160491 schema: make private constructor invokable via make_lw_shared
4dea339e0 schema_builder: add a constructor providing make_shared_schema semantics
476f49c69 schema_builder: without_column(): don't assume column_specification exists
d3833c597 schema: add static variant of column_name_type()

Could you take a look, maybe something rings a bell? Perhaps some lw_shared_ptrs went out of scope too soon?

nyh · 2021-12-09T11:26:26Z

It might not be hard to write a cql-pytest reproducing this bug. We already have a test test_keyspace.py::test_concurrent_create_and_drop_keyspace which concurrently creates and drops a keyspace from two threads (and reproduces a bug in that case, although not a crash).

By the way, regardless of this Scylla bug, your driver tests probably shouldn't all try to create a table with the same name. They should either rely on it existing and using unique keys inside it - or create tables with unique names. This is what we did in cql-pytest and alternator tests.

psarna · 2021-12-13T12:27:35Z

It's quite hard - I did a few simple attempts, but none of them seem to crash Scylla - as opposed to Rust driver tests run in parallel, which cause a crash +- every time. I partially blame Python's "threads" and GIL which effectively limits the parallelism in interesting places, but also I'm not quite sure which particular combination causes this failure. It might be easier to start by trying to investigate the root cause with gdb and bisect, and then it should be easier to figure out a concise reproducer.

denesb · 2021-12-13T13:23:28Z

@denesb what I've done is not nearly as precise as proper bisecting, but I followed git log and just checked out a Scylla version without any of your recent patches, and Scylla no longer crashes. Namely, I got rid of these:
247f2b69d Merge "system tables: create the schema more efficiently" from Botond
e99160491 schema: make private constructor invokable via make_lw_shared
4dea339e0 schema_builder: add a constructor providing make_shared_schema semantics
476f49c69 schema_builder: without_column(): don't assume column_specification exists
d3833c597 schema: add static variant of column_name_type()
Could you take a look, maybe something rings a bell? Perhaps some lw_shared_ptrs went out of scope too soon?

Nothing rings a bell OTOH. Can you post the full backtrace please?

I think a proper bisecting would be best if we have a reliable reproducer.

psarna · 2021-12-13T13:26:31Z

@denesb

#0  0x00007fedf975e2a2 in raise () from /lib64/libc.so.6
#1  0x00007fedf97478a4 in abort () from /lib64/libc.so.6
#2  0x00007fedf9747789 in __assert_fail_base.cold () from /lib64/libc.so.6
#3  0x00007fedf9756a16 in __assert_fail () from /lib64/libc.so.6
#4  0x00000000012f2b08 in schema::schema (this=this@entry=0x600003dfd880, raw=..., raw_view_info=std::optional<raw_view_info> [no contained value]) at schema.cc:375
#5  0x00000000012f7c6e in seastar::lw_shared_ptr<schema>::make<schema::private_tag, schema::raw_schema&, std::optional<raw_view_info>&> (a=..., a=..., a=...)
    at ./seastar/include/seastar/core/shared_ptr.hh:267
#6  seastar::make_lw_shared<schema, schema::private_tag, schema::raw_schema&, std::optional<raw_view_info>&> (a=..., a=..., a=...)
    at ./seastar/include/seastar/core/shared_ptr.hh:417
#7  schema_builder::build (this=0x7ffd8ff2b0b0) at schema.cc:1257
#8  0x00000000026250d9 in db::schema_tables::create_table_from_mutations (ctxt=..., sm=..., version=std::optional<utils::UUID> [no contained value])
    at db/schema_tables.cc:2653
#9  0x000000000267e2a0 in db::schema_tables::merge_tables_and_views(seastar::sharded<service::storage_proxy>&, std::map<utils::UUID, schema_mutations, std::less<utils::UUID>, std::allocator<std::pair<utils::UUID const, schema_mutations> > >&&, std::map<utils::UUID, schema_mutations, std::less<utils::UUID>, std::allocator<std::pair<utils::UUID const, schema_mutations> > >&&, std::map<utils::UUID, schema_mutations, std::less<utils::UUID>, std::allocator<std::pair<utils::UUID const, schema_mutations> > >&&, std::map<utils::UUID, schema_mutations, std::less<utils::UUID>, std::allocator<std::pair<utils::UUID const, schema_mutations> > >&&)::$_47::operator()(schema_mutations, db::schema_tables::schema_diff_side) const (this=<optimized out>, sm=...) at db/schema_tables.cc:1223
#10 seastar::noncopyable_function<seastar::lw_shared_ptr<schema const> (schema_mutations, db::schema_tables::schema_diff_side)>::direct_vtable_for<db::schema_tables::merge_tables_and_views(seastar::sharded<service::storage_proxy>&, std::map<utils::UUID, schema_mutations, std::less<utils::UUID>, std::allocator<std::pair<utils::UUID const, schema_mutations> > >&&, std::map<utils::UUID, schema_mutations, std::less<utils::UUID>, std::allocator<std::pair<utils::UUID const, schema_mutations> > >&&, std::map<utils::UUID, schema_mutations, std::less<utils::UUID>, std::allocator<std::pair<utils::UUID const, schema_mutations> > >&&, std::map<utils::UUID, schema_mutations, std::less<utils::UUID>, std::allocator<std::pair<utils::UUID const, schema_mutations> > >&&)::$_47>::call(seastar::noncopyable_function<seastar::lw_shared_ptr<schema const> (schema_mutations, db::schema_tables::schema_diff_side)> const*, schema_mutations, db::schema_tables::schema_diff_side) (func=<optimized out>, 
    args=<optimized out>, args=<optimized out>) at ./seastar/include/seastar/util/noncopyable_function.hh:124
#11 0x000000000267d781 in seastar::noncopyable_function<seastar::lw_shared_ptr<schema const> (schema_mutations, db::schema_tables::schema_diff_side)>::operator()(schema_mutations, db::schema_tables::schema_diff_side) const (this=0x7ffd8ff2a1d0, this@entry=0x6000004507d0, args=(unknown: 0xf975e2a2), 
--Type <RET> for more, q to quit, c to continue without paging--
    hema_diff_side::right, args=(unknown: 0xf975e2a2), args@entry=db::schema_tables::schema_diff_side::right) at ./seastar/include/seastar/util/noncopyable_function.hh:209
#12 0x0000000002679df5 in db::schema_tables::diff_table_or_view(seastar::sharded<service::storage_proxy>&, std::map<utils::UUID, schema_mutations, std::less<utils::UUID>, std::allocator<std::pair<utils::UUID const, schema_mutations> > >&&, std::map<utils::UUID, schema_mutations, std::less<utils::UUID>, std::allocator<std::pair<utils::UUID const, schema_mutations> > >&&, seastar::noncopyable_function<seastar::lw_shared_ptr<schema const> (schema_mutations, db::schema_tables::schema_diff_side)>) (proxy=..., before=..., after=..., create_schema=...) at db/schema_tables.cc:1198
#13 0x0000000002667bd4 in db::schema_tables::merge_tables_and_views (proxy=..., tables_before=..., tables_after=..., views_before=..., views_after=...) at db/schema_tables.cc:1222
#14 0x00000000026bfb6d in db::schema_tables::do_merge_schema (proxy=..., mutations=..., do_flush=<optimized out>) at db/schema_tables.cc:1092
#15 0x0000000004021d15 in seastar::reactor::run_tasks (this=0x60000021e000, tq=...) at ./seastar/src/core/reactor.cc:2354
#16 seastar::reactor::run_some_tasks (this=this@entry=0x60000021e000) at ./seastar/src/core/reactor.cc:2763
#17 0x0000000004023108 in seastar::reactor::do_run (this=0x60000021e000) at ./seastar/src/core/reactor.cc:2932
#18 0x000000000402234d in seastar::reactor::run (this=0x2) at ./seastar/src/core/reactor.cc:2815
#19 0x0000000003fc1148 in seastar::app_template::run_deprecated(int, char**, std::function<void ()>&&) (this=this@entry=0x7ffd8ff2e898, ac=<optimized out>, ac@entry=28, av=<optimized out>, av@entry=0x7ffd8ff30688, func=...) at ./seastar/src/core/app-template.cc:265
#20 0x0000000003fc05e7 in seastar::app_template::run(int, char**, std::function<seastar::future<int> ()>&&) (this=0x7ffd8ff2e898, ac=28, av=0x7ffd8ff30688, func=...) at ./seastar/src/core/app-template.cc:156
#21 0x0000000000f993af in main (ac=28, av=0x7ffd8ff30688) at main.cc:495

I'll try bisecting to a specific commit later

psarna · 2021-12-13T13:37:56Z

Hm, I have 1 suspect, let me try to check my hypothesis

psarna · 2021-12-13T15:17:29Z

This reproducer does not unfortunately expose the bug every time, so I resorted to testing it in a tight loop. Now I'm reasonably sure that the code is broken even before these schema-related patches I listed above. I'll keep looking, but I might not succeed in bisecting by today, since every recompilation and retest takes some time, and I need to broaden the range

psarna · 2021-12-13T15:40:59Z

I'm leaving the investigation at 80fe158 - which is from September '21, and a commit before corounitization of one of the functions that appears on the backtrace. Still, the problem reproduces, so I need to dig deeper tomorrow.

psarna · 2021-12-13T15:46:05Z

I did one final check, but 3089558 still fails - and that's a right commit before the big corounitization series of the schema tables.

psarna · 2021-12-14T08:14:06Z

I started the bisect with higher granularity and no recompilations by checking docker images. So far my conclusions are as follows - 2.3.1 seems not to be affected, but 4.3.6 already is, so the regression looks fairly old.

denesb · 2021-12-14T08:19:16Z

I think bisecting will stop being useful at this point. Any patch it will find has likely accumulated so many changes on top that the bug will probably be in some completely different place in master.

psarna · 2021-12-14T08:22:56Z

@denesb it might be in a different place, but it definitely survived. What I hope for now is that I'll bisect it down to a specific minor release, which could be helpful - e.g. if I find out that the change was introduced precisely in version x.y.z, we can take a look at which patches were applied between that one and x.y.(z-1). Right now I see that the failure didn't reproduce on 4.0.11, but did on 4.3, so the window got a little narrower.

psarna · 2021-12-14T08:23:17Z

(unless I was unlucky and I simply didn't try enough times on 4.0.11, which is also possible)

psarna · 2021-12-14T09:18:48Z

Sigh, of course. Empirical evidence shows that the regression happens on 4.3.0, 4.2.4 proves fine. I was crossing my fingers for finding a regression in one of the micro releases in order to browse a small set of backported changes, but still, it's something.

Also, I noticed that one test case that notoriously fails (but only when running in parallel with other tests) is

transport::session_test::test_turning_off_schema_fetching ... FAILED

~~I don't know why (or if) it's relevant, but in particular it operates on custom types. Here's the code for reference:~~
edit: scratch that, I removed this test case and Scylla can still crash when running multiple tests in parallel, so it was just a fluke

psarna · 2021-12-15T08:58:40Z

I was curious whether the raft series fixes the issue, but it doesn't - the same assertion seems to be hit:

INFO  2021-12-15 09:57:22,010 [shard 0] schema_tables - Creating test_metadata_ks.t id=03e0cb20-5d85-11ec-a5a0-4f4616a25f93 version=8667b9be-1560-3dd8-8897-2a434e59e93b
INFO  2021-12-15 09:57:22,010 [shard 0] schema_tables - Creating test_metadata_ks.t_scylla_cdc_log id=03e0cb22-5d85-11ec-a5a0-4f4616a25f93 version=a4060e4e-03c3-34ef-a13d-42109e3afe61
INFO  2021-12-15 09:57:22,015 [shard 0] schema_tables - Schema version changed to 11552205-905f-3307-8c5a-37c5a3755994
INFO  2021-12-15 09:57:22,027 [shard 0] schema_tables - Schema version changed to fa305fc9-2071-3e81-a629-d09f3209229a
INFO  2021-12-15 09:57:22,028 [shard 0] migration_manager - Prepare Create new User Type: type_b
INFO  2021-12-15 09:57:22,038 [shard 0] schema_tables - Dropping test_metadata_ks.t id=03e0cb20-5d85-11ec-a5a0-4f4616a25f93 version=8667b9be-1560-3dd8-8897-2a434e59e93b
scylla: schema.cc:375: schema::schema(schema::private_tag, const schema::raw_schema &, std::optional<raw_view_info>): Assertion `!def.id || def.id == id - column_offset(def.kind)' failed.

psarna · 2021-12-15T08:59:22Z

It does look related to UDT though - the test that previously failed also used UDT, and here I see a migration_manager - Prepare Create new User Type: type_b right before crashing.

psarna · 2021-12-15T10:14:52Z

I see the following problem - at some point, schema tables get corrupted and one of the tables claims that it has the following partition key parts:

c at position 0
pk at position 0
e at position 1

Point (2.) is bogus and most likely comes from some other table with the same name, because first of all the definition for this table's key is ((c,e), b, a), and secondly, it's quite obvious that c and pk cannot both have position 0.

Not sure yet why exactly it happened (and why the regression started from 4.3 release.

gleb-cloudius · 2021-12-15T10:26:27Z

On Wed, Dec 15, 2021 at 12:59:03AM -0800, Piotr Sarna wrote: I was curious whether the raft series fixes the issue, but it doesn't - the same assertion seems to be hit:

To get any benefit from it you need to enable experimental raft feature. It will not yet linearize anything but should prevent simultaneous schema application.

…

-- Gleb.

psarna · 2021-12-15T10:58:44Z

Well, it looks like this race is entirely possible when 2 or more CREATE TABLE statements are sent with conflicting schemas:

Request A validates that table t does not exist yet
Request B validates that table t does not exist yet
A creates mutations for table t
B creates mutations for table t
Mutations from A & B are both applied, which results in a bogus table with conflicting system_schema information

It sounds like a very, very bad bug, because a single client is able to crash Scylla by sending 2 legal statements in parallel.

psarna · 2021-12-15T11:32:40Z

Well, I can sadly confirm that a quick patch below prevents Scylla from crashing:

diff --git a/cql3/statements/schema_altering_statement.cc b/cql3/statements/schema_altering_statement.cc
index e15690cad..96ab04c2d 100644
--- a/cql3/statements/schema_altering_statement.cc
+++ b/cql3/statements/schema_altering_statement.cc
@@ -104,6 +104,7 @@ schema_altering_statement::execute0(query_processor& qp, service::query_state& s
     }
 
     co_await mm.schema_read_barrier();
+    auto guard = co_await get_units(mm.announce_sem(), 1);
 
     auto [ret, m] = co_await prepare_schema_mutations(qp);
 
diff --git a/service/migration_manager.hh b/service/migration_manager.hh
index 5d046c5ed..9b7453186 100644
--- a/service/migration_manager.hh
+++ b/service/migration_manager.hh
@@ -92,9 +92,12 @@ class migration_manager : public seastar::async_sharded_service<migration_manage
     service::raft_group_registry& _raft_gr;
     serialized_action _schema_push;
     utils::UUID _schema_version_to_publish;
+    named_semaphore _announce_sem{1, named_semaphore_exception_factory{"announce"}};
 public:
     migration_manager(migration_notifier&, gms::feature_service&, netw::messaging_service& ms, gms::gossiper& gossiper, service::raft_group_registry& raft_gr);
 
+    named_semaphore& announce_sem() { return _announce_sem; }
+
     migration_notifier& get_notifier() { return _notifier; }
     const migration_notifier& get_notifier() const { return _notifier; }

Unfortunately, I bet that the same issue applies to creating keyspaces, types, updating them, and so on, pretty much everything that uses system_schema.tables.

One fix is to add a bunch of semaphores for each of these operations, or even a single "announcement" semaphore, since DDL statements are considered rare anyway. Opinions?

/cc @avikivity @tgrabiec

gleb-cloudius · 2021-12-15T11:43:23Z

On Wed, Dec 15, 2021 at 03:32:52AM -0800, Piotr Sarna wrote: Unfortunately, I bet that the same issue applies to creating keyspaces, types, updating them, and so on, pretty much everything that uses system_schema.tables.

You do not need the semaphore in announce_new_column_family() since it is no longer used and the semaphore in execute0() covers all schema altering statements. To fully linearise schema application we will have to take "the_merge_lock" anyway. No need for a new lock.

One fix is to add a bunch of semaphores for each of these operations, or even a single "announcement" semaphore, since DDL statements are considered rare anyway. Opinions?

How just adding semaphore here fixes anything? Two nodes may do the change and they will not shard the lock.

…

-- Gleb.

psarna · 2021-12-15T11:47:31Z

On Wed, Dec 15, 2021 at 03:32:52AM -0800, Piotr Sarna wrote: Unfortunately, I bet that the same issue applies to creating keyspaces, types, updating them, and so on, pretty much everything that uses system_schema.tables.
You do not need the semaphore in announce_new_column_family() since it is no longer used and the semaphore in execute0() covers all schema altering statements. To fully linearise schema application we will have to take "the_merge_lock" anyway. No need for a new lock.
One fix is to add a bunch of semaphores for each of these operations, or even a single "announcement" semaphore, since DDL statements are considered rare anyway. Opinions?
How just adding semaphore here fixes anything? Two nodes may do the change and they will not shard the lock.

I don't care about multi-nodes scenarios here, since I'm aware it's possible to end up with conflicting schemas on multiple nodes, and we're waiting for raft to fix it. This particular workaround prevents a crash of a single node, after it tries to apply a mix of two or more schema updates into a single system_schema entry.
It's acceptable for two nodes to have conflicting schemas for some tables, but I find it definitely unacceptable for a single node to crash, because it mixed up multiple schema updates into a single, incorrect entry in its own local tables.

psarna · 2021-12-15T11:51:22Z

But you're right that the guard for announce_new_column_family is superfluous, I'll simplify the patch in-place

gleb-cloudius · 2021-12-15T12:02:32Z

On Wed, Dec 15, 2021 at 03:47:42AM -0800, Piotr Sarna wrote: > On Wed, Dec 15, 2021 at 03:32:52AM -0800, Piotr Sarna wrote: Unfortunately, I bet that the same issue applies to creating keyspaces, types, updating them, and so on, pretty much everything that uses system_schema.tables. > You do not need the semaphore in announce_new_column_family() since it is no longer used and the semaphore in execute0() covers all schema altering statements. To fully linearise schema application we will have to take "the_merge_lock" anyway. No need for a new lock. > One fix is to add a bunch of semaphores for each of these operations, or even a single "announcement" semaphore, since DDL statements are considered rare anyway. Opinions? > How just adding semaphore here fixes anything? Two nodes may do the change and they will not shard the lock. I don't care about multi-nodes scenarios here, since I'm aware it's possible to end up with conflicting schemas on multiple nodes, and we're waiting for raft to fix it. This particular workaround prevents a crash of a single node, after it tries to apply a mix of two or more schema updates into a single system_schema entry.

I do not understand how the same may not happen if two nodes change the schema. In the end all mutations are distributed to all nodes and they all apply them.

…

-- Gleb.

psarna · 2021-12-15T12:06:54Z

On Wed, Dec 15, 2021 at 03:47:42AM -0800, Piotr Sarna wrote: > On Wed, Dec 15, 2021 at 03:32:52AM -0800, Piotr Sarna wrote: Unfortunately, I bet that the same issue applies to creating keyspaces, types, updating them, and so on, pretty much everything that uses system_schema.tables. > You do not need the semaphore in announce_new_column_family() since it is no longer used and the semaphore in execute0() covers all schema altering statements. To fully linearise schema application we will have to take "the_merge_lock" anyway. No need for a new lock. > One fix is to add a bunch of semaphores for each of these operations, or even a single "announcement" semaphore, since DDL statements are considered rare anyway. Opinions? > How just adding semaphore here fixes anything? Two nodes may do the change and they will not shard the lock. I don't care about multi-nodes scenarios here, since I'm aware it's possible to end up with conflicting schemas on multiple nodes, and we're waiting for raft to fix it. This particular workaround prevents a crash of a single node, after it tries to apply a mix of two or more schema updates into a single system_schema entry.
I do not understand how the same may not happen if two nodes change the schema. In the end all mutations are distributed to all nodes and they all apply them.
…

Yes, but if node A pulls schema from node B and node B knows a single definition for a table named T, it's not possible for A to receive multiple mutations for table named T, but with different column specification. If node A already knows a table named T, it will also have its version and column information stored, so it will be able to compare it to whatever node B sends. And if node A is not aware of any table named T, then there's no conflict in the first place. Btw - I'm not saying that a crash is impossible with multiple nodes, but I can definitely say that I never experienced such a crash during tests, while it's trivial to make one node crash with a few statements. That's why I think that a local semaphore is much better than nothing.

gleb-cloudius · 2021-12-15T12:07:58Z

On Wed, Dec 15, 2021 at 03:51:33AM -0800, Piotr Sarna wrote: But you're right that the guard for announce_new_column_family is superfluous, I'll simplify the patch in-place

What about using "the_merge_lock"? It exists only on shard zero, but looking at your patch two statement running on different shards can still run in parallel...

…

-- Gleb.

psarna · 2021-12-15T12:10:20Z

On Wed, Dec 15, 2021 at 03:51:33AM -0800, Piotr Sarna wrote: But you're right that the guard for announce_new_column_family is superfluous, I'll simplify the patch in-place
What about using "the_merge_lock"? It exists only on shard zero, but looking at your patch two statement running on different shards can still run in parallel...
…
-- Gleb.

Oh, I wasn't aware that we already have the_merge_lock, it sounds much better. I was testing on 1 shard and hence didn't experience any issues. We should indeed avoid races between shards too.

gleb-cloudius · 2021-12-15T12:13:34Z

On Wed, Dec 15, 2021 at 04:10:31AM -0800, Piotr Sarna wrote: > On Wed, Dec 15, 2021 at 03:51:33AM -0800, Piotr Sarna wrote: But you're right that the guard for announce_new_column_family is superfluous, I'll simplify the patch in-place > What about using "the_merge_lock"? It exists only on shard zero, but looking at your patch two statement running on different shards can still run in parallel... > […](#) > -- Gleb. Oh, I wasn't aware that we already have `the_merge_lock`, it sounds much better. I was testing on 1 shard and hence didn't experience any issues. We should indeed avoid races between shards too.

Then you need to enable altering schema statements bouncing even if raft is disabled. The code is just above the read barrier.

…

-- Gleb.

gleb-cloudius · 2021-12-15T12:20:51Z

On Wed, Dec 15, 2021 at 04:07:05AM -0800, Piotr Sarna wrote: > On Wed, Dec 15, 2021 at 03:47:42AM -0800, Piotr Sarna wrote: > On Wed, Dec 15, 2021 at 03:32:52AM -0800, Piotr Sarna wrote: Unfortunately, I bet that the same issue applies to creating keyspaces, types, updating them, and so on, pretty much everything that uses system_schema.tables. > You do not need the semaphore in announce_new_column_family() since it is no longer used and the semaphore in execute0() covers all schema altering statements. To fully linearise schema application we will have to take "the_merge_lock" anyway. No need for a new lock. > One fix is to add a bunch of semaphores for each of these operations, or even a single "announcement" semaphore, since DDL statements are considered rare anyway. Opinions? > How just adding semaphore here fixes anything? Two nodes may do the change and they will not shard the lock. I don't care about multi-nodes scenarios here, since I'm aware it's possible to end up with conflicting schemas on multiple nodes, and we're waiting for raft to fix it. This particular workaround prevents a crash of a single node, after it tries to apply a mix of two or more schema updates into a single system_schema entry. > I do not understand how the same may not happen if two nodes change the schema. In the end all mutations are distributed to all nodes and they all apply them. > […](#) Yes, but if node A pulls schema from node B and node B knows a single definition for table named T, it's not possible for A to receive multiple mutations for table named T, but with different column specification. If node A already knows a table named T, it will also have its version and column information stored, so it will be able to compare it to whatever table B sends. And if node A is not aware of any table named T, then there's no conflict in the first place. Btw - I'm not saying that a crash is impossible with multiple nodes, but I can definitely say that I never experienced such a crash during tests, while it's *trivial* to make one node crash with a few statements. That's why I think that a local semaphore is **much** better than nothing.

But two nodes may push (announce) two conflicting table versions to the same node simultaneously. I think merge lock suppose to prevent them from been applied in parallel, but then is should be the same for local modifications. Or do we not taking the lock for local merges?

…

-- Gleb.

psarna · 2021-12-15T12:25:13Z

On Wed, Dec 15, 2021 at 04:07:05AM -0800, Piotr Sarna wrote: > On Wed, Dec 15, 2021 at 03:47:42AM -0800, Piotr Sarna wrote: > On Wed, Dec 15, 2021 at 03:32:52AM -0800, Piotr Sarna wrote: Unfortunately, I bet that the same issue applies to creating keyspaces, types, updating them, and so on, pretty much everything that uses system_schema.tables. > You do not need the semaphore in announce_new_column_family() since it is no longer used and the semaphore in execute0() covers all schema altering statements. To fully linearise schema application we will have to take "the_merge_lock" anyway. No need for a new lock. > One fix is to add a bunch of semaphores for each of these operations, or even a single "announcement" semaphore, since DDL statements are considered rare anyway. Opinions? > How just adding semaphore here fixes anything? Two nodes may do the change and they will not shard the lock. I don't care about multi-nodes scenarios here, since I'm aware it's possible to end up with conflicting schemas on multiple nodes, and we're waiting for raft to fix it. This particular workaround prevents a crash of a single node, after it tries to apply a mix of two or more schema updates into a single system_schema entry. > I do not understand how the same may not happen if two nodes change the schema. In the end all mutations are distributed to all nodes and they all apply them. > … Yes, but if node A pulls schema from node B and node B knows a single definition for table named T, it's not possible for A to receive multiple mutations for table named T, but with different column specification. If node A already knows a table named T, it will also have its version and column information stored, so it will be able to compare it to whatever table B sends. And if node A is not aware of any table named T, then there's no conflict in the first place. Btw - I'm not saying that a crash is impossible with multiple nodes, but I can definitely say that I never experienced such a crash during tests, while it's trivial to make one node crash with a few statements. That's why I think that a local semaphore is much better than nothing.
But two nodes may push (announce) two conflicting table versions to the same node simultaneously. I think merge lock suppose to prevent them from been applied in parallel, but then is should be the same for local modifications. Or do we not taking the lock for local merges?

We take the lock for local merges, but by that time it's too late, because we already validated (a few times...) that such a table does not exist, so it's safe to produce mutations for it. Then, each set of mutations for a table with a single name is applied in a serialized manner, and it produces an incorrect entry in system_schema tables - e.g. two partition key columns declared to be at position 0 at the same time.

I assumed that schema is pulled only from one node at a time, is it not true? If it isn't, then maybe we can end up with a similar crash, just much less likely - since multiple nodes would have to send conflicting mutations at precisely right time.

psarna · 2021-12-15T12:27:55Z

Also - using the_merge_lock complicates the patch quite a lot, since the same lock is also taken in merge_schema, and trying to lock it twice in a single path creates a deadlock. At the same time, multiple paths use merge_schema, and some of them should assume that a lock is already taken, some of them not.

In any case, the original patch I posted would actually work fine if we also unconditionally turn on bouncing to shard 0 for schema altering statements. In fact, the semaphore could even be local to this schema_altering_statement, since it's the only user.

psarna · 2021-12-15T12:30:33Z

Second prototype:

diff --git a/cql3/statements/schema_altering_statement.cc b/cql3/statements/schema_altering_statement.cc
index e15690cad..6c1419582 100644
--- a/cql3/statements/schema_altering_statement.cc
+++ b/cql3/statements/schema_altering_statement.cc
@@ -94,16 +94,19 @@ void schema_altering_statement::prepare_keyspace(const service::client_state& st
 
 future<::shared_ptr<messages::result_message>>
 schema_altering_statement::execute0(query_processor& qp, service::query_state& state, const query_options& options) const {
+    static named_semaphore schema_altering_sem{1, named_semaphore_exception_factory{"schema_altering"}};
+
     auto& mm = qp.get_migration_manager();
     ::shared_ptr<cql_transport::event::schema_change> ce;
 
-    if (mm.is_raft_enabled() && this_shard_id() != 0) {
+    if (this_shard_id() != 0) {
         // execute all schema altering statements on a shard zero since this is where raft group 0 is
         co_return ::make_shared<cql_transport::messages::result_message::bounce_to_shard>(0,
                     std::move(const_cast<cql3::query_options&>(options).take_cached_pk_function_calls()));
     }
 
     co_await mm.schema_read_barrier();
+    auto guard = co_await get_units(schema_altering_sem, 1);
 
     auto [ret, m] = co_await prepare_schema_mutations(qp);

gleb-cloudius · 2021-12-15T12:31:15Z

On Wed, Dec 15, 2021 at 04:25:25AM -0800, Piotr Sarna wrote: We take the lock for local merges, but by that time it's too late, because we already validated (a few times...) that such a table does not exist, so it's safe to produce mutations for it. Then, each set of mutations for a table with a single name is applied in a serialized manner, and it produces an incorrect entry in system_schema tables - e.g. two partition key columns declared to be at position 0 at the same time.

Then I completely see how it may happen through different coordinators.

I assumed that schema is pulled only from one node at a time, is it not true? If it isn't, then maybe we can end up with a similar crash, just much less likely - since multiple nodes would have to send conflicting mutations at precisely right time.

I am not talking about schema pull. When ddl is executed resulting mutations are _pushed_ to all nodes. So imaging two different coordinators check that table T does not exist. Create a (incompatible) mutation for it and push it to all other nodes. A 3 node gets then and try to apply both of them one after another. This is exactly he scenario we try to fix with raft.

…

-- Gleb.

psarna · 2021-12-15T12:35:21Z

I am not talking about schema pull. When ddl is executed resulting mutations are pushed to all nodes. So imaging two different coordinators check that table T does not exist. Create a (incompatible) mutation for it and push it to all other nodes. A 3 node gets then and try to apply both of them one after another. This is exactly he scenario we try to fix with raft.

Ok, I wasn't aware of that. In this case, we're probably screwed for multiple nodes too.

Then, we should try to find out a way to not crash if such a conflict is discovered. Perhaps solving this race at least for a single node still makes sense, but in any case we should at least be able to detect such malformed mutations without crashing and graciously refusing to apply them.

gleb-cloudius · 2021-12-20T09:15:46Z

Then, we should try to find out a way to not crash if such a conflict is discovered.

I think this is a good idea even with raft. We do not want potential linerazability bug to become a crash.

alecco · 2022-03-23T10:48:06Z

As pointed by @kbr- this is fixed in current master thanks to raft group0 guard. Verified with smp=4 and smp=1.

kbr- · 2022-03-23T10:49:01Z

(note: only with raft experimental enabled, for now!)

psarna · 2022-03-23T11:43:51Z

Cool! By the way, it will also not be reproducible on current scylla-rust-driver master since 0.4.1 due to scylladb/scylla-rust-driver@df41ab0 . Regardless, good job!

alecco · 2022-03-23T13:37:38Z

By the way, it will also not be reproducible on current scylla-rust-driver master since 0.4.1 due to scylladb/scylla-rust-driver@df41ab0

I can repro the crash with scylladb/scylla-rust-driver@9a2b97b (0.4.0) and --experimental-features raft for latest master.

@psarna this issue should be re-opened, right?

psarna · 2022-03-23T13:52:33Z

Sad, but it's good that we have a reproducer. Reopened

alecco · 2022-03-23T17:00:08Z

Scylla does not crash with Raft group0 guard.

I was starting experimental raft from command line instead of yaml. I double checked and the tests fail but no abort. And single thread works fine afterwards.

Thanks to @kbr-

nyh · 2022-03-23T19:32:42Z

Scylla does not crash with Raft group0 guard.

I was starting experimental raft from command line instead of yaml.

Why is starting with experimental raft from the command line wrong?
This is what I use for cql-pytest and Alternator tests, and it seems to be working well.

psarna added the bug label Dec 15, 2021

kostja assigned kostja, gleb-cloudius and ManManson Dec 29, 2021

kostja added the area/raft label Dec 29, 2021

kostja assigned alecco and unassigned ManManson and gleb-cloudius Mar 23, 2022

alecco closed this as completed Mar 23, 2022

psarna reopened this Mar 23, 2022

alecco closed this as completed Mar 23, 2022

psarna mentioned this issue May 5, 2022

SEGV Scylla 4.6.1 Redhat 7 with Rust driver 0.4.4 #10493

Closed

DoronArazii added this to the 5.1 milestone May 7, 2023

mykaul mentioned this issue Jan 9, 2024

Bootstrap fails due to inconsistent schema after group0 catch up when table was recreated in the past #16683

Closed

Scylla crashes when multiple clients try to create the same schema #9774

Scylla crashes when multiple clients try to create the same schema #9774

Comments

psarna commented Dec 9, 2021 • edited

psarna commented Dec 9, 2021

psarna commented Dec 9, 2021

nyh commented Dec 9, 2021

psarna commented Dec 13, 2021

denesb commented Dec 13, 2021

psarna commented Dec 13, 2021

psarna commented Dec 13, 2021

psarna commented Dec 13, 2021

psarna commented Dec 13, 2021

psarna commented Dec 13, 2021

psarna commented Dec 14, 2021

denesb commented Dec 14, 2021

psarna commented Dec 14, 2021

psarna commented Dec 14, 2021

psarna commented Dec 14, 2021 • edited

psarna commented Dec 15, 2021

psarna commented Dec 15, 2021

psarna commented Dec 15, 2021

gleb-cloudius commented Dec 15, 2021 via email

psarna commented Dec 15, 2021

psarna commented Dec 15, 2021 • edited

gleb-cloudius commented Dec 15, 2021 via email

psarna commented Dec 15, 2021

psarna commented Dec 15, 2021

gleb-cloudius commented Dec 15, 2021 via email

psarna commented Dec 15, 2021 • edited

gleb-cloudius commented Dec 15, 2021 via email

psarna commented Dec 15, 2021

gleb-cloudius commented Dec 15, 2021 via email

gleb-cloudius commented Dec 15, 2021 via email

psarna commented Dec 15, 2021

psarna commented Dec 15, 2021

psarna commented Dec 15, 2021

gleb-cloudius commented Dec 15, 2021 via email

psarna commented Dec 15, 2021 • edited

gleb-cloudius commented Dec 20, 2021

alecco commented Mar 23, 2022

kbr- commented Mar 23, 2022

psarna commented Mar 23, 2022

alecco commented Mar 23, 2022

psarna commented Mar 23, 2022

alecco commented Mar 23, 2022

nyh commented Mar 23, 2022

psarna commented Dec 9, 2021 •

edited

psarna commented Dec 14, 2021 •

edited

psarna commented Dec 15, 2021 •

edited

psarna commented Dec 15, 2021 •

edited

psarna commented Dec 15, 2021 •

edited