Store raft mutations in schema commit log. #12642

gleb-cloudius · 2023-01-26T09:10:38Z

Installation details
Scylla version (or git commit hash): ba26770

The topology management is moving to raft. Raft will manage tablet one day. In the tablet future we will want to know correct topology while replaying the commitlog. For that we need to be able to start raft before replay starts. but today it is impossible because raft stores its data in local tables and uses commitlog as well. The solution is to move raft tables to use separate commitlog which will be replayed earlier. There is such commitlog already for schema, so we need to make raft tables use it as well.

The text was updated successfully, but these errors were encountered:

gusev-p · 2023-01-30T10:54:34Z

I can make something like this here

db::commitlog& cl = db::system_keyspace::is_raft_table(schema) || (schema->ks_name() == db::schema_tables::NAME && _uses_schema_commitlog)
                ? *_schema_commitlog
                : *_commitlog;

but it looks weird, something dedicated to schema is abused for raft.

BTW, add_column_family is called on all shards, and _schema_commitlog is used only on the zero one, since _uses_schema_commitlog is set only there. Is it what we need?

gusev-p · 2023-01-30T10:54:49Z

@gleb-cloudius @kbr- @kostja

kbr-scylla · 2023-01-30T12:12:33Z

@tgrabiec could you please give your opinion?

kbr-scylla · 2023-01-30T13:02:55Z

In the tablet future we will want to know correct topology while replaying the commitlog. For that we need to be able to start raft before replay starts.

I don't understand this.

A write that has ended up in commitlog can get applied before or after a restart.
If it gets applied before a restart, it's not guaranteed to use the "freshest" topology. It will use the topology that the node currently knows.
But if it gets applied after a restart, it must synchronize topology before getting applied?

@tgrabiec @gleb-cloudius could you please provide an example scenario which shows why we need to synchronize topology before replaying commitlog?

gleb-cloudius · 2023-01-30T13:49:52Z

In the tablet future we will want to know correct topology while replaying the commitlog. For that we need to be able to start raft before replay starts.

I don't understand this.

A write that has ended up in commitlog can get applied before or after a restart. If it gets applied before a restart, it's not guaranteed to use the "freshest" topology.

It is. Otherwise it will be fenced.

It will use the topology that the node currently knows.

Which is the latest one unless fenced.

But if it gets applied after a restart, it must synchronize topology before getting applied?

To use the latest topology. Because there is no fencing when we apply mutations from the commitlog.

So with tables we will flush only some memtables and not all of them (that's according to table spec). It means that after we flushed a memtabe and did a cleanup a write may still sit in an active commitlog segment and be replayed after restart.

Another reason is object store. If/when group0 will hold info about data placement for each node you need to know it before you start replay.

kbr-scylla · 2023-01-30T16:11:01Z

It is. Otherwise it will be fenced.

Let's assume a write which ended up in the commitlog. Meaning that it didn't get fenced the moment it was done.

To use the latest topology. Because there is no fencing when we apply mutations from the commitlog.

Since the write was saved in the commitlog, it didn't get fenced when it arrived at the node, hence the topology stored by the node is correct for this particular write. If applying it was safe at the moment the write arrived, then it is also safe to apply it after the restart using that same topology.

The restart scenario should be no different from a long pause scenario. I.e. restart should restore the node to the same state it was before the crash / shutdown. Since the write ended up in the memtable before the crash, we apply it again from the commitlog and it again ends up in the memtable. We arrive at the exact same state we were in before the node crash stop. So the restart for all intents and purposes acts as if the node simply hanged for this duration. Without the restart the write would not have been rejected, so there is no reason to reject it with the restart.

gleb-cloudius · 2023-01-30T17:15:36Z

It is. Otherwise it will be fenced.

Let's assume a write which ended up in the commitlog. Meaning that it didn't get fenced the moment it was done.

To use the latest topology. Because there is no fencing when we apply mutations from the commitlog.

Since the write was saved in the commitlog, it didn't get fenced when it arrived at the node, hence the topology stored by the node is correct for this particular write. If applying it was safe at the moment the write arrived, then it is also safe to apply it after the restart using that same topology.

No, because it may have been already removed from this node by a topology change and a following clean up.

The restart scenario should be no different from a long pause scenario. I.e. restart should restore the node to the same state it was before the crash / shutdown. Since the write ended up in the memtable before the crash, we apply it again from the commitlog and it again ends up in the memtable. We arrive at the exact same state we were in before the node crash stop. So the restart for all intents and purposes acts as if the node simply hanged for this duration. Without the restart the write would not have been rejected, so there is no reason to reject it with the restart.

And again no. Restart scenario is very different from long pause since no replay of a commit log is going on during the later, so they are simply two un-comparable events.

kbr-scylla · 2023-01-30T18:08:46Z

No, because it may have been already removed from this node by a topology change and a following clean up.

So the assumption is that the node did learn of the topology change that invalidated this write before it crashed/stopped? And it removed it from the memtable, but not from the commitlog?

If so then it's a different problem. The node was supposed to invalidate/remove an entry but it didn't really. It left garbage in the commitlog.

It looks like the problem here is not related to topology changes or fencing. The problem is basically that we resurrect data which we were supposed to have "cleaned up". The clean up procedure should make sure that entries in the commitlog which are no longer valid should not be applied. So the clean up procedure should persist information about invalidated entries, so when we restart we consult this information and skip whatever entries were invalidated. Or replay them but then immediately remove because there are tombstones overriding them.

And again no. Restart scenario is very different from long pause since no replay of a commit log is going on during the later, so they are simply two un-comparable events.

The purpose of commitlog replay is to bring the node to the state it was in before it crashed, i.e., to basically make it indistinguishable from a long pause, as I understand it.

gleb-cloudius · 2023-01-30T18:38:13Z

No, because it may have been already removed from this node by a topology change and a following clean up.

So the assumption is that the node did learn of the topology change that invalidated this write before it crashed/stopped? And it removed it from the memtable, but not from the commitlog?

In this scenario yes. In object store case not necessary.

If so then it's a different problem. The node was supposed to invalidate/remove an entry but it didn't really. It left garbage in the commitlog.

With tables it will not. It is not garbage. It is a past write like any other in the commitlog.

It looks like the problem here is not related to topology changes or fencing.

I do not see how it is unrelated to topology changes or fencing. The job of the fencing is to make sure old writes will not get to the previous owner after a topology change and cleanup. Same needs to be done on commitlog replay path. Without topology change there wouldn't be a need for it.

The problem is basically that we resurrect data which we were supposed to have "cleaned up". The clean up procedure should make sure that entries in the commitlog which are no longer valid should not be applied. So the clean up procedure should persist information about invalidated entries, so when we restart we consult this information and skip whatever entries were invalidated. Or replay them but then immediately remove because there are tombstones overriding them.

There is no tombstones either. The data was just dropped by the cleanup. We need to make sure old writes will not get into the node. While the node is alive fencing does it. On reboot commitlog replay should handle it.

And again no. Restart scenario is very different from long pause since no replay of a commit log is going on during the later, so they are simply two un-comparable events.

The purpose of commitlog replay is to bring the node to the state it was in before it crashed, i.e., to basically make it indistinguishable from a long pause, as I understand it.

The is no commitlog replay executed after the long pause which makes two scenarios absolutely different. If the only way a node may become unresponsive is because of a long pause you can drop entire commitlog code which makes discussion what it suppose to do in this scenario kind of moot.

tgrabiec · 2023-02-06T14:59:18Z

We create a schema commit log on shard 0 only because it's sufficient (schema is owned only by shard 0) and because we want to save on pre-allocated commitlog segments.

We don't need the latest topology from cluster-wide group0 before replaying mutations, we only need local node's latest view on topology. Meaning, that we shouldn't see topology going backwards in time after restart.

We need group0 before initializing user tables in case we reshard tablets, which needs to execute a group0 command.

For fencing off cleaned-up mutations from commitlog we indeed need some extra mechanism to recognize that the mutations were cleaned up. Knowing the latest topology is not enough, since the node could have regained ownership of the tablet.

kbr-scylla · 2023-02-06T15:07:11Z

For fencing off cleaned-up mutations from commitlog we indeed need some extra mechanism to recognize that the mutations were cleaned up. Knowing the latest topology is not enough, since the node could have regained ownership of the tablet.

Which supports my view that this has nothing to do with topology. The act of cleanup should be "persisted". If we delete some data on this node and then the node crashes, that data should not be resurrected on this node after restart. This is purely local information and problem of local data loss / resurrection.

kbr-scylla · 2023-02-06T15:13:17Z

No, because it may have been already removed from this node by a topology change and a following clean up.

So the assumption is that the node did learn of the topology change that invalidated this write before it crashed/stopped? And it removed it from the memtable, but not from the commitlog?

In this scenario yes.

So we don't need to fetch the latest topology, we already have this information.

It's a simple argument by case:

if this node has performed clean up before restart, then it already knows that this data should be deleted. No need to sync with the cluster.
if this node has not performed cleanup before restart, then it is correct to not perform cleanup after restart. Indeed, if you replace the restart in your scenario with a pause, in the pause scenario no cleanup would be performed at this moment - nothing forces the node to stop handling queries and do a read barrier after a pause.

tgrabiec · 2023-02-06T15:15:04Z

Which supports my view that this has nothing to do with topology. The act of cleanup should be "persisted". If we delete some data on this node and then the node crashes, that data should not be resurrected on this node after restart. This is purely local information and problem of local data loss / resurrection.

Yes. Currently, this is done by deleting all commitlog segments (implicitly) when flushing all tables. We can start with that, and later add something more fine-grained.

gleb-cloudius · 2023-02-06T15:15:18Z

For fencing off cleaned-up mutations from commitlog we indeed need some extra mechanism to recognize that the mutations were cleaned up. Knowing the latest topology is not enough, since the node could have regained ownership of the tablet.

We need to persist its version in addition to knowing the latest topology.

gleb-cloudius · 2023-02-06T15:19:57Z

No, because it may have been already removed from this node by a topology change and a following clean up.

So the assumption is that the node did learn of the topology change that invalidated this write before it crashed/stopped? And it removed it from the memtable, but not from the commitlog?

In this scenario yes.

So we don't need to fetch the latest topology, we already have this information.

It's a simple argument by case:
* if this node has performed clean up before restart, then it _already knows_ that this data should be deleted. No need to sync with the cluster.

The data is sitting in the commitlog. The cleanup removed data from memtables and sstables. Something is needed to prevent the commitlog reply to write the data again. Today it cannot happen because all memtables are flushed and that release all commitlogs.

* if this node has not performed cleanup before restart, then it is correct to not perform cleanup after restart. Indeed, if you replace the restart in your scenario with a pause, in the pause scenario no cleanup would be performed at this moment - nothing forces the node to stop handling queries and do a read barrier after a pause.

Pause does not reply commitlog. I do not see any similarities.

kbr-scylla · 2023-02-06T15:21:23Z

Something is needed to prevent the commitlog reply to write the data again.

I'm saying that the node already has that "something", it doesn't need to fetch the latest topology, it won't learn anything new regarding this data. It can already fence off these commitlog writes.

gleb-cloudius · 2023-02-06T15:21:51Z

Something is needed to prevent the commitlog reply to write the data again.

I'm saying that the node already has that "something", it doesn't need to fetch the latest topology, it won't learn anything new regarding this data. It can already fence off these commitlog writes.

What is this something?

tgrabiec · 2023-02-06T15:22:15Z

We need to persist its version in addition to knowing the latest topology.

We cannot reject writes in the commitlog based on the version, since the replica already accepted those writes. The coordinator could have ACKed the request based on this. The topology may change later, and it may change not due to cleanup of this tablet.

kbr-scylla · 2023-02-06T15:23:44Z

What is this something?

Whatever caused the cleanup to be triggered in the first place. The topology perhaps? Whatever it was, we can persist this information before doing the cleanup, so it will be known on restart, the node can use it to not undo the cleanup during commitlog replay.

gleb-cloudius · 2023-02-06T15:26:23Z

We need to persist its version in addition to knowing the latest topology.

We cannot reject writes in the commitlog based on the version, since the replica already accepted those writes. The coordinator could have ACKed the request based on this. The topology may change later, and it may change not due to cleanup of this tablet.

Yes, true. Just knowing a global topology is not enough and havinga topology bersion per tablet is probably not an option.

gleb-cloudius · 2023-02-06T15:29:46Z

What is this something?

Whatever caused the cleanup to be triggered in the first place. The topology perhaps? Whatever it was, we can persist this information before doing the cleanup, so it will be known on restart, the node can use it to not undo the cleanup during commitlog replay.

That not answers the question. What exactly is this something? Cleaned up ranges with the timestamp of the cleanup in a local table? May be this will work, but we my have lot of cleanups with dynamic re balancing and the table may grow large. Need to have a way to trim it.

kbr-scylla · 2023-02-06T15:34:49Z

If I understand correctly, the ranges/tokens/keys that must be removed from this node can be calculated directly from the topology - after all it's the topology change (range movement) that triggers the cleanup. So just persisting the topology - which we already do anyway - should be enough. On restart the node would look at persisted topology, and on replay skip keys that do not belong to this node according to that persisted topology.

Although this is not 100% correct, because the commitlog might contain data that was cleaned up even before the last persisted topology. E.g. there was topology A which removed some keys from this node, then topology B which added them back, and our last known topology is B. But commitlog might still contain data that was written before topology A arrived. Perhaps we could fix this by ensuring that tables are flushed between topology changes (and commitlog is trimmed on flush). Then it would be enough to know only the last topology.

gleb-cloudius · 2023-02-06T15:37:25Z

But flushing all tables between each topology changes is exactly what we are trying to avoid. Otherwise there is no problem.

gleb-cloudius · 2023-02-06T15:40:31Z

We can have a new commitlog entry "cleanup" containing ranges that should be cleaned up. A cleanup will commit such entry into the commitlog before proceed and reply will do cleanup as well.

gleb-cloudius · 2023-02-06T15:58:39Z

But if the write and the "cleanup" entry are in different segments how do we make sure the one with "cleanup" entry is not deleted earlier?

tgrabiec · 2023-02-06T16:18:28Z

We already have a mechanism which prevents old writes from the commitlog to reach the table after it was truncated. See database::truncate(). We save truncation records in system.truncated. We store a minimum accepted replay_position. This position is monotonic within a commitlog manager.

We could make the mapping more fine grained and make it per token range, allowing us to truncate particular token ranges.

Edit: I don't know how that works with resharding, but if it doesn't currently, then we need to fix it. \cc @elcallio

gleb-cloudius · 2023-02-06T16:33:28Z

We already have a mechanism which prevents old writes from the commitlog to reach the table after it was truncated. See database::truncate(). We save truncation records in system.truncated. We store a minimum accepted replay_position. This position is monotonic within a commitlog manager.

This is same as what I proposed here: #12642 (comment) except of using a replay_position instead of a timestamp. Timestamp will not probably not work with user provided timestamps if we plan to support those.

gusev-p · 2023-03-07T08:41:08Z

force_schema_commit_log is already published in scylla 5.1

gusev-p · 2023-03-07T09:16:46Z

questions:

We can't rebrand schema commit log since it's already published. Is it ok to use the thing called 'schema commit log' for raft or we should invent something here?
We can't just add a bool flag use_schema_commit_log to schema_builder, we must somehow pass it though frozen_schema. Now I added the following hack:

diff --git a/schema/schema_registry.cc b/schema/schema_registry.cc
index 0c2ce9043..eef60ecf0 100644
--- a/schema/schema_registry.cc
+++ b/schema/schema_registry.cc
@@ -65,6 +65,11 @@ schema_ptr schema_registry::learn(const schema_ptr& s) {
     slogger.debug("Learning about version {} of {}.{}", s->version(), s->ks_name(), s->cf_name());
     auto e_ptr = make_lw_shared<schema_registry_entry>(s->version(), *this);
     auto loaded_s = e_ptr->load(frozen_schema(s));
+
+    // temp hack
+    // how to do it right?
+    loaded_s->_raw._use_schema_commitlog = s->use_schema_commit_log();
+
     _entries.emplace(s->version(), e_ptr);
     return loaded_s;
 }

Any ideas how to do it right?

@kostja @tgrabiec @kbr-scylla @gleb-cloudius

tgrabiec · 2023-03-07T15:37:00Z

I think the fact of using the schema commitlog does not have to be a property of schema_ptr, we can assign it statically on the replica, in database::add_column_family(), like we do for the schema tables. It's then not a property of a particular schema version, but a property of a table on a given replica.

We should probably change all tables managed by group0 to also go to shard 0 only, like with schema tables. We need that for the same reason as we did so for the schema tables - so that we can have multi-partition mutations which are applied atomically. If they would live on different shards, we can't guarantee atomic write to the commitlog.

tgrabiec · 2023-03-07T15:37:48Z

We can require schema commitlog to be enabled when enabling raft, and fail to boot otherwise.

tgrabiec · 2023-03-07T15:43:03Z

It would work like this. Since version X, we have code which writes group0 mutations to schema commit log. When user upgrades from an older version and starts node in version X, node will fail to boot if schema commit log is not enabled (based on cluster feature). At this point he can roll-back, and then make sure schema commit log is enabled by modifying scylla.yaml and doing a rolling restart. Then he can proceed with the upgrade to X.

tgrabiec · 2023-03-07T15:49:10Z

This will require some users to upgrade to a version < X (e.g. 5.1) first.

kostja · 2023-03-07T19:28:50Z

We should require schema commit log to enable Raft, yes. But that would automatically mean schema commit log is enabled for all new install (which is a good thing).

We need this so that we can have multi-partition mutations which are applied atomically. If they live on different shards, we can't guarantee atomic write to the commitlog. Fixes: scylladb#12642

kostja · 2023-03-10T12:50:16Z

* Kamil Braun ***@***.***> [23/03/10 15:48]:

> We can require schema commitlog to be enabled when enabling raft, and fail to boot otherwise. How will upgrade work?

Just the same way it works today when you enable both raft and schema commit log simultaneously.

…

-- Konstantin Osipov, Moscow, Russia

kbr-scylla · 2023-03-10T12:54:25Z

Just the same way it works today when you enable both raft and
schema commit log simultaneously.

Today Raft doesn't use schema commit log, that's why it works.

If Raft depended on schema commit log and you "enabled" them simultaneously, Raft would error out on the first node start after upgrade, because for schema commit log to actually become enabled after enabling it in the config file you need to restart the node after upgrade at least once.

That is, you'd need to enable schema commit log first in configuration, then rolling restart, then rolling restart again, then enable Raft.

Also of course we cannot do this in 5.2:

We should require schema commit log to enable Raft, yes.

Because 5.2 is already in the release cycle.

kbr-scylla · 2023-03-10T12:59:09Z

It would work like this. Since version X, we have code which writes group0 mutations to schema commit log. When user upgrades from an older version and starts node in version X, node will fail to boot if schema commit log is not enabled (based on cluster feature).

I don't understand this part. Why will it fail to boot? I'm guessing we'd have to do something like: if consistent_cluster_management == true but schema commit log cluster feature disabled, abort boot.

kostja · 2023-03-10T13:10:54Z

* Kamil Braun ***@***.***> [23/03/10 15:58]:

> Just the same way it works today when you enable both raft and schema commit log simultaneously. Today Raft doesn't use schema commit log, that's why it works. If Raft depended on schema commit log and you "enabled" them simultaneously, Raft would error out on the first node start after upgrade, because for schema commit log to actually become enabled after enabling it in the config file you need to restart the node after upgrade at least once. That is, you'd need to enable schema commit log first in configuration, then rolling restart, then rolling restart again, then enable Raft. Also of course we cannot do this in 5.2: > We should require schema commit log to enable Raft, yes. Because 5.2 is already in the release cycle.

Well i guess then things are easy, since the only supported path is 5.2 -> 5.3, and we can require that Raft is enabled first. To sum up, I see two ways of handling this: - binding schema commit log and raft features together in 5.2 - deliver raft over schema commit log in 5.3 and require raft upgrade to happen first to enable it.

…

-- Konstantin Osipov, Moscow, Russia

gusev-p · 2023-03-13T09:20:52Z

That is, you'd need to enable schema commit log first in configuration, then rolling restart, then rolling restart again, then enable Raft.

User only needs two rolling restarts. Suppose he uses an old Scylla version, prior to 5.1.

He takes the latest Scylla binary and runs upgrade on all nodes with the current scylla.yaml. Neither raft, nor schema commit log is enabled at this moment.
After some time all nodes in the cluster will be upgraded, the WARN message "All nodes can now switch to use the schema commit log. Restart is needed for this to take effect." will be written to the log.
User sets consistent_cluster_management: true in the scylla.yaml and runs another rolling restart. Now schema_commitlog feature is enabled and we can use it to start raft.

If user set consistent_cluster_management: true without waiting for schema_commitlog feature to settle, we'll fail the boot with appropriate message.

Our goal (scylladb#12642) is to mark raft tables to use schema commitlog. There are two similar cases in code right now - with_null_sharder and set_wait_for_sync_to_commitlog schema_builder methods. The problem is that if we need to mark some new schema with one of these methods we need to do this twice - first in a method describing the schema (e.g. system_keyspace::raft()) and second in the function create_table_from_mutations, which is not obvious and easy to forget. create_table_from_mutations is called when schema object is reconstructed from mutations, with_null_sharder and set_wait_for_sync_to_commitlog must be called from it since the schema properties they describe are not included in the mutation representation of the schema. This patch proposes to distinguish between the schema properties that get into mutations and those that do not. The former are described with schema_builder, while for the latter we introduce schema_static_props struct and the schema_builder::register_static_configurator method. This way we can formulate a rule once in the code about which schemas should have a null sharder, and it will be enforced in all cases.

…l properties into it' from Gusev Petr Our end goal (#12642) is to mark raft tables to use schema commitlog. There are two similar cases in code right now - `with_null_sharder` and `set_wait_for_sync_to_commitlog` `schema_builder` methods. The problem is that if we need to mark some new schema with one of these methods we need to do this twice - first in a method describing the schema (e.g. `system_keyspace::raft()`) and second in the function `create_table_from_mutations`, which is not obvious and easy to forget. `create_table_from_mutations` is called when schema object is reconstructed from mutations, `with_null_sharder` and `set_wait_for_sync_to_commitlog` must be called from it since the schema properties they describe are not included in the mutation representation of the schema. This series proposes to distinguish between the schema properties that get into mutations and those that do not. The former are described with `schema_builder`, while for the latter we introduce `schema_static_props` struct and the `schema_builder::register_static_configurator` method. This way we can formulate a rule once in the code about which schemas should have a null sharder/be synced, and it will be enforced in all cases. Closes #13170 * github.com:scylladb/scylladb: schema.hh: choose schema_commitlog based on schema_static_props flag schema.hh: use schema_static_props for wait_for_sync_to_commitlog schema.hh: introduce schema_static_props, use it for null_sharder database.cc: drop ensure_populated and mark_as_populated

Our goal (scylladb#12642) is to mark raft tables to use schema commitlog. There are two similar cases in code right now - with_null_sharder and set_wait_for_sync_to_commitlog schema_builder methods. The problem is that if we need to mark some new schema with one of these methods we need to do this twice - first in a method describing the schema (e.g. system_keyspace::raft()) and second in the function create_table_from_mutations, which is not obvious and easy to forget. create_table_from_mutations is called when schema object is reconstructed from mutations, with_null_sharder and set_wait_for_sync_to_commitlog must be called from it since the schema properties they describe are not included in the mutation representation of the schema. This patch proposes to distinguish between the schema properties that get into mutations and those that do not. The former are described with schema_builder, while for the latter we introduce schema_static_props struct and the schema_builder::register_static_configurator method. This way we can formulate a rule once in the code about which schemas should have a null sharder, and it will be enforced in all cases.

We aim (scylladb#12642) to use the schema commit log for raft tables. Now they are loaded at the first call to init_system_keyspace in main.cc, but the schema commitlog is only initialized shortly before the second call. This is important, since the schema commitlog initialization (database::before_schema_keyspace_init) needs to access schema commitlog feature, which is loaded from system.scylla_local and therefore is only available after the first init_system_keyspace call. So the idea is to defer the loading of the raft tables until the second call to init_system_keyspace, just as it works for schema tables. For this we need a tool to mark which tables should be loaded in the first or second phase. To do this, in this patch we introduce system_table_load_phase enum. It's set in the schema_static_props for schema tables. It replaces the system_keyspace::table_selector in the signature of init_system_keyspace.

Fixes: scylladb#12642

We aim (scylladb#12642) to use the schema commit log for raft tables. Now they are loaded at the first call to init_system_keyspace in main.cc, but the schema commitlog is only initialized shortly before the second call. This is important, since the schema commitlog initialization (database::before_schema_keyspace_init) needs to access schema commitlog feature, which is loaded from system.scylla_local and therefore is only available after the first init_system_keyspace call. So the idea is to defer the loading of the raft tables until the second call to init_system_keyspace, just as it works for schema tables. For this we need a tool to mark which tables should be loaded in the first or second phase. To do this, in this patch we introduce system_table_load_phase enum. It's set in the schema_static_props for schema tables. It replaces the system_keyspace::table_selector in the signature of init_system_keyspace.

Fixes: scylladb#12642

We aim (scylladb#12642) to use the schema commit log for raft tables. Now they are loaded at the first call to init_system_keyspace in main.cc, but the schema commitlog is only initialized shortly before the second call. This is important, since the schema commitlog initialization (database::before_schema_keyspace_init) needs to access schema commitlog feature, which is loaded from system.scylla_local and therefore is only available after the first init_system_keyspace call. So the idea is to defer the loading of the raft tables until the second call to init_system_keyspace, just as it works for schema tables. For this we need a tool to mark which tables should be loaded in the first or second phase. To do this, in this patch we introduce system_table_load_phase enum. It's set in the schema_static_props for schema tables. It replaces the system_keyspace::table_selector in the signature of init_system_keyspace.

Fixes: scylladb#12642

We aim (scylladb#12642) to use the schema commit log for raft tables. Now they are loaded at the first call to init_system_keyspace in main.cc, but the schema commitlog is only initialized shortly before the second call. This is important, since the schema commitlog initialization (database::before_schema_keyspace_init) needs to access schema commitlog feature, which is loaded from system.scylla_local and therefore is only available after the first init_system_keyspace call. So the idea is to defer the loading of the raft tables until the second call to init_system_keyspace, just as it works for schema tables. For this we need a tool to mark which tables should be loaded in the first or second phase. To do this, in this patch we introduce system_table_load_phase enum. It's set in the schema_static_props for schema tables. It replaces the system_keyspace::table_selector in the signature of init_system_keyspace. The call site for populate_keyspace in init_system_keyspace was changed, table_selector.contains_keyspace was replaced with db.local().has_keyspace. This check prevents calling populate_keyspace(system_schema) on phase1, but allows for populate_keyspace(system) on phase2 (to init raft tables). On this second call some tables from system keyspace (e.g. system.local) may have already been populated on phase1. This check protects from double-populating them, since every populated cf is marked as ready_for_writes.

Fixes: scylladb#12642

We aim (scylladb#12642) to use the schema commit log for raft tables. Now they are loaded at the first call to init_system_keyspace in main.cc, but the schema commitlog is only initialized shortly before the second call. This is important, since the schema commitlog initialization (database::before_schema_keyspace_init) needs to access schema commitlog feature, which is loaded from system.scylla_local and therefore is only available after the first init_system_keyspace call. So the idea is to defer the loading of the raft tables until the second call to init_system_keyspace, just as it works for schema tables. For this we need a tool to mark which tables should be loaded in the first or second phase. To do this, in this patch we introduce system_table_load_phase enum. It's set in the schema_static_props for schema tables. It replaces the system_keyspace::table_selector in the signature of init_system_keyspace. The call site for populate_keyspace in init_system_keyspace was changed, table_selector.contains_keyspace was replaced with db.local().has_keyspace. This check prevents calling populate_keyspace(system_schema) on phase1, but allows for populate_keyspace(system) on phase2 (to init raft tables). On this second call some tables from system keyspace (e.g. system.local) may have already been populated on phase1. This check protects from double-populating them, since every populated cf is marked as ready_for_writes.

We need this so that we can have multi-partition mutations which are applied atomically. If they live on different shards, we can't guarantee atomic write to the commitlog. Fixes: #12642 Closes #13134 * github.com:scylladb/scylladb: test_raft_upgrade: add a test for schema commit log feature scylla_cluster.py: add start flag to server_add ServerInfo: drop host_id scylla_cluster.py: add config to server_add scylla_cluster.py: add expected_error to server_start scylla_cluster.py: ScyllaServer.start, refactor error reporting scylla_cluster.py: fix ScyllaServer.start, reset cmd if start failed raft: check if schema commitlog is initialized Refuse to boot if neither the schema commitlog feature nor force_schema_commit_log is set. For the upgrade procedure the user should wait until the schema commitlog feature is enabled before enabling consistent_cluster_management. raft: move raft initialization after init_system_keyspace database: rename before_schema_keyspace_init->maybe_init_schema_commitlog raft: use schema commitlog for raft tables init_system_keyspace: refactoring towards explicit load phases

gusev-p self-assigned this Jan 30, 2023

mykaul added the area/raft label Jan 31, 2023

mykaul added this to the 5.3 milestone Jan 31, 2023

gusev-p mentioned this issue Mar 9, 2023

Raft, use schema commit log #13134

Merged

gusev-p mentioned this issue Mar 14, 2023

Refactor schema, introduce schema_static_props and move several properties into it #13170

Merged

gusev-p pushed a commit to gusev-p/scylla that referenced this issue Mar 17, 2023

raft: use schema commitlog for raft tables

6b42b9e

Fixes: scylladb#12642

gusev-p pushed a commit to gusev-p/scylla that referenced this issue Mar 20, 2023

raft: use schema commitlog for raft tables

2d5eecd

Fixes: scylladb#12642

gusev-p pushed a commit to gusev-p/scylla that referenced this issue Mar 21, 2023

raft: use schema commitlog for raft tables

839cf4d

Fixes: scylladb#12642

gusev-p pushed a commit to gusev-p/scylla that referenced this issue Mar 22, 2023

raft: use schema commitlog for raft tables

3aba012

Fixes: scylladb#12642

scylladb-promoter closed this as completed in 273e70e Mar 27, 2023

Store raft mutations in schema commit log. #12642

Store raft mutations in schema commit log. #12642

Comments

gleb-cloudius commented Jan 26, 2023

gusev-p commented Jan 30, 2023

gusev-p commented Jan 30, 2023

kbr-scylla commented Jan 30, 2023

kbr-scylla commented Jan 30, 2023

gleb-cloudius commented Jan 30, 2023

kbr-scylla commented Jan 30, 2023

gleb-cloudius commented Jan 30, 2023

kbr-scylla commented Jan 30, 2023

gleb-cloudius commented Jan 30, 2023

tgrabiec commented Feb 6, 2023

kbr-scylla commented Feb 6, 2023

kbr-scylla commented Feb 6, 2023

tgrabiec commented Feb 6, 2023

gleb-cloudius commented Feb 6, 2023

gleb-cloudius commented Feb 6, 2023

kbr-scylla commented Feb 6, 2023

gleb-cloudius commented Feb 6, 2023

tgrabiec commented Feb 6, 2023

kbr-scylla commented Feb 6, 2023

gleb-cloudius commented Feb 6, 2023

gleb-cloudius commented Feb 6, 2023

kbr-scylla commented Feb 6, 2023 • edited

gleb-cloudius commented Feb 6, 2023

gleb-cloudius commented Feb 6, 2023

gleb-cloudius commented Feb 6, 2023

tgrabiec commented Feb 6, 2023 • edited

gleb-cloudius commented Feb 6, 2023

gusev-p commented Mar 7, 2023

gusev-p commented Mar 7, 2023

tgrabiec commented Mar 7, 2023 • edited

tgrabiec commented Mar 7, 2023

tgrabiec commented Mar 7, 2023 • edited

tgrabiec commented Mar 7, 2023 • edited

kostja commented Mar 7, 2023

kostja commented Mar 10, 2023 via email

kbr-scylla commented Mar 10, 2023

kbr-scylla commented Mar 10, 2023

kostja commented Mar 10, 2023 via email

gusev-p commented Mar 13, 2023

kbr-scylla commented Feb 6, 2023 •

edited

tgrabiec commented Feb 6, 2023 •

edited

tgrabiec commented Mar 7, 2023 •

edited

tgrabiec commented Mar 7, 2023 •

edited

tgrabiec commented Mar 7, 2023 •

edited