Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Store raft mutations in schema commit log. #12642

Closed
gleb-cloudius opened this issue Jan 26, 2023 · 40 comments · Fixed by #13134
Closed

Store raft mutations in schema commit log. #12642

gleb-cloudius opened this issue Jan 26, 2023 · 40 comments · Fixed by #13134

Comments

@gleb-cloudius
Copy link
Contributor

Installation details
Scylla version (or git commit hash): ba26770

The topology management is moving to raft. Raft will manage tablet one day. In the tablet future we will want to know correct topology while replaying the commitlog. For that we need to be able to start raft before replay starts. but today it is impossible because raft stores its data in local tables and uses commitlog as well. The solution is to move raft tables to use separate commitlog which will be replayed earlier. There is such commitlog already for schema, so we need to make raft tables use it as well.

@gusev-p gusev-p self-assigned this Jan 30, 2023
@gusev-p
Copy link

gusev-p commented Jan 30, 2023

I can make something like this here

db::commitlog& cl = db::system_keyspace::is_raft_table(schema) || (schema->ks_name() == db::schema_tables::NAME && _uses_schema_commitlog)
                ? *_schema_commitlog
                : *_commitlog;

but it looks weird, something dedicated to schema is abused for raft.

BTW, add_column_family is called on all shards, and _schema_commitlog is used only on the zero one, since _uses_schema_commitlog is set only there. Is it what we need?

@gusev-p
Copy link

gusev-p commented Jan 30, 2023

@gleb-cloudius @kbr- @kostja

@kbr-scylla
Copy link
Contributor

@tgrabiec could you please give your opinion?

@kbr-scylla
Copy link
Contributor

In the tablet future we will want to know correct topology while replaying the commitlog. For that we need to be able to start raft before replay starts.

I don't understand this.

A write that has ended up in commitlog can get applied before or after a restart.
If it gets applied before a restart, it's not guaranteed to use the "freshest" topology. It will use the topology that the node currently knows.
But if it gets applied after a restart, it must synchronize topology before getting applied?

@tgrabiec @gleb-cloudius could you please provide an example scenario which shows why we need to synchronize topology before replaying commitlog?

@gleb-cloudius
Copy link
Contributor Author

In the tablet future we will want to know correct topology while replaying the commitlog. For that we need to be able to start raft before replay starts.

I don't understand this.

A write that has ended up in commitlog can get applied before or after a restart. If it gets applied before a restart, it's not guaranteed to use the "freshest" topology.

It is. Otherwise it will be fenced.

It will use the topology that the node currently knows.

Which is the latest one unless fenced.

But if it gets applied after a restart, it must synchronize topology before getting applied?

To use the latest topology. Because there is no fencing when we apply mutations from the commitlog.

So with tables we will flush only some memtables and not all of them (that's according to table spec). It means that after we flushed a memtabe and did a cleanup a write may still sit in an active commitlog segment and be replayed after restart.

Another reason is object store. If/when group0 will hold info about data placement for each node you need to know it before you start replay.

@kbr-scylla
Copy link
Contributor

It is. Otherwise it will be fenced.

Let's assume a write which ended up in the commitlog. Meaning that it didn't get fenced the moment it was done.

To use the latest topology. Because there is no fencing when we apply mutations from the commitlog.

Since the write was saved in the commitlog, it didn't get fenced when it arrived at the node, hence the topology stored by the node is correct for this particular write. If applying it was safe at the moment the write arrived, then it is also safe to apply it after the restart using that same topology.

The restart scenario should be no different from a long pause scenario. I.e. restart should restore the node to the same state it was before the crash / shutdown. Since the write ended up in the memtable before the crash, we apply it again from the commitlog and it again ends up in the memtable. We arrive at the exact same state we were in before the node crash stop. So the restart for all intents and purposes acts as if the node simply hanged for this duration. Without the restart the write would not have been rejected, so there is no reason to reject it with the restart.

@gleb-cloudius
Copy link
Contributor Author

It is. Otherwise it will be fenced.

Let's assume a write which ended up in the commitlog. Meaning that it didn't get fenced the moment it was done.

To use the latest topology. Because there is no fencing when we apply mutations from the commitlog.

Since the write was saved in the commitlog, it didn't get fenced when it arrived at the node, hence the topology stored by the node is correct for this particular write. If applying it was safe at the moment the write arrived, then it is also safe to apply it after the restart using that same topology.

No, because it may have been already removed from this node by a topology change and a following clean up.

The restart scenario should be no different from a long pause scenario. I.e. restart should restore the node to the same state it was before the crash / shutdown. Since the write ended up in the memtable before the crash, we apply it again from the commitlog and it again ends up in the memtable. We arrive at the exact same state we were in before the node crash stop. So the restart for all intents and purposes acts as if the node simply hanged for this duration. Without the restart the write would not have been rejected, so there is no reason to reject it with the restart.

And again no. Restart scenario is very different from long pause since no replay of a commit log is going on during the later, so they are simply two un-comparable events.

@kbr-scylla
Copy link
Contributor

No, because it may have been already removed from this node by a topology change and a following clean up.

So the assumption is that the node did learn of the topology change that invalidated this write before it crashed/stopped? And it removed it from the memtable, but not from the commitlog?

If so then it's a different problem. The node was supposed to invalidate/remove an entry but it didn't really. It left garbage in the commitlog.

It looks like the problem here is not related to topology changes or fencing. The problem is basically that we resurrect data which we were supposed to have "cleaned up". The clean up procedure should make sure that entries in the commitlog which are no longer valid should not be applied. So the clean up procedure should persist information about invalidated entries, so when we restart we consult this information and skip whatever entries were invalidated. Or replay them but then immediately remove because there are tombstones overriding them.

And again no. Restart scenario is very different from long pause since no replay of a commit log is going on during the later, so they are simply two un-comparable events.

The purpose of commitlog replay is to bring the node to the state it was in before it crashed, i.e., to basically make it indistinguishable from a long pause, as I understand it.

@gleb-cloudius
Copy link
Contributor Author

No, because it may have been already removed from this node by a topology change and a following clean up.

So the assumption is that the node did learn of the topology change that invalidated this write before it crashed/stopped? And it removed it from the memtable, but not from the commitlog?

In this scenario yes. In object store case not necessary.

If so then it's a different problem. The node was supposed to invalidate/remove an entry but it didn't really. It left garbage in the commitlog.

With tables it will not. It is not garbage. It is a past write like any other in the commitlog.

It looks like the problem here is not related to topology changes or fencing.

I do not see how it is unrelated to topology changes or fencing. The job of the fencing is to make sure old writes will not get to the previous owner after a topology change and cleanup. Same needs to be done on commitlog replay path. Without topology change there wouldn't be a need for it.

The problem is basically that we resurrect data which we were supposed to have "cleaned up". The clean up procedure should make sure that entries in the commitlog which are no longer valid should not be applied. So the clean up procedure should persist information about invalidated entries, so when we restart we consult this information and skip whatever entries were invalidated. Or replay them but then immediately remove because there are tombstones overriding them.

There is no tombstones either. The data was just dropped by the cleanup. We need to make sure old writes will not get into the node. While the node is alive fencing does it. On reboot commitlog replay should handle it.

And again no. Restart scenario is very different from long pause since no replay of a commit log is going on during the later, so they are simply two un-comparable events.

The purpose of commitlog replay is to bring the node to the state it was in before it crashed, i.e., to basically make it indistinguishable from a long pause, as I understand it.

The is no commitlog replay executed after the long pause which makes two scenarios absolutely different. If the only way a node may become unresponsive is because of a long pause you can drop entire commitlog code which makes discussion what it suppose to do in this scenario kind of moot.

@mykaul mykaul added this to the 5.3 milestone Jan 31, 2023
@tgrabiec
Copy link
Contributor

tgrabiec commented Feb 6, 2023

We create a schema commit log on shard 0 only because it's sufficient (schema is owned only by shard 0) and because we want to save on pre-allocated commitlog segments.

We don't need the latest topology from cluster-wide group0 before replaying mutations, we only need local node's latest view on topology. Meaning, that we shouldn't see topology going backwards in time after restart.

We need group0 before initializing user tables in case we reshard tablets, which needs to execute a group0 command.

For fencing off cleaned-up mutations from commitlog we indeed need some extra mechanism to recognize that the mutations were cleaned up. Knowing the latest topology is not enough, since the node could have regained ownership of the tablet.

@kbr-scylla
Copy link
Contributor

For fencing off cleaned-up mutations from commitlog we indeed need some extra mechanism to recognize that the mutations were cleaned up. Knowing the latest topology is not enough, since the node could have regained ownership of the tablet.

Which supports my view that this has nothing to do with topology. The act of cleanup should be "persisted". If we delete some data on this node and then the node crashes, that data should not be resurrected on this node after restart. This is purely local information and problem of local data loss / resurrection.

@kbr-scylla
Copy link
Contributor

No, because it may have been already removed from this node by a topology change and a following clean up.

So the assumption is that the node did learn of the topology change that invalidated this write before it crashed/stopped? And it removed it from the memtable, but not from the commitlog?

In this scenario yes.

So we don't need to fetch the latest topology, we already have this information.

It's a simple argument by case:

  • if this node has performed clean up before restart, then it already knows that this data should be deleted. No need to sync with the cluster.
  • if this node has not performed cleanup before restart, then it is correct to not perform cleanup after restart. Indeed, if you replace the restart in your scenario with a pause, in the pause scenario no cleanup would be performed at this moment - nothing forces the node to stop handling queries and do a read barrier after a pause.

@tgrabiec
Copy link
Contributor

tgrabiec commented Feb 6, 2023

Which supports my view that this has nothing to do with topology. The act of cleanup should be "persisted". If we delete some data on this node and then the node crashes, that data should not be resurrected on this node after restart. This is purely local information and problem of local data loss / resurrection.

Yes. Currently, this is done by deleting all commitlog segments (implicitly) when flushing all tables. We can start with that, and later add something more fine-grained.

@gleb-cloudius
Copy link
Contributor Author

For fencing off cleaned-up mutations from commitlog we indeed need some extra mechanism to recognize that the mutations were cleaned up. Knowing the latest topology is not enough, since the node could have regained ownership of the tablet.

We need to persist its version in addition to knowing the latest topology.

@gleb-cloudius
Copy link
Contributor Author

No, because it may have been already removed from this node by a topology change and a following clean up.

So the assumption is that the node did learn of the topology change that invalidated this write before it crashed/stopped? And it removed it from the memtable, but not from the commitlog?

In this scenario yes.

So we don't need to fetch the latest topology, we already have this information.

It's a simple argument by case:

* if this node has performed clean up before restart, then it _already knows_ that this data should be deleted. No need to sync with the cluster.

The data is sitting in the commitlog. The cleanup removed data from memtables and sstables. Something is needed to prevent the commitlog reply to write the data again. Today it cannot happen because all memtables are flushed and that release all commitlogs.

* if this node has not performed cleanup before restart, then it is correct to not perform cleanup after restart. Indeed, if you replace the restart in your scenario with a pause, in the pause scenario no cleanup would be performed at this moment - nothing forces the node to stop handling queries and do a read barrier after a pause.

Pause does not reply commitlog. I do not see any similarities.

@kbr-scylla
Copy link
Contributor

Something is needed to prevent the commitlog reply to write the data again.

I'm saying that the node already has that "something", it doesn't need to fetch the latest topology, it won't learn anything new regarding this data. It can already fence off these commitlog writes.

@gleb-cloudius
Copy link
Contributor Author

Something is needed to prevent the commitlog reply to write the data again.

I'm saying that the node already has that "something", it doesn't need to fetch the latest topology, it won't learn anything new regarding this data. It can already fence off these commitlog writes.

What is this something?

@tgrabiec
Copy link
Contributor

tgrabiec commented Feb 6, 2023

We need to persist its version in addition to knowing the latest topology.

We cannot reject writes in the commitlog based on the version, since the replica already accepted those writes. The coordinator could have ACKed the request based on this. The topology may change later, and it may change not due to cleanup of this tablet.

@kbr-scylla
Copy link
Contributor

What is this something?

Whatever caused the cleanup to be triggered in the first place. The topology perhaps? Whatever it was, we can persist this information before doing the cleanup, so it will be known on restart, the node can use it to not undo the cleanup during commitlog replay.

@gleb-cloudius
Copy link
Contributor Author

We need to persist its version in addition to knowing the latest topology.

We cannot reject writes in the commitlog based on the version, since the replica already accepted those writes. The coordinator could have ACKed the request based on this. The topology may change later, and it may change not due to cleanup of this tablet.

Yes, true. Just knowing a global topology is not enough and havinga topology bersion per tablet is probably not an option.

@gleb-cloudius
Copy link
Contributor Author

What is this something?

Whatever caused the cleanup to be triggered in the first place. The topology perhaps? Whatever it was, we can persist this information before doing the cleanup, so it will be known on restart, the node can use it to not undo the cleanup during commitlog replay.

That not answers the question. What exactly is this something? Cleaned up ranges with the timestamp of the cleanup in a local table? May be this will work, but we my have lot of cleanups with dynamic re balancing and the table may grow large. Need to have a way to trim it.

@kbr-scylla
Copy link
Contributor

kbr-scylla commented Feb 6, 2023

If I understand correctly, the ranges/tokens/keys that must be removed from this node can be calculated directly from the topology - after all it's the topology change (range movement) that triggers the cleanup. So just persisting the topology - which we already do anyway - should be enough. On restart the node would look at persisted topology, and on replay skip keys that do not belong to this node according to that persisted topology.

Although this is not 100% correct, because the commitlog might contain data that was cleaned up even before the last persisted topology. E.g. there was topology A which removed some keys from this node, then topology B which added them back, and our last known topology is B. But commitlog might still contain data that was written before topology A arrived. Perhaps we could fix this by ensuring that tables are flushed between topology changes (and commitlog is trimmed on flush). Then it would be enough to know only the last topology.

@gleb-cloudius
Copy link
Contributor Author

But flushing all tables between each topology changes is exactly what we are trying to avoid. Otherwise there is no problem.

@gleb-cloudius
Copy link
Contributor Author

We can have a new commitlog entry "cleanup" containing ranges that should be cleaned up. A cleanup will commit such entry into the commitlog before proceed and reply will do cleanup as well.

@gleb-cloudius
Copy link
Contributor Author

But if the write and the "cleanup" entry are in different segments how do we make sure the one with "cleanup" entry is not deleted earlier?

@tgrabiec
Copy link
Contributor

tgrabiec commented Feb 6, 2023

We already have a mechanism which prevents old writes from the commitlog to reach the table after it was truncated. See database::truncate(). We save truncation records in system.truncated. We store a minimum accepted replay_position. This position is monotonic within a commitlog manager.

We could make the mapping more fine grained and make it per token range, allowing us to truncate particular token ranges.

Edit: I don't know how that works with resharding, but if it doesn't currently, then we need to fix it. \cc @elcallio

@gleb-cloudius
Copy link
Contributor Author

We already have a mechanism which prevents old writes from the commitlog to reach the table after it was truncated. See database::truncate(). We save truncation records in system.truncated. We store a minimum accepted replay_position. This position is monotonic within a commitlog manager.

This is same as what I proposed here: #12642 (comment) except of using a replay_position instead of a timestamp. Timestamp will not probably not work with user provided timestamps if we plan to support those.

@gusev-p
Copy link

gusev-p commented Mar 7, 2023

force_schema_commit_log is already published in scylla 5.1

@gusev-p
Copy link

gusev-p commented Mar 7, 2023

questions:

  • We can't rebrand schema commit log since it's already published. Is it ok to use the thing called 'schema commit log' for raft or we should invent something here?
  • We can't just add a bool flag use_schema_commit_log to schema_builder, we must somehow pass it though frozen_schema. Now I added the following hack:
diff --git a/schema/schema_registry.cc b/schema/schema_registry.cc
index 0c2ce9043..eef60ecf0 100644
--- a/schema/schema_registry.cc
+++ b/schema/schema_registry.cc
@@ -65,6 +65,11 @@ schema_ptr schema_registry::learn(const schema_ptr& s) {
     slogger.debug("Learning about version {} of {}.{}", s->version(), s->ks_name(), s->cf_name());
     auto e_ptr = make_lw_shared<schema_registry_entry>(s->version(), *this);
     auto loaded_s = e_ptr->load(frozen_schema(s));
+
+    // temp hack
+    // how to do it right?
+    loaded_s->_raw._use_schema_commitlog = s->use_schema_commit_log();
+
     _entries.emplace(s->version(), e_ptr);
     return loaded_s;
 }

Any ideas how to do it right?

@kostja @tgrabiec @kbr-scylla @gleb-cloudius

@tgrabiec
Copy link
Contributor

tgrabiec commented Mar 7, 2023

I think the fact of using the schema commitlog does not have to be a property of schema_ptr, we can assign it statically on the replica, in database::add_column_family(), like we do for the schema tables. It's then not a property of a particular schema version, but a property of a table on a given replica.

We should probably change all tables managed by group0 to also go to shard 0 only, like with schema tables. We need that for the same reason as we did so for the schema tables - so that we can have multi-partition mutations which are applied atomically. If they would live on different shards, we can't guarantee atomic write to the commitlog.

@tgrabiec
Copy link
Contributor

tgrabiec commented Mar 7, 2023

We can require schema commitlog to be enabled when enabling raft, and fail to boot otherwise.

@tgrabiec
Copy link
Contributor

tgrabiec commented Mar 7, 2023

It would work like this. Since version X, we have code which writes group0 mutations to schema commit log. When user upgrades from an older version and starts node in version X, node will fail to boot if schema commit log is not enabled (based on cluster feature). At this point he can roll-back, and then make sure schema commit log is enabled by modifying scylla.yaml and doing a rolling restart. Then he can proceed with the upgrade to X.

@tgrabiec
Copy link
Contributor

tgrabiec commented Mar 7, 2023

This will require some users to upgrade to a version < X (e.g. 5.1) first.

@kostja
Copy link
Contributor

kostja commented Mar 7, 2023

We should require schema commit log to enable Raft, yes. But that would automatically mean schema commit log is enabled for all new install (which is a good thing).

gusev-p pushed a commit to gusev-p/scylla that referenced this issue Mar 9, 2023
We need this so that we can have multi-partition
mutations which are applied atomically.
If they live on different shards, we can't
guarantee atomic write to the commitlog.

Fixes: scylladb#12642
@kostja
Copy link
Contributor

kostja commented Mar 10, 2023 via email

@kbr-scylla
Copy link
Contributor

Just the same way it works today when you enable both raft and
schema commit log simultaneously.

Today Raft doesn't use schema commit log, that's why it works.

If Raft depended on schema commit log and you "enabled" them simultaneously, Raft would error out on the first node start after upgrade, because for schema commit log to actually become enabled after enabling it in the config file you need to restart the node after upgrade at least once.

That is, you'd need to enable schema commit log first in configuration, then rolling restart, then rolling restart again, then enable Raft.

Also of course we cannot do this in 5.2:

We should require schema commit log to enable Raft, yes.

Because 5.2 is already in the release cycle.

@kbr-scylla
Copy link
Contributor

It would work like this. Since version X, we have code which writes group0 mutations to schema commit log. When user upgrades from an older version and starts node in version X, node will fail to boot if schema commit log is not enabled (based on cluster feature).

I don't understand this part. Why will it fail to boot? I'm guessing we'd have to do something like: if consistent_cluster_management == true but schema commit log cluster feature disabled, abort boot.

@kostja
Copy link
Contributor

kostja commented Mar 10, 2023 via email

@gusev-p
Copy link

gusev-p commented Mar 13, 2023

That is, you'd need to enable schema commit log first in configuration, then rolling restart, then rolling restart again, then enable Raft.

User only needs two rolling restarts. Suppose he uses an old Scylla version, prior to 5.1.

  • He takes the latest Scylla binary and runs upgrade on all nodes with the current scylla.yaml. Neither raft, nor schema commit log is enabled at this moment.
  • After some time all nodes in the cluster will be upgraded, the WARN message "All nodes can now switch to use the schema commit log. Restart is needed for this to take effect." will be written to the log.
  • User sets consistent_cluster_management: true in the scylla.yaml and runs another rolling restart. Now schema_commitlog feature is enabled and we can use it to start raft.

If user set consistent_cluster_management: true without waiting for schema_commitlog feature to settle, we'll fail the boot with appropriate message.

gusev-p pushed a commit to gusev-p/scylla that referenced this issue Mar 14, 2023
Our goal (scylladb#12642) is to mark raft tables to use
schema commitlog. There are two similar
cases in code right now - with_null_sharder
and set_wait_for_sync_to_commitlog schema_builder
methods. The problem is that if we need to
mark some new schema with one of these methods
we need to do this twice - first in
a method describing the schema
(e.g. system_keyspace::raft()) and second in the
function create_table_from_mutations, which is not
obvious and easy to forget.

create_table_from_mutations is called when schema object
is reconstructed from mutations, with_null_sharder
and set_wait_for_sync_to_commitlog must be called from it
since the schema properties they describe are
not included in the mutation representation of the schema.

This patch proposes to distinguish between the schema
properties that get into mutations and those that do not.
The former are described with schema_builder, while for
the latter we introduce schema_static_props struct and
the schema_builder::register_static_configurator method.
This way we can formulate a rule once in the code about
which schemas should have a null sharder, and it will
be enforced in all cases.
kbr-scylla added a commit that referenced this issue Mar 15, 2023
…l properties into it' from Gusev Petr

Our end goal (#12642) is to mark raft tables to use
schema commitlog. There are two similar
cases in code right now - `with_null_sharder`
and `set_wait_for_sync_to_commitlog` `schema_builder`
methods. The problem is that if we need to
mark some new schema with one of these methods
we need to do this twice - first in
a method describing the schema
(e.g. `system_keyspace::raft()`) and second in the
function `create_table_from_mutations`, which is not
obvious and easy to forget.

`create_table_from_mutations` is called when schema object
is reconstructed from mutations, `with_null_sharder`
and `set_wait_for_sync_to_commitlog` must be called from it
since the schema properties they describe are
not included in the mutation representation of the schema.

This series proposes to distinguish between the schema
properties that get into mutations and those that do not.
The former are described with `schema_builder`, while for
the latter we introduce `schema_static_props` struct and
the `schema_builder::register_static_configurator` method.
This way we can formulate a rule once in the code about
which schemas should have a null sharder/be synced, and it will
be enforced in all cases.

Closes #13170

* github.com:scylladb/scylladb:
  schema.hh: choose schema_commitlog based on schema_static_props flag
  schema.hh: use schema_static_props for wait_for_sync_to_commitlog
  schema.hh: introduce schema_static_props, use it for null_sharder
  database.cc: drop ensure_populated and mark_as_populated
wmitros pushed a commit to wmitros/scylla that referenced this issue Mar 16, 2023
Our goal (scylladb#12642) is to mark raft tables to use
schema commitlog. There are two similar
cases in code right now - with_null_sharder
and set_wait_for_sync_to_commitlog schema_builder
methods. The problem is that if we need to
mark some new schema with one of these methods
we need to do this twice - first in
a method describing the schema
(e.g. system_keyspace::raft()) and second in the
function create_table_from_mutations, which is not
obvious and easy to forget.

create_table_from_mutations is called when schema object
is reconstructed from mutations, with_null_sharder
and set_wait_for_sync_to_commitlog must be called from it
since the schema properties they describe are
not included in the mutation representation of the schema.

This patch proposes to distinguish between the schema
properties that get into mutations and those that do not.
The former are described with schema_builder, while for
the latter we introduce schema_static_props struct and
the schema_builder::register_static_configurator method.
This way we can formulate a rule once in the code about
which schemas should have a null sharder, and it will
be enforced in all cases.
gusev-p pushed a commit to gusev-p/scylla that referenced this issue Mar 17, 2023
We aim (scylladb#12642) to use the schema commit log
for raft tables. Now they are loaded at
the first call to init_system_keyspace in
main.cc, but the schema commitlog is only
initialized shortly before the second
call. This is important, since the schema
commitlog initialization
(database::before_schema_keyspace_init)
needs to access schema commitlog feature,
which is loaded from system.scylla_local
and therefore is only available after the
first init_system_keyspace call.

So the idea is to defer the loading of the raft tables
until the second call to init_system_keyspace,
just as it works for schema tables.
For this we need a tool to mark which tables
should be loaded in the first or second phase.

To do this, in this patch we introduce system_table_load_phase
enum. It's set in the schema_static_props for schema tables.
It replaces the system_keyspace::table_selector in the
signature of init_system_keyspace.
gusev-p pushed a commit to gusev-p/scylla that referenced this issue Mar 17, 2023
gusev-p pushed a commit to gusev-p/scylla that referenced this issue Mar 20, 2023
We aim (scylladb#12642) to use the schema commit log
for raft tables. Now they are loaded at
the first call to init_system_keyspace in
main.cc, but the schema commitlog is only
initialized shortly before the second
call. This is important, since the schema
commitlog initialization
(database::before_schema_keyspace_init)
needs to access schema commitlog feature,
which is loaded from system.scylla_local
and therefore is only available after the
first init_system_keyspace call.

So the idea is to defer the loading of the raft tables
until the second call to init_system_keyspace,
just as it works for schema tables.
For this we need a tool to mark which tables
should be loaded in the first or second phase.

To do this, in this patch we introduce system_table_load_phase
enum. It's set in the schema_static_props for schema tables.
It replaces the system_keyspace::table_selector in the
signature of init_system_keyspace.
gusev-p pushed a commit to gusev-p/scylla that referenced this issue Mar 20, 2023
gusev-p pushed a commit to gusev-p/scylla that referenced this issue Mar 21, 2023
We aim (scylladb#12642) to use the schema commit log
for raft tables. Now they are loaded at
the first call to init_system_keyspace in
main.cc, but the schema commitlog is only
initialized shortly before the second
call. This is important, since the schema
commitlog initialization
(database::before_schema_keyspace_init)
needs to access schema commitlog feature,
which is loaded from system.scylla_local
and therefore is only available after the
first init_system_keyspace call.

So the idea is to defer the loading of the raft tables
until the second call to init_system_keyspace,
just as it works for schema tables.
For this we need a tool to mark which tables
should be loaded in the first or second phase.

To do this, in this patch we introduce system_table_load_phase
enum. It's set in the schema_static_props for schema tables.
It replaces the system_keyspace::table_selector in the
signature of init_system_keyspace.
gusev-p pushed a commit to gusev-p/scylla that referenced this issue Mar 21, 2023
gusev-p pushed a commit to gusev-p/scylla that referenced this issue Mar 22, 2023
We aim (scylladb#12642) to use the schema commit log
for raft tables. Now they are loaded at
the first call to init_system_keyspace in
main.cc, but the schema commitlog is only
initialized shortly before the second
call. This is important, since the schema
commitlog initialization
(database::before_schema_keyspace_init)
needs to access schema commitlog feature,
which is loaded from system.scylla_local
and therefore is only available after the
first init_system_keyspace call.

So the idea is to defer the loading of the raft tables
until the second call to init_system_keyspace,
just as it works for schema tables.
For this we need a tool to mark which tables
should be loaded in the first or second phase.

To do this, in this patch we introduce system_table_load_phase
enum. It's set in the schema_static_props for schema tables.
It replaces the system_keyspace::table_selector in the
signature of init_system_keyspace.

The call site for populate_keyspace in init_system_keyspace
was changed, table_selector.contains_keyspace was replaced with
db.local().has_keyspace. This check prevents calling
populate_keyspace(system_schema) on phase1, but allows for
populate_keyspace(system) on phase2 (to init raft tables).
On this second call some tables from system keyspace
(e.g. system.local) may have already been populated on phase1.
This check protects from double-populating them, since every
populated cf is marked as ready_for_writes.
gusev-p pushed a commit to gusev-p/scylla that referenced this issue Mar 22, 2023
gusev-p pushed a commit to gusev-p/scylla that referenced this issue Mar 24, 2023
We aim (scylladb#12642) to use the schema commit log
for raft tables. Now they are loaded at
the first call to init_system_keyspace in
main.cc, but the schema commitlog is only
initialized shortly before the second
call. This is important, since the schema
commitlog initialization
(database::before_schema_keyspace_init)
needs to access schema commitlog feature,
which is loaded from system.scylla_local
and therefore is only available after the
first init_system_keyspace call.

So the idea is to defer the loading of the raft tables
until the second call to init_system_keyspace,
just as it works for schema tables.
For this we need a tool to mark which tables
should be loaded in the first or second phase.

To do this, in this patch we introduce system_table_load_phase
enum. It's set in the schema_static_props for schema tables.
It replaces the system_keyspace::table_selector in the
signature of init_system_keyspace.

The call site for populate_keyspace in init_system_keyspace
was changed, table_selector.contains_keyspace was replaced with
db.local().has_keyspace. This check prevents calling
populate_keyspace(system_schema) on phase1, but allows for
populate_keyspace(system) on phase2 (to init raft tables).
On this second call some tables from system keyspace
(e.g. system.local) may have already been populated on phase1.
This check protects from double-populating them, since every
populated cf is marked as ready_for_writes.
kbr-scylla added a commit that referenced this issue Mar 27, 2023
We need this so that we can have multi-partition mutations which are applied atomically. If they live on different shards, we can't guarantee atomic write to the commitlog.

Fixes: #12642

Closes #13134

* github.com:scylladb/scylladb:
  test_raft_upgrade: add a test for schema commit log feature
  scylla_cluster.py: add start flag to server_add
  ServerInfo: drop host_id
  scylla_cluster.py: add config to server_add
  scylla_cluster.py: add expected_error to server_start
  scylla_cluster.py: ScyllaServer.start, refactor error reporting
  scylla_cluster.py: fix ScyllaServer.start, reset cmd if start failed
  raft: check if schema commitlog is initialized Refuse to boot if neither the schema commitlog feature nor force_schema_commit_log is set. For the upgrade procedure the user should wait until the schema commitlog feature is enabled before enabling consistent_cluster_management.
  raft: move raft initialization after init_system_keyspace
  database: rename before_schema_keyspace_init->maybe_init_schema_commitlog
  raft: use schema commitlog for raft tables
  init_system_keyspace: refactoring towards explicit load phases
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants