New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Store raft mutations in schema commit log. #12642
Comments
I can make something like this here
but it looks weird, something dedicated to schema is abused for raft. BTW, |
@tgrabiec could you please give your opinion? |
I don't understand this. A write that has ended up in commitlog can get applied before or after a restart. @tgrabiec @gleb-cloudius could you please provide an example scenario which shows why we need to synchronize topology before replaying commitlog? |
It is. Otherwise it will be fenced.
Which is the latest one unless fenced.
To use the latest topology. Because there is no fencing when we apply mutations from the commitlog. So with tables we will flush only some memtables and not all of them (that's according to table spec). It means that after we flushed a memtabe and did a cleanup a write may still sit in an active commitlog segment and be replayed after restart. Another reason is object store. If/when group0 will hold info about data placement for each node you need to know it before you start replay. |
Let's assume a write which ended up in the commitlog. Meaning that it didn't get fenced the moment it was done.
Since the write was saved in the commitlog, it didn't get fenced when it arrived at the node, hence the topology stored by the node is correct for this particular write. If applying it was safe at the moment the write arrived, then it is also safe to apply it after the restart using that same topology. The restart scenario should be no different from a long pause scenario. I.e. restart should restore the node to the same state it was before the crash / shutdown. Since the write ended up in the memtable before the crash, we apply it again from the commitlog and it again ends up in the memtable. We arrive at the exact same state we were in before the node crash stop. So the restart for all intents and purposes acts as if the node simply hanged for this duration. Without the restart the write would not have been rejected, so there is no reason to reject it with the restart. |
No, because it may have been already removed from this node by a topology change and a following clean up.
And again no. Restart scenario is very different from long pause since no replay of a commit log is going on during the later, so they are simply two un-comparable events. |
So the assumption is that the node did learn of the topology change that invalidated this write before it crashed/stopped? And it removed it from the memtable, but not from the commitlog? If so then it's a different problem. The node was supposed to invalidate/remove an entry but it didn't really. It left garbage in the commitlog. It looks like the problem here is not related to topology changes or fencing. The problem is basically that we resurrect data which we were supposed to have "cleaned up". The clean up procedure should make sure that entries in the commitlog which are no longer valid should not be applied. So the clean up procedure should persist information about invalidated entries, so when we restart we consult this information and skip whatever entries were invalidated. Or replay them but then immediately remove because there are tombstones overriding them.
The purpose of commitlog replay is to bring the node to the state it was in before it crashed, i.e., to basically make it indistinguishable from a long pause, as I understand it. |
In this scenario yes. In object store case not necessary.
With tables it will not. It is not garbage. It is a past write like any other in the commitlog.
I do not see how it is unrelated to topology changes or fencing. The job of the fencing is to make sure old writes will not get to the previous owner after a topology change and cleanup. Same needs to be done on commitlog replay path. Without topology change there wouldn't be a need for it.
There is no tombstones either. The data was just dropped by the cleanup. We need to make sure old writes will not get into the node. While the node is alive fencing does it. On reboot commitlog replay should handle it.
The is no commitlog replay executed after the long pause which makes two scenarios absolutely different. If the only way a node may become unresponsive is because of a long pause you can drop entire commitlog code which makes discussion what it suppose to do in this scenario kind of moot. |
We create a schema commit log on shard 0 only because it's sufficient (schema is owned only by shard 0) and because we want to save on pre-allocated commitlog segments. We don't need the latest topology from cluster-wide group0 before replaying mutations, we only need local node's latest view on topology. Meaning, that we shouldn't see topology going backwards in time after restart. We need group0 before initializing user tables in case we reshard tablets, which needs to execute a group0 command. For fencing off cleaned-up mutations from commitlog we indeed need some extra mechanism to recognize that the mutations were cleaned up. Knowing the latest topology is not enough, since the node could have regained ownership of the tablet. |
Which supports my view that this has nothing to do with topology. The act of cleanup should be "persisted". If we delete some data on this node and then the node crashes, that data should not be resurrected on this node after restart. This is purely local information and problem of local data loss / resurrection. |
So we don't need to fetch the latest topology, we already have this information. It's a simple argument by case:
|
Yes. Currently, this is done by deleting all commitlog segments (implicitly) when flushing all tables. We can start with that, and later add something more fine-grained. |
We need to persist its version in addition to knowing the latest topology. |
The data is sitting in the commitlog. The cleanup removed data from memtables and sstables. Something is needed to prevent the commitlog reply to write the data again. Today it cannot happen because all memtables are flushed and that release all commitlogs.
Pause does not reply commitlog. I do not see any similarities. |
I'm saying that the node already has that "something", it doesn't need to fetch the latest topology, it won't learn anything new regarding this data. It can already fence off these commitlog writes. |
What is this something? |
We cannot reject writes in the commitlog based on the version, since the replica already accepted those writes. The coordinator could have ACKed the request based on this. The topology may change later, and it may change not due to cleanup of this tablet. |
Whatever caused the cleanup to be triggered in the first place. The topology perhaps? Whatever it was, we can persist this information before doing the cleanup, so it will be known on restart, the node can use it to not undo the cleanup during commitlog replay. |
Yes, true. Just knowing a global topology is not enough and havinga topology bersion per tablet is probably not an option. |
That not answers the question. What exactly is this something? Cleaned up ranges with the timestamp of the cleanup in a local table? May be this will work, but we my have lot of cleanups with dynamic re balancing and the table may grow large. Need to have a way to trim it. |
If I understand correctly, the ranges/tokens/keys that must be removed from this node can be calculated directly from the topology - after all it's the topology change (range movement) that triggers the cleanup. So just persisting the topology - which we already do anyway - should be enough. On restart the node would look at persisted topology, and on replay skip keys that do not belong to this node according to that persisted topology. Although this is not 100% correct, because the commitlog might contain data that was cleaned up even before the last persisted topology. E.g. there was topology A which removed some keys from this node, then topology B which added them back, and our last known topology is B. But commitlog might still contain data that was written before topology A arrived. Perhaps we could fix this by ensuring that tables are flushed between topology changes (and commitlog is trimmed on flush). Then it would be enough to know only the last topology. |
But flushing all tables between each topology changes is exactly what we are trying to avoid. Otherwise there is no problem. |
We can have a new commitlog entry "cleanup" containing ranges that should be cleaned up. A cleanup will commit such entry into the commitlog before proceed and reply will do cleanup as well. |
But if the write and the "cleanup" entry are in different segments how do we make sure the one with "cleanup" entry is not deleted earlier? |
We already have a mechanism which prevents old writes from the commitlog to reach the table after it was truncated. See database::truncate(). We save truncation records in system.truncated. We store a minimum accepted We could make the mapping more fine grained and make it per token range, allowing us to truncate particular token ranges. Edit: I don't know how that works with resharding, but if it doesn't currently, then we need to fix it. \cc @elcallio |
This is same as what I proposed here: #12642 (comment) except of using a |
|
questions:
Any ideas how to do it right? |
I think the fact of using the schema commitlog does not have to be a property of schema_ptr, we can assign it statically on the replica, in We should probably change all tables managed by group0 to also go to shard 0 only, like with schema tables. We need that for the same reason as we did so for the schema tables - so that we can have multi-partition mutations which are applied atomically. If they would live on different shards, we can't guarantee atomic write to the commitlog. |
We can require schema commitlog to be enabled when enabling raft, and fail to boot otherwise. |
It would work like this. Since version X, we have code which writes group0 mutations to schema commit log. When user upgrades from an older version and starts node in version X, node will fail to boot if schema commit log is not enabled (based on cluster feature). At this point he can roll-back, and then make sure schema commit log is enabled by modifying scylla.yaml and doing a rolling restart. Then he can proceed with the upgrade to X. |
This will require some users to upgrade to a version < X (e.g. 5.1) first. |
We should require schema commit log to enable Raft, yes. But that would automatically mean schema commit log is enabled for all new install (which is a good thing). |
We need this so that we can have multi-partition mutations which are applied atomically. If they live on different shards, we can't guarantee atomic write to the commitlog. Fixes: scylladb#12642
* Kamil Braun ***@***.***> [23/03/10 15:48]:
> We can require schema commitlog to be enabled when enabling raft, and fail to boot otherwise.
How will upgrade work?
Just the same way it works today when you enable both raft and
schema commit log simultaneously.
…--
Konstantin Osipov, Moscow, Russia
|
Today Raft doesn't use schema commit log, that's why it works. If Raft depended on schema commit log and you "enabled" them simultaneously, Raft would error out on the first node start after upgrade, because for schema commit log to actually become enabled after enabling it in the config file you need to restart the node after upgrade at least once. That is, you'd need to enable schema commit log first in configuration, then rolling restart, then rolling restart again, then enable Raft. Also of course we cannot do this in 5.2:
Because 5.2 is already in the release cycle. |
I don't understand this part. Why will it fail to boot? I'm guessing we'd have to do something like: if |
* Kamil Braun ***@***.***> [23/03/10 15:58]:
> Just the same way it works today when you enable both raft and
schema commit log simultaneously.
Today Raft doesn't use schema commit log, that's why it works.
If Raft depended on schema commit log and you "enabled" them simultaneously, Raft would error out on the first node start after upgrade, because for schema commit log to actually become enabled after enabling it in the config file you need to restart the node after upgrade at least once.
That is, you'd need to enable schema commit log first in configuration, then rolling restart, then rolling restart again, then enable Raft.
Also of course we cannot do this in 5.2:
> We should require schema commit log to enable Raft, yes.
Because 5.2 is already in the release cycle.
Well i guess then things are easy, since the only supported path
is 5.2 -> 5.3, and we can require that Raft is enabled first.
To sum up, I see two ways of handling this:
- binding schema commit log and raft features together in 5.2
- deliver raft over schema commit log in 5.3 and require raft
upgrade to happen first to enable it.
…--
Konstantin Osipov, Moscow, Russia
|
User only needs two rolling restarts. Suppose he uses an old Scylla version, prior to 5.1.
If user set |
Our goal (scylladb#12642) is to mark raft tables to use schema commitlog. There are two similar cases in code right now - with_null_sharder and set_wait_for_sync_to_commitlog schema_builder methods. The problem is that if we need to mark some new schema with one of these methods we need to do this twice - first in a method describing the schema (e.g. system_keyspace::raft()) and second in the function create_table_from_mutations, which is not obvious and easy to forget. create_table_from_mutations is called when schema object is reconstructed from mutations, with_null_sharder and set_wait_for_sync_to_commitlog must be called from it since the schema properties they describe are not included in the mutation representation of the schema. This patch proposes to distinguish between the schema properties that get into mutations and those that do not. The former are described with schema_builder, while for the latter we introduce schema_static_props struct and the schema_builder::register_static_configurator method. This way we can formulate a rule once in the code about which schemas should have a null sharder, and it will be enforced in all cases.
…l properties into it' from Gusev Petr Our end goal (#12642) is to mark raft tables to use schema commitlog. There are two similar cases in code right now - `with_null_sharder` and `set_wait_for_sync_to_commitlog` `schema_builder` methods. The problem is that if we need to mark some new schema with one of these methods we need to do this twice - first in a method describing the schema (e.g. `system_keyspace::raft()`) and second in the function `create_table_from_mutations`, which is not obvious and easy to forget. `create_table_from_mutations` is called when schema object is reconstructed from mutations, `with_null_sharder` and `set_wait_for_sync_to_commitlog` must be called from it since the schema properties they describe are not included in the mutation representation of the schema. This series proposes to distinguish between the schema properties that get into mutations and those that do not. The former are described with `schema_builder`, while for the latter we introduce `schema_static_props` struct and the `schema_builder::register_static_configurator` method. This way we can formulate a rule once in the code about which schemas should have a null sharder/be synced, and it will be enforced in all cases. Closes #13170 * github.com:scylladb/scylladb: schema.hh: choose schema_commitlog based on schema_static_props flag schema.hh: use schema_static_props for wait_for_sync_to_commitlog schema.hh: introduce schema_static_props, use it for null_sharder database.cc: drop ensure_populated and mark_as_populated
Our goal (scylladb#12642) is to mark raft tables to use schema commitlog. There are two similar cases in code right now - with_null_sharder and set_wait_for_sync_to_commitlog schema_builder methods. The problem is that if we need to mark some new schema with one of these methods we need to do this twice - first in a method describing the schema (e.g. system_keyspace::raft()) and second in the function create_table_from_mutations, which is not obvious and easy to forget. create_table_from_mutations is called when schema object is reconstructed from mutations, with_null_sharder and set_wait_for_sync_to_commitlog must be called from it since the schema properties they describe are not included in the mutation representation of the schema. This patch proposes to distinguish between the schema properties that get into mutations and those that do not. The former are described with schema_builder, while for the latter we introduce schema_static_props struct and the schema_builder::register_static_configurator method. This way we can formulate a rule once in the code about which schemas should have a null sharder, and it will be enforced in all cases.
We aim (scylladb#12642) to use the schema commit log for raft tables. Now they are loaded at the first call to init_system_keyspace in main.cc, but the schema commitlog is only initialized shortly before the second call. This is important, since the schema commitlog initialization (database::before_schema_keyspace_init) needs to access schema commitlog feature, which is loaded from system.scylla_local and therefore is only available after the first init_system_keyspace call. So the idea is to defer the loading of the raft tables until the second call to init_system_keyspace, just as it works for schema tables. For this we need a tool to mark which tables should be loaded in the first or second phase. To do this, in this patch we introduce system_table_load_phase enum. It's set in the schema_static_props for schema tables. It replaces the system_keyspace::table_selector in the signature of init_system_keyspace.
We aim (scylladb#12642) to use the schema commit log for raft tables. Now they are loaded at the first call to init_system_keyspace in main.cc, but the schema commitlog is only initialized shortly before the second call. This is important, since the schema commitlog initialization (database::before_schema_keyspace_init) needs to access schema commitlog feature, which is loaded from system.scylla_local and therefore is only available after the first init_system_keyspace call. So the idea is to defer the loading of the raft tables until the second call to init_system_keyspace, just as it works for schema tables. For this we need a tool to mark which tables should be loaded in the first or second phase. To do this, in this patch we introduce system_table_load_phase enum. It's set in the schema_static_props for schema tables. It replaces the system_keyspace::table_selector in the signature of init_system_keyspace.
We aim (scylladb#12642) to use the schema commit log for raft tables. Now they are loaded at the first call to init_system_keyspace in main.cc, but the schema commitlog is only initialized shortly before the second call. This is important, since the schema commitlog initialization (database::before_schema_keyspace_init) needs to access schema commitlog feature, which is loaded from system.scylla_local and therefore is only available after the first init_system_keyspace call. So the idea is to defer the loading of the raft tables until the second call to init_system_keyspace, just as it works for schema tables. For this we need a tool to mark which tables should be loaded in the first or second phase. To do this, in this patch we introduce system_table_load_phase enum. It's set in the schema_static_props for schema tables. It replaces the system_keyspace::table_selector in the signature of init_system_keyspace.
We aim (scylladb#12642) to use the schema commit log for raft tables. Now they are loaded at the first call to init_system_keyspace in main.cc, but the schema commitlog is only initialized shortly before the second call. This is important, since the schema commitlog initialization (database::before_schema_keyspace_init) needs to access schema commitlog feature, which is loaded from system.scylla_local and therefore is only available after the first init_system_keyspace call. So the idea is to defer the loading of the raft tables until the second call to init_system_keyspace, just as it works for schema tables. For this we need a tool to mark which tables should be loaded in the first or second phase. To do this, in this patch we introduce system_table_load_phase enum. It's set in the schema_static_props for schema tables. It replaces the system_keyspace::table_selector in the signature of init_system_keyspace. The call site for populate_keyspace in init_system_keyspace was changed, table_selector.contains_keyspace was replaced with db.local().has_keyspace. This check prevents calling populate_keyspace(system_schema) on phase1, but allows for populate_keyspace(system) on phase2 (to init raft tables). On this second call some tables from system keyspace (e.g. system.local) may have already been populated on phase1. This check protects from double-populating them, since every populated cf is marked as ready_for_writes.
We aim (scylladb#12642) to use the schema commit log for raft tables. Now they are loaded at the first call to init_system_keyspace in main.cc, but the schema commitlog is only initialized shortly before the second call. This is important, since the schema commitlog initialization (database::before_schema_keyspace_init) needs to access schema commitlog feature, which is loaded from system.scylla_local and therefore is only available after the first init_system_keyspace call. So the idea is to defer the loading of the raft tables until the second call to init_system_keyspace, just as it works for schema tables. For this we need a tool to mark which tables should be loaded in the first or second phase. To do this, in this patch we introduce system_table_load_phase enum. It's set in the schema_static_props for schema tables. It replaces the system_keyspace::table_selector in the signature of init_system_keyspace. The call site for populate_keyspace in init_system_keyspace was changed, table_selector.contains_keyspace was replaced with db.local().has_keyspace. This check prevents calling populate_keyspace(system_schema) on phase1, but allows for populate_keyspace(system) on phase2 (to init raft tables). On this second call some tables from system keyspace (e.g. system.local) may have already been populated on phase1. This check protects from double-populating them, since every populated cf is marked as ready_for_writes.
We need this so that we can have multi-partition mutations which are applied atomically. If they live on different shards, we can't guarantee atomic write to the commitlog. Fixes: #12642 Closes #13134 * github.com:scylladb/scylladb: test_raft_upgrade: add a test for schema commit log feature scylla_cluster.py: add start flag to server_add ServerInfo: drop host_id scylla_cluster.py: add config to server_add scylla_cluster.py: add expected_error to server_start scylla_cluster.py: ScyllaServer.start, refactor error reporting scylla_cluster.py: fix ScyllaServer.start, reset cmd if start failed raft: check if schema commitlog is initialized Refuse to boot if neither the schema commitlog feature nor force_schema_commit_log is set. For the upgrade procedure the user should wait until the schema commitlog feature is enabled before enabling consistent_cluster_management. raft: move raft initialization after init_system_keyspace database: rename before_schema_keyspace_init->maybe_init_schema_commitlog raft: use schema commitlog for raft tables init_system_keyspace: refactoring towards explicit load phases
Installation details
Scylla version (or git commit hash): ba26770
The topology management is moving to raft. Raft will manage tablet one day. In the tablet future we will want to know correct topology while replaying the commitlog. For that we need to be able to start raft before replay starts. but today it is impossible because raft stores its data in local tables and uses commitlog as well. The solution is to move raft tables to use separate commitlog which will be replayed earlier. There is such commitlog already for schema, so we need to make raft tables use it as well.
The text was updated successfully, but these errors were encountered: