Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[tablets] Support RF changes using ALTER KEYSPACE #16129

Closed
Tracked by #17572
tzach opened this issue Nov 22, 2023 · 13 comments · Fixed by #16723
Closed
Tracked by #17572

[tablets] Support RF changes using ALTER KEYSPACE #16129

tzach opened this issue Nov 22, 2023 · 13 comments · Fixed by #16723
Assignees
Milestone

Comments

@tzach
Copy link
Contributor

tzach commented Nov 22, 2023

Support updating the RF startagy with ALTER KEYSPACE under Tablets.
Blocked by #16101

@tgrabiec
Copy link
Contributor

Currently, RF is changed by altering keyspace options. It can be safely changed only by 1, in which case the old and new quorums must overlap. Afterwards, admin should run repair to reduce risk of data loss, since changing of RF doesn't replicate old writes automatically.

The plan is to start with something which resembles the current way things work and then improve it to be safe against any RF changes, and to replicate automatically.

Unlike with vnodes, tablet replicas are explicitly stated for each tablet (token range). So to make the current procedure work with tablets, we should extend alter keyspace execution to walk over tablet metadata and change the replicas accordingly, by allocating new replicas or dropping them. This cannot be done on the spot as a group0 transaction if some affected tablet is currently migrating. To solve that, we make tablets updating a topology transaction executed by topology change coordinator, which excludes with tablet migration globally. We introduce a new kind of global_topology_request called keyspace_rf_change, and persist that request in system.topology. handle_global_request() will execute it by committing new tablet metadata and schema change to group0. The request takes arguments (keyspace name and new options), which should be persisted as request arguments similar to how we do for new_cdc_generation_data_uuid.

The CQL statement should fail if there is already an ongoing request. User will have to retry when the previous one is finished. The CQL request handler should wait for the request to complete before returning, this can be done using the virtual task API. If the CQL shell lost track of it, it will be available via task manager API. So this request should integrate with virtual task API (#16374).

Tablet replica selection should reuse load_sketch to allocate replicas. See network_topology_strategy::allocate_tablets_for_new_table(). It should respect replication strategy constraints. It's ok if we don't achieve perfect balance, it will be solved by tablet load balancing. When dropping replicas, we should pick the replica on the most-loaded shard, tracked by load_sketch.

It can happen that RF cannot be achieved. In this case the operation should fail.

We should also fail if RF is changed by more than 1, since the procedure is not safe in this case.

When adding a new DC, it can happen that keyspace metadata already has RF for the DC, but nodes are not bootstrapped yet in that DC, so tablet allocation will fail. We should require users to first add nodes and then set the replication factor for the DC, and that's what our docs recommend to do: https://opensource.docs.scylladb.com/stable/operating-scylla/procedures/cluster-management/add-dc-to-existing-dc.html

When determining the list of DCs, one should look at tm->get_topology()->get_datacenters() and not keyspace options, since some keyspace options are not DCs (e.g. 'replication_factor').

See docs/dev/topology-over-raft.md

@avikivity
Copy link
Member

For tablets, the replication factor is stored in two places:

  1. the keyspace replication clause
  2. the tablets table

If we make storage_proxy look at the tablets table (via effective_replication_map), then the replication clause becomes a goal for the load balancer. It sees the discrepancy between the replication clause and the tablets table, and starts working to reconcile the discrepancy. Once it's done, the ALTER KEYSPACE statement completes.

@tgrabiec
Copy link
Contributor

For tablets, the replication factor is stored in two places:

  1. the keyspace replication clause
  2. the tablets table

If we make storage_proxy look at the tablets table (via effective_replication_map), then the replication clause becomes a goal for the load balancer. It sees the discrepancy between the replication clause and the tablets table, and starts working to reconcile the discrepancy. Once it's done, the ALTER KEYSPACE statement completes.

That's more complicated to implement, so I'd suggest we defer it.

@avikivity
Copy link
Member

For tablets, the replication factor is stored in two places:

  1. the keyspace replication clause
  2. the tablets table

If we make storage_proxy look at the tablets table (via effective_replication_map), then the replication clause becomes a goal for the load balancer. It sees the discrepancy between the replication clause and the tablets table, and starts working to reconcile the discrepancy. Once it's done, the ALTER KEYSPACE statement completes.

That's more complicated to implement, so I'd suggest we defer it.

Sure, we don't have to do everything in one day.,

@avikivity
Copy link
Member

IMO it's okay to allow any RF changes during the first phase. It's consistent with what we do with vnodes. The user is responsible for running repair if they want reads not to lose data, or they can alter the replication factor by 1 each time.

@tgrabiec
Copy link
Contributor

tgrabiec commented Jan 10, 2024

I realized there is one problem with simply changing the replica set. In order for the new tablet replica to accept requests, it must know the new tablet metadata (it creates compaction group for the tablet). There is a time window where some (storage_proxy) coordinator can already work with the new replica set, but new replica may still be at old metadata version. To prevent unnecessary request failures, we should go through a simplified tablet migration track in tablet's state machine, which has two stages, and doesn't do streaming. So request handler for RF change would initiate migrations and switch topology transition state to tablet migration track.

Later, we will do repair there, to automatically repair new replicas. But to do that for arbitrary RF changes we need an infrastructure in storage_proxy to work with more than 1 pending replica.

@bhalevy
Copy link
Member

bhalevy commented Jan 19, 2024

I realized there is one problem with simply changing the replica set. In order for the new tablet replica to accept requests, it must know the new tablet metadata (it creates compaction group for the tablet). There is a time window where some (storage_proxy) coordinator can already work with the new replica set, but new replica may still be at old metadata version. To prevent unnecessary request failures, we should go through a simplified tablet migration track in tablet's state machine, which has two stages, and doesn't do streaming. So request handler for RF change would initiate migrations and switch topology transition state to tablet migration track.

Later, we will do repair there, to automatically repair new replicas. But to do that for arbitrary RF changes we need an infrastructure in storage_proxy to work with more than 1 pending replica.

@tgrabiec@scylladb.com can the tablet scheduler make sure to rebuild just one replica at a time until we support multiple pending replicas in the storage proxy?

@tgrabiec
Copy link
Contributor

I realized there is one problem with simply changing the replica set. In order for the new tablet replica to accept requests, it must know the new tablet metadata (it creates compaction group for the tablet). There is a time window where some (storage_proxy) coordinator can already work with the new replica set, but new replica may still be at old metadata version. To prevent unnecessary request failures, we should go through a simplified tablet migration track in tablet's state machine, which has two stages, and doesn't do streaming. So request handler for RF change would initiate migrations and switch topology transition state to tablet migration track.
Later, we will do repair there, to automatically repair new replicas. But to do that for arbitrary RF changes we need an infrastructure in storage_proxy to work with more than 1 pending replica.

@tgrabiec@scylladb.com can the tablet scheduler make sure to rebuild just one replica at a time until we support multiple pending replicas in the storage proxy?

It can, but we don't plan to do automatic rebuild on RF changes now.

@bhalevy
Copy link
Member

bhalevy commented Mar 18, 2024

Refs #17846

@ptrsmrn
Copy link
Contributor

ptrsmrn commented Mar 29, 2024

We have to eliminate the query timeout when ALTERing KS. It probably doesn't matter if it's tablet-enabled KS or not, both can have the timeout disabled, which simplifies the implementation. Where to do it exactly (cqlsh, python driver?) has to be yet decided.
This has to be reflected in the documentation.

@bhalevy
Copy link
Member

bhalevy commented Mar 29, 2024

We have to eliminate the query timeout when ALTERing KS. It probably doesn't matter if it's tablet-enabled KS or not, both can have the timeout disabled, which simplifies the implementation. Where to do it exactly (cqlsh, python driver?) has to be yet decided. This has to be reflected in the documentation.

I believe that the client timeout can be set per query by the application, so this can be done in cqlsh.
I'm not sure what's going on on the server side and whether the query might timeout on the server as well.
If so, we must prevent that for 6.0
As for eliminating the client timeout for that and/or other types of queries on the client driver side - it seems like a longer term project that shouldn't block the release.

@mykaul mykaul modified the milestones: 6.0, 6.1 May 20, 2024
@mykaul
Copy link
Contributor

mykaul commented May 20, 2024

#16723 is moved to 6.1. I believe so should this one.

@bhalevy
Copy link
Member

bhalevy commented May 20, 2024

#16723 is moved to 6.1. I believe so should this one.

Makes sense

@mykaul mykaul added the P1 Urgent label May 20, 2024
@bhalevy bhalevy changed the title RFC: Integration of Tablets load balancer with RF changes [tablets] Support RF changes using ALTER KEYSPACE May 20, 2024
ptrsmrn added a commit to ptrsmrn/scylladb that referenced this issue May 22, 2024
The full support for ALTERing a tablets-enabled KEYSPACE is not yet
implemented, and we don't want to only change the schema without changing
any tablets, so the statement has to be explicitly rejected for cases
that won't work, so every time any replication option is provided.

Fixes: scylladb#18795
References: scylladb#16129
ptrsmrn added a commit to ptrsmrn/scylladb that referenced this issue May 22, 2024
The full support for ALTERing a tablets-enabled KEYSPACE is not yet
implemented, and we don't want to only change the schema without changing
any tablets, so the statement has to be explicitly rejected for cases
that won't work, so every time any replication option is provided.

Fixes: scylladb#18795
References: scylladb#16129
ptrsmrn added a commit to ptrsmrn/scylladb that referenced this issue May 27, 2024
The full support for ALTERing a tablets-enabled KEYSPACE is not yet
implemented, and we don't want to only change the schema without changing
any tablets, so the statement has to be explicitly rejected for cases
that won't work, so every time any replication option is provided.

Fixes: scylladb#18795
References: scylladb#16129
ptrsmrn added a commit to ptrsmrn/scylladb that referenced this issue May 28, 2024
The full support for ALTERing a tablets-enabled KEYSPACE is not yet
implemented, and we don't want to only change the schema without changing
any tablets, so the statement has to be explicitly rejected for cases
that won't work, so every time any replication option is provided.

Fixes: scylladb#18795
References: scylladb#16129
ptrsmrn added a commit to ptrsmrn/scylladb that referenced this issue May 28, 2024
The full support for ALTERing a tablets-enabled KEYSPACE is not yet
implemented, and we don't want to only change the schema without changing
any tablets, so the statement has to be explicitly rejected for cases
that won't work, so every time any replication option is provided.

Fixes: scylladb#18795
References: scylladb#16129
ptrsmrn added a commit to ptrsmrn/scylladb that referenced this issue May 28, 2024
The full support for ALTERing a tablets-enabled KEYSPACE is not yet
implemented, and we don't want to only change the schema without changing
any tablets, so the statement has to be explicitly rejected for cases
that won't work, so every time any replication option is provided.

Fixes: scylladb#18795
References: scylladb#16129
ptrsmrn added a commit to ptrsmrn/scylladb that referenced this issue May 29, 2024
The full support for ALTERing a tablets-enabled KEYSPACE is not yet
implemented, and we don't want to only change the schema without changing
any tablets, so the statement has to be explicitly rejected for cases
that won't work, so every time any replication option is provided.

Fixes: scylladb#18795
References: scylladb#16129
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

9 participants