[tablets] Support RF changes using ALTER KEYSPACE #16129

tzach · 2023-11-22T08:08:23Z

Support updating the RF startagy with ALTER KEYSPACE under Tablets.
Blocked by #16101

tgrabiec · 2023-12-13T15:28:02Z

Currently, RF is changed by altering keyspace options. It can be safely changed only by 1, in which case the old and new quorums must overlap. Afterwards, admin should run repair to reduce risk of data loss, since changing of RF doesn't replicate old writes automatically.

The plan is to start with something which resembles the current way things work and then improve it to be safe against any RF changes, and to replicate automatically.

Unlike with vnodes, tablet replicas are explicitly stated for each tablet (token range). So to make the current procedure work with tablets, we should extend alter keyspace execution to walk over tablet metadata and change the replicas accordingly, by allocating new replicas or dropping them. This cannot be done on the spot as a group0 transaction if some affected tablet is currently migrating. To solve that, we make tablets updating a topology transaction executed by topology change coordinator, which excludes with tablet migration globally. We introduce a new kind of global_topology_request called keyspace_rf_change, and persist that request in system.topology. handle_global_request() will execute it by committing new tablet metadata and schema change to group0. The request takes arguments (keyspace name and new options), which should be persisted as request arguments similar to how we do for new_cdc_generation_data_uuid.

The CQL statement should fail if there is already an ongoing request. User will have to retry when the previous one is finished. The CQL request handler should wait for the request to complete before returning, this can be done using the virtual task API. If the CQL shell lost track of it, it will be available via task manager API. So this request should integrate with virtual task API (#16374).

Tablet replica selection should reuse load_sketch to allocate replicas. See network_topology_strategy::allocate_tablets_for_new_table(). It should respect replication strategy constraints. It's ok if we don't achieve perfect balance, it will be solved by tablet load balancing. When dropping replicas, we should pick the replica on the most-loaded shard, tracked by load_sketch.

It can happen that RF cannot be achieved. In this case the operation should fail.

We should also fail if RF is changed by more than 1, since the procedure is not safe in this case.

When adding a new DC, it can happen that keyspace metadata already has RF for the DC, but nodes are not bootstrapped yet in that DC, so tablet allocation will fail. We should require users to first add nodes and then set the replication factor for the DC, and that's what our docs recommend to do: https://opensource.docs.scylladb.com/stable/operating-scylla/procedures/cluster-management/add-dc-to-existing-dc.html

When determining the list of DCs, one should look at tm->get_topology()->get_datacenters() and not keyspace options, since some keyspace options are not DCs (e.g. 'replication_factor').

See docs/dev/topology-over-raft.md

avikivity · 2023-12-13T15:37:37Z

For tablets, the replication factor is stored in two places:

the keyspace replication clause
the tablets table

If we make storage_proxy look at the tablets table (via effective_replication_map), then the replication clause becomes a goal for the load balancer. It sees the discrepancy between the replication clause and the tablets table, and starts working to reconcile the discrepancy. Once it's done, the ALTER KEYSPACE statement completes.

tgrabiec · 2023-12-13T15:51:22Z

For tablets, the replication factor is stored in two places:

the keyspace replication clause

the tablets table

If we make storage_proxy look at the tablets table (via effective_replication_map), then the replication clause becomes a goal for the load balancer. It sees the discrepancy between the replication clause and the tablets table, and starts working to reconcile the discrepancy. Once it's done, the ALTER KEYSPACE statement completes.

That's more complicated to implement, so I'd suggest we defer it.

avikivity · 2023-12-13T16:38:02Z

For tablets, the replication factor is stored in two places:

the keyspace replication clause

the tablets table

If we make storage_proxy look at the tablets table (via effective_replication_map), then the replication clause becomes a goal for the load balancer. It sees the discrepancy between the replication clause and the tablets table, and starts working to reconcile the discrepancy. Once it's done, the ALTER KEYSPACE statement completes.

That's more complicated to implement, so I'd suggest we defer it.

Sure, we don't have to do everything in one day.,

avikivity · 2023-12-28T15:00:22Z

IMO it's okay to allow any RF changes during the first phase. It's consistent with what we do with vnodes. The user is responsible for running repair if they want reads not to lose data, or they can alter the replication factor by 1 each time.

tgrabiec · 2024-01-10T13:28:31Z

I realized there is one problem with simply changing the replica set. In order for the new tablet replica to accept requests, it must know the new tablet metadata (it creates compaction group for the tablet). There is a time window where some (storage_proxy) coordinator can already work with the new replica set, but new replica may still be at old metadata version. To prevent unnecessary request failures, we should go through a simplified tablet migration track in tablet's state machine, which has two stages, and doesn't do streaming. So request handler for RF change would initiate migrations and switch topology transition state to tablet migration track.

Later, we will do repair there, to automatically repair new replicas. But to do that for arbitrary RF changes we need an infrastructure in storage_proxy to work with more than 1 pending replica.

bhalevy · 2024-01-19T14:22:24Z

I realized there is one problem with simply changing the replica set. In order for the new tablet replica to accept requests, it must know the new tablet metadata (it creates compaction group for the tablet). There is a time window where some (storage_proxy) coordinator can already work with the new replica set, but new replica may still be at old metadata version. To prevent unnecessary request failures, we should go through a simplified tablet migration track in tablet's state machine, which has two stages, and doesn't do streaming. So request handler for RF change would initiate migrations and switch topology transition state to tablet migration track.

Later, we will do repair there, to automatically repair new replicas. But to do that for arbitrary RF changes we need an infrastructure in storage_proxy to work with more than 1 pending replica.

@tgrabiec@scylladb.com can the tablet scheduler make sure to rebuild just one replica at a time until we support multiple pending replicas in the storage proxy?

tgrabiec · 2024-01-19T14:44:53Z

I realized there is one problem with simply changing the replica set. In order for the new tablet replica to accept requests, it must know the new tablet metadata (it creates compaction group for the tablet). There is a time window where some (storage_proxy) coordinator can already work with the new replica set, but new replica may still be at old metadata version. To prevent unnecessary request failures, we should go through a simplified tablet migration track in tablet's state machine, which has two stages, and doesn't do streaming. So request handler for RF change would initiate migrations and switch topology transition state to tablet migration track.
Later, we will do repair there, to automatically repair new replicas. But to do that for arbitrary RF changes we need an infrastructure in storage_proxy to work with more than 1 pending replica.

@tgrabiec@scylladb.com can the tablet scheduler make sure to rebuild just one replica at a time until we support multiple pending replicas in the storage proxy?

It can, but we don't plan to do automatic rebuild on RF changes now.

bhalevy · 2024-03-18T09:19:43Z

Refs #17846

ptrsmrn · 2024-03-29T09:27:10Z

We have to eliminate the query timeout when ALTERing KS. It probably doesn't matter if it's tablet-enabled KS or not, both can have the timeout disabled, which simplifies the implementation. Where to do it exactly (cqlsh, python driver?) has to be yet decided.
This has to be reflected in the documentation.

bhalevy · 2024-03-29T09:50:35Z

We have to eliminate the query timeout when ALTERing KS. It probably doesn't matter if it's tablet-enabled KS or not, both can have the timeout disabled, which simplifies the implementation. Where to do it exactly (cqlsh, python driver?) has to be yet decided. This has to be reflected in the documentation.

I believe that the client timeout can be set per query by the application, so this can be done in cqlsh.
I'm not sure what's going on on the server side and whether the query might timeout on the server as well.
If so, we must prevent that for 6.0
As for eliminating the client timeout for that and/or other types of queries on the client driver side - it seems like a longer term project that shouldn't block the release.

mykaul · 2024-05-20T07:52:02Z

#16723 is moved to 6.1. I believe so should this one.

bhalevy · 2024-05-20T08:13:21Z

#16723 is moved to 6.1. I believe so should this one.

Makes sense

The full support for ALTERing a tablets-enabled KEYSPACE is not yet implemented, and we don't want to only change the schema without changing any tablets, so the statement has to be explicitly rejected for cases that won't work, so every time any replication option is provided. Fixes: scylladb#18795 References: scylladb#16129

tzach added the area/tablets label Nov 22, 2023

tzach added this to the 6.0 milestone Nov 22, 2023

tzach mentioned this issue Nov 22, 2023

RFC: Integration of Tablets load balancer with Add/Remove DC operations #16130

Closed

ptrsmrn assigned kostja and ptrsmrn Dec 13, 2023

tgrabiec mentioned this issue Dec 22, 2023

Automatically extend replica set to meet RF #16529

Closed

ptrsmrn mentioned this issue Jan 17, 2024

topology_coordinator: create a global requests queue #16822

Open

bhalevy mentioned this issue Feb 29, 2024

Support adding a datacenter with tablets by changing rf from 0->N #17572

Open

bhalevy assigned xemul Mar 12, 2024

bhalevy mentioned this issue Mar 20, 2024

raft topology: updating the snitch #17513

Closed

ptrsmrn mentioned this issue May 6, 2024

tablets: alter keyspace #16723

Merged

dani-tweig added the enhancement label May 9, 2024

mykaul modified the milestones: 6.0, 6.1 May 20, 2024

mykaul added the P1 Urgent label May 20, 2024

bhalevy changed the title ~~RFC: Integration of Tablets load balancer with RF changes~~ [tablets] Support RF changes using ALTER KEYSPACE May 20, 2024

bhalevy mentioned this issue May 21, 2024

cql: reject ALTER tablets KS with repl opts #18772

Closed

bhalevy mentioned this issue May 21, 2024

Support rebuild with tablets #17575

Open

bhalevy mentioned this issue May 27, 2024

fix(decommission): make sure that the rf is smaller than the number of nodes for tablets keyspaces scylladb/scylla-cluster-tests#7457

Closed

2 tasks

scylladb-promoter closed this as completed in e74a4b0 May 29, 2024

mergify bot mentioned this issue May 29, 2024

[Backport 6.0] tablets: alter keyspace #18982

Closed

This was referenced May 29, 2024

[Backport 6.0] tablets: alter keyspace #18985

Closed

tablets: alter keyspace follow-ups (epic) - part 1 #19123

Open

yarongilor mentioned this issue Aug 6, 2024

fix(decommission): adjust for tablets scylladb/scylla-cluster-tests#7625

Merged

2 tasks

This was referenced Sep 17, 2024

tablets: alter keyspace follow-ups (epic) - part 2 #20665

Open

tablets: alter keyspace follow-ups (epic) - improvement ideas #20666

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[tablets] Support RF changes using ALTER KEYSPACE #16129

[tablets] Support RF changes using ALTER KEYSPACE #16129

tzach commented Nov 22, 2023

tgrabiec commented Dec 13, 2023

avikivity commented Dec 13, 2023

tgrabiec commented Dec 13, 2023

avikivity commented Dec 13, 2023

avikivity commented Dec 28, 2023

tgrabiec commented Jan 10, 2024 •

edited

Loading

bhalevy commented Jan 19, 2024

tgrabiec commented Jan 19, 2024

bhalevy commented Mar 18, 2024

ptrsmrn commented Mar 29, 2024

bhalevy commented Mar 29, 2024

mykaul commented May 20, 2024

bhalevy commented May 20, 2024

[tablets] Support RF changes using ALTER KEYSPACE #16129

[tablets] Support RF changes using ALTER KEYSPACE #16129

Comments

tzach commented Nov 22, 2023

tgrabiec commented Dec 13, 2023

avikivity commented Dec 13, 2023

tgrabiec commented Dec 13, 2023

avikivity commented Dec 13, 2023

avikivity commented Dec 28, 2023

tgrabiec commented Jan 10, 2024 • edited Loading

bhalevy commented Jan 19, 2024

tgrabiec commented Jan 19, 2024

bhalevy commented Mar 18, 2024

ptrsmrn commented Mar 29, 2024

bhalevy commented Mar 29, 2024

mykaul commented May 20, 2024

bhalevy commented May 20, 2024

tgrabiec commented Jan 10, 2024 •

edited

Loading