[NEW] Server driven slot migration #587

PingXie · 2024-06-02T08:27:09Z

Today, slot migration is completely driven by an external process, essentially executing the steps below:

On the destination node
CLUSTER SETSLOT <slot> IMPORTING <source_node_id>
On the source node
CLUSTER SETSLOT <slot> MIGRATING <destination_node_id>
Get keys and migrate them one by one
CLUSTER GETKEYSINSLOT <slot> <count>MIGRATE <destination_ip> <destination_port> <key> 0 <timeout>
Set the slot to the destination node on all nodes
CLUSTER SETSLOT <slot> NODE <destination_node_id>

This is a heavy-handed process with many failure paths to handle. Even with the improvements introduced in #445, step 3 above is still error-prone.

The proposal here is to introduce a new command that allows the entire process to be executed on the migration source node in one shot. We can relatively easily perform all the steps above in the engine for now, but going forward, this change also serves as a stepping stone to the eventual atomic slot migration (#23).

On a high level, here is what the proposed workflow would look like:

Initiate slot migration
CLUSTER MIGRATE QUEUE <SLOTS> <SHARD_ID>, where is a comma-separated unordered list of slot ranges or single slots, such as 3-6,7,10,1. Note that <SHARD_ID> is a preferred target identifier instead of <NODE_ID>. This is to relieve the client of the hassle of tracking down the primary node, which is a volatile state on its own and can change right after the client query.

This command is also non-blocking, like CLUSTER FAILOVER.

Check slot migration results

Finding if the slots were migrated successfully or not can be achieved via any of the cluster topology query commands. However, regardless of how the slot migration is performed (atomic or not), errors can happen. There is a need for the client to get more information about any incomplete migration. The detailed implementation is not a concern at this point, but the user interface is key. Because there will be a need for the client to cancel in-progress or pending slot migrations, it is desired to have an ability to report per-slot migration results. For this reason, we could consider a command like the below:

CLUSTER MIGRATE REPORT <SLOTS> where <SLOTS> are optional. When <SLOTS> is not provided, this command reports all in-progress and pending-migration slots.

The report is an array with each element being a map with one of the following two sets of fields:

a. on source
slot_number, target_shard_id, state (started/pending/failed), num_retries, queued_time, start_time, update_time

b. on target
slot_number, source_shard_id, start_time, update_time

Cancel an in-progress or pending slot migration

CLUSTER MIGRATE CANCEL <SLOTS>

Note that this proposal allows the future atomic slot migration improvement to be introduced as a drop-in replacement of the existing migration scheme.

The text was updated successfully, but these errors were encountered:

madolson · 2024-06-02T17:31:20Z

CLUSTER MIGRATE REPORT

This is what info is for, what exactly is the value of introducing a net new command for this? We might also take a hint and add the context to cluster-shards since it's extensible to include ongoing slot migrations.

CLUSTER MIGRATE CANCEL

Do we need to have the optional slots option? this requires an extra optional step of figuring what slots are being migrated, and then forces them to cancel those. I would prefer just CLUSTER MIGRATE CANCEL which cancels all ongoing migrations on the node, so that the end user can retry them.

queued_time

What is the benefit of supporting queues? Since have the concept of failed migrations (if the target is full), it seems to add complexity on the end user to queue up a bunch of slots. I guess my question of the queue is that can you do MIGRATE SLOTS 1 MIGRATE SLOTS 2 as two separate commands, or can you only create one queue at a time? I guess maybe the intention is queue up multiple different slots to different nodes?

PingXie · 2024-06-02T21:23:42Z

CLUSTER MIGRATE REPORT

This is what info is for, what exactly is the value of introducing a net new command for this? We might also take a hint and add the context to cluster-shards since it's extensible to include ongoing slot migrations.

We need multiple properties for each slot being migrated. For instance, on the source side, I am thinking: slot_number, target_shard_id, state (started/pending/failed), num_retries, queued_time, start_time, update_time. If we go with INFO, we would need to use a comma-separate list format (like how we return per command stats), which requires additional parsing work on the client side. I am not totally against that given these precedents but I am hoping to use a more structural format like RESP arrays. I think I can go either way.

CLUSTER MIGRATE CANCEL

Do we need to have the optional slots option? this requires an extra optional step of figuring what slots are being migrated, and then forces them to cancel those. I would prefer just CLUSTER MIGRATE CANCEL which cancels all ongoing migrations on the node, so that the end user can retry them.

Make sense.

queued_time

What is the benefit of supporting queues? Since have the concept of failed migrations (if the target is full), it seems to add complexity on the end user to queue up a bunch of slots. I guess my question of the queue is that can you do MIGRATE SLOTS 1 MIGRATE SLOTS 2 as two separate commands, or can you only create one queue at a time? I guess maybe the intention is queue up multiple different slots to different nodes?

I think the queuing proposal comes from the slot migration operation being async and non-blocking. This automatically opens the door to the client submitting multiple requests and the server picking its own preference, one at a time, or batch however it sees fit, which in turn means some slots might not get worked on after the migration request is accepted. That is why differentiating queued and started (or in_progress) is useful.

madolson · 2024-06-03T03:35:01Z

I am not totally against that given these precedents but I am hoping to use a more structural format like RESP arrays. I think I can go either way.

There has been discussion about introducing a way for info to be in RESP, I should go find it and open another issue. My understanding is clients already know how to parse info responses, so it should be okay to add the next field (sort of like how we show replicas as multiple items in the info). I suppose part of my question also stems from if we need queues. If we just have one "ongoing" batch, it's much easier to show the info.

I think the queuing proposal comes from the slot migration operation being async and non-blocking. This automatically opens the door to the client submitting multiple requests and the server picking its own preference, one at a time, or batch however it sees fit, which in turn means some slots might not get worked on after the migration request is accepted. That is why differentiating queued and started (or in_progress) is useful.

I suppose my view is that most applications would submit something more like one "batch" of slots to migrate from node A to node B, and they would poll for it's completion. Once it's completed they would send the second "batch". If you submit a queue of batches, you need to reason on the server side what to do if one of them fails. If we take as a tenet that we would like to keep Valkey to be simple, I would like to understand deeply why we think building the queues adds to the end user experience.

PingXie · 2024-06-03T04:21:27Z

There has been discussion about introducing a way for info to be in RESP, I should go find it and open another issue.

That is all I need to go with INFO.

My understanding is clients already know how to parse info responses, so it should be okay to add the next field (sort of like how we should replicas as multiple items in the queue).

There is a secondary parsing involved here to extract out these "sub-fields". Without a proper RESP format, this essentially degenerates into CLUSTER NODES, whose serialization and parsing are very error prone. Regardless, there will be changes on the client side to parse out the detailed information and for that I would like to go with RESP. Whether through a new command or not is immaterial to me (and agreed we should bias towards less commands).

I suppose my view is that most applications would submit something more like one "batch" of slots to migrate from node A to node B, and they would poll for it's completion.

Agreed.

Once it's completed they would send the second "batch".

This puts more state tracking burden on the client side for someone who submits multiple batches of requests without waiting, which is your question below.

If you submit a queue of batches, you need to reason on the server side what to do if one of them fails.

I think that you have an (implicit) assumption here about the implementation, i.e., the server will process ALL submitted slots at once. This might or might not be true; we discussed in the past that even with atomic slot migration, we might start migrating one slot at a time. Unless all the slots are processed as a single batch, you will have to reason about the same question. As part of this proposal, I would like to include bounded exponential retries to handle this failure and include num_retries, last_error, start_time, and the update_time for client observability.

If we take as a tenet that we would like to keep Valkey to be simple, I would like to understand deeply why we think building the queues adds to the end user experience.

I agree with the simplicity tenet but I think we need to look from an end-to-end perspective of "combined complexity". For instance, the current migration scheme is simple for the server but puts too much burden on the client side to the point that the overall complexity is very high.

madolson · 2024-06-03T04:37:03Z

I think that you have an (implicit) assumption here about the implementation, i.e., the server will process ALL submitted slots at once.

You could also be queueing up slots to multiple different nodes at the same nodes, which you also can't do atomically. On failure, do you drop the whole queue or do you move to the next node? For atomic slot migration we could do [...] the slots are processed as a single batch, which I think is a nice property, but can't be covered with this implementation.

As part of this proposal, I would like to include bounded exponential retries to handle this failure and include num_retries, last_error, start_time, and the update_time for client observability.

I don't think the server should own this. It feels like we're pushing a lot more state than is really necessary. If there is a bug here, you have to patch the engine to fix it. Putting it in tooling seems a lot safer.

For instance, the current migration scheme is simple for the server but puts too much burden on the client side to the point that the overall complexity is very high.

I don't think this is true. I know a lot of developers that just use the build in valkey-cli cluster migrate functionality, and it works very simply for them. They point it at the cluster, and it does the migration. If it fails, they check the result, update it, and try again. I think you're not coming at is from most users, but you are coming at it from a managed cloud provider who wanted to fully automate it and put retry logic around it.

madolson · 2024-06-03T05:26:57Z

Maybe the thing that would make it easier to decide on is what we want the user experience to look like in the ideal state.

PingXie · 2024-06-03T06:23:38Z

I think you're not coming at is from most users, but you are coming at it from a managed cloud provider who wanted to fully automate it and put retry logic around it.

No. Quite the opposite. I am thinking purely from the self-managed users here. The cli tool can hide some complexity but I don't think it can achieve things like auto-resuming a failed slot migration (due to fail-overs). It is also a hassle to constantly rerun the tool Vs fire-n-forget, for which I imagine the following contract: "once slots are submitted for migration, I no longer need to monitor the system constantly to either re-submit the failed request or fix the broken link myself; the system should either retry for me or park the system in a perpetually stable state so that availability is upheld (at the cost of performance in the failure case). "

madolson · 2024-06-03T16:03:55Z

The cli tool can hide some complexity but I don't think it can achieve things like auto-resuming a failed slot migration (due to fail-overs).

That is wrong though, it could resume a failed slot migration. It knows the topology and the intention, so it should be able to resume the operation. I don't understand the bias towards pushing more orchestration complexity into the engine. I would like the engine to be able to make sure it is in a stable state, but during failure modes it's not clear why the engine should continue to take action on its own.

PingXie · 2024-06-03T16:29:16Z

That is wrong though, it could resume a failed slot migration.

You will have to come back and restart the operation using the cli.

I would like the engine to be able to make sure it is in a stable state

Yes that is the first step.

but during failure modes it's not clear why the engine should continue to take action on its own.

It would require less human intervention - especially valuable in the self-managed case. That said, I think this is completely incremental work on top of the new command the atomic slot migration so I am fine with leaving it out of the first version and we can continue the brainstorming.

hwware · 2024-06-03T18:48:50Z

Sorry I just find this issue, it looks like Ping want to involve some features that allow the cluster do the resharding automatically. I will take carefully about this issue.

But I have one idea for the resharding rebalance: in real production, sometimes, some keys are very hot for reading and writing, and if these kinds of hot keys are located in the same shards or same slots, cluster has the ability to migrate these hot keys to cold slot or shard,

madolson · 2024-06-03T22:09:05Z

You will have to come back and restart the operation using the cli.

This assumes the CLI errors out instead of issuing the command again :) You are right if the error is on the server the CLI is running you have to try to issue it again. Such would also be true on whatever system is polling for the failure.

zuiderkwast · 2024-06-11T02:06:00Z

Current management with the cli is a hassle. valkey-cli --cluster fix is best effort and it makes some wild guesses. We could improve the cli though, but I want the command sequence to be simple, so it can be done from other tools without being error prone. Ideally just calling one command to migrate a slot.

I do want to have some self-healing built into the server/cluster itself. Leaving a migration half-ways is stable but not good if undetected. Resuming a slot migration becomes a non-issue though with atomic slot migration.

I'm skeptical to the queuing mechanism.

Why not make the command blocking? That seems to me like good user experience. Then migrating one by one is just a pipeline.

Or we can make the command be async and push (resp3) a message on error or completion.

To migrate slots to multiple destinations at the same time, we could allow syntax like

CLUSTER MIGRATE SLOT 500-599 TO shardid1 SLOT 600-699 TO shardid2

PingXie self-assigned this Jun 2, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[NEW] Server driven slot migration #587

[NEW] Server driven slot migration #587

PingXie commented Jun 2, 2024

madolson commented Jun 2, 2024 •

edited

Loading

PingXie commented Jun 2, 2024

madolson commented Jun 3, 2024 •

edited

Loading

PingXie commented Jun 3, 2024

madolson commented Jun 3, 2024 •

edited

Loading

madolson commented Jun 3, 2024

PingXie commented Jun 3, 2024

madolson commented Jun 3, 2024 •

edited

Loading

PingXie commented Jun 3, 2024

hwware commented Jun 3, 2024

madolson commented Jun 3, 2024

zuiderkwast commented Jun 11, 2024

[NEW] Server driven slot migration #587

[NEW] Server driven slot migration #587

Comments

PingXie commented Jun 2, 2024

madolson commented Jun 2, 2024 • edited Loading

PingXie commented Jun 2, 2024

madolson commented Jun 3, 2024 • edited Loading

PingXie commented Jun 3, 2024

madolson commented Jun 3, 2024 • edited Loading

madolson commented Jun 3, 2024

PingXie commented Jun 3, 2024

madolson commented Jun 3, 2024 • edited Loading

PingXie commented Jun 3, 2024

hwware commented Jun 3, 2024

madolson commented Jun 3, 2024

zuiderkwast commented Jun 11, 2024

madolson commented Jun 2, 2024 •

edited

Loading

madolson commented Jun 3, 2024 •

edited

Loading

madolson commented Jun 3, 2024 •

edited

Loading

madolson commented Jun 3, 2024 •

edited

Loading