-
Notifications
You must be signed in to change notification settings - Fork 636
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[NEW] Server driven slot migration #587
Comments
This is what info is for, what exactly is the value of introducing a net new command for this? We might also take a hint and add the context to cluster-shards since it's extensible to include ongoing slot migrations.
Do we need to have the optional slots option? this requires an extra optional step of figuring what slots are being migrated, and then forces them to cancel those. I would prefer just
What is the benefit of supporting queues? Since have the concept of failed migrations (if the target is full), it seems to add complexity on the end user to queue up a bunch of slots. I guess my question of the queue is that can you do |
We need multiple properties for each slot being migrated. For instance, on the source side, I am thinking:
Make sense.
I think the queuing proposal comes from the slot migration operation being async and non-blocking. This automatically opens the door to the client submitting multiple requests and the server picking its own preference, one at a time, or batch however it sees fit, which in turn means some slots might not get worked on after the migration request is accepted. That is why differentiating |
There has been discussion about introducing a way for info to be in RESP, I should go find it and open another issue. My understanding is clients already know how to parse info responses, so it should be okay to add the next field (sort of like how we show replicas as multiple items in the info). I suppose part of my question also stems from if we need queues. If we just have one "ongoing" batch, it's much easier to show the info.
I suppose my view is that most applications would submit something more like one "batch" of slots to migrate from node A to node B, and they would poll for it's completion. Once it's completed they would send the second "batch". If you submit a queue of batches, you need to reason on the server side what to do if one of them fails. If we take as a tenet that we would like to keep Valkey to be simple, I would like to understand deeply why we think building the queues adds to the end user experience. |
That is all I need to go with
There is a secondary parsing involved here to extract out these "sub-fields". Without a proper RESP format, this essentially degenerates into
Agreed.
This puts more state tracking burden on the client side for someone who submits multiple batches of requests without waiting, which is your question below.
I think that you have an (implicit) assumption here about the implementation, i.e., the server will process ALL submitted slots at once. This might or might not be true; we discussed in the past that even with atomic slot migration, we might start migrating one slot at a time. Unless all the slots are processed as a single batch, you will have to reason about the same question. As part of this proposal, I would like to include bounded exponential retries to handle this failure and include
I agree with the simplicity tenet but I think we need to look from an end-to-end perspective of "combined complexity". For instance, the current migration scheme is simple for the server but puts too much burden on the client side to the point that the overall complexity is very high. |
You could also be queueing up slots to multiple different nodes at the same nodes, which you also can't do atomically. On failure, do you drop the whole queue or do you move to the next node? For atomic slot migration we could do
I don't think the server should own this. It feels like we're pushing a lot more state than is really necessary. If there is a bug here, you have to patch the engine to fix it. Putting it in tooling seems a lot safer.
I don't think this is true. I know a lot of developers that just use the build in |
Maybe the thing that would make it easier to decide on is what we want the user experience to look like in the ideal state. |
No. Quite the opposite. I am thinking purely from the self-managed users here. The cli tool can hide some complexity but I don't think it can achieve things like auto-resuming a failed slot migration (due to fail-overs). It is also a hassle to constantly rerun the tool Vs fire-n-forget, for which I imagine the following contract: "once slots are submitted for migration, I no longer need to monitor the system constantly to either re-submit the failed request or fix the broken link myself; the system should either retry for me or park the system in a perpetually stable state so that availability is upheld (at the cost of performance in the failure case). " |
That is wrong though, it could resume a failed slot migration. It knows the topology and the intention, so it should be able to resume the operation. I don't understand the bias towards pushing more orchestration complexity into the engine. I would like the engine to be able to make sure it is in a stable state, but during failure modes it's not clear why the engine should continue to take action on its own. |
You will have to come back and restart the operation using the cli.
Yes that is the first step.
It would require less human intervention - especially valuable in the self-managed case. That said, I think this is completely incremental work on top of the new command the atomic slot migration so I am fine with leaving it out of the first version and we can continue the brainstorming. |
Sorry I just find this issue, it looks like Ping want to involve some features that allow the cluster do the resharding automatically. I will take carefully about this issue. But I have one idea for the resharding rebalance: in real production, sometimes, some keys are very hot for reading and writing, and if these kinds of hot keys are located in the same shards or same slots, cluster has the ability to migrate these hot keys to cold slot or shard, |
This assumes the CLI errors out instead of issuing the command again :) You are right if the error is on the server the CLI is running you have to try to issue it again. Such would also be true on whatever system is polling for the failure. |
Current management with the cli is a hassle. I do want to have some self-healing built into the server/cluster itself. Leaving a migration half-ways is stable but not good if undetected. Resuming a slot migration becomes a non-issue though with atomic slot migration. I'm skeptical to the queuing mechanism. Why not make the command blocking? That seems to me like good user experience. Then migrating one by one is just a pipeline. Or we can make the command be async and push (resp3) a message on error or completion. To migrate slots to multiple destinations at the same time, we could allow syntax like
|
Continuation of #245 (comment)
Today, slot migration is completely driven by an external process, essentially executing the steps below:
On the destination node
CLUSTER SETSLOT <slot> IMPORTING <source_node_id>
On the source node
CLUSTER SETSLOT <slot> MIGRATING <destination_node_id>
Get keys and migrate them one by one
CLUSTER GETKEYSINSLOT <slot> <count>MIGRATE <destination_ip> <destination_port> <key> 0 <timeout>
Set the slot to the destination node on all nodes
CLUSTER SETSLOT <slot> NODE <destination_node_id>
This is a heavy-handed process with many failure paths to handle. Even with the improvements introduced in #445, step 3 above is still error-prone.
The proposal here is to introduce a new command that allows the entire process to be executed on the migration source node in one shot. We can relatively easily perform all the steps above in the engine for now, but going forward, this change also serves as a stepping stone to the eventual atomic slot migration (#23).
On a high level, here is what the proposed workflow would look like:
CLUSTER MIGRATE QUEUE <SLOTS> <SHARD_ID>
, where is a comma-separated unordered list of slot ranges or single slots, such as3-6,7,10,1
. Note that<SHARD_ID>
is a preferred target identifier instead of<NODE_ID>
. This is to relieve the client of the hassle of tracking down the primary node, which is a volatile state on its own and can change right after the client query.This command is also non-blocking, like
CLUSTER FAILOVER
.Finding if the slots were migrated successfully or not can be achieved via any of the cluster topology query commands. However, regardless of how the slot migration is performed (atomic or not), errors can happen. There is a need for the client to get more information about any incomplete migration. The detailed implementation is not a concern at this point, but the user interface is key. Because there will be a need for the client to cancel in-progress or pending slot migrations, it is desired to have an ability to report per-slot migration results. For this reason, we could consider a command like the below:
CLUSTER MIGRATE REPORT <SLOTS>
where<SLOTS>
are optional. When<SLOTS>
is not provided, this command reports all in-progress and pending-migration slots.The report is an array with each element being a map with one of the following two sets of fields:
a. on source
slot_number, target_shard_id, state (started/pending/failed), num_retries, queued_time, start_time, update_time
b. on target
slot_number, source_shard_id, start_time, update_time
CLUSTER MIGRATE CANCEL <SLOTS>
Note that this proposal allows the future atomic slot migration improvement to be introduced as a drop-in replacement of the existing migration scheme.
The text was updated successfully, but these errors were encountered: