Skip to content

Commit 2508b75

Browse files
Blog post: Introducing Atomic Slot Migration (#404)
### Description Adds a blog post for the Atomic Slot Migration feature. ### Issues Resolved closes #369 ### Check List - [x] Commits are signed per the DCO using `--signoff` By submitting this pull request, I confirm that my contribution is made under the terms of the BSD-3-Clause License. --------- Signed-off-by: Jacob Murphy <jkmurphy@google.com> Signed-off-by: Kyle J. Davis <kyledvs@amazon.com> Co-authored-by: Kyle J. Davis <kyledvs@amazon.com>
1 parent fc901e0 commit 2508b75

File tree

8 files changed

+304
-0
lines changed

8 files changed

+304
-0
lines changed

content/authors/murphyjacob4.md

Lines changed: 11 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,11 @@
1+
---
2+
title: Jacob Murphy
3+
extra:
4+
photo: "/assets/media/authors/murphyjacob4.png"
5+
github: murphyjacob4
6+
---
7+
8+
Jacob Murphy is a software engineer at Google Cloud based out of Kirkland, WA.
9+
Jacob is passionate about high performance code, database engines, and
10+
reliability. When not coding, Jacob's hobbies include DIY projects on his house,
11+
cooking, and traveling.
362 KB
Loading
382 KB
Loading
Lines changed: 293 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,293 @@
1+
+++
2+
title= "Resharding, Reimagined: Introducing Atomic Slot Migration"
3+
date= 2025-10-29 00:00:00
4+
description= "Valkey 9.0 brings big changes and improvements to the underlying mechanisms of resharding. Find out how Atomic Slot Migration works, the benefits it brings, and the headaches it eliminates."
5+
authors= ["murphyjacob4"]
6+
[extra]
7+
featured = true
8+
featured_image = "/assets/media/featured/random-05.webp"
9+
+++
10+
11+
Managing the topology of a distributed database is one of the most critical and
12+
challenging tasks for any operator. For a high-performance system like Valkey,
13+
moving data slots between nodes —a process known as resharding— needs to be
14+
fast, reliable, and easy.
15+
16+
Clustered Valkey has historically supported resharding through a process known
17+
as slot migration, where one or more of the 16,384 slots is moved from one shard
18+
to another. This slot migration process has historically led to many operational
19+
headaches. To address this, Valkey 9.0 introduced a powerful new feature that
20+
fundamentally improves this process: **Atomic Slot Migration**.
21+
22+
Atomic Slot Migration includes many benefits that makes resharding painless.
23+
This includes:
24+
25+
- a simpler, one-shot command interface supporting multiple slot ranges in a
26+
single migration,
27+
- built-in cancellation support and automated rollback on failure,
28+
- improved large key handling,
29+
- up to 9x faster slot migrations,
30+
- greatly reduced client impact during migrations.
31+
32+
Let's dive into how it works and what it means for you.
33+
34+
## Background: The Legacy Slot Migration Process
35+
36+
Prior to Valkey 9.0, a slot migration from a source node to a target node was
37+
performed through the following steps:
38+
39+
1. Send `<target>`: `CLUSTER SETSLOT <slot> IMPORTING <source>`
40+
2. Send `<source>`: `CLUSTER SETSLOT <slot> MIGRATING <target>`
41+
3. Send `<source>`: `CLUSTER GETKEYSINSLOT <slot> <count>`
42+
4. Send `<source>`: `MIGRATE ...` for each key in the result of step 3
43+
5. Repeat 3 & 4 until no keys are left in the slot on `<source>`
44+
6. Send `<target>`: `CLUSTER SETSLOT <slot> NODE <target>`
45+
7. Send `<source>`: `CLUSTER SETSLOT <slot> NODE <target>`
46+
47+
This was subject to the following problems:
48+
49+
- **Higher latency for client operations**: All client writes and reads to keys
50+
in the migrating hash slot were subject to redirections through a special
51+
`-ASK` error response, which required re-execution of the command on the
52+
target node. Redirected responses meant unexpected latency spikes during
53+
migrations.
54+
- **Multi-key operation unavailability**: Commands like `MGET` and `MSET` which
55+
supply multiple keys could not always be served by a single node when slots
56+
were migrating. When this happened, clients would receive error responses and
57+
were expected to retry.
58+
- **Problems with large keys/collections**: Since the migration was performed
59+
one key at a time, large keys (e.g. collections with many elements) needed to
60+
be sent as a single command. Serialization of a large key required a large
61+
contiguous memory chunk on the source node and import of that payload required
62+
a similar large memory chunk and a large CPU burst on the target node. In some
63+
cases, the memory consumption was enough to trigger out-of-memory conditions
64+
on either side, or the CPU burst could be large enough to cause a failover on
65+
the target shard due to health probes not being served.
66+
- **Slot migration latency**: The overall latency of the slot migration was
67+
bounded by how quickly the operator could send the `CLUSTER GETKEYSINSLOT` and
68+
`MIGRATE` commands. Each batch of keys required a full round-trip-time between
69+
the operator's machine and the cluster, leaving a lot of waiting time that
70+
could be used to do data migration.
71+
- **Lack of resilience to failure**: If a failure condition was encountered, for
72+
example, the hash slot will not fit on the target node, undoing the slot
73+
migration is not well supported and requires replaying the listed steps in
74+
reverse. In some cases, the hash slot may have grown while the migration was
75+
underway, and may not fit on either the source or target node.
76+
77+
## The Core Idea: Migration via Replication
78+
79+
At its heart, the new atomic slot migration process closely resembles the
80+
concepts of replication and failover which serve as the backbone for high
81+
availability within Valkey. When atomic slot migration is requested, data is
82+
asynchronously sent from the old owner (the source node) to the new owner (the
83+
target node). Once all data is transferred and the target is completely caught
84+
up, the ownership is atomically transferred to the target node.
85+
86+
This gets logically broken down into three phases:
87+
88+
1. **Snapshot Phase**: The source node first sends a point-in-time snapshot of
89+
all the data in the migrating slots to the target node. The snapshot is done
90+
asynchronously through a child process, allowing the parent process to
91+
continue serving requests. The snapshot is formatted as a stream of commands
92+
which the target node and its replica can consume verbatim.
93+
2. **Streaming Phase**: While the snapshot is ongoing, the source node will keep
94+
track of all new mutations made to the migrating slots. Once the snapshotting
95+
completes, the source node streams all incremental changes for those slots to
96+
the target.
97+
3. **Finalization Phase**: Once the stream of changes has been sent to the
98+
target, the source node briefly pauses mutations. Only once the target has
99+
fully processed these changes does it acquire ownership of the migrating
100+
slots and broadcast ownership to the cluster. When the source node discovers
101+
this, it knows it can delete the contents of those slots and redirect any
102+
paused clients to the new owner **atomically**.
103+
104+
![Diagram showing how how atomic slot migration is broken into the three phases](atomic-slot-migration-phases.png)
105+
106+
## Why is this better?
107+
108+
By replicating the slot contents and atomically transferring ownership, Atomic
109+
Slot Migration provides many desirable properties over the previous mechanism:
110+
111+
- **Clients are unaware**: Since the entire hash slot is replicated before any
112+
cleanup is done on the source node, clients are completely unaware of the slot
113+
migration, and no longer need to follow `ASK` redirections and retry errors
114+
for multi-key operations.
115+
- **Keys no longer need to be atomically moved**: Collections are moved as
116+
chunks of elements that are replayed as commands, preventing the reliability
117+
problems previously encountered when dumping and restoring a large collection.
118+
- **A migration can easily be rolled back on cancellation or failure**: Since
119+
Valkey places the hash slots in a staging area, they are easily wipe them
120+
independently of the rest of the database. Since this state is not broadcasted
121+
to the cluster, ending the migration is a straight-forward process of cleaning
122+
up the staging area and marking the migration as cancelled. Many failures,
123+
like out-of-memory, failover, or network partition can be handled completely
124+
by the engine.
125+
- **Greatly lowered slot migration latency**: Valkey is highly-optimized for
126+
replication. By batching the slot migrations and using this replication-like
127+
process, the end-to-end migration latency can lower by as much as 9x when
128+
compared to legacy slot migration through `valkey-cli`.
129+
130+
## How to use Atomic Slot Migration
131+
132+
A new family of `CLUSTER` commands gives you full control over the migration
133+
lifecycle.
134+
135+
- [`CLUSTER MIGRATESLOTS SLOTSRANGE start-slot end-slot NODE node-id`](/commands/cluster-migrateslots/)
136+
- This command kicks off the migration. You can specify one or more slot
137+
ranges and the target node ID to begin pushing data. Multiple migrations can
138+
be queued in one command by repeating the SLOTSRANGE and NODE arguments.
139+
- [`CLUSTER CANCELSLOTMIGRATIONS`](/commands/cluster-cancelslotmigrations/)
140+
- Use this command to safely cancel all ongoing slot migrations originating
141+
from the node.
142+
- [`CLUSTER GETSLOTMIGRATIONS`](/commands/cluster-getslotmigrations/)
143+
- This gives you an observable log of recent and active migrations, allowing
144+
you to monitor the status, duration, and outcome of each job. Slot migration
145+
jobs are stored in memory, allowing for simple programmatic access and error
146+
handling.
147+
148+
## Legacy vs. Atomic: Head-to-Head Results
149+
150+
Head-to-head experiments show the improvement provided by atomic slot migration.
151+
152+
### Test Setup
153+
154+
To make things reproducible, the test setup is outlined below:
155+
156+
- Valkey cluster nodes are `c4-standard-8` GCE VMs spread across GCP’s
157+
us-central1 region running Valkey 9.0.0
158+
- Client machine is a separate `c4-standard-8` GCE VM in us-central1-f
159+
- Rebalancing is accomplished with the `valkey-cli --cluster rebalance` command,
160+
with all parameters defaulted. The only exception is during scale in, where
161+
`--cluster-weight` is used to set the weights to only allocate to 3 shards.
162+
- The cluster is filled with 40 GB of data consisting of 16 KB string valued
163+
keys
164+
165+
### Slot Migration Latency: Who's Faster?
166+
167+
The experiment had two tests: one with no load and one with heavy read/write
168+
load. The heavy load is simulated using [memtier-benchmark](https://github.com/RedisLabs/memtier_benchmark) with a 1:10 set/get
169+
ratio on the client machine specified above.
170+
171+
![Chart showing time to scale in and out when under load, from the table below](legacy-vs-atomic-latency.png)
172+
173+
<table>
174+
<tr>
175+
<th>Test Case</th>
176+
<th>Legacy Slot Migration</th>
177+
<th>Atomic Slot Migration</th>
178+
<th>Speedup</th>
179+
</tr>
180+
<tr>
181+
<td>No Load: 3 to 4 shards</td>
182+
<td>1m42.089s</td>
183+
<td>0m10.723s</td>
184+
<td style="background-color:#228b22;">9.52x</td>
185+
</tr>
186+
<tr>
187+
<td>No Load: 4 to 3 shards</td>
188+
<td>1m20.270s</td>
189+
<td>0m9.507s</td>
190+
<td style="background-color:#36a336;">8.44x</td>
191+
</tr>
192+
<tr>
193+
<td>Heavy Load: 3 to 4 shards</td>
194+
<td>2m27.276s</td>
195+
<td>0m30.995s</td>
196+
<td style="background-color:#7fd37f;">4.75x</td>
197+
</tr>
198+
<tr>
199+
<td>Heavy Load: 4 to 3 shards</td>
200+
<td>2m5.328s</td>
201+
<td>0m27.105s</td>
202+
<td style="background-color:#86d686;">4.62x</td>
203+
</tr>
204+
</table>
205+
206+
The main culprit here is unnecessary network round trips (RTTs) in legacy slot
207+
migration. Each slot requires:
208+
209+
- 2 RTTs to call `SETSLOT` and begin the migration
210+
- Each batch of keys in a slot requires:
211+
- 1 RTT for `CLUSTER GETKEYSINSLOT`
212+
- 1 RTT for `MIGRATE`
213+
- 1 RTT for the actual migration of the key batch from source to target node
214+
- 2 RTTs to call `SETSLOT` and end the migration
215+
216+
Those round trip times add up. For this test case where we have:
217+
218+
- 4096 slots to move
219+
- 160 keys per slot
220+
- `valkey-cli`'s default batch size of 10
221+
222+
We need:
223+
224+
![Picture of the following formula rendered in LaTeX: RTTs = SlotsCount * (3 * KeysPerSlot/KeysPerBatch + 4) = 4096 * (3 * 160/10 + 4) = 212992](legacy-rtt-formula.png)
225+
226+
Even with a 300 microsecond round trip time, legacy slot migration spends over a
227+
minute just waiting for those 212,992 round trips.
228+
229+
By removing this overhead, atomic slot migration is now only bounded by the
230+
speed that one node can push data to another, achieving much faster end-to-end
231+
latency.
232+
233+
### Client Impact: How Would Applications Respond?
234+
235+
The experiment measured the throughput of a simulated
236+
[valkey-py](https://github.com/valkey-io/valkey-py) workload with 1:10 set/get
237+
ratio while doing each scaling event. Over three trial throughput averages are
238+
shown below.
239+
240+
![Chart showing queries per second over time. Atomic slot migration quickly dips and recovers, while legacy slot migration incurs a sustained dip](legacy-vs-atomic-client-impact.png)
241+
242+
Despite atomic slot migration causing a more acute throughput hit, you can see
243+
the recovery of the client application is much faster due to far fewer topology
244+
changes and an overall lower end-to-end latency. Each topology change needs to
245+
be handled by the Valkey client, so the quicker the topology changes are made,
246+
the sooner the impact ends. By collapsing the topology changes and performing
247+
atomic handover, atomic slot migration leads to less client impact overall than
248+
legacy slot migration.
249+
250+
## Under the Hood: State Machines and Control Commands
251+
252+
To coordinate the complex dance between the two nodes, a new internal command,
253+
[`CLUSTER SYNCSLOTS`](/commands/cluster-syncslots/), is introduced. This command
254+
orchestrates the state machine with the following sub-commands:
255+
256+
- `CLUSTER SYNCSLOTS ESTABLISH SOURCE <source-node-id> NAME <unique-migration-name> SLOTSRANGE <start> <end> ...`
257+
- Informs the target node of an in progress slot migration and begins tracking
258+
the current connection as a slot migration link.
259+
- `CLUSTER SYNCSLOTS SNAPSHOT-EOF`
260+
- Used as a marker to inform the target the full snapshot of the hash slot
261+
contents have been sent.
262+
- `CLUSTER SYNCSLOTS REQUEST-PAUSE`
263+
- Informs the source node that the target has received all of the snapshot and
264+
is ready to proceed.
265+
- `CLUSTER SYNCSLOTS PAUSED`
266+
- Used as a marker to inform the target no more mutations should occur as the
267+
source has paused mutations.
268+
- `CLUSTER SYNCSLOTS REQUEST-FAILOVER`
269+
- Informs the source node that the target is fully caught up and ready to take
270+
over the hash slots.
271+
- `CLUSTER SYNCSLOTS FAILOVER-GRANTED`
272+
- Informs the target node that the source node is still paused and takeover
273+
can be safely performed.
274+
- `CLUSTER SYNCSLOTS FINISH`
275+
- Inform the replica of the target node that a migration is completed (or
276+
failed).
277+
- `CLUSTER SYNCSLOTS CAPA`
278+
- Reserved command allowing capability negotiation.
279+
280+
The diagram below shows how `CLUSTER SYNCSLOTS` is used internally to drive a
281+
slot migration from start to finish:
282+
283+
![Diagram showing how Valkey uses CLUSTER SYNCSLOTS to drive atomic slot migration](atomic-slot-migration-syncslots.png)
284+
285+
## Get Started Today!
286+
287+
This new Atomic slot migration is a massive step forward for Valkey cluster
288+
management. It provides a faster, more reliable, and overall easier mechanism
289+
for resharding your data.
290+
291+
So, go [download Valkey 9.0](/download/) and try Atomic Slot Migration for
292+
yourself! A huge thank you to everyone in the community who contributed to the
293+
design and implementation.
6.56 KB
Loading
610 KB
Loading
13.6 KB
Loading
558 KB
Loading

0 commit comments

Comments
 (0)