|
| 1 | ++++ |
| 2 | +title= "Resharding, Reimagined: Introducing Atomic Slot Migration" |
| 3 | +date= 2025-10-29 00:00:00 |
| 4 | +description= "Valkey 9.0 brings big changes and improvements to the underlying mechanisms of resharding. Find out how Atomic Slot Migration works, the benefits it brings, and the headaches it eliminates." |
| 5 | +authors= ["murphyjacob4"] |
| 6 | +[extra] |
| 7 | +featured = true |
| 8 | +featured_image = "/assets/media/featured/random-05.webp" |
| 9 | ++++ |
| 10 | + |
| 11 | +Managing the topology of a distributed database is one of the most critical and |
| 12 | +challenging tasks for any operator. For a high-performance system like Valkey, |
| 13 | +moving data slots between nodes —a process known as resharding— needs to be |
| 14 | +fast, reliable, and easy. |
| 15 | + |
| 16 | +Clustered Valkey has historically supported resharding through a process known |
| 17 | +as slot migration, where one or more of the 16,384 slots is moved from one shard |
| 18 | +to another. This slot migration process has historically led to many operational |
| 19 | +headaches. To address this, Valkey 9.0 introduced a powerful new feature that |
| 20 | +fundamentally improves this process: **Atomic Slot Migration**. |
| 21 | + |
| 22 | +Atomic Slot Migration includes many benefits that makes resharding painless. |
| 23 | +This includes: |
| 24 | + |
| 25 | +- a simpler, one-shot command interface supporting multiple slot ranges in a |
| 26 | + single migration, |
| 27 | +- built-in cancellation support and automated rollback on failure, |
| 28 | +- improved large key handling, |
| 29 | +- up to 9x faster slot migrations, |
| 30 | +- greatly reduced client impact during migrations. |
| 31 | + |
| 32 | +Let's dive into how it works and what it means for you. |
| 33 | + |
| 34 | +## Background: The Legacy Slot Migration Process |
| 35 | + |
| 36 | +Prior to Valkey 9.0, a slot migration from a source node to a target node was |
| 37 | +performed through the following steps: |
| 38 | + |
| 39 | +1. Send `<target>`: `CLUSTER SETSLOT <slot> IMPORTING <source>` |
| 40 | +2. Send `<source>`: `CLUSTER SETSLOT <slot> MIGRATING <target>` |
| 41 | +3. Send `<source>`: `CLUSTER GETKEYSINSLOT <slot> <count>` |
| 42 | +4. Send `<source>`: `MIGRATE ...` for each key in the result of step 3 |
| 43 | +5. Repeat 3 & 4 until no keys are left in the slot on `<source>` |
| 44 | +6. Send `<target>`: `CLUSTER SETSLOT <slot> NODE <target>` |
| 45 | +7. Send `<source>`: `CLUSTER SETSLOT <slot> NODE <target>` |
| 46 | + |
| 47 | +This was subject to the following problems: |
| 48 | + |
| 49 | +- **Higher latency for client operations**: All client writes and reads to keys |
| 50 | + in the migrating hash slot were subject to redirections through a special |
| 51 | + `-ASK` error response, which required re-execution of the command on the |
| 52 | + target node. Redirected responses meant unexpected latency spikes during |
| 53 | + migrations. |
| 54 | +- **Multi-key operation unavailability**: Commands like `MGET` and `MSET` which |
| 55 | + supply multiple keys could not always be served by a single node when slots |
| 56 | + were migrating. When this happened, clients would receive error responses and |
| 57 | + were expected to retry. |
| 58 | +- **Problems with large keys/collections**: Since the migration was performed |
| 59 | + one key at a time, large keys (e.g. collections with many elements) needed to |
| 60 | + be sent as a single command. Serialization of a large key required a large |
| 61 | + contiguous memory chunk on the source node and import of that payload required |
| 62 | + a similar large memory chunk and a large CPU burst on the target node. In some |
| 63 | + cases, the memory consumption was enough to trigger out-of-memory conditions |
| 64 | + on either side, or the CPU burst could be large enough to cause a failover on |
| 65 | + the target shard due to health probes not being served. |
| 66 | +- **Slot migration latency**: The overall latency of the slot migration was |
| 67 | + bounded by how quickly the operator could send the `CLUSTER GETKEYSINSLOT` and |
| 68 | + `MIGRATE` commands. Each batch of keys required a full round-trip-time between |
| 69 | + the operator's machine and the cluster, leaving a lot of waiting time that |
| 70 | + could be used to do data migration. |
| 71 | +- **Lack of resilience to failure**: If a failure condition was encountered, for |
| 72 | + example, the hash slot will not fit on the target node, undoing the slot |
| 73 | + migration is not well supported and requires replaying the listed steps in |
| 74 | + reverse. In some cases, the hash slot may have grown while the migration was |
| 75 | + underway, and may not fit on either the source or target node. |
| 76 | + |
| 77 | +## The Core Idea: Migration via Replication |
| 78 | + |
| 79 | +At its heart, the new atomic slot migration process closely resembles the |
| 80 | +concepts of replication and failover which serve as the backbone for high |
| 81 | +availability within Valkey. When atomic slot migration is requested, data is |
| 82 | +asynchronously sent from the old owner (the source node) to the new owner (the |
| 83 | +target node). Once all data is transferred and the target is completely caught |
| 84 | +up, the ownership is atomically transferred to the target node. |
| 85 | + |
| 86 | +This gets logically broken down into three phases: |
| 87 | + |
| 88 | +1. **Snapshot Phase**: The source node first sends a point-in-time snapshot of |
| 89 | + all the data in the migrating slots to the target node. The snapshot is done |
| 90 | + asynchronously through a child process, allowing the parent process to |
| 91 | + continue serving requests. The snapshot is formatted as a stream of commands |
| 92 | + which the target node and its replica can consume verbatim. |
| 93 | +2. **Streaming Phase**: While the snapshot is ongoing, the source node will keep |
| 94 | + track of all new mutations made to the migrating slots. Once the snapshotting |
| 95 | + completes, the source node streams all incremental changes for those slots to |
| 96 | + the target. |
| 97 | +3. **Finalization Phase**: Once the stream of changes has been sent to the |
| 98 | + target, the source node briefly pauses mutations. Only once the target has |
| 99 | + fully processed these changes does it acquire ownership of the migrating |
| 100 | + slots and broadcast ownership to the cluster. When the source node discovers |
| 101 | + this, it knows it can delete the contents of those slots and redirect any |
| 102 | + paused clients to the new owner **atomically**. |
| 103 | + |
| 104 | + |
| 105 | + |
| 106 | +## Why is this better? |
| 107 | + |
| 108 | +By replicating the slot contents and atomically transferring ownership, Atomic |
| 109 | +Slot Migration provides many desirable properties over the previous mechanism: |
| 110 | + |
| 111 | +- **Clients are unaware**: Since the entire hash slot is replicated before any |
| 112 | + cleanup is done on the source node, clients are completely unaware of the slot |
| 113 | + migration, and no longer need to follow `ASK` redirections and retry errors |
| 114 | + for multi-key operations. |
| 115 | +- **Keys no longer need to be atomically moved**: Collections are moved as |
| 116 | + chunks of elements that are replayed as commands, preventing the reliability |
| 117 | + problems previously encountered when dumping and restoring a large collection. |
| 118 | +- **A migration can easily be rolled back on cancellation or failure**: Since |
| 119 | + Valkey places the hash slots in a staging area, they are easily wipe them |
| 120 | + independently of the rest of the database. Since this state is not broadcasted |
| 121 | + to the cluster, ending the migration is a straight-forward process of cleaning |
| 122 | + up the staging area and marking the migration as cancelled. Many failures, |
| 123 | + like out-of-memory, failover, or network partition can be handled completely |
| 124 | + by the engine. |
| 125 | +- **Greatly lowered slot migration latency**: Valkey is highly-optimized for |
| 126 | + replication. By batching the slot migrations and using this replication-like |
| 127 | + process, the end-to-end migration latency can lower by as much as 9x when |
| 128 | + compared to legacy slot migration through `valkey-cli`. |
| 129 | + |
| 130 | +## How to use Atomic Slot Migration |
| 131 | + |
| 132 | +A new family of `CLUSTER` commands gives you full control over the migration |
| 133 | +lifecycle. |
| 134 | + |
| 135 | +- [`CLUSTER MIGRATESLOTS SLOTSRANGE start-slot end-slot NODE node-id`](/commands/cluster-migrateslots/) |
| 136 | + - This command kicks off the migration. You can specify one or more slot |
| 137 | + ranges and the target node ID to begin pushing data. Multiple migrations can |
| 138 | + be queued in one command by repeating the SLOTSRANGE and NODE arguments. |
| 139 | +- [`CLUSTER CANCELSLOTMIGRATIONS`](/commands/cluster-cancelslotmigrations/) |
| 140 | + - Use this command to safely cancel all ongoing slot migrations originating |
| 141 | + from the node. |
| 142 | +- [`CLUSTER GETSLOTMIGRATIONS`](/commands/cluster-getslotmigrations/) |
| 143 | + - This gives you an observable log of recent and active migrations, allowing |
| 144 | + you to monitor the status, duration, and outcome of each job. Slot migration |
| 145 | + jobs are stored in memory, allowing for simple programmatic access and error |
| 146 | + handling. |
| 147 | + |
| 148 | +## Legacy vs. Atomic: Head-to-Head Results |
| 149 | + |
| 150 | +Head-to-head experiments show the improvement provided by atomic slot migration. |
| 151 | + |
| 152 | +### Test Setup |
| 153 | + |
| 154 | +To make things reproducible, the test setup is outlined below: |
| 155 | + |
| 156 | +- Valkey cluster nodes are `c4-standard-8` GCE VMs spread across GCP’s |
| 157 | + us-central1 region running Valkey 9.0.0 |
| 158 | +- Client machine is a separate `c4-standard-8` GCE VM in us-central1-f |
| 159 | +- Rebalancing is accomplished with the `valkey-cli --cluster rebalance` command, |
| 160 | + with all parameters defaulted. The only exception is during scale in, where |
| 161 | + `--cluster-weight` is used to set the weights to only allocate to 3 shards. |
| 162 | +- The cluster is filled with 40 GB of data consisting of 16 KB string valued |
| 163 | + keys |
| 164 | + |
| 165 | +### Slot Migration Latency: Who's Faster? |
| 166 | + |
| 167 | +The experiment had two tests: one with no load and one with heavy read/write |
| 168 | +load. The heavy load is simulated using [memtier-benchmark](https://github.com/RedisLabs/memtier_benchmark) with a 1:10 set/get |
| 169 | +ratio on the client machine specified above. |
| 170 | + |
| 171 | + |
| 172 | + |
| 173 | +<table> |
| 174 | + <tr> |
| 175 | + <th>Test Case</th> |
| 176 | + <th>Legacy Slot Migration</th> |
| 177 | + <th>Atomic Slot Migration</th> |
| 178 | + <th>Speedup</th> |
| 179 | + </tr> |
| 180 | + <tr> |
| 181 | + <td>No Load: 3 to 4 shards</td> |
| 182 | + <td>1m42.089s</td> |
| 183 | + <td>0m10.723s</td> |
| 184 | + <td style="background-color:#228b22;">9.52x</td> |
| 185 | + </tr> |
| 186 | + <tr> |
| 187 | + <td>No Load: 4 to 3 shards</td> |
| 188 | + <td>1m20.270s</td> |
| 189 | + <td>0m9.507s</td> |
| 190 | + <td style="background-color:#36a336;">8.44x</td> |
| 191 | + </tr> |
| 192 | + <tr> |
| 193 | + <td>Heavy Load: 3 to 4 shards</td> |
| 194 | + <td>2m27.276s</td> |
| 195 | + <td>0m30.995s</td> |
| 196 | + <td style="background-color:#7fd37f;">4.75x</td> |
| 197 | + </tr> |
| 198 | + <tr> |
| 199 | + <td>Heavy Load: 4 to 3 shards</td> |
| 200 | + <td>2m5.328s</td> |
| 201 | + <td>0m27.105s</td> |
| 202 | + <td style="background-color:#86d686;">4.62x</td> |
| 203 | + </tr> |
| 204 | +</table> |
| 205 | + |
| 206 | +The main culprit here is unnecessary network round trips (RTTs) in legacy slot |
| 207 | +migration. Each slot requires: |
| 208 | + |
| 209 | +- 2 RTTs to call `SETSLOT` and begin the migration |
| 210 | +- Each batch of keys in a slot requires: |
| 211 | + - 1 RTT for `CLUSTER GETKEYSINSLOT` |
| 212 | + - 1 RTT for `MIGRATE` |
| 213 | + - 1 RTT for the actual migration of the key batch from source to target node |
| 214 | +- 2 RTTs to call `SETSLOT` and end the migration |
| 215 | + |
| 216 | +Those round trip times add up. For this test case where we have: |
| 217 | + |
| 218 | +- 4096 slots to move |
| 219 | +- 160 keys per slot |
| 220 | +- `valkey-cli`'s default batch size of 10 |
| 221 | + |
| 222 | +We need: |
| 223 | + |
| 224 | + |
| 225 | + |
| 226 | +Even with a 300 microsecond round trip time, legacy slot migration spends over a |
| 227 | +minute just waiting for those 212,992 round trips. |
| 228 | + |
| 229 | +By removing this overhead, atomic slot migration is now only bounded by the |
| 230 | +speed that one node can push data to another, achieving much faster end-to-end |
| 231 | +latency. |
| 232 | + |
| 233 | +### Client Impact: How Would Applications Respond? |
| 234 | + |
| 235 | +The experiment measured the throughput of a simulated |
| 236 | +[valkey-py](https://github.com/valkey-io/valkey-py) workload with 1:10 set/get |
| 237 | +ratio while doing each scaling event. Over three trial throughput averages are |
| 238 | +shown below. |
| 239 | + |
| 240 | + |
| 241 | + |
| 242 | +Despite atomic slot migration causing a more acute throughput hit, you can see |
| 243 | +the recovery of the client application is much faster due to far fewer topology |
| 244 | +changes and an overall lower end-to-end latency. Each topology change needs to |
| 245 | +be handled by the Valkey client, so the quicker the topology changes are made, |
| 246 | +the sooner the impact ends. By collapsing the topology changes and performing |
| 247 | +atomic handover, atomic slot migration leads to less client impact overall than |
| 248 | +legacy slot migration. |
| 249 | + |
| 250 | +## Under the Hood: State Machines and Control Commands |
| 251 | + |
| 252 | +To coordinate the complex dance between the two nodes, a new internal command, |
| 253 | +[`CLUSTER SYNCSLOTS`](/commands/cluster-syncslots/), is introduced. This command |
| 254 | +orchestrates the state machine with the following sub-commands: |
| 255 | + |
| 256 | +- `CLUSTER SYNCSLOTS ESTABLISH SOURCE <source-node-id> NAME <unique-migration-name> SLOTSRANGE <start> <end> ...` |
| 257 | + - Informs the target node of an in progress slot migration and begins tracking |
| 258 | + the current connection as a slot migration link. |
| 259 | +- `CLUSTER SYNCSLOTS SNAPSHOT-EOF` |
| 260 | + - Used as a marker to inform the target the full snapshot of the hash slot |
| 261 | + contents have been sent. |
| 262 | +- `CLUSTER SYNCSLOTS REQUEST-PAUSE` |
| 263 | + - Informs the source node that the target has received all of the snapshot and |
| 264 | + is ready to proceed. |
| 265 | +- `CLUSTER SYNCSLOTS PAUSED` |
| 266 | + - Used as a marker to inform the target no more mutations should occur as the |
| 267 | + source has paused mutations. |
| 268 | +- `CLUSTER SYNCSLOTS REQUEST-FAILOVER` |
| 269 | + - Informs the source node that the target is fully caught up and ready to take |
| 270 | + over the hash slots. |
| 271 | +- `CLUSTER SYNCSLOTS FAILOVER-GRANTED` |
| 272 | + - Informs the target node that the source node is still paused and takeover |
| 273 | + can be safely performed. |
| 274 | +- `CLUSTER SYNCSLOTS FINISH` |
| 275 | + - Inform the replica of the target node that a migration is completed (or |
| 276 | + failed). |
| 277 | +- `CLUSTER SYNCSLOTS CAPA` |
| 278 | + - Reserved command allowing capability negotiation. |
| 279 | + |
| 280 | +The diagram below shows how `CLUSTER SYNCSLOTS` is used internally to drive a |
| 281 | +slot migration from start to finish: |
| 282 | + |
| 283 | + |
| 284 | + |
| 285 | +## Get Started Today! |
| 286 | + |
| 287 | +This new Atomic slot migration is a massive step forward for Valkey cluster |
| 288 | +management. It provides a faster, more reliable, and overall easier mechanism |
| 289 | +for resharding your data. |
| 290 | + |
| 291 | +So, go [download Valkey 9.0](/download/) and try Atomic Slot Migration for |
| 292 | +yourself! A huge thank you to everyone in the community who contributed to the |
| 293 | +design and implementation. |
0 commit comments