Skip to content

Voldemort Rebalancing

srikanthps edited this page Oct 30, 2011 · 45 revisions

Voldemort Rebalancing project aims to provide capability to add/delete nodes, move data around in a running voldemort cluster without downtime and with minimal impact on online cluster performance. Few example scenarios where it can be used are

  1. Dynamic addition of new nodes to the cluster.
  2. Deletion of nodes from cluster.
  3. Load balancing of data inside a cluster.

Requirements

These are the requirements for Voldemort rebalancing project.

  1. No downtime
  2. No functional impact on client
  3. Maintain data consistency guarantees while rebalancing.
  4. Minimal and tunable performance impact on the cluster.
  5. Push button user interface ( TODO )

Assumptions

  1. Only one rebalancing process at one time on the cluster should be running.
  2. Partitions to key mapping is not changed during rebalancing instead partitions ownerships (node to partition mapping) is changed and entire partitions are migrated across nodes.
  3. User needs to create the new target cluster metadata ( We provide a tool to help them )

Rebalancing Terminology

The Rebalancing process has three participants during rebalancing

  1. Rebalance Controller : Initiates rebalancing on the cluster by providing a target cluster metadata, can be started on any node ( preferably none of the Voldemort nodes ) which can make socket calls to all nodes in the cluster
  2. Stealer Nodes : The node which get new partitions.
    • The new nodes being added
    • When doing data load balancing, some nodes from the cluster itself will act as stealer nodes
    • Some other secondary nodes which may need to steal data due to secondary movements
  3. Donor Nodes : The nodes from where the data is copied

What are the actual steps performed during rebalancing?

Following are the steps that we go through to rebalance successfully. The controller initiates the rebalancing and then waits for the completion.

  1. Input
    • Either (a) current cluster xml, current stores xml, target cluster xml (b) url, target cluster xml
    • batch size – Number of primary partitions to move together. There is a trade-off, more primary partitions movements = more redirections.
    • max parallel rebalancing – Number of units ( stealer + donor node tuples ) to move together
  2. Get the latest state from the cluster
    Compare the targetCluster.xml provided and add all new nodes to currentCluster.xml
  3. Verify that
    • We are not in rebalancing state already ( “./bin/voldemort-admin-tool.sh —get-metadata server.state —url [url]” all returns NORMAL_SERVER and “./bin/voldemort-admin-tool.sh —get-metadata rebalancing.steal.info.key —url [url]” all returns “[]” i.e. no rebalancing state )
    • RO stores ( if they exist ) are all using the latest storage format ( “./bin/voldemort-admin-tool.sh —ro-metadata storage-format —url [url]” returns all stores with “ro2” format )
  4. Get a list of every primary partition to be moved
  5. For every “batch” of primary partitions to move
    • Create a transition cluster metadata which contains movement of “batch size” number of primary partitions
    • Create a rebalancing plan based on this transition cluster.xml and the current state of the cluster. The plan generated is a map of stealer node to list of donor node + partitions to be moved.
    • State change step
      1. has_read-only_stores AND has_read-write_stores => Change the rebalancing state [ with RO stores information ] change on all stealer nodes
      2. has_read-only_stores => Change the rebalancing state change on all stealer nodes
    • Start multiple units of migration [ unit => stealer + donor node movement ] of all RO Stores data [ No changes to the routing layer and clients ]. At the end of each migration delete the rebalancing state => Done with parallelism ( max parallel rebalancing )
    • State change step [ Changes in rebalancing state will kick in redirecting stores which will start redirecting requests to the old nodes ]
      1. hasROStore AND hasRWStore => Change the cluster metadata + Swap on all nodes AND Change rebalancing state [ with RW stores information ] on all stealer nodes
      2. hasROStore AND !hasRWStore => Change the cluster metadata + Swap on all nodes
      3. !hasROStore AND hasRWStore => Change the cluster metadata on all nodes AND Change rebalancing state [ with RW stores information ] on all stealer nodes
    • Start multiple units of migration of all RW Stores data [ With redirecting happening ]. At the end of migration delete the rebalancing state => Done with parallelism ( max parallel rebalancing )

What about the failure scenarios?

Extra precaution has been taken and every step ( 5 [ c,d,e,f ] ) has a rollback strategy

Rollback strategy for
5 ( c , e ) => If any failure takes place during the ‘State change step’, the following rollback strategy is run on every node that was completed successfully

Swap RO Change cluster metadata Change rebalance state Order
F T T remove the rebalance state change → change back cluster
F F T remove from rebalance state change
T T F change back cluster metadata → swap
T T T remove from rebalance state change → change back cluster → swap

5 ( d, f ) => Similarly during migration of partitions

Has RO stores Has RW stores Finished RO stores Rollback Action [ in the correct order ]
T T T rollback cluster change + swap
T T F nothing to do since “rebalance state change” should have removed everything
T F T won’t be triggered since hasRW is false
T F F nothing to do since “rebalance state change” should have removed everything
F T T rollback cluster change
F T F won’t be triggered
F F T won’t be triggered
F F F won’t be triggered

What happens on the stealer node side during 5 ( d, f )?

The stealer node on receiving a “unit of migration” [ unit => single stealer + single donor node migration ] and does the following

  • Check if the rebalancing state change was already done [ i.e. 5 ( c, e ) was successfully completed ]
  • Acquire a lock for the donor node [ Fail if donor node was already rebalancing ]
  • Start migration of the store partitions from the donor node => PARALLEL [ setMaxParallelStoresRebalancing ]. At the end of every store migration remove it from the list rebalance state change [ so as to stop redirecting stores ]

What about the donor side?

The donor node has no knowledge for rebalancing at all and keeps behaving normally.

What about my data consistency during rebalancing?

Rebalancing process has to maintain data consistency guarantees during rebalancing, We are doing it through a proxy based design. Once rebalancing starts the stealer node is the new master for all rebalancing partitions, All the clients talk directly to stealer node for all the requests for these partitions. Stealer node internally make proxy calls to the original donor node to return correct data back to the client.The process steps are

  1. Client request stealer node for key ‘k1’ belonging to partition ‘p1’ which is currently being migrated/rebalanced.
  2. Stealer node looks at the key and understands that this key is part of a rebalancing partition
  3. Stealer node makes a proxy call to donor node and gets the list of values as returned by the donor node.
  4. Stealer node does local put for all (key,value) pairs ignoring all ObsoleteVersionException
  5. Stealer node now should have all the versions from the original node and now does normal local get/put/ getAll operations.

And how do my clients know the changes?

Voldemort client currently bootstrap from the bootstrap URL at the start time and use the returned cluster/stores metadata for all subsequent operation. Rebalancing results in the cluster metadata change and so we need a mechanism to tell clients that they should rebootstrap if they have old metadata.

  1. Client Side routing bootstrapping: Since the client will be using the old metadata during rebalancing the server now throws an InvalidMetadataException if it sees a request for a partition which does not belong to it. On seeing this special exception the client is re-bootstraps from the bootstrap url and will hence pick up the correct cluster metadata.
  1. Server side routing bootstrapping: The other method of routing i.e. make calls to any server with enable_routing flag set to true and with re-routing to the correct location taking place on the server side ( i.e. 2 hops ). In this approach we’ve added a RebootstrappingStore which picks up the new metadata in case of change.

How to start rebalancing?

  1. Step 1: Make sure cluster is not doing rebalancing.
    • The rebalancing steal info should be null
      ./bin/voldemort-admin-tool.sh —get-metadata rebalancing.steal.info.key —url [ url ]
    • The servers should be NORMAL_STATE
      ./bin/voldemort-admin-tool.sh —get-metadata server.state —url [ url ]
  2. To check whether the keys were moved correctly we need to save some keys and later check if they have been migrated to their new locations
    ./bin/voldemort-rebalance.sh —current-cluster [ current_cluster ] —current-stores [ current_stores_path ] —entropy false —output-dir [ directory where we’ll store the keys for later use. Keys are stored on a per store basis ]
  3. Generate the new cluster xml. This can be done either by hand or if you want to do it automatically here is a tool to do to
    ./bin/voldemort-rebalance.sh —current-cluster [ current_cluster_path ] —target-cluster [ should be the same as current-cluster but with new nodes put in with empty partitions ] —current-stores [ current_stores_path ] —generate
  4. Run the new metadata through key-distribution generator to get an idea of skew if any. Make sure your standard deviation is close to 0.
    ./bin/run-class.sh voldemort.utils.KeyDistributionGenerator —cluster-xml [ new_cluster_xml_generated_above ] —stores-xml [ stores_metadata ]
  5. Use the new_cluster_xml and run the real rebalancing BUT first with —show-plan ( to check what we’re going to move )
    ./bin/voldemort-rebalance.sh —url [ url ] —target-cluster [ new_cluster_metadata_generated ] —show-plan
  6. Run the real deal
    ./bin/voldemort-rebalance.sh —url [ url ] —target-cluster [ new_cluster_metadata_generated ]
  7. Monitor by checking the async jobs
    ./bin/voldemort-admin-tool.sh —async get —url [ url ]

The following takes place when we are running the real rebalancing (i) Pick batch of partitions to move (ii) Generate transition plan (iii) Execute the plan as a series of ‘stealer-donor’ node copying steps. By default we do only one ‘stealer-donor’ tuple movement at once. You can increase this by setting —parallelism option.

If anything fails we can rollback easily ( as long as —delete was not used while running voldemort-rebalance.sh )

  • To stop an async process
    ./bin/voldemort-admin-tool.sh —async stop —async-id [comma separated list of async jobs ] —url [ url ] —node [ node on which the async job is running ]
  • To clear the rebalancing information on a particular node
    ./bin/voldemort-admin-tool.sh —clear-rebalancing-metadata —url [ url ] —node [ node-id ]

Following are some of the configurations that will help you on the server side

Parameter on server Default What it does
enable.rebalancing true Should be true so as to run rebalancing
max.rebalancing.attempts 3 Once a stealer node receives the plan to copy from a donor node, it will attempt this many times to copy the data ( in case of failure )
rebalancing.timeout.seconds 10 * 24 * 60 * 60 Time we give for the server side rebalancing to finish copying data from a donor node
max.parallel.stores.rebalancing 3 Stores to rebalance in parallel
rebalancing.optimization true Some times we have data stored without being partition aware ( Example : BDB ). In this scenario we can run an optimization phase which ignores copying data over if a replica already exists

What is left to make this process better?

a) Execute tasks should be smarter and choose tasks to execute so as to avoid two disk sweeps happening on the same node.

b) Fix deletes! – Make it run at the end instead of in the middle [ Even though I’ll never run this in production ]

c) Logging – Currently we propagate the message at the lowest level all the way to the top. Instead we should try to make a better progress bar ( “Number of stores completed – 5 / 10” ) and push that upwards.

d) Currently the stealer node goes into REBALANCING_MASTER state and doesn’t allow any disk sweeps ( like slop pusher job, etc ) from not taking place. But what about the poor donor node?