Skip to content

Improve Legion Lord election algorithm #100

Closed
unbit opened this Issue Dec 27, 2012 · 23 comments

2 participants

@unbit
Owner
unbit commented Dec 27, 2012

The first purpose of the legion subsystem was ip-takeover. Infact the protocol is very similar to the carp one.

During the "future of the uWSGI clustering" discussions/rfc emerged that a better master ('lord' in legion subsystem gergo) election system is needed.

Inspirations are the paxos algorithm and the redis-sentinel implementation. Objectives are reducing split-brain cases and increase reliability.

@unbit
Owner
unbit commented Dec 28, 2012

Fix situations where nodes/members have the same valor:

Initial legion subsystem specs leave the case of nodes with the same valor undefined.

The new implementation will set a UUID (if uuid support is available) too. When 2 (ore more) nodes with the same valor pop-up the one with the higher UUID (using uuid_compare()) wins.

@unbit
Owner
unbit commented Dec 28, 2012

Quorum:

it is a pretty decent way to reduce split brain problems.

Each legion will expect (by default) a quorum of 1, that means node will not vote for election of a lord. A quorum value higher than 1 will enable the vote mode. Each node will send an additional announce with the choosen lord. When a potential new lord receive at least Q (quorum) announces for itself it will became the new lord.

@unbit unbit was assigned Dec 28, 2012
@unbit
Owner
unbit commented Dec 28, 2012

Multiple lords ?

Are multiple lords useful in some specific context ?

@prymitive
Collaborator

Quorum You described means that I need to have:

N >= Quorum nodes with --quorum --lord=nodeA

??

There is one great feature in uWSGI that I use - statelessness - only FastRouter nodes are "static", all backends can be moved around, added and removed without touching single line of configuration (and it should be easy to use multicast for subscriptions so I could have those made more dynamic as well). So I am not a fan of putting --lord=nodeA anywhere in my config files, I would prefer just to run some number of nodes and let them pick a lord. I understand that some clustering tools require me to input all nodes, but current legion can talk over multicast just fine and I really like that.

Maybe quorum could be implemented as the minimal number of joined nodes (like with cman and other tools)? So that if I use --quorum 3 than legion would not elect lord until there are at least 3 nodes sending announces.

@prymitive
Collaborator

BTW - please add legion API so that plugins could interact with it easily. Right now if I don't want to reimplement (which is fancy term for copy&paste) all uwsgi_opt_legion(), uwsgi_opt_legion_hook() and uwsgi_opt_legion() logic I need to format a string with options for legion parameters.

@unbit
Owner
unbit commented Dec 28, 2012

Quorum means that a lord is not elected until Q nodes agree on it. This is required for avoiding (reducing) split brain, and it will be optional (by default no quorum is needed). Regarding stateless the relevant "new part" is the one allowing you to NOT specify the valor of a node. The valor will be somewhat randomic so you do not need to worry about that, and you will be sure only one node per-legion will run "things".

@unbit
Owner
unbit commented Dec 28, 2012

Regarding the api you only need to add the legion name in your options. In my mind a --legion-cron would be something like that:

--legion-cron foobar -1 -1 -1 -1 -1 my_command

that means, run the cron only if the node is the lord of the 'foobar' legion.

In your plugin you will have something like that:

run_cron(mycron) {
if (uwsgi_i_am_the_lord(mycron->legion)) {
uwsgi_run_command(mycron->command);
}
}

@prymitive
Collaborator

Great, everything is clear now.

run_cron(mycron) {
if (uwsgi_i_am_the_lord(mycron->legion)) {
uwsgi_run_command(mycron->command);
}
}

This is pretty much what I was toying with in https://github.com/prymitive/uwsgi/blob/clustered_crons/plugins/clustered_crons/clustered_crons.c

@unbit
Owner
unbit commented Dec 28, 2012

Yes, i think you will be able to extremely simplify it as soon as the api is ready

@prymitive
Collaborator

Could legion nodes also advertise ip:port of the uWSGI sockets they have to other nodes? This way new node would join legion cluster and once it's joined and the cluster is quorate, it could ask some other node (preferably master) for cache-sync. This would allow for cache syncing without the need to use --cache-sync with fixed node address.

@unbit
Owner
unbit commented Jan 27, 2013

this is part of the "legion scroll" system. You will find reference to it in the current code. Basically you can add a "blob"
to each node with custom values.

--legion-scroll = mylegion "foobar=test,youraddr=127.0.0.1:4040"

it is still incomplete but will be ready for sure in time for 1.5 as it is the base for the upcoming clustered-emperor support

@prymitive
Collaborator

Ok, scroll will be great addition.

Could we also consider allowing user/plugin to provide some code that will dynamically alter nodes valor?
Either every N second or/and when legion cluster is altered (node joins or leaves legion).
First one would allow master to jump around from time to time (when load on a node is too high we can move master to other, less loaded node)
Second one would allow to spread cron tasks around during startup.

Without the ability to alter the valor, we might end up with all tasks on single node. If those tasks are memory or cpu heavy than we might kill that node.
Minimum version would be using valor value computed by $(user provided valor) - $(number of legions I am master of).

@unbit
Owner
unbit commented Jan 29, 2013

altering the valor is pretty easy (all of the math is done every time from scratch in the legion subsystem so you can trick it in various way)

something like uwsgi_legion_set/inc/dec_valor(struct uwsgi_legion *legion, uint64_t new_valor);

should be good (and without the need of locking as we are changing a single numeric value)

@prymitive
Collaborator

Is "legion scroll" going to be in 1.9 or later? Once scroll is ready and there is a way to sync uWSGI cache from legion master node, than we will have basic legion crons plugin. I'm not rushing or anything, I currently use redis script for locking jobs and I won't move from it anytime soon, but others might find that useful and start if so, we will get few extra legion testers ;)

@unbit
Owner
unbit commented Feb 27, 2013

it should be (eventually will be in 1.9.1)

@prymitive
Collaborator

I don't think there are any issues with current legion implementation, per run uid feature is present and working, scroll is ready and working. Closing as resolved

@prymitive prymitive closed this Apr 13, 2013
@prymitive
Collaborator

There are 2 things that qualify for this issue that I would like to discuss, so I'll reopen it to have discussion in one place.

I'm thinking about replacing keepalived with legion, for virtual ip management on 2 of my fastrouter nodes. I was thinking about what could go wrong and as always split brain can occur (as with all clustering tools and only 2 nodes), so maybe we could add legion node(s) that can't become The Lord and use quorum=2 (or bigger, with multiple arbiters).

What this would give us - if one node can't communicate with the other currently there is no way to check if this is due to issue one local or remote node. But if local node can talk to arbiter nodes and quorum is reached, then we can make a decision who should become a lord.

Second thing that could help - health checks - run something, if it fails lower the valor of local node or prevent it from becoming The Lord, example:

```--legion-lord-check="ping router.ip"

if we can't ping router than network might be down on local node, don't become the lord.

--legion-valor-check="50 killall 0 redis-server

if redis-server is not running (and we use it) lower the valor value by 50 "points".

@prymitive prymitive reopened this Apr 17, 2013
@unbit
Owner
unbit commented Apr 17, 2013

+1 for the arbiter, i will work on this soon after 1.9.7

Regarding the dynamic valor, i am thinking about changing it directly from the apps, so you can have a mule making all the checks you need without spawning a process every time.

@unbit
Owner
unbit commented Apr 20, 2013

The support for arbiters has been added. Just add node with a valor of 0 (they will be reported in the logs as "arbiter" instead of "node")

@prymitive
Collaborator

Changing valor for mules is ok, one can add checks there with uwsgi timer or cron API, but I think there is use case for more direct manipulation of nodes valor.

Example:

I need to do some maintenance on lord node and I would prefer to move the lord to another node before stopping lords uWSGI instance (in mongodb I would call rs.stepDown() to do so), so it would be nice to have command line option for setting valor of any running legion instance.
uwsgi --legion-set-valor mylegion 30 -s localhost:uwsgi_port

I would need to talk to local instance, probably over uWSGI socket (?), it would then set valor to 30, that would trigger new election (or we need to force update in legion cluster).

This would allow me to more gracefully stop lord node.

@prymitive
Collaborator

We could also add something like stepDown() into uwsgi_reload(), when I need to stop the lord, I would like it to first pass its lord status to other node in as much graceful way as possible, so in case of virtual IP there is as little downtime as it can be.

I'm not sure what happens when uWSGI instance joined to legion dies/reloads, does it send any message to other nodes that it should be marked as dead? Or does it dies and other nodes thinks it's up until legion-tolerance is exceeded?

It looks like the second case, it has pros and cons:

we can quickly reload instance without other nodes in the legion noticing it (in case valor is different and runtime uuids are not used for lord election) - in such case lord title stays on the reloaded node

if node will not reload correctly (upgrade that went wrong) node will not get up and so if it was the lord we need to wait until other nodes will notice that it is down

I think it's just a matter of (optionally) sending announce with valor = 0 when we need to reload, we can make it an option so user will pick desired behavior.

(?)

@unbit
Owner
unbit commented Jun 2, 2013

Ok, i think it is time to deal with it. What to do when an instance is reloaded (or stopped) ? i suppose we need to allow users to choose what to do for each legion. In some case it could be good to leave asap, in others it would be better to wait for the whole cluster to choose the best lord

@unbit
Owner
unbit commented Aug 6, 2013

ok, i think we can close it. Latest commit send a "death announce" whenever the server exit or it is reloaded. There is no need to allow the user to choose it, as the uuid of the node is changed on startup, so effectively the old node is always lost

@unbit unbit closed this Aug 6, 2013
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Something went wrong with that request. Please try again.