InstanceConnectability

Failed connections

At the minute, if we can't connect to an instance, it just fails that connection. Subsequent connections may try that instance again (possibly immediately).

The load balancer should be making local decisions about which instances are available and which are not. When an instance cannot be connected to, it should quarantine it and not use it for new connections. Meanwhile, it can try the instance from time to time to see if it's working yet.

I think we want to have a backoff: try again in a second, then four seconds, then 16 seconds, (then 32 seconds?).

How to probe? It would be nice to use actual requests, mainly because

it's conceptually simpler
doing something else might put spurious errors in app logs, or worse, upset an app (a poorly written one, admittedly).

Connections may take some time to fail, and it may be difficult to tell if it has failed or is taking a long time; and, it is not acceptable for a client connection to take a long time to fail unless there's no other option. (This indicates that instances in quarantine may be promoted if there's no other choice).

First pass

Keep two sets; active instances, and quarantined instances. The latter are in a priority queue, in order of next-to-probe.

When the head of the heap becomes eligible for probing, promote it to the active instances, doubling its backoff.
When an instance fails a connection, insert it in the quarantine queue.
When an instance succeeds a connection, reset its backoff to 1.
When the instances are updated: if an instance is in active, update there; if in quarantine, update there (and reinstate if different? how would it be different?). If new, add to active. Remove all in active or quarantine that are no longer present. (May be best to create a diff to do this)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

InstanceConnectability

Failed connections

First pass

Clone this wiki locally