Skip to content

HostHealth

Michael Bridgen edited this page Mar 21, 2016 · 1 revision

Host health

Problem: instances tend to stick around if you stop an agent or host, because there's nothing to remove them.

Basically: run a timer in each agent which refreshes a TTLd key. Peers should reagrd as dubious instances that belong to a host that is not present; so there's an inner join on the {instances, hosts}.

Questions:

  1. just because the agent isn't running, doesn't mean the instances aren't running; how much should e.g., the balancer, take note of which hosts are live?

  2. if you restart an agent with a different identity (host IP, effectively), how do you make sure you don't get overlapping sets of instances?

  3. how do instances get removed?

  4. How much does this change if we're not using etcd? (or: what is a good set of abstract data structures?)

Scenarios:

  • Two agents are started on the same host. How do they not stomp on each other's records?

  • All the agents are stopped, then started after a longish period. How do old records get collected?

Approach version one

Everyone watches the host records and will see when someone fails to heartbeat. If you notice a host go away, you wait for a few seconds (to avoid a thundering herd), then remove all its records.

Problems:

  • What if that agent comes back? You'll be removing records while the returned agent is trying to add them.

  • There's still nothing stopping everyone piling in, if the GC takes long enough.

  • What if all the agents terminate, and are restarted after the host records have been removed: now no-one will know there were supposed to be host records (it's edge-triggered).

Approach version two

Agent records and heartbeats are distinct; an agent record is only removed after its instances are removed; so, if all the agents go away, they will know which agents need instances garbage collected, and so long as one agent is running, progress is made.

Each run of an agent has an identifier. Instances are only collected if the identifier is an expired one (the implementation should use conditional delete so it doesn't wipe out an updated record).

A freshly-started agent will still need to reconcile instance records with the current state. So, it should remove instance records for its host.

Incidentally, this need not be edge-triggered; the GC can poll and tell what to do from the state.

Clone this wiki locally