Skip to content

IP Allocation further ideas

Bryan Boreham edited this page May 18, 2015 · 18 revisions

This page contains ideas and observations that were originally part of the design, but were moved here to make it easier to read.

Open questions

  • not persisting anything will lead to issuing the same IPs on either side of a split brain, if one side reboots

    • persist the CRDT to disk
  • the entropy of the crdt only increases with time (merging is not always possible) and therefore there exists workloads which could cause it to degrade into gossiping reservations for size 1.

    • adding a minimum reservations size does not help; minimum reservation size of n means worst case utilisation could be 1/n and maximum number of hosts would be N/n

TODO

After we get through code review:

  • introduce mechanisms from reducing ring size, to prevent us always reaching maximum entropy
    • automatically tombstone any consecutive ring entries, to merge a series of small ranges
      • alternatively, just have every node merge consecutive ring entries and discard the second token
    • periodically release any empty ranges you might own
  • Potentially make claim work for ranges you don't own
  • Remove the free list on space
  • Make a better master election algorithm
  • Multiple Subnets support
    • Currently there is no fall-back range if you don't specify one on the command-line; it would be better to pick from a set of possibilities (checking the range is currently unused before we start using it).
    • How to use IPAM for WeaveDNS? It needs its own special subnet.
  • Consider using IPAM for reverse DNS lookups?
  • We get a bit of noise in the weave logs from containers going down, now that the weave script is calling ethtool and curl via containers.
  • Interrogate Docker to check container exists and is alive at the point we assign an IP
  • It would be good to move PeerName out of package router into a more central package so both ipam and router can depend on it.
  • There is no specific logic to time-out requests such as "give me some space": the logic will be re-run and the request re-sent next time something else happens (e.g. it receives another request, or some periodic gossip). This means we may send out requests more frequently than required, but this is innoccuous and it keeps things simple. It is possible for nothing to happen, e.g. if everyone else has disconnected. We could have a timer to re-consider things.

OLD

CAP

We have to choose amongst:

  • Consistency - no two containers are given the same IP
  • Availability - we can always ask for an IP
  • Partition-tolerant - we can carry on allocating and releasing when some nodes cannot communicate

We take Consistency and Partition-tolerance as paramount, so we weaken Availability to the extent that we will refuse to allocate an IP when we cannot reach the peer that reserved it, even though that peer may not have allocated it.

reservations

There are other suitable representations for reservations.

For a possibly more compact representation, which also happens to neatly align with sub-nets and CIDR notation, think of the IP space as a binary tree, with reservations being represented as leaves labelled by the node holding the reservation for all IPs below it. Unlabelled leaves then represent lost IP addresses. Non-leaf nodes get the label of the left-most labelled leaf of their children. The heir of a lost reservation is the node identified by the label on the sibling of the unlabelled leaf corresponding to the lost reservation.

initialisation

TODO deal with nodes being given different ranges and multiple ranges; the effect should be that all ranges specified anywhere are used for reservations everywhere.

How long to wait before electing a leader:

This time can be quite short; it needs to be just long enough for any alive nodes to gossip their state. Unless we want to cater for situations where a partition has occurred and all nodes on our side were subsequently stopped, or we are trying to add a new node to a network of existing nodes which is presently unreachable. We could conceivably have two start-up modes for nodes; one which permits the above range grabbing, and another where we simply wait for a non-empty map. This mode selection could happen based on whether any IP ranges were specified on start-up. However, one issue here is that weave backs off connection re-establishment quite aggressively, so a node that starts up and cannot establish any outbound connections due to firewalls, may have to wait a very long time for inbound connections to occur. TODO work this out.

[2] Ordinarily nodes will always hold some reservations, i.e. once a node has managed to get hold of some reservations it is never going to transfer them all entirely. Therefore the situation of all node entries being empty only arises on formation of a new network.

Failures:

  • repeated appearance and sudden death (after sending map) of nodes with lower ids - perhaps make the MSB of the id a time-stamp? Then the max delay is bounded by the clock difference.

reservation

The criteria for selecting a node to ask for reservations can include:

  • proximity
  • number of available IPs
  • mergeability of donations with our existing reservations, so that we can reduce fragmentation

TODO figure out what selection criteria we should use. This also has an impact on what additional data about nodes we need to include in the CRDT, and the update frequency of that.

TODO improve handling case where we run out of "next best nodes". Probably we should just sleep and try again, essentially waiting for either better luck next time (i.e. less message loss/delay), some node recovery / partition-healing, or a map update.

TODO might it be worth considering proactive donation, perhaps based on gossiped "time until I run out of IPs based on current allocation rate"?

claiming a specific IP

TODO what if it the IP is already allocated? In the protocol, the node we are asking could tell us that. The question is what we do about it. We could use the same process as for dealing with "reservations reclaimed too early" below. Which begs the questions whether instead of all the protocol above, the node should simply assert that it holds reservations for the IPs it wants to claim, and then we let the "reservations reclaimed too early" logic sort out any clashes. This has one other advantage: claiming an IP would no longer require synchronous interactions with other nodes, delays due to waiting for map updates, etc.

reclaiming leaked IPs

TODO Can we end up in a situation where a reservation moves rapidly between nodes, with none of corresponding map updates ever making it to the node which is the potential heir of that reservation? It would result in the reservation being reclaimed in error. Surely unlikely, but can we quantify this somehow?

Can we do better?...

It would be nice if we were able to identify a single node to perform this failure marking, so we don't needlessly perform the same cleanup operation on multiple nodes. Various options for identifying a single node:

  1. node with the lowest id overall - this is rather unbalanced
  2. node with the "next lowest" id compared to dead node - this can still be rather unbalanced, especially if ids have the time-stamp as the MSB.

Here's another idea... when marking a node as dead, we move its reservations into another part of the CRDT, which collects reservations from all dead nodes. Alive nodes can then remove reservations from that data structure for which they are the heir. It should be possible to construct a convergent data structure for this.