docs/architecture.txt

Router:
  - Starts Routes
  - Starts Connection Maker
  - Starts TCP Listener
  - Starts UDP Listener
  - Sniffs traffic
  - Starts the local peer

- UDP Listener
  This is a process that reads off the UDP socket, decodes the frames
  and does either relaying to a connected peer (or peers), or
  injection to the local interface (or both) (router.listenUDP /
  router.handleUDPPacketFunc)

- Traffic Sniffer
  This is a process that captures packets with pcap, decodes the
  frames and forwards them to a local peer (or peers) (router.Sniff /
  router.handleCapturedPacket)

- Routes
  Maintains unicast and broadcast routing tables which can be read
  directly from any thread by just obtaining a read lock on Routes.
  - Unicast answers the question "If I want to send a packet to peer
    X, which of the peers I'm connected to should I send the packet
    to?"
  - Broadcast answers the question "If peer X broadcasts a packet, by
    the time it reaches me, which of my connected peers is X expecting
    me to pass the packet to?"
  Also spawns a thread which runs an actor loop, responding to
  requests to rebuild the routing tables.

- Connection Maker
  Spawns a thread which runs an actor loop. This actor loop is passed
  the known locations of peers as they're discovered and is informed
  when connections we have made die. It proactively tries to create
  connections to remote peers we're not connected to using a random
  exponential backoff to regulate the period between connection
  attempts.

- Local Peer
  This is the representation of the local peer. The local peer is only
  peer that is "active" in any running weave. Inactive peers simply
  get given some state such their name, version, UID and
  connections. These are set when a network update is received from a
  remote peer. This state can be read directly whilst holding a read
  lock regardless of whether the peer is local or not.

  The local active peer spawns an actor thread which is mainly used by
  the local connections. The actor thread manages state changes
  relating to the local connections, identifying duplicate
  connections, managing broadcasting network updates when its set of
  connections (and hence version) changes.

  The local peer is directly called by the router traffic sniffer and
  router udp listener processes to send traffic to neighbouring
  peers. This is done directly with read locks. In these methods, we
  inspect the unicast and broadcast routing tables as generated by
  Routes.

- Connections
  The connection life cycle is as follows:
  1. Either a TCP connection is created to, or received from, the remote peer
  2. We spawn a thread which will eventually become the connection actor loop
  3. This thread starts by doing the TCP handshake directly
  4. Assuming this succeeds, a TCP receiver thread is spawned. This
    simply responds to traffic received from the remote peer via TCP
  5. We register this connection with the local peer
  6a. If we initiated this connection then we now start sending fast
    heartbeats to the remote peer so that the remote peer can
    determine what address/port it should use to send UDP back to
    us. To do this, we spawn off a "forwarder" thread to send
    heartbeats, monitor incoming heartbeats, and some other auxiliary
    duties.  It also consumes frames to be encapsulated and send via
    UDP from two channels, for DF and non-DF cases.  In the non-DF
    case, it can just send the packets out of the UDP Listener
    socket. In the DF case, it needs its own socket so that it can do
    PMTU discovery easily. To do this, it uses a Raw IP socket (IP has
    no ports, so there's no collision issue with the UDP Listener
    socket) and so it must add UDP headers itself.
 6b. If we did not initiate this connection then the UDP Listener
    should start receiving fast heartbeats from the remote peer. From
    those it should be able to identify the local connection via the
    local peer. It will tell the local connection (communicating to
    the actor thread) about the UDP address of the remote peer. The
    local connection will then start its forwarder thread as
    described in 6a, and start sending fast heartbeats. We send to the
    remote peer via TCP a ConnectionEstablished message. The remote
    peer receives this (on the TCP receiver process), tells the
    connection actor process, which then replaces the fast heartbeater
    with a slow heartbeater and marks the connection as established
    (which means it is included in network updates broadcast to our
    peers).
  6c. When the connection initiator receives the fast heartbeat from
    the remote peer it sends to the remote peer via a TCP a
    ConnectionEstablished message. This is handled by the remote peer
    as described in 6b.
  7. Whenever a connection is established or terminated, the local
    peer's version is incremented. Whenever this happens, the peer
    generates a network update message which is broadcast to its
    directly connected neighbours via TCP. This network update message
    contains the relevant changes to the network topology due to the
    connection change. When such a message is received by a TCP
    receiver threads, they apply the update to the local model of the
    network topology. This may fail for a number of reasons (for
    example the update may contain reference to peers that we have no
    prior knowledge of. In this case, we ignore the update and send
    back to the peer from whom we received the update a request for
    the complete network topology), or it may apply and elicit some
    changes to our model. If it does elicit some changes then we send
    out an updated update message to all our peers. In this way the
    changes are passed quickly from peer to peer based on the
    established connections and stop being sent once a received update
    causes no changes to the peer's topology model. Changes are
    additive and care is taken to ensure that no loops can occur.


Routing

In general, we build up knowledge by looking only at layer 2
(ethernet) MAC addresses. Packets that we sniff from the local
interface must have MAC address of local interfaces thus we associate
such MAC addresses with ourself. Packets that we receive via UDP from
other peers have the association embedded in the UDP traffic so that
we can add the association of the src MAC with the originating
peer. Even when a packet is relayed by an intermediate peer, we
maintain the information as to who originally sniffed the packet so
that all peers can build up the same set of associations.

If we were implementing a hub, it would be legal to just broadcast all
packets to everyone. That would work, but it would be wasteful.  As
we're more implementing a switch, if we know the destination of a
packet then we will form a packet which includes the frame sniffed,
our own identity as the original sniffer of the packet and the
destination peer identity (so that we don't rely on intermediate peers
having the same knowledge as us as to which MACs are where). We then
consult the routing tables to determine which of our connections we
should use in order to try and get the packet to its ultimate
destination peer. The packet we form does not include any information
as to the route we expect the packet to take - we merely determine the
next hop and entrust that to do the onwards routing. Any intermediate
peer will receive the packet, should be able to identify the
destination peer and then similarly consult its own routing tables to
determine the next hop. The intermediate peer does not need to know in
advance the association between the destination MAC and the
destination peer. Intermediate peers do however decode the frame
sufficiently to add the association of the source peer and the source
MAC.


PMTU

PMTU is the lowest MTU of hop in the path between two different
nodes. In general, it is a benefit to know what the PMTU is so that
you can perform any necessary fragmentation of packets at the
endpoints of the path to avoid any refragmentation needing to
occur. If you allow refragmentation to occur then you can end up with
many small packets and network performance can suffer.  To discover
PMTU, there is an IP flag called "Don't Fragment" (DF). With this set,
if a node receives a packet that is bigger than the next MTU, it is
required to drop the packet and send back an ICMP 3,4 packet which
informs the sending node of the next hop MTU (RFC 1191). In theory,
these ICMP packets should have all the reverse NAT and so forth
applied, to make it all the way back to the sender. In practice, this
does often work, but in some cases all ICMP are sometimes blocked,
often by idiots who don't understand that ICMP is really the error
channel of all network traffic.

Because of the encapsulation overheads, it is important that weave
respects PMTU. If it sniffs a packet of X bytes, the weave-to-weave
traffic will be some N bytes bigger than X. If the packet happens to
be an IP packet and it happens to have DF set then we should set DF on
the larger packet we send between weaves. If X+N > the PMTU between
weave peers then our send will error. We will then hopefully be able
to query what the actual PMTU is, and then can subtract N from this
and send an resultant ICMP3,4 packet back to the original sender.

It is frequently the case that large UDP packets that don't have DF
set (either the sending side chose not to set DF or the packet is
greater than the PMTU so it can't have DF set) get dropped. Ideally,
large packets without DF set should just get transparently fragmented
and reassembled and there should be no packet loss but in reality this
often doesn't occur. Therefore weave tests to see whether or not
fragmentation of large UDP packets between peers is reliable or not,
and retests from time to time.

If weave determines that fragmentation is reliable then when weave
sniffs large packets (i.e. packets larger than the MTU and thus don't
have DF set) then weave will encapsulate these packets as necessary
and similarly send them out as-is, without DF set, trusting to the
network to do all necessary fragmentation and reassembly.

However, if weave determines that fragmentation is not reliable
between any two peers then it will manually fragment larger packets
correctly according to the IP spec, and will then send them between
weave peers with the DF flag set (i.e. the fragmentation will ensure
that the encapsulated traffic will not be greater than the PMTU
between weave peers). Because the fragmentation is done according to
the IP spec, we don't need to do reassembly ourselves - on the
receiving weave, we just inject all the fragments and rely on the
stack to do reassembly.

Sometimes PMTU discovery doesn't work as ICMP packets sometimes get
dropped by firewalls. Whenever weave sends between peers with DF set
and gets an error informing it of a new PMTU, it will attempt to
verify that PMTU by sending packets of exactly that size to the remote
peer. If it gets no indication from the remote weave that these
packets have been received within a timeout period, it will conduct a
binary search - sending packets of different sizes - to determine the
PMTU.