/
architecture.txt
209 lines (189 loc) · 11 KB
/
architecture.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
Router:
- Starts Routes
- Starts Connection Maker
- Starts TCP Listener
- Starts UDP Listener
- Sniffs traffic
- Starts the local peer
- UDP Listener
This is a process that reads off the UDP socket, decodes the frames
and does either relaying to a connected peer (or peers), or
injection to the local interface (or both) (router.listenUDP /
router.handleUDPPacketFunc)
- Traffic Sniffer
This is a process that captures packets with pcap, decodes the
frames and forwards them to a local peer (or peers) (router.Sniff /
router.handleCapturedPacket)
- Routes
Maintains unicast and broadcast routing tables which can be read
directly from any thread by just obtaining a read lock on Routes.
- Unicast answers the question "If I want to send a packet to peer
X, which of the peers I'm connected to should I send the packet
to?"
- Broadcast answers the question "If peer X broadcasts a packet, by
the time it reaches me, which of my connected peers is X expecting
me to pass the packet to?"
Also spawns a thread which runs an actor loop, responding to
requests to rebuild the routing tables.
- Connection Maker
Spawns a thread which runs an actor loop. This actor loop is passed
the known locations of peers as they're discovered and is informed
when connections we have made die. It proactively tries to create
connections to remote peers we're not connected to using a random
exponential backoff to regulate the period between connection
attempts.
- Local Peer
This is the representation of the local peer. The local peer is only
peer that is "active" in any running weave. Inactive peers simply
get given some state such their name, version, UID and
connections. These are set when a network update is received from a
remote peer. This state can be read directly whilst holding a read
lock regardless of whether the peer is local or not.
The local active peer spawns an actor thread which is mainly used by
the local connections. The actor thread manages state changes
relating to the local connections, identifying duplicate
connections, managing broadcasting network updates when its set of
connections (and hence version) changes.
The local peer is directly called by the router traffic sniffer and
router udp listener processes to send traffic to neighbouring
peers. This is done directly with read locks. In these methods, we
inspect the unicast and broadcast routing tables as generated by
Routes.
- Connections
The connection life cycle is as follows:
1. Either a TCP connection is created to, or received from, the remote peer
2. We spawn a thread which will eventually become the connection actor loop
3. This thread starts by doing the TCP handshake directly
4. Assuming this succeeds, a TCP receiver thread is spawned. This
simply responds to traffic received from the remote peer via TCP
5. We register this connection with the local peer
6a. If we initiated this connection then we now start sending fast
heartbeats to the remote peer so that the remote peer can
determine what address/port it should use to send UDP back to
us. To do this, we spawn off a "forwarder" thread to send
heartbeats, monitor incoming heartbeats, and some other auxiliary
duties. It also consumes frames to be encapsulated and send via
UDP from two channels, for DF and non-DF cases. In the non-DF
case, it can just send the packets out of the UDP Listener
socket. In the DF case, it needs its own socket so that it can do
PMTU discovery easily. To do this, it uses a Raw IP socket (IP has
no ports, so there's no collision issue with the UDP Listener
socket) and so it must add UDP headers itself.
6b. If we did not initiate this connection then the UDP Listener
should start receiving fast heartbeats from the remote peer. From
those it should be able to identify the local connection via the
local peer. It will tell the local connection (communicating to
the actor thread) about the UDP address of the remote peer. The
local connection will then start its forwarder thread as
described in 6a, and start sending fast heartbeats. We send to the
remote peer via TCP a ConnectionEstablished message. The remote
peer receives this (on the TCP receiver process), tells the
connection actor process, which then replaces the fast heartbeater
with a slow heartbeater and marks the connection as established
(which means it is included in network updates broadcast to our
peers).
6c. When the connection initiator receives the fast heartbeat from
the remote peer it sends to the remote peer via a TCP a
ConnectionEstablished message. This is handled by the remote peer
as described in 6b.
7. Whenever a connection is established or terminated, the local
peer's version is incremented. Whenever this happens, the peer
generates a network update message which is broadcast to its
directly connected neighbours via TCP. This network update message
contains the relevant changes to the network topology due to the
connection change. When such a message is received by a TCP
receiver threads, they apply the update to the local model of the
network topology. This may fail for a number of reasons (for
example the update may contain reference to peers that we have no
prior knowledge of. In this case, we ignore the update and send
back to the peer from whom we received the update a request for
the complete network topology), or it may apply and elicit some
changes to our model. If it does elicit some changes then we send
out an updated update message to all our peers. In this way the
changes are passed quickly from peer to peer based on the
established connections and stop being sent once a received update
causes no changes to the peer's topology model. Changes are
additive and care is taken to ensure that no loops can occur.
Routing
In general, we build up knowledge by looking only at layer 2
(ethernet) MAC addresses. Packets that we sniff from the local
interface must have MAC address of local interfaces thus we associate
such MAC addresses with ourself. Packets that we receive via UDP from
other peers have the association embedded in the UDP traffic so that
we can add the association of the src MAC with the originating
peer. Even when a packet is relayed by an intermediate peer, we
maintain the information as to who originally sniffed the packet so
that all peers can build up the same set of associations.
If we were implementing a hub, it would be legal to just broadcast all
packets to everyone. That would work, but it would be wasteful. As
we're more implementing a switch, if we know the destination of a
packet then we will form a packet which includes the frame sniffed,
our own identity as the original sniffer of the packet and the
destination peer identity (so that we don't rely on intermediate peers
having the same knowledge as us as to which MACs are where). We then
consult the routing tables to determine which of our connections we
should use in order to try and get the packet to its ultimate
destination peer. The packet we form does not include any information
as to the route we expect the packet to take - we merely determine the
next hop and entrust that to do the onwards routing. Any intermediate
peer will receive the packet, should be able to identify the
destination peer and then similarly consult its own routing tables to
determine the next hop. The intermediate peer does not need to know in
advance the association between the destination MAC and the
destination peer. Intermediate peers do however decode the frame
sufficiently to add the association of the source peer and the source
MAC.
PMTU
PMTU is the lowest MTU of hop in the path between two different
nodes. In general, it is a benefit to know what the PMTU is so that
you can perform any necessary fragmentation of packets at the
endpoints of the path to avoid any refragmentation needing to
occur. If you allow refragmentation to occur then you can end up with
many small packets and network performance can suffer. To discover
PMTU, there is an IP flag called "Don't Fragment" (DF). With this set,
if a node receives a packet that is bigger than the next MTU, it is
required to drop the packet and send back an ICMP 3,4 packet which
informs the sending node of the next hop MTU (RFC 1191). In theory,
these ICMP packets should have all the reverse NAT and so forth
applied, to make it all the way back to the sender. In practice, this
does often work, but in some cases all ICMP are sometimes blocked,
often by idiots who don't understand that ICMP is really the error
channel of all network traffic.
Because of the encapsulation overheads, it is important that weave
respects PMTU. If it sniffs a packet of X bytes, the weave-to-weave
traffic will be some N bytes bigger than X. If the packet happens to
be an IP packet and it happens to have DF set then we should set DF on
the larger packet we send between weaves. If X+N > the PMTU between
weave peers then our send will error. We will then hopefully be able
to query what the actual PMTU is, and then can subtract N from this
and send an resultant ICMP3,4 packet back to the original sender.
It is frequently the case that large UDP packets that don't have DF
set (either the sending side chose not to set DF or the packet is
greater than the PMTU so it can't have DF set) get dropped. Ideally,
large packets without DF set should just get transparently fragmented
and reassembled and there should be no packet loss but in reality this
often doesn't occur. Therefore weave tests to see whether or not
fragmentation of large UDP packets between peers is reliable or not,
and retests from time to time.
If weave determines that fragmentation is reliable then when weave
sniffs large packets (i.e. packets larger than the MTU and thus don't
have DF set) then weave will encapsulate these packets as necessary
and similarly send them out as-is, without DF set, trusting to the
network to do all necessary fragmentation and reassembly.
However, if weave determines that fragmentation is not reliable
between any two peers then it will manually fragment larger packets
correctly according to the IP spec, and will then send them between
weave peers with the DF flag set (i.e. the fragmentation will ensure
that the encapsulated traffic will not be greater than the PMTU
between weave peers). Because the fragmentation is done according to
the IP spec, we don't need to do reassembly ourselves - on the
receiving weave, we just inject all the fragments and rely on the
stack to do reassembly.
Sometimes PMTU discovery doesn't work as ICMP packets sometimes get
dropped by firewalls. Whenever weave sends between peers with DF set
and gets an error informing it of a new PMTU, it will attempt to
verify that PMTU by sending packets of exactly that size to the remote
peer. If it gets no indication from the remote weave that these
packets have been received within a timeout period, it will conduct a
binary search - sending packets of different sizes - to determine the
PMTU.