Switch branches/tags
0.11 0.11.2 1.0 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 2.0 2.1 2.2 2.3 2.4 2.5 676-allocate-subnets 742-installsuffix 1286-move-gce-to-tools 1463-too-much-space 1793-topology-gossip-order 1806-ipam-pools 1986-depend-on-plugin-helpers 2018-09-28-moar-initialDelaySeconds 2436-mac-move 3035-remove-weaveproxy-remnants 20170711-weave-instant-help NathanHaim-issues/3245_change_golang_version allids-further allow-for-externalip apk-ca-certs aws-ci aws-test awsvpc-rm-env awsvpc-test bash-image break-infinite-loop build-is-dependency bump-go-odp check-840-fix check-k8s-1.7 check-running circle20 configure-net-plugin-log-level db-panic debug-840 debug-flaky-840 debug-log del_default_ipset disable-dp disable-ingress-masq dns-aliases doc-non-bullet docs-k8s-src-ip document-entrypoints dont-mount-root dont-smoke enable-stp fast-datapath fastdp-flow-logging fdp-prep-fixed feature/skip-nat-via-weave-expose fix-2797-loosing-ipam-ips-bb fix-cross-build fix-ipt-rules-order fix-k8s-use-legacy-mode-docs fix-k8s-weave-passwd-howto fix-kubeaddon-link fix-makefile fix-missing-rule fix-promisc fix_xtables_lock flaky-130 flaky-840 full-report gh-issue-template-bryan gh-pages gke go1.10 go1.10.1 gossip-broadcast-stall hairpin-option-1.9 hairpin-option ipblock issue-3133 issue-3206 issue-3296 issue-3312 issue-3331 issue-3386 issue-3449 issues/fix-840-after-1.6 issues/improve-net-plugin-tests issues/k8s-1.7.8 issues/rm-claim-retry issues/weave-kube-43-prometheus-integration issues/635-persist-dns issues/1119 issues/1591_watch_dns_name issues/1644-fastdp-crypto-rekey issues/1644-fastdp-crypto-rfc7539esp issues/1644-fastdp-rekeying issues/1770-fetch-dns-address issues/1853-warn-fastdp-mtu issues/1914-DNS-network-connect issues/2017-awsvpc-tests issues/2017-awsvpc-wip issues/2018-aws-circleci issues/2144-auto-expose issues/2187-persist-discovered-peers issues/2316-arp-miss issues/2388-withnetns issues/2395-bump-docker issues/2397-weave-launch issues/2419-withnetns-new issues/2428-rm-broadcast-flows issues/2436-mac-move issues/2479-netlink-ebusy issues/2557-prom-metrics issues/2620-weave-npc-integration-tests issues/2689-npc-mcast issues/2758-allow-ip-forwarding issues/2797-recover-ips-on-peer-loss issues/2797-remove-dead-peers issues/2924-preserve-src-ip-on-local issues/2924-preserve-src-ip issues/3025-fix-from-rules issues/3105-k8s-17-netpolicies-whitelist-1.8 issues/3105-k8s-17-netpolicies issues/3121-weave-setup-fails issues/3168-ingress-ipblock issues/3222-dst-selector issues/3223-update-miekgdns issues/3245_change_golang_version issues/3287-iptables-random-fully issues/3289-many-namespaces-debug issues/3332-add-nodestatus-perms issues/3336-promisc issues/3361-defunct-procs issues/3384-peerlist-race issues/3392-get-peers-with-no-ip issues/3394-delete-old-me k8s-1-11 k8s-1.8 k8s-1.9 kube-1-12 kubeadm-1.6-beta lkpdn/handle-fixed-genl-reply-messages log_level_setting luxas-add-head-manifest manifest-tool-make master-merge-test master metrics-addr minikube-docs ml-release-docs multus network-tester no-kube-fallback note-ip-forward npc-additional-metric-labels parse-labels portmap1 ppc64le_suuport quarantine rade-fdp re-expose refactor-kubernetes-test-2.0 refcount-ipsets release-process-update remove-old-k8s remove-ruby rename-router-bridge-before-rebase retry-iptables retry_forwarder rm-bytes-prom-metric role-name-16 ruby-236 runc-nsexec seeded-docs self-peer set-kube-netstatus shfmt standalone-dns stevenjohnstone-issues/3337-symlink-upgrades-broken stop-dont-repeat-yourself stop-prog-unit-test-crash sysbench-test test-870 trim-npc-json try-dep ubuntu-16.04 unreacheable-peers update-builder update-release-process vagrant-18-04 vagrant-more-mem vethwedu weave-kube-1.6-docs withnetns-inproc-debug withnetns-inproc yaml-docs
Nothing to show
Find file Copy path
Fetching contributors…
Cannot retrieve contributors at this time
171 lines (136 sloc) 7.51 KB

Overlay Method Selection

Weave automatically selects the best working overlay implementation when connecting to a new peer. It does this by initiating both forwarders in parallel, eventually settling on the most preferred established forwarder at which point the other forwarder is shut down, assuming it had not failed already. In the event that the remaining forwarder fails at a later time (e.g. because of a heartbeat timeout) the control plane TCP connection will recycle allowing the selection algorithm to run again. By default fast datapath is preferred over Sleeve.

Local Bridging

Unlike the weave bridge netdev used by Sleeve, the OVS datapath has no inherent bridging capability. Consequently fastdp implements an ethernet bridge in addition to the vxlan overlay, maintaining its own forwarding database by learning MAC addresses and dispatching broadcast, unicast & multicast traffic accordingly.

Short Peer IDs

The router relies on having access to the source and destination peer when deciding how to forward packets. When the Sleeve overlay is in use this information is conveyed directly within the encapsulation by including the names of the peers in question, a solution not accommodated directly by the vxlan wire format. Fortunately vxlan does have a twenty four bit segment ID field in the header we can use to encode this data - the challenge is to identify peers uniquely with a twelve bit identifier instead of the seventeen bytes used by Sleeve. In practice this is achieved by peers adopting a random 'short ID' and resolving ownership collisions of same via the existing gossip mechanism.

Heartbeats via vxlan

Heartbeats are implemented by sending an encapsulated ethernet frame with source and destination MAC addresses of 00:00:00:00:00:00 and a payload consisting of the connection UID and total ethernet frame length. The vxlan vport miss handler on the receiving side detects the all-zero addresses and acknowledges the heartbeat via the TCP control channel after validating the connection UID and frame length against their expected values.

PMTU Discovery

The length of the heartbeat ethernet frame is set to the MTU of the datapath interface. In the event the vxlan packet is dropped or truncated, the heartbeat will not be acknowledged; this lack of acknowledgement will cause the peers to fall back to the Sleeve overlay, which has a more sophisticated dynamic mechanism for coping with low path MTUs.

To avoid triggering this fallback in typical deployments, the datapath interface is statically configured with an MTU of 1376 bytes allowing it to work with most underlay network provider MTUs, including GCE at 1460 bytes (the eighty four byte difference accommodates the encrypted vxlan overhead). This value can be overridden by setting WEAVE_MTU at launch if necessary.

Virtual Ports

There are three kinds of virtual port associated with the weave datapath:

  • internal - one of these is created automatically, named after the datapath. It corresponds to the network interface of the same name that appears on the host when the datapath is created; it is the ingress/egress port for weave expose
  • netdev - one for each application container, corresponding to the host end of the veth pair
  • vxlan - one for each UDP port on which we are listening for vxlan encapsulated packets. Typically there is only one of these, but there may be more in a network in which peers do not all use the same port configuration

Misses and Flow Creation

As mentioned above, the datapath has no inherent behaviour - any ingressing packet which does not match a flow is passed to a userspace miss handler. The miss handler is then responsible for a) instructing the datapath to take actions for the packet that caused the miss (for example by copying it to one or more ports) and optionally b) installing flow rules which will allow the datapath to act on similar packets in future without invoking the miss handler.

At a high level the fast datapath implementation can be viewed as a set of miss handlers that determine what actions the OVS datapath should take based on router topology information, together with some additional machinery that manages the expiry of resulting flows when that state changes.

When a miss handler is invoked, it has two pieces of context: a byte array containing the packet that triggered the miss, and a set of 'flow keys' that have been extracted from the packet by the kernel module. The following flow keys are of interest to weave:

  • InPort - the identifier of the ingress virtual port
  • Ethernet - the source and destination MAC addresses of the ethernet frame
  • Tunnel - tunnel identifier (see section on short peer IDs above), and source/dest IPv4 addresses. Only present for ingress via a vxlan vport

Crucially, the router must use this information alone to determine which actions to take - this allows the specification of a flow which matches these keys and instructs the kernel to take action automatically in future without further userspace involvement.

Two actions are of interest:

  • Output - output the packet to the specified vport
  • SetTunnel - update the effective vxlan tunnel parameters. Only required prior to output from a vxlan vport

It is possible to have multiple actions in a flow, so the router can for example create a single rule that copies broadcast traffic to all 'local' vports (save the ingress vport obviously) as well as to vxlan vports for onward peers as dictated by the routing topology.

Every flow specified by the router has the following characteristics:

  • An InPort key matching the ingress vport
  • A Tunnel key if the ingress vport is of type vxlan
  • An Ethernet key matching the source and destination MAC
  • A list of Output (for internal and netdev vports) and SetTunnel+Output (for vxlan vports) actions

Under certain circumstances the miss handler will instruct the datapath to execute a set of actions for a packet without creating a corresponding flow:

  • When broadcasting packets to discover the path to unknown destination MACs. This condition is transient in the overwhelming majority of cases and so the benefit of creating a flow is outweighed by the need to invalidate it once the MAC is learned
  • When the packet also needs to be forwarded to one or more peers via Sleeve. In this case the router needs to handle all subsequent matching packets as there is no OVS flow action to do it without userspace involvement

Flow Invalidation

Once a flow has been created for a particular combination of keys, the miss handler will never again be invoked for matching packets. It is therefore extremely important that we detect events which invalidate existing flow actions:

  • Addition of netdev vports to the datapath
  • Route invalidation (topology change)
  • Short peer ID collision

The response in each case is the same - delete all flows from the datapath, allowing them to be recreated taking into account the updated state.

In addition to these event based invalidations there is an expiry process that executes every five minutes. This process enumerates all flows in the datapath, removing any which have not been used since the last check; this cleans up:

  • Flows referring to netdev vports which have been removed
  • Flows created by forwarders which have been stopped
  • Flows related to MAC addresses which have not communicated recently

Flows keyed on tunnel IPv4 address do not need to be cleared when a peer appears to change IP address due to NAT; this will cause a miss resulting in a new flow, and the old flow will expire naturally via the timer mechanism.