Skip to content
This repository has been archived by the owner on Jun 20, 2024. It is now read-only.

IPAM self-preservation prevents convergence #2084

Open
bboreham opened this issue Mar 22, 2016 · 2 comments
Open

IPAM self-preservation prevents convergence #2084

bboreham opened this issue Mar 22, 2016 · 2 comments

Comments

@bboreham
Copy link
Contributor

If, in the absence of persistence, a peer gives away some space, then dies and restarts and reconnects to a peer that has not heard about the donation, when it receives the real information later it will object and drop the connection.

Repro: in the context of weaveworks/weave/test (same script as #2083):

#! /bin/bash

. ./config.sh

start_suite "IPAM panic"

weave_on $HOST1 launch-router --no-discovery
weave_on $HOST1 expose
weave_on $HOST2 launch-router --no-discovery $HOST1
weave_on $HOST3 launch-router --no-discovery $HOST2
weave_on $HOST3 expose
weave_on $HOST1 stop-router
weave_on $HOST1 launch-router --no-discovery $HOST2

wait a couple of minutes, then look at HOST1's weave logs:

INFO: 2016/03/22 11:50:25.219476 Command line options: map[nickname:host1 port:6783 http-addr:127.0.0.1:6784 ipalloc-range:10.32.0.0/12 dns-listen-address:172.17.0.1:53 name:46:98:1c:22:6d:b7 no-discovery:true datapath:datapath dns-effective-listen-address:172.17.0.1]
INFO: 2016/03/22 11:50:25.220158 Communication between peers is unencrypted.
INFO: 2016/03/22 11:50:25.222141 Our name is 46:98:1c:22:6d:b7(host1)
INFO: 2016/03/22 11:50:25.222186 Initial set of peers: [192.168.48.12]
INFO: 2016/03/22 11:50:25.223175 Docker API on unix:///var/run/docker.sock: &[GoVersion=go1.5.3 Os=linux Arch=amd64 KernelVersion=4.2.0-34-generic BuildTime=2016-03-10T15:59:07.784447681+00:00 Version=1.10.3 ApiVersion=1.22 GitCommit=20f81dd]
INFO: 2016/03/22 11:50:25.223201 Assuming quorum size of 2
INFO: 2016/03/22 11:50:25.223241 [allocator 46:98:1c:22:6d:b7] Initialising via deferred consensus
INFO: 2016/03/22 11:50:25.224131 Listening for DNS queries on 172.17.0.1
INFO: 2016/03/22 11:50:25.224162 Sniffing traffic on datapath (via ODP)
INFO: 2016/03/22 11:50:25.225853 Listening for HTTP control messages on 127.0.0.1:6784
2016/03/22 11:50:25 ->[192.168.48.12:6783] attempting connection
2016/03/22 11:50:25 ->[192.168.48.12:6783|5a:e6:43:04:ae:40(host2)]: connection ready; using protocol version 2
INFO: 2016/03/22 11:50:25.227115 overlay_switch ->[5a:e6:43:04:ae:40(host2)] using fastdp
2016/03/22 11:50:25 ->[192.168.48.12:6783|5a:e6:43:04:ae:40(host2)]: connection added (new peer)
2016/03/22 11:50:25 ->[192.168.48.12:6783|5a:e6:43:04:ae:40(host2)]: connection fully established
INFO: 2016/03/22 11:50:25.322053 Discovered remote MAC d2:0f:c2:f4:76:ba at 5a:e6:43:04:ae:40(host2)
INFO: 2016/03/22 11:50:25.604876 Discovered remote MAC 36:b2:7b:11:9e:2e at 8a:d3:d4:27:06:c9(host3)
INFO: 2016/03/22 11:50:25.729498 EMSGSIZE on send, expecting PMTU update (IP packet was 60028 bytes, payload was 60020 bytes)
INFO: 2016/03/22 11:50:25.730394 sleeve ->[192.168.48.12:6783|5a:e6:43:04:ae:40(host2)]: Effective MTU verified at 1438
INFO: 2016/03/22 11:50:25.738291 Discovered remote MAC 5a:e6:43:04:ae:40 at 5a:e6:43:04:ae:40(host2)
INFO: 2016/03/22 11:50:25.744480 Discovered remote MAC 12:f3:8f:e5:07:19 at 8a:d3:d4:27:06:c9(host3)
INFO: 2016/03/22 11:50:25.758252 Discovered remote MAC 4a:a3:fc:f4:9d:60 at 5a:e6:43:04:ae:40(host2)
INFO: 2016/03/22 11:50:27.337294 Discovered local MAC da:71:8a:ef:ec:39
INFO: 2016/03/22 11:50:27.841293 Discovered local MAC 1a:6e:e3:1d:6d:1d
INFO: 2016/03/22 11:50:27.969342 Discovered local MAC 46:98:1c:22:6d:b7
INFO: 2016/03/22 11:50:28.996807 Discovered remote MAC 8a:d3:d4:27:06:c9 at 8a:d3:d4:27:06:c9(host3)
2016/03/22 11:50:55 ->[192.168.48.12:6783|5a:e6:43:04:ae:40(host2)]: connection shutting down due to error: read tcp4 192.168.48.11:35365->192.168.48.12:6783: read: connection reset by peer
2016/03/22 11:50:55 ->[192.168.48.12:6783|5a:e6:43:04:ae:40(host2)]: connection deleted
2016/03/22 11:50:55 Removed unreachable peer 5a:e6:43:04:ae:40(host2)
2016/03/22 11:50:55 Removed unreachable peer 8a:d3:d4:27:06:c9(host3)
INFO: 2016/03/22 11:50:55.234674 [nameserver 46:98:1c:22:6d:b7] peer 5a:e6:43:04:ae:40 gone
INFO: 2016/03/22 11:50:55.234700 [nameserver 46:98:1c:22:6d:b7] peer 8a:d3:d4:27:06:c9 gone
2016/03/22 11:50:55 ->[192.168.48.12:58020] connection accepted
2016/03/22 11:50:55 ->[192.168.48.12:58020|5a:e6:43:04:ae:40(host2)]: connection ready; using protocol version 2
INFO: 2016/03/22 11:50:55.590996 overlay_switch ->[5a:e6:43:04:ae:40(host2)] using fastdp
2016/03/22 11:50:55 ->[192.168.48.12:58020|5a:e6:43:04:ae:40(host2)]: connection added (new peer)
2016/03/22 11:50:55 ->[192.168.48.12:58020|5a:e6:43:04:ae:40(host2)]: connection shutting down due to error: Peer 8a:d3:d4:27:06:c9 says it owns the IP range from 10.40.0.1, which I think I own
2016/03/22 11:50:55 ->[192.168.48.12:58020|5a:e6:43:04:ae:40(host2)]: connection deleted
...

(note the first connection reset by peer is due to #2083)

@bboreham bboreham added the bug label Mar 22, 2016
@rade
Copy link
Member

rade commented Apr 13, 2016

iirc one idea we had here is to only barf when the gossip we receive indicates that we have an actual allocation in a range supposedly owned by another peer. Alas the ring code does not (and should no have to) know about allocations. So we perhaps need to make merging higher-order, taking some suitable predicate.

@bboreham
Copy link
Contributor Author

bboreham commented May 5, 2016

If that is the solution then this is covered by #1962

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

3 participants
@rade @bboreham and others