You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I would expect a node to join the cluster normally when a Weave pod is restarted. Instead not only did it not connect to the rest of the cluster properly, but also stopped responding to /status etc.
What happened?
The node is in a fairly large 100+ node cluster, and was recently restarted. It is a part of a Kubernetes cluster and Weave is running in a pod. Version 2.6.0. The rest of the cluster is working, as was this node before the restart.
First the weave status seemed ok - but it was unable to form connections, all of them were in a pending state - also oddly the amount of Connections is 112 even though the cluster is only configured to have 105 nodes:
curl http://localhost:6784/status
Version: 2.6.0 (up to date; next check at 2020/01/28 18:07:25)
Service: router
Protocol: weave 1..2
Name: fe:e7:41:bb:ba:b0(node37)
Encryption: enabled
PeerDiscovery: enabled
Targets: 105
Connections: 112 (111 pending, 1 failed)
Peers: 105 (with 10696 established, 209 pending connections)
TrustedSubnets: none
Service: ipam
Status: waiting for IP(s) to become available
Range: 10.2.0.0/16
DefaultSubnet: 10.2.0.0/16
But after a few minutes the Weave pod stopped responding to HTTP requests to /status and /status/connections
I was trying to reproduce this issue to better understand and also to verify #3763 helps to fix it. I tried on 150 node cluster with encryption enabled. Restarted random weave-net pods several times. But i am unable to reproduce this issue
What you expected to happen?
I would expect a node to join the cluster normally when a Weave pod is restarted. Instead not only did it not connect to the rest of the cluster properly, but also stopped responding to /status etc.
What happened?
The node is in a fairly large 100+ node cluster, and was recently restarted. It is a part of a Kubernetes cluster and Weave is running in a pod. Version 2.6.0. The rest of the cluster is working, as was this node before the restart.
First the weave status seemed ok - but it was unable to form connections, all of them were in a
pending
state - also oddly the amount of Connections is 112 even though the cluster is only configured to have 105 nodes:But after a few minutes the Weave pod stopped responding to HTTP requests to
/status
and/status/connections
I ran a kill -SIGQUIT on the weave process and got the following dump:
https://gist.github.com/tstm/ef3d6615dcd189f48c5970e991a11c92
How to reproduce it?
In a large cluster, restart a weave pod without deleting the xfrm policies or the weave database file.
Anything else we need to know?
The cluster is running on bare metal. UFW is in use on the host nodes.
Versions:
Logs:
The logs do not include the SIGQUIT stuff that was above. And the IP addresses have been anonymized.
https://gist.github.com/tstm/57623d0944a4c2dd4a819457bc4e1234
Network:
The network has been checked to work fine, and just a few minutes before this the weave was able to connect and work properly.
The text was updated successfully, but these errors were encountered: