Weave not responding to /status or /status/connections a few minutes after restart #3762

tstm · 2020-01-28T14:01:07Z

What you expected to happen?

I would expect a node to join the cluster normally when a Weave pod is restarted. Instead not only did it not connect to the rest of the cluster properly, but also stopped responding to /status etc.

What happened?

The node is in a fairly large 100+ node cluster, and was recently restarted. It is a part of a Kubernetes cluster and Weave is running in a pod. Version 2.6.0. The rest of the cluster is working, as was this node before the restart.

First the weave status seemed ok - but it was unable to form connections, all of them were in a pending state - also oddly the amount of Connections is 112 even though the cluster is only configured to have 105 nodes:

curl http://localhost:6784/status
        Version: 2.6.0 (up to date; next check at 2020/01/28 18:07:25)
        Service: router
       Protocol: weave 1..2
           Name: fe:e7:41:bb:ba:b0(node37)
     Encryption: enabled
  PeerDiscovery: enabled
        Targets: 105
    Connections: 112 (111 pending, 1 failed)
          Peers: 105 (with 10696 established, 209 pending connections)
 TrustedSubnets: none
        Service: ipam
         Status: waiting for IP(s) to become available
          Range: 10.2.0.0/16
  DefaultSubnet: 10.2.0.0/16

But after a few minutes the Weave pod stopped responding to HTTP requests to /status and /status/connections

I ran a kill -SIGQUIT on the weave process and got the following dump:
https://gist.github.com/tstm/ef3d6615dcd189f48c5970e991a11c92

How to reproduce it?

In a large cluster, restart a weave pod without deleting the xfrm policies or the weave database file.

Anything else we need to know?

The cluster is running on bare metal. UFW is in use on the host nodes.

Versions:

$ weave version
Version: 2.6.0
$ docker version
Client:
 Version:           18.09.5
 API version:       1.39
 Go version:        go1.10.8
 Git commit:        e8ff056
 Built:             Thu Apr 11 04:43:57 2019
 OS/Arch:           linux/amd64
 Experimental:      false

Server: Docker Engine - Community
 Engine:
  Version:          18.09.5
  API version:      1.39 (minimum version 1.12)
  Go version:       go1.10.8
  Git commit:       e8ff056
  Built:            Thu Apr 11 04:10:53 2019
  OS/Arch:          linux/amd64
  Experimental:     false

$ uname -a
Linux node37 4.15.0-62-generic #69-Ubuntu SMP Wed Sep 4 20:55:53 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux

$ kubectl version
Client Version: version.Info{Major:"1", Minor:"16", GitVersion:"v1.16.3", GitCommit:"b3cbbae08ec52a7fc73d334838e18d17e8512749", GitTreeState:"clean", BuildDate:"2019-11-14T04:24:29Z", GoVersion:"go1.12.13", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"14", GitVersion:"v1.14.8", GitCommit:"211047e9a1922595eaa3a1127ed365e9299a6c23", GitTreeState:"clean", BuildDate:"2019-10-15T12:02:12Z", GoVersion:"go1.12.10", Compiler:"gc", Platform:"linux/amd64"}

Logs:

The logs do not include the SIGQUIT stuff that was above. And the IP addresses have been anonymized.

$ kubectl logs -n kube-system <weave-net-pod> weave

https://gist.github.com/tstm/57623d0944a4c2dd4a819457bc4e1234

Network:

The network has been checked to work fine, and just a few minutes before this the weave was able to connect and work properly.

The text was updated successfully, but these errors were encountered:

bboreham · 2020-01-28T14:21:30Z

Thanks for the report. In the goroutine dump I can see a lot of goroutines are blocked on the fastDatapathForwarder lock, like this:

goroutine 25 [semacquire, 67 minutes]:
sync.(*RWMutex).Lock(0xc000455b0c)
	/usr/local/go/src/sync/rwmutex.go:98 +0x97
github.com/weaveworks/weave/router.(*fastDatapathForwarder).handleVxlanSpecialPacket(0xc000455ad0, 0xc00346b5bc, 0x56e, 0x575, 0xc004caee40)
	/go/src/github.com/weaveworks/weave/router/fastdp.go:799 +0x51

This is held here by a goroutine blocked waiting on an encryptedTCPSender lock, which is held here by a goroutine blocked on a TCP send.

In both cases it's questionable whether a lock should be held over a blocking IO; also whether that IO should block forever.

murali-reddy · 2020-02-05T04:22:13Z

I was trying to reproduce this issue to better understand and also to verify #3763 helps to fix it. I tried on 150 node cluster with encryption enabled. Restarted random weave-net pods several times. But i am unable to reproduce this issue

kubectl exec -it weave-net-n8whr  -n kube-system -c weave -- /home/weave/weave --local status

        Version: 2.6.0 (up to date; next check at 2020/02/05 00:17:28)

        Service: router
       Protocol: weave 1..2
           Name: 5e:0d:fa:f8:77:98(ip-172-20-48-69.us-west-2.compute.internal)
     Encryption: enabled
  PeerDiscovery: enabled
        Targets: 150
    Connections: 150 (150 established)
          Peers: 151 (with 22650 established connections)
 TrustedSubnets: none

        Service: ipam
         Status: ready
          Range: 10.32.0.0/12
  DefaultSubnet: 10.32.0.0/12

This was referenced Jan 28, 2020

TCP sends block forever weaveworks/mesh#125

Open

encryptedTCPSender.Send() holds a lock over a TCP send weaveworks/mesh#126

Open

Move send of ipsec message outside lock #3763

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Weave not responding to /status or /status/connections a few minutes after restart #3762

Weave not responding to /status or /status/connections a few minutes after restart #3762

tstm commented Jan 28, 2020

bboreham commented Jan 28, 2020

murali-reddy commented Feb 5, 2020

Weave not responding to /status or /status/connections a few minutes after restart #3762

Weave not responding to /status or /status/connections a few minutes after restart #3762

Comments

tstm commented Jan 28, 2020

What you expected to happen?

What happened?

How to reproduce it?

Anything else we need to know?

Versions:

Logs:

Network:

bboreham commented Jan 28, 2020

murali-reddy commented Feb 5, 2020