Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Weave not responding to /status or /status/connections a few minutes after restart #3762

Open
tstm opened this issue Jan 28, 2020 · 2 comments

Comments

@tstm
Copy link

tstm commented Jan 28, 2020

What you expected to happen?

I would expect a node to join the cluster normally when a Weave pod is restarted. Instead not only did it not connect to the rest of the cluster properly, but also stopped responding to /status etc.

What happened?

The node is in a fairly large 100+ node cluster, and was recently restarted. It is a part of a Kubernetes cluster and Weave is running in a pod. Version 2.6.0. The rest of the cluster is working, as was this node before the restart.

First the weave status seemed ok - but it was unable to form connections, all of them were in a pending state - also oddly the amount of Connections is 112 even though the cluster is only configured to have 105 nodes:

curl http://localhost:6784/status
        Version: 2.6.0 (up to date; next check at 2020/01/28 18:07:25)
        Service: router
       Protocol: weave 1..2
           Name: fe:e7:41:bb:ba:b0(node37)
     Encryption: enabled
  PeerDiscovery: enabled
        Targets: 105
    Connections: 112 (111 pending, 1 failed)
          Peers: 105 (with 10696 established, 209 pending connections)
 TrustedSubnets: none
        Service: ipam
         Status: waiting for IP(s) to become available
          Range: 10.2.0.0/16
  DefaultSubnet: 10.2.0.0/16

But after a few minutes the Weave pod stopped responding to HTTP requests to /status and /status/connections

I ran a kill -SIGQUIT on the weave process and got the following dump:
https://gist.github.com/tstm/ef3d6615dcd189f48c5970e991a11c92

How to reproduce it?

In a large cluster, restart a weave pod without deleting the xfrm policies or the weave database file.

Anything else we need to know?

The cluster is running on bare metal. UFW is in use on the host nodes.

Versions:

$ weave version
Version: 2.6.0
$ docker version
Client:
 Version:           18.09.5
 API version:       1.39
 Go version:        go1.10.8
 Git commit:        e8ff056
 Built:             Thu Apr 11 04:43:57 2019
 OS/Arch:           linux/amd64
 Experimental:      false

Server: Docker Engine - Community
 Engine:
  Version:          18.09.5
  API version:      1.39 (minimum version 1.12)
  Go version:       go1.10.8
  Git commit:       e8ff056
  Built:            Thu Apr 11 04:10:53 2019
  OS/Arch:          linux/amd64
  Experimental:     false

$ uname -a
Linux node37 4.15.0-62-generic #69-Ubuntu SMP Wed Sep 4 20:55:53 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux

$ kubectl version
Client Version: version.Info{Major:"1", Minor:"16", GitVersion:"v1.16.3", GitCommit:"b3cbbae08ec52a7fc73d334838e18d17e8512749", GitTreeState:"clean", BuildDate:"2019-11-14T04:24:29Z", GoVersion:"go1.12.13", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"14", GitVersion:"v1.14.8", GitCommit:"211047e9a1922595eaa3a1127ed365e9299a6c23", GitTreeState:"clean", BuildDate:"2019-10-15T12:02:12Z", GoVersion:"go1.12.10", Compiler:"gc", Platform:"linux/amd64"}

Logs:

The logs do not include the SIGQUIT stuff that was above. And the IP addresses have been anonymized.

$ kubectl logs -n kube-system <weave-net-pod> weave

https://gist.github.com/tstm/57623d0944a4c2dd4a819457bc4e1234

Network:

The network has been checked to work fine, and just a few minutes before this the weave was able to connect and work properly.

@bboreham
Copy link
Contributor

Thanks for the report. In the goroutine dump I can see a lot of goroutines are blocked on the fastDatapathForwarder lock, like this:

goroutine 25 [semacquire, 67 minutes]:
sync.(*RWMutex).Lock(0xc000455b0c)
	/usr/local/go/src/sync/rwmutex.go:98 +0x97
github.com/weaveworks/weave/router.(*fastDatapathForwarder).handleVxlanSpecialPacket(0xc000455ad0, 0xc00346b5bc, 0x56e, 0x575, 0xc004caee40)
	/go/src/github.com/weaveworks/weave/router/fastdp.go:799 +0x51

This is held here by a goroutine blocked waiting on an encryptedTCPSender lock, which is held here by a goroutine blocked on a TCP send.

In both cases it's questionable whether a lock should be held over a blocking IO; also whether that IO should block forever.

@murali-reddy
Copy link
Contributor

I was trying to reproduce this issue to better understand and also to verify #3763 helps to fix it. I tried on 150 node cluster with encryption enabled. Restarted random weave-net pods several times. But i am unable to reproduce this issue

kubectl exec -it weave-net-n8whr  -n kube-system -c weave -- /home/weave/weave --local status

        Version: 2.6.0 (up to date; next check at 2020/02/05 00:17:28)

        Service: router
       Protocol: weave 1..2
           Name: 5e:0d:fa:f8:77:98(ip-172-20-48-69.us-west-2.compute.internal)
     Encryption: enabled
  PeerDiscovery: enabled
        Targets: 150
    Connections: 150 (150 established)
          Peers: 151 (with 22650 established connections)
 TrustedSubnets: none

        Service: ipam
         Status: ready
          Range: 10.32.0.0/12
  DefaultSubnet: 10.32.0.0/12

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants