Skip to content
This repository has been archived by the owner on Jun 20, 2024. It is now read-only.

Zombie apocalypse with kubernetes and weave-kube #2836

Closed
Bregor opened this issue Mar 9, 2017 · 2 comments
Closed

Zombie apocalypse with kubernetes and weave-kube #2836

Bregor opened this issue Mar 9, 2017 · 2 comments
Assignees
Milestone

Comments

@Bregor
Copy link
Contributor

Bregor commented Mar 9, 2017

Kubernetes:

$ kubectl version --short
Client Version: v1.5.4
Server Version: v1.5.4

Weave:

$ kubectl get ds -n kube-system weave-net -o jsonpath='{.spec.template.spec.containers[*].image}'
weaveworks/weave-kube:1.9.3 weaveworks/weave-npc:1.9.3

Zombies:

$ knife ssh "role:kubernetes_node" "sudo ps aux|grep [l]aunch.sh"|sort
kube01 root      2230  0.0  0.0      0     0 ?        Z    Mar07   0:00 [launch.sh] <defunct>
kube02 root      7411  0.0  0.0      0     0 ?        Z    Mar07   0:00 [launch.sh] <defunct>
kube03 root      5007  0.0  0.0      0     0 ?        Z    Mar07   0:00 [launch.sh] <defunct>
kube04 root      6274  0.0  0.0      0     0 ?        Z    19:19   0:00 [launch.sh] <defunct>
...

uname -a:

$ knife ssh "role:kubernetes_node" "uname -a"|sort
kube01 Linux kube01 4.4.0-51-generic #72-Ubuntu SMP Thu Nov 24 18:29:54 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux
kube02 Linux kube02 4.4.0-51-generic #72-Ubuntu SMP Thu Nov 24 18:29:54 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux
kube03 Linux kube03 4.4.0-51-generic #72-Ubuntu SMP Thu Nov 24 18:29:54 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux
kube04 Linux kube04 4.4.0-64-generic #85-Ubuntu SMP Mon Feb 20 11:50:30 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux

Maybe it is connected with following issue in kubernetes: kubernetes/kubernetes#39334
We experience this cluster-wide.

@Bregor
Copy link
Contributor Author

Bregor commented Mar 9, 2017

I found it!

Attention to node web05
I downgrade weave version to one version at the time on this host.

weave-1.9.2:

$ knife ssh "role:kubernetes_node" "sudo ps aux|grep [l]aunch.sh"|sort
web01 root       639  0.0  0.0      0     0 ?        Z    Mar07   0:00 [launch.sh] <defunct>
web02 root      2863  0.0  0.0      0     0 ?        Z    Mar07   0:00 [launch.sh] <defunct>
web03 root     27055  0.0  0.0      0     0 ?        Z    Mar07   0:00 [launch.sh] <defunct>
web05 root     17381  0.0  0.0      0     0 ?        Z    21:49   0:00 [launch.sh] <defunct>

weave-1.9.1:

$ knife ssh "role:kubernetes_node" "sudo ps aux|grep [l]aunch.sh"|sort
web01 root       639  0.0  0.0      0     0 ?        Z    Mar07   0:00 [launch.sh] <defunct>
web02 root      2863  0.0  0.0      0     0 ?        Z    Mar07   0:00 [launch.sh] <defunct>
web03 root     27055  0.0  0.0      0     0 ?        Z    Mar07   0:00 [launch.sh] <defunct>
web05 root     22508  0.0  0.0      0     0 ?        Z    21:54   0:00 [launch.sh] <defunct>

weave-1.9.0:

$ knife ssh "role:kubernetes_node" "sudo ps aux|grep [l]aunch.sh"|sort
web01 root       639  0.0  0.0      0     0 ?        Z    Mar07   0:00 [launch.sh] <defunct>
web02 root      2863  0.0  0.0      0     0 ?        Z    Mar07   0:00 [launch.sh] <defunct>
web03 root     27055  0.0  0.0      0     0 ?        Z    Mar07   0:00 [launch.sh] <defunct>
web05 root     26266  0.0  0.0      0     0 ?        Z    21:57   0:00 [launch.sh] <defunct>

weave-1.8.2:

$ knife ssh "role:kubernetes_node" "sudo ps aux|grep [l]aunch.sh"|sort
web01 root       639  0.0  0.0      0     0 ?        Z    Mar07   0:00 [launch.sh] <defunct>
web02 root      2863  0.0  0.0      0     0 ?        Z    Mar07   0:00 [launch.sh] <defunct>
web03 root     27055  0.0  0.0      0     0 ?        Z    Mar07   0:00 [launch.sh] <defunct>
web05 root      4712  0.0  0.0   1524  1016 ?        Ss   22:01   0:00 /bin/sh /home/weave/launch.sh --host=10.83.8.203 --status-addr=10.83.8.203:6782

Tadaaa!
No zombies with 1.8.2.

@marccarre
Copy link
Contributor

Many thanks for reporting this @Bregor!
I confirm:

  1. I can systematically see defunct processes for launch.sh;
  2. this behaviour seems to have been introduced by this change;
  3. that although not a proper fix, applying a patch removing exec to the 1.9.3 branch removes the symptoms;
  4. we're internally discussing a proper fix/refactoring to resolve this issue.

marccarre added a commit that referenced this issue Mar 13, 2017
Reproduces issue #2836.
Sample output:
test #5 "run_on mct-0.us-central1-a.weave-net ps aux | grep 'launch.sh' | grep defunct | wc -l" failed:
	expected "0"
	got "1"
test #6 "run_on mct-1.us-central1-a.weave-net ps aux | grep 'launch.sh' | grep defunct | wc -l" failed:
	expected "0"
	got "1"
test #7 "run_on mct-2.us-central1-a.weave-net ps aux | grep 'launch.sh' | grep defunct | wc -l" failed:
	expected "0"
	got "1"
3 of 8 tests failed in 61.780s.
marccarre added a commit that referenced this issue Mar 13, 2017
This:

1. prevents from generating defunct (a.k.a. zombie) processes, and therefore fixes #2836; but also
2. reintroduces running more than one process and issues related to signals forwarding, effectively reverting #2688 / reopening #2684.

A proper fix would leverage something like Tini.
See also:

- github.com/krallin/tini
- github.com/krallin/tini/issues/8
- github.com/docker-library/official-images#init
marccarre added a commit that referenced this issue Mar 13, 2017
This prevents from generating defunct (a.k.a. zombie) processes, and therefore fixes #2836 and does not reopens #2684.
marccarre added a commit that referenced this issue Mar 13, 2017
Fixes #2836, which prevents from generating defunct (a.k.a. zombie) processes, and does so without reopening #2684.
In Docker, ENTRYPOINT is PID 1 and therefore has the responsibility of reaping processes and forwarding signals to child processes, which launch.sh does not do.
This change leverages tini to add such behaviour.
marccarre added a commit that referenced this issue Mar 13, 2017
Fixes #2836, i.e. prevents from generating defunct (a.k.a. zombie) launch.sh processes, and does so without reopening #2684.
Why: In Docker, ENTRYPOINT is PID 1 and therefore has the responsibility of reaping processes and forwarding signals to child processes, which launch.sh does not do.
This change leverages tini to bake such behaviour in, as recommended by Docker: https://github.com/docker-library/official-images#init
marccarre added a commit that referenced this issue Mar 13, 2017
Fixes #2836, i.e. prevents from generating defunct (a.k.a. zombie) launch.sh processes, and does so without reopening #2684.
Why: In Docker, ENTRYPOINT is PID 1 and therefore has the responsibility of reaping processes and forwarding signals to child processes, which launch.sh does not do.
This change leverages tini to bake such behaviour in, as recommended by Docker: https://github.com/docker-library/official-images#init
See also: github.com/krallin/tini/issues/8
marccarre added a commit that referenced this issue Mar 14, 2017
Fixes #2836, i.e. prevents from generating defunct (a.k.a. zombie) launch.sh processes, and also propagates signals from Docker to our processes (i.e. does not reopen #2684).
Why: In Docker, ENTRYPOINT is PID 1 and therefore has the responsibility of reaping processes and forwarding signals to child processes, which launch.sh currently does not do.
This change leverages tini to bake such behaviour in, as recommended by Docker. See also:
- github.com/docker-library/official-images#init
- github.com/krallin/tini/issues/8
marccarre added a commit that referenced this issue Mar 14, 2017
Fixes #2836, i.e. prevents from generating defunct (a.k.a. zombie) launch.sh processes, and also propagates signals from Docker to our processes (i.e. does not reopen #2684).
Why: In Docker, ENTRYPOINT is PID 1 and therefore has the responsibility of reaping processes and forwarding signals to child processes, which launch.sh currently does not do.
This change leverages tini to bake such behaviour in, as recommended by Docker.

See also:
- github.com/docker-library/official-images#init
- github.com/krallin/tini/issues/8

Sample output:

- During initialisation:

```
$ ps auxf
USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
[...]
root      1380  0.5  1.9 879660 74864 ?        Ssl  Mar13   5:42 /usr/bin/docker daemon -H fd:// -H unix:///
root      1664  0.0  0.4 502116 18644 ?        Ssl  Mar13   0:05  \_ docker-containerd -l /var/run/docker/li
root      9716  0.0  0.1 134960  5412 ?        Sl   11:00   0:00      \_ docker-containerd-shim 4946a0467c5a
root      9734  0.0  0.0    736     4 ?        Ss   11:00   0:00      |   \_ /sbin/tini -s -- /home/weave/la
root      9738  6.0  1.5 483756 59948 ?        Sl   11:00   0:00      |       \_ /home/weave/weaver --port=6
root     10020  0.0  0.0   1524    64 ?        S    11:00   0:00      |       \_ /bin/sh /home/weave/launch.
root     10110  0.0  0.0   1772  1264 ?        S    11:00   0:00      |           \_ /bin/sh /home/weave/wea
root     10135  0.0  0.0   1772   324 ?        S    11:00   0:00      |               \_ /bin/sh /home/weave
root     10136  0.0  0.0  14656  2912 ?        S    11:00   0:00      |                   \_ curl -o /tmp/we
[...]
```

- Once initialised successfully:

```
$ ps auxf
USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
[...]
root      1380  0.5  1.9 879660 74864 ?        Ssl  Mar13   5:42 /usr/bin/docker daemon -H fd:// -H unix:///
root      1664  0.0  0.4 502116 18644 ?        Ssl  Mar13   0:05  \_ docker-containerd -l /var/run/docker/li
root      9716  0.0  0.1 134960  5412 ?        Sl   11:00   0:00      \_ docker-containerd-shim 4946a0467c5a
root      9734  0.0  0.0    736     4 ?        Ss   11:00   0:00      |   \_ /sbin/tini -s -- /home/weave/la
root      9738  4.2  1.5 491952 59948 ?        Sl   11:00   0:00      |       \_ /home/weave/weaver --port=6
[...]
```
@marccarre marccarre added this to the 1.9.4 milestone Mar 14, 2017
marccarre added a commit that referenced this issue Mar 14, 2017
Reproduces issue #2836.
Sample output:
test #5 "run_on mct-0.us-central1-a.weave-net ps aux | grep -c '[d]efunct'" failed:
	expected "0"
	got "1"
test #6 "run_on mct-1.us-central1-a.weave-net ps aux | grep -c '[d]efunct'" failed:
	expected "0"
	got "1"
test #7 "run_on mct-2.us-central1-a.weave-net ps aux | grep -c '[d]efunct'" failed:
	expected "0"
	got "1"
3 of 8 tests failed in 61.780s.
marccarre added a commit that referenced this issue Mar 14, 2017
Fixes #2836, i.e. prevents from generating defunct (a.k.a. zombie) launch.sh processes, and also propagates signals from Docker to our processes (i.e. does not reopen #2684).
Why: In Docker, ENTRYPOINT is PID 1 and therefore has the responsibility of reaping processes and forwarding signals to child processes, which launch.sh currently does not do.
This change leverages tini to bake such behaviour in, as recommended by Docker.

See also:
- github.com/docker-library/official-images#init
- github.com/krallin/tini/issues/8

Sample output:

- During initialisation:

```
$ ps auxf
USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
[...]
root      1380  0.5  1.9 879660 74864 ?        Ssl  Mar13   5:42 /usr/bin/docker daemon -H fd:// -H unix:///
root      1664  0.0  0.4 502116 18644 ?        Ssl  Mar13   0:05  \_ docker-containerd -l /var/run/docker/li
root      9716  0.0  0.1 134960  5412 ?        Sl   11:00   0:00      \_ docker-containerd-shim 4946a0467c5a
root      9734  0.0  0.0    736     4 ?        Ss   11:00   0:00      |   \_ /sbin/tini -s -- /home/weave/la
root      9738  6.0  1.5 483756 59948 ?        Sl   11:00   0:00      |       \_ /home/weave/weaver --port=6
root     10020  0.0  0.0   1524    64 ?        S    11:00   0:00      |       \_ /bin/sh /home/weave/launch.
root     10110  0.0  0.0   1772  1264 ?        S    11:00   0:00      |           \_ /bin/sh /home/weave/wea
root     10135  0.0  0.0   1772   324 ?        S    11:00   0:00      |               \_ /bin/sh /home/weave
root     10136  0.0  0.0  14656  2912 ?        S    11:00   0:00      |                   \_ curl -o /tmp/we
[...]
```

- Once initialised successfully:

```
$ ps auxf
USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
[...]
root      1380  0.5  1.9 879660 74864 ?        Ssl  Mar13   5:42 /usr/bin/docker daemon -H fd:// -H unix:///
root      1664  0.0  0.4 502116 18644 ?        Ssl  Mar13   0:05  \_ docker-containerd -l /var/run/docker/li
root      9716  0.0  0.1 134960  5412 ?        Sl   11:00   0:00      \_ docker-containerd-shim 4946a0467c5a
root      9734  0.0  0.0    736     4 ?        Ss   11:00   0:00      |   \_ /sbin/tini -s -- /home/weave/la
root      9738  4.2  1.5 491952 59948 ?        Sl   11:00   0:00      |       \_ /home/weave/weaver --port=6
[...]
```
marccarre added a commit that referenced this issue Mar 14, 2017
This:

1. prevents from generating defunct (a.k.a. zombie) processes, and therefore fixes #2836; but also
2. reintroduces running more than one process and issues related to signals forwarding, effectively reverting #2688 / reopening #2684.

A proper fix would leverage something like Tini.
See also:

- github.com/krallin/tini
- github.com/krallin/tini/issues/8
- github.com/docker-library/official-images#init
marccarre added a commit that referenced this issue Mar 14, 2017
Reproduces issue #2836, i.e.:

- during initialisation:

```
$ ps auxf
[...]
root      1380  0.4  1.6 879660 60732 ?        Ssl  Mar13   5:54 /usr/bin/docker daemon -H fd:// -H unix:///
root      1664  0.0  0.4 502116 18540 ?        Ssl  Mar13   0:06  \_ docker-containerd -l /var/run/docker/li
root     12615  0.0  0.1 200496  5424 ?        Sl   14:38   0:00      \_ docker-containerd-shim c5637e5bbdcb
root     12629  6.2  1.6 296400 62312 ?        Ssl  14:38   0:00      |   \_ /home/weave/weaver --port=6783
root     12910  0.0  0.0   1524    68 ?        S    14:38   0:00      |       \_ /bin/sh /home/weave/launch.
root     13002  0.0  0.0   1772  1232 ?        S    14:38   0:00      |           \_ /bin/sh /home/weave/wea
root     13027  0.0  0.0   1772   320 ?        S    14:38   0:00      |               \_ /bin/sh /home/weave
root     13028  0.0  0.0  14656  2708 ?        S    14:38   0:00      |                   \_ curl -o /tmp/we
[...]
```

- after initialisation:

```
$ ps auxf
[...]
root      1380  0.4  1.6 879660 60732 ?        Ssl  Mar13   5:54 /usr/bin/docker daemon -H fd:// -H unix:///
root      1664  0.0  0.4 502116 18540 ?        Ssl  Mar13   0:06  \_ docker-containerd -l /var/run/docker/li
root     12615  0.0  0.1 200496  5424 ?        Sl   14:38   0:00      \_ docker-containerd-shim c5637e5bbdcb
root     12629  3.5  1.6 297460 63340 ?        Ssl  14:38   0:00      |   \_ /home/weave/weaver --port=6783
root     12910  0.0  0.0      0     0 ?        Z    14:38   0:00      |       \_ [launch.sh] <defunct>
[...]
```

Sample output:

```
test #5 "run_on mct-0.us-central1-a.weave-net ps aux | grep -c '[d]efunct'" failed:
	expected "0"
	got "1"
test #6 "run_on mct-1.us-central1-a.weave-net ps aux | grep -c '[d]efunct'" failed:
	expected "0"
	got "1"
test #7 "run_on mct-2.us-central1-a.weave-net ps aux | grep -c '[d]efunct'" failed:
	expected "0"
	got "1"
3 of 8 tests failed in 62.012s.
```
marccarre added a commit that referenced this issue Mar 14, 2017
This:

1. prevents from generating defunct (a.k.a. zombie) processes, and therefore fixes #2836; but also
2. reintroduces running more than one process and issues related to signals forwarding, effectively reverting #2688 / reopening #2684.

A proper fix would leverage something like Tini.
See also:

- github.com/krallin/tini
- github.com/krallin/tini/issues/8
- github.com/docker-library/official-images#init
marccarre added a commit that referenced this issue Mar 14, 2017
Fixes #2836, i.e. prevents from generating defunct (a.k.a. zombie) launch.sh processes, and also propagates signals from Docker to our processes (i.e. does not reopen #2684).
Why: In Docker, ENTRYPOINT is PID 1 and therefore has the responsibility of reaping processes and forwarding signals to child processes, which launch.sh currently does not do.
This change leverages tini to bake such behaviour in, as recommended by Docker.

See also:
- github.com/docker-library/official-images#init
- github.com/krallin/tini/issues/8

Sample output:

- During initialisation:

```
$ ps auxf
USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
[...]
root      1380  0.5  1.9 879660 74864 ?        Ssl  Mar13   5:42 /usr/bin/docker daemon -H fd:// -H unix:///
root      1664  0.0  0.4 502116 18644 ?        Ssl  Mar13   0:05  \_ docker-containerd -l /var/run/docker/li
root      9716  0.0  0.1 134960  5412 ?        Sl   11:00   0:00      \_ docker-containerd-shim 4946a0467c5a
root      9734  0.0  0.0    736     4 ?        Ss   11:00   0:00      |   \_ /sbin/tini -s -- /home/weave/la
root      9738  6.0  1.5 483756 59948 ?        Sl   11:00   0:00      |       \_ /home/weave/weaver --port=6
root     10020  0.0  0.0   1524    64 ?        S    11:00   0:00      |       \_ /bin/sh /home/weave/launch.
root     10110  0.0  0.0   1772  1264 ?        S    11:00   0:00      |           \_ /bin/sh /home/weave/wea
root     10135  0.0  0.0   1772   324 ?        S    11:00   0:00      |               \_ /bin/sh /home/weave
root     10136  0.0  0.0  14656  2912 ?        S    11:00   0:00      |                   \_ curl -o /tmp/we
[...]
```

- Once initialised successfully:

```
$ ps auxf
USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
[...]
root      1380  0.5  1.9 879660 74864 ?        Ssl  Mar13   5:42 /usr/bin/docker daemon -H fd:// -H unix:///
root      1664  0.0  0.4 502116 18644 ?        Ssl  Mar13   0:05  \_ docker-containerd -l /var/run/docker/li
root      9716  0.0  0.1 134960  5412 ?        Sl   11:00   0:00      \_ docker-containerd-shim 4946a0467c5a
root      9734  0.0  0.0    736     4 ?        Ss   11:00   0:00      |   \_ /sbin/tini -s -- /home/weave/la
root      9738  4.2  1.5 491952 59948 ?        Sl   11:00   0:00      |       \_ /home/weave/weaver --port=6
[...]
```
marccarre added a commit that referenced this issue Mar 14, 2017
This:

1. prevents from generating defunct (a.k.a. zombie) processes, and therefore fixes #2836; but also
2. reintroduces #2684/#2688 as shells do not forward signals and we are still running more than one process.

A proper fix would leverage something like tini.
See also:

- github.com/krallin/tini
- github.com/krallin/tini/issues/8
- github.com/docker-library/official-images#init

Sample output:

- during initialisation:

```
$ ps auxf
[...]
root      1380  0.4  1.9 879660 74660 ?        Ssl  Mar13   6:02 /usr/bin/docker daemon -H fd:// -H unix:///
root      1664  0.0  0.4 502116 18528 ?        Ssl  Mar13   0:06  \_ docker-containerd -l /var/run/docker/li
root     15482  0.0  0.0 200496  3364 ?        Sl   15:04   0:00      \_ docker-containerd-shim 3d1f5eb6e090
root     15496  0.0  0.0   1524   996 ?        Ss   15:04   0:00      |   \_ /bin/sh /home/weave/launch.sh
root     15780  0.0  0.0   1524    64 ?        S    15:04   0:00      |       \_ /bin/sh /home/weave/launch.
root     15872  0.0  0.0   1772  1272 ?        S    15:04   0:00      |       |   \_ /bin/sh /home/weave/wea
root     15897  0.0  0.0   1772   324 ?        S    15:04   0:00      |       |       \_ /bin/sh /home/weave
root     15898  0.0  0.0  14656  2948 ?        S    15:04   0:00      |       |           \_ curl -o /tmp/we
root     15781 10.0  1.5 484556 60296 ?        Sl   15:04   0:00      |       \_ /home/weave/weaver --port=6
[...]
```

- after initialisation:

```
$ ps auxf
[...]
root      1380  0.4  1.9 879660 74660 ?        Ssl  Mar13   6:02 /usr/bin/docker daemon -H fd:// -H unix:///
root      1664  0.0  0.4 502116 18592 ?        Ssl  Mar13   0:06  \_ docker-containerd -l /var/run/docker/li
root     15482  0.0  0.0 200496  3364 ?        Sl   15:04   0:00      \_ docker-containerd-shim 3d1f5eb6e090
root     15496  0.0  0.0   1524   996 ?        Ss   15:04   0:00      |   \_ /bin/sh /home/weave/launch.sh
root     15781  2.3  1.6 503064 64320 ?        Sl   15:04   0:00      |       \_ /home/weave/weaver --port=6[...]
```
bboreham added a commit that referenced this issue Mar 14, 2017
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

3 participants