Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

All pods in NS crashed with Tap #12

Open
kferrone opened this issue Oct 9, 2020 · 2 comments
Open

All pods in NS crashed with Tap #12

kferrone opened this issue Oct 9, 2020 · 2 comments
Assignees
Labels
bug Something isn't working stale This issue or pull request is stale

Comments

@kferrone
Copy link

kferrone commented Oct 9, 2020

Description

All pods in the namespace of the pod I tapped started tripping out. Some time after I ran the command to tap my pod, random pods in the same namespace started failing and restarting. It didn't do this right away, it started happening an hour or so after I left the tap on, i.e. after I was done sniffing some headers, I didn't do kubectl tap off my-service. Not only did pods start failing, entire nodes started getting tainted with NoSchedule which in turn caused the cluster autoscaler to overwork itself replacing failed nodes over and over.

Kubectl commands to create reproducable environment / deployment

First off, when I ran the initialize command, it would always complain the tap took too long and didn't immediately port-forward on it's own.
Here is what I ran.

kubectl tap on -n my-ns -p 4000 my-service --port-forward

Then because the port-forward didn't activate because of timeout, I ran:

kubectl port-forward svc/my-service 2244:2244 -n my-ns

Then I did my sniffings then killed the port-forward, but did not turn off the tap.
Leaving that extra container in one pod seemed to cause all hell to break loose in the namespace.
As soon as I turn it off, everything went back to normal.

Screenshots or other information

Kubernetes client version: 1.17
Kubernetes server version: 1.17
Cloud: AWS EKS

One thing to note is we have Appmesh Auto-Inject active on the namespace. Not all pods in the NS are injected with Appmesh, however the pod I injected with tap was also injected with Appmesh. This means the pod had an X-Ray sidecar and an Envoy sidecar already present when I injected the tap. Maybe this was part of the issue?

@Eriner
Copy link
Contributor

Eriner commented Oct 12, 2020

Hi @kferrone, sorry you encountered this issue. Have you been able to reproduce the issue, by chance? Do you have a set of manifests I could apply to a local cluster to reproduce on my end?

It's possible that kubetap's interaction with the other sidecars is causing the problem. Kubetap deploys the mitmproxy sidecar and then essentially sed's the Service port, replacing the target port with the mitmproxy sidecar port. The mitmproxy sidecar then forwards the traffic to the original port. It stores the original port value as an annotation. It is therefore very possible that there is an unfavorable interaction with X-Ray/Envoy.

If you could provide instructions to reproduce this issue, I'd be happy to take a look.

If you're interested in debugging this on your own, I suggest looking at the tap.go file here.

Thanks for filing this issue!

@Eriner Eriner added the bug Something isn't working label Oct 12, 2020
@Eriner
Copy link
Contributor

Eriner commented Oct 12, 2020

First off, when I ran the initialize command, it would always complain the tap took too long and didn't immediately port-forward on it's own.

Just to comment on this, the timeout can occur if the Deployment is taking a while to init. That is to say, if the node needs to download an image and blow-up the container to be run, sometimes this can cause the timeout to be reached if the image is large.

@Eriner Eriner added the stale This issue or pull request is stale label Nov 16, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working stale This issue or pull request is stale
Projects
None yet
Development

No branches or pull requests

2 participants