Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

1.1.0 kubernetes panic: send on closed channel #877

Closed
jonaz opened this issue Nov 22, 2016 · 20 comments
Closed

1.1.0 kubernetes panic: send on closed channel #877

jonaz opened this issue Nov 22, 2016 · 20 comments

Comments

@jonaz
Copy link
Contributor

jonaz commented Nov 22, 2016

This has happened twice (2) in the last hour since i upgraded to 1.1.0 in our development cluster.

The last log message before the panic was:

time="2016-11-22T14:43:00Z" level=error msg="Kubernetes connection error failed to decode watch event: GET \"https://127.0.0.1:9443/apis/extensions/v1beta1/ingresses\" : unexpected EOF, retrying in 639.617334ms" 
panic: send on closed channel

goroutine 4420 [running]:
panic(0xd292c0, 0xc4203f96a0)
	/usr/local/go/src/runtime/panic.go:500 +0x1a1
github.com/containous/traefik/provider/k8s.(*clientImpl).watch.func1.1(0xc42050e900, 0xc420141260, 0xc420221080, 0x26, 0xc420141200)
	/go/src/github.com/containous/traefik/provider/k8s/client.go:280 +0x33f
created by github.com/containous/traefik/provider/k8s.(*clientImpl).watch.func1
	/go/src/github.com/containous/traefik/provider/k8s/client.go:286 +0xbf

I run traefik like this:

apiVersion: extensions/v1beta1
kind: DaemonSet
metadata:
  name: traefik-ingress-controller
  namespace: kube-system
  labels:
    app: traefik-ingress-lb
spec:
  template:
    metadata:
      labels:
        k8s-app: traefik-ingress-lb
        name: traefik-ingress-lb
    spec:
      terminationGracePeriodSeconds: 30
      hostNetwork: true
      containers:
      - image: traefik:v1.1.0
        name: traefik-ingress-lb
        imagePullPolicy: Always
        resources:
          limits:
            cpu: 1000m
            memory: 500Mi
          requests:
            cpu: 200m
            memory: 100Mi
        ports:
        - containerPort: 80
          hostPort: 80
        - containerPort: 8081
        args:
        - --retry.attempts=8
        - --retry
        - --web
        - --web.address=:8081
        - --kubernetes
        - --kubernetes.endpoint=https://127.0.0.1:9443
@jonaz
Copy link
Contributor Author

jonaz commented Nov 23, 2016

I has happend alot since i updated to 1.1.0 yesterday:

Time 	message  	pod_name  
November 23rd 2016, 08:23:35.033	panic: send on closed channel traefik-ingress-controller-jff4x 
November 23rd 2016, 08:20:15.166	panic: send on closed channel traefik-ingress-controller-jff4x 
November 23rd 2016, 08:18:11.736	panic: send on closed channel traefik-ingress-controller-jff4x 
November 23rd 2016, 08:05:05.438	panic: send on closed channel traefik-ingress-controller-d0w7j 
November 23rd 2016, 07:21:34.433	panic: send on closed channel traefik-ingress-controller-jff4x
November 23rd 2016, 06:55:16.236	panic: send on closed channel traefik-ingress-controller-jff4x
November 23rd 2016, 06:12:36.462	panic: send on closed channel traefik-ingress-controller-jff4x
November 23rd 2016, 06:04:28.268	panic: send on closed channel traefik-ingress-controller-jff4x
November 23rd 2016, 05:44:58.866	panic: send on closed channel traefik-ingress-controller-jff4x
November 23rd 2016, 05:36:52.834	panic: send on closed channel traefik-ingress-controller-jff4x
November 23rd 2016, 05:19:39.035	panic: send on closed channel traefik-ingress-controller-jff4x
November 23rd 2016, 04:38:08.234	panic: send on closed channel traefik-ingress-controller-jff4x
November 23rd 2016, 03:36:38.334	panic: send on closed channel traefik-ingress-controller-jff4x
November 23rd 2016, 03:35:08.460	panic: send on closed channel traefik-ingress-controller-jff4x
November 23rd 2016, 03:29:51.734	panic: send on closed channel traefik-ingress-controller-jff4x
November 23rd 2016, 03:26:47.672	panic: send on closed channel traefik-ingress-controller-jff4x
November 23rd 2016, 03:10:05.876	panic: send on closed channel traefik-ingress-controller-66z9z
November 23rd 2016, 01:55:17.676	panic: send on closed channel traefik-ingress-controller-66z9z
November 23rd 2016, 01:18:26.260	panic: send on closed channel traefik-ingress-controller-jff4x
November 23rd 2016, 00:34:45.336	panic: send on closed channel traefik-ingress-controller-jff4x
November 23rd 2016, 00:33:42.360	panic: send on closed channel traefik-ingress-controller-jff4x
November 23rd 2016, 00:06:09.633	panic: send on closed channel traefik-ingress-controller-jff4x
November 23rd 2016, 00:03:05.363	panic: send on closed channel traefik-ingress-controller-jff4x
November 22nd 2016, 23:49:57.108	panic: send on closed channel traefik-ingress-controller-66z9z
November 22nd 2016, 23:06:29.933	panic: send on closed channel traefik-ingress-controller-jff4x
November 22nd 2016, 22:57:24.639	panic: send on closed channel traefik-ingress-controller-zndmv
November 22nd 2016, 22:28:07.783	panic: send on closed channel traefik-ingress-controller-66z9z
November 22nd 2016, 22:23:02.135	panic: send on closed channel traefik-ingress-controller-jff4x
November 22nd 2016, 21:58:46.216	panic: send on closed channel traefik-ingress-controller-d0w7j
November 22nd 2016, 21:35:28.234	panic: send on closed channel traefik-ingress-controller-jff4x
November 22nd 2016, 20:40:51.269	panic: send on closed channel traefik-ingress-controller-jff4x
November 22nd 2016, 19:17:43.383	panic: send on closed channel traefik-ingress-controller-jff4x
November 22nd 2016, 19:12:38.535	panic: send on closed channel traefik-ingress-controller-jff4x
November 22nd 2016, 19:02:29.935	panic: send on closed channel traefik-ingress-controller-jff4x
November 22nd 2016, 17:58:36.634	panic: send on closed channel traefik-ingress-controller-jff4x
November 22nd 2016, 17:49:28.355	panic: send on closed channel traefik-ingress-controller-jff4x
November 22nd 2016, 17:06:51.316	panic: send on closed channel traefik-ingress-controller-d0w7j
November 22nd 2016, 17:04:56.631	panic: send on closed channel traefik-ingress-controller-jff4x
November 22nd 2016, 16:57:43.408	panic: send on closed channel traefik-ingress-controller-zndmv
November 22nd 2016, 16:44:00.741	panic: send on closed channel traefik-ingress-controller-uuwxw
November 22nd 2016, 16:19:41.436	panic: send on closed channel traefik-ingress-controller-uuwxw
November 22nd 2016, 15:49:07.337	panic: send on closed channel traefik-ingress-controller-uuwxw
November 22nd 2016, 15:43:00.936	panic: send on closed channel traefik-ingress-controller-uuwxw

@emilevauge
Copy link
Member

@jonaz I made a fix and pushed a Docker image for testing containous/traefik:k8s. If you could test it, that would help a lot :)

@thaume
Copy link

thaume commented Nov 24, 2016

Hi @emilevauge I had the same problem, just updated my traefik controller (it went smoothly), will let you know how it goes for me. Thanks !

@jonaz
Copy link
Contributor Author

jonaz commented Nov 24, 2016

@emilevauge i will try tomorrow at work.

is there a commit with diff i can look at to complement the testing? (code review ftw)

@emilevauge
Copy link
Member

is there a commit with diff i can look at to complement the testing? (code review ftw)

Here is the pull request #900

@jonaz
Copy link
Contributor Author

jonaz commented Nov 25, 2016

I have deployed containous/traefik:k8s into our development cluster and will give it a few hours now.

@thaume
Copy link

thaume commented Nov 25, 2016

As a heads up, it has been 15 hours and not a single restart.

@jonaz
Copy link
Contributor Author

jonaz commented Nov 25, 2016

There is a problem with the fix.

containous/traefik:k8s seems contain the fix in #874

queryParams := map[string]string{"watch": "true", "resourceVersion": resourceVersion}

Which means that it will never try do write to errCh (failed to decode watch event: GET) and it will not panic.

So in order to test this fix we need a docker image without the fix from #874.

Since this really is a bug in the error handling we need to continue to get the errors that #874 solved :)

@emilevauge
Copy link
Member

@jonaz Yeah, #874 had already been merged. I just pushed containous/traefik:k8s disabling the fix.

@jonaz
Copy link
Contributor Author

jonaz commented Nov 25, 2016

@emilevauge thanks i'll redeploy that

@jonaz
Copy link
Contributor Author

jonaz commented Nov 25, 2016

I have not had any panics the last 5 hours. I will leave it running over the weekend.

@emilevauge emilevauge added the priority/P0 needs hot fix label Nov 26, 2016
@jonaz
Copy link
Contributor Author

jonaz commented Nov 28, 2016

It has not crashed yet. So i would consider it fixed.

@emilevauge
Copy link
Member

Fixed by #900
Thanks @jonaz @thaume for your help :)

@mthird
Copy link

mthird commented Nov 28, 2016

I've switched to using the containous/traefik:k8s build and still see the decoding issues:

2016-11-28T19:57:06.631945394Z time="2016-11-28T19:57:06Z" level=error msg="Kubernetes connection error failed to decode watch event: GET \"https://10.200.0.1:443/apis/extensions/v1beta1/ingresses\" : unexpected EOF, retrying in 300.839227ms" 
2016-11-28T19:58:07.047522776Z time="2016-11-28T19:58:07Z" level=error msg="Kubernetes connection error failed to decode watch event: GET \"https://10.200.0.1:443/apis/extensions/v1beta1/ingresses\" : unexpected EOF, retrying in 316.754354ms" 
2016-11-28T19:59:07.447124736Z time="2016-11-28T19:59:07Z" level=error msg="Kubernetes connection error failed to decode watch event: GET \"https://10.200.0.1:443/apis/extensions/v1beta1/ingresses\" : unexpected EOF, retrying in 535.680986ms" 
2016-11-28T20:00:08.084936902Z time="2016-11-28T20:00:08Z" level=error msg="Kubernetes connection error failed to decode watch event: GET \"https://10.200.0.1:443/apis/extensions/v1beta1/ingresses\" : unexpected EOF, retrying in 511.724881ms" 

My traefik config:

apiVersion: v1
kind: Service
metadata:
  name: traefik
  labels:
    k8s-app: traefik-ingress-lb
spec:
  selector:
    k8s-app: traefik-ingress-lb
  ports:
    - port: 80
      name: http
    - port: 443
      name: https
---
apiVersion: v1
kind: Service
metadata:
  name: traefik-console
  labels:
    k8s-app: traefik-ingress-lb
spec:
  selector:
    k8s-app: traefik-ingress-lb
  ports:
    - port: 8080
      name: webui
---
apiVersion: v1
kind: Deployment
apiVersion: extensions/v1beta1
metadata:
  name: traefik-ingress-controller
  namespace: kube-system
  labels:
    k8s-app: traefik-ingress-lb
spec:
  replicas: 1
  selector:
    matchLabels:
      k8s-app: traefik-ingress-lb
  template:
    metadata:
      labels:
        k8s-app: traefik-ingress-lb
        name: traefik-ingress-lb
        version: v1.1.0-rc4
      annotations:
        scheduler.alpha.kubernetes.io/affinity: >
          {
            "nodeAffinity": {
              "requiredDuringSchedulingIgnoredDuringExecution": {
                "nodeSelectorTerms": [
                  {
                    "matchExpressions": [
                      {
                        "key": "stackpoint.io/role",
                        "operator": "In",
                        "values": ["master"]
                      }
                    ]
                  }
                ]
              }
            }
          }
    spec:
      terminationGracePeriodSeconds: 60
      containers:
      - image: containous/traefik:k8s
        name: traefik-ingress-lb
        resources:
          limits:
            cpu: 200m
            memory: 30Mi
          requests:
            cpu: 100m
            memory: 20Mi
        ports:
          - containerPort: 80
            hostPort: 80
          - containerPort: 8080
            hostPort: 8081
        args:
        - --web
        - --kubernetes

@jonaz
Copy link
Contributor Author

jonaz commented Nov 28, 2016

Yeah if you read my comment you will se why:
#877 (comment)

@mthird
Copy link

mthird commented Nov 28, 2016

The comment from emilevuage indicates the k8s image was updated 3 days ago without the fix. The image on docker hub also indicates it was updated 3 days ago.

@emilevauge
Copy link
Member

emilevauge commented Nov 28, 2016

@mthird the image on Docker Hub containous/traefik:k8s only contains a fix on the panic issue. Nothing wrong if you still get unexpected EOF logs, this has been fixed in #874

@jonaz
Copy link
Contributor Author

jonaz commented Nov 28, 2016

Yeah thats exacly the point? containous/traefik:k8s must NOT contain the fix from #874 (fix for unexpected EOF). Otherwise we cannot know if this #877 is fixed. Because as i said this is an error in the errorhandlinng.

This issue is only about fixing "panic: send on closed channel" which was an error depending on errorhanding the unexpected EOF.

@emilevauge
Copy link
Member

emilevauge commented Nov 28, 2016

@jonaz race condition ;)

@mthird
Copy link

mthird commented Nov 28, 2016

Makes sense now. Will ignore the log entries :)

Thanks!

@ldez ldez added the kind/bug/confirmed a confirmed bug (reproducible). label Apr 29, 2017
@traefik traefik locked and limited conversation to collaborators Sep 1, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

7 participants