Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add long jobs in exponential backoff providers #626

Closed
yvespp opened this issue Aug 18, 2016 · 8 comments
Closed

Add long jobs in exponential backoff providers #626

yvespp opened this issue Aug 18, 2016 · 8 comments
Assignees
Labels

Comments

@yvespp
Copy link
Contributor

yvespp commented Aug 18, 2016

Traefik dies if it can't reach the Kubernetes API server for some reason (network outage, API server down). This very unfortunate because it causes an outage even though the backends in the cluster are actually still up and could be reached.

It would be nice if Traefik could be more resilient and not depend on the API server being available all the time.

Tested with 1.0.0 and 1.0.1.

Log:

time="2016-08-18T07:18:56+02:00" level=debug msg="Skipping event from kubernetes map[type:MODIFIED object:map[kind:Endpoints apiVersion:v1 metadata:map[name:vvn-baustein namespace:poz-uat selfLink:/api/v1/namespaces/poz-uat/endpoints/vvn-baustein uid:2f76c98c-483b-11e6-a5a8-005056b207ba resourceVersion:14087443 creationTimestamp:2016-07-12T14:16:11Z labels:map[appId:252696]] subsets:[map[notReadyAddresses:[map[ip:172.30.192.25 targetRef:map[name:aps-vvn-baustein-126968-uulp8 uid:2fa62cf5-604d-11e6-8cfc-005056b207ba resourceVersion:14087340 kind:Pod namespace:poz-uat]] map[ip:172.31.32.16 targetRef:map[namespace:poz-uat name:aps-vvn-baustein-126970-58j3o uid:861e2fe9-5f8f-11e6-8cfc-005056b207ba resourceVersion:14087442 kind:Pod]]] ports:[map[protocol:TCP name:http port:8080]]]]]]"
time="2016-08-18T07:19:51+02:00" level=error msg="Error watching kubernetes events: failed to create watch: failed to do version request: GET \"https://172.23.0.1:443/apis/extensions/v1beta1/ingresses\" : failed to create request: GET \"https://172.23.0.1:443/apis/extensions/v1beta1/ingresses\" : [Get https://172.23.0.1:443/apis/extensions/v1beta1/ingresses: dial tcp 172.23.0.1:443: getsockopt: connection refused]"
time="2016-08-18T07:19:52+02:00" level=fatal msg="Cannot connect to Kubernetes server failed to create watch: failed to do version request: GET \"https://172.23.0.1:443/apis/extensions/v1beta1/ingresses\" : failed to create request: GET \"https://172.23.0.1:443/apis/extensions/v1beta1/ingresses\" : [Get https://172.23.0.1:443/apis/extensions/v1beta1/ingresses: dial tcp 172.23.0.1:443: getsockopt: connection refused]"
@emilevauge
Copy link
Member

@yvespp I may know where it comes from. Can you confirm that you got a lot of Error watching kubernetes even logs before it dies?

@yvespp
Copy link
Contributor Author

yvespp commented Aug 18, 2016

No, I only see what I posted above.
The log.Fatalf in kubernetes.go#L157 causes Traefik to exit (see logger.go#L160). So it's by design.

@emilevauge
Copy link
Member

@yvespp indeed it should not be a Fatalf, but it should only ends there after having tried multiple times using exponential backoff. Can you give us all your logs ?

@yvespp
Copy link
Contributor Author

yvespp commented Aug 18, 2016

Ah, ok. Here the log: https://gist.githubusercontent.com/yvespp/461d4fed9decc697f6c5e502d63b8042/raw/ddb85d4c274d5c68902b08c4d7b7b27b5a878421/traefik.log
It's just from the last few hours, I have to see if I can get the whole log...

Here the config:

    logLevel = "INFO"

    defaultEntryPoints = ["http", "https"]
    accessLogsFile = "/proc/1/fd/1"

    [entryPoints]
      [entryPoints.http]
      address = ":80"
        [entryPoints.http.redirect]
          entryPoint = "https"
      [entryPoints.https]
      address = ":443"
        [entryPoints.https.tls]
          [[entryPoints.https.tls.certificates]]
          CertFile = "/etc/ssl/tls.crt"
          KeyFile = "/etc/ssl/tls.key"

    [web]
    address = ":8080"
    ReadOnly = true

    [kubernetes]

@yvespp
Copy link
Contributor Author

yvespp commented Aug 18, 2016

Here the whole log from a test run. I stopped the API Server on 18:05:58 and Traefik died: https://gist.githubusercontent.com/yvespp/3b4278c7c99e2e47711659a50bd26ff4/raw/da3e0798a3e181d60ad6b3ce84e3b4e0c395a89e/traefik.log2

@yvespp
Copy link
Contributor Author

yvespp commented Aug 18, 2016

I noticed that a fresh instance of Traefik, that was just started, doesn't die when I stop the API Server: https://gist.githubusercontent.com/yvespp/9fb879812f1b815574975c8ba71b5982/raw/919087d47ab99b787d4f2e0e2412eb21ca7db648/traefik.log3

In this log you can see Traefik surviving several API Server restarts. I then waited an hour an restarted the API Server again and Traefik died immediately: https://gist.githubusercontent.com/yvespp/3e7aac2990d1124a0bd631326eaa2ff7/raw/a10c28cd0639434ab9b30dec84a7583525d6dc64/traefik.log4
There where two instances (Pods) of Traefik running and both died at the same time.

@emilevauge
Copy link
Member

Ok thanks, to be clear, if you restarts traefik, it works again right?

@yvespp
Copy link
Contributor Author

yvespp commented Aug 18, 2016

Yes, if the API Server is up again.
As long as the API Server is down Kubernetes can't restart the Pod.

@emilevauge emilevauge self-assigned this Aug 18, 2016
@emilevauge emilevauge changed the title Kubernets: Traefik dies when it can't connect to the api server Add long jobs in exponential backoff providers Aug 18, 2016
This was referenced Aug 19, 2016
@ldez ldez added the kind/bug/confirmed a confirmed bug (reproducible). label Apr 29, 2017
@traefik traefik locked and limited conversation to collaborators Sep 1, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

No branches or pull requests

4 participants