Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Traefik (1.6.5 and 1.7.1) stops watching file after it detects Marathon is down #3974

Closed
daltonmatos opened this issue Oct 3, 2018 · 7 comments
Milestone

Comments

@daltonmatos
Copy link

Do you want to request a feature or report a bug?

Bug

What did you do?

I'm trying to use the file.watch feature. The idea is to be able to reload SSL certificates without restaring traefik.

I have the TLS config on a separated file, and here is my main config file (the part that matters):

[file]
watch = true
fileName = "/etc/ssl.toml"

I have two copies of "/etc/ssl.toml", each one with a different certificate, like this:

ssl1.toml:

[[tls]]
  entryPoints = ["https"]
  [tls.certificate]
      certFile = "/etc/cert1/fullchain.pem"
      keyFile = "/etc/cert1/privkey.pem"

and ssl2.toml:

[[tls]]
  entryPoints = ["https"]
  [tls.certificate]
      certFile = "/etc/cert2/fullchain.pem"
      keyFile = "/etc/cert2/privkey.pem"

And then what I do is just a simple cp, like this:

cd /etc
cp ssl2.toml ssl.toml
cp ssl1.toml ssl.toml

And check the certificates were reloaded with a simple curl to localhost.

What did you expect to see?

I expected to see traefik change the certificates every time I did the cp.

What did you see instead?

Traefik changes the certificates correctly, until this message appears in the debug log:

ERRO[2018-10-03T09:45:23-03:00] Failed to retrieve Marathon applications: all the Marathon hosts are presently down

As soon as this message appears, traefik does not detect the file change anymore and then the certificates stop being reloaded.

Output of traefik version: (What version of Traefik are you using?)

Version:      v1.6.5
Codename:     tetedemoine
Go version:   go1.10.3
Built:        2018-07-10_03:54:03PM
OS/Arch:      linux/amd64

The version here is 1.6.5 but this same behavior happens with 1.7.1.

What is your environment & configuration (arguments, toml, provider, platform, ...)?

{
 "LifeCycle": {
  "RequestAcceptGraceTimeout": 0,
  "GraceTimeOut": 0
 },
 "GraceTimeOut": 0,
 "Debug": false,
 "CheckNewVersion": true,
 "SendAnonymousUsage": false,
 "AccessLogsFile": "",
 "AccessLog": null,
 "TraefikLogsFile": "",
 "TraefikLog": null,
 "Tracing": null,
 "LogLevel": "",
 "EntryPoints": {},
 "Cluster": null,
 "Constraints": [],
 "ACME": null,
 "DefaultEntryPoints": [
  "http"
 ],
 "ProvidersThrottleDuration": 2000000000,
 "MaxIdleConnsPerHost": 200,
 "IdleTimeout": 0,
 "InsecureSkipVerify": false,
 "RootCAs": null,
 "Retry": null,
 "HealthCheck": {
  "Interval": 30000000000
 },
 "RespondingTimeouts": null,
 "ForwardingTimeouts": null,
 "AllowMinWeightZero": false,
 "Web": null,
 "Docker": null,
 "File": null,
 "Marathon": null,
 "Consul": null,
 "ConsulCatalog": null,
 "Etcd": null,
 "Zookeeper": null,
 "Boltdb": null,
 "Kubernetes": null,
 "Mesos": null,
 "Eureka": null,
 "ECS": null,
 "Rancher": null,
 "DynamoDB": null,
 "ServiceFabric": null,
 "Rest": null,
 "API": null,
 "Metrics": null,
 "Ping": null,
 "ConfigFile": ""
}

If applicable, please paste the log output at DEBUG level (--logLevel=DEBUG switch)

INFO[2018-10-03T09:55:35-03:00] Starting provider *file.Provider {"Watch":true,"Filename":"/etc/ssl.toml","Constraints":null,"Trace":false,"TemplateVersion":0,"DebugLogGeneratedTemplate":fal
se,"Directory":"","TraefikFile":"/home/daltonmatos/sievegroup/sieve/docker-traefik/config.toml"}
WARN[2018-10-03T09:55:35-03:00] clientTLS is nil
DEBU[2018-10-03T09:55:35-03:00] Configuration received from provider file: {"tls":[{"EntryPoints":["https"],"Certificate":{"CertFile":"/etc/cert1/fullchain.pem","KeyFile":"/etc/cert1/privkey.pem"}}]}
DEBU[2018-10-03T09:55:35-03:00] Add certificate for domains *.<redacted>.com.br
INFO[2018-10-03T09:55:35-03:00] Server configuration reloaded on :8082
INFO[2018-10-03T09:55:35-03:00] Server configuration reloaded on :80
INFO[2018-10-03T09:55:35-03:00] Server configuration reloaded on :443
DEBU[2018-10-03T09:55:50-03:00] Configuration received from provider file: {"tls":[{"EntryPoints":["https"],"Certificate":{"CertFile":"/etc/cert2/fullchain.pem","KeyFile":"/etc/cert2/privkey.pem"}}]}
DEBU[2018-10-03T09:55:50-03:00] Add certificate for domains *.<redacted>.com.br
INFO[2018-10-03T09:55:50-03:00] Server configuration reloaded on :80
INFO[2018-10-03T09:55:50-03:00] Server configuration reloaded on :443
INFO[2018-10-03T09:55:50-03:00] Server configuration reloaded on :8082
DEBU[2018-10-03T09:55:54-03:00] Configuration received from provider file: {"tls":[{"EntryPoints":["https"],"Certificate":{"CertFile":"/etc/cert1/fullchain.pem","KeyFile":"/etc/cert1/privkey.pem"}}]}
DEBU[2018-10-03T09:55:56-03:00] Add certificate for domains *.<redacted>.com.br
INFO[2018-10-03T09:55:56-03:00] Server configuration reloaded on :80
INFO[2018-10-03T09:55:56-03:00] Server configuration reloaded on :443
INFO[2018-10-03T09:55:56-03:00] Server configuration reloaded on :8082
ERRO[2018-10-03T09:57:35-03:00] Failed to retrieve Marathon applications: all the Marathon hosts are presently down

For now on, even changing the contents of /etc/ssl.toml, traefik does not detect this change and does not reload the certificates.

Important Note: I was only able to reproduce this when I started Traefik with Marathon Turned off. When I started traefik with Matathon already started, all works. Even when I remove every app from Marathon and then stop it. but if I later start marathon, Traefik does not start to detect the file changes, although it receives new updates from Marathon:

ERRO[2018-10-03T10:19:02-03:00] Failed to retrieve Marathon applications: all the Marathon hosts are presently down 
DEBU[2018-10-03T10:19:45-03:00] Received provider event type: deployment_info, event: &{deployment_info %!s(*marathon.StepActions=&{[{ScaleApplication  /sleep}]}) 2018-10-03T13:19:45.729Z %!s(*marathon.DeploymentPlan=&{8ebc3a31-cf6c-4c71-8a1e-b081de2af445 2018-10-03T13:19:45.599Z 0xc4200205a0 0xc420020600 [0xc4204e6120]})} 
DEBU[2018-10-03T10:19:45-03:00] originLabelsmap[]                            
DEBU[2018-10-03T10:19:45-03:00] allLabelsmap[:map[]]                         
DEBU[2018-10-03T10:19:45-03:00] originLabelsmap[]                            
DEBU[2018-10-03T10:19:45-03:00] allLabelsmap[:map[]]

Let me see if you need any new information and I will provide it.

@jbdoumenjou jbdoumenjou added kind/bug/possible a possible bug that needs analysis before it is confirmed or fixed. area/provider/marathon area/provider/file and removed status/0-needs-triage labels Oct 3, 2018
@jbdoumenjou
Copy link
Member

Hi @daltonmatos,
could you provide us your complete logs? They seem truncated.

@daltonmatos
Copy link
Author

Sure @jbdoumenjou, I will re-run the tests and post it here. Thanks.

@daltonmatos
Copy link
Author

Here is the raw log, just copied from the terminal.

All I changed was to redact the domain name.

sudo ./traefi-1.6.5 -c config.toml
INFO[2018-10-03T12:32:47-03:00] Using TOML configuration file /home/daltonmatos/sievegroup/sieve/docker-traefik/config.toml 
INFO[2018-10-03T12:32:47-03:00] Traefik version v1.6.5 built on 2018-07-10_03:54:03PM 
INFO[2018-10-03T12:32:47-03:00] 
Stats collection is disabled.
Help us improve Traefik by turning this feature on :)
More details on: https://docs.traefik.io/basics/#collected-data
 
DEBU[2018-10-03T12:32:47-03:00] Global configuration loaded {"LifeCycle":{"RequestAcceptGraceTimeout":0,"GraceTimeOut":10000000000},"GraceTimeOut":0,"Debug":false,"CheckNewVersion":true,"SendAnonymousUsage":false,"AccessLogsFile":"","AccessLog":null,"TraefikLogsFile":"","TraefikLog":null,"Tracing":null,"LogLevel":"debug","EntryPoints":{"http":{"Address":":80","TLS":null,"Redirect":null,"Auth":null,"WhitelistSourceRange":null,"WhiteList":null,"Compress":false,"ProxyProtocol":null,"ForwardedHeaders":{"Insecure":true,"TrustedIPs":null}},"https":{"Address":":443","TLS":{"MinVersion":"","CipherSuites":null,"Certificates":null,"ClientCAFiles":null,"ClientCA":{"Files":null,"Optional":false}},"Redirect":null,"Auth":null,"WhitelistSourceRange":null,"WhiteList":null,"Compress":false,"ProxyProtocol":null,"ForwardedHeaders":{"Insecure":true,"TrustedIPs":null}},"traefik":{"Address":":8082","TLS":null,"Redirect":null,"Auth":null,"WhitelistSourceRange":null,"WhiteList":null,"Compress":false,"ProxyProtocol":null,"ForwardedHeaders":{"Insecure":true,"TrustedIPs":null}}},"Cluster":null,"Constraints":[],"ACME":null,"DefaultEntryPoints":["http"],"ProvidersThrottleDuration":2000000000,"MaxIdleConnsPerHost":10000,"IdleTimeout":0,"InsecureSkipVerify":false,"RootCAs":null,"Retry":null,"HealthCheck":{"Interval":30000000000},"RespondingTimeouts":null,"ForwardingTimeouts":null,"AllowMinWeightZero":false,"Web":null,"Docker":null,"File":{"Watch":true,"Filename":"/etc/ssl.toml","Constraints":null,"Trace":false,"TemplateVersion":0,"DebugLogGeneratedTemplate":false,"Directory":"","TraefikFile":"/home/daltonmatos/sievegroup/sieve/docker-traefik/config.toml"},"Marathon":{"Watch":true,"Filename":"","Constraints":[],"Trace":false,"TemplateVersion":2,"DebugLogGeneratedTemplate":false,"Endpoint":"http://172.18.0.31:8080","Domain":"marathon","ExposedByDefault":true,"GroupsAsSubDomains":true,"DCOSToken":"","MarathonLBCompatibility":false,"FilterMarathonConstraints":false,"TLS":null,"DialerTimeout":60000000000,"KeepAlive":10000000000,"ForceTaskHostname":false,"Basic":null,"RespectReadinessChecks":false},"Consul":null,"ConsulCatalog":null,"Etcd":null,"Zookeeper":null,"Boltdb":null,"Kubernetes":null,"Mesos":null,"Eureka":null,"ECS":null,"Rancher":null,"DynamoDB":null,"ServiceFabric":null,"Rest":null,"API":{"EntryPoint":"traefik","Dashboard":true,"Debug":false,"CurrentConfigurations":null,"Statistics":null},"Metrics":{"Prometheus":{"Buckets":[0.1,0.3,1.2,5],"EntryPoint":"traefik"},"Datadog":null,"StatsD":null,"InfluxDB":null},"Ping":null} 
DEBU[2018-10-03T12:32:47-03:00] Configured Prometheus metrics                
INFO[2018-10-03T12:32:47-03:00] Preparing server https &{Address::443 TLS:0xc4204e4080 Redirect:<nil> Auth:<nil> WhitelistSourceRange:[] WhiteList:<nil> Compress:false ProxyProtocol:<nil> ForwardedHeaders:0xc4200356c0} with readTimeout=0s writeTimeout=0s idleTimeout=3m0s 
INFO[2018-10-03T12:32:47-03:00] Preparing server traefik &{Address::8082 TLS:<nil> Redirect:<nil> Auth:<nil> WhitelistSourceRange:[] WhiteList:<nil> Compress:false ProxyProtocol:<nil> ForwardedHeaders:0xc420035700} with readTimeout=0s writeTimeout=0s idleTimeout=3m0s 
INFO[2018-10-03T12:32:47-03:00] Starting server on :443                      
INFO[2018-10-03T12:32:47-03:00] Preparing server http &{Address::80 TLS:<nil> Redirect:<nil> Auth:<nil> WhitelistSourceRange:[] WhiteList:<nil> Compress:false ProxyProtocol:<nil> ForwardedHeaders:0xc420035680} with readTimeout=0s writeTimeout=0s idleTimeout=3m0s 
INFO[2018-10-03T12:32:47-03:00] Starting server on :80                       
INFO[2018-10-03T12:32:47-03:00] Starting provider configuration.providerAggregator {} 
INFO[2018-10-03T12:32:47-03:00] Starting server on :8082                     
INFO[2018-10-03T12:32:47-03:00] Starting provider *marathon.Provider {"Watch":true,"Filename":"","Constraints":[],"Trace":false,"TemplateVersion":2,"DebugLogGeneratedTemplate":false,"Endpoint":"http://172.18.0.31:8080","Domain":"marathon","ExposedByDefault":true,"GroupsAsSubDomains":true,"DCOSToken":"","MarathonLBCompatibility":false,"FilterMarathonConstraints":false,"TLS":null,"DialerTimeout":60000000000,"KeepAlive":10000000000,"ForceTaskHostname":false,"Basic":null,"RespectReadinessChecks":false} 
INFO[2018-10-03T12:32:47-03:00] Starting provider *file.Provider {"Watch":true,"Filename":"/etc/ssl.toml","Constraints":null,"Trace":false,"TemplateVersion":0,"DebugLogGeneratedTemplate":false,"Directory":"","TraefikFile":"/home/daltonmatos/sievegroup/sieve/docker-traefik/config.toml"} 
WARN[2018-10-03T12:32:47-03:00] clientTLS is nil                             
DEBU[2018-10-03T12:32:47-03:00] Configuration received from provider file: {"tls":[{"EntryPoints":["https"],"Certificate":{"CertFile":"/etc/letsencrypt-second-cert-2018-10-02/live/<redacted>/fullchain.pem","KeyFile":"/etc/letsencrypt-second-cert-2018-10-02/live/<redacted>/privkey.pem"}}]} 
DEBU[2018-10-03T12:32:47-03:00] Add certificate for domains *.<redacted>.com.br 
INFO[2018-10-03T12:32:47-03:00] Server configuration reloaded on :8082       
INFO[2018-10-03T12:32:47-03:00] Server configuration reloaded on :80         
INFO[2018-10-03T12:32:47-03:00] Server configuration reloaded on :443        
DEBU[2018-10-03T12:32:49-03:00] Configuration received from provider file: {"tls":[{"EntryPoints":["https"],"Certificate":{"CertFile":"/etc/letsencrypt-first-cert-2018-10-01/live/<redacted>/fullchain.pem","KeyFile":"/etc/letsencrypt-first-cert-2018-10-01/live/<redacted>/privkey.pem"}}]} 
DEBU[2018-10-03T12:32:49-03:00] Configuration received from provider file: {"tls":[{"EntryPoints":["https"],"Certificate":{"CertFile":"/etc/letsencrypt-first-cert-2018-10-01/live/<redacted>/fullchain.pem","KeyFile":"/etc/letsencrypt-first-cert-2018-10-01/live/<redacted>/privkey.pem"}}]} 
DEBU[2018-10-03T12:32:49-03:00] Add certificate for domains *.<redacted>.com.br 
INFO[2018-10-03T12:32:49-03:00] Server configuration reloaded on :443        
INFO[2018-10-03T12:32:49-03:00] Server configuration reloaded on :8082       
INFO[2018-10-03T12:32:49-03:00] Server configuration reloaded on :80         
ERRO[2018-10-03T12:32:50-03:00] Failed to retrieve Marathon applications: all the Marathon hosts are presently down 

Every cp I did after this (to change the contents of /etc/ssl.toml) didn't trigger the reload.

@ldez
Copy link
Member

ldez commented Oct 3, 2018

@daltonmatos could you provide the full content (redacted) of your config.toml file?

@daltonmatos
Copy link
Author

daltonmatos commented Oct 3, 2018

Hello @ldez,

Here is the config:

MaxIdleConnsPerHost = 10000
logLevel = "debug"

[lifeCycle]
graceTimeOut = 10

defaultEntryPoints = ["http"]
[entryPoints]
  [entryPoints.http]
  address = ":80"

  [entryPoints.https]
  address = ":443"
  [entryPoints.https.tls]

  [entryPoints.traefik]
  address = ":8082"

[file]
watch = true
fileName = "/etc/ssl.toml"

[api]
  dashboard = true

[metrics]
  [metrics.prometheus]
    entrypoint = "traefik"
    buckets = [0.1,0.3,1.2,5.0]

[marathon]
endpoint = "http://172.18.0.31:8080"
watch = true
domain = "marathon"
groupsAsSubDomains = true

@skwair skwair added the priority/P2 need to be fixed in the future label Oct 4, 2018
@diegooliveira
Copy link
Contributor

It's a problem in the logic, I debuged and reproduce the issue. If the first configuration loaded in [1] is nil then the only way to reload the configuration is if the watcher loop in [2] fires by a Marathon change, like a new app been deployed.

The fix might be to add a loop trying to reload the first configuration from time to time if the initial was null and the watch is on.

I'll try make a pr for this change.

[1]
https://github.com/containous/traefik/blob/d3ae88f108f927201fffc310c61cd31cdf5e8ea8/provider/marathon/marathon.go#L153-L157

[2]
https://github.com/containous/traefik/blob/d3ae88f108f927201fffc310c61cd31cdf5e8ea8/provider/marathon/marathon.go#L134-L149

@geraldcroes geraldcroes added kind/bug/confirmed a confirmed bug (reproducible). and removed kind/bug/possible a possible bug that needs analysis before it is confirmed or fixed. labels Oct 31, 2018
diegooliveira pushed a commit to diegooliveira/traefik that referenced this issue Nov 1, 2018
    If Traefik can't load the configuration when starting it will give
up doing so until a new Marathon event gets fired. If no new event is
fired then no configuration will ever be loaded, no matter if the watch
is true or not.

fix traefik#3974
diegooliveira pushed a commit to diegooliveira/traefik that referenced this issue Nov 1, 2018
    If Traefik can't load the configuration when starting it will give
up doing so until a new Marathon event gets fired. If no new event is
fired then no configuration will ever be loaded, no matter if the watch
is true or not.

fix traefik#3974
diegooliveira pushed a commit to diegooliveira/traefik that referenced this issue Nov 1, 2018
    If Traefik can't load the configuration when starting it will give
up doing so until a new Marathon event gets fired. If no new event is
fired then no configuration will ever be loaded, no matter if the watch
is true or not.

fix traefik#3974
@traefiker traefiker added this to the 1.7 milestone Nov 26, 2018
@traefiker
Copy link
Contributor

Closed by #4230.

@traefik traefik locked and limited conversation to collaborators Sep 1, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

8 participants