Traefik not updating config #42

adamgraves-choices · 2017-03-27T23:25:54Z

Hi,

We've got an intermittent issue where traefik isn't updating the frontend and backend configures in our Rancher environment.

New stacks and changes to stacks sometimes don't get reflected in the config, sometimes it resolves itself within approx. 10-60 minutes, but on some occasions we have to restart the Traefik stack. Sometimes that doesn't help, and we have ended up destroying the environment and rebuilding it from scratch to resolve the issue.

Last time it occurred I tested the rancher-metadata service to ensure that was working, and everything looked fine from there.

Anyone else encountering this?

joshuacox · 2017-04-15T00:00:27Z

I am indeed noticing this behavior. I have notice I have some containers set with really long health checks, and when those are in play I think this tends to exacerbate this problem.

ghost · 2017-04-21T14:10:24Z

i have the same problem. when i upgrade a server and the ip adress changes it does not get reflected in the traefik config. is there a way to manually regenerate the rules & traefik.toml? currently i restart the traefik docker and the config is correct again but this is not suitable for production

joshuacox · 2017-04-21T19:13:26Z

@rawmind0 any recommendations on how to fix this in situ? I have tried restarting either rancher-traefik or alpine-traefik, or both, with curious results. One of which being banned from letsencrypt by rate limiting :(

I'd like to know if there is a better method, perhaps a command I can run inside one of the containers to force it to reload it's configurations without dropping all the certs.

Another thought, is that maybe we could have a version of this that keeps all it's configs in a convoy-nfs mount.

I know all of this might be moot as well once traefik begins to natively support rancher.

rawmind0 · 2017-04-27T14:07:29Z

Hi guys,

sorry about the issues you have suffering. Could you please, provide some more details about??

BTW, inside alpine-traefik container, you could restart traefik or confd without the need of restart the container..

monit restart traefik
#or
monit restart confd

ghost · 2017-04-27T14:16:15Z

At the beginning everything worked fine but after some time rancher-traefik did not updated a new ip after an upgrade of a container (and the resulting ip change). it still had the old ip address for the backed. I am not sure but it could be related with a updated to rancher version 1.5.3. Currently i am testing the new nativ Traefik Rancher backend and it looks promising.

snahelou · 2017-05-02T07:46:17Z

Hello

I have the same problem with Rancher 1.5.3 and Traefik rawmind/alpine-traefik:1.2.3-1

EDIT:

Maybe it's due to confd does not refresh metadata:

bash-4.3$ curl http://rancher-metadata
curl: (6) Couldn't resolve host 'rancher-metadata'

btw dns is working on other containers and metadata works.

Due to rancher/rancher#5041

I tried to add search into rancher ui and after upgrade dns is now working but confd is always empty :(

rawmind0 · 2017-05-02T10:57:19Z

Hi @snahelou ...

This is not the cause of the problem.... confd is able to ressolv rancher URI an connect...This problem is with alpine curl, not confd.... If you do curl http://rancher-metadata.rancher.internal it should work.....

Please, publish confd logs...../opt/tools/confd/log/confd.log inside alpine-traefik containers....

Have your services healthcheck configured??

snahelou · 2017-05-02T12:53:30Z

Hello

Yes sorry, dns was not the problem.

I had the following error

2017-05-02T12:42:28Z traefik-traefik-1 /opt/tools/confd/bin/confd[159]: ERROR template: rules.toml.tmpl:41:34: executing "rules.toml.tmpl" at <getv (printf "/stack...>: error calling getv: key does not exist

              {{- $back_status := getv (printf "/stacks/%s/services/%s/containers/%s/health_state" $stack_name $service_name $container) -}}

I remove 2 stacks and the service come back available. It's strange because stacks were green.

rawmind0 · 2017-05-02T12:56:09Z

It seems you din't have healthcheks configured....health checks are mandatory...only healthy backends are added to traefik..

snahelou · 2017-05-02T13:27:33Z

Ok, strange, healthchecks were configured because I used a jenkins multibranch pipeline and other branchs works well.

Thanks for your support.

Regards

jjscarafia · 2017-06-14T13:13:26Z

Hi!
I've got an intermittent issue very similar to this one where traefik isn't updating the frontend and backend configures in our Rancher environment on every host (some hosts are updated).

New stacks and changes to stacks sometimes don't get reflected in every host config.

About our configuration:

I've two host running traefik (http://34.201.12.10:8000 and http://54.210.1.168:8000/)
One have the configuration updated (traefik-1) and the other don't (traefik-2)
Doing "monit restart confd" solves the issue but later it happens again if we add new stacks
I'm using traefik on "nginx" services inside stacks (check nginx service labels on image attached)
I'm using rancher 1.5.10 with traefik catalog "1.2.3-1" (last version)
I've test running "curl http://rancher-metadata.rancher.internal" and it works, it returns something on both hosts
Find attached log files from the two traefiks and also from the /opt/tools/confd/log/confd.log.
find attached also the healthchecks configured on nginx, all stacks are and services are on green
we are using rancher os on hosts (deployed with aws ec2)

One note, the confd log of the traefik1 shows the error "executing "rules.toml.tmpl" at <getv (printf "/stack...>: error calling getv: key does not exist", but traefik1 is the one configured ok, traefik2 is the one that is not configured ok (not refreshed). I've also check every traefik label on the servers and are exactly the same as the one attached

Anyone else with the same? Thanks!
Juan

healthchecks

traefik 2 dashboard where test-portal1-14-06 service is not discovered

traefik 1 dashboard where test-portal1-14-06 service is discovered

nginx labels

traefik-1-confd.txt
traefik-2.txt
traefik-1.txt
traefik-2-confd.txt

jjscarafia · 2017-06-14T13:20:33Z

Some more information, I've check file /opt/traefik/etc/rules.toml on traefik-1 and traefik-2 and on both of them the "test-portal1-14-06 " service configuration is present, don't know why traefik does not reload, perhups related to this?

jjscarafia · 2017-06-21T12:06:50Z

@rawmind0 any help on this? Any suggestion? can you please check my post in this issue

snahelou · 2017-06-22T07:37:14Z

Check if all of your stacks are green even if they have no traefik tags
When I have errors on a stack, that make my confd unstable. In your case, It's very strange that one server work and not the other.

Regards

dbsanfte · 2017-06-22T16:16:33Z

When a container crashes and restarts itself, Traefik correctly removes the container from the pool but doesn't readd it once it's restarted again. I have to manually scale the stack up and down to get Traefik to pick it up. Any ideas?

Considering abandoning this image and going for the native Rancher support in Traefik 1.3 to see if that resolves it.

jjscarafia · 2017-06-22T17:43:47Z

@dbsanfte, no idea, I've try to evacuate a host and traefik updates correctly when new containers are created on other hosts.
@snahelou thanks for the response! I have all stacks on green.

Some test I've done, not sure if they are the ones that makes it work now... (just in case it helps someone):

Using for host ubuntu 16.04 (docker 1.12.6) instead of rancherOS v1.0.2 (docker 17.03.1-ce) seams to work better, but it is not a conclusion yet
As @snahelou suggest here, It seams that If I stop a stack and while stack stopped (on red), if I create new stacks, conf.d gets confused and traefik config is not refreshed.
before I was adding the label "traefik.alias.fqdn" with empty value to every service where I was using traefik an with a value, only on the services that I want some value, I've delete this label and keep it only where it was necessary

Till no more red stacks and using ubuntu 16.04, traefik seams to be working ok for, at least, 24 hours

rawmind0 · 2017-06-22T17:55:40Z

@jjscarafia , your case is so strange....

In your confd log files, last update should set rules.toml file to same content....It's so strange to work just in one server.... Infrastructure services are working well on both??
traefik-2-confd.txt

2017-06-14T12:43:59Z traefik-traefik-2 /opt/tools/confd/bin/confd[143]: INFO /opt/traefik/etc/rules.toml has md5sum bf6b2298be0acf958ad37fac08f7180d should be 7
3983e979b367f06346659a41726824f
2017-06-14T12:43:59Z traefik-traefik-2 /opt/tools/confd/bin/confd[143]: INFO Target config /opt/traefik/etc/rules.toml out of sync
2017-06-14T12:43:59Z traefik-traefik-2 /opt/tools/confd/bin/confd[143]: INFO Target config /opt/traefik/etc/rules.toml has been update

traefik-1-confd.txt

2017-06-14T12:44:09Z traefik-traefik-1 /opt/tools/confd/bin/confd[24]: INFO /opt/traefik/etc/rules.toml has md5sum bf6b2298be0acf958ad37fac08f7180d should be 73
983e979b367f06346659a41726824f
2017-06-14T12:44:09Z traefik-traefik-1 /opt/tools/confd/bin/confd[24]: INFO Target config /opt/traefik/etc/rules.toml out of sync
2017-06-14T12:44:09Z traefik-traefik-1 /opt/tools/confd/bin/confd[24]: INFO Target config /opt/traefik/etc/rules.toml has been updated

With ubuntu and docker 1.12.6 is working well???

jjscarafia · 2017-06-22T18:03:46Z

Hi @rawmind0 and thanks for the comments!

I've just update all infrastructure services (they show an available upgrade).
Yes, it seams that with ubuntu 16.04 (docker 1.12.6) it is working ok but I will give chance to rancherOS again and will share the results
The only "red" container I've, is the "rancher-agent-bootstrap" that is only visible on hosts (image attached). Could this be bothering on any way?

@rawmind0 just in case you are available and want, I can give you access to the rancher, just send me an email to jjs@adhoc.com.ar

rawmind0 · 2017-06-22T18:24:36Z

Hi @jjscarafia ...

The most strange is that it works in one server and not in the other one. Please, upgrade infrastructure services to the latest version.
More that rancheros or ubuntu, the problem could be with docker version 1.12.6 vs 17.03-1....Thanks for test and share results, i really appreciate...
The only "red" container that could affect traefik confd, would be in stacks with traefik.enable=true, these are the only that confd looks for.

Best regards....

jjscarafia · 2017-06-22T21:16:52Z

I've been playing for a while and I can see that:

I could reproduce the error of traefik conf not updating by stopping stacks (they become red) and create new stacks with traefik labels.
During that period the log looks like:

 "/stack...>: error calling getv: key does not exist
2017-06-22T21:05:46Z adhoc-traefik-traefik-3 /opt/tools/confd/bin/confd[23]: ERR
OR template: rules.toml.tmpl:41:34: executing "rules.toml.tmpl" at <getv (printf
 "/stack...>: error calling getv: key does not exist
2017-06-22T21:06:01Z adhoc-traefik-traefik-3 /opt/tools/confd/bin/confd[23]: ERR
OR template: rules.toml.tmpl:41:34: executing "rules.toml.tmpl" at <getv (printf
 "/stack...>: error calling getv: key does not exist
2017-06-22T21:06:16Z adhoc-traefik-traefik-3 /opt/tools/confd/bin/confd[23]: ERR
OR template: rules.toml.tmpl:41:34: executing "rules.toml.tmpl" at <getv (printf
 "/stack...>: error calling getv: key does not exist
2017-06-22T21:06:31Z adhoc-traefik-traefik-3 /opt/tools/confd/bin/confd[23]: ERR
OR template: rules.toml.tmpl:41:34: executing "rules.toml.tmpl" at <getv (printf
 "/stack...>: error calling getv: key does not exist

After re-starting stopped stacks, traefik conf was updated automatically again
This didn't happens always, sometimes I can stop an stack and new stacks are auto discovered (I guess it was related to sorting, stack names or something like that)
I couldn't replicate yet again the error where one traefik conf was updated ant the others not

dbsanfte · 2017-06-26T17:18:01Z

Moving over to the native Traefik Rancher support resolved my issue with my crashed/auto-restarted Node.js containers not being picked up by this image.

jjscarafia · 2017-06-27T00:38:37Z

@dbsanfte good to know that and thanks for sharing. Are you also using acme support with native rancher support?

dbsanfte · 2017-06-27T09:42:28Z

No we're just defining a plain old SSL cert/key, no ACME.

lasley · 2017-08-03T02:54:07Z

I just hit this one too. In my case, a host went down which caused some stacks to migrate to another host.

There were some other stacks that were simply stopped because I didn't want them alive at the moment. Traefik did not start updating until I started those stacks as well, which I could then stop at my leisure.

jjscarafia · 2017-08-03T14:12:32Z

@lasley moving to native traefik support to rancher make it works ok for me.
If it helps, this is my very ugly rancher-catalog template

adamgraves-choices · 2017-08-09T00:40:06Z

@jjscarafia I've built something similar using the native rancher templates: https://github.com/nhsuk/traefik-rancher

Unfortunately I've come across a critical bug which stops us using Traefik for now: traefik/traefik#1927

jjscarafia · 2017-08-09T16:38:17Z

@adamgraves-choices thanks for the feedback. It seams that was the issue I've face yesterday...

lasley · 2017-08-09T16:38:56Z

Honestly I thought I was just screwing up somehow so I wasn't even going to say anything 😆

percosys · 2017-08-11T18:06:37Z

I am having a similar issue. I was able to get past the error in the log message by setting an environmental variable CONF_PREFIX to /latest which seems to have triggered confd to look at the latest route in the rancher metadata service not the default of /2015-12-19. However I am still having an issue with the correct rules being written.

When confd completes its interval I do in fact see a new /opt/traefik/etc/rules.toml file but it is missing the URL and backends params shown in the template.

I believe it is skipping over the following block in the template because rancher-meta has not yet registered the container is healthy by the time confd finishes writing the new rules.toml.

{{- if eq $back_status "healthy" }}
    [backends.{{$service_name}}__{{$stack_name}}.servers.{{getv (printf "/stacks/%s/services/%s/containers/%s/name" $stack_name $service_name $container)}}]
                {{- if eq $traefik_protocol "https"}}
      url = "{{$traefik_protocol}}://{{getv (printf "/stacks/%s/services/%s/containers/%s/primary_ip" $stack_name $service_name $container) -}}:
                {{- else}}
      url = "http://{{getv (printf "/stacks/%s/services/%s/containers/%s/primary_ip" $stack_name $service_name $container) -}}:
                {{- end -}}
                {{- if exists (printf "/stacks/%s/services/%s/labels/traefik.port" $stack_name $service_name) -}}
                    {{getv (printf "/stacks/%s/services/%s/labels/traefik.port" $stack_name $service_name)}}
                {{- else -}}
                80
                {{- end}}"
      weight = 0
              {{- end -}}
            {{- end -}}

It seems to be when confd is trigged to run it detects a change in the number of stacks in "latest" but it if the container is not "healthy" by the time it writes the new rules file it will skip over that part of the template.

My suspicion is since the number of stacks doesn't change by the next interval the rules.toml doesn't get updated until the number of stacks change in rancher, which could be a long time or even never.

If my suspicion is correct then is there a better methodology of updating the rules.toml other then counting the number of stacks in rancher?

I do have health checks configured on all my stacks so I am not sure how to move forward.

Once again assuming that confd is only looking for a change in number of stacks in the environment I see 3 possible solutions.

Some how sandbag the confd process from completing before all services are healthy. This might not be desired as not every service in an environment could potentially be unhealthy during an execution causing the service to never complete.
Have a second "nested" key in the rules.toml.toml file that some how dynamically checks the individual health of each container before executing rules.toml.tmpl. This also seems like it could break down similar to option one if some containers in the environment are never healthy.
Rewrite the rules.toml on an interval regardless of changes to the stack so that on a predictable timeline the rules.toml will be updated with any healthy container regardless of the changes to the stack.

lasley · 2017-08-11T18:21:17Z

@alexisaperez - Regarding confd - I think that it's a dumb implementation & simply rewrites the rules every X units of time.

The reasoning behind this assertion is that when I make the comma change in #51, it's just a few seconds until the rule is updated in Traefik. I'm definitely no confd expert though, so it's possible it's noticing the change in the rules file itself and triggering the update.

percosys · 2017-08-11T18:34:13Z

@lasley I thought that at first as well, but in my testing it seems that the rules.toml only gets updated when the number of stacks in the environment changes. I also am not an expert in confd it is just what I observed. I think one way that might solve the issue for my environment at least would be to change the key in the rules.toml.toml from /stacks to /containers but I will have to report back on if thats feasible.

adepretis · 2017-10-12T12:58:09Z

I'm also having the same problem with frontends/backend not getting updated although everything is green and healty - confd.log logs show plenty of:

2017-10-12T12:55:53Z traefik-traefik-1 /opt/tools/confd/bin/confd[24]: ERROR template: traefik.crt.tmpl:1:20: executing "traefik.crt.tmpl" at <getv "/traefik/ssl_c...>: error calling getv: key does not exist

rawmind0 · 2017-11-07T08:38:37Z

Hi all,

From alpine-traefik release 1.4.0-3, traefik built in rancher integration is supported, metadata and api. Also, community-catalog is already updated. Now 3 rancher integration are available, metadata, api ( traefik built in) or external (rancher-traefik).

Take into account that labels are different with traefik built in integration, https://docs.traefik.io/configuration/backends/rancher/#labels-overriding-default-behaviour
Metadata with longpoll is the prefered integration, it’s working so good. :)

Also, I made a PR that is merged and will be included in next traefik release with a refactor of rancher integration. traefik/traefik#2291

Best regards...

jjscarafia · 2017-11-07T12:55:18Z

Great news, great work! Thanks for the update!

rawmind0 · 2017-12-09T18:22:58Z

Hi all,

rancher-traefik updated to use rancher-template instead confd to get immediate updates from metadata. Traefik external integration use it.

Best regards...

jjscarafia mentioned this issue Jun 15, 2017

Error calling getv #8

Closed

rawmind0 closed this as completed Nov 18, 2017

rawmind0 mentioned this issue Dec 9, 2017

Traefik 1.4.x makes service 503 when it upgraded rawmind0/alpine-traefik#67

Closed

Traefik not updating config #42

Traefik not updating config #42

Comments

adamgraves-choices commented Mar 27, 2017

joshuacox commented Apr 15, 2017

ghost commented Apr 21, 2017

joshuacox commented Apr 21, 2017

rawmind0 commented Apr 27, 2017

ghost commented Apr 27, 2017

snahelou commented May 2, 2017 • edited

rawmind0 commented May 2, 2017

snahelou commented May 2, 2017 • edited

rawmind0 commented May 2, 2017

snahelou commented May 2, 2017

jjscarafia commented Jun 14, 2017

jjscarafia commented Jun 14, 2017

jjscarafia commented Jun 21, 2017

snahelou commented Jun 22, 2017

dbsanfte commented Jun 22, 2017 • edited

jjscarafia commented Jun 22, 2017 • edited

rawmind0 commented Jun 22, 2017

jjscarafia commented Jun 22, 2017

rawmind0 commented Jun 22, 2017 • edited

jjscarafia commented Jun 22, 2017

dbsanfte commented Jun 26, 2017

jjscarafia commented Jun 27, 2017

dbsanfte commented Jun 27, 2017

lasley commented Aug 3, 2017

jjscarafia commented Aug 3, 2017

adamgraves-choices commented Aug 9, 2017

jjscarafia commented Aug 9, 2017

lasley commented Aug 9, 2017

percosys commented Aug 11, 2017

lasley commented Aug 11, 2017

percosys commented Aug 11, 2017

adepretis commented Oct 12, 2017

rawmind0 commented Nov 7, 2017

jjscarafia commented Nov 7, 2017

rawmind0 commented Dec 9, 2017

snahelou commented May 2, 2017 •

edited

snahelou commented May 2, 2017 •

edited

dbsanfte commented Jun 22, 2017 •

edited

jjscarafia commented Jun 22, 2017 •

edited

rawmind0 commented Jun 22, 2017 •

edited