Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Traefik not updating config #42

Closed
adamgraves-choices opened this issue Mar 27, 2017 · 35 comments
Closed

Traefik not updating config #42

adamgraves-choices opened this issue Mar 27, 2017 · 35 comments

Comments

@adamgraves-choices
Copy link

Hi,

We've got an intermittent issue where traefik isn't updating the frontend and backend configures in our Rancher environment.

New stacks and changes to stacks sometimes don't get reflected in the config, sometimes it resolves itself within approx. 10-60 minutes, but on some occasions we have to restart the Traefik stack. Sometimes that doesn't help, and we have ended up destroying the environment and rebuilding it from scratch to resolve the issue.

Last time it occurred I tested the rancher-metadata service to ensure that was working, and everything looked fine from there.

Anyone else encountering this?

@joshuacox
Copy link

I am indeed noticing this behavior. I have notice I have some containers set with really long health checks, and when those are in play I think this tends to exacerbate this problem.

@ghost
Copy link

ghost commented Apr 21, 2017

i have the same problem. when i upgrade a server and the ip adress changes it does not get reflected in the traefik config. is there a way to manually regenerate the rules & traefik.toml? currently i restart the traefik docker and the config is correct again but this is not suitable for production

@joshuacox
Copy link

@rawmind0 any recommendations on how to fix this in situ? I have tried restarting either rancher-traefik or alpine-traefik, or both, with curious results. One of which being banned from letsencrypt by rate limiting :(

I'd like to know if there is a better method, perhaps a command I can run inside one of the containers to force it to reload it's configurations without dropping all the certs.

Another thought, is that maybe we could have a version of this that keeps all it's configs in a convoy-nfs mount.

I know all of this might be moot as well once traefik begins to natively support rancher.

@rawmind0
Copy link
Owner

Hi guys,

sorry about the issues you have suffering. Could you please, provide some more details about??

BTW, inside alpine-traefik container, you could restart traefik or confd without the need of restart the container..

monit restart traefik
#or
monit restart confd

@ghost
Copy link

ghost commented Apr 27, 2017

At the beginning everything worked fine but after some time rancher-traefik did not updated a new ip after an upgrade of a container (and the resulting ip change). it still had the old ip address for the backed. I am not sure but it could be related with a updated to rancher version 1.5.3. Currently i am testing the new nativ Traefik Rancher backend and it looks promising.

@snahelou
Copy link

snahelou commented May 2, 2017

Hello

I have the same problem with Rancher 1.5.3 and Traefik rawmind/alpine-traefik:1.2.3-1

EDIT:

Maybe it's due to confd does not refresh metadata:

bash-4.3$ curl http://rancher-metadata
curl: (6) Couldn't resolve host 'rancher-metadata'

btw dns is working on other containers and metadata works.

Due to rancher/rancher#5041

I tried to add search into rancher ui and after upgrade dns is now working but confd is always empty :(

@rawmind0
Copy link
Owner

rawmind0 commented May 2, 2017

Hi @snahelou ...

This is not the cause of the problem.... confd is able to ressolv rancher URI an connect...This problem is with alpine curl, not confd.... If you do curl http://rancher-metadata.rancher.internal it should work.....

Please, publish confd logs...../opt/tools/confd/log/confd.log inside alpine-traefik containers....

Have your services healthcheck configured??

@snahelou
Copy link

snahelou commented May 2, 2017

Hello

Yes sorry, dns was not the problem.

I had the following error

2017-05-02T12:42:28Z traefik-traefik-1 /opt/tools/confd/bin/confd[159]: ERROR template: rules.toml.tmpl:41:34: executing "rules.toml.tmpl" at <getv (printf "/stack...>: error calling getv: key does not exist
              {{- $back_status := getv (printf "/stacks/%s/services/%s/containers/%s/health_state" $stack_name $service_name $container) -}}

I remove 2 stacks and the service come back available. It's strange because stacks were green.

@rawmind0
Copy link
Owner

rawmind0 commented May 2, 2017

It seems you din't have healthcheks configured....health checks are mandatory...only healthy backends are added to traefik..

@snahelou
Copy link

snahelou commented May 2, 2017

Ok, strange, healthchecks were configured because I used a jenkins multibranch pipeline and other branchs works well.

Thanks for your support.

Regards

@jjscarafia
Copy link

Hi!
I've got an intermittent issue very similar to this one where traefik isn't updating the frontend and backend configures in our Rancher environment on every host (some hosts are updated).

New stacks and changes to stacks sometimes don't get reflected in every host config.

About our configuration:

  • I've two host running traefik (http://34.201.12.10:8000 and http://54.210.1.168:8000/)
  • One have the configuration updated (traefik-1) and the other don't (traefik-2)
  • Doing "monit restart confd" solves the issue but later it happens again if we add new stacks
  • I'm using traefik on "nginx" services inside stacks (check nginx service labels on image attached)
  • I'm using rancher 1.5.10 with traefik catalog "1.2.3-1" (last version)
  • I've test running "curl http://rancher-metadata.rancher.internal" and it works, it returns something on both hosts
  • Find attached log files from the two traefiks and also from the /opt/tools/confd/log/confd.log.
  • find attached also the healthchecks configured on nginx, all stacks are and services are on green
  • we are using rancher os on hosts (deployed with aws ec2)

One note, the confd log of the traefik1 shows the error "executing "rules.toml.tmpl" at <getv (printf "/stack...>: error calling getv: key does not exist", but traefik1 is the one configured ok, traefik2 is the one that is not configured ok (not refreshed). I've also check every traefik label on the servers and are exactly the same as the one attached

Anyone else with the same? Thanks!
Juan

healthchecks
healthchecks

traefik 2 dashboard where test-portal1-14-06 service is not discovered
traefik-2-dashboard

traefik 1 dashboard where test-portal1-14-06 service is discovered
traefik-1-dashboard

nginx labels
nginx-labels

traefik-1-confd.txt
traefik-2.txt
traefik-1.txt
traefik-2-confd.txt

@jjscarafia
Copy link

Some more information, I've check file /opt/traefik/etc/rules.toml on traefik-1 and traefik-2 and on both of them the "test-portal1-14-06 " service configuration is present, don't know why traefik does not reload, perhups related to this?

@jjscarafia
Copy link

@rawmind0 any help on this? Any suggestion? can you please check my post in this issue

@snahelou
Copy link

Check if all of your stacks are green even if they have no traefik tags
When I have errors on a stack, that make my confd unstable. In your case, It's very strange that one server work and not the other.

Regards

@dbsanfte
Copy link

dbsanfte commented Jun 22, 2017

When a container crashes and restarts itself, Traefik correctly removes the container from the pool but doesn't readd it once it's restarted again. I have to manually scale the stack up and down to get Traefik to pick it up. Any ideas?

Considering abandoning this image and going for the native Rancher support in Traefik 1.3 to see if that resolves it.

@jjscarafia
Copy link

jjscarafia commented Jun 22, 2017

@dbsanfte, no idea, I've try to evacuate a host and traefik updates correctly when new containers are created on other hosts.
@snahelou thanks for the response! I have all stacks on green.

Some test I've done, not sure if they are the ones that makes it work now... (just in case it helps someone):

  1. Using for host ubuntu 16.04 (docker 1.12.6) instead of rancherOS v1.0.2 (docker 17.03.1-ce) seams to work better, but it is not a conclusion yet
  2. As @snahelou suggest here, It seams that If I stop a stack and while stack stopped (on red), if I create new stacks, conf.d gets confused and traefik config is not refreshed.
  3. before I was adding the label "traefik.alias.fqdn" with empty value to every service where I was using traefik an with a value, only on the services that I want some value, I've delete this label and keep it only where it was necessary

Till no more red stacks and using ubuntu 16.04, traefik seams to be working ok for, at least, 24 hours

@rawmind0
Copy link
Owner

@jjscarafia , your case is so strange....

In your confd log files, last update should set rules.toml file to same content....It's so strange to work just in one server.... Infrastructure services are working well on both??
traefik-2-confd.txt

2017-06-14T12:43:59Z traefik-traefik-2 /opt/tools/confd/bin/confd[143]: INFO /opt/traefik/etc/rules.toml has md5sum bf6b2298be0acf958ad37fac08f7180d should be 7
3983e979b367f06346659a41726824f
2017-06-14T12:43:59Z traefik-traefik-2 /opt/tools/confd/bin/confd[143]: INFO Target config /opt/traefik/etc/rules.toml out of sync
2017-06-14T12:43:59Z traefik-traefik-2 /opt/tools/confd/bin/confd[143]: INFO Target config /opt/traefik/etc/rules.toml has been update

traefik-1-confd.txt

2017-06-14T12:44:09Z traefik-traefik-1 /opt/tools/confd/bin/confd[24]: INFO /opt/traefik/etc/rules.toml has md5sum bf6b2298be0acf958ad37fac08f7180d should be 73
983e979b367f06346659a41726824f
2017-06-14T12:44:09Z traefik-traefik-1 /opt/tools/confd/bin/confd[24]: INFO Target config /opt/traefik/etc/rules.toml out of sync
2017-06-14T12:44:09Z traefik-traefik-1 /opt/tools/confd/bin/confd[24]: INFO Target config /opt/traefik/etc/rules.toml has been updated

With ubuntu and docker 1.12.6 is working well???

@jjscarafia
Copy link

Hi @rawmind0 and thanks for the comments!

  1. I've just update all infrastructure services (they show an available upgrade).
  2. Yes, it seams that with ubuntu 16.04 (docker 1.12.6) it is working ok but I will give chance to rancherOS again and will share the results
  3. The only "red" container I've, is the "rancher-agent-bootstrap" that is only visible on hosts (image attached). Could this be bothering on any way?

@rawmind0 just in case you are available and want, I can give you access to the rancher, just send me an email to jjs@adhoc.com.ar

seleccion_055

@rawmind0
Copy link
Owner

rawmind0 commented Jun 22, 2017

Hi @jjscarafia ...

  1. The most strange is that it works in one server and not in the other one. Please, upgrade infrastructure services to the latest version.
  2. More that rancheros or ubuntu, the problem could be with docker version 1.12.6 vs 17.03-1....Thanks for test and share results, i really appreciate...
  3. The only "red" container that could affect traefik confd, would be in stacks with traefik.enable=true, these are the only that confd looks for.

Best regards....

@jjscarafia
Copy link

I've been playing for a while and I can see that:

  1. I could reproduce the error of traefik conf not updating by stopping stacks (they become red) and create new stacks with traefik labels.
  2. During that period the log looks like:
 "/stack...>: error calling getv: key does not exist
2017-06-22T21:05:46Z adhoc-traefik-traefik-3 /opt/tools/confd/bin/confd[23]: ERR
OR template: rules.toml.tmpl:41:34: executing "rules.toml.tmpl" at <getv (printf
 "/stack...>: error calling getv: key does not exist
2017-06-22T21:06:01Z adhoc-traefik-traefik-3 /opt/tools/confd/bin/confd[23]: ERR
OR template: rules.toml.tmpl:41:34: executing "rules.toml.tmpl" at <getv (printf
 "/stack...>: error calling getv: key does not exist
2017-06-22T21:06:16Z adhoc-traefik-traefik-3 /opt/tools/confd/bin/confd[23]: ERR
OR template: rules.toml.tmpl:41:34: executing "rules.toml.tmpl" at <getv (printf
 "/stack...>: error calling getv: key does not exist
2017-06-22T21:06:31Z adhoc-traefik-traefik-3 /opt/tools/confd/bin/confd[23]: ERR
OR template: rules.toml.tmpl:41:34: executing "rules.toml.tmpl" at <getv (printf
 "/stack...>: error calling getv: key does not exist
  1. After re-starting stopped stacks, traefik conf was updated automatically again
  2. This didn't happens always, sometimes I can stop an stack and new stacks are auto discovered (I guess it was related to sorting, stack names or something like that)
  3. I couldn't replicate yet again the error where one traefik conf was updated ant the others not

@dbsanfte
Copy link

Moving over to the native Traefik Rancher support resolved my issue with my crashed/auto-restarted Node.js containers not being picked up by this image.

@jjscarafia
Copy link

@dbsanfte good to know that and thanks for sharing. Are you also using acme support with native rancher support?

@dbsanfte
Copy link

No we're just defining a plain old SSL cert/key, no ACME.

@lasley
Copy link
Contributor

lasley commented Aug 3, 2017

I just hit this one too. In my case, a host went down which caused some stacks to migrate to another host.

There were some other stacks that were simply stopped because I didn't want them alive at the moment. Traefik did not start updating until I started those stacks as well, which I could then stop at my leisure.

@jjscarafia
Copy link

@lasley moving to native traefik support to rancher make it works ok for me.
If it helps, this is my very ugly rancher-catalog template

@adamgraves-choices
Copy link
Author

@jjscarafia I've built something similar using the native rancher templates: https://github.com/nhsuk/traefik-rancher

Unfortunately I've come across a critical bug which stops us using Traefik for now: traefik/traefik#1927

@jjscarafia
Copy link

@adamgraves-choices thanks for the feedback. It seams that was the issue I've face yesterday...

@lasley
Copy link
Contributor

lasley commented Aug 9, 2017

Honestly I thought I was just screwing up somehow so I wasn't even going to say anything 😆

@percosys
Copy link

I am having a similar issue. I was able to get past the error in the log message by setting an environmental variable CONF_PREFIX to /latest which seems to have triggered confd to look at the latest route in the rancher metadata service not the default of /2015-12-19. However I am still having an issue with the correct rules being written.

When confd completes its interval I do in fact see a new /opt/traefik/etc/rules.toml file but it is missing the URL and backends params shown in the template.

I believe it is skipping over the following block in the template because rancher-meta has not yet registered the container is healthy by the time confd finishes writing the new rules.toml.

{{- if eq $back_status "healthy" }}
    [backends.{{$service_name}}__{{$stack_name}}.servers.{{getv (printf "/stacks/%s/services/%s/containers/%s/name" $stack_name $service_name $container)}}]
                {{- if eq $traefik_protocol "https"}}
      url = "{{$traefik_protocol}}://{{getv (printf "/stacks/%s/services/%s/containers/%s/primary_ip" $stack_name $service_name $container) -}}:
                {{- else}}
      url = "http://{{getv (printf "/stacks/%s/services/%s/containers/%s/primary_ip" $stack_name $service_name $container) -}}:
                {{- end -}}
                {{- if exists (printf "/stacks/%s/services/%s/labels/traefik.port" $stack_name $service_name) -}}
                    {{getv (printf "/stacks/%s/services/%s/labels/traefik.port" $stack_name $service_name)}}
                {{- else -}}
                80
                {{- end}}"
      weight = 0
              {{- end -}}
            {{- end -}}

It seems to be when confd is trigged to run it detects a change in the number of stacks in "latest" but it if the container is not "healthy" by the time it writes the new rules file it will skip over that part of the template.

My suspicion is since the number of stacks doesn't change by the next interval the rules.toml doesn't get updated until the number of stacks change in rancher, which could be a long time or even never.

If my suspicion is correct then is there a better methodology of updating the rules.toml other then counting the number of stacks in rancher?

I do have health checks configured on all my stacks so I am not sure how to move forward.

Once again assuming that confd is only looking for a change in number of stacks in the environment I see 3 possible solutions.

  1. Some how sandbag the confd process from completing before all services are healthy. This might not be desired as not every service in an environment could potentially be unhealthy during an execution causing the service to never complete.
  2. Have a second "nested" key in the rules.toml.toml file that some how dynamically checks the individual health of each container before executing rules.toml.tmpl. This also seems like it could break down similar to option one if some containers in the environment are never healthy.
  3. Rewrite the rules.toml on an interval regardless of changes to the stack so that on a predictable timeline the rules.toml will be updated with any healthy container regardless of the changes to the stack.

@lasley
Copy link
Contributor

lasley commented Aug 11, 2017

@alexisaperez - Regarding confd - I think that it's a dumb implementation & simply rewrites the rules every X units of time.

The reasoning behind this assertion is that when I make the comma change in #51, it's just a few seconds until the rule is updated in Traefik. I'm definitely no confd expert though, so it's possible it's noticing the change in the rules file itself and triggering the update.

@percosys
Copy link

@lasley I thought that at first as well, but in my testing it seems that the rules.toml only gets updated when the number of stacks in the environment changes. I also am not an expert in confd it is just what I observed. I think one way that might solve the issue for my environment at least would be to change the key in the rules.toml.toml from /stacks to /containers but I will have to report back on if thats feasible.

@adepretis
Copy link

I'm also having the same problem with frontends/backend not getting updated although everything is green and healty - confd.log logs show plenty of:

2017-10-12T12:55:53Z traefik-traefik-1 /opt/tools/confd/bin/confd[24]: ERROR template: traefik.crt.tmpl:1:20: executing "traefik.crt.tmpl" at <getv "/traefik/ssl_c...>: error calling getv: key does not exist

@rawmind0
Copy link
Owner

rawmind0 commented Nov 7, 2017

Hi all,

From alpine-traefik release 1.4.0-3, traefik built in rancher integration is supported, metadata and api. Also, community-catalog is already updated. Now 3 rancher integration are available, metadata, api ( traefik built in) or external (rancher-traefik).

Take into account that labels are different with traefik built in integration, https://docs.traefik.io/configuration/backends/rancher/#labels-overriding-default-behaviour
Metadata with longpoll is the prefered integration, it’s working so good. :)

Also, I made a PR that is merged and will be included in next traefik release with a refactor of rancher integration. traefik/traefik#2291

Best regards...

@jjscarafia
Copy link

Great news, great work! Thanks for the update!

@rawmind0
Copy link
Owner

rawmind0 commented Dec 9, 2017

Hi all,

rancher-traefik updated to use rancher-template instead confd to get immediate updates from metadata. Traefik external integration use it.

Best regards...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

9 participants