Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Gateway Timeout when rolling update a scaled service in Docker swarm mode #1480

Closed
hostingnuggets opened this issue Apr 21, 2017 · 21 comments
Closed

Comments

@hostingnuggets
Copy link

What version of Traefik are you using (traefik version)?

v1.2.3

What is your environment & configuration (arguments, toml...)?

docker service create \
    --name traefik \
    --constraint=node.role==manager \
    --publish 80:80 --publish 8080:8080 \
    --mount type=bind,source=/var/run/docker.sock,target=/var/run/docker.sock \
    --network traefik-net \
    traefik \
    --docker \
    --docker.swarmmode \
    --docker.domain=traefik \
    --docker.watch \
    --web

What did you do?

I am following the swarm mode user guide (https://docs.traefik.io/user-guide/swarm-mode/) directly on my Docker manager node and have setup the whoami0 service and scaled it up to 2 tasks in order to have redundancy. Now I wanted to test a rolling update of the service using docker service update --force --update-delay=10s whoami0 and noticed that during the rolling update traefik returns twice a Gateway Timeout. As far as I understand the rolling updates in swarm mode there should be no downtime because only one container/task gets stopped at a time. So there is always one container running while the other one gets restarted.

What did you expect to see?

Zero downtime

What did you see instead?

Gateway Timeout

If applicable, please paste the log output in debug mode (--debug switch)

time="2017-04-21T21:47:22Z" level=warning msg="Error forwarding to http://10.0.0.5:80, err: dial tcp 10.0.0.5:80: i/o timeout" 
@timoreimann
Copy link
Contributor

With other providers (such as Marathon and Kubernetes), rolling upgrades do not release you from the extra effort needed to achieve lossless handovers. I'm not familiar with Docker Swarm -- does it support something like request draining out of the box to get this extra level of convenience?

@hostingnuggets
Copy link
Author

@timoreimann I hope you are not asking me this question ;-) because I have no clue what request draining is. How can I help in order to find this out?

@timoreimann
Copy link
Contributor

@hostingnuggets I was asking generally -- you or anyone else you may know. :-) Maybe @vdemeester?

Request draining means that a proxy stops sending requests to a particular backend once it knows that the backend is going to be stopped/replaced soon. It's a way to achieve handovering without involvement from the backend application.

@pascalandy
Copy link
Contributor

pascalandy commented Apr 24, 2017

The assumption is wrong.

Docker Swarm does NOT support a real Zero-downtime deployments with rolling upgrades at the moment. moby/moby#30321

IMHO, it's not related to Traefik's project.

@timoreimann
Copy link
Contributor

Thanks for chiming in and sharing the moby/docker link, @pascalandy.

So, as I understand, the situation in Docker Swarm is similar to other providers I'm familiar with: You'll need to teach your apps to handle shutdowns gracefully if you don't want to lose/timeout on requests.

Note there's already a long-standing feature request for draining backends: #41.

@hostingnuggets I'm going to close the issue as it doesn't seem like there's anything in particular this issue can or should track. Feel free to post again if you think otherwise.

@hostingnuggets
Copy link
Author

@timoreimann thanks for the research. I hope a workaround or solution around this issue can be found one day, no matter if it is on the side of traefik or Docker Swarm.

@timoreimann
Copy link
Contributor

@hostingnuggets I forgot to mention that a workaround is already available: You can enable retries so that requests against failing / terminating backends are retried with other backends. It's more of a probabilistic than a deterministic approach, however, but if you set the number of retries to a reasonably high value and don't mind paying the casual increase in latency, you should be okay.

@hostingnuggets
Copy link
Author

@timoreimann thanks for the suggestion, I now set retries to 30 and indeed while updating my service with a rolling update I did not get any timeouts from traefik.io, just like you mention a latency/delay. I would have expected this delay to be around 10 seconds but it is around 35 seconds which is IMHO quite long because the containers restart really fast. Maybe I need to tune a timeout parameter somewhere?

@mvdstam
Copy link
Contributor

mvdstam commented Apr 27, 2017

@hostingnuggets This is probably still a problem in Docker. As demonstrated here, major packet loss can still occur when updating services. All Traefik does, is routing traffic through the VIP that is provided by the Docker engine.

@hostingnuggets
Copy link
Author

@mvdstam That's exactly the behavior I experience with my tests. The worst part is that you get these timeouts for each service task (=replica) which needs to be restarted. So in my case with 2 replicas I get two time windows where the whole service does not answer to HTTP requests although the other container is running and ready to serve requests. I hope the Docker guys can find a solution or workaround this.

@ldez ldez modified the milestone: 1.3 Apr 29, 2017
@Yshayy
Copy link
Contributor

Yshayy commented May 5, 2017

@mvdstam as far as I know, from version 1.2+, traefik by default does not use by swarm service VIP and instead use the IPs of each task/container. (that way it can achieve traefik features like stickiness)

From encountering a similar issue in the past, I noticed that using swarm service VIP load balancer (traefik.backend.loadbalancer.swarm=true) produce better results regarding availability in updates.

If I remember correctly, combining it with image/service health-checking actually achieved zero downtime updates, but the load balancing itself didn't work so good (not equally distributed between containers).

In order to achieve zero downtime availability, we are currently using two services and changing the label priority between them to do rolling updates (style blue-green deployments), but the retry approach sounds interesting as well.

@hostingnuggets I think the reason for the long delays is due to the implementation of traefik's docker provider, the backend endpoints are based on swarm task list (which contains status/desired status as well) and traefik polls this list quite slowly:
https://github.com/containous/traefik/blob/v1.3/provider/docker/docker.go
SwarmDefaultWatchTime = 15 * time.Second
Which means it can take 15 seconds to get the right list of tasks and each task status, for example, it might take several seconds to identify that task is shutting down, that a new task was created or got to healthy state.

When using (traefik.backend.loadbalancer.swarm=true), it can work instantly as the backend endpoint does not change at all (it's a service VIP) and swarm itself is responsible for managing the backends.

@hostingnuggets
Copy link
Author

@Yshayy Thanks for your detailed input, very interesting! I can confirm that running a web app container in swarm mode with two replicas with the label traefik.backend.loadbalancer.swarm=true does not generate any noticeable interruptions/timeouts.

The downside as you say I have noticed that the load balancing among the replicas/tasks does not work at all. For instance I am running a GET every second on a web app which displays the hostname and after 30 minutes it is still querying the same container. Any ideas how to fix that behaviour? I do not care about session stickiness...

@Yshayy
Copy link
Contributor

Yshayy commented May 5, 2017

@hostingnuggets Unfortunately, we didn't found a suitable solution for the load-balancing behavior (we didn't care for stickiness as well) but the load-balancing issue was critical, we've noticed that when load-testing it without keep-alive support in the client (http) the load-balancing worked better. (we load-tested it with jmeter)
I think we've also tested the swarm VIP point directly (without traefik) and that worked ok, but it was some time ago, I'm not sure.

In the end, we've decided to not use swarm internal load-balancer and create two services and change priority label between them, although this solution is pretty stable it got the disadvantage of maintaining additional service.

Still, after seeing the retry trick, I think retrying failed requests might be the most convenient solution if it's possible to decrease the swarm polling interval to 1-2 seconds.

@hostingnuggets
Copy link
Author

@Yshayy any idea if there is someone of the docker team is reading these issues and who could take a look into why the VIP does not load-balance nicely when used by traefik?

Regarding changing the swarm polling interval from 15 to 1-2 seconds, I can't find the parameter in the traefik.toml file for the Docker backend. It looks like most of the backends have the RefreshSeconds parameter but not the Docker backendt, or it is not mentioned in the documentation (http://docs.traefik.io/toml/#docker-backend).

@timoreimann
Copy link
Contributor

@hostingnuggets looking at the code, the interval seems to be static. It should be fairly easy though to make it configurable.

@vdemeester is there anything that speaks against exposing the parameter to the configuration space?

@Yshayy
Copy link
Contributor

Yshayy commented May 6, 2017

@hostingnuggets I'm less familiar with that work that is been done in docker/moby/swarmkit and I haven't seen any issue regarding this behavior. frankly, I'm not sure if it's an issue with swarm, docker's network overlay driver, traefik or something else entirely.

@timoreimann, I think it can be useful, although I have no idea what should be the recommended thresholds or how long does a refresh request actually takes, maybe setting it too low can cause problems?
There's also this todo in the code:

// TODO: This need to be change. Linked to Swarm events docker/docker#23827

Event-based approach certainly sounds more efficient and probably credible.
It seems there's some progress in regards to exposing service events in moby/swarmkit projects:
moby/swarmkit#2034
moby/moby#32421

@hostingnuggets
Copy link
Author

Thanks to both of you for your input and comments. As an idea it might be interesting for you guys to have a look into how the Docker Flow Proxy (http://proxy.dockerflow.com/) manages zero downtime when doing rolling updates of Docker Swarm services. Afaik they use HA-Proxy inside their container. So there might be a trick or further hints for traefik on how to also achieve zero downtime (no matter on which side the issue is really).

@pascalandy
Copy link
Contributor

pascalandy commented May 12, 2017

EDIT: It does not resolve this issue. It looks fancy thought

$ docker service update -d=false --image abiosoft/caddy:0.10.2 caddy_webapp-caddy-a
caddy_webapp-caddy-a
overall progress: 1 out of 1 tasks
1/1: running   [==================================================>]
verify: Waiting 1 seconds to verify that tasks are stable...

I still take 30 seconds to get our service back online.

+++

There is a new flag -d=false. We can make a synchronous service scale with the new synchronous service create/update feature.

BEFORE:
docker service scale http_http=5

NOW:
docker service update -d=false --replicas 10 http_http

Cheers!
Pascal

@pascalandy
Copy link
Contributor

Follow this issue as well - moby/moby#30321 (comment)

@Yshayy
Copy link
Contributor

Yshayy commented May 29, 2017

@pascalandy I think lowering the polling interval in traefik or making it customizable (instead of just 15 seconds) can make some difference.
In the near future, Docker events API will probably provide an opportunity for a better solution.

@pascalandy
Copy link
Contributor

I'll leave this to the pros :-)

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

7 participants