Rebalancing service containers across hosts #2558

gordontyler · 2015-11-05T14:48:42Z

Say for example that I have N hosts and a number of containers running on those hosts. Initially, the load was fine, but it has subsequently increased and I'm facing resouce constraints on these hosts. I need to add more hosts to my system and reassign existing containers to these new hosts.

In the case of stateless containers, this should be fairly easy -- destroy existing containers and recreate them on the new hosts.

It's harder in the case of stateful containers, but a stop, export, remove, load, start sequence would probably work, although I'm not sure how volumes would factor into that.

It would super awesome if Rancher could handle at least the stateless container case for me. Something like an action for a service or a stack maybe to "rebalance" the containers across available hosts.

ndelitski · 2015-11-05T15:47:47Z

i suppose disable host, some combination of scale -1, then +1 would work? And maybe when you disable host rancher will automatically start rebalancing? For statefull containers I am using combination convoy + EBS but not in volume-driver mode :D So in this case I have to add unmount/mount, detach/attach step

gordontyler · 2015-11-05T15:52:56Z

So far I've done it by deleting and recreating the services from compose files. This is only possible because they're stateless.

When disabling a host, Rancher already seems to recreate its containers on other hosts.

deniseschannon · 2015-11-05T21:07:39Z

From what I'm reading, you're wanting to be able to click on something to rebalance a service to allow it distribute containers across the additional hosts that you've added into your environment. This would be for something that has a specific scale versus a global service since a global scale would obviously start more containers if more hosts were added that followed it's scheduling rules.

gordontyler · 2015-11-05T21:12:17Z

Correct.

demarant · 2016-01-08T15:37:13Z

+1 this is a requirement for us as well

@will-chan as discussed via email, I found this ticket which is strictly related I think.

we must be able to guarantee a Stack is running with at least N-number of containers across at least two hosts. This is a basic requirement for HA. You don't want to have your entire Stack (or large part of it) running on a single host. that host is a single point of failure, even though Rancher will keep the scale, this can be undesired for containers that take longer time to start.

Rebalancing is not possible with current Rancher scheduling rules. Only if you manually intervene by deleting hosts as @gordontyler already mentioned.

The current simple Rancher scheduling algorithm to place containers on the host with least number of containers has an undesired effect. It only works the first time you deploy on hosts with equal number of containers. In the longer term, as you have hosts that fails or get replaced, you end up with very undesirable stack distributions, e.g. an entire stack on a single host or large part of it.

Imagine the following (I simplify here). You start from scratch with rancher and you create a new environment "production" and you put 3 new hosts there (A,B,C). Than you deploy your first Stack A which has one service (add health-check) with a scale of 6 (no affinity rule, just to make a simple example and let it distribute across alls hosts). You will get 2 containers on each host. All fine here, nice redundancy across hosts.

Now simulate a disruption of service, e.g. take down host A. Rancher will detect the 2 containers "unhealthy" and it will place them on the other hosts, one on B and one on C, so they will have 3 containers each. Now simulate you get the host A back online. What do you see? Host A is free, the initial containers are not rescheduled here. So the other Hosts get packed with more containers but the entire cluster/environment does not rebalance! So as time goes, more hosts will be empty/free, and fewer hosts will be packed more. Unless you manually intervene now and than, which is what we do not want to do, we want Rancher to take care of that for us in a graceful way.

Now, if you launch a new Stack B and let us say we have one service with a scale of 2, both containers will be launched on the Host A. So now we have Stack B only on one node, even though there are 3 nodes available. We (operation team) could establish a policy that every time we deploy a new Stack, we need to add at least two empty hosts ... but that would be weird and over allocation of hosts over time.

I have tried many different things, using hard and soft anti-affinity rules like io.rancher.scheduler.affinity:container_label_soft_ne:io.rancher.stack_service.name=stack_name/service_name as suggusted.

It still does not guarantee that a stack is running on at least two hosts and prevents the undesired long-term stated above.

So, basically everytime we have maintenance of a host, we are kind of forced to just to put a new host in place and just destroy the old one, so that the containers are put on the new host directly. But this does not protect us from "unexpected host/network failures", Rancher will not re-balance and you end-up with empty hosts, where new stacks will possibly be deployed entirely.

We expect Rancher to re-balance and ensure multi-host setup at least for ephemeral services is very easy to be done for GA release. Rancher is meant for production deployments.

The quick and dirty solution idea we have now is to either duplicate identical services within a Stack or duplicate the same Stacks.

For example, imagine I have a stack "mystack":

docker-compose.yml:

myapp1:
  image: myorg/myapp:mytag
  label:
    io.rancher.scheduler.affinity:container_label_soft_ne: io.rancher.stack_service.name=mystack/myapp1
    io.rancher.scheduler.affinity:container_label_ne: io.rancher.stack_service.name=mystack/myapp2
myapp2:
  image: myorg/myapp:mytag
  label:
    io.rancher.scheduler.affinity:container_label_soft_ne: io.rancher.stack_service.name=mystack/myapp2
    io.rancher.scheduler.affinity:container_label_ne: io.rancher.stack_service.name=mystack/myapp1

rancher-compose.yml:

myapp1:
  scale: 1
  health_check:
    port: 3000
    interval: 2000
    unhealthy_threshold: 3
    response_timeout: 2000
    healthy_threshold: 2
myapp2:
  scale: 1
  health_check:
    port: 3000
    interval: 2000
    unhealthy_threshold: 3
    response_timeout: 2000
    healthy_threshold: 2

The above I hope will assure that I have at least 2 containers of the same application (i use same image on app1 and app2) are not running on the same host. Being separated services I can scale them independently and having soft anti-affinity will still allow them to be on same host if the other are "too packed". It is a kind of dirty-trick ... as I don't like to duplicate identical services. We will experiment with the above. It can also be done by deploying two identical Stacks. Maybe Rancher should add scale to the Stack as well?

roynasser · 2016-01-26T21:10:58Z

+1 I generally agree on most here ;) One thing which isnt really acceptable is the delete host approach... at least not for me... Lets say a degraded number of hosts has led our stack to concentrate X services on one or more hosts. If I then get the number of hosts back up, the last thing I'd want to do is to delete a host in order to have containers balance out... If this was a simple/small 2 host deployment it would be even more critical as deleting the only host left = no service at all.... Its a case of "2 wrongs to make a right" :p I'm sure in the long run, one of the more involved strategies is the way to go...

Implementing moving containers would, IMO, be the way out... I'm not sure if its 100% production ready, but there is some demos on docker containers being moved almost seamlessly... for stateful containers I would assume you would need to be using a "cluster-wide" FS such as convoy with gluster or nfs? (otherwise you will always have an issue.... the container may be stateful but can you always count 100% on the host? if the host is a SPOF then you wont be delivering HA)

demarant · 2016-01-26T21:19:17Z

@RVN-BR exactly, we want Rancher to automatically re-balance services and make sure a Stack is on at least two hosts. This could be done by allowing one to say in docker-compose via specific rancher labels "run this service with a total scale of X with at least Y instances on each host that meets the scheduling rules". That would solve my HA requirements.

gmehta3 · 2016-05-27T01:50:23Z

+1

courcelm · 2016-06-15T14:53:26Z

+1

ghost · 2016-06-24T22:58:18Z

+1

kvaes · 2016-08-15T20:18:58Z

+1

blackside · 2016-08-22T14:13:43Z

+1

mccricardo · 2016-09-07T11:28:59Z

+1

marcbachmann · 2016-09-07T11:33:35Z

Please stop using +1. There are github reactions to do that.

OlivierCuyp · 2016-10-05T09:54:21Z

We are in the testing phase of Rancher, coming from DockerCloud.
DockerCloud propose 3 deployment strategies :

Emptiest node: containers are deployed on the node that has the least containers running
Every node: 1 container will be deployed on each node (no scaling possible)
HA: containers will be spread equally on each node

All our application's containers are currently deployed in HA.
This is maybe one the the biggest feature we miss in Rancher.

Napsty · 2017-02-01T10:12:42Z

We recently upgraded to 1.3.3 and I just realized that this can be achieved with the option "Always run one instance of this container on every host" when adding a service.
This should take care of running one instance per host, effectively creating HA. However it doesn't allow to scale up afterwards (scale is set to "Global") if additional containers are wanted/needed.

mheiniger · 2017-02-01T10:16:04Z

Looks like #7253 could solve that problem by

schedule containers evenly across those pool of hosts

candlerb · 2017-06-23T11:15:06Z

You can get some way towards this by:

Defining a soft anti-affinity scheduling rule for each multi-instance service in the stack. [^1]
```
# Under Scheduling tab
The host [should not] have a [service with the name] [Test-app/web]
```
Do an "upgrade" on a service, to automatically destroy and re-create the containers.

Unfortunately, if there is a new (empty) host in the cluster, and your service is already running across all hosts, you might end up with all your containers created on the empty host.

A simple solution would be that any container which is stopped (or to-be-stopped) during an upgrade cycle is not counted towards the anti-affinity rule. This might be the case already - I have not checked.

Another option is for the anti-affinity rule to be weighted by the number of matching instances - that is, it would prefer the host with the fewest number of running instances, rather than only matching a host with zero instances.

The problem with using an anti-affinity rule like this is that it will try to force the service to run on as many hosts as possible - taking priority over resource concerns, until one instance is running on every node. However in practice, the user probably only wants to be sure that it's running on 2 or 3 nodes for redundancy.

A better approach could be to weight the constraint inversely proportional to the number of nodes where the service is already running. For example: if the service is only running on one node, then aggressively choose a different node. If it is running on two nodes then prefer to run it on a third. If it is running on three nodes then weakly prefer it to run on a fourth. At this point, balancing of other resources is probably more important, since you have good redundancy.

I would argue that this sort of anti-affinity should be part of the default scheduler behaviour, since this is probably what people expect. That is, if they request more than one instance of the same service then it's likely to be for redundancy purposes, not just for spreading load over multiple cores.

Also, I would like to see the default scheduler behaviour explicitly documented. In particular, does it take into account any of the following, and if so how?

Total server RAM size and number of CPU cores / CPU performance
Point-in-time actual RAM and CPU usage
Reserved RAM and CPU
anything else...?

[^1] Aside: clicking Edit on a running service doesn't give you this option, but once you select 'Upgrade' you can modify labels and scheduling rules.

Rather stupidly, I used the UI to paste in a label io.rancher.scheduler.affinity:container_label_soft_ne which is very awkward. I completely overlooked the separate "Scheduling" tab sitting right there!

edgarbjorntvedt · 2017-08-24T10:37:32Z

I agree with candlerb in his suggestions. ".. the user probably only wants to be sure that it's running on 2 or 3 nodes for redundancy." Correct, that is what we want.

His approach to weight the constraint inversely proportional to the number of nodes where the service is already running, would solve the one thing in Rancher that does not work satisfactorily for us.

I also agree that this should be part of the default scheduler behaviour, since this was what we originally expected by Rancher.

firestar · 2017-09-19T15:34:29Z

Still hoping for a HA cluster strategy for Rancher!

cwrau · 2018-04-14T19:10:14Z

+1

micw · 2018-04-19T08:44:09Z

+1 if that helps but I fear that we'll need to wait and migrate to rancher 2.0/rke

vincent99 · 2018-04-19T09:35:27Z

This is one those things where everybody thinks they want "feature x" but when you start taking they all have a different and incompatible idea of what X means.. I want them spread across just a few hosts. No, all the hosts. Or spread them according to the value of this label so they're in different zones. But don't reschedule them if one dies because they have storage over here and I want to reuse it. And if a host too full then it's ok to colocate temporarily.. but rebalance if a new host comes in. But not too many at a time or I'll lose quorum... Etc

It's (clearly) not going to change for 1.xmore after 2 years, and in 2.0 you can do whatever k8s supports.

wrossmann · 2019-06-06T21:41:51Z

I have a solution for this that is very much in the vein of "if it's stupid, but it works, it's not stupid".

Create yourself a container like below:

FROM alpine:latest
CMD sh -c 'while true; do sleep 5; done'

Build and tag, eg: registry.company.com/noop:latest
Spin up a stack and lock it to your overloaded host.
Scale it up so that the container count on the problem host is X higher than the rest. [do math]
Hit "upgrade" on a problem service to trigger a rolling restart.
Watch as something approximating actual balance happens.
IF not satisfied GOTO 4
???
Profit!

arwineap · 2019-06-06T21:56:16Z

Eventually you have to spin down the "noop" stack and your production containers still get stacked on the same box

wrossmann · 2019-06-06T22:37:07Z

@arwineap it's an ongoing process until you get to a "good" state, but the noop containers use virtually no resources. The image is all of 5.5MB and I've got 14 of them currently running which consumes a grand total of 2.1MB RAM and no ~0.3% CPU.

Earlier in the week I had to evacuate a host and the first thing that happened after that was a restart of the 3 most RAM-intensive services in our stacks, all of whose containersl landed on that newly-empty host and nearly blew out the ram. The noop stack let me rebalance those, and now I'm just going to periodically decrease the scale a couple containers below the other hosts so that the other services naturally start to balance as well.

This is by no means a perfect solution, but at least it will keep me somewhat sane while we work towards our Rancher2/K8s migration.

deniseschannon added kind/enhancement Issues that improve or augment existing functionality area/scheduler labels Nov 5, 2015

deniseschannon added kind/feature Issues that represent larger new pieces of functionality, not enhancements to existing functionality and removed kind/enhancement Issues that improve or augment existing functionality labels Nov 5, 2015

will-chan added the release/future label Jan 6, 2016

demarant mentioned this issue Jan 22, 2016

[feature request] Scale by multiples resource definition #3379

Closed

demarant mentioned this issue Feb 29, 2016

Feature Request - Container Scheduling so at least X number of instances are running on these hosts #2247

Closed

deniseschannon modified the milestone: Unscheduled Jun 28, 2016

deniseschannon removed the release/future label Jun 28, 2016

inabhi9 mentioned this issue Sep 9, 2016

RabbitMQ config additions ddmng/catalog-dockerfiles#1

Merged

cjellick mentioned this issue Mar 2, 2017

Ability to schedule containers across pools of hosts (ie. zones/regions) #7253

Closed

vincent99 closed this as completed Apr 19, 2018

superseb mentioned this issue Sep 27, 2019

Rebalancing service containers across hosts #23095

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rebalancing service containers across hosts #2558

Rebalancing service containers across hosts #2558

gordontyler commented Nov 5, 2015

ndelitski commented Nov 5, 2015

gordontyler commented Nov 5, 2015

deniseschannon commented Nov 5, 2015

gordontyler commented Nov 5, 2015

demarant commented Jan 8, 2016

roynasser commented Jan 26, 2016

demarant commented Jan 26, 2016

gmehta3 commented May 27, 2016

courcelm commented Jun 15, 2016

ghost commented Jun 24, 2016

kvaes commented Aug 15, 2016

blackside commented Aug 22, 2016

mccricardo commented Sep 7, 2016

marcbachmann commented Sep 7, 2016

OlivierCuyp commented Oct 5, 2016

Napsty commented Feb 1, 2017 •

edited

mheiniger commented Feb 1, 2017

candlerb commented Jun 23, 2017 •

edited

edgarbjorntvedt commented Aug 24, 2017

firestar commented Sep 19, 2017

cwrau commented Apr 14, 2018

micw commented Apr 19, 2018

vincent99 commented Apr 19, 2018 •

edited

wrossmann commented Jun 6, 2019

arwineap commented Jun 6, 2019

wrossmann commented Jun 6, 2019 •

edited

Rebalancing service containers across hosts #2558

Rebalancing service containers across hosts #2558

Comments

gordontyler commented Nov 5, 2015

ndelitski commented Nov 5, 2015

gordontyler commented Nov 5, 2015

deniseschannon commented Nov 5, 2015

gordontyler commented Nov 5, 2015

demarant commented Jan 8, 2016

roynasser commented Jan 26, 2016

demarant commented Jan 26, 2016

gmehta3 commented May 27, 2016

courcelm commented Jun 15, 2016

ghost commented Jun 24, 2016

kvaes commented Aug 15, 2016

blackside commented Aug 22, 2016

mccricardo commented Sep 7, 2016

marcbachmann commented Sep 7, 2016

OlivierCuyp commented Oct 5, 2016

Napsty commented Feb 1, 2017 • edited

mheiniger commented Feb 1, 2017

candlerb commented Jun 23, 2017 • edited

edgarbjorntvedt commented Aug 24, 2017

firestar commented Sep 19, 2017

cwrau commented Apr 14, 2018

micw commented Apr 19, 2018

vincent99 commented Apr 19, 2018 • edited

wrossmann commented Jun 6, 2019

arwineap commented Jun 6, 2019

wrossmann commented Jun 6, 2019 • edited

Napsty commented Feb 1, 2017 •

edited

candlerb commented Jun 23, 2017 •

edited

vincent99 commented Apr 19, 2018 •

edited

wrossmann commented Jun 6, 2019 •

edited