Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rebalancing service containers across hosts #2558

Closed
gordontyler opened this issue Nov 5, 2015 · 26 comments
Closed

Rebalancing service containers across hosts #2558

gordontyler opened this issue Nov 5, 2015 · 26 comments
Labels
area/scheduler kind/feature Issues that represent larger new pieces of functionality, not enhancements to existing functionality

Comments

@gordontyler
Copy link

Say for example that I have N hosts and a number of containers running on those hosts. Initially, the load was fine, but it has subsequently increased and I'm facing resouce constraints on these hosts. I need to add more hosts to my system and reassign existing containers to these new hosts.

In the case of stateless containers, this should be fairly easy -- destroy existing containers and recreate them on the new hosts.

It's harder in the case of stateful containers, but a stop, export, remove, load, start sequence would probably work, although I'm not sure how volumes would factor into that.

It would super awesome if Rancher could handle at least the stateless container case for me. Something like an action for a service or a stack maybe to "rebalance" the containers across available hosts.

@ndelitski
Copy link

i suppose disable host, some combination of scale -1, then +1 would work? And maybe when you disable host rancher will automatically start rebalancing? For statefull containers I am using combination convoy + EBS but not in volume-driver mode :D So in this case I have to add unmount/mount, detach/attach step

@gordontyler
Copy link
Author

So far I've done it by deleting and recreating the services from compose files. This is only possible because they're stateless.

When disabling a host, Rancher already seems to recreate its containers on other hosts.

@deniseschannon deniseschannon added kind/enhancement Issues that improve or augment existing functionality area/scheduler labels Nov 5, 2015
@deniseschannon
Copy link

From what I'm reading, you're wanting to be able to click on something to rebalance a service to allow it distribute containers across the additional hosts that you've added into your environment. This would be for something that has a specific scale versus a global service since a global scale would obviously start more containers if more hosts were added that followed it's scheduling rules.

@deniseschannon deniseschannon added kind/feature Issues that represent larger new pieces of functionality, not enhancements to existing functionality and removed kind/enhancement Issues that improve or augment existing functionality labels Nov 5, 2015
@gordontyler
Copy link
Author

Correct.

@demarant
Copy link

demarant commented Jan 8, 2016

+1 this is a requirement for us as well

@will-chan as discussed via email, I found this ticket which is strictly related I think.

we must be able to guarantee a Stack is running with at least N-number of containers across at least two hosts. This is a basic requirement for HA. You don't want to have your entire Stack (or large part of it) running on a single host. that host is a single point of failure, even though Rancher will keep the scale, this can be undesired for containers that take longer time to start.

Rebalancing is not possible with current Rancher scheduling rules. Only if you manually intervene by deleting hosts as @gordontyler already mentioned.

The current simple Rancher scheduling algorithm to place containers on the host with least number of containers has an undesired effect. It only works the first time you deploy on hosts with equal number of containers. In the longer term, as you have hosts that fails or get replaced, you end up with very undesirable stack distributions, e.g. an entire stack on a single host or large part of it.

Imagine the following (I simplify here). You start from scratch with rancher and you create a new environment "production" and you put 3 new hosts there (A,B,C). Than you deploy your first Stack A which has one service (add health-check) with a scale of 6 (no affinity rule, just to make a simple example and let it distribute across alls hosts). You will get 2 containers on each host. All fine here, nice redundancy across hosts.

Now simulate a disruption of service, e.g. take down host A. Rancher will detect the 2 containers "unhealthy" and it will place them on the other hosts, one on B and one on C, so they will have 3 containers each. Now simulate you get the host A back online. What do you see? Host A is free, the initial containers are not rescheduled here. So the other Hosts get packed with more containers but the entire cluster/environment does not rebalance! So as time goes, more hosts will be empty/free, and fewer hosts will be packed more. Unless you manually intervene now and than, which is what we do not want to do, we want Rancher to take care of that for us in a graceful way.

Now, if you launch a new Stack B and let us say we have one service with a scale of 2, both containers will be launched on the Host A. So now we have Stack B only on one node, even though there are 3 nodes available. We (operation team) could establish a policy that every time we deploy a new Stack, we need to add at least two empty hosts ... but that would be weird and over allocation of hosts over time.

I have tried many different things, using hard and soft anti-affinity rules like io.rancher.scheduler.affinity:container_label_soft_ne:io.rancher.stack_service.name=stack_name/service_name as suggusted.

It still does not guarantee that a stack is running on at least two hosts and prevents the undesired long-term stated above.

So, basically everytime we have maintenance of a host, we are kind of forced to just to put a new host in place and just destroy the old one, so that the containers are put on the new host directly. But this does not protect us from "unexpected host/network failures", Rancher will not re-balance and you end-up with empty hosts, where new stacks will possibly be deployed entirely.

We expect Rancher to re-balance and ensure multi-host setup at least for ephemeral services is very easy to be done for GA release. Rancher is meant for production deployments.

The quick and dirty solution idea we have now is to either duplicate identical services within a Stack or duplicate the same Stacks.

For example, imagine I have a stack "mystack":

docker-compose.yml:

myapp1:
  image: myorg/myapp:mytag
  label:
    io.rancher.scheduler.affinity:container_label_soft_ne: io.rancher.stack_service.name=mystack/myapp1
    io.rancher.scheduler.affinity:container_label_ne: io.rancher.stack_service.name=mystack/myapp2
myapp2:
  image: myorg/myapp:mytag
  label:
    io.rancher.scheduler.affinity:container_label_soft_ne: io.rancher.stack_service.name=mystack/myapp2
    io.rancher.scheduler.affinity:container_label_ne: io.rancher.stack_service.name=mystack/myapp1

rancher-compose.yml:

myapp1:
  scale: 1
  health_check:
    port: 3000
    interval: 2000
    unhealthy_threshold: 3
    response_timeout: 2000
    healthy_threshold: 2
myapp2:
  scale: 1
  health_check:
    port: 3000
    interval: 2000
    unhealthy_threshold: 3
    response_timeout: 2000
    healthy_threshold: 2

The above I hope will assure that I have at least 2 containers of the same application (i use same image on app1 and app2) are not running on the same host. Being separated services I can scale them independently and having soft anti-affinity will still allow them to be on same host if the other are "too packed". It is a kind of dirty-trick ... as I don't like to duplicate identical services. We will experiment with the above. It can also be done by deploying two identical Stacks. Maybe Rancher should add scale to the Stack as well?

@roynasser
Copy link

+1 I generally agree on most here ;) One thing which isnt really acceptable is the delete host approach... at least not for me... Lets say a degraded number of hosts has led our stack to concentrate X services on one or more hosts. If I then get the number of hosts back up, the last thing I'd want to do is to delete a host in order to have containers balance out... If this was a simple/small 2 host deployment it would be even more critical as deleting the only host left = no service at all.... Its a case of "2 wrongs to make a right" :p I'm sure in the long run, one of the more involved strategies is the way to go...

Implementing moving containers would, IMO, be the way out... I'm not sure if its 100% production ready, but there is some demos on docker containers being moved almost seamlessly... for stateful containers I would assume you would need to be using a "cluster-wide" FS such as convoy with gluster or nfs? (otherwise you will always have an issue.... the container may be stateful but can you always count 100% on the host? if the host is a SPOF then you wont be delivering HA)

@demarant
Copy link

@RVN-BR exactly, we want Rancher to automatically re-balance services and make sure a Stack is on at least two hosts. This could be done by allowing one to say in docker-compose via specific rancher labels "run this service with a total scale of X with at least Y instances on each host that meets the scheduling rules". That would solve my HA requirements.

@gmehta3
Copy link

gmehta3 commented May 27, 2016

+1

2 similar comments
@courcelm
Copy link

+1

@ghost
Copy link

ghost commented Jun 24, 2016

+1

@kvaes
Copy link

kvaes commented Aug 15, 2016

+1

2 similar comments
@blackside
Copy link

+1

@mccricardo
Copy link

+1

@marcbachmann
Copy link

Please stop using +1. There are github reactions to do that.

@OlivierCuyp
Copy link

We are in the testing phase of Rancher, coming from DockerCloud.
DockerCloud propose 3 deployment strategies :

  • Emptiest node: containers are deployed on the node that has the least containers running
  • Every node: 1 container will be deployed on each node (no scaling possible)
  • HA: containers will be spread equally on each node

All our application's containers are currently deployed in HA.
This is maybe one the the biggest feature we miss in Rancher.

@Napsty
Copy link

Napsty commented Feb 1, 2017

We recently upgraded to 1.3.3 and I just realized that this can be achieved with the option "Always run one instance of this container on every host" when adding a service.
This should take care of running one instance per host, effectively creating HA. However it doesn't allow to scale up afterwards (scale is set to "Global") if additional containers are wanted/needed.

@mheiniger
Copy link

Looks like #7253 could solve that problem by

schedule containers evenly across those pool of hosts

@candlerb
Copy link

candlerb commented Jun 23, 2017

You can get some way towards this by:

  1. Defining a soft anti-affinity scheduling rule for each multi-instance service in the stack. [^1]

    # Under Scheduling tab
    The host [should not] have a [service with the name] [Test-app/web]
    
  2. Do an "upgrade" on a service, to automatically destroy and re-create the containers.

Unfortunately, if there is a new (empty) host in the cluster, and your service is already running across all hosts, you might end up with all your containers created on the empty host.

A simple solution would be that any container which is stopped (or to-be-stopped) during an upgrade cycle is not counted towards the anti-affinity rule. This might be the case already - I have not checked.

Another option is for the anti-affinity rule to be weighted by the number of matching instances - that is, it would prefer the host with the fewest number of running instances, rather than only matching a host with zero instances.

The problem with using an anti-affinity rule like this is that it will try to force the service to run on as many hosts as possible - taking priority over resource concerns, until one instance is running on every node. However in practice, the user probably only wants to be sure that it's running on 2 or 3 nodes for redundancy.

A better approach could be to weight the constraint inversely proportional to the number of nodes where the service is already running. For example: if the service is only running on one node, then aggressively choose a different node. If it is running on two nodes then prefer to run it on a third. If it is running on three nodes then weakly prefer it to run on a fourth. At this point, balancing of other resources is probably more important, since you have good redundancy.

I would argue that this sort of anti-affinity should be part of the default scheduler behaviour, since this is probably what people expect. That is, if they request more than one instance of the same service then it's likely to be for redundancy purposes, not just for spreading load over multiple cores.

Also, I would like to see the default scheduler behaviour explicitly documented. In particular, does it take into account any of the following, and if so how?

  • Total server RAM size and number of CPU cores / CPU performance
  • Point-in-time actual RAM and CPU usage
  • Reserved RAM and CPU
  • anything else...?

[^1] Aside: clicking Edit on a running service doesn't give you this option, but once you select 'Upgrade' you can modify labels and scheduling rules.

Rather stupidly, I used the UI to paste in a label io.rancher.scheduler.affinity:container_label_soft_ne which is very awkward. I completely overlooked the separate "Scheduling" tab sitting right there!

@edgarbjorntvedt
Copy link

I agree with candlerb in his suggestions. ".. the user probably only wants to be sure that it's running on 2 or 3 nodes for redundancy." Correct, that is what we want.

His approach to weight the constraint inversely proportional to the number of nodes where the service is already running, would solve the one thing in Rancher that does not work satisfactorily for us.

I also agree that this should be part of the default scheduler behaviour, since this was what we originally expected by Rancher.

@firestar
Copy link

Still hoping for a HA cluster strategy for Rancher!

@cwrau
Copy link

cwrau commented Apr 14, 2018

+1

@micw
Copy link

micw commented Apr 19, 2018

+1 if that helps but I fear that we'll need to wait and migrate to rancher 2.0/rke

@vincent99
Copy link
Contributor

vincent99 commented Apr 19, 2018

This is one those things where everybody thinks they want "feature x" but when you start taking they all have a different and incompatible idea of what X means.. I want them spread across just a few hosts. No, all the hosts. Or spread them according to the value of this label so they're in different zones. But don't reschedule them if one dies because they have storage over here and I want to reuse it. And if a host too full then it's ok to colocate temporarily.. but rebalance if a new host comes in. But not too many at a time or I'll lose quorum... Etc

It's (clearly) not going to change for 1.xmore after 2 years, and in 2.0 you can do whatever k8s supports.

@wrossmann
Copy link

I have a solution for this that is very much in the vein of "if it's stupid, but it works, it's not stupid".

  1. Create yourself a container like below:
FROM alpine:latest
CMD sh -c 'while true; do sleep 5; done'
  1. Build and tag, eg: registry.company.com/noop:latest
  2. Spin up a stack and lock it to your overloaded host.
  3. Scale it up so that the container count on the problem host is X higher than the rest. [do math]
  4. Hit "upgrade" on a problem service to trigger a rolling restart.
  5. Watch as something approximating actual balance happens.
  6. IF not satisfied GOTO 4
  7. ???
  8. Profit!

@arwineap
Copy link

arwineap commented Jun 6, 2019

Eventually you have to spin down the "noop" stack and your production containers still get stacked on the same box

@wrossmann
Copy link

wrossmann commented Jun 6, 2019

@arwineap it's an ongoing process until you get to a "good" state, but the noop containers use virtually no resources. The image is all of 5.5MB and I've got 14 of them currently running which consumes a grand total of 2.1MB RAM and no ~0.3% CPU.

Earlier in the week I had to evacuate a host and the first thing that happened after that was a restart of the 3 most RAM-intensive services in our stacks, all of whose containersl landed on that newly-empty host and nearly blew out the ram. The noop stack let me rebalance those, and now I'm just going to periodically decrease the scale a couple containers below the other hosts so that the other services naturally start to balance as well.

This is by no means a perfect solution, but at least it will keep me somewhat sane while we work towards our Rancher2/K8s migration.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/scheduler kind/feature Issues that represent larger new pieces of functionality, not enhancements to existing functionality
Projects
None yet
Development

No branches or pull requests