Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

README: Add note about cluster-autoscaler not supporting multiple AZs #647

Merged
merged 2 commits into from Mar 25, 2019

Conversation

Projects
None yet
4 participants
@mgalgs
Copy link
Contributor

mgalgs commented Mar 19, 2019

Description

As discussed on slack, cluster-autoscaler doesn't support ASGs which span multiple AZs. Made a few clarifying notes in the README to that effect.

Checklist

  • Added/modified documentation as required (such as the README.md, and examples directory)
  • Added yourself to the humans.txt file
@whereisaaron

This comment has been minimized.

Copy link

whereisaaron commented Mar 20, 2019

This is often said but not entirely true. We use cluster-autoscaler with multi-AZ ASG's all the time and it works perfectly. This is because we don't have any AZ-specific dependencies in our workloads, e.g. all of PVC volume types can be mounted in any AZ. Failure zone Pod anti-affinity could also be an issue, but we generally only have soft/preferred anti-affinity rules.

The mechanism/'issue' is just as explained. The cluster-autoscaler takes a random node and assesses whether another node like that would enable the 'Pending' Pod to be scheduled. If it would, then it asks AWS to make that ASG larger. Of course the ASG could add a node in any AZ (favoring balance). But if your workload doesn't care about the AZ, then there is just no problem with this mechanism. And cluster-autoscaler is perfect for multi-AZ ASG's.

Because overall our ASGs and workloads are very AZ-balanced, even our soft Pod anti-affinity is almost always satisfied.

If your workloads are all single-AZ PVCs and hard anti-affinity requirements (e.g. etcd or other quorum hosting), then the advice to have single AZ node pools is of course completely valid.

@mumoshu

This comment has been minimized.

Copy link
Collaborator

mumoshu commented Mar 20, 2019

@mgalgs Hey! Thanks for your contribution.

Yep, I believe that @whereisaaron's explanation is valid, too. You may already have read it, but for more context, I'm sharing the original discussion regarding the gotcha of CA: kubernetes-retired/contrib#1552 (comment)

Maybe we'd better add a dedicated section in the README for this?

I'm not a good writer but I'd propose something like the below as a foundation:


Ensure that you have a separate nodegroup per availability zone when your workload is zone-aware!

cluster-autoscaler is unable to reliably add necessary nodes when you have a nodegroup that spans multiple AZs, by design.

To create separate nodegroup per AZ, just replicate your nodegroup config per AZ.

BEFORE:

nodeGroups:
  - name: ng1-public
    instanceType: m5.xlarge
    # availabilityZones: ["eu-west-2a", "eu-west-2b"]

AFTER:

nodeGroups:
  - name: ng1-public-2a
    instanceType: m5.xlarge
    availabilityZones: ["eu-west-2a"]
  - name: ng1-public-2b
    instanceType: m5.xlarge
    availabilityZones: ["eu-west-2b"]
@errordeveloper

This comment has been minimized.

Copy link
Member

errordeveloper commented Mar 20, 2019

Yes, it sounds like it should be up to the user whether to use a single AZ or not, we should just make a note of how to do it, in case they think they must.

@mgalgs mgalgs force-pushed the mgalgs:readme-autoscaler-single-az branch from 617ce43 to e9c5a9e Mar 23, 2019

@mgalgs

This comment has been minimized.

Copy link
Contributor Author

mgalgs commented Mar 23, 2019

Thanks for the feedback! I totally agree that we should inform the user about the constraints and let them make a decision. I've revised my PR based on @mumoshu's draft.

@mgalgs mgalgs force-pushed the mgalgs:readme-autoscaler-single-az branch from e9c5a9e to 2e3e607 Mar 23, 2019

@whereisaaron

This comment has been minimized.

Copy link

whereisaaron commented Mar 24, 2019

Cheers @mglags. Suggestions and explanation:

  • The AZRebalance scaling process is [suspended]

There is no need to do this. If your workload is not AZ-specific, then by definition is doesn't mind being re-balanced. This setting would be a work-around if have (unbalanced) AZ-specific requests that drive unbalanced ASG's and you don't want a re-balance undoing that. But that case you should be using per-AZ ASGs anyway, as your other criteria recommend.

  • No required podAffinity with topology other than host
  • No required nodeAffinity on zone label
  • No nodeSelector on a zone label

'Soft' affinity requirements that use preferredDuringSchedulingIgnoredDuringExecution do not prevent scheduling even if not satisfied, so again not a problem in multi-AZ ASG. It is a problem to use 'hard' affinity requirements that use requiredDuringSchedulingIgnoredDuringExecution. A nodeSelector is also form of 'hard' affinity.
https://kubernetes.io/docs/concepts/configuration/assign-pod-node/#affinity-and-anti-affinity

  • Never scale any zone to 0

You can certainly zero to and from zero node with a multi-AZ ASG - on AWS at least. This is because you can add the labels needed as node/affinity selectors as AWS tags on the ASG. The cluster-autoscaler will use those tags to determine if making that ASG larger would enable the pending pod to be scheduled (in place of selecting a random ASG instance, since there are none). Thus, so long as your node selector / affinity is not requesting a particular failure domain (AZ), you are still sweet. I've done and tested this with multi-AZ ASGs and the cluster-autoscaler.

@mgalgs mgalgs force-pushed the mgalgs:readme-autoscaler-single-az branch from 2e3e607 to df9a1d0 Mar 24, 2019

@mgalgs

This comment has been minimized.

Copy link
Contributor Author

mgalgs commented Mar 24, 2019

@whereisaaron thanks for the feedback, I've incorporated your suggestions.

You can certainly zero to and from zero node with a multi-AZ ASG - on AWS at least.

Got it. I was pretty much blindly transcribing the comments I got from a CA contributor (here), but that makes sense.

@whereisaaron

This comment has been minimized.

Copy link

whereisaaron commented Mar 24, 2019

Yep, I read it @mgalgs. I think that is sensible advice for how you might possibly be able to get it working if you do have AZ-specific workloads/resources. But I'd say don't even try that, just have per-AZ ASGs in that situation.

@errordeveloper
Copy link
Member

errordeveloper left a comment

LGTM!

@errordeveloper

This comment has been minimized.

Copy link
Member

errordeveloper commented Mar 25, 2019

Thanks a lot @mgalgs for contributing this, and thanks @whereisaaron and @mumoshu for the review!

@errordeveloper errordeveloper merged commit 6e0136a into weaveworks:master Mar 25, 2019

1 check passed

ci/circleci: make-eksctl-image Your tests passed on CircleCI!
Details

@mgalgs mgalgs deleted the mgalgs:readme-autoscaler-single-az branch Mar 25, 2019

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.