Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NAT only provisioned in one AZ, even for multi-AZ node groups #392

Closed
whereisaaron opened this issue Jan 4, 2019 · 26 comments · Fixed by #861

Comments

@whereisaaron
Copy link

commented Jan 4, 2019

What happened?

I deployed a cluster with eksctl create --node-private-networking. A multi-AZ k8s control plane was created, and nodes were created in three AZs. However a NAT Gateway was only created in one AZ, and all subnets were routed through that one AZ. Because of that the loss of that one AZ compromises the whole cluster - at the very least the cluster can't pull images any more.

What you expected to happen?

I expected that either NAT gateways or NAT instances would to be created in a public subnet for each AZ that the node group uses, and for the default route in all public/private subnets to go to the NAT gateway/instance in the same AZ. Thus maintaining the high availability for the cluster as a whole.

How to reproduce it?

eksctl create

Anything else we need to know?

  • I believe a workaround already exists, in that I can create my own VPC and subnets with per-AZ NAT Gateways and then use --vpc-private-subnets to deploy to those subnets?

  • It would nice to have the option of deploying NAT instances rather than NAT gateways. Per-AZ AWS NAT Gateways are expensive, especially given the additional traffic charges, (e.g. 3 x AZ and 2TB traffic = $250/month in Sydney), versus per-AZ t3.micro NAT instances (e.g. 3 x t3.micro and 2TB traffic = $15/month in Sydney). So NAT gateways are ~15 times more expensive to run for small amounts of (mostly image pull) traffic. And it gets worse if you use a lot of traffic.

  • Per-AZ NAT gateways/instances does save a little money and latency on cross-AZ traffic charges.

Versions

$ eksctl version
[ℹ]  version.Info{BuiltAt:"", GitCommit:"", GitTag:"0.1.16"}
$ uname -a
Linux 4.4.0-17134-Microsoft #345-Microsoft Wed Sep 19 17:47:00 PST 2018 x86_64 x86_64 x86_64 GNU/Linux
$ kubectl version
Client Version: version.Info{Major:"1", Minor:"12", GitVersion:"v1.12.4", GitCommit:"f49fa022dbe63faafd0da106ef7e05a29721d3f1", GitTreeState:"clean", BuildDate:"2018-12-14T07:10:00Z", GoVersion:"go1.10.4", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"11+", GitVersion:"v1.11.5-eks-6bad6d", GitCommit:"6bad6d9c768dc0864dab48a11653aa53b5a47043", GitTreeState:"clean", BuildDate:"2018-12-06T23:13:14Z", GoVersion:"go1.10.3", Compiler:"gc", Platform:"linux/amd64"}

Also include your version of heptio-authenticator-aws --> N/A

Logs

[ℹ]  using region us-east-2
[ℹ]  setting availability zones to [us-east-2c us-east-2b us-east-2a]
[ℹ]  subnets for us-east-2c - public:10.201.0.0/19 private:10.201.96.0/19
[ℹ]  subnets for us-east-2b - public:10.201.32.0/19 private:10.201.128.0/19
[ℹ]  subnets for us-east-2a - public:10.201.64.0/19 private:10.201.160.0/19
[ℹ]  using "ami-053cbe66e0033ebcf" for nodes
[ℹ]  creating EKS cluster "foo" in "us-east-2" region
[ℹ]  will create 2 separate CloudFormation stacks for cluster itself and the initial nodegroup
[ℹ]  if you encounter any issues, check CloudFormation console or try 'eksctl utils describe-stacks --region=us-east-2 --name=foo'
[ℹ]  creating cluster stack "eksctl-foo-cluster"
[ℹ]  creating nodegroup stack "eksctl-foo-nodegroup-ng-1d7b8a83"
[✔]  all EKS cluster resource for "foo" had been created
[✔]  saved kubeconfig as "/home/user/.kube/config"
[ℹ]  nodegroup "ng-1d7b8a83" has 0 nodes
[ℹ]  waiting for at least 2 nodes to become ready
[ℹ]  nodegroup "ng-1d7b8a83" has 2 nodes
[ℹ]  node "ip-10-201-120-84.us-east-2.compute.internal" is ready
[ℹ]  node "ip-10-201-181-84.us-east-2.compute.internal" is ready
[ℹ]  kubectl command should work with "/home/user/.kube/config", try 'kubectl get nodes'
[✔]  EKS cluster "foo" in "us-east-2" region is ready
@errordeveloper

This comment has been minimized.

Copy link
Member

commented Jan 4, 2019

Hi @whereisaaron, thanks for reporting this!

Somehow I wasn't aware of NAT instances until now, the AWS docs that I was reading seemed to mention only NAT gateways when I was working on the feature. It is entirely doable to provide an option, but if there is no benefit from using NAT gateways (which looks like a legacy product), I'm all for switching to NAT instances.

Please let me know if you are keen to help here.

@milkowski

This comment has been minimized.

Copy link

commented Jan 4, 2019

It is worth considering if even free tier t2.micro would be enough in most cases as NAT instance.

@whereisaaron

This comment has been minimized.

Copy link
Author

commented Jan 4, 2019

Thanks @errordeveloper,

I don't mind whether NAT Gateways or NAT Instances are deployed, but we absolutely need one per AZ with per-AZ routing tables, otherwise our multi-AZ control plane and multi-AZ node groups are pointless 😄

NAT Instances are technically the old way, where you run your own NAT, but AWS provides an AMI that is basically zero-config. The trade-off is it up to you to update the AMI's and, and scale the instance types up if necessary. But for most small clusters without 100's of nodes NAT Instances are more than adequate and vastly cheaper. This is because the base price is based on the instance type you choose, and there is no additional traffic charges.

NAT Gateways are the newer, fully-managed option. AWS maintains them for you and they scale automatically, which is great. They problem is they are still per-AZ just like NAT instances, and AWS charges for every subnet and/or AZ you launch one in. In contrast, Google Cloud Platform has the same service for the same price (suspiciously the same 😄), but a single Cloud NAT 'instance' is highly-available across all subnets and zones in a region. So there no overhead for being highly available with Google, whereas AWS NAT Gateway pricing punishes you for that 😢 And both AWS and Google charge you a traffic premium on top of regular traffic charges, which are per-GB costs you don't need to pay with NAT Instances.

So the vital change is, some form of NAT for every AZ. If sticking with NAT Gateways is easiest, that's fine.

The nice-to-have change would be the option to deploy NAT instances instead of NAT Gateways. I say 'option' because people with unlimited budgets will probably prefer NAT Gateways.

@milkowski yes T2.micro or T2.nano works fine. But I think T3 instances have better network performance than T2, plus multi-core, so preferable for NAT if available.

@errordeveloper

This comment has been minimized.

Copy link
Member

commented Jan 5, 2019

@whereisaaron

This comment has been minimized.

Copy link
Author

commented Jan 5, 2019

For NAT instances, in the CF template you include for each AZ:

  1. An instance using an AWS NAT AMI image in the public subnet, with an attached EIP, and source IP address checking disabled.
  2. A routing table that routes to the NAT instance for that AZ.
  3. For all the public/private subnets you associate the routing table for the AZ the subnet is in.

Here is a CF enample:
https://github.com/rcrelia/aws-mojo/blob/master/cloudformation/vpc-scenario-2-reference/aws-vpc-nat-instances.README.md

The NAT 'daemon' is the Linux kernel, so you couldn't easily do that in-cluster. Maybe it is possible somehow but it is unlikely to be efficient and low-latency.

@errordeveloper

This comment has been minimized.

Copy link
Member

commented Jan 5, 2019

@whereisaaron

This comment has been minimized.

Copy link
Author

commented Feb 6, 2019

Hi @errordeveloper, as we discussed above, while eksctl can create a VPC for you, it can't create a high-availability private cluster because the single-AZ NAT is vulnerable to zone failure.

I created the following CloudFormation template (based on mostly someone else's template) to create what is required; 3 public subnets each with a NAT Gateway, and 3 private subnets each with their own routing table to route to the NAT Gateway for that AZ.

https://gist.github.com/whereisaaron/7eb907d17d7a3bc4d50b9ab279107492

This is working great with eksctl and the --private-subnets option. This VPC template is not any more special that what eksctl already does, it just makes sure the NAT isn't a single point of failure for the cluster. Nothing else is required, as whenever a new node is launched, it will use the routing table in its AZ and thus also use the NAT Gateway in its AZ.

This also reduces latency and cost from cross-AZ traffic, while ensuring a zone failure won't bring the cluster down.

@errordeveloper

This comment has been minimized.

Copy link
Member

commented Feb 6, 2019

@IPyandy

This comment has been minimized.

Copy link
Contributor

commented Feb 20, 2019

This is a tricky one due to the 5 elastic IP limit per VPC (default) and also the cost of the NAT Gateways. These things are not cheap to maintain and maybe adding a --num-nat-gateways would be best than just a default per-AZ. When using in development, on NAT Gateway is much more efficient and cost-effective than multiple ones.

When running in production, I completely agree that redundancy should be there. The --num-nat-gateways flag or a --production flag or --development flag would probably be a good direction.

I like the --num-nat-gateways or similar option.

Thanks

@whereisaaron

This comment has been minimized.

Copy link
Author

commented Feb 20, 2019

@IPyandy, if you don't have multi-AZ NAT then you don't have a high-availability cluster anyway, so you can just deploy the whole cluster to one subnet in one AZ if you only want one NAT.

But if people do want the option, it might be easier to offer a --single-az-nat-gateway switch, so either one subnet or all per-AZ subnets have NAT gateways. And likewise the routing tables are either all to one NAT subnet or to their same-AZ subnet. Might be easier to implement.

You are correct that NAT Gateways are expensive, and small NAT instances are often more than sufficient (see my costing example above). So it would be good to have the option of either.

You don't need EIPs for NAT instances, you can just launch them with auto-assigned public IPs. Also you can just say what you are doing and request a higher EIP limit. After all NAT reduces the IP usage over public clusters.

Lastly you only need one (set of) NAT gateways per VPC, and you can large numbers of eksctl clusters to the same or different sets of private subnets in the same VPC that all route to the same NAT gateway(s) in the public network. So your dev/preprod clusters can use the same NAT as your production ones. You still get hit by the NAT Gateway per-GB charges though 😢

@IPyandy

This comment has been minimized.

Copy link
Contributor

commented Feb 21, 2019

@whereisaaron I do agree that if single NAT then no redundancy, though in testing and dev sometimes testing east-west is more important than having inbound redundancy. By this time your prod environment should be configured and have redundancy at the edge built in (NAT gateways, Internet Gateways and so forth).

Sometimes testing the viability of node failure across AZs, amongst other use cases is still a good option for testing with a single NAT Gateway, this is not uncommon to do. If the purpose is to test redundancy at all points, then yes, of course, build the multiple gateways in.

Whether it's an optional switch to add or reduce gateways really doesn't really concern me, mostly just making sure it is an option. Though from the perspective of engineers/devs turning clusters on-and-off there's the usability aspect as well.

I'm not really a big fan of NAT instances and it would be unnecessary code as it's just not something anyone should be dealing with unless you have ops teams taking care of those. No idea how long AWS plans to support them (native ones) either.

I think we do agree there should be an option, though I'll be curious as to what the maintainers think from a design perspective, I wouldn't mind tackling the code.

@errordeveloper Would another issue be productive to discuss the actual design case?

@whereisaaron

This comment has been minimized.

Copy link
Author

commented Feb 21, 2019

Yeah I can understand that people might want it, even if I'm not a fan 😄

I'm not using eksctl for VPC creation anyway, primarily over the lack of multi-AZ NAT. And I must say eksctl works just great for deploying into existing VPCs. 👏

NAT instances are simple things they are really trouble free. The hassle is you have to update the images regularly and that means taking them out of service for a restart. It's not too bad, re-route an AZ to another AZ's NAT, update the first AZ's image, change the routing back. There are also more complex schemes involving reassigning ENI's.

NAT Gateways are of course, no effort at all, but the data processing charges seems overpriced to me. It's not even vaguely competitive with doing it yourself with AWS own NAT images.

@errordeveloper

This comment has been minimized.

Copy link
Member

commented Feb 21, 2019

It does sound like we should will need to have a flag that lets you control multiple vs single NAT gateway. I also wonder if we could find a way to easily add NAT gateways on-demand. Perhaps it should also be possible for someone to avoid creating private subnets. We now have a way for someone to modify cluster stack, so adding some resources to it after it was created should be easier now than it was some time ago.
To begin with, let's add multi-AZ NAT gateways, and disable that by default with a flag. We can discuss details in the context of a PR. I am not sure what the defaults should be, but certainly whatever we decide to call the flag, it will need to be on the basis of multi-AZ vs single (not count). I would rather avoid having to resort to NAT instances for now, as one can always implement that in a custom VPC, if they strongly desire to save money.
I am concerned with default limit and extra cost for those not needing the private subnets, so we might want to discuss either on-demand approach (no NAT gateways initially, unless private nodegroup is created right away), or using only a single NAT gateway by default and allow use to opt-in initially or at a later time. But let's begin with adding multi-AZ capability, as as EKS is multi-AZ to begin with and it's a shame NAT was overlooked.

@errordeveloper

This comment has been minimized.

Copy link
Member

commented Feb 21, 2019

@errordeveloper Would another issue be productive to discuss the actual design case?

There are much easier ones, and if you don't have personal need, I wouldn't worry about trying to tackle this as the first issue.

@IPyandy

This comment has been minimized.

Copy link
Contributor

commented Feb 21, 2019

@whereisaaron true, I'm mostly using terraform for production VPC environments and eksctl for the cluster setups (production ready ones) for now. Once configuration files become more of a class-1 citizen then maybe that would be more integrated.

But for quick, lets demo, dev or test something quick (being in the consulting side) eksctl right now is pretty great.

@whereisaaron

This comment has been minimized.

Copy link
Author

commented Feb 21, 2019

@errordeveloper: "We now have a way for someone to modify cluster stack"
Cool. what is that? You mean you can provide your own VPC, or this is a new feature?

@errordeveloper

This comment has been minimized.

Copy link
Member

commented Feb 22, 2019

@whereisaaron

This comment has been minimized.

Copy link
Author

commented Feb 22, 2019

Oh, from the --help I had thought that command was a way to upgrade control plane versions, like 1.10 -> 1.11. Since k8s version is the only parameter you can specify. How would I use it to I append to the stack though? There are no options to supply CF e.g. for an extra SG? It says 'update based on latest configuration' but I can't supply any configuration options? Confused 🤷‍♀️

eksctl utils update-cluster-stack --help
Update cluster stack based on latest configuration (append-only)

Usage: eksctl utils update-cluster-stack [flags]

General flags:
  -n, --name string      EKS cluster name (required)
  -r, --region string    AWS region
      --version string   Kubernetes version (valid options: 1.10, 1.11) (default "1.11")
      --dry-run          do not apply any change, only show what resources would be added (default true)

AWS client flags:
  -p, --profile string     AWS credentials profile to use (overrides the AWS_PROFILE environment variable)
      --timeout duration   max wait time in any polling operations (default 20m0s)

Common flags:
  -C, --color string   toggle colorized logs (true,false,fabulous) (default "true")
  -h, --help           help for this command
  -v, --verbose int    set log level, use 0 to silence, 4 for debugging and 5 for debugging with AWS debug logging (default 3)

Use 'eksctl utils update-cluster-stack [command] --help' for more information about a command.
@mmuth

This comment has been minimized.

Copy link

commented Mar 26, 2019

I would also really appreciate the deployment of multi AZ NAT Gateways (in my opinion this should even be the default)... This is currently prevents the use of eksctl in our company and we will stick to kops for now.

However eksctl makes a nice impression and we will strongly observe it.

@foxylion

This comment has been minimized.

Copy link

commented Mar 27, 2019

I second that feature request. Without configuring all components to be HA this won't be a production grade solution for many users.

@webgig

This comment has been minimized.

Copy link

commented Mar 29, 2019

Started using eksctl recently only to realise NAT Gateways are not Multi AZ.
eksctl is still a great tool, good for dev environment setup without Multi AZ NAT.

My workaround would be to create VPC with Multi AZ NAT using cloudformation and setup the cluster using eksctl.

@mcfedr

This comment has been minimized.

Copy link
Contributor

commented Apr 2, 2019

I know its the opposite of this bug, but maybe someone can tell me, how to create a cluster without nat gateway?

@IPyandy

This comment has been minimized.

Copy link
Contributor

commented Apr 3, 2019

I like how aws-cdk handles this when creating a Vpc. If private subnets are created, then natgateways == numAZs unless natGateways is specified, then it deploys that number of gateways.

This could be made into optional flags here.

@cristian-radu

This comment has been minimized.

Copy link
Contributor

commented Jun 4, 2019

@errordeveloper I would like to have a go at this if it's up for grabs. :)

@cristian-radu

This comment has been minimized.

Copy link
Contributor

commented Jun 5, 2019

thanks @whereisaaron for the cloudformation gist. I found this diagram as a visual reference as well.

I like the idea proposed by @IPyandy

If private subnets are created, then natgateways == numAZs unless natGateways is specified, then it deploys that number of gateways.

I feel like HA NAT by default would be best in order to try and prevent someone from accidentally creating a non HA production cluster. Accidentally creating HA staging/dev clusters will incur more cost, but at least you don't risk an outage by doing that and should be able to change it later once you've realized where all your money went :)

@cristian-radu

This comment has been minimized.

Copy link
Contributor

commented Jun 10, 2019

#861

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
9 participants
You can’t perform that action at this time.