-
-
Notifications
You must be signed in to change notification settings - Fork 4.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Release 18+ Upgrade Guide Breaks Existing Deployments #1744
Comments
Doesnt work if I run from a bastion host, which is also granted access to the cluster. Error
Same server showing access
So it doesnt matter whether the server is configured to access the cluster or not, it fails the same way |
Ran the first few times with the Terraform kubernetes provider configured as it has been when it was working, and have re-ran it with that provider completely removed. Ive also ran it with and without the null provider stuff. |
Are there any requirements to upgrade an existing cluster from 17 to 18? Im not getting warnings 'variable not expected here' so I think I have everything renamed, but not making any progress on the aws-auth configmap. Does it need removed from the state? |
Also, not sure if this matters, but in the current working deployment we have
|
Same |
I believe your error message states the issue By default, this port is not open on the security groups created by this module - https://github.com/terraform-aws-modules/terraform-aws-eks#security-groups |
It's more like if it's a wrong url error more than a security group issue, no ? I expect this call being done from my laptop, not from a remote server. But I may be wrong. |
The error message I was referring to was the one provided by @jseiser further up. The error message you have provided does not provide enough detail - the module does not construct any URLs so I would suspect its also a security group access issue you are facing as well, but thats just a hunch based on whats provided |
Hello, I'm having the exact same issue when upgrading from v17 to v18. It happens during the state refresh of the The problem disappears if I manually remove the module.eks-cluster.local_file.kubeconfig[0] and module.eks-cluster.kubernetes_config_map.aws_auth[0] from the v17 state file, but I'm not exactly sure if there are any consequences in doing this. |
I guess i dont follow. This is a cluster/TF deployment created using the <18.0 module version. We are wanting/trying to upgrade this enviroment to the latest module version. Im not able to run a plan because of the error. The security groups have not changed. |
I believe that would be the appropriate change (remove those from your state) - v18.x removes native support for kubeconfig and aws-auth configmap https://github.com/terraform-aws-modules/terraform-aws-eks/blob/master/UPGRADE-18.0.md#list-of-backwards-incompatible-changes |
How should we handle config-map roles now? For example i had somethig like this
Moving forward what would be the best way to adding that types of roles into the aws-auth? |
@PadillaBraulio this is left up to users to decide what suites them best (Terraform, Helm, some flavor of GitOps, bash scripts, etc.) |
These errors also occur on certain changes when your the provider config depends on output of the eks module example:
|
Since I only had a test cluster, I went ahead and approved the plan to replace the cluster. This is probably a provider issue, but the execution failed because it tried to create a new cluster before tearing down the old one. But it couldn't create the new cluster because an existing cluster, the old one, had the same name. |
I get the error even when removing the k8s provider they also have the data in there example.. |
Yes i normally have it like in the example, but if i encounter these kind of errors : dial tcp 127.0.0.1:80: connect: connection refused. i think it is caused by the cluster iam role arn changes or the cluster security group changes which triggers the replacement of the cluster. In my case : if i adjust the config of the 18.0.4 module so it keeps using the same iam cluster role and cluster security group and the cluster iam role arn |
have you checked the security groups, as already suggested? You mentioned the jenkins pod runs in the same cluster or at least inside the same vpc as I understood. If you have the private api-server endpoint enabled, it could be the jenkins pod will try to connect to the private endpoint, and require a security group rule that is not provided by the eks module by default. i actually ran into this same problem. |
Its def. not that. Nothing has changed yet, since this is a working 17.x deployment and only fails when trying to run the plan on the 18.x deployment. I also showed that it errors out on an external server above as well. Thanks. |
Ill have to give this a try on monday. |
Hoping that this would save time for others dealing with aws-auth configmap management change. This worked for us, modifying the example code from here.
Users and roles following the syntax in 17.x version. |
i'm also still looking for a good solution for the aws-auth config map, have copied the aws-auth.tf from 17.24 for now which still works but it needs that forked http provider. |
kubernetes_patch would have been useful to solve the |
When you guys were testing upgrades pre-merge, how were you all handling this situation? Im going to spin up a test env and try and walk through some of the suggestions above, but wanted to know your guys experience. If I get something working I have no issue creating a pull to update the documentation. |
again, it is a BREAKING change - if there was a clean and straightforward path to upgrade without change/disruption then it would not be a breaking change. this module had grown quite quickly and was carrying a lot of pre-0.12 syntax which was severely holding it back (extensive list of lists and index lookup, etc.) and the changes added (most notably due to the numerous changes of EKS itself) led to a patchwork of changes that built up over the years. I can't stress enough that this change was EXTENSIVE and I am sorry we cannot provide the copious amounts of details and upgrade steps to make the process smooth and seamless - the module is complex, EKS is complex, and the changes were substantial. that said, this is how we generally test modules in this org:
|
I understand the motive to introduce breaking changes in order to refactor this module to rid it of historic cruft, but I too am hesitant to upgrade our existing clusters from the latest 17.x version as we are making use of the managed IMO it would be greatly appreciated if instead of telling users that it's now up to them to figure out how to re-implement functionality that was removed in the interest of tidying up, explicit examples be provided that satisfy the same set of design constraints satisfied by the previous version, i.e., the ability to provision an accessible cluster exclusively with Terraform. Several users have already noted that it is not feasible to rely on Perhaps I missed the part of the discussion that led up to the removal of managed Is it safe to rely on the forked HTTP provider and the pure Terraform implementation used in 17.x if users choose to do so? |
There is no need for me to go into the depths of aws-auth issues when we can just look at history https://github.com/terraform-aws-modules/terraform-aws-eks/issues?q=is%3Aissue+sort%3Aupdated-desc+aws-auth+is%3Aclosed Again, a clear boundary line was created with this change and I understand its very controversial - this module provisions AWS infrastructure resources via the AWS API (via the Terraform AWS provider) and any internal cluster provisioning and management is left up to users As for the forked http provider, I do not know what its fate is. Most likely what will end up happening is that it gets archived in its current state so users can continue to utilize it - if we're lucky, Hashicorp incorporates the change upstream and the fork can still be archived but users can move off the fork and onto the official provider |
Fair enough. This is a complex problem due to the automatic creation of the
Point taken, but I would venture to suggest that part of provisioning stateful resources such as Kubernetes clusters, EC2 instances, etc. includes ensuring access control is properly configured. I don't expect this module to install a monitoring and logging workload, for example, but I do expect it to provision my resources in such a way that I can connect to them. I respect your decision to remove this functionality but I'm just trying to determine the best course of action to avoid headaches going forward with the upgrade. I appreciate the work that went in to refactoring and the
Re: your second point, I'm not holding my breath. https://www.hashicorp.com/blog/terraform-community-contributions |
My solution when I had problems like you're running into (this was back on eks module v13) was to split things up... have one terraform run build and deploy the cluster itself, and a separate terraform run in a separate directory build and deploy all the helm charts and kubernetes resource entries onto it... |
@jcam, I am all in support of that, but since its' on production right, I am approaching it as, first upgrade the eks module, and finally breaking the pieces and moving the resources outside of that eventually and keeping eks module independent. |
I would separate it first, and upgrade second. That way there's no chance the EKS cluster upgrade terraform run could impact all your deployed applications. I don't use terraform cloud, but with my backend I simply did a terraform state pull, made the new folder for all the app components and did a terraform init there, did a terraform state push, then did terraform state rm for all the app-components in the cluster folder, and a terraform state rm for all the cluster components in the app folder |
were there no conflicts with the resource names? for example for logging of alb-load-balancer is done in a s3 bucket and that bucket is also created alongside the eks cluster and is inside |
I just needed to split things so they were in one place or the other and not both. in your case, I would put the logging bucket in the app deploy stage, or I would keep it in the cluster stage and use a data object in the app stage instead of a resource object |
do you have any link or directions I can follow for the steps you mentioned below. I am kind of confused to be honest for the steps you mentioned |
This issue has been resolved in version 18.19.0 🎉 |
BTW if someone needs to solve this via a PR without direct access to
|
With v18 I am unable to configure a cluster so that pods have network access. Here's a simple cluster in v17: provider "aws" {
region = "eu-central-1"
}
data "aws_eks_cluster_auth" "cluster" {
name = module.eks.cluster_id
}
provider "kubernetes" {
host = module.eks.cluster_endpoint
token = data.aws_eks_cluster_auth.cluster.token
cluster_ca_certificate = base64decode(module.eks.cluster_certificate_authority_data)
}
data "aws_availability_zones" "available" {}
module "vpc" {
source = "terraform-aws-modules/vpc/aws"
version = "3.2.0"
name = "test-cluster-vpc"
cidr = "10.0.0.0/16"
azs = data.aws_availability_zones.available.names
private_subnets = ["10.0.1.0/24", "10.0.2.0/24", "10.0.3.0/24"]
public_subnets = ["10.0.101.0/24", "10.0.102.0/24", "10.0.103.0/24"]
enable_nat_gateway = true
single_nat_gateway = true
enable_dns_hostnames = true
}
module "eks" {
source = "terraform-aws-modules/eks/aws"
version = "17.24.0"
cluster_name = "test-cluster"
cluster_version = "1.22"
subnets = module.vpc.private_subnets
vpc_id = module.vpc.vpc_id
node_groups = [
{
instance_type = "t2.small"
capacity_type = "SPOT"
}
]
} If I apply this, I can then run a pod on it and ping an internet host:
If I adapt the terraform file to v18: [...]
module "eks" {
source = "terraform-aws-modules/eks/aws"
version = "~> 18"
cluster_name = "test-cluster"
cluster_version = "1.22"
subnet_ids = module.vpc.private_subnets
vpc_id = module.vpc.vpc_id
eks_managed_node_groups = [
main = {
instance_type = "t2.small"
capacity_type = "SPOT"
}
]
} This doesn't work anymore:
|
Ah, there it is. The link in the migration guide is broken. |
So, I think I finally worked out what security groups there are for a simple cluster with an
Footnotes
|
Hi I have tried to update from 17.24.0 to 18.x however terraform wants to destroy my cluster and recreate a new one. I have added all these variable as mentioned above but without any sucess.
For testing I did not include workers 17.24.0 Config
18.24.1 config
This forces the replacement
|
Thank you @bryantbiggs that worked for me. |
Thanks, @ArchiFleKs for steps! One additional step I had to define because I used:
If this is not defined, nodes will become unreachable and all deployments on it. |
This is very important! I did this for one environment and it worked well, I was able to gradually drain and terminate old nodes. I forgot this step for another environment and right after |
@qlikcoe @dusansusic it was mentioned later by couple of others as well, github has collapsed majority of that discussion. For example, this was my experience: #1744 (comment) |
I'm going to lock this issue because it has been closed for 30 days ⏳. This helps our maintainers find and focus on the active issues. If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further. |
Description
Attempted to follow the upgrade guide to get to 18+. Our Terraform deployments generally run from a Jenkins worker pod, that exists ON the same cluster that we are upgrading. The pod has a service account on it, using the IRSA setup which grants it access to the cluster.
This all works/worked before the upgrade.
Reproduction
Attempt to follow the upgrade guide for 18.
Code Snippet to Reproduce
locals
Expected behavior
Module will run to completion
Actual behavior
Current aws-auth
The SA on the pod, that terraform is running from.
The error terraform returns
Additional context
I do not doubt that im missing something, but that something does not appear to be covered in the documentation that I can find.
The text was updated successfully, but these errors were encountered: