Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

self_managed_node_group creates invalid launch template when using network_interfaces #1880

Closed
mark-nimbix opened this issue Feb 15, 2022 · 16 comments · Fixed by #1980
Closed

Comments

@mark-nimbix
Copy link

Description

When using the EKS module to create a self managed node group if you define a network_interface in the self_managed_node_group sub module it will create an invalid template per issue hashicorp/terraform-provider-aws#4570. It looks like there needs to be a conditional for the vpc_security_group_ids value setting it to null when a network_interface is provided. I was able to get past the error and add the EFA interface_type I wanted by adding the following.

git diff
diff --git a/modules/self-managed-node-group/main.tf b/modules/self-managed-node-group/main.tf
index 1b7ca66..52bcb81 100644
--- a/modules/self-managed-node-group/main.tf
+++ b/modules/self-managed-node-group/main.tf
@@ -57,7 +57,7 @@ resource "aws_launch_template" "this" {
   key_name      = var.key_name
   user_data     = module.user_data.user_data

-  vpc_security_group_ids = compact(concat([try(aws_security_group.this[0].id, "")], var.vpc_security_group_ids))
+  vpc_security_group_ids = length(var.network_interfaces) > 0 ? null :  compact(concat([try(aws_security_group.this[0].id, "")], var.vpc_security_group_ids))

   default_version                      = var.launch_template_default_version
   update_default_version               = var.update_launch_template_default_version
@@ -197,6 +197,7 @@ resource "aws_launch_template" "this" {
     for_each = var.network_interfaces
     content {
       associate_carrier_ip_address = lookup(network_interfaces.value, "associate_carrier_ip_address", null)
+      interface_type               = lookup(network_interfaces.value, "interface_type", null)
       associate_public_ip_address  = lookup(network_interfaces.value, "associate_public_ip_address", null)
       delete_on_termination        = lookup(network_interfaces.value, "delete_on_termination", null)
       description                  = lookup(network_interfaces.value, "description", null)

Although I ran into a new error. I was able to get past it by setting the variable to not create the Cloudwatch log group. I am not sure how the change is related to this new error.

│ Error: Creating CloudWatch Log Group failed: ResourceAlreadyExistsException: The specified log group already exists: The CloudWatch Log Group '/aws/eks/tf-west-test-vpc/cluster' already exists.

Versions

  • Terraform:
    1.1.5
  • Provider(s):
    AWS 3.74.2
  • Module:
    EKS 18.2.7

Reproduction

Steps to reproduce the behavior:
Create an EKS cluster using the EKS module with a self_managed_node_group that has a network_interface defined. This will create a clean plan but when you apply the the autoscaling group will fail to create due to the issue listed above (When a network interface is defined for a launch template the security groups must be defined in the network interfaces section and not with vpc_security_group_ids in the main block of the launch template.)

Code Snippet to Reproduce


self_managed_node_groups = { system = {
            name = "system"
            instance_type = var.node_type
            ami_id = var.ami.id
            desired_size = 2
            min_size = 1
            max_size = 4
            key_name = ""
            bootstrap_extra_args = "--kubelet-extra-args '--node-labels=node-role.jarvice.io/jarvice-system=true,node-pool.jarvice.io/jarvice-system=jxesystem --register-with-taints=node-role.jarvice.io/jarvice-system=true:NoSchedule'"
            network_interfaces = [
                    {
                      device_index = 0
                      associate_public_ip_address = true
                      interface_type = "efa"
                      security_groups = ["${aws_security_group.ssh.id}"] 
                    }
                ]
            enable_bootstrap_user_data = true
            pre_bootstrap_user_data = <<EOF
# pre_userdata (executed before kubelet bootstrap and cluster join)
# Add authorized ssh key
echo "${var.ssh_public_key}" >>/home/ec2-user/.ssh/authorized_keys
EOF
        }
}

Expected behavior

Successful creation of a self managed node group with an EFA interface

Actual behavior

Error on terraform apply.
│ Error: Error creating Auto Scaling Group: InvalidQueryParameter: Invalid launch template: When a network interface is provided, the security groups must be a part of it.

Terminal Output Screenshot(s)

Additional context

@bryantbiggs
Copy link
Member

I don't think there is anything for us to do here, no? The error message from the API is pretty clear - a security group must be provided when creating a custom ENI - what solution are you looking for?

@mark-nimbix
Copy link
Author

mark-nimbix commented Feb 15, 2022

I don't think there is anything for us to do here, no? The error message from the API is pretty clear - a security group must be provided when creating a custom ENI - what solution are you looking for?

The EKS module is creating the bad launch template. I can not get around that using the EKS module as it stands currently.

@bryantbiggs
Copy link
Member

The EKS module is creating the bad launch template. I can not get around that using the EKS module.

I don't follow

@bryantbiggs
Copy link
Member

from the upstream issue you linked - hashicorp/terraform-provider-aws#4570 (comment) this is the intended behavior from the AWS API

@mark-nimbix
Copy link
Author

mark-nimbix commented Feb 15, 2022

The EKS module is creating the bad launch template. I can not get around that using the EKS module.

I don't follow

If you read the issue I linked you will see the error is misleading. To create a valid template you do need to define the security groups in the network_interfaces block which I have. But if you also set the vpc_security_groups_ids value in the launch_template you will get the error in the listed issue. So the problem is the EKS sub module is defining the vpc_security_groups_ids value always which creates an invalid template should you define any network interface.

hashicorp/terraform-provider-aws#4570 (comment)

@bryantbiggs
Copy link
Member

ya, I don't see this as an issue with the module here. to get around this its quite simple, create your own launch template and pass it into the module

@mark-nimbix
Copy link
Author

By that logic you might as well remove the network interfaces block from your help docs in the EKS module because per the issue we have been discussing you need to define a security group for a network interface and doing so will require that the vpc_security_groups_ids value be null in the launch_template. So from what I can tell this module can not use the network interfaces in any way with the self_managed_node_groups.
The reason I did not create a launch template is because I am using bootstrap_extra_args which from what I can tell in the user_data module is dependent on the EKS cluster creation.

@bryantbiggs
Copy link
Member

By that logic you might as well remove the network interfaces block from your help docs in the EKS module because per the issue we have been discussing you need to define a security group for a network interface and doing so will require that the vpc_security_groups_ids value be null in the launch_template. So from what I can tell this module can not use the network interfaces in any way with the self_managed_node_groups. The reason I did not create a launch template is because I am using bootstrap_extra_args which from what I can tell in the user_data module is dependent on the EKS cluster creation.

correct - its dependent on *cluster creation, not node creation (needs endpoint, certificate, etc.). you can use the sub-modules separate from the root module though, this is how the root module is composed

@mark-nimbix
Copy link
Author

By that logic you might as well remove the network interfaces block from your help docs in the EKS module because per the issue we have been discussing you need to define a security group for a network interface and doing so will require that the vpc_security_groups_ids value be null in the launch_template. So from what I can tell this module can not use the network interfaces in any way with the self_managed_node_groups. The reason I did not create a launch template is because I am using bootstrap_extra_args which from what I can tell in the user_data module is dependent on the EKS cluster creation.

correct - its dependent on *cluster creation, not node creation (needs endpoint, certificate, etc.). you can use the sub-modules separate from the root module though, this is how the root module is composed

So there is no way to use the network_interfaces in a self_managed_node_group with the current EKS module?

@bryantbiggs
Copy link
Member

I don't know the answer to that question - I haven't tried it myself. I do find it hard to believe since we have this module as well as the autoscaling module https://github.com/bryantbiggs/terraform-aws-autoscaling/blob/master/main.tf#L172-L191 and this is the first time I am hearing about this. I am sure there is a way to make it work, you'll just have to get creative and figure out how

@mark-nimbix
Copy link
Author

I don't know the answer to that question - I haven't tried it myself. I do find it hard to believe since we have this module as well as the autoscaling module https://github.com/bryantbiggs/terraform-aws-autoscaling/blob/master/main.tf#L172-L191 and this is the first time I am hearing about this. I am sure there is a way to make it work, you'll just have to get creative and figure out how

Perhaps you are confused since I included EFA. If you remove the need for EFA the problem still exists. The network_interfaces block is unusable in the self_manage_node_groups.

@mark-nimbix
Copy link
Author

mark-nimbix commented Feb 15, 2022

In an effort to make a simplified example I cloned the EKS repo and used the self_managed_node_group example.
Steps to reproduce:
git clone https://github.com/terraform-aws-modules/terraform-aws-eks.git
cd terraform-aws-eks/examples/self_managed_node_group
Replace main.tf with the following version. The only thing I changed here was removing the complete example and added a network_interfaces block to the bottlerocket example and set the region to us-west-2.
terraform init
terraform apply

network_interfaces = [
                    {
                      device_index = 0
                      associate_public_ip_address = true
                      #interface_type = "efa"
                      security_groups = [aws_security_group.additional.id]
                    }
       ]
provider "aws" {
  region = local.region
}

locals {
  name            = "ex-${replace(basename(path.cwd), "_", "-")}"
  cluster_version = "1.21"
  region          = "us-west-2"

  tags = {
    Example    = local.name
    GithubRepo = "terraform-aws-eks"
    GithubOrg  = "terraform-aws-modules"
  }
}

data "aws_caller_identity" "current" {}

################################################################################
# EKS Module
################################################################################

module "eks" {
  source = "../.."

  cluster_name                    = local.name
  cluster_version                 = local.cluster_version
  cluster_endpoint_private_access = true
  cluster_endpoint_public_access  = true

  cluster_addons = {
    coredns = {
      resolve_conflicts = "OVERWRITE"
    }
    kube-proxy = {}
    vpc-cni = {
      resolve_conflicts = "OVERWRITE"
    }
  }

  cluster_encryption_config = [{
    provider_key_arn = aws_kms_key.eks.arn
    resources        = ["secrets"]
  }]

  vpc_id     = module.vpc.vpc_id
  subnet_ids = module.vpc.private_subnets

  # Extend cluster security group rules
  cluster_security_group_additional_rules = {
    egress_nodes_ephemeral_ports_tcp = {
      description                = "To node 1025-65535"
      protocol                   = "tcp"
      from_port                  = 1025
      to_port                    = 65535
      type                       = "egress"
      source_node_security_group = true
    }
  }

  # Extend node-to-node security group rules
  node_security_group_additional_rules = {
    ingress_self_all = {
      description = "Node to node all ports/protocols"
      protocol    = "-1"
      from_port   = 0
      to_port     = 0
      type        = "ingress"
      self        = true
    }
    egress_all = {
      description      = "Node all egress"
      protocol         = "-1"
      from_port        = 0
      to_port          = 0
      type             = "egress"
      cidr_blocks      = ["0.0.0.0/0"]
      ipv6_cidr_blocks = ["::/0"]
    }
  }

  self_managed_node_group_defaults = {
    disk_size = 50
  }

  self_managed_node_groups = {
    # Default node group - as provisioned by the module defaults
    default_node_group = {}

    # Bottlerocket node group
    bottlerocket = {
      name = "bottlerocket-self-mng"

      platform      = "bottlerocket"
      ami_id        = data.aws_ami.eks_default_bottlerocket.id
      instance_type = "m5.large"
      desired_size  = 2
      key_name      = aws_key_pair.this.key_name

      iam_role_additional_policies = ["arn:aws:iam::aws:policy/AmazonSSMManagedInstanceCore"]
      network_interfaces = [
                    {
                      device_index = 0
                      associate_public_ip_address = true
                      #interface_type = "efa"
                      security_groups = [aws_security_group.additional.id]
                    }
       ]
      bootstrap_extra_args = <<-EOT
      # The admin host container provides SSH access and runs with "superpowers".
      # It is disabled by default, but can be disabled explicitly.
      [settings.host-containers.admin]
      enabled = false

      # The control host container provides out-of-band access via SSM.
      # It is enabled by default, and can be disabled if you do not expect to use SSM.
      # This could leave you with no way to access the API and change settings on an existing node!
      [settings.host-containers.control]
      enabled = true
      [settings.kubernetes.node-labels]
      ingress = "allowed"
      EOT
    }

}


  tags = local.tags
}

################################################################################
# aws-auth configmap
# Only EKS managed node groups automatically add roles to aws-auth configmap
# so we need to ensure fargate profiles and self-managed node roles are added
################################################################################

data "aws_eks_cluster_auth" "this" {
  name = module.eks.cluster_id
}

locals {
  kubeconfig = yamlencode({
    apiVersion      = "v1"
    kind            = "Config"
    current-context = "terraform"
    clusters = [{
      name = module.eks.cluster_id
      cluster = {
        certificate-authority-data = module.eks.cluster_certificate_authority_data
        server                     = module.eks.cluster_endpoint
      }
    }]
    contexts = [{
      name = "terraform"
      context = {
        cluster = module.eks.cluster_id
        user    = "terraform"
      }
    }]
    users = [{
      name = "terraform"
      user = {
        token = data.aws_eks_cluster_auth.this.token
      }
    }]
  })
}

resource "null_resource" "apply" {
  triggers = {
    kubeconfig = base64encode(local.kubeconfig)
    cmd_patch  = <<-EOT
      kubectl create configmap aws-auth -n kube-system --kubeconfig <(echo $KUBECONFIG | base64 --decode)
      kubectl patch configmap/aws-auth --patch "${module.eks.aws_auth_configmap_yaml}" -n kube-system --kubeconfig <(echo $KUBECONFIG | base64 --decode)
    EOT
  }

  provisioner "local-exec" {
    interpreter = ["/bin/bash", "-c"]
    environment = {
      KUBECONFIG = self.triggers.kubeconfig
    }
    command = self.triggers.cmd_patch
  }
}

################################################################################
# Supporting Resources
################################################################################

module "vpc" {
  source  = "terraform-aws-modules/vpc/aws"
  version = "~> 3.0"

  name = local.name
  cidr = "10.0.0.0/16"

  azs             = ["${local.region}a", "${local.region}b", "${local.region}c"]
  private_subnets = ["10.0.1.0/24", "10.0.2.0/24", "10.0.3.0/24"]
  public_subnets  = ["10.0.4.0/24", "10.0.5.0/24", "10.0.6.0/24"]

  enable_nat_gateway   = true
  single_nat_gateway   = true
  enable_dns_hostnames = true

  enable_flow_log                      = true
  create_flow_log_cloudwatch_iam_role  = true
  create_flow_log_cloudwatch_log_group = true

  public_subnet_tags = {
    "kubernetes.io/cluster/${local.name}" = "shared"
    "kubernetes.io/role/elb"              = 1
  }

  private_subnet_tags = {
    "kubernetes.io/cluster/${local.name}" = "shared"
    "kubernetes.io/role/internal-elb"     = 1
  }

  tags = local.tags
}

resource "aws_security_group" "additional" {
  name_prefix = "${local.name}-additional"
  vpc_id      = module.vpc.vpc_id

  ingress {
    from_port = 22
    to_port   = 22
    protocol  = "tcp"
    cidr_blocks = [
      "10.0.0.0/8",
      "172.16.0.0/12",
      "192.168.0.0/16",
    ]
  }

  tags = local.tags
}

resource "aws_kms_key" "eks" {
  description             = "EKS Secret Encryption Key"
  deletion_window_in_days = 7
  enable_key_rotation     = true

  tags = local.tags
}

data "aws_ami" "eks_default" {
  most_recent = true
  owners      = ["amazon"]

  filter {
    name   = "name"
    values = ["amazon-eks-node-${local.cluster_version}-v*"]
  }
}

data "aws_ami" "eks_default_bottlerocket" {
  most_recent = true
  owners      = ["amazon"]

  filter {
    name   = "name"
    values = ["bottlerocket-aws-k8s-${local.cluster_version}-x86_64-*"]
  }
}

resource "tls_private_key" "this" {
  algorithm = "RSA"
}

resource "aws_key_pair" "this" {
  key_name   = local.name
  public_key = tls_private_key.this.public_key_openssh
}

resource "aws_kms_key" "ebs" {
  description             = "Customer managed key to encrypt self managed node group volumes"
  deletion_window_in_days = 7
  policy                  = data.aws_iam_policy_document.ebs.json
}

# This policy is required for the KMS key used for EKS root volumes, so the cluster is allowed to enc/dec/attach encrypted EBS volumes
data "aws_iam_policy_document" "ebs" {
  # Copy of default KMS policy that lets you manage it
  statement {
    sid       = "Enable IAM User Permissions"
    actions   = ["kms:*"]
    resources = ["*"]

    principals {
      type        = "AWS"
      identifiers = ["arn:aws:iam::${data.aws_caller_identity.current.account_id}:root"]
    }
  }

  # Required for EKS
  statement {
    sid = "Allow service-linked role use of the CMK"
    actions = [
      "kms:Encrypt",
      "kms:Decrypt",
      "kms:ReEncrypt*",
      "kms:GenerateDataKey*",
      "kms:DescribeKey"
    ]
    resources = ["*"]

    principals {
      type = "AWS"
      identifiers = [
        "arn:aws:iam::${data.aws_caller_identity.current.account_id}:role/aws-service-role/autoscaling.amazonaws.com/AWSServiceRoleForAutoScaling", # required for the ASG to manage encrypted volumes for nodes
        module.eks.cluster_iam_role_arn,                                                                                                            # required for the cluster / persistentvolume-controller to create encrypted PVCs
      ]
    }
  }

  statement {
    sid       = "Allow attachment of persistent resources"
    actions   = ["kms:CreateGrant"]
    resources = ["*"]

    principals {
      type = "AWS"
      identifiers = [
        "arn:aws:iam::${data.aws_caller_identity.current.account_id}:role/aws-service-role/autoscaling.amazonaws.com/AWSServiceRoleForAutoScaling", # required for the ASG to manage encrypted volumes for nodes
        module.eks.cluster_iam_role_arn,                                                                                                            # required for the cluster / persistentvolume-controller to create encrypted PVCs
      ]
    }

    condition {
      test     = "Bool"
      variable = "kms:GrantIsForAWSResource"
      values   = ["true"]
    }
  }
}

The problem reproduces with the examples. I do not see a way the network_interfaces block can be used at all in EKS/self_managed_nodes with the current code. You will need to terraform apply to see the error.

│ Error: Error creating Auto Scaling Group: InvalidQueryParameter: Invalid launch template: When a network interface is provided, the security groups must be a part of it.
│ 	status code: 400, request id: 45beb34b-81ad-4370-935b-3100e2527861
│
│   with module.eks.module.self_managed_node_group["bottlerocket"].aws_autoscaling_group.this[0],
│   on ../../modules/self-managed-node-group/main.tf line 260, in resource "aws_autoscaling_group" "this":
│  260: resource "aws_autoscaling_group" "this" {
│
╵ 

@chuajiesheng
Copy link

It seems to be because create_security_group is defaulted to true.

Which resulted in self-managed-node-group.aws_security_group.this being created.

And when self-managed-node-group.aws_launch_template.this is run, it resulted in vpc_security_group_ids being populated with the aws_security_group.this.

And from Launch template doc here

vpc_security_group_ids - A list of security group IDs to associate with. Conflicts with network_interfaces.security_groups

You can fix this by manually going to the UI and create a new Launch Template version, move the common security group out and update the ASG.

@mark-nimbix
Copy link
Author

It seems to be because create_security_group is defaulted to true.

Which resulted in self-managed-node-group.aws_security_group.this being created.

And when self-managed-node-group.aws_launch_template.this is run, it resulted in vpc_security_group_ids being populated with the aws_security_group.this.

And from Launch template doc here

vpc_security_group_ids - A list of security group IDs to associate with. Conflicts with network_interfaces.security_groups

You can fix this by manually going to the UI and create a new Launch Template version, move the common security group out and update the ASG.

Thanks for the explanation. We ended up using a fork with a variation of the code fix I mentioned above. That ended up being a cleaner solution for us.

@antonbabenko
Copy link
Member

This issue has been resolved in version 18.18.0 🎉

@github-actions
Copy link

I'm going to lock this issue because it has been closed for 30 days ⏳. This helps our maintainers find and focus on the active issues. If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Nov 12, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants