Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NodeCreationFailure: Instances failed to join the kubernetes cluster. This is happening on a fresh cluster. #2149

Closed
arnav13081994 opened this issue Jul 6, 2022 · 15 comments
Labels

Comments

@arnav13081994
Copy link

Description

I followed the docs and have exhausted all the resources online but still not able to create an EKS cluster with EKS Managed Nodes. I always get the following error:

│ Error: error waiting for EKS Node Group (eks-dev-eks-cluster:default_node_group-2022070609081040940000000f) to create: unexpected state 'CREATE_FAILED', wanted target 'ACTIVE'. last error: 2 errors occurred:
│       * eks-default_node_group-2022070609081040940000000f-ecc0e962-29c0-7802-83d8-213eca9d1cd7: AsgInstanceLaunchFailures: You've reached your quota for maximum Fleet Requests for this account. Launching EC2 instance failed.
│       * DUMMY_04f2c42f-98d6-428c-aed2-95deada02ad2, DUMMY_46fee16c-8052-4fc7-a170-522943edc191, DUMMY_4ff890be-596d-4370-85eb-56146cc1b5ea, DUMMY_c94e1a5e-9bce-42c2-bc7b-7b24db9216f5, DUMMY_d36b7c25-3716-4b50-92e7-ac48c400e33a, DUMMY_fa1ffb54-1a20-4a9e-b302-e31db512548c: NodeCreationFailure: Instances failed to join the kubernetes cluster

Versions

  • Terraform version: ~> 1.2.3
  • Provider version(s):
aws = {
  version = "~> 4.21.0"
}
kubernetes = {
  version = "~>2.12.0"
}

Reproduction Code [Required]

terraform {
  required_providers {
    aws = {
      source  = "hashicorp/aws"
      version = "~> 4.21.0"
    }
    kubernetes = {
      source  = "hashicorp/kubernetes"
      version = "~>2.12.0"
    }
  }
  required_version = "~> 1.2.3"
}


provider "aws" {
  profile = "..."
  region  = local.region

  default_tags {
    tags = {
      Environment = "Staging"
      Terraform   = "True"
    }
  }
}

#
# Housekeeping
#

locals {
  project_name    = "eks-dev"
  cluster_name    = "${local.project_name}-eks-cluster"
  cluster_version = "1.21"
  region          = "us-west-1"
}


/*
The following 2 data resources are used get around the fact that we have to wait
for the EKS cluster to be initialised before we can attempt to authenticate.
*/

data "aws_eks_cluster" "default" {
  name = module.eks.cluster_id
}

data "aws_eks_cluster_auth" "default" {
  name = module.eks.cluster_id
}

provider "kubernetes" {
  host                   = data.aws_eks_cluster.default.endpoint
  cluster_ca_certificate = base64decode(data.aws_eks_cluster.default.certificate_authority[0].data)
  token                  = data.aws_eks_cluster_auth.default.token
}
#############################################################################################
#############################################################################################

# Create EKS Cluster
#############################################################################################
#############################################################################################
# Create VPC for EKS Cluster
module "vpc" {
  source  = "terraform-aws-modules/vpc/aws"
  version = "~> 3.0"

  name = local.cluster_name
  cidr = "10.0.0.0/16"

  azs             = ["${local.region}a", "${local.region}b"]
  private_subnets = ["10.0.1.0/24", "10.0.2.0/24"]
  public_subnets  = ["10.0.3.0/24", "10.0.4.0/24"]


  enable_nat_gateway   = true
  single_nat_gateway   = true
  enable_dns_hostnames = true

  enable_flow_log                      = true
  create_flow_log_cloudwatch_iam_role  = true
  create_flow_log_cloudwatch_log_group = true

  public_subnet_tags = {
    "kubernetes.io/cluster/${local.cluster_name}" = "shared"
    "kubernetes.io/role/elb"                      = "1"
  }

  private_subnet_tags = {
    "kubernetes.io/cluster/${local.cluster_name}" = "shared"
    "kubernetes.io/role/internal-elb"             = "1"
  }
}


resource "aws_security_group" "additional" {
  name_prefix = "${local.cluster_name}-additional"
  vpc_id      = module.vpc.vpc_id

  ingress {
    from_port = 22
    to_port   = 22
    protocol  = "tcp"
    cidr_blocks = [
      "10.0.0.0/8",
      "172.16.0.0/12",
      "192.168.0.0/16",
    ]
  }
}




module "eks" {
  source  = "terraform-aws-modules/eks/aws"
  version = "18.17.0"

  cluster_name    = local.cluster_name
  cluster_version = local.cluster_version

  cluster_endpoint_public_access = true

  vpc_id     = module.vpc.vpc_id
  subnet_ids = module.vpc.private_subnets


  eks_managed_node_group_defaults = {
    ami_type                              = "AL2_x86_64"
    disk_size                             = 50
    attach_cluster_primary_security_group = true
    vpc_security_group_ids                = [aws_security_group.additional.id]
  }
  eks_managed_node_groups = {
    first = {
      desired_size = 1
      max_size     = 1
      min_size     = 1
    }
  }
}


Steps to reproduce the behavior:

Just run terraform apply --auto-approve and after waiting for about 20 minutes you will see the aforementioned error.

Expected behavior

The eks cluster with 1 EKS managed group gets created.

Actual behavior

The following error is thrown:

│ Error: error waiting for EKS Node Group (eks-dev-eks-cluster:default_node_group-2022070609081040940000000f) to create: unexpected state 'CREATE_FAILED', wanted target 'ACTIVE'. last error: 2 errors occurred:
│       * eks-default_node_group-2022070609081040940000000f-ecc0e962-29c0-7802-83d8-213eca9d1cd7: AsgInstanceLaunchFailures: You've reached your quota for maximum Fleet Requests for this account. Launching EC2 instance failed.
│       * DUMMY_04f2c42f-98d6-428c-aed2-95deada02ad2, DUMMY_46fee16c-8052-4fc7-a170-522943edc191, DUMMY_4ff890be-596d-4370-85eb-56146cc1b5ea, DUMMY_c94e1a5e-9bce-42c2-bc7b-7b24db9216f5, DUMMY_d36b7c25-3716-4b50-92e7-ac48c400e33a, DUMMY_fa1ffb54-1a20-4a9e-b302-e31db512548c: NodeCreationFailure: Instances failed to join the kubernetes cluster

Additional context

I have read other similar issues and have experimented with iam_role_attach_cni_policy = true but still get the same issue. Any help would be greatly appreciated. This has been extremely frustrating for me.

@arnav13081994 arnav13081994 changed the title NodeCreationFailure: Instances failed to join the kubernetes cluster NodeCreationFailure: Instances failed to join the kubernetes cluster. This is happening on a fresh cluster. Jul 7, 2022
@tanvp112
Copy link

AsgInstanceLaunchFailures: You've reached your quota for maximum Fleet Requests for this account.

Maybe you need to raise the fleet quota.

@arnav13081994
Copy link
Author

@tanvp112 Im not sure if this is about any quota increase as I'm creating just 1 node.

Have you faced the same issue?

@sebastianmacarescu
Copy link

I have the same issue. Anybody knows why?

@arnav13081994
Copy link
Author

@sebastianmacarescu

The following config worked for me. I still don't know why it worked though. There seems to be some race condition

terraform {
  required_version = "~> 1.0"

  required_providers {
    aws = {
      source  = "hashicorp/aws"
      version = "~> 4.0"
    }
  }
}


provider "aws" {
  region  = "us-east-1"
  profile = "ADD NAME OF AWS PROFILE OR SET CREDS EXPLICITLY"
}

data "aws_eks_cluster" "default" {
  name = module.eks_default.cluster_id
  depends_on = [
    module.eks_default.cluster_id,
  ]
}

data "aws_eks_cluster_auth" "default" {
  name = module.eks_default.cluster_id
  depends_on = [
    module.eks_default.cluster_id,
  ]
}

provider "kubernetes" {
  host                   = data.aws_eks_cluster.default.endpoint
  cluster_ca_certificate = base64decode(data.aws_eks_cluster.default.certificate_authority[0].data)
  token                  = data.aws_eks_cluster_auth.default.token
}

provider "helm" {
  kubernetes {
    host                   = data.aws_eks_cluster.default.endpoint
    cluster_ca_certificate = base64decode(data.aws_eks_cluster.default.certificate_authority[0].data)
    token                  = data.aws_eks_cluster_auth.default.token
  }
}

################################################################################
# Common Locals
################################################################################

locals {
  # Used to determine correct partition (i.e. - `aws`, `aws-gov`, `aws-cn`, etc.)
  partition = data.aws_partition.current.partition
}

################################################################################
# Common Data
################################################################################

data "aws_partition" "current" {}
data "aws_caller_identity" "current" {}

################################################################################
# Common Modules
################################################################################

module "tags" {
  # tflint-ignore: terraform_module_pinned_source
  source = "github.com/clowdhaus/terraform-tags"

  application = "someclustername"
  environment = "nonprod"
  repository  = "https://github.com/clowdhaus/eks-reference-architecture"
}


################################################################################
# EKS Modules
################################################################################

module "vpc" {
  # https://registry.terraform.io/modules/terraform-aws-modules/vpc/aws/latest
  source  = "terraform-aws-modules/vpc/aws"
  version = "~> 3.12"

  name = "someclustername"
  cidr = "10.0.0.0/16"

  azs             = ["us-east-1a", "us-east-1b", "us-east-1c"]
  private_subnets = ["10.0.1.0/24", "10.0.2.0/24", "10.0.3.0/24"]
  public_subnets  = ["10.0.101.0/24", "10.0.102.0/24", "10.0.103.0/24"]

  enable_nat_gateway     = true
  single_nat_gateway     = true
  one_nat_gateway_per_az = false

  enable_dns_hostnames = true

  manage_default_network_acl    = true
  default_network_acl_tags      = { Name = "someclustername-default" }
  manage_default_route_table    = true
  default_route_table_tags      = { Name = "someclustername-default" }
  manage_default_security_group = true
  default_security_group_tags   = { Name = "someclustername-default" }

  public_subnet_tags = {
    "kubernetes.io/cluster/someclustername-default" = "shared"
    "kubernetes.io/role/elb"                    = 1
  }

  private_subnet_tags = {
    "kubernetes.io/cluster/someclustername-default" = "shared"
    "kubernetes.io/role/internal-elb"           = 1
  }

  tags = module.tags.tags
}




module "eks_default" {
  source  = "terraform-aws-modules/eks/aws"
  version = "~> 18.26"

  cluster_name    = "someclustername-default"
  cluster_version = "1.22"

  # EKS Addons
  cluster_addons = {
    coredns = {
      resolve_conflicts = "OVERWRITE"
    }
    kube-proxy = {}
    vpc-cni = {
      resolve_conflicts = "OVERWRITE"
    }
  }

  # Encryption key
  create_kms_key = true
  cluster_encryption_config = [{
    resources = ["secrets"]
  }]

  vpc_id     = module.vpc.vpc_id
  subnet_ids = module.vpc.private_subnets

  eks_managed_node_groups = {
    default = {
      # By default, the module creates a launch template to ensure tags are propagated to instances, etc.,
      # so we need to disable it to use the default template provided by the AWS EKS managed node group service
      create_launch_template = false
      launch_template_name   = ""

      # list of pods per instance type: https://github.com/awslabs/amazon-eks-ami/blob/master/files/eni-max-pods.txt
      # or run: kubectl get node -o yaml | grep pods
      instance_types = ["t2.xlarge"]
      disk_size      = 50

      # Is deprecated and will be removed in v19.x
      create_security_group = false

      min_size     = 1
      max_size     = 3
      desired_size = 1

      update_config = {
        max_unavailable_percentage = 33
      }
    }
  }

  tags = module.tags.tags
}

@AmitKulkarni9
Copy link

@arnav13081994
I am getting the same error.
Error: error waiting for EKS Node Group (devopsthehardway-cluster:devopsthehardway-workernodes) to create: unexpected state 'CREATE_FAILED', wanted target 'ACTIVE'. last error: 2 errors occurred:
│ * eks-devopsthehardway-workernodes-74c19ba3-f519-395b-e417-a16c178036c0: AsgInstanceLaunchFailures: You've reached your quota for maximum Fleet Requests for this account. Launching EC2 instance failed.
│ * DUMMY_085e351a-269f-4d54-b838-916f649a9cce, DUMMY_187d8572-0d47-4f5b-8986-4bfd680b3b93, DUMMY_2dc892c6-fce4-4c83-a29b-9b1f714e5adf, DUMMY_a4f7ff66-b607-4c59-9585-a3be5dd0cdf5, DUMMY_a53dbd59-1a80-4207-af6a-ab72e6421fe1: NodeCreationFailure: Instances failed to join the kubernetes cluster

Below is my code
`terraform {
backend "s3" {
bucket = "terraform-state-amtoyadevopsthehardway"
key = "eks-terraform-workernodes.tfstate"
region = "ap-southeast-2"
}
required_providers {
aws = {
source = "hashicorp/aws"
}
}
}

IAM Role for EKS to have access to the appropriate resources

resource "aws_iam_role" "eks-iam-role" {
name = "devopsthehardway-eks-iam-role"

path = "/"

assume_role_policy = <<EOF
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Principal": {
"Service": "eks.amazonaws.com"
},
"Action": "sts:AssumeRole"
}
]
}
EOF

}

Attach the IAM policy to the IAM role

resource "aws_iam_role_policy_attachment" "AmazonEKSClusterPolicy" {
policy_arn = "arn:aws:iam::aws:policy/AmazonEKSClusterPolicy"
role = aws_iam_role.eks-iam-role.name
}
resource "aws_iam_role_policy_attachment" "AmazonEC2ContainerRegistryReadOnly-EKS" {
policy_arn = "arn:aws:iam::aws:policy/AmazonEC2ContainerRegistryReadOnly"
role = aws_iam_role.eks-iam-role.name
}

Create the EKS cluster

resource "aws_eks_cluster" "devopsthehardway-eks" {
name = "devopsthehardway-cluster"
role_arn = aws_iam_role.eks-iam-role.arn

vpc_config {
subnet_ids = [var.subnet_id_1, var.subnet_id_2]
}

depends_on = [
aws_iam_role.eks-iam-role,
]
}

Worker Nodes

resource "aws_iam_role" "workernodes" {
name = "eks-node-group-example"

assume_role_policy = jsonencode({
Statement = [{
Action = "sts:AssumeRole"
Effect = "Allow"
Principal = {
Service = "ec2.amazonaws.com"
}
}]
Version = "2012-10-17"
})
}

resource "aws_iam_role_policy_attachment" "AmazonEKSWorkerNodePolicy" {
policy_arn = "arn:aws:iam::aws:policy/AmazonEKSWorkerNodePolicy"
role = aws_iam_role.workernodes.name
}

resource "aws_iam_role_policy_attachment" "AmazonEKS_CNI_Policy" {
policy_arn = "arn:aws:iam::aws:policy/AmazonEKS_CNI_Policy"
role = aws_iam_role.workernodes.name
}

resource "aws_iam_role_policy_attachment" "EC2InstanceProfileForImageBuilderECRContainerBuilds" {
policy_arn = "arn:aws:iam::aws:policy/EC2InstanceProfileForImageBuilderECRContainerBuilds"
role = aws_iam_role.workernodes.name
}

resource "aws_iam_role_policy_attachment" "AmazonEC2ContainerRegistryReadOnly" {
policy_arn = "arn:aws:iam::aws:policy/AmazonEC2ContainerRegistryReadOnly"
role = aws_iam_role.workernodes.name
}

resource "aws_eks_node_group" "worker-node-group" {
cluster_name = aws_eks_cluster.devopsthehardway-eks.name
node_group_name = "devopsthehardway-workernodes"
node_role_arn = aws_iam_role.workernodes.arn
subnet_ids = [var.subnet_id_1, var.subnet_id_2]
instance_types = ["t3.xlarge"]

scaling_config {
desired_size = 1
max_size = 1
min_size = 1
}

depends_on = [
aws_iam_role_policy_attachment.AmazonEKSWorkerNodePolicy,
aws_iam_role_policy_attachment.AmazonEKS_CNI_Policy,
#aws_iam_role_policy_attachment.AmazonEC2ContainerRegistryReadOnly,
]
}`

@Chakki1301
Copy link

Same error. It's new AWS account with very few EC2. Something else is wrong when done via TF automation or eksctl.

unexpected state 'CREATE_FAILED', wanted target 'ACTIVE'. last error: 2 errors occurred:
│ * eks-managed-ondemand-20220916000118777000000009-90c1a1cc-c40a-4d69-60fa-d40ad3479549: AsgInstanceLaunchFailures: You've reached your quota for maximum Fleet Requests for this account. Launching EC2 instance failed.

@lauren-themis
Copy link

Same error - tried on 4.24.0 and 4.31.0. Why is this closed?

@Jaysins
Copy link

Jaysins commented Oct 11, 2022

Anyone figured this out?

@chinchalinchin
Copy link

chinchalinchin commented Oct 23, 2022

I am receiving this error as well. In the CloudTrail logs for the RunInstances API call that EKS makes when provisioning new nodes, it appears this related to how the EC2 Instance Profile is attached to the Node,

{
"errorCode": "Client.InvalidParameterValue",
"errorMessage": "Value (eks-xxxx) for parameter iamInstanceProfile.name is invalid. Invalid IAM Instance Profile name
}

A possible workaround is creating your own EC2 launch template and then using that in the node_group definition; however, you would need to replicate the launch template EKS uses by default: https://docs.aws.amazon.com/eks/latest/userguide/launch-templates.html

I have not yet been able to do this.

@danvau7
Copy link

danvau7 commented Nov 10, 2022

Getting the same error as well today. Currently looking into it.

@Jaysins
Copy link

Jaysins commented Nov 10, 2022

Be sure you're not creating in a private subnet that was the issue for me.

@esoxjem
Copy link

esoxjem commented Nov 14, 2022

[FIXED] Run the automated runbook to see the actual issue
https://docs.aws.amazon.com/systems-manager-automation-runbooks/latest/userguide/automation-awssupport-troubleshooteksworkernode.html

In our case, it was an issue with security group and user data script.

@danvau7
Copy link

danvau7 commented Nov 15, 2022

Getting the same error as well today. Currently looking into it.

The issue was that I had restricted the cluster_endpoint_public_access_cidrs to a specific subnet. This limited the ability of the Nodes to talk to the API Endpoint. Allowing them to access the API Endpoint via their local IPs solved this issue. Thus I just needed to add the following code to make this error go away:

cluster_endpoint_private_access = true

@swananddhole
Copy link

@danvau7 I'm getting error even after setting the cluster_endpoint_private_access to true. Can anyone help out here, it's really frustrating.

@github-actions
Copy link

github-actions bot commented Jan 8, 2023

I'm going to lock this issue because it has been closed for 30 days ⏳. This helps our maintainers find and focus on the active issues. If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Jan 8, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

No branches or pull requests