Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

node autoscaling not scaling #324

Closed
1 of 4 tasks
pkelleratwork opened this issue Mar 27, 2019 · 11 comments
Closed
1 of 4 tasks

node autoscaling not scaling #324

pkelleratwork opened this issue Mar 27, 2019 · 11 comments

Comments

@pkelleratwork
Copy link

pkelleratwork commented Mar 27, 2019

I have issues

I'm submitting a...

  • bug report
  • feature request
  • support request
  • kudos, thank you, warm fuzzy

What is the current behavior?

node autoscaling does not scale any nodes

If this is a bug, how to reproduce? Please include a code sample if relevant.

  1. ran this module---
module "create-cluster" {
  source  = "terraform-aws-modules/eks/aws"
  version = "2.2.1"

  cluster_name              = "demothis"
  cluster_version           = "1.11"
  kubeconfig_name           = "demothis"
  manage_aws_auth           = "false"
  subnets                   = "3-public-subnets-here"
  vpc_id                    = "vpc-id-here"

  # worker node configurations
  # https://github.com/terraform-aws-modules/terraform-aws-eks/blob/master/local.tf

  workers_group_defaults = {
      asg_max_size          = "50"
      autoscaling_enabled   = "true"
      instance_type         = "t3.medium"
    }

  # tags to add to all resources
  tags = {
    cluster                 = "demothis"
    environment             = "dev"
  }
}
  1. created clusterrolebindings
    kubectl create clusterrolebinding add-on-cluster-admin --clusterrole=cluster-admin --serviceaccount=kube-system:default

  2. installed cluster-autoscaling via helm

resource "helm_release" "cluster-autoscaling" {
    name        = "cluster-autoscaler"
    namespace   = "kube-system"
    chart       = "stable/cluster-autoscaler"

    set {
        name    = "autoDiscovery.clusterName"
        value   = "demothis"
    }

    set {
        name    = "autoDiscovery.enabled"
        value   = "true"
    }

    set {
        name    = "cloudProvider"
        value   = "aws"
    }

    set {
        name    = "awsRegion"
        value   = "us-east-1"
    }

    set {
        name    = "sslCertPath"
        value   = "/etc/kubernetes/pki/ca.crt"
    }

    set {
        name    = "rbac.create"
        value   = "true"
    }
}
  1. verified cluster-autoscaling installed to kube-system
(⎈ demothis:)➜  projects ✗ kgp -n kube-system
NAME                                                        READY   STATUS    RESTARTS   AGE
aws-node-p4sww                                              1/1     Running   0          18h
cluster-autoscaler-aws-cluster-autoscaler-676f48b86-lzhtv   1/1     Running   1          18h
coredns-7bcbfc4774-6hnr4                                    1/1     Running   0          18h
coredns-7bcbfc4774-zkf64                                    1/1     Running   0          18h
kube-proxy-4btkw                                            1/1     Running   0          18h
kubernetes-dashboard-5478c45897-bqfhh                       1/1     Running   0          18h
metrics-server-5f64dbfb9d-qlvm7                             1/1     Running   0          18h
tiller-deploy-6fb466b55b-5m9b7                              1/1     Running   0          18h
  1. loaded apps and got a pending after number of pods were created....
(⎈ demothis:)➜  projects ✗ kgp
NAME                                READY   STATUS    RESTARTS   AGE
pine-android-bff-9b4bbc55f-mhdzl    0/1     Pending   0          1m
pine-api-7d9776d589-99fpq           1/1     Running   0          1m
pine-api-7d9776d589-dr7rw           1/1     Running   0          1m
pine-auth-service-9b9f5dc66-r7j9z   1/1     Running   0          35m
pine-web-6bf54bf695-brv4w           1/1     Running   0          35m
pine-web-bff-7b49cd44c6-8qcj6       1/1     Running   0          35m

cluster autoscaling logs -

I0327 19:52:01.520925       1 static_autoscaler.go:128] Starting main loop
I0327 19:52:01.697210       1 auto_scaling_groups.go:320] Regenerating instance to ASG map for ASGs: []
I0327 19:52:01.697232       1 aws_manager.go:152] Refreshed ASG list, next refresh after 2019-03-27 19:52:11.697229478 +0000 UTC m=+47783.645256945
I0327 19:52:01.697302       1 utils.go:526] No pod using affinity / antiaffinity found in cluster, disabling affinity predicate for this loop
I0327 19:52:01.697313       1 static_autoscaler.go:261] Filtering out schedulables
I0327 19:52:01.697367       1 static_autoscaler.go:271] No schedulable pods
I0327 19:52:01.697380       1 scale_up.go:262] Pod default/pine-android-bff-6bcb794bd8-7pl2f is unschedulable
I0327 19:52:01.697406       1 scale_up.go:304] Upcoming 0 nodes
I0327 19:52:01.697415       1 scale_up.go:420] No expansion options
I0327 19:52:01.697447       1 static_autoscaler.go:333] Calculating unneeded nodes
I0327 19:52:01.697457       1 utils.go:474] Skipping ip-192-168-174-2.ec2.internal - no node group config
I0327 19:52:01.697507       1 static_autoscaler.go:360] Scale down status: unneededOnly=false lastScaleUpTime=2019-03-27 06:36:26.055084128 +0000 UTC m=+38.003111545 lastScaleDownDeleteTime=2019-03-27 06:36:26.055084241 +0000 UTC m=+38.003111657 lastScaleDownFailTime=2019-03-27 06:36:26.055084352 +0000 UTC m=+38.003111770 scaleDownForbidden=false isDeleteInProgress=false
I0327 19:52:01.697522       1 static_autoscaler.go:370] Starting scale down
I0327 19:52:01.697547       1 scale_down.go:659] No candidates for scale down
I0327 19:52:01.698128       1 factory.go:33] Event(v1.ObjectReference{Kind:"Pod", Namespace:"default", Name:"pine-android-bff-6bcb794bd8-7pl2f", UID:"90d94a29-50c9-11e9-989f-0e074c530082", APIVersion:"v1", ResourceVersion:"133307", FieldPath:""}): type: 'Normal' reason: 'NotTriggerScaleUp' pod didn't trigger scale-up (it wouldn't fit if a new node is added):

What's the expected behavior?

autoscaling should work - https://github.com/terraform-aws-modules/terraform-aws-eks/blob/master/docs/autoscaling.md

Are you able to fix this problem and submit a PR? Link here if you have already. nope.

Environment details

  • Affected module version: terraform-aws-modules/eks/aws 2.2.1
  • OS: aws eks ec2 worker node
  • Terraform version: Terraform v0.11.13

Any other relevant info

I noticed there is no k8s.io/cluster-autoscaler/enabled tag created on the ec2 worker nodes. I tried adding it manually and restarting cluster-autoscaling pod - did not work

@max-rocket-internet
Copy link
Contributor

What is the autoscaler logging on startup? It should look something like this:

I0328 08:44:11.028188       1 main.go:333] Cluster Autoscaler 1.13.1
I0328 08:44:11.065134       1 leaderelection.go:205] attempting to acquire leader lease  kube-system/cluster-autoscaler...
I0328 08:44:28.481337       1 leaderelection.go:214] successfully acquired lease kube-system/cluster-autoscaler
I0328 08:44:28.513396       1 predicates.go:122] Using predicate PodFitsResources
I0328 08:44:28.513418       1 predicates.go:122] Using predicate GeneralPredicates
I0328 08:44:28.513425       1 predicates.go:122] Using predicate PodToleratesNodeTaints
I0328 08:44:28.513432       1 predicates.go:122] Using predicate CheckVolumeBinding
I0328 08:44:28.513439       1 predicates.go:122] Using predicate MaxAzureDiskVolumeCount
I0328 08:44:28.513446       1 predicates.go:122] Using predicate MaxEBSVolumeCount
I0328 08:44:28.513453       1 predicates.go:122] Using predicate NoVolumeZoneConflict
I0328 08:44:28.513460       1 predicates.go:122] Using predicate ready
I0328 08:44:28.513467       1 predicates.go:122] Using predicate CheckNodeUnschedulable
I0328 08:44:28.513474       1 predicates.go:122] Using predicate MatchInterPodAffinity
I0328 08:44:28.513531       1 predicates.go:122] Using predicate MaxCSIVolumeCountPred
I0328 08:44:28.513538       1 predicates.go:122] Using predicate MaxGCEPDVolumeCount
I0328 08:44:28.513545       1 predicates.go:122] Using predicate NoDiskConflict
I0328 08:44:28.513552       1 cloud_provider_builder.go:29] Building aws cloud provider.
I0328 08:44:28.793414       1 auto_scaling_groups.go:124] Registering ASG xx01-xxx-xxxxxxxxxxxxxxxxxxx

The last line there is showing you that it has found an ASG to use.

Also this message pod didn't trigger scale-up (it wouldn't fit if a new node is added) could mean that the pod is requesting more resources then a whole node has, i.e. adding a node to the cluster won't help.

@pkelleratwork
Copy link
Author

pkelleratwork commented Mar 28, 2019

@max-rocket-internet looks like Registering ASG never happens. i get up to Building aws cloud provider but after that only receive repeated
I0328 14:11:30.578927 1 auto_scaling_groups.go:320] Regenerating instance to ASG map for ASGs: []

I'll keep trying to figure it out - thx

@pkelleratwork
Copy link
Author

pkelleratwork commented Mar 28, 2019

@max-rocket-internet im very stuck - out of ideas. CA is not working because pods are not scheduling. I noticed after redeploying cluster from scratch with autoscaling_enabled ="true" the ASG tag was k8s.io/cluster-autoscaler/disabled. I changed that to k8s.io/cluster-autoscaler/enabled` and checked yes to tag new instances.
Screen Shot 2019-03-28 at 10 05 18 AM

i increase min number of nodes to 2, and was able to successfull install all apps. I repeated this until the pods filled up but never scaled.

the CA logs give the following -

(⎈ demothis:)➜  projects ✗ kl cluster-autoscaler-aws-cluster-autoscaler-676f48b86-4ltwz -n kube-system
I0328 16:00:32.464407       1 flags.go:52] FLAG: --address=":8085"
I0328 16:00:32.464427       1 flags.go:52] FLAG: --alsologtostderr="false"
I0328 16:00:32.464431       1 flags.go:52] FLAG: --balance-similar-node-groups="false"
I0328 16:00:32.464435       1 flags.go:52] FLAG: --cloud-config=""
I0328 16:00:32.464439       1 flags.go:52] FLAG: --cloud-provider="aws"
I0328 16:00:32.464445       1 flags.go:52] FLAG: --cloud-provider-gce-lb-src-cidrs="130.211.0.0/22,209.85.152.0/22,209.85.204.0/22,35.191.0.0/16"
I0328 16:00:32.464451       1 flags.go:52] FLAG: --cluster-name=""
I0328 16:00:32.464454       1 flags.go:52] FLAG: --cores-total="0:320000"
I0328 16:00:32.464553       1 flags.go:52] FLAG: --estimator="binpacking"
I0328 16:00:32.464560       1 flags.go:52] FLAG: --expander="random"
I0328 16:00:32.464564       1 flags.go:52] FLAG: --expendable-pods-priority-cutoff="-10"
I0328 16:00:32.464569       1 flags.go:52] FLAG: --gke-api-endpoint=""
I0328 16:00:32.464573       1 flags.go:52] FLAG: --gpu-total="[]"
I0328 16:00:32.464577       1 flags.go:52] FLAG: --httptest.serve=""
I0328 16:00:32.464581       1 flags.go:52] FLAG: --ignore-daemonsets-utilization="false"
I0328 16:00:32.464587       1 flags.go:52] FLAG: --ignore-mirror-pods-utilization="false"
I0328 16:00:32.464591       1 flags.go:52] FLAG: --kubeconfig=""
I0328 16:00:32.464595       1 flags.go:52] FLAG: --kubernetes=""
I0328 16:00:32.464607       1 flags.go:52] FLAG: --leader-elect="true"
I0328 16:00:32.464614       1 flags.go:52] FLAG: --leader-elect-lease-duration="15s"
I0328 16:00:32.464620       1 flags.go:52] FLAG: --leader-elect-renew-deadline="10s"
I0328 16:00:32.464625       1 flags.go:52] FLAG: --leader-elect-resource-lock="endpoints"
I0328 16:00:32.464630       1 flags.go:52] FLAG: --leader-elect-retry-period="2s"
I0328 16:00:32.464635       1 flags.go:52] FLAG: --log-backtrace-at=":0"
I0328 16:00:32.464745       1 flags.go:52] FLAG: --log-dir=""
I0328 16:00:32.464753       1 flags.go:52] FLAG: --log-file=""
I0328 16:00:32.464757       1 flags.go:52] FLAG: --logtostderr="true"
I0328 16:00:32.464761       1 flags.go:52] FLAG: --max-autoprovisioned-node-group-count="15"
I0328 16:00:32.464766       1 flags.go:52] FLAG: --max-empty-bulk-delete="10"
I0328 16:00:32.464770       1 flags.go:52] FLAG: --max-failing-time="15m0s"
I0328 16:00:32.464775       1 flags.go:52] FLAG: --max-graceful-termination-sec="600"
I0328 16:00:32.464779       1 flags.go:52] FLAG: --max-inactivity="10m0s"
I0328 16:00:32.464783       1 flags.go:52] FLAG: --max-node-provision-time="15m0s"
I0328 16:00:32.464787       1 flags.go:52] FLAG: --max-nodes-total="0"
I0328 16:00:32.464791       1 flags.go:52] FLAG: --max-total-unready-percentage="45"
I0328 16:00:32.464796       1 flags.go:52] FLAG: --memory-total="0:6400000"
I0328 16:00:32.464800       1 flags.go:52] FLAG: --min-replica-count="0"
I0328 16:00:32.464805       1 flags.go:52] FLAG: --namespace="kube-system"
I0328 16:00:32.464809       1 flags.go:52] FLAG: --new-pod-scale-up-delay="0s"
I0328 16:00:32.464813       1 flags.go:52] FLAG: --node-autoprovisioning-enabled="false"
I0328 16:00:32.464817       1 flags.go:52] FLAG: --node-group-auto-discovery="[asg:tag=k8s.io/cluster-autoscaler/enabled,kubernetes.io/cluster/demo]"
I0328 16:00:32.464832       1 flags.go:52] FLAG: --nodes="[]"
I0328 16:00:32.464836       1 flags.go:52] FLAG: --ok-total-unready-count="3"
I0328 16:00:32.464850       1 flags.go:52] FLAG: --regional="false"
I0328 16:00:32.464855       1 flags.go:52] FLAG: --scale-down-candidates-pool-min-count="50"
I0328 16:00:32.464859       1 flags.go:52] FLAG: --scale-down-candidates-pool-ratio="0.1"
I0328 16:00:32.464864       1 flags.go:52] FLAG: --scale-down-delay-after-add="10m0s"
I0328 16:00:32.464868       1 flags.go:52] FLAG: --scale-down-delay-after-delete="10s"
I0328 16:00:32.464873       1 flags.go:52] FLAG: --scale-down-delay-after-failure="3m0s"
I0328 16:00:32.464877       1 flags.go:52] FLAG: --scale-down-enabled="true"
I0328 16:00:32.464882       1 flags.go:52] FLAG: --scale-down-non-empty-candidates-count="30"
I0328 16:00:32.464886       1 flags.go:52] FLAG: --scale-down-unneeded-time="10m0s"
I0328 16:00:32.464890       1 flags.go:52] FLAG: --scale-down-unready-time="20m0s"
I0328 16:00:32.464895       1 flags.go:52] FLAG: --scale-down-utilization-threshold="0.5"
I0328 16:00:32.464899       1 flags.go:52] FLAG: --scan-interval="10s"
I0328 16:00:32.464904       1 flags.go:52] FLAG: --skip-headers="false"
I0328 16:00:32.464908       1 flags.go:52] FLAG: --skip-nodes-with-local-storage="true"
I0328 16:00:32.464912       1 flags.go:52] FLAG: --skip-nodes-with-system-pods="true"
I0328 16:00:32.464917       1 flags.go:52] FLAG: --stderrthreshold="0"
I0328 16:00:32.464921       1 flags.go:52] FLAG: --test.bench=""
I0328 16:00:32.464925       1 flags.go:52] FLAG: --test.benchmem="false"
I0328 16:00:32.465021       1 flags.go:52] FLAG: --test.benchtime="1s"
I0328 16:00:32.465025       1 flags.go:52] FLAG: --test.blockprofile=""
I0328 16:00:32.465029       1 flags.go:52] FLAG: --test.blockprofilerate="1"
I0328 16:00:32.465033       1 flags.go:52] FLAG: --test.count="1"
I0328 16:00:32.465037       1 flags.go:52] FLAG: --test.coverprofile=""
I0328 16:00:32.465041       1 flags.go:52] FLAG: --test.cpu=""
I0328 16:00:32.465045       1 flags.go:52] FLAG: --test.cpuprofile=""
I0328 16:00:32.465049       1 flags.go:52] FLAG: --test.failfast="false"
I0328 16:00:32.465055       1 flags.go:52] FLAG: --test.list=""
I0328 16:00:32.465058       1 flags.go:52] FLAG: --test.memprofile=""
I0328 16:00:32.465062       1 flags.go:52] FLAG: --test.memprofilerate="0"
I0328 16:00:32.465074       1 flags.go:52] FLAG: --test.mutexprofile=""
I0328 16:00:32.465078       1 flags.go:52] FLAG: --test.mutexprofilefraction="1"
I0328 16:00:32.465083       1 flags.go:52] FLAG: --test.outputdir=""
I0328 16:00:32.465087       1 flags.go:52] FLAG: --test.parallel="2"
I0328 16:00:32.465091       1 flags.go:52] FLAG: --test.run=""
I0328 16:00:32.465095       1 flags.go:52] FLAG: --test.short="false"
I0328 16:00:32.465099       1 flags.go:52] FLAG: --test.testlogfile=""
I0328 16:00:32.465104       1 flags.go:52] FLAG: --test.timeout="0s"
I0328 16:00:32.465108       1 flags.go:52] FLAG: --test.trace=""
I0328 16:00:32.465173       1 flags.go:52] FLAG: --test.v="false"
I0328 16:00:32.465181       1 flags.go:52] FLAG: --unremovable-node-recheck-timeout="5m0s"
I0328 16:00:32.465186       1 flags.go:52] FLAG: --v="4"
I0328 16:00:32.465190       1 flags.go:52] FLAG: --vmodule=""
I0328 16:00:32.465194       1 flags.go:52] FLAG: --write-status-configmap="true"
I0328 16:00:32.465202       1 main.go:333] Cluster Autoscaler 1.13.1
I0328 16:00:32.492416       1 leaderelection.go:205] attempting to acquire leader lease  kube-system/cluster-autoscaler...
I0328 16:00:32.504813       1 leaderelection.go:289] lock is held by cluster-autoscaler-aws-cluster-autoscaler-676f48b86-nwbcc and has not yet expired
I0328 16:00:32.504841       1 leaderelection.go:210] failed to acquire lease kube-system/cluster-autoscaler
I0328 16:00:35.960675       1 leaderelection.go:289] lock is held by cluster-autoscaler-aws-cluster-autoscaler-676f48b86-nwbcc and has not yet expired
I0328 16:00:35.960698       1 leaderelection.go:210] failed to acquire lease kube-system/cluster-autoscaler
I0328 16:00:40.225708       1 leaderelection.go:289] lock is held by cluster-autoscaler-aws-cluster-autoscaler-676f48b86-nwbcc and has not yet expired
I0328 16:00:40.225734       1 leaderelection.go:210] failed to acquire lease kube-system/cluster-autoscaler
I0328 16:00:43.831262       1 leaderelection.go:289] lock is held by cluster-autoscaler-aws-cluster-autoscaler-676f48b86-nwbcc and has not yet expired
I0328 16:00:43.831284       1 leaderelection.go:210] failed to acquire lease kube-system/cluster-autoscaler
I0328 16:00:46.886418       1 leaderelection.go:289] lock is held by cluster-autoscaler-aws-cluster-autoscaler-676f48b86-nwbcc and has not yet expired
I0328 16:00:46.886441       1 leaderelection.go:210] failed to acquire lease kube-system/cluster-autoscaler
I0328 16:00:49.917517       1 leaderelection.go:214] successfully acquired lease kube-system/cluster-autoscaler
I0328 16:00:49.917811       1 factory.go:33] Event(v1.ObjectReference{Kind:"Endpoints", Namespace:"kube-system", Name:"cluster-autoscaler", UID:"883ca038-516c-11e9-8f7a-0eb365398f8c", APIVersion:"v1", ResourceVersion:"8912", FieldPath:""}): type: 'Normal' reason: 'LeaderElection' cluster-autoscaler-aws-cluster-autoscaler-676f48b86-4ltwz became leader
I0328 16:00:49.919829       1 reflector.go:131] Starting reflector *v1.Node (1h0m0s) from k8s.io/autoscaler/cluster-autoscaler/utils/kubernetes/listers.go:239
I0328 16:00:49.919853       1 reflector.go:169] Listing and watching *v1.Node from k8s.io/autoscaler/cluster-autoscaler/utils/kubernetes/listers.go:239
I0328 16:00:49.919862       1 reflector.go:131] Starting reflector *v1beta1.PodDisruptionBudget (1h0m0s) from k8s.io/autoscaler/cluster-autoscaler/utils/kubernetes/listers.go:266
I0328 16:00:49.919870       1 reflector.go:169] Listing and watching *v1beta1.PodDisruptionBudget from k8s.io/autoscaler/cluster-autoscaler/utils/kubernetes/listers.go:266
I0328 16:00:49.919957       1 reflector.go:131] Starting reflector *v1beta1.DaemonSet (1h0m0s) from k8s.io/autoscaler/cluster-autoscaler/utils/kubernetes/listers.go:293
I0328 16:00:49.919964       1 reflector.go:169] Listing and watching *v1beta1.DaemonSet from k8s.io/autoscaler/cluster-autoscaler/utils/kubernetes/listers.go:293
I0328 16:00:49.920027       1 reflector.go:131] Starting reflector *v1.Pod (1h0m0s) from k8s.io/autoscaler/cluster-autoscaler/utils/kubernetes/listers.go:174
I0328 16:00:49.920033       1 reflector.go:169] Listing and watching *v1.Pod from k8s.io/autoscaler/cluster-autoscaler/utils/kubernetes/listers.go:174
I0328 16:00:49.920123       1 reflector.go:131] Starting reflector *v1.Pod (1h0m0s) from k8s.io/autoscaler/cluster-autoscaler/utils/kubernetes/listers.go:149
I0328 16:00:49.920131       1 reflector.go:169] Listing and watching *v1.Pod from k8s.io/autoscaler/cluster-autoscaler/utils/kubernetes/listers.go:149
I0328 16:00:49.920211       1 reflector.go:131] Starting reflector *v1.Node (1h0m0s) from k8s.io/autoscaler/cluster-autoscaler/utils/kubernetes/listers.go:212
I0328 16:00:49.920218       1 reflector.go:169] Listing and watching *v1.Node from k8s.io/autoscaler/cluster-autoscaler/utils/kubernetes/listers.go:212
I0328 16:00:49.945978       1 predicates.go:122] Using predicate PodFitsResources
I0328 16:00:49.946002       1 predicates.go:122] Using predicate GeneralPredicates
I0328 16:00:49.946007       1 predicates.go:122] Using predicate PodToleratesNodeTaints
I0328 16:00:49.946011       1 predicates.go:122] Using predicate ready
I0328 16:00:49.946015       1 predicates.go:122] Using predicate MaxEBSVolumeCount
I0328 16:00:49.946037       1 predicates.go:122] Using predicate NoDiskConflict
I0328 16:00:49.946065       1 predicates.go:122] Using predicate NoVolumeZoneConflict
I0328 16:00:49.946082       1 predicates.go:122] Using predicate MatchInterPodAffinity
I0328 16:00:49.946118       1 predicates.go:122] Using predicate MaxAzureDiskVolumeCount
I0328 16:00:49.946157       1 predicates.go:122] Using predicate MaxCSIVolumeCountPred
I0328 16:00:49.946195       1 predicates.go:122] Using predicate MaxGCEPDVolumeCount
I0328 16:00:49.946222       1 predicates.go:122] Using predicate CheckNodeUnschedulable
I0328 16:00:49.946227       1 predicates.go:122] Using predicate CheckVolumeBinding
I0328 16:00:49.946242       1 cloud_provider_builder.go:29] Building aws cloud provider.
I0328 16:00:49.948777       1 reflector.go:131] Starting reflector *v1.PersistentVolumeClaim (0s) from k8s.io/client-go/informers/factory.go:132
I0328 16:00:49.948797       1 reflector.go:169] Listing and watching *v1.PersistentVolumeClaim from k8s.io/client-go/informers/factory.go:132
I0328 16:00:49.949188       1 reflector.go:131] Starting reflector *v1.Service (0s) from k8s.io/client-go/informers/factory.go:132
I0328 16:00:49.949204       1 reflector.go:169] Listing and watching *v1.Service from k8s.io/client-go/informers/factory.go:132
I0328 16:00:49.949815       1 reflector.go:131] Starting reflector *v1.ReplicationController (0s) from k8s.io/client-go/informers/factory.go:132
I0328 16:00:49.949833       1 reflector.go:169] Listing and watching *v1.ReplicationController from k8s.io/client-go/informers/factory.go:132
I0328 16:00:49.950762       1 reflector.go:131] Starting reflector *v1.Node (0s) from k8s.io/client-go/informers/factory.go:132
I0328 16:00:49.950786       1 reflector.go:169] Listing and watching *v1.Node from k8s.io/client-go/informers/factory.go:132
I0328 16:00:49.951226       1 reflector.go:131] Starting reflector *v1.PersistentVolume (0s) from k8s.io/client-go/informers/factory.go:132
I0328 16:00:49.951241       1 reflector.go:169] Listing and watching *v1.PersistentVolume from k8s.io/client-go/informers/factory.go:132
I0328 16:00:49.951559       1 reflector.go:131] Starting reflector *v1.ReplicaSet (0s) from k8s.io/client-go/informers/factory.go:132
I0328 16:00:49.951579       1 reflector.go:169] Listing and watching *v1.ReplicaSet from k8s.io/client-go/informers/factory.go:132
I0328 16:00:49.952286       1 reflector.go:131] Starting reflector *v1beta1.PodDisruptionBudget (0s) from k8s.io/client-go/informers/factory.go:132
I0328 16:00:49.952303       1 reflector.go:169] Listing and watching *v1beta1.PodDisruptionBudget from k8s.io/client-go/informers/factory.go:132
I0328 16:00:49.957846       1 reflector.go:131] Starting reflector *v1.StatefulSet (0s) from k8s.io/client-go/informers/factory.go:132
I0328 16:00:49.957927       1 reflector.go:169] Listing and watching *v1.StatefulSet from k8s.io/client-go/informers/factory.go:132
I0328 16:00:49.958240       1 reflector.go:131] Starting reflector *v1.Pod (0s) from k8s.io/client-go/informers/factory.go:132
I0328 16:00:49.958335       1 reflector.go:169] Listing and watching *v1.Pod from k8s.io/client-go/informers/factory.go:132
I0328 16:00:49.958620       1 reflector.go:131] Starting reflector *v1.StorageClass (0s) from k8s.io/client-go/informers/factory.go:132
I0328 16:00:49.958694       1 reflector.go:169] Listing and watching *v1.StorageClass from k8s.io/client-go/informers/factory.go:132
I0328 16:00:50.119839       1 request.go:530] Throttling request took 168.496976ms, request: GET:https://10.100.0.1:443/api/v1/persistentvolumes?limit=500&resourceVersion=0
I0328 16:00:50.182236       1 auto_scaling_groups.go:320] Regenerating instance to ASG map for ASGs: []
I0328 16:00:50.182394       1 aws_manager.go:152] Refreshed ASG list, next refresh after 2019-03-28 16:01:00.182387611 +0000 UTC m=+27.740626393
I0328 16:00:50.182812       1 main.go:252] Registered cleanup signal handler
I0328 16:00:50.319811       1 request.go:530] Throttling request took 361.319819ms, request: GET:https://10.100.0.1:443/api/v1/pods?limit=500&resourceVersion=0
I0328 16:01:00.200777       1 static_autoscaler.go:128] Starting main loop
I0328 16:01:00.398826       1 auto_scaling_groups.go:320] Regenerating instance to ASG map for ASGs: []
I0328 16:01:00.398849       1 aws_manager.go:152] Refreshed ASG list, next refresh after 2019-03-28 16:01:10.398846028 +0000 UTC m=+37.957084784
I0328 16:01:00.398948       1 utils.go:526] No pod using affinity / antiaffinity found in cluster, disabling affinity predicate for this loop
I0328 16:01:00.398957       1 static_autoscaler.go:261] Filtering out schedulables
I0328 16:01:00.399023       1 static_autoscaler.go:271] No schedulable pods
I0328 16:01:00.399031       1 static_autoscaler.go:279] No unschedulable pods
I0328 16:01:00.399041       1 static_autoscaler.go:333] Calculating unneeded nodes
I0328 16:01:00.399056       1 utils.go:474] Skipping ip-192-168-211-186.ec2.internal - no node group config
I0328 16:01:00.399065       1 utils.go:474] Skipping ip-192-168-74-229.ec2.internal - no node group config
I0328 16:01:00.399157       1 static_autoscaler.go:360] Scale down status: unneededOnly=true lastScaleUpTime=2019-03-28 16:00:50.182644653 +0000 UTC m=+17.740883396 lastScaleDownDeleteTime=2019-03-28 16:00:50.182644743 +0000 UTC m=+17.740883492 lastScaleDownFailTime=2019-03-28 16:00:50.182644847 +0000 UTC m=+17.740883587 scaleDownForbidden=false isDeleteInProgress=false
I0328 16:11:01.100587       1 static_autoscaler.go:370] Starting scale down
I0328 16:11:01.100612       1 scale_down.go:659] No candidates for scale down
I0328 16:11:01.100745       1 factory.go:33] Event(v1.ObjectReference{Kind:"Pod", Namespace:"default", Name:"oak-api-664d8fcb9f-cpm6x", UID:"f543956c-5173-11e9-b66f-027f9ed717b8", APIVersion:"v1", ResourceVersion:"10117", FieldPath:""}): type: 'Normal' reason: 'NotTriggerScaleUp' pod didn't trigger scale-up (it wouldn't fit if a new node is added):
I0328 16:11:01.100767       1 factory.go:33] Event(v1.ObjectReference{Kind:"Pod", Namespace:"default", Name:"oak-api-664d8fcb9f-77llx", UID:"f54463dc-5173-11e9-b66f-027f9ed717b8", APIVersion:"v1", ResourceVersion:"10122", FieldPath:""}): type: 'Normal' reason: 'NotTriggerScaleUp' pod didn't trigger scale-up (it wouldn't fit if a new node is added):
I0328 16:11:11.112297       1 static_autoscaler.go:128] Starting main loop
I0328 16:11:11.273326       1 auto_scaling_groups.go:320] Regenerating instance to ASG map for ASGs: []
I0328 16:11:11.273350       1 aws_manager.go:152] Refreshed ASG list, next refresh after 2019-03-28 16:11:21.273346679 +0000 UTC m=+648.831585439
I0328 16:11:11.273444       1 utils.go:526] No pod using affinity / antiaffinity found in cluster, disabling affinity predicate for this loop
I0328 16:11:11.273485       1 static_autoscaler.go:261] Filtering out schedulables
I0328 16:11:11.273870       1 static_autoscaler.go:271] No schedulable pods
I0328 16:11:11.273940       1 scale_up.go:262] Pod default/oak-api-664d8fcb9f-cpm6x is unschedulable
I0328 16:11:11.274176       1 scale_up.go:262] Pod default/oak-api-664d8fcb9f-77llx is unschedulable
I0328 16:11:11.274261       1 scale_up.go:304] Upcoming 0 nodes
I0328 16:11:11.274297       1 scale_up.go:420] No expansion options
I0328 16:11:11.274357       1 static_autoscaler.go:333] Calculating unneeded nodes
I0328 16:11:11.274379       1 utils.go:474] Skipping ip-192-168-211-186.ec2.internal - no node group config
I0328 16:11:11.274415       1 utils.go:474] Skipping ip-192-168-74-229.ec2.internal - no node group config
I0328 16:11:11.274539       1 static_autoscaler.go:360] Scale down status: unneededOnly=false lastScaleUpTime=2019-03-28 16:00:50.182644653 +0000 UTC m=+17.740883396 lastScaleDownDeleteTime=2019-03-28 16:00:50.182644743 +0000 UTC m=+17.740883492 lastScaleDownFailTime=2019-03-28 16:00:50.182644847 +0000 UTC m=+17.740883587 scaleDownForbidden=false isDeleteInProgress=false
I0328 16:11:11.274577       1 static_autoscaler.go:370] Starting scale down
I0328 16:11:11.274617       1 scale_down.go:659] No candidates for scale down
I0328 16:11:11.275259       1 factory.go:33] Event(v1.ObjectReference{Kind:"Pod", Namespace:"default", Name:"oak-api-664d8fcb9f-77llx", UID:"f54463dc-5173-11e9-b66f-027f9ed717b8", APIVersion:"v1", ResourceVersion:"10122", FieldPath:""}): type: 'Normal' reason: 'NotTriggerScaleUp' pod didn't trigger scale-up (it wouldn't fit if a new node is added):
I0328 16:11:11.275319       1 factory.go:33] Event(v1.ObjectReference{Kind:"Pod", Namespace:"default", Name:"oak-api-664d8fcb9f-cpm6x", UID:"f543956c-5173-11e9-b66f-027f9ed717b8", APIVersion:"v1", ResourceVersion:"10117", FieldPath:""}): type: 'Normal' reason: 'NotTriggerScaleUp' pod didn't trigger scale-up (it wouldn't fit if a new node is added):
(⎈ demothis:)➜  projects ✗

@pkelleratwork
Copy link
Author

GOT IT! I think i found a bug!! the tag is created wrong in the ASG....

  1. i ran my cluster module (above)
  2. i manually changed the ASG tag from k8s.io/cluster-autoscaler/disabled to k8s.io/cluster-autoscaler/enabled and checked Yes to tag new instances.
    Screen Shot 2019-03-28 at 2 56 42 PM
  3. ran cluster autocaling helm install
  4. verified in the logs the CA Registered the ASG!!!

@max-rocket-internet
Copy link
Contributor

OK cool but how is your tag set like that? I am using module version v2.3.0 and this is the worker group config:

  worker_groups = [
    {
      instance_type       = "m4.xlarge"
      asg_max_size        = 40
      autoscaling_enabled = true
      additional_userdata = "${xxxxx}"
      kubelet_extra_args  = "--node-labels=xxx=xxxx"
    },
  ]

And the tag on the ASG is k8s.io/cluster-autoscaler/enabled=true

@pkelleratwork
Copy link
Author

I'm using terraform-aws-modules/eks/aws v2.2.1 - didnt realize there was a newer version. I'll try 2.3.1 now.

@pkelleratwork
Copy link
Author

pkelleratwork commented Apr 3, 2019

v2.3.1 is not working either. here is the module ran, the output and aws console screenshot.

module

module "create-cluster" {
  source  = "terraform-aws-modules/eks/aws"
  version = "2.3.1"

  cluster_name              = "demothis"
  cluster_version           = "1.11"
  kubeconfig_name           = "demothis"
  manage_aws_auth           = "false"
  subnets                   = "[x,yz]"
  vpc_id                    = "vpc-123"

  # worker node configurations
  worker_groups = [
    {
      asg_desired_capacity  = "2"
      asg_max_size          = "50"
      autoscaling_enabled   = "true"
      instance_type         = "t3.medium"
    }
  ]

  # tags to add to all resources
  tags = {
    cluster                 = "demothis"
    environment             = "dev"
  }
}

auto-scaling output - tags.2.key are wrong

module.demothis-cluster.module.create-cluster.aws_autoscaling_group.workers: Creating...
  arn:                            "" => "<computed>"
  default_cooldown:               "" => "<computed>"
  desired_capacity:               "" => "2"
  force_delete:                   "" => "false"
  health_check_grace_period:      "" => "300"
  health_check_type:              "" => "<computed>"
  launch_configuration:           "" => "demothis-02019040319275202730000000d"
  load_balancers.#:               "" => "<computed>"
  max_size:                       "" => "50"
  metrics_granularity:            "" => "1Minute"
  min_size:                       "" => "1"
  name:                           "" => "<computed>"
  name_prefix:                    "" => "demothis-0"
  protect_from_scale_in:          "" => "false"
  service_linked_role_arn:        "" => "<computed>"
  tags.#:                         "" => "7"
  tags.0.%:                       "" => "3"
  tags.0.key:                     "" => "Name"
  tags.0.propagate_at_launch:     "" => "1"
  tags.0.value:                   "" => "demothis-0-eks_asg"
  tags.1.%:                       "" => "3"
  tags.1.key:                     "" => "kubernetes.io/cluster/demothis"
  tags.1.propagate_at_launch:     "" => "1"
  tags.1.value:                   "" => "owned"
  tags.2.%:                       "" => "3"
  tags.2.key:                     "" => "k8s.io/cluster-autoscaler/disabled"
  tags.2.propagate_at_launch:     "" => "0"
  tags.2.value:                   "" => "true"
  tags.3.%:                       "" => "3"
  tags.3.key:                     "" => "k8s.io/cluster-autoscaler/demothis"
  tags.3.propagate_at_launch:     "" => "0"
  tags.3.value:                   "" => ""
  tags.4.%:                       "" => "3"
  tags.4.key:                     "" => "k8s.io/cluster-autoscaler/node-template/resources/ephemeral-storage"
  tags.4.propagate_at_launch:     "" => "0"
  tags.4.value:                   "" => "100Gi"
  tags.5.%:                       "" => "3"
  tags.5.key:                     "" => "cluster"
  tags.5.propagate_at_launch:     "" => "1"
  tags.5.value:                   "" => "demothis"
  tags.6.%:                       "" => "3"
  tags.6.key:                     "" => "environment"
  tags.6.propagate_at_launch:     "" => "1"
  tags.6.value:                   "" => "demo"
  vpc_zone_identifier.#:          "" => "3"
  vpc_zone_identifier.1666119713: "" => "subnet-x"
  vpc_zone_identifier.3712104169: "" => "subnet-y"
  vpc_zone_identifier.3798380006: "" => "subnet-z"
  wait_for_capacity_timeout:      "" => "10m"
module.demothis-cluster.create-cluster.aws_autoscaling_group.workers: Still creating... (10s elapsed)
module.demothis-cluster.create-cluster.aws_autoscaling_group.workers: Still creating... (20s elapsed)
module.demothis-cluster.create-cluster.aws_autoscaling_group.workers: Still creating... (30s elapsed)
module.demothis-cluster.create-cluster.aws_autoscaling_group.workers: Still creating... (40s elapsed)
module.demothis-cluster.module.create-cluster.aws_autoscaling_group.workers: Creation complete after 50s (ID: demothis-02019040319280139330000000e)

aws console showing it tag disabled
Screen Shot 2019-04-03 at 2 37 56 PM

@dpiddockcmp
Copy link
Contributor

You are passing the string "true", try passing the boolean value true. The problem is in the interpolation used checking against an integer:
${lookup(var.worker_groups[count.index], "autoscaling_enabled", local.workers_group_defaults["autoscaling_enabled"]) == 1 ? "enabled" : "disabled" }

@pkelleratwork
Copy link
Author

siiiigh....
ugh

Thanks @dpiddockcmp - that was my problem. i failed to catch that. I went back to the documentation and it clearly states that.

I really appreciate all your help.

@max-rocket-internet
Copy link
Contributor

We've all made these mistakes. Glad you got it sorted 🙂

@github-actions
Copy link

github-actions bot commented Dec 1, 2022

I'm going to lock this issue because it has been closed for 30 days ⏳. This helps our maintainers find and focus on the active issues. If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Dec 1, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants