Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enabling monitoring didn't complete as Jobs get suck when all nodes have custom taints #27253

Closed
ansilh opened this issue May 26, 2020 · 2 comments
Assignees
Labels
area/server-chart internal kind/bug Issues that are defects reported by users or that we know have reached a real release
Milestone

Comments

@ansilh
Copy link

ansilh commented May 26, 2020

What kind of request is this (question/bug/enhancement/feature request): Bug

Steps to reproduce (least amount of steps as possible):

  • Set a custom taint on all nodes in a cluster(eg:- zone=private)
  • Enable monitoring on the cluster from Rancher UI

Result:

Below Jobs never get sheduled.

operator-init-cluster-monitoring
operator-init-monitoring-operator

Other details that may be helpful:

Environment information

  • Rancher version : 2.4.3
  • Installation option : HA
  • System Chart branch: release-v2.4

Cluster information

  • Cluster type (Hosted/Infrastructure Provider/Custom/Imported): Custom
  • Machine type (cloud/VM/metal) and specifications (CPU/memory): VM (4vCPU/2GB mem)| 2 control plane | 3 etcd| 3 worker
  • Kubernetes version : v1.17.5"
Client Version: version.Info{Major:"1", Minor:"16", GitVersion:"v1.16.8", GitCommit:"ec6eb119b81be488b030e849b9e64fda4caaf33c", GitTreeState:"clean", BuildDate:"2020-03-12T21:00:06Z", GoVersion:"go1.13.8", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"17", GitVersion:"v1.17.5", GitCommit:"e0fccafd69541e3750d460ba0f9743b90336f24f", GitTreeState:"clean", BuildDate:"2020-04-16T11:35:47Z", GoVersion:"go1.13.9", Compiler:"gc", Platform:"linux/amd64"}
  • Docker version (use docker version):
Client: Docker Engine - Community
 Version:           19.03.9
 API version:       1.40
 Go version:        go1.13.10
 Git commit:        9d988398e7
 Built:             Fri May 15 00:25:27 2020
 OS/Arch:           linux/amd64
 Experimental:      false

Server: Docker Engine - Community
 Engine:
  Version:          19.03.2
  API version:      1.40 (minimum version 1.12)
  Go version:       go1.12.8
  Git commit:       6a30dfc
  Built:            Thu Aug 29 05:27:34 2019
  OS/Arch:          linux/amd64
  Experimental:     false
 containerd:
  Version:          1.2.13
  GitCommit:        7ad184331fa3e55e52b890ea95e65ba581ae3429
 runc:
  Version:          1.0.0-rc10
  GitCommit:        dc9208a3303feef5b3839f4323d9beb36df0a9dd
 docker-init:
  Version:          0.18.0
  GitCommit:        fec3683

Below changes may fix the issue.

Change (1)
Modify job-install-crds.yaml template to include custom toleration.

system-charts/charts/rancher-monitoring/v0.1.0/charts/operator-init/templates/job-install-crds.yaml

      tolerations:
{{- include "linux-node-tolerations" . | nindent 8 }}
{{- if .Values.tolerations }}
{{ toYaml .Values.tolerations | indent 4 }}
{{- end }}

Change(2)
Adding Answers from advanced options while enabling monitoring will fix operator-init-cluster-monitoring , but still, operator-init-monitoring-operator will not get scheduled.
Suspecting that below code is not considering operator-init* keys and values and this might need a change along with Chart template modification.

pkg/controllers/user/monitoring/operatorHandler.go

183         // take operator answers from overwrite answers
184         answers, version := monitoring.GetOverwroteAppAnswersAndVersion(cluster.Annotations)
185         for ansKey, ansVal := range answers {
186                 if strings.HasPrefix(ansKey, "operator.") {  //#<-- strings.HasPrefix(ansKey, "operator.") || strings.HasPrefix(ansKey, "operator-init.") { ?
187                         appAnswers[ansKey] = ansVal
188                 }
189         }

gzrancher/rancher#10292

gzrancher/rancher#10292

@aiyengar2
Copy link
Contributor

Thanks @ansilh for opening the ticket and suggesting the fixes!

Note for QA:

Once the backend changes have been merged in, we should check whether clusters in which every node has 1 or more taints are able to deploy Monitoring V1 by providing the following under Advanced Options:

operator.tolerations[0].operator=Exists
operator-init.tolerations[0].operator=Exists
prometheus.tolerations[0].operator=Exists
grafana.tolerations[0].operator=Exists
exporter-kube-state.tolerations[0].operator=Exists

UI issue to expose these fields in a user-friendly way is tracked in #29479.

@jiaqiluo
Copy link
Member

This bug fix is validated in rancher:v2.4-3950-head and rancher:v2.5-c32b5a14a8bc90a932c67735c3c42c44e48183e7-head single install, with the monitoring v1 0.1.4

Steps:

  • provision 3 clusters with k8s version v1.17, 1.18, and 1.19
  • add taints to all nodes

Screen Shot 2020-10-14 at 1 50 15 PM

  • set the branch of system-library to dev-v2.4

image

  • enable the cluster monitoring v0.1.4
  • wait and see pods are failed to assigned to any node

Screen Shot 2020-10-14 at 1 55 49 PM

  • disable the cluster monitoring and re-enable it with the following addition answers

Screen Shot 2020-10-14 at 2 32 20 PM

  • wait and see all workloads are deployed and running up, monitoring works as expected

The same tests are done for project monitoring.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/server-chart internal kind/bug Issues that are defects reported by users or that we know have reached a real release
Projects
None yet
Development

No branches or pull requests

4 participants