Enabling monitoring didn't complete as Jobs get suck when all nodes have custom taints #27253

ansilh · 2020-05-26T12:19:43Z

What kind of request is this (question/bug/enhancement/feature request): Bug

Steps to reproduce (least amount of steps as possible):

Set a custom taint on all nodes in a cluster(eg:- zone=private)
Enable monitoring on the cluster from Rancher UI

Result:

Below Jobs never get sheduled.

operator-init-cluster-monitoring
operator-init-monitoring-operator

Other details that may be helpful:

Environment information

Rancher version : 2.4.3
Installation option : HA
System Chart branch: release-v2.4

Cluster information

Cluster type (Hosted/Infrastructure Provider/Custom/Imported): Custom
Machine type (cloud/VM/metal) and specifications (CPU/memory): VM (4vCPU/2GB mem)| 2 control plane | 3 etcd| 3 worker
Kubernetes version : v1.17.5"

Client Version: version.Info{Major:"1", Minor:"16", GitVersion:"v1.16.8", GitCommit:"ec6eb119b81be488b030e849b9e64fda4caaf33c", GitTreeState:"clean", BuildDate:"2020-03-12T21:00:06Z", GoVersion:"go1.13.8", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"17", GitVersion:"v1.17.5", GitCommit:"e0fccafd69541e3750d460ba0f9743b90336f24f", GitTreeState:"clean", BuildDate:"2020-04-16T11:35:47Z", GoVersion:"go1.13.9", Compiler:"gc", Platform:"linux/amd64"}

Docker version (use docker version):

Client: Docker Engine - Community
 Version:           19.03.9
 API version:       1.40
 Go version:        go1.13.10
 Git commit:        9d988398e7
 Built:             Fri May 15 00:25:27 2020
 OS/Arch:           linux/amd64
 Experimental:      false

Server: Docker Engine - Community
 Engine:
  Version:          19.03.2
  API version:      1.40 (minimum version 1.12)
  Go version:       go1.12.8
  Git commit:       6a30dfc
  Built:            Thu Aug 29 05:27:34 2019
  OS/Arch:          linux/amd64
  Experimental:     false
 containerd:
  Version:          1.2.13
  GitCommit:        7ad184331fa3e55e52b890ea95e65ba581ae3429
 runc:
  Version:          1.0.0-rc10
  GitCommit:        dc9208a3303feef5b3839f4323d9beb36df0a9dd
 docker-init:
  Version:          0.18.0
  GitCommit:        fec3683

Below changes may fix the issue.

Change (1)
Modify job-install-crds.yaml template to include custom toleration.

system-charts/charts/rancher-monitoring/v0.1.0/charts/operator-init/templates/job-install-crds.yaml

      tolerations:
{{- include "linux-node-tolerations" . | nindent 8 }}
{{- if .Values.tolerations }}
{{ toYaml .Values.tolerations | indent 4 }}
{{- end }}

Change(2)
Adding Answers from advanced options while enabling monitoring will fix operator-init-cluster-monitoring , but still, operator-init-monitoring-operator will not get scheduled.
Suspecting that below code is not considering operator-init* keys and values and this might need a change along with Chart template modification.

pkg/controllers/user/monitoring/operatorHandler.go

183         // take operator answers from overwrite answers
184         answers, version := monitoring.GetOverwroteAppAnswersAndVersion(cluster.Annotations)
185         for ansKey, ansVal := range answers {
186                 if strings.HasPrefix(ansKey, "operator.") {  //#<-- strings.HasPrefix(ansKey, "operator.") || strings.HasPrefix(ansKey, "operator-init.") { ?
187                         appAnswers[ansKey] = ansVal
188                 }
189         }

gzrancher/rancher#10292

The text was updated successfully, but these errors were encountered:

aiyengar2 · 2020-10-12T20:26:39Z

Thanks @ansilh for opening the ticket and suggesting the fixes!

Note for QA:

Once the backend changes have been merged in, we should check whether clusters in which every node has 1 or more taints are able to deploy Monitoring V1 by providing the following under Advanced Options:

operator.tolerations[0].operator=Exists
operator-init.tolerations[0].operator=Exists
prometheus.tolerations[0].operator=Exists
grafana.tolerations[0].operator=Exists
exporter-kube-state.tolerations[0].operator=Exists

UI issue to expose these fields in a user-friendly way is tracked in #29479.

jiaqiluo · 2020-10-14T21:48:37Z

This bug fix is validated in rancher:v2.4-3950-head and rancher:v2.5-c32b5a14a8bc90a932c67735c3c42c44e48183e7-head single install, with the monitoring v1 0.1.4

Steps:

provision 3 clusters with k8s version v1.17, 1.18, and 1.19
add taints to all nodes

set the branch of system-library to dev-v2.4

enable the cluster monitoring v0.1.4
wait and see pods are failed to assigned to any node

disable the cluster monitoring and re-enable it with the following addition answers

wait and see all workloads are deployed and running up, monitoring works as expected

The same tests are done for project monitoring.

ansilh added area/server-chart internal kind/bug Issues that are defects reported by users or that we know have reached a real release labels May 26, 2020

maggieliu added the [zube]: Team Blue Backlog label Jun 2, 2020

maggieliu assigned aiyengar2 Jun 2, 2020

maggieliu added this to the v2.5 milestone Jun 29, 2020

maggieliu unassigned aiyengar2 Jul 2, 2020

maggieliu modified the milestones: v2.5, v2.4.x Jul 2, 2020

maggieliu added [zube]: Team Red Backlog and removed [zube]: Team Blue Backlog labels Jul 2, 2020

maggieliu assigned aiyengar2 Oct 7, 2020

maggieliu modified the milestones: v2.4.x, v2.4.9 Oct 7, 2020

maggieliu added the [zube]: Next Up label Oct 7, 2020

zube bot removed the [zube]: Team Red Backlog label Oct 7, 2020

aiyengar2 added [zube]: Working and removed [zube]: Next Up labels Oct 7, 2020

bmdepesa assigned jiaqiluo Oct 8, 2020

This was referenced Oct 12, 2020

Add Monitoring v0.1.4 rancher/system-charts#357

Merged

Allow users to overwrite operator-init values #29478

Merged

aiyengar2 added [zube]: Review and removed [zube]: Working labels Oct 12, 2020

aiyengar2 mentioned this issue Oct 12, 2020

[Monitoring V1 UI] Add ability to specify tolerations for more components #29479

Closed

This was referenced Oct 13, 2020

[2.4] Allow users to overwrite operator-init values #29499

Merged

[2.5] Allow users to overwrite operator-init values #29500

Merged

aiyengar2 added [zube]: To Test and removed [zube]: Review labels Oct 14, 2020

jiaqiluo closed this as completed Oct 14, 2020

zube bot added [zube]: Done and removed [zube]: To Test labels Oct 14, 2020

zube bot removed the [zube]: Done label Jan 13, 2021

Tejeev mentioned this issue Jan 15, 2021

Alertmanager does not tolerate taints or offer a method to do so #30866

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enabling monitoring didn't complete as Jobs get suck when all nodes have custom taints #27253

Enabling monitoring didn't complete as Jobs get suck when all nodes have custom taints #27253

ansilh commented May 26, 2020 •

edited by zube bot

Loading

aiyengar2 commented Oct 12, 2020

jiaqiluo commented Oct 14, 2020

Enabling monitoring didn't complete as Jobs get suck when all nodes have custom taints #27253

Enabling monitoring didn't complete as Jobs get suck when all nodes have custom taints #27253

Comments

ansilh commented May 26, 2020 • edited by zube bot Loading

aiyengar2 commented Oct 12, 2020

Note for QA:

jiaqiluo commented Oct 14, 2020

ansilh commented May 26, 2020 •

edited by zube bot

Loading