Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

V2 Monitoring Fails to upgrade when crd is in a failed state #35744

Closed
Auston-Ivison-Suse opened this issue Dec 3, 2021 · 5 comments
Closed
Assignees
Labels
area/monitoring kind/bug-qa Issues that have not yet hit a real release. Bugs introduced by a new feature or enhancement release-note Note this issue in the milestone's release notes
Milestone

Comments

@Auston-Ivison-Suse
Copy link

Auston-Ivison-Suse commented Dec 3, 2021

Rancher Server Setup

  • Rancher version: v2.6-head(46eb9d4)
  • Installation option (Docker install/Helm Chart):Docker
  • Proxy/Cert Details: self-signed

Information about the Cluster

  • Kubernetes version:v1.21.6-rancher1-1
  • Cluster Type (Local/Downstream):Downstream aws/ec2
  • 4 worker nodes, 1 etcd nodes, 1 control plane node
    • Each worker node should have a taint added to it during the configuration steps. For example: qa=test:noschedule

Describe the bug
After installing v2 Monitoring (100.1.0+up19.0.3) onto a cluster with taints on every single node, it is required of us to add tolerations (for the taints on the nodes) to the "rancher-monitoring-crd" by editing the yaml file while the pod is in a failed state.

To Reproduce

  1. Go to the apps and marketplace, install monitoring and the previously mentioned version
  2. While waiting for install to process you can check whether the pods are being deployed or whether the app "rancher-monitoring-crd" is working in the "installed apps" page.
  3. Once you see it's failed, use the kebab menu of the CRD to "edit/upgrade"
  4. On the yaml page add tolerations equivalent to the previously created taints. For example:
tolerations:
- key: "qa"
  operator: "Equal"
  value: "test"
  effect: "NoSchedule"
  
  1. Click save and return to the installed apps page.
  2. View the yaml of the crd from the kebab menu and notice the previously added tolerations are not listed.

Result

The rancher-monitoring-crd remains in a failed state. The workaround would be to install the application using a helm install.
Following error when attempting to upgrade the already installed monitoring.

Fri, Dec 3 2021 4:32:50 pm | helm upgrade --install=true --namespace=cattle-monitoring-system --timeout=10m0s --values=/home/shell/helm/values-rancher-monitoring-crd-100.1.0-up19.0.3.yaml --version=100.1.0+up19.0.3 --wait=true rancher-monitoring-crd /home/shell/helm/rancher-monitoring-crd-100.1.0-up19.0.3.tgz
Fri, Dec 3 2021 4:32:50 pm | Error: UPGRADE FAILED: "rancher-monitoring-crd" has no deployed releases

Expected Result

The application should be upgraded and the tolerations should be installed just fine.

Additional context
This is a blocker of this issue: #34439

@Auston-Ivison-Suse Auston-Ivison-Suse added kind/bug-qa Issues that have not yet hit a real release. Bugs introduced by a new feature or enhancement area/monitoring status/release-blocker labels Dec 3, 2021
@Auston-Ivison-Suse Auston-Ivison-Suse added this to the v2.6.3 milestone Dec 3, 2021
@jiaqiluo
Copy link
Member

jiaqiluo commented Dec 6, 2021

In my test, I could not reproduce the error mentioned in the issue. The error "rancher-monitoring-crd" has no deployed releases indicates the installation process did not start properly, we can use the helm list -A command to check if there is a release. Here is the output from my setup, which shows the app rancher-monitoring-crd is in a failed state.

> helm list -A
NAME                  	NAMESPACE               	REVISION	UPDATED                                	STATUS  	CHART                                                                                     	APP VERSION
fleet-agent-c-xp2px   	cattle-fleet-system     	2       	2021-12-06 16:54:29.056504086 +0000 UTC	deployed	fleet-agent-c-xp2px-v0.0.0+s-ab7adbbc8a2a4ffc2b8734b929cbae38d95ef9ce7e271dca92b4334d7c458
rancher-monitoring-crd	cattle-monitoring-system	3       	2021-12-06 20:15:15.286786805 +0000 UTC	failed  	rancher-monitoring-crd-100.1.0+up19.0.3

Now if I edit the app in rancher UI to set the toleration, an upgrade will be triggered and installed the app successfully.

screencapture-64-227-98-133-dashboard-c-c-xp2px-apps-catalog-cattle-io-app-cattle-monitoring-system-rancher-monitoring-crd-2021-12-06-13_34_06

However, I realized another two issue that blocks us from installing the monitoring chart from the UI:

  1. Even we add the toleration to the crd chart and get it installed, the rancher-monitoring app will not be installed. in other words, the installation process stops at the 1st failure on the crd chart.
  2. Every time we edit the rancher-monitoring app in the UI, it will trigger an upgrade on the crd chart, but since the UI does not provide the toleration in the values it passes to Helm, the upgrade always fail and we have to edit the app manually which brings us back to the issue 1.

Because of the above 2 issues, although we have added the support for toleration in the crd chat, because of the missing UI component, we are unable to install the rancher-monitoring via UI on a cluster where customized taints are applied to nodes.

As always, we have a workaround for this issue: instead of using Rancher UI, use Helm to install the rancher-monitoring chart into the cluster directly.

Note to QA:
Can you try it again to see if you can get the crd app installed successfully? The tricky part here is that you need to wait for the app to be in the failed state in Rancher UI ( you need to wait 10 mins for the operation to timeout) and to show in the output of the helm list -A command before you edit the app.
Can you also try if you get the crd app installed successfully via the helm install command directly?
At this moment, what we can do is to confirm the fix in the chart works. As to the UI, it has been tracked in another UI issue and we will keep working on it there.

Update 1:
The same issue is reproduced in rancher 2.6.2, which means even in 2.6.2 we cannot get monitoring installed properly via the monitoring's custom UI in the dashboard.
As the result, for closing this issue, QA needs to make sure the monitoring-crd app can be installed successfully via the command line.

@zube zube bot removed the [zube]: Working label Dec 6, 2021
@jiaqiluo jiaqiluo added the release-note Note this issue in the milestone's release notes label Dec 7, 2021
@jiaqiluo
Copy link
Member

jiaqiluo commented Dec 7, 2021

We need to add the following to the release note:
In order to set nodeSelector or tolerations to the rancher-monitoring-crd chart, you need to install the rancher-monitoring-crd and rancher-monitoring chart by using the Helm command via command line. Rancher UI will add the support soon.

@gaktive
Copy link
Member

gaktive commented Dec 8, 2021

@jiaqiluo let me know if rancher/dashboard#4737 covers the UI need correctly or if the description needs updating

@Auston-Ivison-Suse
Copy link
Author

Auston-Ivison-Suse commented Dec 10, 2021

Rancher Setup Reproduction

  • Rancher version: v2.6-head(46eb9d4)
  • Installation option (Docker install/Helm Chart):Docker
  • Proxy/Cert Details: self-signed

Information about the Cluster

  • Kubernetes version:v1.21.6-rancher1-1
  • Cluster Type (Local/Downstream):Downstream aws/ec2
  • 4 worker nodes, 1 etcd nodes, 1 control plane node
    • Each worker node should have a taint added to it during the configuration steps. For example: qa=test:noschedule

To Reproduce

  1. Go to the apps and marketplace, install monitoring and the previously mentioned version
  2. While waiting for install to process you can check whether the pods are being deployed or whether the app "rancher-monitoring-crd" is working in the "installed apps" page.
  3. Once you see it's failed, use the kebab menu of the CRD to "edit/upgrade"
  4. On the yaml page add tolerations equivalent to the previously created taints. For example:
tolerations:
- key: "qa"
  operator: "Equal"
  value: "test"
  effect: "NoSchedule"
  
  1. Click save and return to the installed apps page.
  2. View the yaml of the crd from the kebab menu and notice the previously added tolerations are not listed.

Setup For Validation

  • Rancher version: v2.6-head(0d14421)
  • Installation option (Docker install/Helm Chart):Docker
  • Proxy/Cert Details: self-signed

Information about the Cluster

  • Kubernetes version:v1.22.4
  • Cluster Type (Local/Downstream):Downstream aws/ec2
  • 4 worker nodes, 1 etcd nodes, 1 control plane node
    • Each worker node should have a taint added to it during the configuration steps. For example: qa=test:noschedule

Steps For Validation

Test case 1:

  1. In the downstream cluster with the aforementioned configuration try install monitoring v100.1.0+up19.0.3
  2. Let rancher-monitoring-crd get to a failed state by waiting approximately 10 minutes.
  3. Once the app is in a failed state edit the yaml to have the appropriate tolerations and that should allow for the app to be installed. Tolerations listed below:
tolerations:
- key: "qa"
  operator: "Equal"
  value: "test"
  effect: "NoSchedule"
  

Test case 2:

  1. Download the kubeconfig files for the local cluster and the aforementioned downstream cluster.
  2. Download the charts-dev-v2.6 repository from the rancher repository
  3. From either your file explorer go to charts-dev-v2.6-->charts-->rancher-monitoring-crd-->100.1.0+up19.0.3
  4. within this directory you will edit the following yaml --> "values.yaml" _example below: _
tolerations:
- key: "qa"
  operator: "Equal"
  value: "test"
  effect: "NoSchedule"
  1. now take the kubeconfig for your downstream cluster and copy it to the last mentioned path: charts-dev-v2.6-->charts-->rancher-monitoring-crd-->100.1.0+up19.0.3/<downstreamcluster.yaml>
  2. now run the command:

export KUBECONFIG=<localcluster.yaml>
7. The previous command lets you run the following to install rancher-monitoring-crd via a helm install _example below: _

helm install rancher-monitoring-crd . -n cattle-monitoring-system --create-namespace --kubeconfig <downstreamcluster.yaml>
Results
The rancher-monitoring-crd is successfully installed following either steps.

@nbarnes22
Copy link

Confirmed that in Rancher UI v2.6.3 removing the old version (100.1.0+up19.0.3) and installing the latest (v100.1.2+up19.0.3) from the UI succeeded.

@zube zube bot removed the [zube]: Done label Apr 29, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/monitoring kind/bug-qa Issues that have not yet hit a real release. Bugs introduced by a new feature or enhancement release-note Note this issue in the milestone's release notes
Projects
None yet
Development

No branches or pull requests

6 participants