Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

windows-exporter pods keep restarting until wins-upgrader is fully deployed on new 2.5.7 win node #31842

Closed
thehejik opened this issue Mar 30, 2021 · 3 comments
Assignees
Labels
area/windows kind/bug-qa Issues that have not yet hit a real release. Bugs introduced by a new feature or enhancement kind/enhancement Issues that improve or augment existing functionality QA/XS
Milestone

Comments

@thehejik
Copy link

What kind of request is this (question/bug/enhancement/feature request): bug

Steps to reproduce (least amount of steps as possible):

  • Use rancher 2.5.7 (with v1 monitoring disabled)
  • Deploy a windows cluster - 3 linux nodes - 1 etcd, 1 control and 1 worker nodes, 3 windows nodes (eg. windows core 1909 ami-008ec03d9035c8ee7)
  • In cluster explorer point the feature charts to dev-v2.5 branch
  • Deploy rancher-wins-upgrader:0.0.100-rc00 with the following override:
masquerade:
  enabled: true
  as: c:\\etc\wmi-exporter\wmi-exporter.exe
  • When the upgrade of wins has finished, deploy Monitoring v2 rancher-monitoring*:9.4.204-rc07 on the cluster into System Project:
    • select cluster.provider.rke from Cluster Type dropdown in General tab
    • enable windowsExplorer by Edit as Yaml -> and override windowsExporter.enabled=true
  • Wait until all pods from all DaemonSets in cattle-monitoring-system namespace are ready
  • add one or more windows worker nodes into the cluster and observe pods from rancher-monitoring-windows-exporter DS on the new node.

Result:
The pods rancher-monitoring-windows-exporter* are keep restarting probably until wins-upgrader is fully installed on the new node(s). I was adding 3 new nodes and each of them had the same symptoms:

win-exporter

After a while all exporter pods are running and everything works as expected.

Other details that may be helpful:

  • logs from rancher-monitoring-windows-exporter-* pod and exporter-node container from rancher-monitoring-windows-exporter DS (new node):
time="2021-03-30T12:26:28Z" level=info msg="Starting windows_exporter (version=0.15.0, branch=master, revision=cdbb27d0b4ea9810dc35035fad281fe6468b7dd1)" source="exporter.go:412"
time="2021-03-30T12:26:28Z" level=info msg="Build context (go=go1.15.3, user=appveyor-vm\\appveyor@appveyor-vm, date=20201107-08:23:37)" source="exporter.go:413"
time="2021-03-30T12:26:28Z" level=info msg="Starting server on :9796" source="exporter.go:416"
  • wins-upgrader-default* pod and noop container from wins-upgrader-default DS started at 12:25:57 - taken from webui (new node)

It appears it is most probably caused by a race condition between wins-upgrader and windows-exporter deployments - at least according to the timestamps - In my case exporter attempted to start before wins-upgrader was ready. I have no further evidence only the timestamps. If the default period for pod restarting is ~30s then it would make a sense.

Environment information

  • Rancher version (rancher/rancher/rancher/server image tag or shown bottom left in the UI): 2.5.7
  • Installation option (single install/HA): HA

Cluster information

  • Cluster type (Hosted/Infrastructure Provider/Custom/Imported): AWS
  • Machine type (cloud/VM/metal) and specifications (CPU/memory): AWS VM t3.xlarge
  • Kubernetes version (use kubectl version): : v1.20.4
@thehejik thehejik added area/windows kind/bug-qa Issues that have not yet hit a real release. Bugs introduced by a new feature or enhancement labels Mar 30, 2021
@sowmyav27 sowmyav27 added this to the v2.5.8 milestone Mar 30, 2021
@aiyengar2
Copy link
Contributor

@thehejik if the exporter pods eventually start running, I think the bug described here is the expected behavior as of today.

However, I do agree that having Pods restart several times when there is no actual error seems like a bug; we should avoid pod restarts when possible.

To resolve this, I'll add in an initContainer to the windows-exporter deployment called check-wins-version that just calls a PowerShell script on 5s intervals to check whether wins has been updated on the host to at least v0.1.0. This would prevent container restarts as the Pod will instead be stuck on Pending till the initContainer is complete.

@aiyengar2
Copy link
Contributor

Note: the fix for this issue will not prevent windows-exporter from failing but rather mitigate repetitive failures caused by a wins version <v0.1.0 by checking this in an initContainer.

We still expect some restarts to happen when upgrading wins on the host, since an upgrade of wins by definition must trigger wins to restart until it can establish a connection.

As part of testing this, we should also make sure that a clean 2.5.7 cluster without wins upgraded can deploy Monitoring, but should just see the windowsExporter pods stuck on pending in the initContainer.

@thehejik
Copy link
Author

Test report:
Validated against rancher:2.5.7 according to the reproducer from description but recent version of charts were used:

  • rancher-wins-upgrader-0.0.100-rc01
  • rancher-monitoring-14.5.100-rc07

First test:
It works way better now, after adding a windows node to current windows cluster (wins and v2 monitoring are deployed) the wins-upgrader pods (masquerade ON) are deployed together with other resources from monitoring chart as windows-exporter to the new windows node.

There are no restarts of windows-exporter pods visible at all on the new node (initContainer is in while loop until 0.1.0 wins is installed): win-exporter-fixed

Second test:
When Installing v2 monitoring on fresh 2.5.7 cluster without wins-upgrader its deployment fails as expected on timeout while waiting for windows-exporter pods but it seems it is only one missing resource for the monitoring (grafana, prometheus and linux exporters/metrics are up):

...
DaemonSet is not ready: cattle-monitoring-system/rancher-monitoring-windows-exporter. 0 out of 2 expected pods are ready
Error: timed out waiting for the condition

Also following alert is showing on Monitoring/Overview page:

warning | TargetDown | 100% of the windows-exporter/rancher-monitoring-windows-exporter targets in cattle-monitoring-system namespace are down.

@shpwrck shpwrck added the kind/enhancement Issues that improve or augment existing functionality label May 4, 2021
@zube zube bot removed the [zube]: Done label Jul 14, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/windows kind/bug-qa Issues that have not yet hit a real release. Bugs introduced by a new feature or enhancement kind/enhancement Issues that improve or augment existing functionality QA/XS
Projects
None yet
Development

No branches or pull requests

4 participants