windows-exporter pods keep restarting until wins-upgrader is fully deployed on new 2.5.7 win node #31842

thehejik · 2021-03-30T13:57:08Z

What kind of request is this (question/bug/enhancement/feature request): bug

Steps to reproduce (least amount of steps as possible):

Use rancher 2.5.7 (with v1 monitoring disabled)
Deploy a windows cluster - 3 linux nodes - 1 etcd, 1 control and 1 worker nodes, 3 windows nodes (eg. windows core 1909 ami-008ec03d9035c8ee7)
In cluster explorer point the feature charts to dev-v2.5 branch
Deploy rancher-wins-upgrader:0.0.100-rc00 with the following override:

masquerade:
  enabled: true
  as: c:\\etc\wmi-exporter\wmi-exporter.exe

When the upgrade of wins has finished, deploy Monitoring v2 rancher-monitoring*:9.4.204-rc07 on the cluster into System Project:
- select cluster.provider.rke from Cluster Type dropdown in General tab
- enable windowsExplorer by Edit as Yaml -> and override windowsExporter.enabled=true
Wait until all pods from all DaemonSets in cattle-monitoring-system namespace are ready
add one or more windows worker nodes into the cluster and observe pods from rancher-monitoring-windows-exporter DS on the new node.

Result:
The pods rancher-monitoring-windows-exporter* are keep restarting probably until wins-upgrader is fully installed on the new node(s). I was adding 3 new nodes and each of them had the same symptoms:

After a while all exporter pods are running and everything works as expected.

Other details that may be helpful:

logs from rancher-monitoring-windows-exporter-* pod and exporter-node container from rancher-monitoring-windows-exporter DS (new node):

time="2021-03-30T12:26:28Z" level=info msg="Starting windows_exporter (version=0.15.0, branch=master, revision=cdbb27d0b4ea9810dc35035fad281fe6468b7dd1)" source="exporter.go:412"
time="2021-03-30T12:26:28Z" level=info msg="Build context (go=go1.15.3, user=appveyor-vm\\appveyor@appveyor-vm, date=20201107-08:23:37)" source="exporter.go:413"
time="2021-03-30T12:26:28Z" level=info msg="Starting server on :9796" source="exporter.go:416"

wins-upgrader-default* pod and noop container from wins-upgrader-default DS started at 12:25:57 - taken from webui (new node)

It appears it is most probably caused by a race condition between wins-upgrader and windows-exporter deployments - at least according to the timestamps - In my case exporter attempted to start before wins-upgrader was ready. I have no further evidence only the timestamps. If the default period for pod restarting is ~30s then it would make a sense.

Environment information

Rancher version (rancher/rancher/rancher/server image tag or shown bottom left in the UI): 2.5.7
Installation option (single install/HA): HA

Cluster information

Cluster type (Hosted/Infrastructure Provider/Custom/Imported): AWS
Machine type (cloud/VM/metal) and specifications (CPU/memory): AWS VM t3.xlarge
Kubernetes version (use kubectl version): : v1.20.4

The text was updated successfully, but these errors were encountered:

aiyengar2 · 2021-03-30T17:15:46Z

@thehejik if the exporter pods eventually start running, I think the bug described here is the expected behavior as of today.

However, I do agree that having Pods restart several times when there is no actual error seems like a bug; we should avoid pod restarts when possible.

To resolve this, I'll add in an initContainer to the windows-exporter deployment called check-wins-version that just calls a PowerShell script on 5s intervals to check whether wins has been updated on the host to at least v0.1.0. This would prevent container restarts as the Pod will instead be stuck on Pending till the initContainer is complete.

aiyengar2 · 2021-04-12T22:56:37Z

Note: the fix for this issue will not prevent windows-exporter from failing but rather mitigate repetitive failures caused by a wins version <v0.1.0 by checking this in an initContainer.

We still expect some restarts to happen when upgrading wins on the host, since an upgrade of wins by definition must trigger wins to restart until it can establish a connection.

As part of testing this, we should also make sure that a clean 2.5.7 cluster without wins upgraded can deploy Monitoring, but should just see the windowsExporter pods stuck on pending in the initContainer.

thehejik · 2021-04-14T13:30:34Z

Test report:
Validated against rancher:2.5.7 according to the reproducer from description but recent version of charts were used:

rancher-wins-upgrader-0.0.100-rc01
rancher-monitoring-14.5.100-rc07

First test:
It works way better now, after adding a windows node to current windows cluster (wins and v2 monitoring are deployed) the wins-upgrader pods (masquerade ON) are deployed together with other resources from monitoring chart as windows-exporter to the new windows node.

There are no restarts of windows-exporter pods visible at all on the new node (initContainer is in while loop until 0.1.0 wins is installed):

Second test:
When Installing v2 monitoring on fresh 2.5.7 cluster without wins-upgrader its deployment fails as expected on timeout while waiting for windows-exporter pods but it seems it is only one missing resource for the monitoring (grafana, prometheus and linux exporters/metrics are up):

...
DaemonSet is not ready: cattle-monitoring-system/rancher-monitoring-windows-exporter. 0 out of 2 expected pods are ready
Error: timed out waiting for the condition

Also following alert is showing on Monitoring/Overview page:

warning | TargetDown | 100% of the windows-exporter/rancher-monitoring-windows-exporter targets in cattle-monitoring-system namespace are down.

thehejik added area/windows kind/bug-qa Issues that have not yet hit a real release. Bugs introduced by a new feature or enhancement labels Mar 30, 2021

thehejik assigned thehejik and aiyengar2 Mar 30, 2021

sowmyav27 added this to the v2.5.8 milestone Mar 30, 2021

aiyengar2 added the [zube]: Next Up label Mar 30, 2021

sowmyav27 added the QA/XS label Apr 8, 2021

aiyengar2 added [zube]: Working and removed [zube]: Next Up labels Apr 9, 2021

aiyengar2 mentioned this issue Apr 12, 2021

Minor updates around windows-exporter and wins upgrades rancher/charts#1115

Merged

aiyengar2 added [zube]: Review [zube]: To Test and removed [zube]: Working [zube]: Review labels Apr 12, 2021

thehejik added [zube]: QA Working and removed [zube]: To Test labels Apr 14, 2021

thehejik closed this as completed Apr 14, 2021

zube bot added [zube]: Done and removed [zube]: QA Working labels Apr 14, 2021

shpwrck added the kind/enhancement Issues that improve or augment existing functionality label May 4, 2021

zube bot removed the [zube]: Done label Jul 14, 2021

ealasgarov mentioned this issue Oct 26, 2022

windows node exporter fails #39417

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

windows-exporter pods keep restarting until wins-upgrader is fully deployed on new 2.5.7 win node #31842

windows-exporter pods keep restarting until wins-upgrader is fully deployed on new 2.5.7 win node #31842

thehejik commented Mar 30, 2021

aiyengar2 commented Mar 30, 2021

aiyengar2 commented Apr 12, 2021

thehejik commented Apr 14, 2021

windows-exporter pods keep restarting until wins-upgrader is fully deployed on new 2.5.7 win node #31842

windows-exporter pods keep restarting until wins-upgrader is fully deployed on new 2.5.7 win node #31842

Comments

thehejik commented Mar 30, 2021

aiyengar2 commented Mar 30, 2021

aiyengar2 commented Apr 12, 2021

thehejik commented Apr 14, 2021