Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Monitoring V2] no metrics from windows nodes available in Grafana when win_prefix_path is set on a windows cluster #79

Open
sowmyav27 opened this issue May 4, 2021 · 6 comments
Assignees
Labels
area/windows bug Something isn't working team/area4

Comments

@sowmyav27
Copy link

What kind of request is this (question/bug/enhancement/feature request): bug

Steps to reproduce (least amount of steps as possible):
on 2.5.8-rc18

  • chart Repo commit
status:
  branch: dev-v2.5
  commit: 766d0ea8d73bb6a727a72c4a89fa4ce730479239
  • Deploy a windows cluster with win_prefix_path set to 'c:\host\opt' (1 etcd+control, 3 linux worker nodes and 3 windows worker nodes)
  • when the clusters comes up Active, deploy monitoring v2 chart on the cluster
  • All the workloads come up Active, but no metrics from windows nodes are seen in Grafana and the Prometheus targets show windows nodes as Down

Expected Result:
Metrics from windows nodes should be available in Grafana.

Other details that may be helpful:

Environment information

  • Rancher version (rancher/rancher/rancher/server image tag or shown bottom left in the UI): 2.5.8-rc18
  • Installation option (single install/HA): HA
@aiyengar2
Copy link
Contributor

aiyengar2 commented May 5, 2021

Reproduced in the latest Monitoring chart. Screenshots and logs attached below for investigation.

Seems like an issue with the wins cli proxy command on the wins server side. Filing a related issue with rancher/wins to track.

Redeploying the wins client with windows-exporter itself doesn't seem to resolve the issue as it produces two more of the same exact logs:

Handling backend connection request [rancher-monitoring-windows-exporter-5q4nz]
error in remotedialer server [500]: connect not allowed

Screen Shot 2021-05-04 at 6 41 03 PM

Screen Shot 2021-05-04 at 6 43 30 PM

check-wins-version logs

Detected wins version on host is v0.1.0, which is >v0.1.0. Continuing with installation...

exporter-node logs

time="2021-05-05T01:39:04Z" level=warning msg="No where-clause specified for service collector. This will generate a very large number of metrics!" source="service.go:41"
time="2021-05-05T01:39:04Z" level=error msg="Failed to start service: The service process could not connect to the service controller." source="exporter.go:350"
time="2021-05-05T01:39:04Z" level=info msg="Enabled collectors: system, cpu, net, os, logical_disk, tcp, container, service, cs, memory" source="exporter.go:360"
time="2021-05-05T01:39:04Z" level=info msg="Starting windows_exporter (version=0.15.0, branch=master, revision=cdbb27d0b4ea9810dc35035fad281fe6468b7dd1)" source="exporter.go:412"
time="2021-05-05T01:39:04Z" level=info msg="Build context (go=go1.15.3, user=appveyor-vm\\appveyor@appveyor-vm, date=20201107-08:23:37)" source="exporter.go:413"
time="2021-05-05T01:39:04Z" level=info msg="Starting server on :9796" source="exporter.go:416"

exporter-node-proxy logs

INFO[2021-05-05T01:40:36Z] Connecting to proxy                           url="ws://rancher_wins_proxy"

wins service logs (host)

PS C:\Users\Administrator> Get-EventLog -LogName Application -Source rancher-wins -ErrorAction Ignore | Sort-Obj
ect Index | %{ $_.Message }
Stackdump - waiting signal at Global\stackdump-3592
Listening on \\.\pipe\rancher_wins_proxy
Listening on \\.\pipe\rancher_wins
currentVersion.Major > versionRange.MaxVersion.Major: 11, 9
currentVersion.Major > versionRange.MaxVersion.Major: 11, 9
currentVersion.Major < versionRange.MinVersion.Major: 11, 12
currentVersion.Major > versionRange.MaxVersion.Major: 11, 10
currentVersion.Major < versionRange.MinVersion.Major: 11, 12
currentVersion.Major < versionRange.MinVersion.Major: 11, 13
currentVersion.Major > versionRange.MaxVersion.Major: 11, 9
currentVersion.Major > versionRange.MaxVersion.Major: 11, 10
currentVersion.Minor < versionRange.MinVersion.Major: 10, 11
currentVersion.Major < versionRange.MinVersion.Major: 11, 12
currentVersion.Major < versionRange.MinVersion.Major: 11, 13
currentVersion.Major < versionRange.MinVersion.Major: 11, 13
currentVersion.Major > versionRange.MaxVersion.Major: 11, 9
currentVersion.Major > versionRange.MaxVersion.Major: 11, 9
currentVersion.Major < versionRange.MinVersion.Major: 11, 12
currentVersion.Major > versionRange.MaxVersion.Major: 11, 10
currentVersion.Major < versionRange.MinVersion.Major: 11, 12
currentVersion.Major < versionRange.MinVersion.Major: 11, 13
currentVersion.Major > versionRange.MaxVersion.Major: 11, 9
currentVersion.Major > versionRange.MaxVersion.Major: 11, 10
currentVersion.Minor < versionRange.MinVersion.Major: 10, 11
currentVersion.Major < versionRange.MinVersion.Major: 11, 12
currentVersion.Major < versionRange.MinVersion.Major: 11, 13
currentVersion.Major < versionRange.MinVersion.Major: 11, 13
could not get checksum for "c:\\etc\\rancher\\wins\\wins.exe": open c:\etc\rancher\wins\wins.exe: The process ca
nnot access the file because it is being used by another process.
could not get checksum for "c:\\etc\\rancher\\wins\\wins.exe": open c:\etc\rancher\wins\wins.exe: The process ca
nnot access the file because it is being used by another process.
Handling backend connection request [rancher-monitoring-windows-exporter-ldck9]
error in remotedialer server [500]: connect not allowed

named pipes (host)

PS C:\Users\Administrator> (get-childitem \\.\pipe\).FullName
... (omitted) ...
\\.\pipe\rancher_wins
\\.\pipe\rancher_wins_proxy
... (omitted) ...

@aiyengar2
Copy link
Contributor

aiyengar2 commented May 5, 2021

Possible Workaround

Just deploying rancher-wins-upgrader (e.g. re-initializing the wins service) seems to be an effective workaround to this issue.

I'm not sure whether this is because the fix in wins v0.1.1 somehow resolves this bug (doubtful) or whether the re-initialization of wins is what fixes the issue, since that would cause the named pipe + GRPC server + network configuration of the host to be re-initialized.

@sowmyav27 once rc19 is cut with wins v0.1.1, can you retest this issue to see if that resolves it?


Screen Shot 2021-05-04 at 7 12 40 PM

@Jono-SUSE-Rancher
Copy link

@sowmyav27 & @aiyengar2 - We are doing some triage right now of issues in 2.6. Would you be able to give us more information about this? Is this fixed in the latest RC? And Arvind, how does the workaround look as a viable option? (You mentioned it was a possible workaround).

@aiyengar2
Copy link
Contributor

@Jono-SUSE-Rancher I don't believe this is fixed in the latest RC.

The core problem here seems to be that a Windows cluster without rancher-wins-upgrader deployed that mounts resources on a prefixPath (e.g. c:\host\opt; this is specified as part of the RKE1 config) does not seem to be able to accept proxy connections via the Named Pipe mounted at \\.\pipe\rancher_wins_proxy.

This issue appears to be resolved when the wins service is restarted and/or the wins config is refreshed, which is exactly what happens when you deploy rancher-wins-upgrader.

I'm not sure why this restart is required so this needs to be investigated. The problem could be with the way we do bootstrapping on Windows nodes (e.g. how we set up the config + service) or could require cutting a new wins release. Either way, this would be a Windows issue that is not particular to Monitoring (cc: @sirredbeard ).

Currently, only Monitoring is impacted since only monitoring uses wins cli proxy, but I believe there are conversations about using that feature in other Windows components (cc: @rosskirkpat), so this does need to be eventually prioritized.

However, if we cannot prioritize this in 2.6, the workaround of expecting rancher-wins-upgrader to be deployed onto Windows clusters with prefixPath enabled sounds like a viable option to me. I think we should encourage customers to start using it anyways so that they can have declarative wins configs (i.e. an expectation that the upgrader chart exists would allow us to more easily cut wins releases in the future, if we need to add security fixes, golang bumps, or new features). @luthermonson @sirredbeard any thoughts here?

Either way, if we prioritize the workaround, I think we should ensure that it is tested rigorously to ensure that we don't miss anything before suggesting it as the official solution to this issue.

@deniseschannon
Copy link

@sowmyav27 @aiyengar2 Could this be related to the fact that no metrics are available in grafana for k8s 1.21?

rancher/rancher#33465

@sirredbeard sirredbeard transferred this issue from rancher/rancher Aug 18, 2021
@sirredbeard sirredbeard added this to Proposed in Windows Team 2.6.3 Aug 18, 2021
@aiyengar2 aiyengar2 changed the title no metrics from windows nodes available in Grafana when win_prefix_path is set on a windows cluster [Monitoring V2] no metrics from windows nodes available in Grafana when win_prefix_path is set on a windows cluster Aug 18, 2021
@aiyengar2
Copy link
Contributor

@deniseschannon that should be unrelated. rancher/rancher#33465 is Monitoring V1; this is Monitoring V2.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/windows bug Something isn't working team/area4
Projects
No open projects
Development

No branches or pull requests

7 participants