-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Must restart node-exporter pod for a node's metrics to be exported #376
Comments
Nothing has changed with our versions for
|
Nothing has changed for
|
Removing the ephemeral cluster for now 💸 . https://werft.gitpod-io-dev.com/job/ops-workspace-cluster-delete-aledbf-lvm.1 |
@ArthurSens found that Prometheus is unable to scrap metrics (context deadline exceeded). Upon checking a
|
@ArthurSens restarting @corneliusludmann could we ask for your help in scheduling this issue? We could use support. Not having node level metrics is risky. 🙏 |
👋 @ArthurSens @liam-j-bennett as a workaround, we removed this line from the
Is that a change you'd be willing to absorb? At this point, the current state is:
|
I can take a look at making the relevant change in Satellite if it fixes the main issue. It's difficult to tell from the debug logs, but was anything upgraded on the kernel between these gens? Most TLS issues will come from changes within the kernel libraries (mainly security update invalidating old ciphers, for example) |
No, we haven't changed the kernel in the last few generations @liam-j-bennett , because shiftfs is broke in later versions. @ArthurSens @liam-j-bennett can one of you help us with the normalized load average not working on the Gitpod Overview Dashboard for |
@ArthurSens does gitpod-io/ops#377 also solve the issue with normalized load average not showing for |
Reopening as the original issue (normalized load average is not reported for gen75) still exists |
We've fixed the However, what we're seeing, is that Prometheus is timing out when scraping metrics for |
We saw a variety of errors in observing Prometheus, where it had trouble connecting to node-exporter on workspace and headless nodes. Most fall into these two buckets:
We also saw the following, but, they were less frequent:
|
We were able to quickly However, when using |
Yap, works like a champ now in @Furisto for awareness, as you're deploying |
I'll take a look at this tomorrow - I'll timebox some investigation and then give an update by EoD |
So I've done some initial investigation: https://www.notion.so/gitpod/Node-Exporter-metrics-not-being-pulled-58ee9aa326f8447d8a480a7fca7c9543 The export is working now from my investigation. @kylos101 Could you confirm? |
Hey @liam-j-bennett , could you peek at this comment? https://www.notion.so/Node-Exporter-metrics-not-being-pulled-58ee9aa326f8447d8a480a7fca7c9543?d=baa1a38e41954437bb5d620e24552a56#99a14a6a343240b5a345ab18aa3e1d0c I see the metrics are working now, but, am unclear why they initially did not work, and what changed to cause them to work again. |
@kylos101 @liam-j-bennett is there any work to be done on this ticket? From the thread, it looks like this issue is resolved |
Unfortunately, this has happened again with |
The |
@kylos101 @ArthurSens can we try increase cpu limit in |
@ArthurSens there are still many In the
Which led me to this issue, prometheus/node_exporter#2387 which is now closed, but it mentioned prometheus/node_exporter#2500, which indicates this was a kernel level issue. We cannot upgrade our kernel due to shiftfs being broken in recent versions. We haven't changed our kernel version in quite a while, therefore, I'm inclined to try upgrading to node-exporter to 1.5 (we are presently on 1.3.1). 1.5 has a new default for GOMAXPROCS, and a feature we can try opting into. wdyt? I moved this back to in-progress, as the issue is still happening with gen80. cc: @gitpod-io/engineering-delivery-operations-experience and @gitpod-io/engineering-workspace for awareness |
Nice exploration Kyle! I've raised #416 as a potential fix. When using jsonnet we used to have automated upgrades for all the containers we used in the stack, we lost that when transitioning to the Go based installer 😅 |
@ArthurSens sorry bud, it still seems like an issue with cc: @Furisto |
Bug description
Hi there, we're testing gen75 to be deployed tomorrow (Tuesday), and noticed that the normalized load average for workspace Clusters does not work (like on the Gitpod Overview dashboard) until the
node-exporter
pod on a corresponding node is restarted.This recording rule appears to not work for a node until the
node-exporter
pod is deleted / restarted. Here is a loom demonstrating the problem.edit: Upon further examination, we found new nodes likeworkspace-ws-ephemeral-101-internal-xl-pool-lsgb
do not export metrics for the node, untilnode-exporter
is restarted. For example, we do not see the new node that is almost 10 minutes old:This problem only exists for
node-exporter
running onheadless
andworkspace
nodes. It does not impactserver
orservices
nodes.Steps to reproduce
I've built an ephemeral cluster (
ephemeral-101
). Addnew-workspace-cluster
role to your user via Gitpod Admin.kubectl cordon
the internal-xl nodes.sudo apt install -y stress-ng && stress-ng -k -c 8 -l 100 -q
Expected behavior
The recording rule should work without restarting the pod on existing and future (new) nodes that we scale-up.
Example repository
No response
Anything else?
No response
The text was updated successfully, but these errors were encountered: