Description
Bug description
Hi there, we're testing gen75 to be deployed tomorrow (Tuesday), and noticed that the normalized load average for workspace Clusters does not work (like on the Gitpod Overview dashboard) until the node-exporter
pod on a corresponding node is restarted.
This recording rule appears to not work for a node until the node-exporter
pod is deleted / restarted. Here is a loom demonstrating the problem.
edit: Upon further examination, we found new nodes like workspace-ws-ephemeral-101-internal-xl-pool-lsgb
do not export metrics for the node, until node-exporter
is restarted. For example, we do not see the new node that is almost 10 minutes old:
This problem only exists for node-exporter
running on headless
and workspace
nodes. It does not impact server
or services
nodes.
Steps to reproduce
I've built an ephemeral cluster (ephemeral-101
). Add new-workspace-cluster
role to your user via Gitpod Admin.
- Set the large workspace class in your user settings.
kubectl cordon
the internal-xl nodes.- Create a workspace, it'll trigger scale-up
- Wait for the workspace to start, stress it with
sudo apt install -y stress-ng && stress-ng -k -c 8 -l 100 -q
- Inspect the overview dashboard for the new node that was created by the autoscaler, you will not see load average until restarting node-exporter on that node
Expected behavior
The recording rule should work without restarting the pod on existing and future (new) nodes that we scale-up.
Example repository
No response
Anything else?
No response