Skip to content

Must restart node-exporter pod for a node's metrics to be exported #376

Closed
@kylos101

Description

@kylos101

Bug description

Hi there, we're testing gen75 to be deployed tomorrow (Tuesday), and noticed that the normalized load average for workspace Clusters does not work (like on the Gitpod Overview dashboard) until the node-exporter pod on a corresponding node is restarted.

This recording rule appears to not work for a node until the node-exporter pod is deleted / restarted. Here is a loom demonstrating the problem.

edit: Upon further examination, we found new nodes like workspace-ws-ephemeral-101-internal-xl-pool-lsgb do not export metrics for the node, until node-exporter is restarted. For example, we do not see the new node that is almost 10 minutes old:

This problem only exists for node-exporter running on headless and workspace nodes. It does not impact server or services nodes.

Steps to reproduce

I've built an ephemeral cluster (ephemeral-101). Add new-workspace-cluster role to your user via Gitpod Admin.

  1. Set the large workspace class in your user settings.
  2. kubectl cordon the internal-xl nodes.
  3. Create a workspace, it'll trigger scale-up
  4. Wait for the workspace to start, stress it with sudo apt install -y stress-ng && stress-ng -k -c 8 -l 100 -q
  5. Inspect the overview dashboard for the new node that was created by the autoscaler, you will not see load average until restarting node-exporter on that node

Expected behavior

The recording rule should work without restarting the pod on existing and future (new) nodes that we scale-up.

Example repository

No response

Anything else?

No response

Metadata

Metadata

Assignees

Labels

bugSomething isn't working

Type

No type

Projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions