Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Must restart node-exporter pod for a node's metrics to be exported #376

Closed
kylos101 opened this issue Nov 7, 2022 · 38 comments · Fixed by #377 or #420
Closed

Must restart node-exporter pod for a node's metrics to be exported #376

kylos101 opened this issue Nov 7, 2022 · 38 comments · Fixed by #377 or #420
Assignees
Labels
bug Something isn't working

Comments

@kylos101
Copy link
Contributor

kylos101 commented Nov 7, 2022

Bug description

Hi there, we're testing gen75 to be deployed tomorrow (Tuesday), and noticed that the normalized load average for workspace Clusters does not work (like on the Gitpod Overview dashboard) until the node-exporter pod on a corresponding node is restarted.

This recording rule appears to not work for a node until the node-exporter pod is deleted / restarted. Here is a loom demonstrating the problem.

edit: Upon further examination, we found new nodes like workspace-ws-ephemeral-101-internal-xl-pool-lsgb do not export metrics for the node, until node-exporter is restarted. For example, we do not see the new node that is almost 10 minutes old:

This problem only exists for node-exporter running on headless and workspace nodes. It does not impact server or services nodes.

Steps to reproduce

I've built an ephemeral cluster (ephemeral-101). Add new-workspace-cluster role to your user via Gitpod Admin.

  1. Set the large workspace class in your user settings.
  2. kubectl cordon the internal-xl nodes.
  3. Create a workspace, it'll trigger scale-up
  4. Wait for the workspace to start, stress it with sudo apt install -y stress-ng && stress-ng -k -c 8 -l 100 -q
  5. Inspect the overview dashboard for the new node that was created by the autoscaler, you will not see load average until restarting node-exporter on that node

Expected behavior

The recording rule should work without restarting the pod on existing and future (new) nodes that we scale-up.

Example repository

No response

Anything else?

No response

@kylos101 kylos101 added the bug Something isn't working label Nov 7, 2022
@kylos101 kylos101 changed the title Must restart node-exporter pod on nodes (existing and new from scaleup) for this recording rule to work Must restart node-exporter pod for recording rule to work Nov 7, 2022
@kylos101
Copy link
Contributor Author

kylos101 commented Nov 7, 2022

Nothing has changed with our versions for node-exporter.

gitpod /workspace/gitpod (main) $ kubectx -
Switched to context "ephemeral-101".
gitpod /workspace/gitpod (main) $ kubectl describe ds node-exporter -n monitoring-satellite | grep -i Image
    Image:      quay.io/prometheus/node-exporter:v1.3.1
    Image:      quay.io/brancz/kube-rbac-proxy:v0.13.0
gitpod /workspace/gitpod (main) $ kubectx -
Switched to context "us74".
gitpod /workspace/gitpod (main) $ kubectl describe ds node-exporter -n monitoring-satellite | grep -i Image
    Image:      quay.io/prometheus/node-exporter:v1.3.1
    Image:      quay.io/brancz/kube-rbac-proxy:v0.13.0

@kylos101
Copy link
Contributor Author

kylos101 commented Nov 7, 2022

Nothing has changed for prometheus-operator deployment:

gitpod /workspace/gitpod (main) $ kubectx -
Switched to context "us74".
gitpod /workspace/gitpod (main) $ kubectl describe deployment prometheus-operator -n monitoring-satellite | grep -i Image
    Image:      quay.io/prometheus-operator/prometheus-operator:v0.58.0
    Image:      quay.io/brancz/kube-rbac-proxy:v0.13.0
gitpod /workspace/gitpod (main) $ kubectx -
Switched to context "ephemeral-101".
gitpod /workspace/gitpod (main) $ kubectl describe deployment prometheus-operator -n monitoring-satellite | grep -i Image
    Image:      quay.io/prometheus-operator/prometheus-operator:v0.58.0
    Image:      quay.io/brancz/kube-rbac-proxy:v0.13.0

@kylos101 kylos101 changed the title Must restart node-exporter pod for recording rule to work Must restart node-exporter pod for a node's metrics to be exported Nov 7, 2022
@kylos101
Copy link
Contributor Author

kylos101 commented Nov 7, 2022

The node is 11m old, and the workspace is running:

gitpod /workspace/gitpod (main) $ kubectl get nodes | grep workspace-ws-ephemeral-101-internal-xl-pool-4kfs
workspace-ws-ephemeral-101-internal-xl-pool-4kfs   Ready                      <none>                      11m     v1.23.13+k3s1

gitpodio-empty-usni7l8lnpw      Running ws-a5aa2c9a-31ce-4976-8405-62e2d499ac27         workspace-ws-ephemeral-101-internal-xl-pool-4kfs kylos101@gmail.com      code    https://github.com/gitpod-io/empty

And from my workspace, I am applying stress, using the cores:

gitpod /workspace/empty (main) $ gp top
  Workspace class  : Large: Up to 8 vCPU, 16GB memory, 50GB disk  
  CPU (millicores) : 7964m/8000m (99%)                            
  Memory (bytes)   : 484Mi/16384Mi (2%)                           
gitpod /workspace/empty (main) $ gp info
  Workspace ID    : gitpodio-empty-usni7l8lnpw                                     
  Instance ID     : a5aa2c9a-31ce-4976-8405-62e2d499ac27                           
  Workspace class : Large: Up to 8 vCPU, 16GB memory, 50GB disk                    
  Workspace URL   : https://gitpodio-empty-usni7l8lnpw.ws-ephemeral-101.gitpod.io  
  Cluster host    : ws-ephemeral-101.gitpod.io   

But, Grafana doesn't show the node's normalized load average:
image

@kylos101
Copy link
Contributor Author

kylos101 commented Nov 7, 2022

Removing the ephemeral cluster for now 💸 .

https://werft.gitpod-io-dev.com/job/ops-workspace-cluster-delete-aledbf-lvm.1

@kylos101
Copy link
Contributor Author

kylos101 commented Nov 8, 2022

@ArthurSens found that Prometheus is unable to scrap metrics (context deadline exceeded).

Upon checking a node-exporter pod, I see:

kubectl logs node-exporter-qgtgb -c kube-rbac-proxy -n monitoring-satellite

gitpod /workspace/gitpod (main) $ kubectl logs node-exporter-qgtgb -c kube-rbac-proxy -n monitoring-satellite
I1108 15:24:03.606908   12638 main.go:186] Valid token audiences: 
I1108 15:24:16.507740   12638 main.go:316] Generating self signed cert as no cert is provided
I1108 15:25:33.306888   12638 main.go:366] Starting TCP socket on [10.10.0.43]:9100
I1108 15:25:35.205945   12638 main.go:373] Listening securely on [10.10.0.43]:9100
2022/11/08 15:26:08 http: TLS handshake error from 10.10.0.6:58222: EOF

@kylos101
Copy link
Contributor Author

kylos101 commented Nov 8, 2022

@ArthurSens restarting node-exporter is no longer a viable option. 😞

@corneliusludmann could we ask for your help in scheduling this issue? We could use support. Not having node level metrics is risky. 🙏

@ArthurSens ArthurSens self-assigned this Nov 8, 2022
@ArthurSens
Copy link
Contributor

Logs from kube-rbac-proxy are really spamming TLS handshake errors
image

I'm far from an expert in Networking, so pardon if my comments seem 'too obvious', but some research(googling 😛) has shown that TLS handshake errors occur when the client and server are unable to establish a secure connection.

@kylos101
Copy link
Contributor Author

kylos101 commented Nov 8, 2022

👋 @ArthurSens @liam-j-bennett as a workaround, we removed this line from the node-exporter daemonset in us75:

"--tls-cipher-suites=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_ECDSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384,TLS_ECDHE_ECDSA_WITH_AES_256_GCM_SHA384,TLS_ECDHE_RSA_WITH_CHACHA20_POLY1305,TLS_ECDHE_ECDSA_WITH_CHACHA20_POLY1305",

Is that a change you'd be willing to absorb?

At this point, the current state is:

  1. prometheus can scrape metrics from node-exporter pods again (the TLS errors @ArthurSens observed are gone), and we can see node level metrics for workspace and headless pools again 🎉 Refer to this report for an example.
  2. the normalized load average on the Gitpod Dashboard page is still not working, we tried restarting the prometheus-k8s statefulset, but, that did not help

@liam-j-bennett
Copy link
Contributor

I can take a look at making the relevant change in Satellite if it fixes the main issue.

It's difficult to tell from the debug logs, but was anything upgraded on the kernel between these gens? Most TLS issues will come from changes within the kernel libraries (mainly security update invalidating old ciphers, for example)

@kylos101
Copy link
Contributor Author

kylos101 commented Nov 9, 2022

No, we haven't changed the kernel in the last few generations @liam-j-bennett , because shiftfs is broke in later versions.

@ArthurSens @liam-j-bennett can one of you help us with the normalized load average not working on the Gitpod Overview Dashboard for us75?

@kylos101
Copy link
Contributor Author

kylos101 commented Nov 9, 2022

@ArthurSens does gitpod-io/ops#377 also solve the issue with normalized load average not showing for us75? That was the original reason I created this issue.

@kylos101
Copy link
Contributor Author

kylos101 commented Nov 9, 2022

Here is a picture of the normalized load average, as you can see for gen75, it is missing.

image

@kylos101 kylos101 reopened this Nov 9, 2022
Repository owner moved this from ✨Done to 📓Scheduled in 🚚 Security, Infrastructure, and Delivery Team (SID) Nov 9, 2022
@kylos101
Copy link
Contributor Author

kylos101 commented Nov 9, 2022

Reopening as the original issue (normalized load average is not reported for gen75) still exists

@kylos101
Copy link
Contributor Author

kylos101 commented Nov 9, 2022

We've fixed the node-exporter daemonset, to avoid --tls ciphers for now (use defaults).

However, what we're seeing, is that Prometheus is timing out when scraping metrics for node-exporter. The default is 10s, we're increasing to 15s to see if that helps. As a side effect, we're also increasing the interval in which prometheus scrapes from 15s to 30s.

@kylos101
Copy link
Contributor Author

kylos101 commented Nov 9, 2022

We saw a variety of errors in observing Prometheus, where it had trouble connecting to node-exporter on workspace and headless nodes.

Most fall into these two buckets:

  1. Context deadline exceeded, this generally happens when we exceed the scrapTimeout, which is now 15s.
  2. TLS handshake timeout, this generally happens after about 10.1s

We also saw the following, but, they were less frequent:

  1. Unexpected EOF at 9s once
  2. Get "https://10.10.0.140:9100/metrics": read tcp 192.168.48.137:60266->10.10.0.140:9100: read: connection reset by peer

@kylos101
Copy link
Contributor Author

kylos101 commented Nov 9, 2022

We were able to quickly telnet from prometheus to node-exporter on port 9100, so it's not like we're unable to make a TCP connection.

However, when using wget, it took a while for Prometheus to return a 401 error when trying to connect to node-exporter. This suggests setting up the TLS connection is expensive. Perhaps more so now than it was before.

@adrienthebo adrienthebo moved this from 📓Scheduled to ⚒In Progress in 🚚 Security, Infrastructure, and Delivery Team (SID) Nov 9, 2022
@kylos101 kylos101 moved this to In Progress in 🌌 Workspace Team Nov 9, 2022
@kylos101 kylos101 self-assigned this Nov 9, 2022
@kylos101
Copy link
Contributor Author

Yap, works like a champ now in us76. Closing for now, will reopen if the issue continues.

@Furisto for awareness, as you're deploying eu76 tomorrow.

Repository owner moved this from In Progress to Awaiting Deployment in 🌌 Workspace Team Nov 17, 2022
Repository owner moved this from ⚒In Progress to ✨Done in 🚚 Security, Infrastructure, and Delivery Team (SID) Nov 17, 2022
@kylos101 kylos101 reopened this Nov 20, 2022
Repository owner moved this from ✨Done to 📓Scheduled in 🚚 Security, Infrastructure, and Delivery Team (SID) Nov 20, 2022
@kylos101 kylos101 removed the status in 🌌 Workspace Team Nov 20, 2022
@kylos101
Copy link
Contributor Author

kylos101 commented Nov 20, 2022

This issue is not present with us77, but is with eu77. Adding back to the Delivery Team's inbox to request help.

image

@liam-j-bennett
Copy link
Contributor

I'll take a look at this tomorrow - I'll timebox some investigation and then give an update by EoD

@liam-j-bennett
Copy link
Contributor

So I've done some initial investigation: https://www.notion.so/gitpod/Node-Exporter-metrics-not-being-pulled-58ee9aa326f8447d8a480a7fca7c9543

The export is working now from my investigation. @kylos101 Could you confirm?

@kylos101
Copy link
Contributor Author

Hey @liam-j-bennett , could you peek at this comment? https://www.notion.so/Node-Exporter-metrics-not-being-pulled-58ee9aa326f8447d8a480a7fca7c9543?d=baa1a38e41954437bb5d620e24552a56#99a14a6a343240b5a345ab18aa3e1d0c

I see the metrics are working now, but, am unclear why they initially did not work, and what changed to cause them to work again.

@mrsimonemms
Copy link

@kylos101 @liam-j-bennett is there any work to be done on this ticket? From the thread, it looks like this issue is resolved

@mrsimonemms mrsimonemms moved this from 📓Scheduled to 🕶In Review / Measuring in 🚚 Security, Infrastructure, and Delivery Team (SID) Dec 8, 2022
@ArthurSens
Copy link
Contributor

Unfortunately, this has happened again with gen78, I believe we still need to find the culprit here

@iQQBot
Copy link
Contributor

iQQBot commented Dec 13, 2022

The kube-rbac-proxy configure CPU limit: only 0.02, and for some reason, when it start, it will use full of CPU, and never down. every request is slowly, may take 2 min to process

@iQQBot
Copy link
Contributor

iQQBot commented Dec 13, 2022

@kylos101 @ArthurSens can we try increase cpu limit in kube-rbac-proxy first?

@ArthurSens
Copy link
Contributor

The kube-rbac-proxy configure CPU limit: only 0.02, and for some reason, when it start, it will use full of CPU, and never down. every request is slowly, may take 2 min to process

image

Nice finding! I'll open a PR with an increase

@kylos101
Copy link
Contributor Author

kylos101 commented Dec 23, 2022

@ArthurSens there are still many context deadline exceeded, preventing prometheus from communicating with node-exporter.

image

In the node-exporter logs I see:

kubectx -c
us80

kubectl logs node-exporter-w9sn8 -n monitoring-satellite

ts=2022-12-23T09:32:50.878Z caller=stdlib.go:105 level=error caller="error encoding and sending metric family: write tcp 127.0.0.1:9100" msg="->127.0.0.1:54994: write: broken pipe"
ts=2022-12-23T09:32:50.878Z caller=stdlib.go:105 level=error caller="error encoding and sending metric family: write tcp 127.0.0.1:9100" msg="->127.0.0.1:54994: write: broken pipe"
ts=2022-12-23T09:32:50.878Z caller=stdlib.go:105 level=error caller="error encoding and sending metric family: write tcp 127.0.0.1:9100" msg="->127.0.0.1:54994: write: broken pipe"
ts=2022-12-23T09:32:50.878Z caller=stdlib.go:105 level=error caller="error encoding and sending metric family: write tcp 127.0.0.1:9100" msg="->127.0.0.1:54994: write: broken pipe"
ts=2022-12-23T09:32:50.878Z caller=stdlib.go:105 level=error caller="error encoding and sending metric family: write tcp 127.0.0.1:9100" msg="->127.0.0.1:54994: write: broken pipe"

Which led me to this issue, prometheus/node_exporter#2387 which is now closed, but it mentioned prometheus/node_exporter#2500, which indicates this was a kernel level issue. We cannot upgrade our kernel due to shiftfs being broken in recent versions.

We haven't changed our kernel version in quite a while, therefore, I'm inclined to try upgrading to node-exporter to 1.5 (we are presently on 1.3.1). 1.5 has a new default for GOMAXPROCS, and a feature we can try opting into. wdyt?

I moved this back to in-progress, as the issue is still happening with gen80.

cc: @gitpod-io/engineering-delivery-operations-experience and @gitpod-io/engineering-workspace for awareness

@ArthurSens
Copy link
Contributor

Nice exploration Kyle! I've raised #416 as a potential fix.

When using jsonnet we used to have automated upgrades for all the containers we used in the stack, we lost that when transitioning to the Go based installer 😅

@kylos101
Copy link
Contributor Author

kylos101 commented Jan 4, 2023

@ArthurSens sorry bud, it still seems like an issue with us81. I haven't checked the logs or metrics, to see if us81 is different from us80, and wanted to give you a heads up. Do you have bandwidth to continue helping on this one?

cc: @Furisto

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
No open projects
7 participants