Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PrometheusAlerts are not consistent across recent workspace-cluster generations #384

Closed
kylos101 opened this issue Nov 17, 2022 · 2 comments
Labels
bug Something isn't working

Comments

@kylos101
Copy link
Contributor

kylos101 commented Nov 17, 2022

Bug description

Please observe the different rules we have in workspace-clusters. This might be a non-issue, and due to centralizing alerts work.

gitpod /workspace/gitpod (aledbf/wswait) $ kubectl get prometheusrule -A
NAMESPACE              NAME                                     AGE
monitoring-satellite   alertmanager-monitoring-rules            14d
monitoring-satellite   cardinality-analysis-monitoring-rules    14d
monitoring-satellite   cluster-autoscaler                       14d
monitoring-satellite   kube-prometheus-rules                    14d
monitoring-satellite   kube-state-metrics-monitoring-rules      14d
monitoring-satellite   kubernetes-monitoring-rules              14d
monitoring-satellite   node-monitoring-rules                    14d
monitoring-satellite   openvsx-proxy-monitoring-rules           14d
monitoring-satellite   prometheus-monitoring-rules              14d
monitoring-satellite   prometheus-operator-monitoring-rules     14d
monitoring-satellite   ssh-gateway-monitoring-rules             14d
monitoring-satellite   workspace-failure-slo-monitoring-rules   14d
monitoring-satellite   workspace-monitoring-rules               14d
monitoring-satellite   workspace-nodes-monitoring-rules         14d
monitoring-satellite   ws-daemon-monitoring-rules               14d
monitoring-satellite   ws-manager-monitoring-rules              14d
gitpod /workspace/gitpod (aledbf/wswait) $ kubectx us75
Switched to context "us75".
gitpod /workspace/gitpod (aledbf/wswait) $ kubectl get prometheusrule -A
NAMESPACE              NAME                                     AGE
default                workspace-failure-slo-monitoring-rules   9d
default                workspace-monitoring-rules               9d
default                workspace-nodes-monitoring-rules         9d
default                ws-daemon-monitoring-rules               9d
default                ws-manager-monitoring-rules              9d
monitoring-satellite   alertmanager-monitoring-rules            9d
monitoring-satellite   cardinality-analysis-monitoring-rules    9d
monitoring-satellite   cluster-autoscaler                       9d
monitoring-satellite   kube-prometheus-rules                    9d
monitoring-satellite   kube-state-metrics-monitoring-rules      9d
monitoring-satellite   kubernetes-monitoring-rules              9d
monitoring-satellite   node-monitoring-rules                    9d
monitoring-satellite   openvsx-proxy-monitoring-rules           9d
monitoring-satellite   prometheus-monitoring-rules              9d
monitoring-satellite   prometheus-operator-monitoring-rules     9d
monitoring-satellite   ssh-gateway-monitoring-rules             9d
monitoring-satellite   workspace-failure-slo-monitoring-rules   8d
monitoring-satellite   workspace-monitoring-rules               8d
monitoring-satellite   workspace-nodes-monitoring-rules         8d
monitoring-satellite   ws-daemon-monitoring-rules               8d
monitoring-satellite   ws-manager-monitoring-rules              8d
werft                  workspace-failure-slo-monitoring-rules   9d
werft                  workspace-monitoring-rules               9d
werft                  workspace-nodes-monitoring-rules         9d
werft                  ws-daemon-monitoring-rules               9d
werft                  ws-manager-monitoring-rules              9d
gitpod /workspace/gitpod (aledbf/wswait) $ kubectx us76
Switched to context "us76".
gitpod /workspace/gitpod (aledbf/wswait) $ kubectl get prometheusrule -A
NAMESPACE              NAME                                    AGE
monitoring-satellite   cardinality-analysis-monitoring-rules   5h29m
monitoring-satellite   cluster-autoscaler                      5h29m
monitoring-satellite   kube-prometheus-rules                   5h29m
monitoring-satellite   workspace-monitoring-rules              5h29m
monitoring-satellite   ws-daemon-monitoring-rules              5h29m

What rules are we expecting to get deployed to workspace-clusters? I suspect some of these differences may have to do with centralized alerting. @ArthurSens can you help me triage?

I assume we want these two:

https://github.com/gitpod-io/gitpod/blob/6c5f908e1e1098b6538a280e8863cac4ad54d446/operations/observability/mixins/workspace/rules/satellite/workspaces.yaml#L11

https://github.com/gitpod-io/gitpod/blob/6c5f908e1e1098b6538a280e8863cac4ad54d446/operations/observability/mixins/workspace/rules/satellite/ws-daemon.yaml#L11

And that the other 3 are either expected or come from manifests. Are any missing in us76?

Steps to reproduce

n/a

Expected behavior

No response

Example repository

No response

Anything else?

This impacts the ability for on-call to support workspace-clusters through PagerDuty.

I noticed this when trying to push updated alerts to us75 via https://werft.gitpod-io-dev.com/job/ops-workspace-cluster-enable-alerts-main.103/raw (which failed). As a test I pushed alerts to us76, and it worked (which is good, and will allow us to alert changes to existing clusters).

@ArthurSens
Copy link
Contributor

I suspect some of these differences may have to do with centralized alerting. @ArthurSens can you help me triage?

You're correct! Most of the alerts aren't deployed to local clusters anymore, but to our monitoring-central cluster. That's where most of our alerting rules are evaluated and triggered.

What rules are we expecting to get deployed to workspace-clusters?

You can expect all rules that were added to your imports

@kylos101
Copy link
Contributor Author

kylos101 commented Dec 7, 2022

Thanks, @ArthurSens !

@kylos101 kylos101 closed this as completed Dec 7, 2022
Repository owner moved this from 🧊Backlog to ✨Done in 🚚 Security, Infrastructure, and Delivery Team (SID) Dec 7, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
No open projects
Development

No branches or pull requests

2 participants