Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

monitoring: cadvisor recording rules seem broken #17069

Closed
bobheadxi opened this issue Jan 6, 2021 · 7 comments · Fixed by #17096, sourcegraph/deploy-sourcegraph#1572 or sourcegraph/deploy-sourcegraph#1644
Assignees
Labels
bug An error, flaw or fault that produces an incorrect or unexpected result, or behavior. estimate/3d p1

Comments

@bobheadxi
Copy link
Member

no data available anywhere. the dashboard queries themselves appear unchanged, and cadvisor doesnt seem down. the recording rule is reporting wonky names, so this might be some change in behaviour in cadvisor or prometheus:

image

image

@bobheadxi bobheadxi self-assigned this Jan 6, 2021
@bobheadxi bobheadxi added the bug An error, flaw or fault that produces an incorrect or unexpected result, or behavior. label Jan 6, 2021
@bobheadxi bobheadxi added this to the Dist: 2020.12.28 milestone Jan 6, 2021
@bobheadxi
Copy link
Member Author

might be permissions issue: google/cadvisor#2270

@bobheadxi
Copy link
Member Author

bobheadxi commented Jan 6, 2021

could be related to container runtime change to OCI (#15194)? though this seems to be an issue in docker-compose deployments too (e.g. demo.sourcegraph.com seems to have incomplete cadvisor names) separate issue: #17072

@bobheadxi
Copy link
Member Author

bobheadxi commented Jan 6, 2021

Also confirmed that the services are being picked up, they just don't have a name, https://github.com/sourcegraph/deploy-sourcegraph-dogfood-k8s-2/pull/702 experiments with setting it ourselves

Update: this doesn't seem to be setting the name correctly

@bobheadxi
Copy link
Member Author

bobheadxi commented Jan 7, 2021

Status update

As of yesterday, the following patches provided a preliminary fix for this issue:

As of now, most dashboards seem to be back to normal in k8s.sgdev.org and Cloud.

There is one last problem, described and kind of fixed in sourcegraph/deploy-sourcegraph#1578, that needs to be addressed. There is also #17072, which should be fixed - I will verify and close that issue soon. These two issues are the only ones that might impact customers at the moment (unless they too have switched to OCI, in which case they will need this set of fixes)

There are still a few dashboards broken, most notably the fs-inode dashboards. It seems there's something different about the labelling on those - will require more investigation

@bobheadxi
Copy link
Member Author

bobheadxi commented Jan 10, 2021

Some potentially related issues: google/cadvisor#2666, google/cadvisor#2215 - it seems that containerd support is not very fleshed out yet in cAdvisor, cc @daxmc99

@daxmc99
Copy link
Contributor

daxmc99 commented Jan 13, 2021

Hmm from the linked issue google/cadvisor#2215 it looks like the preference is querying containerd directly. Not sure the best way to expose that yet

bobheadxi added a commit that referenced this issue Jan 15, 2021
Adds admin and changelog documentation on some recent monitoring changes:

- #17072
- #17069
- #17158
- #17070

Dev docs also show up before admin docs if you search for "grafana" or "prometheus", so adds notices that point back to the admin docs.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment