Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Missing monitoring to notify admins of mass recloning events #9789

Closed
slimsag opened this issue Apr 13, 2020 · 0 comments
Closed

Missing monitoring to notify admins of mass recloning events #9789

slimsag opened this issue Apr 13, 2020 · 0 comments

Comments

@slimsag
Copy link
Member

slimsag commented Apr 13, 2020

We have a metric in gitserver, src_gitserver_repos_recloned, which would warn us if Sourcegraph is trying to perform a mass-reclone of all repositories.

We should add this and alerts to the gitserver dashboard / monitoring stack, as this has caused major issues in the past for customers and gone unnoticed.

@slimsag slimsag added this to the 3.16 milestone Apr 13, 2020
@slimsag slimsag self-assigned this Apr 13, 2020
slimsag added a commit that referenced this issue Apr 13, 2020
…board

Before:

![image](https://user-images.githubusercontent.com/3173176/79102596-e2c15280-7d1f-11ea-9b87-340c7a0be0ef.png)

- `Deadline Exceeded Rate` MAY be a useful metric, but right now is far too high and random to give any useful information. Will revisit later: #9786
- `Clone Queue Size`: covered by new monitoring already.
- `Available Disk Space`: covered by new monitoring already, but only in percentages. Filed #9787
- `Echo Duration Seconds`: covered by new monitoring already.
- `Latencies for commands (seconds)`: this panel is currently misleading and broken because it operating on the rate of a bucket metric (???); filed #9788
- `Commands Running Concurrently`: covered by new monitoring already.
- `Num repos waiting on git ls-remote`: This is a heavily misleading title (makes it sound like this blocks repo actions, which it does not); covered by new monitoring already.
- `Repos Cloned Rate`: Covered more effectively by clone queue size in new monitoring.
- `Repos Recloned Rate`: Not yet covered, filed #9789
- `Repos Removed Rate`: Not yet covered, filed #9790
slimsag added a commit that referenced this issue Apr 13, 2020
* remove the deprecated "Sourcegraph Internal" > "Gitserver" dashboard

Before:

![image](https://user-images.githubusercontent.com/3173176/79105299-3da97880-7d25-11ea-94fc-7c3a902a5bd2.png)

- `P.90 by command`: A by-command breakdown is not generally useful because gitserver command latency is mostly determined by other commands running on the system at the same time (so "latency per command" is not useful except for debugging Git's performance itself). #9788 exists to track monitoring latency of all commands.
- `QPS by command` could be useful for debugging regressions or flaws that lead us to execute too many of the same command, but Jaeger is probably better suited for this and the query could be ran manually anyway. "Running git commands" already exists in new monitoring.
- `Disk space remaining`: covered by new monitoring already, but only in percentages. Filed #9787
- `Total running commands`: covered already in new monitoring.

* remove the deprecated "Sourcegraph Internal" > "Gitserver Rev 2" dashboard

Before:

![image](https://user-images.githubusercontent.com/3173176/79102596-e2c15280-7d1f-11ea-9b87-340c7a0be0ef.png)

- `Deadline Exceeded Rate` MAY be a useful metric, but right now is far too high and random to give any useful information. Will revisit later: #9786
- `Clone Queue Size`: covered by new monitoring already.
- `Available Disk Space`: covered by new monitoring already, but only in percentages. Filed #9787
- `Echo Duration Seconds`: covered by new monitoring already.
- `Latencies for commands (seconds)`: this panel is currently misleading and broken because it operating on the rate of a bucket metric (???); filed #9788
- `Commands Running Concurrently`: covered by new monitoring already.
- `Num repos waiting on git ls-remote`: This is a heavily misleading title (makes it sound like this blocks repo actions, which it does not); covered by new monitoring already.
- `Repos Cloned Rate`: Covered more effectively by clone queue size in new monitoring.
- `Repos Recloned Rate`: Not yet covered, filed #9789
- `Repos Removed Rate`: Not yet covered, filed #9790

* remove the deprecated "Sourcegraph Internal" > "Overview" dashboard

Before:

![image](https://user-images.githubusercontent.com/3173176/79107328-0b017f00-7d29-11ea-9191-3cdf42133c48.png)

- `QPS by Status Code`: too high level to be useful in any way, also very misleading because it doesn't account for the fact that 90% of our requests are GraphQL where an error is a 200.
- `P.90 request duration`: Too much cardinality to be useful
- `Memory usage [RSS]`: Arguably the most useful on this dashboard, but still too high cardinality to be useful; superseded by #9791
- `Container CPU [all cores]`: broken; superseded by #9791

* remove useless "Sourcegraph Internal" > "Zoekt Indexserver" dashboard

Honestly, there is no way this dashboard could ever be useful to anyone. How would
you interpret these numbers?

![image](https://user-images.githubusercontent.com/3173176/79107978-3fc20600-7d2a-11ea-9fd6-a7a0ed4e15f7.png)

New monitoring already has in place alerting for revision resolution, and #9783
exists to track indexing failures (not covered by this dashboard).

I am not convinced "Latencies for indexing a repo (seconds)" is a useful metric, currently, either.

* remove deprecated "Sourcegraph internal" > "Searcher" dashboard

Before:

![image](https://user-images.githubusercontent.com/3173176/79108810-f70b4c80-7d2b-11ea-8ae6-56b3d26a6d55.png)

- `Searches by status ...`: covered by "Unindexed search request errors" in new monitoring.
- `Total running requests`: not needed/useful
- `Total requests served`: not needed/useful
- `Errors [rate-10m]`: covered by "Unindexed search request errors" in new monitoring.
- `Requests [rate-10m]`: not needed/useful
- `Archive Cache ...`: Filed #9796

* remove obsolete "Sourcegraph Internal" > "HTTP" dashboard

Before:

![image](https://user-images.githubusercontent.com/3173176/79109262-d98ab280-7d2c-11ea-9b83-e3c29307e6d7.png)

- `QPS by Status Code`: Not useful, most of our requests even GraphQL are 200-with-error.
- `QPS by Route`: Not useful, as shown by screenshot.
- `P.90 request duration`: Will be superseded by #9797 & currently not useful due to verbosity.
- `P.90 Overall Route Duration`: Confusing with above; Will be superseded by #9797 & currently not useful due to verbosity.
@slimsag slimsag mentioned this issue Apr 14, 2020
13 tasks
@slimsag slimsag removed this from the 3.16 milestone Apr 24, 2020
@slimsag slimsag moved this from Needs triage to To do in OLD - Distribution - use "Distribution 🚢" Jun 3, 2020
@slimsag slimsag changed the title observability: gitserver: alert on src_gitserver_repos_recloned Missing monitoring to notify admins of mass recloning events Jun 23, 2020
@slimsag slimsag moved this from To do to Medium priority (ordered) in OLD - Distribution - use "Distribution 🚢" Jun 30, 2020
@slimsag slimsag added this to the Backlog milestone Jul 1, 2020
OLD - Distribution - use "Distribution 🚢" automation moved this from Medium priority to Done May 26, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
No open projects
Development

No branches or pull requests

2 participants