Missing monitoring to notify admins of mass recloning events #9789

slimsag · 2020-04-13T08:18:48Z

We have a metric in gitserver, src_gitserver_repos_recloned, which would warn us if Sourcegraph is trying to perform a mass-reclone of all repositories.

We should add this and alerts to the gitserver dashboard / monitoring stack, as this has caused major issues in the past for customers and gone unnoticed.

The text was updated successfully, but these errors were encountered:

…board Before: ![image](https://user-images.githubusercontent.com/3173176/79102596-e2c15280-7d1f-11ea-9b87-340c7a0be0ef.png) - `Deadline Exceeded Rate` MAY be a useful metric, but right now is far too high and random to give any useful information. Will revisit later: #9786 - `Clone Queue Size`: covered by new monitoring already. - `Available Disk Space`: covered by new monitoring already, but only in percentages. Filed #9787 - `Echo Duration Seconds`: covered by new monitoring already. - `Latencies for commands (seconds)`: this panel is currently misleading and broken because it operating on the rate of a bucket metric (???); filed #9788 - `Commands Running Concurrently`: covered by new monitoring already. - `Num repos waiting on git ls-remote`: This is a heavily misleading title (makes it sound like this blocks repo actions, which it does not); covered by new monitoring already. - `Repos Cloned Rate`: Covered more effectively by clone queue size in new monitoring. - `Repos Recloned Rate`: Not yet covered, filed #9789 - `Repos Removed Rate`: Not yet covered, filed #9790

* remove the deprecated "Sourcegraph Internal" > "Gitserver" dashboard Before: ![image](https://user-images.githubusercontent.com/3173176/79105299-3da97880-7d25-11ea-94fc-7c3a902a5bd2.png) - `P.90 by command`: A by-command breakdown is not generally useful because gitserver command latency is mostly determined by other commands running on the system at the same time (so "latency per command" is not useful except for debugging Git's performance itself). #9788 exists to track monitoring latency of all commands. - `QPS by command` could be useful for debugging regressions or flaws that lead us to execute too many of the same command, but Jaeger is probably better suited for this and the query could be ran manually anyway. "Running git commands" already exists in new monitoring. - `Disk space remaining`: covered by new monitoring already, but only in percentages. Filed #9787 - `Total running commands`: covered already in new monitoring. * remove the deprecated "Sourcegraph Internal" > "Gitserver Rev 2" dashboard Before: ![image](https://user-images.githubusercontent.com/3173176/79102596-e2c15280-7d1f-11ea-9b87-340c7a0be0ef.png) - `Deadline Exceeded Rate` MAY be a useful metric, but right now is far too high and random to give any useful information. Will revisit later: #9786 - `Clone Queue Size`: covered by new monitoring already. - `Available Disk Space`: covered by new monitoring already, but only in percentages. Filed #9787 - `Echo Duration Seconds`: covered by new monitoring already. - `Latencies for commands (seconds)`: this panel is currently misleading and broken because it operating on the rate of a bucket metric (???); filed #9788 - `Commands Running Concurrently`: covered by new monitoring already. - `Num repos waiting on git ls-remote`: This is a heavily misleading title (makes it sound like this blocks repo actions, which it does not); covered by new monitoring already. - `Repos Cloned Rate`: Covered more effectively by clone queue size in new monitoring. - `Repos Recloned Rate`: Not yet covered, filed #9789 - `Repos Removed Rate`: Not yet covered, filed #9790 * remove the deprecated "Sourcegraph Internal" > "Overview" dashboard Before: ![image](https://user-images.githubusercontent.com/3173176/79107328-0b017f00-7d29-11ea-9191-3cdf42133c48.png) - `QPS by Status Code`: too high level to be useful in any way, also very misleading because it doesn't account for the fact that 90% of our requests are GraphQL where an error is a 200. - `P.90 request duration`: Too much cardinality to be useful - `Memory usage [RSS]`: Arguably the most useful on this dashboard, but still too high cardinality to be useful; superseded by #9791 - `Container CPU [all cores]`: broken; superseded by #9791 * remove useless "Sourcegraph Internal" > "Zoekt Indexserver" dashboard Honestly, there is no way this dashboard could ever be useful to anyone. How would you interpret these numbers? ![image](https://user-images.githubusercontent.com/3173176/79107978-3fc20600-7d2a-11ea-9fd6-a7a0ed4e15f7.png) New monitoring already has in place alerting for revision resolution, and #9783 exists to track indexing failures (not covered by this dashboard). I am not convinced "Latencies for indexing a repo (seconds)" is a useful metric, currently, either. * remove deprecated "Sourcegraph internal" > "Searcher" dashboard Before: ![image](https://user-images.githubusercontent.com/3173176/79108810-f70b4c80-7d2b-11ea-8ae6-56b3d26a6d55.png) - `Searches by status ...`: covered by "Unindexed search request errors" in new monitoring. - `Total running requests`: not needed/useful - `Total requests served`: not needed/useful - `Errors [rate-10m]`: covered by "Unindexed search request errors" in new monitoring. - `Requests [rate-10m]`: not needed/useful - `Archive Cache ...`: Filed #9796 * remove obsolete "Sourcegraph Internal" > "HTTP" dashboard Before: ![image](https://user-images.githubusercontent.com/3173176/79109262-d98ab280-7d2c-11ea-9b83-e3c29307e6d7.png) - `QPS by Status Code`: Not useful, most of our requests even GraphQL are 200-with-error. - `QPS by Route`: Not useful, as shown by screenshot. - `P.90 request duration`: Will be superseded by #9797 & currently not useful due to verbosity. - `P.90 Overall Route Duration`: Confusing with above; Will be superseded by #9797 & currently not useful due to verbosity.

slimsag added team/distribution monitoring labels Apr 13, 2020

slimsag added this to the 3.16 milestone Apr 13, 2020

slimsag self-assigned this Apr 13, 2020

slimsag mentioned this issue Apr 14, 2020

Missing prometheus metrics #6428

Closed

13 tasks

slimsag added the gitserver label Apr 17, 2020

beyang mentioned this issue Apr 21, 2020

distribution: 3.16 tracking issue #10069

Closed

47 tasks

slimsag added estimate/0.5d monitoring-long-tail labels Apr 22, 2020

slimsag removed this from the 3.16 milestone Apr 24, 2020

slimsag mentioned this issue Apr 24, 2020

Tracking Issue: Long tail of monitoring metric improvements #10164

Closed

22 tasks

slimsag added this to Needs triage in OLD - Distribution - use "Distribution 🚢" Jun 3, 2020

slimsag moved this from Needs triage to To do in OLD - Distribution - use "Distribution 🚢" Jun 3, 2020

slimsag changed the title ~~observability: gitserver: alert on src_gitserver_repos_recloned~~ Missing monitoring to notify admins of mass recloning events Jun 23, 2020

slimsag moved this from To do to Medium priority (ordered) in OLD - Distribution - use "Distribution 🚢" Jun 30, 2020

slimsag added this to the Backlog milestone Jul 1, 2020

slimsag mentioned this issue Oct 22, 2020

zoekt: improve grafana dashboard #14858

Closed

dan-mckean closed this as completed May 26, 2021

OLD - Distribution - use "Distribution 🚢" automation moved this from Medium priority to Done May 26, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Missing monitoring to notify admins of mass recloning events #9789

Missing monitoring to notify admins of mass recloning events #9789

slimsag commented Apr 13, 2020 •

edited

Missing monitoring to notify admins of mass recloning events #9789

Missing monitoring to notify admins of mass recloning events #9789

Comments

slimsag commented Apr 13, 2020 • edited

slimsag commented Apr 13, 2020 •

edited