New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
monitoring: encourage silencing, render entry for alerts w/o solutions #12731
Conversation
…n't have solutions
**Possible solutions:** | ||
|
||
- **Get details on the exact queries that are slow** by configuring `"observability.logSlowSearches": 20,` in the site configuration and looking for `frontend` warning logs prefixed with `slow search request` for additional details. | ||
- **Check that most repositories are indexed** by visiting https://sourcegraph.example.com/site-admin/repositories?filter=needs-index (it should show few or no results.) | ||
- **Kubernetes:** Check CPU usage of zoekt-webserver in the indexed-search pod, consider increasing CPU limits in the `indexed-search.Deployment.yaml` if regularly hitting max CPU utilization. | ||
- **Docker Compose:** Check CPU usage on the Zoekt Web Server dashboard, consider increasing `cpus:` of the zoekt-webserver container in `docker-compose.yml` if regularly hitting max CPU utilization. | ||
|
||
# frontend: 90th_percentile_search_request_duration | ||
**Silence this alert:** If you are aware of this alert and want to silence notifications for it, add the following to your site configuration: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of this:
frontend: 99th_percentile_search_request_duration
Descriptions:
frontend: 20s+ 99th percentile successful search request duration over 5m
Possible solutions:...
Docker Compose: Check CPU usage on the Zoekt Web Server dashboard, consider increasing
cpus:
of the zoekt-webserver container indocker-compose.yml
if regularly hitting max CPU utilization.Silence this alert: If you are aware of this alert and want to silence notifications for it, add the following to your site configuration:
{ "observability.silenceAlerts": [ "warning_frontend_99th_percentile_search_request_duration" ] }
How about this?
frontend: 99th_percentile_search_request_duration
Descriptions:
frontend: 20s+ 99th percentile successful search request duration over 5m
Possible solutions:...
Docker Compose: Check CPU usage on the Zoekt Web Server dashboard, consider increasing
cpus:
of the zoekt-webserver container indocker-compose.yml
if regularly hitting max CPU utilization.Silence this alert: If you are aware of this alert and want to silence notifications for it, add the following to your site configuration:
"observability.silenceAlerts": [ "warning_frontend_99th_percentile_search_request_duration" ]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, ie treat it as just another solution? that works 💯
for alerts that don't have solutions, do we want silencing to be the only solution, or should we prompt admins to open a ticket or contact us?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Silencing being the only solution is fine. I'm not worried about getting admins to contact us - they will.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also note I remove the surrounding {}
brackets which makes it shorter, and indented the code block to match the bulleted list
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
indented the code block to match the bulleted list
docsite doesn't support this it seems :(((
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM once you address my feedback (no need for re-review).
Codecov Report
@@ Coverage Diff @@
## main #12731 +/- ##
=======================================
Coverage 50.82% 50.83%
=======================================
Files 1443 1443
Lines 81281 81281
Branches 6614 6560 -54
=======================================
+ Hits 41315 41319 +4
+ Misses 36403 36400 -3
+ Partials 3563 3562 -1
Flags with carried forward coverage won't be shown. Click here to find out more.
|
I think it would be simpler to have a header that says "silencing alerts" and tells you how to do it in a generic fashion and we can reference/link that instead. Silence an alertIf you are aware of an alert and want to silence notifications for it, add the following to your site configuration: {
"observability.silenceAlerts": [
"ALERT_NAME"
]
} You can find the |
@pecigonzalo we do have a page for that - I think the concern is you are unlikely to move away from the section once you've landed on it via an alert notification's solutions link. This gives admins a copy-paste solution that makes it clear we do have this capability, at very little cost (just a larger static page) |
@bobheadxi I dont share that concern, if the have a link or a clear relationship to how to silence alerts, I think its actually more likely that someone will search for "silence alerts sourcegraph" than wait for an alert to pop up. In general, I would actually not encourage silencing without expire, as its likely the silence will remain there after the issue is fixed. |
I'm not sure - I don't think there is a lot of awareness within our own team about this functionality, and I'd rather have this up front and center and focus on opening issues to improve alerts like @slimsag has done, rather than have us keep trying to remove alerts or have customers disable notifications altogether
I think there's a possible solution for this - when customers upgrade, we can show a dismissible prompt reminding admins that they have silenced alerts and should reassess whether they should stay I'll come up with some wording to better promote making sure silences are followed up on (d317534 ), but from discussions with stephen I think this is valuable to have at probably not a lot of cost |
another idea here - we can simply change the name of the alert if there is a significant change in how it works. this might leave a lot of silences targeting alerts that no longer exist, but I feel like there is probably a programmatic way we can warn about this in the future |
closes #12236
also adds a recommendation for how to silence each alert - see https://sourcegraph.slack.com/archives/CJX299FGE/p1596505757462700