Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

monitoring: encourage silencing, render entry for alerts w/o solutions #12731

Merged
merged 9 commits into from Aug 6, 2020

Conversation

bobheadxi
Copy link
Member

closes #12236

also adds a recommendation for how to silence each alert - see https://sourcegraph.slack.com/archives/CJX299FGE/p1596505757462700

@bobheadxi bobheadxi requested a review from slimsag as a code owner August 5, 2020 03:59
@bobheadxi bobheadxi requested a review from a team August 5, 2020 03:59
**Possible solutions:**

- **Get details on the exact queries that are slow** by configuring `"observability.logSlowSearches": 20,` in the site configuration and looking for `frontend` warning logs prefixed with `slow search request` for additional details.
- **Check that most repositories are indexed** by visiting https://sourcegraph.example.com/site-admin/repositories?filter=needs-index (it should show few or no results.)
- **Kubernetes:** Check CPU usage of zoekt-webserver in the indexed-search pod, consider increasing CPU limits in the `indexed-search.Deployment.yaml` if regularly hitting max CPU utilization.
- **Docker Compose:** Check CPU usage on the Zoekt Web Server dashboard, consider increasing `cpus:` of the zoekt-webserver container in `docker-compose.yml` if regularly hitting max CPU utilization.

# frontend: 90th_percentile_search_request_duration
**Silence this alert:** If you are aware of this alert and want to silence notifications for it, add the following to your site configuration:
Copy link
Member

@slimsag slimsag Aug 5, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of this:

frontend: 99th_percentile_search_request_duration

Descriptions:

  • frontend: 20s+ 99th percentile successful search request duration over 5m
    Possible solutions:

  • ...

  • Docker Compose: Check CPU usage on the Zoekt Web Server dashboard, consider increasing cpus: of the zoekt-webserver container in docker-compose.yml if regularly hitting max CPU utilization.

Silence this alert: If you are aware of this alert and want to silence notifications for it, add the following to your site configuration:

{
  "observability.silenceAlerts": [
    "warning_frontend_99th_percentile_search_request_duration"
  ]
}

How about this?

frontend: 99th_percentile_search_request_duration

Descriptions:

  • frontend: 20s+ 99th percentile successful search request duration over 5m
    Possible solutions:

  • ...

  • Docker Compose: Check CPU usage on the Zoekt Web Server dashboard, consider increasing cpus: of the zoekt-webserver container in docker-compose.yml if regularly hitting max CPU utilization.

  • Silence this alert: If you are aware of this alert and want to silence notifications for it, add the following to your site configuration:

      "observability.silenceAlerts": [
        "warning_frontend_99th_percentile_search_request_duration"
      ]

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, ie treat it as just another solution? that works 💯

for alerts that don't have solutions, do we want silencing to be the only solution, or should we prompt admins to open a ticket or contact us?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Silencing being the only solution is fine. I'm not worried about getting admins to contact us - they will.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also note I remove the surrounding {} brackets which makes it shorter, and indented the code block to match the bulleted list

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

indented the code block to match the bulleted list

docsite doesn't support this it seems :(((

Copy link
Member

@slimsag slimsag left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM once you address my feedback (no need for re-review).

@codecov
Copy link

codecov bot commented Aug 5, 2020

Codecov Report

Merging #12731 into main will increase coverage by 0.00%.
The diff coverage is n/a.

@@           Coverage Diff           @@
##             main   #12731   +/-   ##
=======================================
  Coverage   50.82%   50.83%           
=======================================
  Files        1443     1443           
  Lines       81281    81281           
  Branches     6614     6560   -54     
=======================================
+ Hits        41315    41319    +4     
+ Misses      36403    36400    -3     
+ Partials     3563     3562    -1     
Flag Coverage Δ
#go 52.20% <ø> (+<0.01%) ⬆️
#integration 24.20% <ø> (+0.01%) ⬆️
#storybook 14.11% <ø> (ø)
#typescript 47.07% <ø> (+<0.01%) ⬆️
#unit 47.51% <ø> (+<0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
.../internal/codeintel/resolvers/graphql/locations.go 85.45% <0.00%> (ø)
...odeintel/bundles/persistence/cache/reader_cache.go 93.75% <0.00%> (+2.08%) ⬆️
web/src/settings/SettingsPage.tsx 76.47% <0.00%> (+11.76%) ⬆️

@keegancsmith keegancsmith changed the base branch from master to main August 5, 2020 07:49
@pecigonzalo
Copy link
Contributor

pecigonzalo commented Aug 5, 2020

I think it would be simpler to have a header that says "silencing alerts" and tells you how to do it in a generic fashion and we can reference/link that instead.


Silence an alert

If you are aware of an alert and want to silence notifications for it, add the following to your site configuration:

{
  "observability.silenceAlerts": [
    "ALERT_NAME"
  ]
}

You can find the ALERT_NAME on lorem ipsum

@bobheadxi
Copy link
Member Author

bobheadxi commented Aug 5, 2020

@pecigonzalo we do have a page for that - I think the concern is you are unlikely to move away from the section once you've landed on it via an alert notification's solutions link. This gives admins a copy-paste solution that makes it clear we do have this capability, at very little cost (just a larger static page)

@pecigonzalo
Copy link
Contributor

@bobheadxi I dont share that concern, if the have a link or a clear relationship to how to silence alerts, I think its actually more likely that someone will search for "silence alerts sourcegraph" than wait for an alert to pop up.
If we want to point them in the right direction, we could link to the "how to silence" page/section from the alert.
If its hard to find the sections and understand our documentation on how to monitor and manage alerts, we should fix that instead.

In general, I would actually not encourage silencing without expire, as its likely the silence will remain there after the issue is fixed.

@bobheadxi
Copy link
Member Author

bobheadxi commented Aug 6, 2020

I think its actually more likely that someone will search for "silence alerts sourcegraph" than wait for an alert to pop up.

I'm not sure - I don't think there is a lot of awareness within our own team about this functionality, and I'd rather have this up front and center and focus on opening issues to improve alerts like @slimsag has done, rather than have us keep trying to remove alerts or have customers disable notifications altogether

In general, I would actually not encourage silencing without expire, as its likely the silence will remain there after the issue is fixed.

I think there's a possible solution for this - when customers upgrade, we can show a dismissible prompt reminding admins that they have silenced alerts and should reassess whether they should stay

I'll come up with some wording to better promote making sure silences are followed up on (d317534 ), but from discussions with stephen I think this is valuable to have at probably not a lot of cost

@bobheadxi
Copy link
Member Author

In general, I would actually not encourage silencing without expire, as its likely the silence will remain there after the issue is fixed.

another idea here - we can simply change the name of the alert if there is a significant change in how it works. this might leave a lot of silences targeting alerts that no longer exist, but I feel like there is probably a programmatic way we can warn about this in the future

@bobheadxi bobheadxi merged commit cdd1b77 into main Aug 6, 2020
@bobheadxi bobheadxi deleted the monitoring/more-silencing branch August 6, 2020 08:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

monitoring: better advice for alerts that do not have an entry in alert_solutions
3 participants