Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Distinguish API errors in Sentry #3864

Open
gnprice opened this issue Feb 3, 2020 · 9 comments · May be fixed by #4096
Open

Distinguish API errors in Sentry #3864

gnprice opened this issue Feb 3, 2020 · 9 comments · May be fixed by #4096
Labels

Comments

@gnprice
Copy link
Member

gnprice commented Feb 3, 2020

In our Sentry logs, the most frequent "issue" by far is:

Error ApiError(src/api/apiErrors)

Flipping through the events that make it up, they turn out to be a wide variety of different errors:

Confusingly, the main dashboard, which shows one big row per "issue", includes the error message from the one most recent event, making it look like that one error is generating that many events, when in fact it's all these errors combined.

I'm not sure exactly how Sentry is deciding to aggregate these. In addition to the "Error ApiError(src/api/apiErrors)" part which they all have in common, they also all have the same stack trace, so that may be a factor; it's

Error: Cannot send to multiple streams
  at construct(app:///[native code]:0:0)
  at _construct(node_modules/@babel/runtime/helpers/construct.js:30:21)
  at Wrapper(node_modules/@babel/runtime/helpers/wrapNativeSuper.js:26:14)
  at ApiError(src/api/apiErrors.js:14:5)
  at makeErrorFromApi(src/api/apiErrors.js:36:20)
  at call(src/api/apiFetch.js:76:11)
  at tryCatch(node_modules/@babel/runtime/node_modules/regenerator-runtime/runtime.js:45:40)
[... then a bunch more framework frames, which don't vary either]

where only the message in the first line varies.

We have a lot of useful metadata about these errors, so we should be able to do pretty well at disambiguating. A few items that probably always mean a different error when they differ are:

  • the route, i.e. the path portion of the URL (like /api/v1/typing) or better yet the part after /api/v1/
  • the HTTP method (POST, GET, etc.)
  • the code value in the response (typically just BAD_REQUEST, but when it does vary it'll be informative)

Probably the route is the single most valuable piece of data to distinguish by -- that alone should distinguish most of the different errors here.

I expect resolving this will involve some reading up on the Sentry API; I think we'll likely need to use some Sentry features we don't currently use.

@gnprice gnprice added the a-tools label Feb 3, 2020
@gnprice
Copy link
Member Author

gnprice commented Feb 3, 2020

Meanwhile! I just discovered an alternate view within Sentry, which can mostly disaggregate events. It's called "Discover", and this UI is apparently Discover v2, which they just launched. Here's a view that's roughly the most recent individual error events. (There's still a little grouping, which I don't yet understand.)

That's helpful for scanning through what the most frequent types of API errors are, so it's a partial workaround for this issue. The big downside of this view, though, is that you only see on the screen the last 50 or so events, covering a short window of time; so it's highly subject to bursts, and also it's hard to see beyond the three or four most-frequent errors which dominate the view. So the distinctions called for in this issue would still be great to have, to make the Issues dashboard (which solves those problems by aggregating across time) more informative.

@rk-for-zulip
Copy link
Contributor

Oof. This is new – the error message used to be a major disaggregator. (Hence my earlier push (#3733) to keep nonessential varying data out of the error message and in the "extras" block.)

(For reference, here is the current Sentry documentation on event aggregation.)

@gnprice
Copy link
Member Author

gnprice commented Feb 4, 2020

Hmm, yeah. From those docs:

When Sentry detects a stack trace in the event data (either directly or as part of an exception), the grouping is effectively based entirely on the stack trace.

which matches what we're seeing now, but not what we were seeing when you made that earlier push. (At that time, we'd get tons of separate "issues" from the exact same stack trace when the message varied.)

They say this about changes:

Each time the default behavior is modified, Sentry releases it as a new version so it does not affect how existing issues are grouped. When you create a Sentry project, the latest and greatest version of the grouping algorithm is automatically selected. This means that the grouping behavior is consistent within a project. If you want to upgrade an existing project to a new grouping algorithm version, you can do so in the project settings. When upgrading you will very likely see new groups being created.

I don't think we've ever made such a change in the project settings, though!

It looks like this would be the UI for doing so.

@chrisbobbe
Copy link
Contributor

If I'm reading the Sentry logs correctly, it looks like all of these are only happening on Android? None of the errors would seem to be platform-specific, which is weird.

Looking for causes of #4033, which is on Android; looks like there's been a fresh report of something similar (also on Android), here.

@rk-for-zulip
Copy link
Contributor

I'm not sure offhand about the other two, but this one I can explain: it's because the user ID is embedded in the error message. This means that the error message will be unique for almost every user of this API, and so the various error reports for #3732 are not being grouped together.

Looking for other flavors of that error message shows that some do occur on iOS: for example, this one, with Bad value for 'to': [13].

@gnprice
Copy link
Member Author

gnprice commented Apr 18, 2020

If I'm reading the Sentry logs correctly, it looks like all of these are only happening on Android? None of the errors would seem to be platform-specific, which is weird.

Often (maybe always) our errors seem not to get aggregated in Sentry between Android and iOS.

In particular that applies to the omnibus "API error" Sentry error that this issue thread is about. Looks like this is the iOS version:
https://sentry.io/organizations/zulip/issues/1562540614/
while this is the Android version:
https://sentry.io/organizations/zulip/issues/1275118205/

All three of these event links:

were to different events that got attached to the Android version of the error, I guess because that must have been the error I scanned through to find them. Because they're all the same "error", they all have the same right sidebar, including the aggregates which are across different events on that error.

@rk-for-zulip
Copy link
Contributor

I expect resolving this will involve some reading up on the Sentry API; I think we'll likely need to use some Sentry features we don't currently use.

In particular, I think what we want here are explicit fingerprinting and a custom integration.

The former will do the disaggregation by whatever strings we define. However, the usual method of doing that involves a scope created at the exception-capture site, which isn't appropriate for us – our ApiErrors are handled by the uncaught-exception handler, which is exactly how we like it.

What we want is to (potentially) process every exception Sentry reports, regardless of the context in which it was caught, and give it a custom fingerprint – which is the sort of thing integrations are for.

For reference, since the source is probably the best documentation we'll get on the topic:

  • here is what Scope does with event.fingerprint;
  • here is the definition of class Integration, from which all integrations should be derived;
  • here is the source of the @sentry/javascript default integrations; and
  • here is the source of the @sentry/react-native Javascript-side default integrations.

@gnprice
Copy link
Member Author

gnprice commented Apr 23, 2020

Cool, thanks for that research! That "fingerprinting" doc looks like it's directly on point.

Down midway through the page, it offers an example using beforeSend, which looks like a bit simpler of an API for doing the same thing as a custom integration. I think either way can probably work out fine. Bit more discussion here:
https://chat.zulip.org/#narrow/stream/243-mobile-team/topic/distinguishing.20API.20errors.20in.20Sentry/near/861545

  • the definition of class Integration, from which all integrations should be derived;

(It looks like that's actually a TS "interface", which is a structural type -- so no class derivation is necessary, just an object with properties name and setupOnce, the latter with the right signature.)

rk-for-zulip added a commit to rk-for-zulip/zulip-mobile that referenced this issue May 3, 2020
Tell the Sentry service that, when bucketing events, it should (also)
consider the error message, the error code, and the HTTP status of any
caught `ApiError`s to be distinguishing characteristics.

Fixes zulip#3864.
rk-for-zulip added a commit to rk-for-zulip/zulip-mobile that referenced this issue May 3, 2020
Tell the Sentry service that, when bucketing events, it should (also)
consider the error message, the error code, and the HTTP status of any
caught `ApiError`s to be distinguishing characteristics.

Fixes zulip#3864.
rk-for-zulip added a commit to rk-for-zulip/zulip-mobile that referenced this issue May 3, 2020
Tell the Sentry service that, when bucketing events, it should (also)
consider the error message, the error code, and the HTTP status of any
caught `ApiError`s to be important distinguishing characteristics.

Fixes zulip#3864.
rk-for-zulip added a commit to rk-for-zulip/zulip-mobile that referenced this issue May 3, 2020
Tell the Sentry service that, when bucketing events, it should (also)
consider the error message, the error code, and the HTTP status of any
caught `ApiError`s to be important distinguishing characteristics.

Fixes zulip#3864.
rk-for-zulip added a commit to rk-for-zulip/zulip-mobile that referenced this issue May 4, 2020
Tell the Sentry service that, when bucketing events, it should (also)
consider the error message, the error code, and the HTTP status of any
caught `ApiError`s to be important distinguishing characteristics.

Fixes zulip#3864.
rk-for-zulip added a commit to rk-for-zulip/zulip-mobile that referenced this issue May 5, 2020
Tell the Sentry service that, when bucketing events, it should (also)
consider the error code and HTTP status of any caught `ApiError`s to
be important distinguishing characteristics.

Partially addresses zulip#3864.

Note that, although this is a response to a change made by Sentry to
no longer use error messages as aggregator-hints (at least for custom
error types?), we don't add `.message` to the event fingerprint. The
contents of the `msg` field are mostly intended for humans, and often
contain instance-specific data [1]; when the error message _was_ used
as part of the fingerprint, we had the opposite problem as today --
namely, dozens of different Sentry issues corresponding to the same
bug.

[1] And in the future, it may even be localized: see issue zulip#3692.
rk-for-zulip added a commit to rk-for-zulip/zulip-mobile that referenced this issue May 7, 2020
Tell the Sentry service that, when bucketing events, it should (also)
consider the error code and HTTP status of any caught `ApiError`s to
be important distinguishing characteristics.

Partially addresses zulip#3864.

Note that, although this is a response to a change made by Sentry to
no longer use error messages as aggregator-hints (at least for custom
error types?), we don't add `.message` to the event fingerprint. The
contents of the `msg` field are mostly intended for humans, and often
contain instance-specific data [1]; when the error message _was_ used
as part of the fingerprint, we had the opposite problem as today --
namely, dozens of different Sentry issues corresponding to the same
bug.

[1] And in the future, it may even be localized: see issue zulip#3692.
rk-for-zulip added a commit to rk-for-zulip/zulip-mobile that referenced this issue May 7, 2020
Tell the Sentry service that, when bucketing events, it should (also)
consider the error code and HTTP status of any caught `ApiError`s to
be important distinguishing characteristics.

Partially addresses zulip#3864.

Note that, although this is a response to a change made by Sentry to
no longer use error messages as aggregator-hints (at least for custom
error types?), we don't add `.message` to the event fingerprint. The
contents of the `msg` field are mostly intended for humans, and often
contain instance-specific data [1]; when the error message _was_ used
as part of the fingerprint, we had the opposite problem as today --
namely, dozens of different Sentry issues corresponding to the same
bug.

[1] And in the future, it may even be localized: see issue zulip#3692.
rk-for-zulip added a commit to rk-for-zulip/zulip-mobile that referenced this issue May 8, 2020
Tell the Sentry service that, when bucketing events, it should (also)
consider the error code and HTTP status of any caught `ApiError`s to
be important distinguishing characteristics.

Partially addresses zulip#3864.

Note that, although this is a response to a change made by Sentry to
no longer use error messages as aggregator-hints (at least for custom
error types?), we don't add `.message` to the event fingerprint. The
contents of the `msg` field are mostly intended for humans, and often
contain instance-specific data [1]; when the error message _was_ used
as part of the fingerprint, we had the opposite problem as today --
namely, dozens of different Sentry issues corresponding to the same
bug.

[1] And in the future, it may even be localized: see issue zulip#3692.
rk-for-zulip added a commit to rk-for-zulip/zulip-mobile that referenced this issue May 8, 2020
Tell the Sentry service that, when bucketing events, it should (also)
consider the error code and HTTP status of any caught `ApiError`s to
be important distinguishing characteristics.

Partially addresses zulip#3864.

Note that, although this is a response to a change made by Sentry to
no longer use error messages as aggregator-hints (at least for custom
error types?), we don't add `.message` to the event fingerprint. The
contents of the `msg` field are mostly intended for humans, and often
contain instance-specific data [1]; when the error message _was_ used
as part of the fingerprint, we had the opposite problem as today --
namely, dozens of different Sentry issues corresponding to the same
bug.

[1] And in the future, it may even be localized: see issue zulip#3692.
@rk-for-zulip rk-for-zulip linked a pull request May 8, 2020 that will close this issue
@gnprice
Copy link
Member Author

gnprice commented Jan 30, 2021

Update on this:

  • API errors are still by far our most widespread "issue" -- "widespread" meaning they affect the greatest number of users.
  • The current omnibus "issue" they get grouped under is here. In the dashboard it says "construct([native code])"; this appears to be just an unhelpful way of referring to where our ApiError class's constructor calls the constructor for its base class Error.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants