Distinguish API errors in Sentry #3864

gnprice · 2020-02-03T19:37:09Z

In our Sentry logs, the most frequent "issue" by far is:

Error ApiError(src/api/apiErrors)

Flipping through the events that make it up, they turn out to be a wide variety of different errors:

Bad value for 'to': [30] (on POST /typing -- so Typing notifications are mistyped, resulting in rejection (for servers < 2.0) #3732; event)
Topic can't be empty (on POST /messages, with e.g. subject=%20 -- so Messages with whitespace-only topics should not be sent #3743)
Zulip server auth failure: f2f0[... a UUID] is not registered (on POST /users/me/android_gcm_reg_id; event)
Cannot send to multiple streams (on POST /messages; event; looks to be Can't send messages to streams with commas in the name #3729)
... that covers the last few dozen of these events, but there are probably other errors that are just a bit less frequent.

Confusingly, the main dashboard, which shows one big row per "issue", includes the error message from the one most recent event, making it look like that one error is generating that many events, when in fact it's all these errors combined.

I'm not sure exactly how Sentry is deciding to aggregate these. In addition to the "Error ApiError(src/api/apiErrors)" part which they all have in common, they also all have the same stack trace, so that may be a factor; it's

Error: Cannot send to multiple streams
  at construct(app:///[native code]:0:0)
  at _construct(node_modules/@babel/runtime/helpers/construct.js:30:21)
  at Wrapper(node_modules/@babel/runtime/helpers/wrapNativeSuper.js:26:14)
  at ApiError(src/api/apiErrors.js:14:5)
  at makeErrorFromApi(src/api/apiErrors.js:36:20)
  at call(src/api/apiFetch.js:76:11)
  at tryCatch(node_modules/@babel/runtime/node_modules/regenerator-runtime/runtime.js:45:40)
[... then a bunch more framework frames, which don't vary either]

where only the message in the first line varies.

We have a lot of useful metadata about these errors, so we should be able to do pretty well at disambiguating. A few items that probably always mean a different error when they differ are:

the route, i.e. the path portion of the URL (like /api/v1/typing) or better yet the part after /api/v1/
the HTTP method (POST, GET, etc.)
the code value in the response (typically just BAD_REQUEST, but when it does vary it'll be informative)

Probably the route is the single most valuable piece of data to distinguish by -- that alone should distinguish most of the different errors here.

I expect resolving this will involve some reading up on the Sentry API; I think we'll likely need to use some Sentry features we don't currently use.

The text was updated successfully, but these errors were encountered:

gnprice · 2020-02-03T21:03:59Z

Meanwhile! I just discovered an alternate view within Sentry, which can mostly disaggregate events. It's called "Discover", and this UI is apparently Discover v2, which they just launched. Here's a view that's roughly the most recent individual error events. (There's still a little grouping, which I don't yet understand.)

That's helpful for scanning through what the most frequent types of API errors are, so it's a partial workaround for this issue. The big downside of this view, though, is that you only see on the screen the last 50 or so events, covering a short window of time; so it's highly subject to bursts, and also it's hard to see beyond the three or four most-frequent errors which dominate the view. So the distinctions called for in this issue would still be great to have, to make the Issues dashboard (which solves those problems by aggregating across time) more informative.

rk-for-zulip · 2020-02-03T23:57:44Z

Oof. This is new – the error message used to be a major disaggregator. (Hence my earlier push (#3733) to keep nonessential varying data out of the error message and in the "extras" block.)

(For reference, here is the current Sentry documentation on event aggregation.)

gnprice · 2020-02-04T00:22:50Z

Hmm, yeah. From those docs:

When Sentry detects a stack trace in the event data (either directly or as part of an exception), the grouping is effectively based entirely on the stack trace.

which matches what we're seeing now, but not what we were seeing when you made that earlier push. (At that time, we'd get tons of separate "issues" from the exact same stack trace when the message varied.)

They say this about changes:

Each time the default behavior is modified, Sentry releases it as a new version so it does not affect how existing issues are grouped. When you create a Sentry project, the latest and greatest version of the grouping algorithm is automatically selected. This means that the grouping behavior is consistent within a project. If you want to upgrade an existing project to a new grouping algorithm version, you can do so in the project settings. When upgrading you will very likely see new groups being created.

I don't think we've ever made such a change in the project settings, though!

It looks like this would be the UI for doing so.

chrisbobbe · 2020-04-18T00:57:10Z

Bad value for 'to': [30] (on POST /typing -- so Typing notifications are mistyped, resulting in rejection (for servers < 2.0) #3732; event)

Zulip server auth failure: f2f0[... a UUID] is not registered (on POST /users/me/android_gcm_reg_id; event)

Cannot send to multiple streams (on POST /messages; event; looks to be Can't send messages to streams with commas in the name #3729)

If I'm reading the Sentry logs correctly, it looks like all of these are only happening on Android? None of the errors would seem to be platform-specific, which is weird.

Looking for causes of #4033, which is on Android; looks like there's been a fresh report of something similar (also on Android), here.

rk-for-zulip · 2020-04-18T01:05:11Z

Bad value for 'to': [30] (on POST /typing -- so Typing notifications are mistyped, resulting in rejection (for servers < 2.0) #3732; event)

I'm not sure offhand about the other two, but this one I can explain: it's because the user ID is embedded in the error message. This means that the error message will be unique for almost every user of this API, and so the various error reports for #3732 are not being grouped together.

Looking for other flavors of that error message shows that some do occur on iOS: for example, this one, with Bad value for 'to': [13].

gnprice · 2020-04-18T02:12:54Z

If I'm reading the Sentry logs correctly, it looks like all of these are only happening on Android? None of the errors would seem to be platform-specific, which is weird.

Often (maybe always) our errors seem not to get aggregated in Sentry between Android and iOS.

In particular that applies to the omnibus "API error" Sentry error that this issue thread is about. Looks like this is the iOS version:
https://sentry.io/organizations/zulip/issues/1562540614/
while this is the Android version:
https://sentry.io/organizations/zulip/issues/1275118205/

All three of these event links:

Bad value for 'to': [30] (on POST /typing -- so Typing notifications are mistyped, resulting in rejection (for servers < 2.0) #3732; event)

Zulip server auth failure: f2f0[... a UUID] is not registered (on POST /users/me/android_gcm_reg_id; event)

Cannot send to multiple streams (on POST /messages; event; looks to be Can't send messages to streams with commas in the name #3729)

were to different events that got attached to the Android version of the error, I guess because that must have been the error I scanned through to find them. Because they're all the same "error", they all have the same right sidebar, including the aggregates which are across different events on that error.

rk-for-zulip · 2020-04-23T18:06:03Z

I expect resolving this will involve some reading up on the Sentry API; I think we'll likely need to use some Sentry features we don't currently use.

In particular, I think what we want here are explicit fingerprinting and a custom integration.

The former will do the disaggregation by whatever strings we define. However, the usual method of doing that involves a scope created at the exception-capture site, which isn't appropriate for us – our ApiErrors are handled by the uncaught-exception handler, which is exactly how we like it.

What we want is to (potentially) process every exception Sentry reports, regardless of the context in which it was caught, and give it a custom fingerprint – which is the sort of thing integrations are for.

For reference, since the source is probably the best documentation we'll get on the topic:

here is what Scope does with event.fingerprint;
here is the definition of class Integration, from which all integrations should be derived;
here is the source of the @sentry/javascript default integrations; and
here is the source of the @sentry/react-native Javascript-side default integrations.

gnprice · 2020-04-23T21:41:07Z

Cool, thanks for that research! That "fingerprinting" doc looks like it's directly on point.

Down midway through the page, it offers an example using beforeSend, which looks like a bit simpler of an API for doing the same thing as a custom integration. I think either way can probably work out fine. Bit more discussion here:
https://chat.zulip.org/#narrow/stream/243-mobile-team/topic/distinguishing.20API.20errors.20in.20Sentry/near/861545

the definition of class Integration, from which all integrations should be derived;

(It looks like that's actually a TS "interface", which is a structural type -- so no class derivation is necessary, just an object with properties name and setupOnce, the latter with the right signature.)

Tell the Sentry service that, when bucketing events, it should (also) consider the error message, the error code, and the HTTP status of any caught `ApiError`s to be distinguishing characteristics. Fixes zulip#3864.

Tell the Sentry service that, when bucketing events, it should (also) consider the error message, the error code, and the HTTP status of any caught `ApiError`s to be important distinguishing characteristics. Fixes zulip#3864.

Tell the Sentry service that, when bucketing events, it should (also) consider the error code and HTTP status of any caught `ApiError`s to be important distinguishing characteristics. Partially addresses zulip#3864. Note that, although this is a response to a change made by Sentry to no longer use error messages as aggregator-hints (at least for custom error types?), we don't add `.message` to the event fingerprint. The contents of the `msg` field are mostly intended for humans, and often contain instance-specific data [1]; when the error message _was_ used as part of the fingerprint, we had the opposite problem as today -- namely, dozens of different Sentry issues corresponding to the same bug. [1] And in the future, it may even be localized: see issue zulip#3692.

gnprice · 2021-01-30T04:58:39Z

Update on this:

API errors are still by far our most widespread "issue" -- "widespread" meaning they affect the greatest number of users.
The current omnibus "issue" they get grouped under is here. In the dashboard it says "construct([native code])"; this appears to be just an unhelpful way of referring to where our ApiError class's constructor calls the constructor for its base class Error.

gnprice added the a-tools label Feb 3, 2020

chrisbobbe mentioned this issue Apr 18, 2020

Long, sometimes endless loading in message list #4033

Open

gnprice assigned rk-for-zulip May 1, 2020

rk-for-zulip mentioned this issue May 3, 2020

Tag API errors to distinguish them in Sentry #4086

Closed

rk-for-zulip linked a pull request May 8, 2020 that will close this issue

[NEEDS iOS TESTING] API errors in Sentry, v2 #4096

Open

gnprice mentioned this issue Jan 30, 2021

Tried to use NavigationService before appContainerRef was set #4453

Closed

chrisbobbe unassigned rk-for-zulip Jul 15, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Distinguish API errors in Sentry #3864

Distinguish API errors in Sentry #3864

gnprice commented Feb 3, 2020 •

edited

Loading

gnprice commented Feb 3, 2020

rk-for-zulip commented Feb 3, 2020

gnprice commented Feb 4, 2020

chrisbobbe commented Apr 18, 2020

rk-for-zulip commented Apr 18, 2020

gnprice commented Apr 18, 2020

rk-for-zulip commented Apr 23, 2020

gnprice commented Apr 23, 2020

gnprice commented Jan 30, 2021

Distinguish API errors in Sentry #3864

Distinguish API errors in Sentry #3864

Comments

gnprice commented Feb 3, 2020 • edited Loading

gnprice commented Feb 3, 2020

rk-for-zulip commented Feb 3, 2020

gnprice commented Feb 4, 2020

chrisbobbe commented Apr 18, 2020

rk-for-zulip commented Apr 18, 2020

gnprice commented Apr 18, 2020

rk-for-zulip commented Apr 23, 2020

gnprice commented Apr 23, 2020

gnprice commented Jan 30, 2021

gnprice commented Feb 3, 2020 •

edited

Loading