-
-
Notifications
You must be signed in to change notification settings - Fork 639
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Distinguish API errors in Sentry #3864
Comments
Meanwhile! I just discovered an alternate view within Sentry, which can mostly disaggregate events. It's called "Discover", and this UI is apparently Discover v2, which they just launched. Here's a view that's roughly the most recent individual error events. (There's still a little grouping, which I don't yet understand.) That's helpful for scanning through what the most frequent types of API errors are, so it's a partial workaround for this issue. The big downside of this view, though, is that you only see on the screen the last 50 or so events, covering a short window of time; so it's highly subject to bursts, and also it's hard to see beyond the three or four most-frequent errors which dominate the view. So the distinctions called for in this issue would still be great to have, to make the Issues dashboard (which solves those problems by aggregating across time) more informative. |
Hmm, yeah. From those docs:
which matches what we're seeing now, but not what we were seeing when you made that earlier push. (At that time, we'd get tons of separate "issues" from the exact same stack trace when the message varied.) They say this about changes:
I don't think we've ever made such a change in the project settings, though! It looks like this would be the UI for doing so. |
If I'm reading the Sentry logs correctly, it looks like all of these are only happening on Android? None of the errors would seem to be platform-specific, which is weird. Looking for causes of #4033, which is on Android; looks like there's been a fresh report of something similar (also on Android), here. |
I'm not sure offhand about the other two, but this one I can explain: it's because the user ID is embedded in the error message. This means that the error message will be unique for almost every user of this API, and so the various error reports for #3732 are not being grouped together. Looking for other flavors of that error message shows that some do occur on iOS: for example, this one, with |
Often (maybe always) our errors seem not to get aggregated in Sentry between Android and iOS. In particular that applies to the omnibus "API error" Sentry error that this issue thread is about. Looks like this is the iOS version: All three of these event links:
were to different events that got attached to the Android version of the error, I guess because that must have been the error I scanned through to find them. Because they're all the same "error", they all have the same right sidebar, including the aggregates which are across different events on that error. |
In particular, I think what we want here are explicit fingerprinting and a custom integration. The former will do the disaggregation by whatever strings we define. However, the usual method of doing that involves a scope created at the exception-capture site, which isn't appropriate for us – our What we want is to (potentially) process every exception Sentry reports, regardless of the context in which it was caught, and give it a custom fingerprint – which is the sort of thing integrations are for. For reference, since the source is probably the best documentation we'll get on the topic: |
Cool, thanks for that research! That "fingerprinting" doc looks like it's directly on point. Down midway through the page, it offers an example using
(It looks like that's actually a TS "interface", which is a structural type -- so no class derivation is necessary, just an object with properties |
Tell the Sentry service that, when bucketing events, it should (also) consider the error message, the error code, and the HTTP status of any caught `ApiError`s to be distinguishing characteristics. Fixes zulip#3864.
Tell the Sentry service that, when bucketing events, it should (also) consider the error message, the error code, and the HTTP status of any caught `ApiError`s to be distinguishing characteristics. Fixes zulip#3864.
Tell the Sentry service that, when bucketing events, it should (also) consider the error message, the error code, and the HTTP status of any caught `ApiError`s to be important distinguishing characteristics. Fixes zulip#3864.
Tell the Sentry service that, when bucketing events, it should (also) consider the error message, the error code, and the HTTP status of any caught `ApiError`s to be important distinguishing characteristics. Fixes zulip#3864.
Tell the Sentry service that, when bucketing events, it should (also) consider the error message, the error code, and the HTTP status of any caught `ApiError`s to be important distinguishing characteristics. Fixes zulip#3864.
Tell the Sentry service that, when bucketing events, it should (also) consider the error code and HTTP status of any caught `ApiError`s to be important distinguishing characteristics. Partially addresses zulip#3864. Note that, although this is a response to a change made by Sentry to no longer use error messages as aggregator-hints (at least for custom error types?), we don't add `.message` to the event fingerprint. The contents of the `msg` field are mostly intended for humans, and often contain instance-specific data [1]; when the error message _was_ used as part of the fingerprint, we had the opposite problem as today -- namely, dozens of different Sentry issues corresponding to the same bug. [1] And in the future, it may even be localized: see issue zulip#3692.
Tell the Sentry service that, when bucketing events, it should (also) consider the error code and HTTP status of any caught `ApiError`s to be important distinguishing characteristics. Partially addresses zulip#3864. Note that, although this is a response to a change made by Sentry to no longer use error messages as aggregator-hints (at least for custom error types?), we don't add `.message` to the event fingerprint. The contents of the `msg` field are mostly intended for humans, and often contain instance-specific data [1]; when the error message _was_ used as part of the fingerprint, we had the opposite problem as today -- namely, dozens of different Sentry issues corresponding to the same bug. [1] And in the future, it may even be localized: see issue zulip#3692.
Tell the Sentry service that, when bucketing events, it should (also) consider the error code and HTTP status of any caught `ApiError`s to be important distinguishing characteristics. Partially addresses zulip#3864. Note that, although this is a response to a change made by Sentry to no longer use error messages as aggregator-hints (at least for custom error types?), we don't add `.message` to the event fingerprint. The contents of the `msg` field are mostly intended for humans, and often contain instance-specific data [1]; when the error message _was_ used as part of the fingerprint, we had the opposite problem as today -- namely, dozens of different Sentry issues corresponding to the same bug. [1] And in the future, it may even be localized: see issue zulip#3692.
Tell the Sentry service that, when bucketing events, it should (also) consider the error code and HTTP status of any caught `ApiError`s to be important distinguishing characteristics. Partially addresses zulip#3864. Note that, although this is a response to a change made by Sentry to no longer use error messages as aggregator-hints (at least for custom error types?), we don't add `.message` to the event fingerprint. The contents of the `msg` field are mostly intended for humans, and often contain instance-specific data [1]; when the error message _was_ used as part of the fingerprint, we had the opposite problem as today -- namely, dozens of different Sentry issues corresponding to the same bug. [1] And in the future, it may even be localized: see issue zulip#3692.
Tell the Sentry service that, when bucketing events, it should (also) consider the error code and HTTP status of any caught `ApiError`s to be important distinguishing characteristics. Partially addresses zulip#3864. Note that, although this is a response to a change made by Sentry to no longer use error messages as aggregator-hints (at least for custom error types?), we don't add `.message` to the event fingerprint. The contents of the `msg` field are mostly intended for humans, and often contain instance-specific data [1]; when the error message _was_ used as part of the fingerprint, we had the opposite problem as today -- namely, dozens of different Sentry issues corresponding to the same bug. [1] And in the future, it may even be localized: see issue zulip#3692.
Update on this:
|
In our Sentry logs, the most frequent "issue" by far is:
Flipping through the events that make it up, they turn out to be a wide variety of different errors:
/typing
-- so Typing notifications are mistyped, resulting in rejection (for servers < 2.0) #3732; event)/messages
, with e.g.subject=%20
-- so Messages with whitespace-only topics should not be sent #3743)/users/me/android_gcm_reg_id
; event)/messages
; event; looks to be Can't send messages to streams with commas in the name #3729)Confusingly, the main dashboard, which shows one big row per "issue", includes the error message from the one most recent event, making it look like that one error is generating that many events, when in fact it's all these errors combined.
I'm not sure exactly how Sentry is deciding to aggregate these. In addition to the "Error ApiError(src/api/apiErrors)" part which they all have in common, they also all have the same stack trace, so that may be a factor; it's
where only the message in the first line varies.
We have a lot of useful metadata about these errors, so we should be able to do pretty well at disambiguating. A few items that probably always mean a different error when they differ are:
/api/v1/typing
) or better yet the part after/api/v1/
code
value in the response (typically justBAD_REQUEST
, but when it does vary it'll be informative)Probably the route is the single most valuable piece of data to distinguish by -- that alone should distinguish most of the different errors here.
I expect resolving this will involve some reading up on the Sentry API; I think we'll likely need to use some Sentry features we don't currently use.
The text was updated successfully, but these errors were encountered: