Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: enable Datadog APM error tracking with a tracing layer #1626

Merged
merged 18 commits into from Feb 19, 2024

Conversation

oddgrd
Copy link
Contributor

@oddgrd oddgrd commented Feb 14, 2024

Description of change

Add an error layer to the tracing subscriber to capture any dyn std::error::Error, and set the fields required for Datadog APM error tracking in the otel data extension on the current span.

This also fixes the issue where the error message field was being overwritten by the http.status_code field, by setting the status code in the otel data extension rather than directly on the span.

Todo

  • Go over our errors and set a &(dyn std::error::Error) field where possible. (Done in fa8d1b5)
  • Remove some error events that shouldn't be error events (e.g. 400s). (Done in fa8d1b5)

More remains for both points above, but I have tried to remove all obvious "errors that shouldn't be errors", and I've migrated the errors in all error events to &dyn std::error::Error where it was straight forward. I think the clean-up can continue in future PRs.

Caveats

  • This does not enable logs error tracking, for that we would need to set these fields on all events, which is a bit messy without valuable support.
  • For APM error tracking, in addition to the fields requirement, it's also a requirement that the span is a service entry span.
  • We have no good way of getting the error type from a dyn Error, but the errors will still be distinguished in error tracking by error.message and error.stack (Datadog generates a fingerprint for each error based on error.message, error.type and error.stack).

Alternatives considered

  • Using valuable for something like the above. We currently expose our tracing dependencies to users in the shuttle-runtime, so using an unstable feature isn't viable. We can reconsider it when the common features that leak the valuable dependency are removed from the runtime, and they will be when provisioning is extracted from the runtime. Or if valuable has a stable release.
  • Using a utility function to emit an error event with the correct fields, and also set the fields on the otel data extension in the span. This works, and we'll have both APM and Logs error tracking, but we will need to modify all our existing error events. Some of which also do not impl Error, and some of them are exposed to users via our deployment log layer. Regardless, this is a viable option, but it's my opinion that it's better to start with this simple layer, and we can iterate on it when we get a clearer picture of the error tracking feature and what we are missing. The work on this option has been started here: https://github.com/shuttle-hq/shuttle/tree/feat/datadog-error-tracking

How has this been tested? (if applicable)

Tested with a local environment against Datadog.

@iulianbarbu
Copy link
Contributor

it's better to start with this simple layer, and we can iterate on it when we get a clearer picture of the error-tracking feature and what we are missing

I agree with this. How I see it is that at some point when we clarify how to use Valuable internally, we'll release implementations for each error (different than dyn std:error::Error) and then we use the same layer you created to expand a pre-existing error field to error message, stack & type fields, now possible for more than just &dyn std:error::Error.

For APM error tracking, in addition to the fields requirement, it's also a requirement that the span is a service entry span.

Why is this the case?

@oddgrd
Copy link
Contributor Author

oddgrd commented Feb 14, 2024

For APM error tracking, in addition to the fields requirement, it's also a requirement that the span is a service entry span.

Why is this the case?

I'm not sure, but it's listed in the requirements here.

@oddgrd oddgrd marked this pull request as ready for review February 15, 2024 20:13
Copy link
Contributor

@iulianbarbu iulianbarbu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Super excited by what this enables! Thanks @oddgrd ! 😍

common/src/backends/otlp_tracing_bridge.rs Show resolved Hide resolved
common/src/backends/otlp_tracing_bridge.rs Outdated Show resolved Hide resolved
common/src/backends/trace.rs Show resolved Hide resolved
Copy link
Member

@chesedo chesedo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left some notes. I think this PR warrants a demo to all the engineers as we should use the dyn StdError whenever we are logging an error

common/src/backends/otlp_tracing_bridge.rs Outdated Show resolved Hide resolved
common/src/backends/otlp_tracing_bridge.rs Show resolved Hide resolved
common/src/backends/otlp_tracing_bridge.rs Show resolved Hide resolved
common/src/backends/otlp_tracing_bridge.rs Outdated Show resolved Hide resolved
common/src/models/error.rs Outdated Show resolved Hide resolved
gateway/src/proxy.rs Show resolved Hide resolved
@oddgrd
Copy link
Contributor Author

oddgrd commented Feb 16, 2024

Left some notes. I think this PR warrants a demo to all the engineers as we should use the dyn StdError whenever we are logging an error

Sure, I'm happy to do a demo! I'm also planning to document how we should format errors in the M&O Wiki.

Copy link
Member

@Kazy Kazy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great job 🎉 I agree with your comments regarding Valuable and that we should work iteratively. If we get the errors from the logs there, it will already be highly valuable (no pun intended).

@chesedo
Copy link
Member

chesedo commented Feb 19, 2024

Sure, I'm happy to do a demo! I'm also planning to document how we should format errors in the M&O Wiki.

I'm hoping a demo will make it clear as to why this is valuable - the way we should be doing it 😄

@oddgrd oddgrd merged commit c5f2caf into main Feb 19, 2024
31 checks passed
@oddgrd oddgrd deleted the feat/alternative-datadog-error-tracking branch February 19, 2024 17:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants