Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

All Sourcegraph components can easily raise runtime errors for debugging #33240

Closed
11 of 17 tasks
bobheadxi opened this issue Mar 30, 2022 · 6 comments
Closed
11 of 17 tasks
Assignees
Labels
roadmap Issue tracking a product-eng roadmap item tracking

Comments

@bobheadxi
Copy link
Member

bobheadxi commented Mar 30, 2022

Problem to solve

  • Errors are not actionable:
    • It's hard to understand without a lot of familiarity with the code that produced the error to understand if an error is a big problem or not. Transient errors are mixed up with big ones.
  • Noise is making the task difficult: transient errors are generating a lot of volume and makes it harder to diagnose the root cause.
  • Errors are potentially missing or not being reported: in some cases, an error is reported by a customer, but we're not seeing any relevant errors in the logs.
  • Getting access to the errors is not always an easy task: depending on the instance type where the error happened, it can be difficult to retrieve the logs.
  • Error reporting is ad-hoc in various services, leading to implementation fragmentation which makes the problem hard to tackle as a whole.

Overall, we have a low signal to noise ratio that makes debugging tedious. We can take this opportunity, as a secondary objective to improve our sensitive data redaction in error reporting.

Related problem: due to the lack of a unified reporting mechanism, diagnosing creates a lot of communication overhead, which can take days in some cases when those communications are subject to strict internal policies in place in some companies.

Measure of success

  • The handbook documents a set of conventions on how to write errors that has been collectively approved by both errors consuming and producing teams.
  • Have at least one service fully migrated to use the new errors conventions to serve as an example to later migrate all other services to the new conventions.
  • Errors rates are provided with enough metadata to identify what component they are coming from, making them monitorable.
  • Error reporting works across all instance types.
  • Have a mechanism in place to access seamlessly access the errors on a given instance.
    • TODO: Get in touch with the cloud devops team, as they possibly have related work on this matter.

Solution summary

We will propose an RFC with a set of conventions for runtime errors, actively seeking feedback from the biggest errors consumers. We're likely to include documentation tasks and content producing work to improve onboarding on the general topic of errors.

We will audit the codebase to understand better the current state to identify possible source of debt as well as proposing changes on the error-related packages to facilitate following the above convention as well as passing additional data.

Artifacts:

  • TO DO GitHub discussion to brainstorm on this topic.

What specific customers are we iterating on the problem and solution with?

Internal Sourcegraph developers and SREs.

Impact on use cases

This effort contributes to the company-wide effort to improve Observability.

Delivery plan

  • ➡️ Investigate existing proposals, issues, and problem space @jhchabran
  • Prepare a draft RFC with standardized proposal around conventions.
  • Partner with a given team on improving the errors on a specific domain and evaluate those results with SREs.
  • Raise RFC as formal standardization proposal

Tracked issues

@bobheadxi: 3.00d

Completed: 3.00d

@jhchabran

Completed

Legend

  • 👩 Customer issue
  • 🐛 Bug
  • 🧶 Technical debt
  • 🎩 Quality of life
  • 🛠️ Roadmap
  • 🕵️ Spike
  • 🔒 Security issue
  • 🙆 Stretch goal
@bobheadxi bobheadxi added roadmap Issue tracking a product-eng roadmap item tracking devx/errors labels Mar 30, 2022
@bobheadxi bobheadxi changed the title DevX: application logging roadmap All Sourcegraph services can easily raise runtime errors for debugging Mar 30, 2022
@bobheadxi bobheadxi changed the title All Sourcegraph services can easily raise runtime errors for debugging All Sourcegraph components can easily raise runtime errors for debugging Mar 30, 2022
@taylorsperry taylorsperry self-assigned this Apr 8, 2022
@bobheadxi
Copy link
Member Author

bobheadxi commented May 10, 2022

There's currently some Sentry work in the code related to this, but last I asked nobody uses Sentry for backend errors: https://sourcegraph.slack.com/archives/C02UC4WUX1Q/p1651506189865219

@bobheadxi
Copy link
Member Author

bobheadxi commented May 10, 2022

Thought: how are errors handled today? do we even need a tool like sentry? Can't error aggregation/stack trace dumping be done with logging, i.e. treat all error-level logs as serious evnts? (the new logging library dumps lib/error stacktraces by default in production)

Would be nice to not need a tool like Sentry since it simplifies the story for customers, and seems aligned with how errors are handled today for backend services (with a log-first and/or trace-oriented approach)

cc @jhchabran

@bobheadxi
Copy link
Member Author

bobheadxi commented May 12, 2022

joe mentioned this: https://github.com/orgs/sourcegraph/projects/139 previous initiative around sentry

@jhchabran
Copy link
Member

Would be nice to not need a tool like Sentry since it simplifies the story for customers, and seems aligned with how errors are handled today for backend services (with a log-first and/or trace-oriented approach)

I entirely agree, Sentry should be a system that is added on top of the existing code as a way to visualize errors collection, nothing else. Thanks for bringing up Joe's board, I wasn't aware of it.

@jhchabran
Copy link
Member

jhchabran commented May 17, 2022

Did a first exploration this afternoon and brought @burmudar along. I'll keep going tomorrow :)


First principles

  • Surfacing an error means logging it.
  • A surfaced error always comes with contextual metadata.
  • Errors are a signal.
  • Errors work the same across all instance types.

Surfacing an error

	logger.With(log.Error(err)).Error("description") // log an error
    logger.With(log.Error(err)).Warn("description")  // log a transient error, can be ignored by Sentry but not the metrics for example. 
    logger.With(log.Error(err)).Info("description")  // log but do not instrument the error

Instrumenting error rates

Following the aforementioned first principle of errors being a signal, we typically want to be able to monitor the error rate of a particular group of errors. This is tangential to the issue, but worth mentioning I believe.

As an example, a service has an HTTP API, and you want to monitor the error rate of your API. You add a middleware on your handlers, (like httptrace.go) where you log the errors with "http_handlers" as your bucket name.

logger.With(log.Error(err), log.String("error_bucket", "my-bucket").Error("description")

This does not prevent you from logging the errors earlier in the code, with a different bucket or no bucket at all (in which case, the error will be aggregated into the total error rate).

This enables users to mix explicit error reporting while still having an error handler that catches errors bubbling up. This solves the problem of having a root error handler that is not aware of context-specific metadata while you actually have this metadata on the original call site, while still being able to compute error rates properly. We don't want the user to have to think about his error being logged further down the line.

Capturing the surfaced errors

While we're on Cloud, we can use Sentry, but we can't really do that on other instance types. So by plugging the log monitoring system onto the logs, like a middleware, we can send the errors into an external system such as Sentry or use a different approach such as storing them somewhere so we can display them in the UI for example (purely hypothetical, just mentioning it as it was brought forward that getting customers to send us the logs can sometimes be a bit longer due to internal processes on their side).

	syncLogs := log.Init(log.Resource{
		Name:       "wjh",
		Version:    version.Version(),
		InstanceID: hostname.Get(),
	}, mySentryErrorMiddleware, myMetricsErrorMiddleware)

So we could pass an additional option to log.Init to wire the error handler(s), which would guarantee that we're not dynamically changing it, which would be quite surprising behavior.

API approach

At this stage, it makes more sense to just do a first iteration with the primitives provided by the logging API rather than creating new helpers such as logger.CaptureError.

About the observation package

None of the above is incompatible with the observation package and using it with it could simply result in passing additional metadata.

Discarded considerations

Annotating errors

In the context of having a root error handler such as the one in httptrace.go, it may sound appealing to add helpers to annotate errors with metadata that could be parsed by the logger to be then logged as attributes. This way, you can annotate your errors and just pass them up to get a nicely annotated log line.

But that introduces complexity on the error type we use and how to reconcile this when errors are wrapped. I believe it's much simpler to avoid overloading the error themselves and if we really want to use some annotation we could decide to expose the necessary helpers from cockroachdb-errors if we have to.

At this stage, we're better keeping it simple and learning from the usage patterns that we'll see emerging.

Using the error message as the bucket name

It's really easy to make change to an error message and messing up the metrics on the errors just because you fixed a typo is dangerous. So having an explicit bucket name is more robust and clearer in its intentions.

@jhchabran
Copy link
Member

We're done, At this point, it's just continuously polishing this to unlock more actionability.

I think it'd be better to close the current issue and to wrap the remaining work in the form of a bet that we think in terms of business impact. Or just finish those while we're on cool down. I'll give some thoughts on this, and will circle back.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
roadmap Issue tracking a product-eng roadmap item tracking
Projects
None yet
Development

No branches or pull requests

3 participants