feat: expose error types as a dimension and in expressions#2655
feat: expose error types as a dimension and in expressions#2655
Conversation
WalkthroughAdds structured error-type classification throughout the router: new errorType constants and String(), errorCategory on httpGraphqlError, propagation into expression context, metrics/telemetry tagging, documentation, and extensive tests validating classifications. Changes
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related PRs
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches
📝 Coding Plan
Comment |
Router-nonroot image scan passed✅ No security vulnerabilities found in image: |
There was a problem hiding this comment.
Actionable comments posted: 3
🧹 Nitpick comments (2)
docs-website/router/configuration/error-types.mdx (1)
7-30: Add one quick usage example and a direct metrics-doc cross-link.The page is clear, but a tiny “how to use” snippet here would reduce context switching.
As per coding guidelines: "Include code examples in documentation where appropriate to illustrate concepts" and "Link between related documentation pages using internal references".📘 Suggested doc enhancement
When an error occurs during request processing, the router classifies it into one of the following types. This classification is available via: - **Expressions**: `request.errorType` and `subgraph.request.errorType` (see [Template Expressions](/router/configuration/template-expressions)) -- **Metrics**: the `wg.error.type` OTEL attribute (or `wg_error_type` in Prometheus) +- **Metrics**: the `wg.error.type` OTEL attribute (or `wg_error_type` in Prometheus). See [Metrics & Monitoring](/router/metrics-and-monitoring#error-identification) +### Quick usage examples + +```bash +request.errorType == 'validation_error' +``` + +```bash +subgraph.request.errorType == 'context_timeout' +```🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@docs-website/router/configuration/error-types.mdx` around lines 7 - 30, Add a short "how to use" example showing expression checks for request.errorType and subgraph.request.errorType (e.g., matching 'validation_error' and 'context_timeout') and include a direct internal link to the metrics docs for wg.error.type (and Prometheus wg_error_type) so readers can jump to metric semantics; update the docs-website/router/configuration/error-types.mdx content to append a small code block demonstrating usage of request.errorType/subgraph.request.errorType and add a cross-reference link text pointing to the metrics page.router/core/http_graphql_error.go (1)
20-28: Inconsistent error categorization reduces observability—consider a category-aware constructor or consistent literal initialization.The
NewHttpGraphqlErrorconstructor (lines 23–28) cannot seterrorCategory, leaving it zero-initialized on all constructor-based calls (e.g.,errors.go:408). Meanwhile, many struct literals ingraphql_prehandler.go,operation_processor.go,parse_multipart.go, andbatch.goalso omiterrorCategory, causing them to fall back toerrorTypeUnknownin metrics and observability (seegetErrorTypeinerrors.go:153–155). While the zero-value fallback is intentional, this inconsistency means errors explicitly categorized in some paths (e.g., input errors, blocked operations) lose their classification during prehandler processing. Recommend either extending the constructor to accept and seterrorCategory, or auditing and explicitly setting the field in literals where it should be classified.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@router/core/http_graphql_error.go` around lines 20 - 28, The NewHttpGraphqlError constructor currently leaves errorCategory zero-valued causing inconsistent classification; change NewHttpGraphqlError to accept an errorCategory (type errorType) parameter and assign it to the httpGraphqlError.errorCategory field, then update all call sites (e.g., errors.go usages) to pass the appropriate category; alternatively, if you prefer not to change the constructor, audit and explicitly set errorCategory in the struct literals in graphql_prehandler.go, operation_processor.go, parse_multipart.go, and batch.go so they no longer rely on the zero-value (see getErrorType/errorTypeUnknown for how metrics will interpret the field).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@router-tests/telemetry/telemetry_test.go`:
- Line 8466: Replace the weak "unknown" expectation for the deterministic
subgraph-failure path with the explicit subgraph error token: update the
occurrences of otel.WgErrorType.String("unknown") in telemetry_test.go (the
assertions at the locations referenced) to assert
otel.WgErrorType.String("subgraph_error") instead so the test verifies the
subgraph fetch failure classification.
In `@router/core/engine_loader_hooks.go`:
- Around line 176-177: The code currently sets
exprCtx.Subgraph.Request.ErrorType unconditionally using
getErrorType(responseInfo.Err).String(), which writes "unknown" when
responseInfo.Err is nil; change the logic in the block handling responseInfo to
only set exprCtx.Subgraph.Request.ErrorType when responseInfo.Err != nil (and
leave it as the empty string otherwise). Specifically, wrap the
getErrorType(...) assignment in a conditional check on responseInfo.Err (use
responseInfo.Err != nil) alongside the existing exprCtx.Subgraph.Request.Error =
WrapExprError(responseInfo.Err) handling, and ensure addExpressions(...)
receives an expression context that has an empty ErrorType for successful
fetches.
In `@router/core/operation_processor.go`:
- Around line 328-331: Several httpGraphqlError return sites in
operation_processor.go that represent user request parsing/validation are
missing an errorCategory, causing getErrorType() to report unknown; locate every
constructor/return of httpGraphqlError in request parsing and prehandler paths
(the same pattern used around the JSON/variables unmarshal cases) and add the
appropriate category field (e.g., errorCategory: errorTypeInputError) so
parsing/setup errors are annotated consistently; update each return that
represents a client input/validation failure to include errorCategory:
errorTypeInputError (and reuse trimmedError(err) for the message where
applicable) so getErrorType() can classify them.
---
Nitpick comments:
In `@docs-website/router/configuration/error-types.mdx`:
- Around line 7-30: Add a short "how to use" example showing expression checks
for request.errorType and subgraph.request.errorType (e.g., matching
'validation_error' and 'context_timeout') and include a direct internal link to
the metrics docs for wg.error.type (and Prometheus wg_error_type) so readers can
jump to metric semantics; update the
docs-website/router/configuration/error-types.mdx content to append a small code
block demonstrating usage of request.errorType/subgraph.request.errorType and
add a cross-reference link text pointing to the metrics page.
In `@router/core/http_graphql_error.go`:
- Around line 20-28: The NewHttpGraphqlError constructor currently leaves
errorCategory zero-valued causing inconsistent classification; change
NewHttpGraphqlError to accept an errorCategory (type errorType) parameter and
assign it to the httpGraphqlError.errorCategory field, then update all call
sites (e.g., errors.go usages) to pass the appropriate category; alternatively,
if you prefer not to change the constructor, audit and explicitly set
errorCategory in the struct literals in graphql_prehandler.go,
operation_processor.go, parse_multipart.go, and batch.go so they no longer rely
on the zero-value (see getErrorType/errorTypeUnknown for how metrics will
interpret the field).
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 4859c5c8-486d-4a0e-82d7-5a6de1a8e748
📒 Files selected for processing (15)
docs-website/docs.jsondocs-website/router/configuration/error-types.mdxdocs-website/router/configuration/template-expressions.mdxdocs-website/router/metrics-and-monitoring.mdxdocs-website/router/metrics-and-monitoring/prometheus-metric-reference.mdxrouter-tests/telemetry/telemetry_test.gorouter/core/context.gorouter/core/engine_loader_hooks.gorouter/core/errors.gorouter/core/errors_test.gorouter/core/graphql_prehandler.gorouter/core/http_graphql_error.gorouter/core/operation_metrics.gorouter/core/operation_processor.gorouter/internal/expr/expr.go
| attribute.StringSlice("services", []string{"employees", "products"}), | ||
| attribute.StringSlice("error_services", []string{"products"}), | ||
| otel.WgRequestError.Bool(true), | ||
| otel.WgErrorType.String("unknown"), |
There was a problem hiding this comment.
Assert subgraph_error instead of unknown for this deterministic subgraph-failure path.
At Line 8466, Line 8541, Line 8862, Line 9046, Line 9121, and Line 9442, the test expects wg.error.type="unknown" even though this scenario is an explicit subgraph fetch failure. That weakens coverage for the new error taxonomy and can hide regressions in subgraph classification.
Suggested expectation fix
- otel.WgErrorType.String("unknown"),
+ otel.WgErrorType.String("subgraph_error"),Also applies to: 8541-8541, 8862-8862, 9046-9046, 9121-9121, 9442-9442
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@router-tests/telemetry/telemetry_test.go` at line 8466, Replace the weak
"unknown" expectation for the deterministic subgraph-failure path with the
explicit subgraph error token: update the occurrences of
otel.WgErrorType.String("unknown") in telemetry_test.go (the assertions at the
locations referenced) to assert otel.WgErrorType.String("subgraph_error")
instead so the test verifies the subgraph fetch failure classification.
| exprCtx.Subgraph.Request.Error = WrapExprError(responseInfo.Err) | ||
| exprCtx.Subgraph.Request.ErrorType = getErrorType(responseInfo.Err).String() |
There was a problem hiding this comment.
Don't populate subgraph.request.errorType on successful fetches.
When responseInfo.Err is nil, this writes "unknown" into the expression context. Because exprCtx is fed into addExpressions(...) immediately below, successful subgraph calls can now look like uncategorized failures in templates, dashboards, and custom telemetry. Leave the field empty unless there is an actual error.
Proposed fix
exprCtx.Subgraph.Id = ds.ID
exprCtx.Subgraph.Name = ds.Name
exprCtx.Subgraph.Request.Error = WrapExprError(responseInfo.Err)
- exprCtx.Subgraph.Request.ErrorType = getErrorType(responseInfo.Err).String()
+ if responseInfo.Err != nil {
+ exprCtx.Subgraph.Request.ErrorType = getErrorType(responseInfo.Err).String()
+ } else {
+ exprCtx.Subgraph.Request.ErrorType = ""
+ }📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| exprCtx.Subgraph.Request.Error = WrapExprError(responseInfo.Err) | |
| exprCtx.Subgraph.Request.ErrorType = getErrorType(responseInfo.Err).String() | |
| exprCtx.Subgraph.Id = ds.ID | |
| exprCtx.Subgraph.Name = ds.Name | |
| exprCtx.Subgraph.Request.Error = WrapExprError(responseInfo.Err) | |
| if responseInfo.Err != nil { | |
| exprCtx.Subgraph.Request.ErrorType = getErrorType(responseInfo.Err).String() | |
| } else { | |
| exprCtx.Subgraph.Request.ErrorType = "" | |
| } |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@router/core/engine_loader_hooks.go` around lines 176 - 177, The code
currently sets exprCtx.Subgraph.Request.ErrorType unconditionally using
getErrorType(responseInfo.Err).String(), which writes "unknown" when
responseInfo.Err is nil; change the logic in the block handling responseInfo to
only set exprCtx.Subgraph.Request.ErrorType when responseInfo.Err != nil (and
leave it as the empty string otherwise). Specifically, wrap the
getErrorType(...) assignment in a conditional check on responseInfo.Err (use
responseInfo.Err != nil) alongside the existing exprCtx.Subgraph.Request.Error =
WrapExprError(responseInfo.Err) handling, and ensure addExpressions(...)
receives an expression context that has an empty ErrorType for successful
fetches.
| return &httpGraphqlError{ | ||
| message: fmt.Sprintf("error parsing extensions: %s", trimmedError(err)), | ||
| statusCode: http.StatusBadRequest, | ||
| message: fmt.Sprintf("error parsing extensions: %s", trimmedError(err)), | ||
| statusCode: http.StatusBadRequest, | ||
| errorCategory: errorTypeInputError, |
There was a problem hiding this comment.
Common request failures still bypass the new type buckets.
This only annotates the JSON/variables unmarshal cases. Other user-facing httpGraphqlError returns in the same file still omit errorCategory, so getErrorType() will keep reporting unknown for cases like Line 593, Line 675, Line 683, Line 1291, Line 1385, and Line 1550. Please propagate categories consistently across the remaining request/validation constructors here as well. Based on learnings, request parsing/setup errors in Cosmo flow through prehandler httpGraphqlError paths.
Also applies to: 337-340, 347-350, 360-363, 375-378
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@router/core/operation_processor.go` around lines 328 - 331, Several
httpGraphqlError return sites in operation_processor.go that represent user
request parsing/validation are missing an errorCategory, causing getErrorType()
to report unknown; locate every constructor/return of httpGraphqlError in
request parsing and prehandler paths (the same pattern used around the
JSON/variables unmarshal cases) and add the appropriate category field (e.g.,
errorCategory: errorTypeInputError) so parsing/setup errors are annotated
consistently; update each return that represents a client input/validation
failure to include errorCategory: errorTypeInputError (and reuse
trimmedError(err) for the message where applicable) so getErrorType() can
classify them.
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (1)
router-tests/observability/error_type_test.go (1)
29-30:metricReaderis created but never used for assertions.The
metricReader(OTEL SDK ManualReader) is passed to the test config but no assertions are made against it. OnlypromRegistryis used inassertErrorTypeInMetrics. This pattern repeats in all subtests.Consider either:
- Adding OTEL metric assertions using
metricReader.Collect()to verify thewg.error.typeattribute- Removing
metricReaderif only Prometheus coverage is intended🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@router-tests/observability/error_type_test.go` around lines 29 - 30, The test creates metricReader via metric.NewManualReader() but never uses it for assertions; either add OTEL metric assertions using metricReader (call metricReader.Collect() and inspect the resulting metrics for the "wg.error.type" attribute) or remove metricReader to avoid dead setup; update each subtest where metricReader, promRegistry and assertErrorTypeInMetrics are used (look for metricReader, metric.NewManualReader(), assertErrorTypeInMetrics, promRegistry, and the "wg.error.type" attribute) to either (A) call metricReader.Collect(), parse the metric data and assert the error type, or (B) delete metricReader and related config so only promRegistry assertions remain.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@router-tests/observability/error_type_test.go`:
- Line 100: The test currently ignores the error returned by ConfigureAuth;
update the call to capture both returns (e.g., authenticators, err :=
testutils.ConfigureAuth(t)) and fail the test immediately if err != nil (for
example using t.Fatalf or require.NoError(t, err)) so setup failures are
surfaced early; ensure subsequent uses still reference the authenticators
variable.
---
Nitpick comments:
In `@router-tests/observability/error_type_test.go`:
- Around line 29-30: The test creates metricReader via metric.NewManualReader()
but never uses it for assertions; either add OTEL metric assertions using
metricReader (call metricReader.Collect() and inspect the resulting metrics for
the "wg.error.type" attribute) or remove metricReader to avoid dead setup;
update each subtest where metricReader, promRegistry and
assertErrorTypeInMetrics are used (look for metricReader,
metric.NewManualReader(), assertErrorTypeInMetrics, promRegistry, and the
"wg.error.type" attribute) to either (A) call metricReader.Collect(), parse the
metric data and assert the error type, or (B) delete metricReader and related
config so only promRegistry assertions remain.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 53a8f04d-e3dd-4717-afb0-da5830bc540b
📒 Files selected for processing (1)
router-tests/observability/error_type_test.go
| metricReader := metric.NewManualReader() | ||
| promRegistry := prometheus.NewRegistry() | ||
|
|
||
| authenticators, _ := testutils.ConfigureAuth(t) |
There was a problem hiding this comment.
Ignored error from ConfigureAuth could mask test setup failures.
The second return value (error) is discarded. If configuration fails, this could lead to confusing test failures later.
🛡️ Proposed fix
- authenticators, _ := testutils.ConfigureAuth(t)
+ authenticators, authErr := testutils.ConfigureAuth(t)
+ require.NoError(t, authErr)📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| authenticators, _ := testutils.ConfigureAuth(t) | |
| authenticators, authErr := testutils.ConfigureAuth(t) | |
| require.NoError(t, authErr) |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@router-tests/observability/error_type_test.go` at line 100, The test
currently ignores the error returned by ConfigureAuth; update the call to
capture both returns (e.g., authenticators, err := testutils.ConfigureAuth(t))
and fail the test immediately if err != nil (for example using t.Fatalf or
require.NoError(t, err)) so setup failures are surfaced early; ensure subsequent
uses still reference the authenticators variable.
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #2655 +/- ##
===========================================
+ Coverage 46.42% 60.36% +13.93%
===========================================
Files 1046 244 -802
Lines 141488 25901 -115587
Branches 9679 0 -9679
===========================================
- Hits 65680 15634 -50046
+ Misses 74095 8937 -65158
+ Partials 1713 1330 -383
🚀 New features to boost your workflow:
|
Adds 5 new error types to the router's error classification system, reducing errors that fall into unknown. The new types cover
common non-subscription error paths: operation_blocked, persisted_operation_not_found, validation_error, input_error, and subgraph_error.
These error types are exposed via the wg.error.type OTel attribute (and wg_error_type in Prometheus) for metrics, and via request.errorType / subgraph.request.errorType in template expressions — enabling users to filter
dashboards, set up alerts, and write conditional logic based on specific error categories.
TODO: review error types, if any more should be added
Summary by CodeRabbit
Documentation
New Features
Bug Fixes
Tests
Checklist