Add caller-side Nexus Operation metrics.#10026
Add caller-side Nexus Operation metrics.#10026S15 wants to merge 18 commits intotemporalio:feature/nexus-standalonefrom
Conversation
Add API boilerplate for standalone Nexus Operations. - [x] built - [ ] run locally and tested manually - [x] covered by existing tests - [ ] added new unit test(s) - [ ] added new functional test(s)
Add Nexus Standalone feature flag. Tests will be added to respective API impl.
Add Nexus Standalone Describe and Start handlers. - [ ] built - [ ] run locally and tested manually - [ ] covered by existing tests - [x] added new unit test(s) - [x] added new functional test(s) --------- Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add Nexus Standalone List and Count handlers. - [ ] built - [ ] run locally and tested manually - [ ] covered by existing tests - [ ] added new unit test(s) - [x] added new functional test(s)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
## What changed? Wire up Nexus Standalone with CHASM. ## How did you test it? - [ ] built - [ ] run locally and tested manually - [x] covered by existing tests - [ ] added new unit test(s) - [ ] added new functional test(s) --------- Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Co-authored-by: Roey Berman <roey@temporal.io>
## What changed? Add identity to `TerminatedFailureInfo`. PS: `CanceledFailureInfo` will be a separate PR. ## Why? Allow clients to access identity via API. ## How did you test it? - [ ] built - [ ] run locally and tested manually - [ ] covered by existing tests - [x] added new unit test(s) - [x] added new functional test(s)
|
The currently failing tests are shared with the |
bergundy
left a comment
There was a problem hiding this comment.
Overall looks great. I think we should be done in another round.
| } | ||
| }, | ||
| GoCtx: context.WithValue(context.Background(), nexusoperation.OperationContextKey, &nexusoperation.OperationContext{ | ||
| NamespaceRegistry: nsRegistry, |
There was a problem hiding this comment.
You don't need the namespace registry in the context, the current namespace entry is already on the CHASM context,.
| var NexusOperationSuccessCount = metrics.NewCounterDef( | ||
| "nexus_operation_success_count", | ||
| metrics.WithDescription("Nexus Operations successfully completed."), | ||
| ) | ||
| var NexusOperationFailedCount = metrics.NewCounterDef( | ||
| "nexus_operation_failed_count", | ||
| metrics.WithDescription("Nexus Operations failures."), | ||
| ) | ||
| var NexusOperationCancelCount = metrics.NewCounterDef( | ||
| "nexus_operation_cancel_count", | ||
| metrics.WithDescription("Nexus Operations cancellations."), | ||
| ) | ||
| var NexusOperationTerminateCount = metrics.NewCounterDef( | ||
| "nexus_operation_terminate_count", | ||
| metrics.WithDescription("Nexus Operations that were terminated before completion."), | ||
| ) | ||
| var NexusOperationTimeoutCount = metrics.NewCounterDef( | ||
| "nexus_operation_timeout_count", | ||
| metrics.WithDescription("Nexus Operations that timed out before completion."), | ||
| ) |
There was a problem hiding this comment.
On the fence with the naming here. On the one hand it's consistent with activities, OTOH, I would prefer to use the statuses and use the same tense for all outcomes (failed is past tense and inconsistent with the rest)
| if store, ok := o.Store.TryGet(ctx); ok { | ||
| return store.OnNexusOperationCompleted(ctx, o, result, links) | ||
| } | ||
| metricsHandler, err := o.EnrichMetricsHandler(ctx) |
There was a problem hiding this comment.
Instead of passing in a metrics handler, you can obtain it in the transition function.
| func (o *Operation) EnrichMetricsHandler(ctx chasm.Context) (metrics.Handler, error) { | ||
| // nolint:revive // unchecked-type-assertion: intentional panic on missing context value | ||
| opCtx := ctx.Value(OperationContextKey).(*OperationContext) | ||
| namespaceName, err := opCtx.NamespaceRegistry.GetNamespaceName(namespace.ID(ctx.ExecutionKey().NamespaceID)) |
There was a problem hiding this comment.
ctx.NamespaceEntry().Name().String()
| metrics.NexusEndpointTag(o.GetEndpoint()), | ||
| metrics.WorkflowTypeTag(WorkflowTypeTag), | ||
| } | ||
| if opCtx.MetricTagConfig != nil { |
There was a problem hiding this comment.
Let's just ensure that this is never nil. It shouldn't be.
| } | ||
|
|
||
| func (o *Operation) emitOnFailedMetrics(handler metrics.Handler, closeTime time.Time) { | ||
| outcomeTag := metrics.OutcomeTag(nexusoperationpb.OPERATION_STATUS_FAILED.String()) |
| startedTime := o.GetStartedTime() | ||
| if startedTime != nil { | ||
| // Async operation that was started. | ||
| NexusOperationScheduleToStartLatency.With(handler).Record(startedTime.AsTime().Sub(scheduledTime), outcomeTag) |
There was a problem hiding this comment.
Emit this on transition to start instead of only emitting it when the operation completes.
| tags := []metrics.Tag{ | ||
| metrics.NamespaceTag(namespaceName.String()), | ||
| metrics.NexusEndpointTag(o.GetEndpoint()), | ||
| metrics.WorkflowTypeTag(WorkflowTypeTag), |
There was a problem hiding this comment.
If you're in a workflow, you'll need to emit the workflow's type.
There was a problem hiding this comment.
You can add a method to the Store interface: WorkflowTypeTag() string.
| func newComponentOnlyLibrary(namespaceRegistry namespace.Registry, dc *dynamicconfig.Collection) *componentOnlyLibrary { | ||
| return &componentOnlyLibrary{ | ||
| namespaceRegistry: namespaceRegistry, | ||
| metricTagConfig: MetricTagConfiguration.Get(dc), |
There was a problem hiding this comment.
Hmmm... it's okay to reuse this since I figure users will want to turn it on for both callers and handler but this config struct was meant for the frontend (handler side) originally. You'll want to document how this config is used now that it has other purposes and mention that HeaderTagMappings is not relevant on the caller side.
38923fc to
340f40a
Compare
ce242b7 to
397eca3
Compare
397eca3 to
7ae0421
Compare
378e20c to
80b781d
Compare
What changed?
New metrics:
Counters
nexus_operation_success_countnexus_operation_failed_countnexus_operation_cancel_countnexus_operation_terminate_countnexus_operation_timeout_countHistograms
nexus_operation_schedule_to_close_latencynexus_operation_schedule_to_start_latencynexus_operation_start_to_close_latencyLabels
All metrics use the following labels:
namespacenexus_endpointnexus_servicenexus_operationworkflowTypeLatency metrics also include:
outcomeTimeout Count also includes:
timeout_typeHow did you test it?
Potential risks
Metrics emitting isn't nil safe, and enrich metrics could panic if misconfigured.