Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement Nexus async completion endpoint #5735

Merged
merged 4 commits into from
Apr 19, 2024

Conversation

bergundy
Copy link
Member

What changed?

  • Implemented the async Nexus operation completion endpoint with functional tests
  • Refactored frontend fx to provide a mux.Router instead of collecting handlers so I can inject routes in the component (plugin) code
  • Refactored task_executor.go to statemachine_environment.go and adapted it to support handling API requests
  • Added a StateMachineEnvironment() to history Engine. In the future we'll want to have the environment on the shard controller instead of the engine but because I'm waiting until we drop support for shard level workflow cache first to make that change
  • Refactored some helpers from nexus_handler.go into common/nexus for reusability

@bergundy bergundy requested a review from a team as a code owner April 16, 2024 21:00
@bergundy bergundy force-pushed the bergundy/async-operation-completion branch from 4a1ba13 to 4bb9cb4 Compare April 16, 2024 21:27
@bergundy bergundy force-pushed the bergundy/async-operation-completion branch from 4bb9cb4 to 7c8d04c Compare April 16, 2024 23:06
cmd/tools/rpcwrappers/main.go Outdated Show resolved Hide resolved
service/history/consts/const.go Show resolved Hide resolved
service/matching/forwarder.go Outdated Show resolved Hide resolved
// The task is stale and is safe to be dropped.
// Even though ErrStaleReference is castable to serviceerror.NotFound, we give this error special treatment
// because we're interested in the metric.
metrics.TaskSkipped.With(e.taggedMetricsHandler).Record(1)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should change the other place where TaskSkipped is emitted to return the StaleStateError

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure what the semantics are there.
When would the task event ID be >= than the mutable state next event ID?

It does seem like we want to put the task in the DLQ in that case instead of dropping it. Do you want to change that behavior in this PR that goes into a feature branch?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My understanding it that it should never happen and basically the same as the StaleStateError.

goes into a feature branch

I am fine either way. That dropping logic is there for a long time so not an urgent thing to fix.
I commented here because it might impact the interpretation of the TaskSkipped metric.

service/frontend/fx.go Show resolved Hide resolved
plugins/nexusoperations/executors.go Show resolved Hide resolved
Config *Config
CallbackTokenGenerator *commonnexus.CallbackTokenGenerator
HistoryClient resource.HistoryClient
NamespaceValidationInterceptor *interceptor.NamespaceValidatorInterceptor
Copy link
Member

@yycptt yycptt Apr 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't read the code very carefully. I thought those interceptors are taken care of by the existing NexusHTTPHandler in frontend? I guess I am confused by all those handlers now 🤦

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I was debating whether to put this functionality in the frontend package.
I'm trying to see how far I can take these components.

There's definitely some duplication but I'll take care of that later. I have it tracked.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I'm also a little confused why there is a separate Nexus frontend handler just for completions.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we could move the nexus http handler into this package as part of later refactoring work.

@bergundy bergundy force-pushed the bergundy/async-operation-completion branch from f2367fd to 4d20fd8 Compare April 19, 2024 00:15
## What changed?

Added pending nexus operations to describe response.

I was debating if I want to allow components to register a describe hook
but decided to just inline this for now.
@bergundy
Copy link
Member Author

Note to reviewers, I squashed the stacked PR #5749 on top of this. You can ignore that commit.

panic(fmt.Sprintf("expected at least one namespace ID field in request with nesting of 2 in %s", t))
} else {
// There's more than one, assume there's a top level one (e.g.
// historyservice.GetWorkflowExecutionHistoryRequest)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I only see one NamespaceId field in historyservice.GetWorkflowExecutionHistoryRequest.

I think ideally we should panic here as well since it's ambiguous in this case.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh sorry it's GetWorkflowExecutionRawHistoryRequest.

// The task is stale and is safe to be dropped.
// Even though ErrStaleReference is castable to serviceerror.NotFound, we give this error special treatment
// because we're interested in the metric.
metrics.TaskSkipped.With(e.taggedMetricsHandler).Record(1)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My understanding it that it should never happen and basically the same as the StaleStateError.

goes into a feature branch

I am fine either way. That dropping logic is there for a long time so not an urgent thing to fix.
I commented here because it might impact the interpretation of the TaskSkipped metric.

@bergundy bergundy merged commit 8083c39 into nexus Apr 19, 2024
5 checks passed
@bergundy bergundy deleted the bergundy/async-operation-completion branch April 19, 2024 18:11
"nexus_completion_requests",
WithDescription("The number of Nexus completion (callback) requests received by the service."),
)
NexusCompletionLatencyHistogram = NewCounterDef(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't you want a NewTimerDef instead? Ditto L686 above.

Suggested change
NexusCompletionLatencyHistogram = NewCounterDef(
NexusCompletionLatency = NewTimerDef(

}

completion := &tokenspb.NexusOperationCompletion{}
return completion, proto.Unmarshal(plaintext, completion)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just wanna double check, proto.Unmarshal works with completion instead of &completion?

Can you add unit tests for this function and DecodeCallbackToken below?

@@ -69,3 +71,26 @@ func (g *CallbackTokenGenerator) Tokenize(completion *tokenspb.NexusOperationCom
}
return string(b), nil
}

// DecodeCompletion decodes a callback token unwrapping the contained NexusOperationCompletion proto struct.
func (g *CallbackTokenGenerator) DecodeCompletion(token *CallbackToken) (*tokenspb.NexusOperationCompletion, error) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function doesn't seem to need be a method of CallbackTokenGenerator

Comment on lines +180 to +183
if !exposeDetails {
return nexus.HandlerErrorf(nexus.HandlerErrorTypeInternal, "internal error")
}
return err
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you move this to the default case inside the switch?
Also, don't you need to wrap the last return in nexus.HandlerError?

Comment on lines +41 to +43
func handleSuccessfulOperationResult(node *hsm.Node, operation Operation,
result *commonpb.Payload,
attemptTime *time.Time,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit

Suggested change
func handleSuccessfulOperationResult(node *hsm.Node, operation Operation,
result *commonpb.Payload,
attemptTime *time.Time,
func handleSuccessfulOperationResult(
node *hsm.Node,
operation Operation,
result *commonpb.Payload,
attemptTime *time.Time,

tag.WorkflowRunID(completion.GetRunId()),
)
if completion.GetNamespaceId() != ns.ID().String() {
logger.Error("namespace ID in token doesn't match the token", tag.WorkflowNamespaceID(ns.ID().String()), tag.Error(err), tag.NewStringTag("completion-namespace-id", completion.GetNamespaceId()))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: avoid very long lines

Suggested change
logger.Error("namespace ID in token doesn't match the token", tag.WorkflowNamespaceID(ns.ID().String()), tag.Error(err), tag.NewStringTag("completion-namespace-id", completion.GetNamespaceId()))
logger.Error(
"namespace ID in token doesn't match the token",
tag.WorkflowNamespaceID(ns.ID().String()),
tag.Error(err),
tag.NewStringTag("completion-namespace-id", completion.GetNamespaceId()),
)

Comment on lines +97 to +98
if !h.Config.Enabled() {
h.MetricsHandler.Counter(metrics.NexusCompletionRequestPreProcessErrors.Name()).Record(1)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if !h.Config.Enabled() {
h.MetricsHandler.Counter(metrics.NexusCompletionRequestPreProcessErrors.Name()).Record(1)
reqPreProcessErrorsCounter := h.MetricsHandler.Counter(metrics.NexusCompletionRequestPreProcessErrors.Name())
if !h.Config.Enabled() {
reqPreProcessErrorsCounter.Record(1)

metrics.OperationTag(apiName),
metrics.NamespaceTag(nsName),
),
requestStartTime: time.Now(),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't it be more precise if you recorded the start time at the top of this function?

Comment on lines +235 to +246
} else {
if c.outcomeTag != nil {
c.metricsHandler = c.metricsHandler.WithTags(c.outcomeTag)
} else {
var he *nexus.HandlerError
if errors.As(*errPtr, &he) {
c.metricsHandler = c.metricsHandler.WithTags(metrics.NexusOutcomeTag("error_" + strings.ToLower(string(he.Type))))
} else {
c.metricsHandler = c.metricsHandler.WithTags(metrics.NexusOutcomeTag("error_internal"))
}
}
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
} else {
if c.outcomeTag != nil {
c.metricsHandler = c.metricsHandler.WithTags(c.outcomeTag)
} else {
var he *nexus.HandlerError
if errors.As(*errPtr, &he) {
c.metricsHandler = c.metricsHandler.WithTags(metrics.NexusOutcomeTag("error_" + strings.ToLower(string(he.Type))))
} else {
c.metricsHandler = c.metricsHandler.WithTags(metrics.NexusOutcomeTag("error_internal"))
}
}
}
} else if c.outcomeTag != nil {
c.metricsHandler = c.metricsHandler.WithTags(c.outcomeTag)
} else {
var he *nexus.HandlerError
if errors.As(*errPtr, &he) {
c.metricsHandler = c.metricsHandler.WithTags(metrics.NexusOutcomeTag("error_" + strings.ToLower(string(he.Type))))
} else {
c.metricsHandler = c.metricsHandler.WithTags(metrics.NexusOutcomeTag("error_internal"))
}
}

// not referencing a stale state or that the task itself is not stale. For example, if the state has a history of
// `[{v: 1, t: 3}, {v: 2, t: 5}]`, task A `{v: 2, t: 4}` **is not** referencing stale state because for version `2`
// When a task or API request is being processed, the history is compared with the imprinted state reference to verify
// that the state is not stale or that the task or request itself is not stale. For example, if the state has a history
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit

Suggested change
// that the state is not stale or that the task or request itself is not stale. For example, if the state has a history
// that the state is not stale or that the task/request itself is not stale. For example, if the state has a history

bergundy added a commit that referenced this pull request Apr 23, 2024
## What changed?

[Operation state
machine](#5545)
[Nexus command
processing](#5546)
[Implement Nexus start operation task
executors](#5686)
[Implement Nexus async completion
endpoint](#5735)

Upgrade API to support full E2E flow
temporalio/api#363
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants