Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rest of Nexus API definitions #363

Merged
merged 21 commits into from
Apr 23, 2024
Merged

Conversation

bergundy
Copy link
Member

@bergundy bergundy commented Mar 6, 2024

Note for reviewers

Server support is implemented in the nexus feature branch.
We should wait until we're ready to merge nexus into main.

Even after we merge this, there'll like be very small changes as we work on the Go SDK, but hopefully only additions, I tried my best not to require any more breaking changes.

@bergundy bergundy requested review from a team as code owners March 6, 2024 23:24
@bergundy bergundy marked this pull request as draft March 7, 2024 01:01
@bergundy
Copy link
Member Author

bergundy commented Mar 7, 2024

Changed to draft, I already see some gaps.

@@ -234,6 +234,27 @@ message ProtocolMessageCommandAttributes {
string message_id = 1;
}

message ScheduleNexusOperationCommandAttributes {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can I set headers like I can via Nexus calls directly?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not yet, but I suspect we'll need this later for tracing. Might as well just add it now.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@@ -49,4 +49,6 @@ enum CommandType {
COMMAND_TYPE_UPSERT_WORKFLOW_SEARCH_ATTRIBUTES = 13;
COMMAND_TYPE_PROTOCOL_MESSAGE = 14;
COMMAND_TYPE_MODIFY_WORKFLOW_PROPERTIES = 16;
COMMAND_TYPE_SCHEDULE_NEXUS_OPERATION = 17;
COMMAND_TYPE_CANCEL_NEXUS_OPERATION = 18;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
COMMAND_TYPE_CANCEL_NEXUS_OPERATION = 18;
COMMAND_TYPE_REQUEST_CANCEL_NEXUS_OPERATION = 18;

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks.

// Operation completed as canceled (may have not ever been delivered).
NEXUS_OPERATION_STATE_CANCELED = 7;
// Operation timed out.
NEXUS_OPERATION_STATE_TIMEDOUT = 8;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
NEXUS_OPERATION_STATE_TIMEDOUT = 8;
NEXUS_OPERATION_STATE_TIMED_OUT = 8;

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense, thanks.

temporal/api/enums/v1/common.proto Outdated Show resolved Hide resolved
Comment on lines 78 to 80
// Operation has just been initialized, waiting to be scheduled.
// This is a transient state which may never be visible to a client.
NEXUS_OPERATION_STATE_INITIALIZED = 1;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To confirm though, this is a state it may be in? I would think that any Nexus operation goes straight to "scheduled".

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Today it's not a state an operation will ever be in because it immediately transitions to the scheduled state.
In the future we may block operations from being scheduled if the queue per source namespace and destination service is backed up.

I was debating whether this should added here or not, still not sure if I want to initialize the state machine on the server in the UNSPECIFIED state or have this explicit state before transitioning to scheduled.

I'll have to get deeper into the implementation to decide.

@@ -861,6 +861,7 @@ message DescribeWorkflowExecutionResponse {
repeated temporal.api.workflow.v1.PendingChildExecutionInfo pending_children = 4;
temporal.api.workflow.v1.PendingWorkflowTaskInfo pending_workflow_task = 5;
repeated temporal.api.workflow.v1.CallbackInfo callbacks = 6;
repeated temporal.api.workflow.v1.NexusOperationInfo pending_nexus_operations = 7;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I note this doesn't match the other "pending" whose info objects are specific to pending and don't include state enumerates for completion. Are you sure that you shouldn't call the object PendingNexusOperationInfo and the enum PendingNexusOperationState and get rid of the terminal states? If you think you may want a more robust info, I think you should go ahead and map out the RPC that uses it. Otherwise, you have multiple unused enum values from an API POV.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, maybe these states should be internally defined in the server repo.
I'll probably want to delete the operation state machines from mutable state after completion.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 To me from an API POV a pending-specific info with a pending-specific state enum matches what we have and doesn't cause any problems with attempted inadvertent reuse of this info for anything else. (for sure actual server model may differ)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll remove the non-pending states.

// A Nexus operation completed as canceled.
EVENT_TYPE_NEXUS_OPERATION_CANCELED = 52;
// A Nexus operation timed out.
EVENT_TYPE_NEXUS_OPERATION_TIMEDOUT = 53;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
EVENT_TYPE_NEXUS_OPERATION_TIMEDOUT = 53;
EVENT_TYPE_NEXUS_OPERATION_TIMED_OUT = 53;

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks.

NexusOperationStartedEventAttributes nexus_operation_started_event_attributes = 54;
NexusOperationCompletedEventAttributes nexus_operation_completed_event_attributes = 55;
NexusOperationFailedEventAttributes nexus_operation_failed_event_attributes = 56;
NexusOperationCanceledEventAttributes nexus_operation_canceld_event_attributes = 57;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
NexusOperationCanceledEventAttributes nexus_operation_canceld_event_attributes = 57;
NexusOperationCanceledEventAttributes nexus_operation_canceled_event_attributes = 57;

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

😬

NexusOperationCompletedEventAttributes nexus_operation_completed_event_attributes = 55;
NexusOperationFailedEventAttributes nexus_operation_failed_event_attributes = 56;
NexusOperationCanceledEventAttributes nexus_operation_canceld_event_attributes = 57;
NexusOperationTimedOutEventAttributes nexus_operation_timedout_event_attributes = 58;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
NexusOperationTimedOutEventAttributes nexus_operation_timedout_event_attributes = 58;
NexusOperationTimedOutEventAttributes nexus_operation_timed_out_event_attributes = 58;

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@@ -183,3 +183,28 @@ message CallbackInfo {
// The time when the next attempt is scheduled.
google.protobuf.Timestamp next_attempt_schedule_time = 8;
}

message NexusOperationInfo {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am wondering if we should have a concept of worker identity reported upon poll with Nexus task handlers. It is useful information for activities and such, so it's probably useful here too.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is information for the caller. The handler may not even be a worker and I'm not sure the handler side will want to expose worker identities.

I think I'll probably add the operation ID at minimum here.

Copy link
Member

@cretz cretz Mar 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, I agree due to the needed separation, we should not report this worker information to the caller. However it does bring up the question, as a caller I can see outstanding Nexus operations, but can I on the incoming/impl side? Maybe it's not a big deal for now since it's usually 1:1 with workflows.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not a big deal now, it's 1:1 with workflows and you'll be able to see that there are nexus callbacks attached to your workflow.

Later on we'll want better tracing abilities but I haven't gotten to that yet.

Copy link
Member Author

@bergundy bergundy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @cretz for the great feedback.

@@ -234,6 +234,27 @@ message ProtocolMessageCommandAttributes {
string message_id = 1;
}

message ScheduleNexusOperationCommandAttributes {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not yet, but I suspect we'll need this later for tracing. Might as well just add it now.

@@ -49,4 +49,6 @@ enum CommandType {
COMMAND_TYPE_UPSERT_WORKFLOW_SEARCH_ATTRIBUTES = 13;
COMMAND_TYPE_PROTOCOL_MESSAGE = 14;
COMMAND_TYPE_MODIFY_WORKFLOW_PROPERTIES = 16;
COMMAND_TYPE_SCHEDULE_NEXUS_OPERATION = 17;
COMMAND_TYPE_CANCEL_NEXUS_OPERATION = 18;
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks.

Comment on lines 78 to 80
// Operation has just been initialized, waiting to be scheduled.
// This is a transient state which may never be visible to a client.
NEXUS_OPERATION_STATE_INITIALIZED = 1;
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Today it's not a state an operation will ever be in because it immediately transitions to the scheduled state.
In the future we may block operations from being scheduled if the queue per source namespace and destination service is backed up.

I was debating whether this should added here or not, still not sure if I want to initialize the state machine on the server in the UNSPECIFIED state or have this explicit state before transitioning to scheduled.

I'll have to get deeper into the implementation to decide.

temporal/api/enums/v1/common.proto Outdated Show resolved Hide resolved
// Operation completed as canceled (may have not ever been delivered).
NEXUS_OPERATION_STATE_CANCELED = 7;
// Operation timed out.
NEXUS_OPERATION_STATE_TIMEDOUT = 8;
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense, thanks.

// A Nexus operation completed as canceled.
EVENT_TYPE_NEXUS_OPERATION_CANCELED = 52;
// A Nexus operation timed out.
EVENT_TYPE_NEXUS_OPERATION_TIMEDOUT = 53;
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks.

NexusOperationStartedEventAttributes nexus_operation_started_event_attributes = 54;
NexusOperationCompletedEventAttributes nexus_operation_completed_event_attributes = 55;
NexusOperationFailedEventAttributes nexus_operation_failed_event_attributes = 56;
NexusOperationCanceledEventAttributes nexus_operation_canceld_event_attributes = 57;
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

😬

NexusOperationCompletedEventAttributes nexus_operation_completed_event_attributes = 55;
NexusOperationFailedEventAttributes nexus_operation_failed_event_attributes = 56;
NexusOperationCanceledEventAttributes nexus_operation_canceld_event_attributes = 57;
NexusOperationTimedOutEventAttributes nexus_operation_timedout_event_attributes = 58;
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@@ -183,3 +183,28 @@ message CallbackInfo {
// The time when the next attempt is scheduled.
google.protobuf.Timestamp next_attempt_schedule_time = 8;
}

message NexusOperationInfo {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is information for the caller. The handler may not even be a worker and I'm not sure the handler side will want to expose worker identities.

I think I'll probably add the operation ID at minimum here.

@@ -861,6 +861,7 @@ message DescribeWorkflowExecutionResponse {
repeated temporal.api.workflow.v1.PendingChildExecutionInfo pending_children = 4;
temporal.api.workflow.v1.PendingWorkflowTaskInfo pending_workflow_task = 5;
repeated temporal.api.workflow.v1.CallbackInfo callbacks = 6;
repeated temporal.api.workflow.v1.NexusOperationInfo pending_nexus_operations = 7;
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, maybe these states should be internally defined in the server repo.
I'll probably want to delete the operation state machines from mutable state after completion.

Copy link
Member

@cretz cretz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me (not marking a draft "approved" though)

@@ -39,7 +38,7 @@ import "temporal/api/common/v1/message.proto";
message Failure {
string message = 1;
map<string, string> metadata = 2;
google.protobuf.Value details = 3;
bytes details = 3;
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note this is a backwards incompatible change but I realized it's simpler to just use JSON serialized to bytes here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How is this set by Nexus handlers and the Nexus HTTP protocol?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In Go there's a nexus.Failure object with a details field that's json.RawMessage.

https://github.com/nexus-rpc/sdk-go/blob/506896b352dacee3b5d7d62010a87bc08b2b8fd1/nexus/api.go#L48-L55

Here's how it's defined in the HTTP spec: https://github.com/nexus-rpc/api/blob/main/SPEC.md#failure

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 I agree that bytes is better than structured value here. May want to document that it is likely JSON based on SDK usage.

@bergundy bergundy marked this pull request as ready for review April 19, 2024 17:57
Copy link
Contributor

@rodrigozhou rodrigozhou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, please address @pdoerner comments.

Copy link
Member

@cretz cretz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How far along is the implementation? Is it possible to not put these history-affecting protos into master until the implementation is basically complete server side?

}

// State of a Nexus operation cancelation.
enum NexusOperationCancelationState {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While we are a bit inconsistent on "cancelled" vs "canceled", we are consistently using Cancellation with two Ls in our protos

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I knew we were inconsistent but never noticed we we writing Cancellation with two Ls.
I'm totally okay changing for "consistency".

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 Yeah we're consistent on cancellation, it's canceled we're inconsistent on

// The time when the next attempt is scheduled.
google.protobuf.Timestamp next_attempt_schedule_time = 10;

NexusOperationCancelationInfo cancelation_info = 11;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't seem to report this information for activities today, are we sure we have to put it on Nexus operations at this time?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above.

Comment on lines +259 to +264
// The time when the last attempt completed.
google.protobuf.Timestamp last_attempt_complete_time = 8;
// The last attempt's failure, if any.
temporal.api.failure.v1.Failure last_attempt_failure = 9;
// The time when the next attempt is scheduled.
google.protobuf.Timestamp next_attempt_schedule_time = 10;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't seem to report these 3 values for activities today, are we sure we have to put it on Nexus operations at this time?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is important to be able to understand how your operation is progressing.
Activities are run by your workers, operations requests are made by the server, there's no other means of inspecting the status.

Copy link
Member

@cretz cretz Apr 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Arguably it's important for activities too and I can give obvious reasons why, but we don't wait on whether I think it's important, we wait on whether users do so we don't prematurely make the API surface too large. Ideally Nexus operations and activities can have similar visibility if/until more is needed in general. We are working too hard to make Nexus special in some of these cases based on these assumptions of need.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Activity has last started and last failure.
Operations don't record attempt starts, only completions, so it's basically exposing the same amount of information but in different format.

// This is useful propagating tracing information.
// Note that it is not using temporal.api.common.v1.Header because the Nexus protocol only supports string header
// values.
map<string, string> header = 5;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that it is not using temporal.api.common.v1.Header because the Nexus protocol only supports string header values

We could only accept string payloads, but payloads are important because they imply encrypt-ability and plugin to every SDKs encryption tooling. Doing this as just strings means callers have to use separate encryption tooling which is a lot to ask inside of a workflow.

I wonder if we can say: "json/plain" strings are unwrapped and used as strings, all other "json/*" values are just bytes converted to strings, and all other values are just base64'd bytes.

While it may be worth thinking the SDK side can just accept strings and go through the data converter which will encrypt them, but then turn them back to strings here, several users have proxies that walk protos and encrypt all payloads.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we can say: "json/plain" strings are unwrapped and used as strings, all other "json/*" values are just bytes converted to strings, and all other values are just base64'd bytes.

I'm open to this idea but it would require additional metadata to decode and would complicate the integration between Temporal Nexus and "plain" Nexus.

several users have proxies that walk protos and encrypt all payloads.

I'm aware, I didn't expect users to put sensitive information in here.
AFAIR, our SDK headers don't go through codecs today and it doesn't seem like it's been an issue yet although I anticipate it may become an issue if they propagate baggage with their traces.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have a concept of Temporal headers that people expect to be able to put context values including sensitive ones. To make Nexus's use of Temporal headers the one special one that has different rules is a bit off. I understand it complicates integration, but it is important to think about the user here.

How would my Go context propagator work for Nexus operations for instance? (which people absolutely put sensitive information in)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not convinced it should just work by default, it all depends on how much context you want to share with the party you're calling, which may or may not be the same org and using the same encryption keys.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Header propagation does not work by default, users must write it. No keys are auto-forwarded.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After off-PR discussion, Temporal headers are not going to be Nexus headers at this time unfortunately. But we should consider that in the future.

@@ -39,7 +38,7 @@ import "temporal/api/common/v1/message.proto";
message Failure {
string message = 1;
map<string, string> metadata = 2;
google.protobuf.Value details = 3;
bytes details = 3;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How is this set by Nexus handlers and the Nexus HTTP protocol?

@@ -184,4 +183,10 @@ message OutgoingService {
message OutgoingServiceSpec {
// URL to invoke requests.
string url = 1;
// Callback URL that is attached to every StartOperation request for this service.
// The URL should be accessable to the handler of the StartOperation requests.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// The URL should be accessable to the handler of the StartOperation requests.
// The URL should be accessible to the handler of the StartOperation requests.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@@ -184,4 +183,10 @@ message OutgoingService {
message OutgoingServiceSpec {
// URL to invoke requests.
string url = 1;
// Callback URL that is attached to every StartOperation request for this service.
// The URL should be accessable to the handler of the StartOperation requests.
// The service handles callbacks in the /api/v1/namespaces/{namespace_name}/nexus/callback route. Note that if the
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is discussion of removing the /api/v1 prefix from our HTTP API annotations (it is a bit redundant/unnecessary). This decision should happen soon, don't bake too many Nexus expectations on this prefix yet.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I can update the comment when the decision is made but I think it's valuable to help users set this field properly.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 Just a heads up so there are no implementation-side expectations inside server dev

@bergundy
Copy link
Member Author

How far along is the implementation? Is it possible to not put these history-affecting protos into master until the implementation is basically complete server side?

Implementation is pretty much done but not yet production ready.
I have a feature branch in the server that I want to merge pending this PR.

Copy link
Member

@cretz cretz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. I do think it's unfortunate that Temporal headers can't be used.

temporal/api/command/v1/message.proto Outdated Show resolved Hide resolved
temporal/api/command/v1/message.proto Outdated Show resolved Hide resolved
temporal/api/history/v1/message.proto Outdated Show resolved Hide resolved
@bergundy
Copy link
Member Author

LGTM. I do think it's unfortunate that Temporal headers can't be used.

We can add that later, we left room/time for that.

bergundy added a commit to temporalio/temporal that referenced this pull request Apr 23, 2024
## What changed?

[Operation state
machine](#5545)
[Nexus command
processing](#5546)
[Implement Nexus start operation task
executors](#5686)
[Implement Nexus async completion
endpoint](#5735)

Upgrade API to support full E2E flow
temporalio/api#363
@bergundy bergundy merged commit 2952cc5 into temporalio:master Apr 23, 2024
3 checks passed
@bergundy bergundy deleted the nexus-e2e branch April 23, 2024 17:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants