Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add ExecuteMultiOperation #367

Merged
merged 18 commits into from
Apr 13, 2024
Merged

Conversation

stephanos
Copy link
Contributor

@stephanos stephanos commented Mar 12, 2024

What changed?

Adding a new "ExecuteMultiOperation" API that will support sending a combined Start + Update.

The first version of the Server implementation can be found here: temporalio/temporal#5577

Note that while the API has a working implementation, it might still change. It is therefore annotated with "NOTE: Experimental API".

Why?

New API for sending operations to the Server atomically.

Breaking changes

@stephanos stephanos force-pushed the update-with-start2 branch 2 times, most recently from d89ae69 to ce65a44 Compare March 12, 2024 16:01
@stephanos stephanos force-pushed the update-with-start2 branch 3 times, most recently from a0bfae2 to da6a6cb Compare March 19, 2024 22:01
google/rpc/code.proto Outdated Show resolved Hide resolved
google/rpc/status.proto Outdated Show resolved Hide resolved
@stephanos stephanos force-pushed the update-with-start2 branch 4 times, most recently from 3747c49 to c8e14b5 Compare March 21, 2024 16:38
@stephanos stephanos force-pushed the update-with-start2 branch 3 times, most recently from 1ea586e to 23d2d6e Compare March 22, 2024 19:27
@stephanos stephanos marked this pull request as ready for review March 22, 2024 23:14
@stephanos stephanos requested review from a team as code owners March 22, 2024 23:14

message WorkflowOperationResult {
oneof result {
google.rpc.Status error = 1;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think multi-operation failure should be a gRPC error.

I can't imagine a continue-on-error future (or even a non-transactional future) where this has value reported as a successful gRPC result with a failure embedded. I think for now we should consider multi-operation transactional (i.e. all fails or all succeeds) as that's where it has most value, and use gRPC errors to report failure. If we need to be more detailed about the error for the user, we can add error details which can include partial successes if we ever have that (I hope not, it's confusing and has limited value). Even if we did have the concept of partial completion at some point in the future, I think it'd be up to request options on whether error should be ignored, but it's important IMO to let gRPC failures be gRPC failures.

Copy link
Contributor Author

@stephanos stephanos Mar 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've alluded to it in the other comment; the issue here is that once the Update is admitted, the Workflow needs to start or the wait stage accepted/completed would never be reached. And once the Workflow has started, it cannot be undone (as of now).

We could send a gRPC failure back in that case. That would mean that the client cannot see the partial success anymore.

Copy link
Member

@cretz cretz Mar 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Update failure/rejection outcome is already on the successful update response. A gRPC failure would be failure to admit an update which should also fail to start the workflow (same as failure to admit a signal on signal with start).

Also, I don't believe that multi-operation update should wait until update reaches a certain state, or that a worker even needs to run. I believe update "durably admitted" (which I heard we're adding support for so update-with-start works) is when it should return.

Copy link
Contributor Author

@stephanos stephanos Mar 29, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Update failure/rejection outcome is already on the successful update response. A gRPC failure would be failure to admit an update which should also fail to start the workflow (same as failure to admit a signal on signal with start).

You're right, I've come to see it the same way 👍

Also, I don't believe that multi-operation update should wait until update reaches a certain state, or that a worker even needs to run. I believe update "durably admitted" (which I heard we're adding support for so update-with-start works) is when it should return.

We are considering "durably admitted" now (approval pending) indeed.

It's my understanding that being able to wait for accepted/completed allows the client to receive the response in a single gRPC call instead of having to initiate a poll. Once we allow the user to pick wait stage "Admitted", they can choose whether to to unblock early and poll or receive everything in one go.

Or are there other concerns for you?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's my understanding that being able to wait for accepted/completed allows the client to receive the response in a single gRPC call instead of having to initiate a poll.

This is my understanding too, I just think a supported wait-stage should be admitted with this call. One of the great aspects of Temporal is that when I submit something durably I know it will run, I don't have to wait for it to run. We may not be offering this for update, but since we're offering it for update-with-start, we should allow users to return without a worker running (i.e. basically immediately once made durable), same as signal-with-start. But I also support the ability to wait longer for update response for sure.

@@ -107,6 +107,16 @@ service WorkflowService {
};
}

// ExecuteMultiOperation executes multiple operations.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we specify the transactional expectations of this call? Basically guarantee it all fails or all succeeds atomically/transactionally?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, I took a stab at it. I'm not quite sure yet how to express the subtleties in error behavior: when the Update fails to be admitted, there actually won't be a new Workflow (akin to a transaction). However, if the Update doesn't reach Accepted/Completed, only the Update operation will fail but not the Start. So it actually has limited transactional guarantees.

Copy link
Member

@cretz cretz Mar 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

However, if the Update doesn't reach Accepted/Completed, only the Update operation will fail but not the Start. So it actually has limited transactional guarantees.

This depends on how you define fail. It won't fail from a gRPC perspective. I don't consider an update being rejected as an update "failing" from the POV of this API. If an update doesn't reach accepted/completed, the update response is capable of representing that. Still, that the start and update were accepted by the server and the update reached a state of durability, then the gRPC call succeeded, even if a single update had an application-level failure (not to be confused with server/gRPC failure).

Also, I don't believe that multi-operation update should wait until update reaches a certain state, or that a worker even needs to run. I believe update "durably admitted" (which I heard we're adding support for so update-with-start works) is when it should return.

temporal/api/workflowservice/v1/request_response.proto Outdated Show resolved Hide resolved
temporal/api/workflowservice/v1/service.proto Outdated Show resolved Hide resolved
@stephanos stephanos force-pushed the update-with-start2 branch 2 times, most recently from 29ef7b9 to 49c91ce Compare April 3, 2024 22:15
@stephanos stephanos marked this pull request as draft April 3, 2024 22:27
temporal/api/workflowservice/v1/service.proto Outdated Show resolved Hide resolved
temporal/api/workflowservice/v1/service.proto Outdated Show resolved Hide resolved
@stephanos stephanos marked this pull request as ready for review April 5, 2024 20:52
@stephanos stephanos requested a review from cretz April 5, 2024 21:19
Copy link
Member

@cretz cretz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. May want to clarify idempotency behavior (i.e. that users should set start workflow request ID since they may repeatedly invoke this w/ fail-if-existing), but don't need to necessarily.

(also, I disagree that multi operation should ever need to be speculative/lossy as that is very confusing for users, but this was already decided internally)

@stephanos
Copy link
Contributor Author

stephanos commented Apr 8, 2024

May want to clarify idempotency behavior (i.e. that users should set start workflow request ID since they may repeatedly invoke this w/ fail-if-existing)

@cretz I'll think about this more ... I wonder if we should require a request ID 🤔 If we're not, I'll add a follow-up comment in a separate PR.

@cretz
Copy link
Member

cretz commented Apr 8, 2024

I wonder if we should require a request ID

We should not require it IMO. It is a choice from the API caller whether they will ever re-invoke this, and request ID is only for that. Think about the user invoking this from the HTTP API. We really need to care about that developer's experience too (which is why lossy updates are rough on them).

@stephanos stephanos merged commit aa101ea into temporalio:master Apr 13, 2024
3 checks passed
@stephanos stephanos deleted the update-with-start2 branch April 13, 2024 02:03
stephanos added a commit to temporalio/temporal that referenced this pull request Apr 13, 2024
## What changed?
<!-- Describe what has changed in this PR -->

Happy-path implementation for temporalio/api#367

This is just the first version; it'll need more work (e.g. more
validation and tests) before it can be production-ready.

It is behind a feature flag.

## Why?
<!-- Tell your future self why have you made these changes -->

Putting up the review now so the [API
change](temporalio/api#367) can be merged; and
no feature branch is necessary anymore.

## How did you test it?
<!-- How have you verified this change? Tested locally? Added a unit
test? Checked in staging env? -->

Net new API behind feature flag, needs more testing.

## Potential risks
<!-- Assuming the worst case, what can be broken when deploying this
change to production? -->

It's behind a feature flag.

## Documentation
<!-- Have you made sure this change doesn't falsify anything currently
stated in `docs/`? If significant
new behavior is added, have you described that in `docs/`? -->

## Is hotfix candidate?
<!-- Is this PR a hotfix candidate or does it require a notification to
be sent to the broader community? (Yes/No) -->

---------

Co-authored-by: Alex Shtin <alex@shtin.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants