Async Update Polling #1088

mmcshane · 2023-04-14T01:15:19Z

What was changed

Adds support for async updates

A WorkflowUpdateHandle implementation that runs a poll RPC as part of the call to Get
Add PollWorkflowUpdate to the client and the interceptor layers
The ability to "rehydrate" an update reference into a handle

Why?

Planned feature

Checklist

Closes
How was this tested:

Unit tests here and a succeeding feature test that is waiting on this PR and an server PR

Any docs updates needed?

internal/internal_workflow_client.go

client/client.go

cretz · 2023-04-14T12:36:40Z

internal/internal_workflow_client.go

+		WaitPolicy: &wait,
+		Identity:   w.client.identity,
+	}
+	resp, err := w.client.workflowService.PollWorkflowExecutionUpdate(grpcCtx, &pollReq)


This needs to be in a loop and the server needs to wait no longer than 60s before sending back a "not yet, try again". This helps with our philosophy of promising to proxy creators and others that we have no call lasting longer than like 80s (we time out poll calls on our side after 70s, which we should do here probably too IMO).

Forgot about this aspect so overall I agree.

Curious about the specific timeout values though - the server's default long-poll timeout cap is 3 minutes so I assume the 60s limit comes from the client side?

the server's default long-poll timeout cap is 3 minutes

Right, but what I believe the server does on task poll is to preempt that with an empty return after 60 or 70s IIRC. We also want to do that here. We don't want timeout to be reached, we want a successful-yet-empty-meaning-try-again response.

Server is just going to return a GRPC timeout status here, it won't (and IMO shouldn't) do the early-return trick. But I can detect that and retry if the parent context still has some life left in it.

I think we should do the early return "trick". I don't think it's a failing state to have a poll say "nothing yet" after a period of time. This is different than a timeout which the caller cannot differentiate between a real timeout (client or server side based on context deadline) and this one.

Whatever we choose to do, I think it needs to never take longer than a worker poll. And I believe we are telling people with proxies to set their max RPC timeouts at 80s. And the client should retry forever (or as long as their caller is willing) in the scenario where it isn't done yet. I think when you write the if statement to have the client retry on gRPC code timeout, it'll become clear that reusing timeout here is not the best.

Fixed up to loop until the parent context .Err() returns non-nil. The server is definitely returning a GRPC timeout status right now (and that was an intentional design choice) but I can verify that we're happy with the approach again.

What error status does the server return here DEADLINE_EXCEEDED?

Yes. Well, and some other stuff too, like NOT_FOUND for workflow-not-found. But I assume you're asking in the context of timeouts

Yes, I am basically trying to make sure these expected timeouts won't count against our SLAs

Yep, confirmed with the server team who requested this style in the first place.

internal/internal_workflow_client.go

cretz · 2023-04-14T12:39:45Z

internal/internal_workflow_client.go


-	if uh.err != nil || valuePtr == nil {
-		return uh.err
+func (luh *lazyUpdateHandle) Get(ctx context.Context, valuePtr interface{}) error {


I remember discussions around being able to wait for a stage instead of completion, is that a future PR? Also, is there any way for me to just check whether completed in a non-poll/non-blocking way?

That would be subsequent to this, yes. This is just enough to get e2e working being async and blocking on ACCEPTED.

As for checking Ready() (or similar) - by "non-blocking" would you be OK with a (short-ish) RPC? Because that would be necessary with the current impl (i.e. no Future to check)

Spoke w/Chad - we both see TryGet as a likely near-term request. Less clear is some form of "describe" API to observe execution state. We can wait for some usecases for the latter.

Sushisource · 2023-04-14T16:17:19Z

client/client.go

+	// WorkflowUpdateRef reflects the fields needed to refer to a workflow
+	// update.
+	// NOTE: Experimental


Probably worth explaining why/how this is distinguished from a handle

Ended up getting rid of this interface in favor of a struct which is only a struct for future extensibility purposes.

client/client.go

internal/internal_workflow_client.go

Sushisource · 2023-04-14T16:26:35Z

internal/internal_workflow_client.go

+		WaitPolicy: &wait,
+		Identity:   w.client.identity,
+	}
+	resp, err := w.client.workflowService.PollWorkflowExecutionUpdate(grpcCtx, &pollReq)


I would say since the retrying is hidden from the user, whether or not the early-return thing is used isn't relevant except to make it easy for us to detect and retry (and possibly the proxy thing, but I don't know if we actually advise users that).

The early return does feel slightly lame, though I can't really pinpoint exactly why. And actual timeout feels more semantically appropriate to me, and we can distinguish which one is which by including a bit of metadata.

As much as it's not my favorite argument, I can kinda get sticking with the early return to keep things consistent w/ other long pollers

client/client.go

cretz · 2023-04-17T12:17:39Z

internal/internal_workflow_client.go

@@ -63,6 +66,8 @@ const (
 	defaultGetHistoryTimeout = 65 * time.Second

 	getSystemInfoTimeout = 5 * time.Second
+
+	pollUpdateTimeout = 60 * time.Second


For other polls, we make sure the server returns 10 seconds before this I believe. Can you confirm that will happen here?

As it stands, that will not happen. Server prefers to return DeadlineExceeded here.

Server prefers to return DeadlineExceeded here.

I don't think that's true on normal long polls. I think it returns an empty success and I think it should here. Helps with error code tracking and such. I don't think it's an error case for your update to not be done yet.

I don't think that's true on normal long polls

This is correct. It was an intentional decision to use DEADLINE_EXCEEDED here and understood to be different than the queue polling calls.

cretz · 2023-04-17T12:19:20Z

internal/internal_workflow_client.go

+		)
+		resp, err := w.client.workflowService.PollWorkflowExecutionUpdate(ctx, &pollReq)
+		cancel()
+		if err == context.DeadlineExceeded || status.Code(err) == codes.DeadlineExceeded {


How do you know this isn't a legitimate deadline-exceeded from the server? Or how do you know this isn't a user-defined context deadline that we should bubble out? I think we need an empty response from the server here before SDK timeout. We don't need error codes appearing on normal operation. SDK side timeout IMO should only occur on actual error.

I think I'm missing your intent here ... all DeadlineExceededs from the server are legitimate?

If it's the parent context that failed, we won't loop through again (see guard on for loop) - I added a test for the same.

all DeadlineExceededs from the server are legitimate?

~~Is there not way to tell the difference between the server not responding and the update not being complete yet?~~

Saw the discussion above

If you're fetching the result of an update, I don't think there exists a meaningful difference.

SDK side timeout IMO should only occur on actual error.

Is your suggestion then that when the caller-supplied context ends (timeout, cancel, whatever) and the outcome is not yet available then Handle.Get(ctx, valuePtr) should not return an error and also should not set the valuePtr?

(a) can we run the this-code-will-not-respect-the-passed-context argument to ground before shifting? I really don't see in the code where that's the case and am asking you to make that clear

(b) re: the failure metric - I have no idea and will look into it. But again, the server's intention is to use a DEADLINE_EXCEEDED here so we may need to make allowances like we've done with longPoll.

can we run the this-code-will-not-respect-the-passed-context argument to ground before shifting?

I think your context check in the loop is enough. So we can shift now. User context is no longer a concern. It just popped in my head as the first of many concerns about reusing errors for flow here.

But again, the server's intention is to use a DEADLINE_EXCEEDED here so we may need to make allowances like we've done with longPoll.

Can we also run the "server's intention" argument into the ground before shifting? Why is that the intention? Why can that not change? Do they do that on other long polls?

It can absolutely change but you're talking about something that was discussed and settled a few weeks ago so it's not the sort of thing that's going to happen in the context of this PR.

👍 I vote for changing it. If it can't change now, ok, but I think it should. I think we'll be happier for it. In the meantime, the error-based control flow is probably ok enough. My fear is if we don't change soon we never will.

If it helps, the code change to treat a nil response.Outcome as a signal to retry (subject to the parent context, natch) means that we can change the server without breaking field SDKs.

cretz

Approving with the known caveat that we will change server to not use error codes to signify update not completed yet

The codegen referenced here has been removed from this repository.

This consists mainly of exposing the WaitPolicy to users of the workflow client and implementing a lazy version of WorkflowUpdateHandle.

Loop inside the PollWorkflowClient stub function calling up to the server with a context built from the long poll timeout.

Also refactors tests to use an init() func since gomock attaches its validation to t.Cleanup() so it's wrong to re-use gomock controller instances across sub-tests.

Will now retry in 3 cases: 1. local context deadline exceeded 2. grpc error with DEADLINE_EXCEEDED 3. No error but grpc response does not contain an outcome

mmcshane requested a review from a team as a code owner April 14, 2023 01:15

mmcshane commented Apr 14, 2023

View reviewed changes

internal/internal_workflow_client.go Show resolved Hide resolved

cretz reviewed Apr 14, 2023

View reviewed changes

Sushisource reviewed Apr 14, 2023

View reviewed changes

cretz reviewed Apr 17, 2023

View reviewed changes

Quinn-With-Two-Ns mentioned this pull request Apr 18, 2023

Support Async Update Polling temporalio/sdk-java#1743

Closed

cretz approved these changes Apr 19, 2023

View reviewed changes

Matt McShane added 7 commits April 19, 2023 16:09

Remove orphaned go:generate directives

4750746

The codegen referenced here has been removed from this repository.

Add support for async workflow updates.

ef4a349

This consists mainly of exposing the WaitPolicy to users of the workflow client and implementing a lazy version of WorkflowUpdateHandle.

Retry poll request until parent ctx expires

981f4f0

Loop inside the PollWorkflowClient stub function calling up to the server with a context built from the long poll timeout.

Prefer the name GetWorkflowUpdateHandle

b0e3e35

Add test for parent ctx timeout in update polling

9cb5024

Also refactors tests to use an init() func since gomock attaches its validation to t.Cleanup() so it's wrong to re-use gomock controller instances across sub-tests.

Rename to GetWorkflowUpdateHandleOptions

390f939

Retry poll for update on nil response outcome

f510b1c

Will now retry in 3 cases: 1. local context deadline exceeded 2. grpc error with DEADLINE_EXCEEDED 3. No error but grpc response does not contain an outcome

mmcshane merged commit 29aecb3 into temporalio:master Apr 19, 2023

mmcshane deleted the mpm/poll-update-client branch April 19, 2023 20:30

cretz mentioned this pull request May 17, 2023

Add support for async update temporalio/sdk-java#1766

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Async Update Polling #1088

Async Update Polling #1088

mmcshane commented Apr 14, 2023

cretz Apr 14, 2023 •

edited

Loading

mmcshane Apr 14, 2023

cretz Apr 14, 2023

mmcshane Apr 14, 2023

cretz Apr 14, 2023 •

edited

Loading

mmcshane Apr 14, 2023

Quinn-With-Two-Ns Apr 17, 2023

mmcshane Apr 17, 2023 •

edited

Loading

Quinn-With-Two-Ns Apr 17, 2023

mmcshane Apr 17, 2023

cretz Apr 14, 2023

mmcshane Apr 14, 2023 •

edited

Loading

mmcshane Apr 14, 2023

Sushisource Apr 14, 2023

mmcshane Apr 17, 2023

Sushisource Apr 14, 2023

cretz Apr 17, 2023

mmcshane Apr 17, 2023

cretz Apr 17, 2023

mmcshane Apr 17, 2023

cretz Apr 17, 2023

mmcshane Apr 17, 2023

Quinn-With-Two-Ns Apr 17, 2023 •

edited

Loading

mmcshane Apr 17, 2023

mmcshane Apr 17, 2023

mmcshane Apr 18, 2023

cretz Apr 18, 2023 •

edited

Loading

mmcshane Apr 18, 2023

cretz Apr 18, 2023 •

edited

Loading

mmcshane Apr 18, 2023

cretz left a comment

Async Update Polling #1088

Async Update Polling #1088

Conversation

mmcshane commented Apr 14, 2023

What was changed

Why?

Checklist

cretz Apr 14, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cretz Apr 14, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mmcshane Apr 17, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mmcshane Apr 14, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Quinn-With-Two-Ns Apr 17, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cretz Apr 18, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cretz Apr 18, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cretz left a comment

Choose a reason for hiding this comment

cretz Apr 14, 2023 •

edited

Loading

cretz Apr 14, 2023 •

edited

Loading

mmcshane Apr 17, 2023 •

edited

Loading

mmcshane Apr 14, 2023 •

edited

Loading

Quinn-With-Two-Ns Apr 17, 2023 •

edited

Loading

cretz Apr 18, 2023 •

edited

Loading

cretz Apr 18, 2023 •

edited

Loading