Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Worker client replacement #721

Merged
merged 4 commits into from
Apr 22, 2024
Merged

Worker client replacement #721

merged 4 commits into from
Apr 22, 2024

Conversation

cretz
Copy link
Member

@cretz cretz commented Apr 18, 2024

What was changed

Added Worker::replace_client. Notes:

  • Replacement unregisters itself with old client and re-registers with new client for eager workflow start
  • Uses lock of arc, so no currently operating poll on existing client will be interrupted (this has been tested)
  • This is as close as we can reasonably get to dynamic client endpoints without working through Tonic's connectivity
    • This is actually much cleaner for SDKs since they only really care about this for workers and can't interrupt anything running anyways
  • Had to change a few internal things but nothing drastic
  • Wrote test that starts two servers and swaps between (had to lower polling timeout on them)

Checklist

  1. Closes [Feature Request] Dynamic client for workers #477

@cretz cretz requested a review from a team as a code owner April 18, 2024 15:49
@@ -296,7 +296,7 @@ pub struct ConfiguredClient<C> {
options: Arc<ClientOptions>,
headers: Arc<RwLock<ClientHeaders>>,
/// Capabilities as read from the `get_system_info` RPC call made on client connection
capabilities: Option<get_system_info_response::Capabilities>,
capabilities: Arc<Option<get_system_info_response::Capabilities>>,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Capabilities is just a bunch of bools. No need to Arc it. In fact, you could change the proto build.rs to add a derive Copy to it

Copy link
Member Author

@cretz cretz Apr 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I remember the problem now. The issue is that WorkerClient::capabilities() wants to return a reference, but that is now on a replaceable client so I can't return a reference. Should I clone/copy for every invocation of WorkflowClient::capabilites() call and change that to return a copy?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, that's totally fine.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 Will make this call return a clone of the proto on each invocation and change back everywhere else to continue doing the same.

core/src/worker/client.rs Show resolved Hide resolved
core/src/worker/client.rs Outdated Show resolved Hide resolved
core/src/worker/mod.rs Outdated Show resolved Hide resolved
core/src/worker/mod.rs Outdated Show resolved Hide resolved
);
let worker_key = client.workers().register(Box::new(provider));
let worker_key = Mutex::new(client.workers().register(Box::new(slot_provider.clone())));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The provider shouldn't need to be cloned?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently it is a non-cloneable struct that is moved into this "register" call. But I need to call this "register" call with it again so now I need to have multiple references to this slot provider, so I had to make it cloneable and store it on the worker. I can wrap it in an Arc and update everywhere its used or I can clone it upon use here (or do more advanced stuff w/ references and lifetimes).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, yeah, I see since it is getting put into Self below, I missed that initially.

I think the alternative here would be to not instantiate the worker key here any longer, since you can create it from within self now that it owns the provider. It probably ends up being the same issue though.

Imo it probably does make sense to wrap the provider in an arc rather than the channel inside it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was hoping to get away with as little of changes to the existing worker registration/eager code as I could, but I fear we're basically saying I may need to do a bit more to it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not that much change, and it's nice to maintain some good stewardship where then structure makes sense rather than a minimum possible change, not like this is in a huge rush or anything.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Switched from Box to Arc, you were correct, not a big change (417b709).

core/src/worker/slot_provider.rs Outdated Show resolved Hide resolved
tests/integ_tests/polling_tests.rs Outdated Show resolved Hide resolved
tests/integ_tests/polling_tests.rs Outdated Show resolved Hide resolved
@cretz cretz merged commit 9f7e87d into temporalio:master Apr 22, 2024
6 checks passed
@cretz cretz deleted the replace-client branch April 22, 2024 18:30
Comment on lines +82 to 83
slot_provider: Arc<SlotProvider>,
/// Registration key to enable eager workflow start for this worker
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cretz Sorry for being so late in the game (already merged...) I wonder if holding a reference to slot_provider will hang the worker shutdown in some cases. slot_provider contains a reference to external_wft_tx and I think we need to drop all refs to it, so that the rx part closes, and that unblocks the worker shutdown...

Copy link
Member Author

@cretz cretz Apr 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't this have failed tests that rely on worker shutdown to complete (or are there none or would worker shutdown only trigger in certain ways)? Is the SlotProvider drop needed in time for shutdown or finalize_shutdown? The latter consumes this worker and therefore this slot provider should be dropped.

Can we write a test for this? Do you have alternative suggestions on how to handle this so I can still use the same slot provider upon client replacement? Maybe we can make unregister() return the slot provider it removed?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I have opened #725. I think that is a cleaner approach anyways and I wish I thought of it before.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Feature Request] Dynamic client for workers
3 participants