Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GCS-Based actor management implementation #6763

Merged
merged 15 commits into from Apr 13, 2020

Conversation

wumuzi520
Copy link
Contributor

@wumuzi520 wumuzi520 commented Jan 10, 2020

Why are these changes needed?

Pls see the <Design Document> first.

This PR implements the creation and reconstruction of actors based on gcs server.

Changes on gcs server side

Several important classes are added: GcsActor, GcsActorManager, GcsActorScheduler.

  • GcsActor: An abstraction of actor at GcsServer side, which wrapper the ActorTableData and provides some simple interface to access the field inside ActorTableData.
  • GcsActorManager: It is responsible for managing the lifecycle of all registered actors.
  • GcsActorScheduler: It is responsible for scheduling actors registered to GcsActorManager, it also contains a inner class called GcsLeasedWorker which is an abstraction of remote leased worker in raylet.

In addition, this PR has also made some changes to GcsNodeManager, it is responsible for monitoring and manage nodes.

Changes on raylet side

  • In the old actor management scheme, raylet will be responsible for updating ActorTableData, while in the new GCS-Based actor management scheme, we expect that GCS will be responsible for updating all ActorTableData. So, you will see that all logic about updating ActorTableData will be get ride off.
  • Besides, the raylet should cache the relationship of actor and leased worker because that the raylet should fast reply gcs server without lease anything when gcs server rebuild actors after restart. Pls see the <Design Document>.

Chages on worker side

  • invoke the gcs_rpc_client.CreateActor on the callback of ResolveDependencies.
  • Fast reply the gcs server without create anyting if it is already bound with an actor when gcs server rebuild actors rebuild actors after restart.

Related issue number

Checks

@AmplabJenkins
Copy link

Can one of the admins verify this patch?

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/20564/
Test FAILed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/20583/
Test FAILed.

@ericl ericl self-assigned this Jan 12, 2020
@zhijunfu zhijunfu self-requested a review January 15, 2020 11:40
@wumuzi520 wumuzi520 force-pushed the gcs_actor_manager branch 2 times, most recently from 9e82612 to 979cec7 Compare March 5, 2020 08:50
@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/22763/
Test FAILed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/22764/
Test FAILed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/22771/
Test FAILed.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/22797/
Test PASSed.

@wumuzi520 wumuzi520 force-pushed the gcs_actor_manager branch 2 times, most recently from 1ccf692 to 4642605 Compare March 6, 2020 05:36
@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/22800/
Test PASSed.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/22801/
Test PASSed.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/22806/
Test PASSed.

@wumuzi520 wumuzi520 force-pushed the gcs_actor_manager branch 3 times, most recently from c3b1100 to 289feb7 Compare March 6, 2020 10:53
@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/22810/
Test FAILed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/22811/
Test FAILed.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/22812/
Test PASSed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/22808/
Test FAILed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/22837/
Test FAILed.

@ericl ericl removed their assignment Mar 7, 2020
@ericl
Copy link
Contributor

ericl commented Mar 7, 2020

(please feel free to re-request once not WIP)

@stephanie-wang
Copy link
Contributor

Do not depend on io_context_ in the unit tests. This is important because then we will have fine-grained control over the order of execution and can test complicated concurrency scenarios. Instead, collect any callbacks in the test fixture, choose the order that you flush them in, and add assertions where necessary to check that the right callbacks are getting scheduled. I know that the code uses the io_context_ right now to retry requests. Instead, you can pass in a retry callback that abstracts away the io_context_. The callback can take in the method to retry as an argument and the period to wait before retrying. This will let us have control over when retries are executed and also check that we are retrying under the correct scenarios.

@stephanie-wang I agree with you that it's helpful to make tests more reliable and deterministic. However, I'm concerned there are too many callbacks, which will hurt code readability. Also, adding interfaces like WorkerLeaseInterface only for testing makes the code over abstracted.
Instead of using callbacks, I think we can just add a virtual function (e.g. RequestWorkerLease) in GcsActorScheduler, and then override this function in MockedGcsActorScheduler. This should have the same effect, without over abstracting non-test code.

@stephanie-wang
I also think there are too many input callback parameters, and code readability may be affected. Because only the logic of retry in the bottom layer uses io_context, I extract the retry function separately and define it as a virtual function. When writing unit tests, it only need to mock the virtual function out, and then there will be no logic about io_context, and only the input parameters of the constructor still contains io_context, but it has no impact on the unit test.

Sure, I'm not attached to the testing method that I suggested, and thanks for correcting me on the "bugs" I thought were there :) The main thing I care about is making sure that the tests are reliable and deterministic, as you say. So I think your suggestions about having the retry function be virtual would work as well. Could you add this to the current PR and remove the dependency on the io_context? I think this is important enough that we should have it in the initial commit for actor management. Thanks!

@stephanie-wang
Copy link
Contributor

Okay, after some discussion offline, let's get this PR merged without the changes to the unit tests, but the next focus should be on removing the io_context dependency from the unit tests.

By the way, the Travis failures on OSX and RLlib seem like they're related to this PR. Let me know if I can help with debugging.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/24517/
Test PASSed.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/24522/
Test PASSed.

OneNodeTest() : GcsActorSchedulerTest(1) {}
};
TEST_F(GcsActorSchedulerTest, TestScheduleFailedWithZeroNode) {
ASSERT_EQ(0, gcs_node_manager_->GetAllAliveNodes().size());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Usually the actual value goes first and the expected value goes second. Not a big deal, but it makes the output nicer if the values do not match.

Copy link
Contributor

@stephanie-wang stephanie-wang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, this looks great! I just left two last comments, about notifying the TaskManager when the creation task finishes and about whether we need to flush the actor's leased worker address.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/24550/
Test PASSed.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/24549/
Test PASSed.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/24553/
Test PASSed.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/24555/
Test PASSed.

@wumuzi520 wumuzi520 closed this Apr 12, 2020
@wumuzi520 wumuzi520 reopened this Apr 12, 2020
@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/24581/
Test PASSed.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/24583/
Test PASSed.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/24584/
Test PASSed.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/24608/
Test PASSed.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/24615/
Test PASSed.

@stephanie-wang stephanie-wang merged commit 4a81793 into ray-project:master Apr 13, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants