Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ray cluster CRD and example CR + multi-ray-cluster operator #12098

Merged
merged 26 commits into from
Dec 14, 2020

Conversation

DmitriGekhtman
Copy link
Contributor

@DmitriGekhtman DmitriGekhtman commented Nov 18, 2020

Why are these changes needed?

Following up on #11929, this PR adds a k8s CRD describing a ray cluster configuration and an example ray cluster CR. A ray cluster CR is pretty much just a reformatted version of one of the current ray cluster configs.

This PR also extends the operator such that it can manage multiple Ray clusters.
Using kubectl, users can create/update/delete clusters and check monitoring logs.


Related issue number

#11929 #11545

Checks

  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

Create, update, delete, logging working as expected for me locally.

@ericl ericl removed their assignment Nov 18, 2020
@DmitriGekhtman DmitriGekhtman force-pushed the dmitri/k8s-operator-crd branch 3 times, most recently from 6827805 to 14c8749 Compare November 23, 2020 16:07
@DmitriGekhtman
Copy link
Contributor Author

DmitriGekhtman commented Nov 30, 2020

Currently debugging the following error, which occurs after starting a cluster, shutting it down, and trying to start a new cluster by applying, deleting, applying a raycluster custom resource. The error takes place immediately after the monitor initializes a StandardAutoscaler.

edit: Takes place in Monitor.update_raylet_map().

edit: This probably happens because we can only support one GlobalState per interpreter session.

edit: Running Monitors in subprocesses instead of threads could solve this -- going to try that.

[2020-11-30 08:22:38,240 C 27 87] service_based_gcs_client.cc:207: Couldn't reconnect to GCS server. The last attempted GCS server address was :0
*** StackTrace Information ***
    @     0x7f74211423e5  google::GetStackTraceToString()
    @     0x7f74210b704e  ray::GetCallTrace()
    @     0x7f74210dc454  ray::RayLog::~RayLog()
    @     0x7f7420d2952a  ray::gcs::ServiceBasedGcsClient::ReconnectGcsServer()
    @     0x7f7420d295ad  ray::gcs::ServiceBasedGcsClient::GcsServiceFailureDetected()
    @     0x7f7420d3228f  _ZNSt17_Function_handlerIFvRKN3ray6StatusERKNS0_3rpc19GetAllNodeInfoReplyEEZNS4_12GcsRpcClient14GetAllNodeInfoERKNS4_21GetAllNodeInfoRequestERKSt8functionIS8_EEUlS3_S7_E_E9_M_invokeERKSt9_Any_dataS3_S7_
    @     0x7f7420d37115  ray::rpc::ClientCallImpl<>::OnReplyReceived()
    @     0x7f7420c4cb12  _ZN5boost4asio6detail18completion_handlerIZN3ray3rpc17ClientCallManager29PollEventsFromCompletionQueueEiEUlvE_E11do_completeEPvPNS1_19scheduler_operationERKNS_6system10error_codeEm
    @     0x7f74211afa51  boost::asio::detail::scheduler::do_run_one()
    @     0x7f74211b06a9  boost::asio::detail::scheduler::run()
    @     0x7f74211b2a07  boost::asio::io_context::run()
    @     0x7f7420c15e44  _ZNSt6thread11_State_implINS_8_InvokerISt5tupleIJZN3ray3gcs19GlobalStateAccessorC4ERKSsS7_bEUlvE_EEEEE6_M_runEv
    @     0x7f7421455120  execute_native_thread_routine
    @     0x7f7422194609  start_thread
    @     0x7f74220bb293  clone

@DmitriGekhtman DmitriGekhtman changed the title Ray cluster CRD and example CR Ray cluster CRD and example CR + multi-ray-cluster operator Dec 1, 2020
@DmitriGekhtman
Copy link
Contributor Author

Running the monitor in a subprocess fixed the previous issue -- the code is now functional!
To-do list in PR description.

self.subprocess.terminate()
self.subprocess.join()
# Reinstantiate process with f as target and start.
self.subprocess = mp.Process(name=self.name, target=f)
Copy link
Contributor

@yiranwang52 yiranwang52 Dec 1, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it possible for the subprocesses to be leaked if operator.py is killed unexpectedly?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think currently if operator.py is killed unexpectedly, the operator pod will shut down.
Which reminds me -- I was going to make all of the ray-clusters managed by the operator fate-share with the operator.
So then everything would go down.
That's of course not optimal behavior.

Let me also experiment to see what happens if operator.py is killed unexpectedly but the pod doesn't go down.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think you want to kill all the clusters when operator go down.
It just need to know about them when started.

Copy link
Contributor Author

@DmitriGekhtman DmitriGekhtman Dec 2, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, right. I think that should work with the code as it is. (When the operator restarts it should create_or_update on each cluster.)
Will test.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think Python does an OK job of cleaning up processes spawned by the multiprocessing module?
To check, I ran a test script that uses multiprocessing to spawn a dummy process that runs forever. After doing a Ctrl-C keyboard interrupt, the pid of the process is no longer present in the output of ps -ef.

I'll check that the monitor processes behave in the same way.

Let me know if there's something that can be done to ensure that the processes are cleaned correctly.
(besides rewriting everything to have each cluster's autoscaler .update in a for loop, which is probably a good idea to implement sooner or later or sooner to replace this subprocess logic)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your test sounds good enough.

Copy link
Contributor Author

@DmitriGekhtman DmitriGekhtman Dec 2, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, you were right -- terminating operator.py did leak the monitor process. (In my test script, the child process was receiving the keyboard interrupt.)

I've now set the monitor subprocess to be a daemon, and that works -- when I run kill -SIGTERM <operator_pid> it stops the monitor subprocess too.
(If you do kill -SIGKILL <operator_pid> then the daemon monitor subprocess still leaks.)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

okay then this is still a problem we should fix in the near future.

@DmitriGekhtman DmitriGekhtman force-pushed the dmitri/k8s-operator-crd branch 2 times, most recently from 8909a0d to 5fe741e Compare December 3, 2020 03:55
@edoakes edoakes added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Dec 11, 2020
@DmitriGekhtman DmitriGekhtman force-pushed the dmitri/k8s-operator-crd branch 3 times, most recently from b2ac1f7 to 87c99aa Compare December 14, 2020 03:48
@DmitriGekhtman DmitriGekhtman added tests-ok The tagger certifies test failures are unrelated and assumes personal liability. and removed @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. labels Dec 14, 2020
@edoakes edoakes merged commit 11ce1dc into ray-project:master Dec 14, 2020
@DmitriGekhtman DmitriGekhtman deleted the dmitri/k8s-operator-crd branch January 1, 2021 09:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
tests-ok The tagger certifies test failures are unrelated and assumes personal liability.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants