Ray cluster CRD and example CR + multi-ray-cluster operator #12098

DmitriGekhtman · 2020-11-18T04:07:21Z

Why are these changes needed?

Following up on #11929, this PR adds a k8s CRD describing a ray cluster configuration and an example ray cluster CR. A ray cluster CR is pretty much just a reformatted version of one of the current ray cluster configs.

This PR also extends the operator such that it can manage multiple Ray clusters.
Using kubectl, users can create/update/delete clusters and check monitoring logs.

Related issue number

#11929 #11545

Checks

I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

Create, update, delete, logging working as expected for me locally.

python/ray/autoscaler/kubernetes/operator_configs/example_cluster_resource.yaml

DmitriGekhtman · 2020-11-30T16:26:07Z

Currently debugging the following error, which occurs after starting a cluster, shutting it down, and trying to start a new cluster by applying, deleting, applying a raycluster custom resource. The error takes place immediately after the monitor initializes a StandardAutoscaler.

edit: Takes place in Monitor.update_raylet_map().

edit: This probably happens because we can only support one GlobalState per interpreter session.

edit: Running Monitors in subprocesses instead of threads could solve this -- going to try that.

[2020-11-30 08:22:38,240 C 27 87] service_based_gcs_client.cc:207: Couldn't reconnect to GCS server. The last attempted GCS server address was :0
*** StackTrace Information ***
    @     0x7f74211423e5  google::GetStackTraceToString()
    @     0x7f74210b704e  ray::GetCallTrace()
    @     0x7f74210dc454  ray::RayLog::~RayLog()
    @     0x7f7420d2952a  ray::gcs::ServiceBasedGcsClient::ReconnectGcsServer()
    @     0x7f7420d295ad  ray::gcs::ServiceBasedGcsClient::GcsServiceFailureDetected()
    @     0x7f7420d3228f  _ZNSt17_Function_handlerIFvRKN3ray6StatusERKNS0_3rpc19GetAllNodeInfoReplyEEZNS4_12GcsRpcClient14GetAllNodeInfoERKNS4_21GetAllNodeInfoRequestERKSt8functionIS8_EEUlS3_S7_E_E9_M_invokeERKSt9_Any_dataS3_S7_
    @     0x7f7420d37115  ray::rpc::ClientCallImpl<>::OnReplyReceived()
    @     0x7f7420c4cb12  _ZN5boost4asio6detail18completion_handlerIZN3ray3rpc17ClientCallManager29PollEventsFromCompletionQueueEiEUlvE_E11do_completeEPvPNS1_19scheduler_operationERKNS_6system10error_codeEm
    @     0x7f74211afa51  boost::asio::detail::scheduler::do_run_one()
    @     0x7f74211b06a9  boost::asio::detail::scheduler::run()
    @     0x7f74211b2a07  boost::asio::io_context::run()
    @     0x7f7420c15e44  _ZNSt6thread11_State_implINS_8_InvokerISt5tupleIJZN3ray3gcs19GlobalStateAccessorC4ERKSsS7_bEUlvE_EEEEE6_M_runEv
    @     0x7f7421455120  execute_native_thread_routine
    @     0x7f7422194609  start_thread
    @     0x7f74220bb293  clone

DmitriGekhtman · 2020-12-01T20:54:32Z

Running the monitor in a subprocess fixed the previous issue -- the code is now functional!
To-do list in PR description.

yiranwang52 · 2020-12-01T21:01:59Z

python/ray/operator/operator.py

+            self.subprocess.terminate()
+            self.subprocess.join()
+        # Reinstantiate process with f as target and start.
+        self.subprocess = mp.Process(name=self.name, target=f)


is it possible for the subprocesses to be leaked if operator.py is killed unexpectedly?

I think currently if operator.py is killed unexpectedly, the operator pod will shut down.
Which reminds me -- I was going to make all of the ray-clusters managed by the operator fate-share with the operator.
So then everything would go down.
That's of course not optimal behavior.

Let me also experiment to see what happens if operator.py is killed unexpectedly but the pod doesn't go down.

I don't think you want to kill all the clusters when operator go down.
It just need to know about them when started.

Ah, right. I think that should work with the code as it is. (When the operator restarts it should create_or_update on each cluster.)
Will test.

I think Python does an OK job of cleaning up processes spawned by the multiprocessing module?
To check, I ran a test script that uses multiprocessing to spawn a dummy process that runs forever. After doing a Ctrl-C keyboard interrupt, the pid of the process is no longer present in the output of ps -ef.

I'll check that the monitor processes behave in the same way.

Let me know if there's something that can be done to ensure that the processes are cleaned correctly.
(besides rewriting everything to have each cluster's autoscaler .update in a for loop, which is probably a good idea to implement sooner or later or sooner to replace this subprocess logic)

Your test sounds good enough.

Actually, you were right -- terminating operator.py did leak the monitor process. (In my test script, the child process was receiving the keyboard interrupt.)

I've now set the monitor subprocess to be a daemon, and that works -- when I run kill -SIGTERM <operator_pid> it stops the monitor subprocess too.
(If you do kill -SIGKILL <operator_pid> then the daemon monitor subprocess still leaks.)

okay then this is still a problem we should fix in the near future.

…ption if CRD missing in operator.

…delete -f is sufficient)

DmitriGekhtman assigned ericl, yiranwang52 and edoakes Nov 18, 2020

DmitriGekhtman force-pushed the dmitri/k8s-operator-crd branch from d82a776 to 696a936 Compare November 18, 2020 15:12

yiranwang52 reviewed Nov 18, 2020

View reviewed changes

python/ray/autoscaler/kubernetes/operator_configs/example_cluster_resource.yaml Outdated Show resolved Hide resolved

yiranwang52 reviewed Nov 18, 2020

View reviewed changes

python/ray/autoscaler/kubernetes/operator_configs/example_cluster_resource.yaml Outdated Show resolved Hide resolved

ericl removed their assignment Nov 18, 2020

DmitriGekhtman force-pushed the dmitri/k8s-operator-crd branch 3 times, most recently from 6827805 to 14c8749 Compare November 23, 2020 16:07

DmitriGekhtman force-pushed the dmitri/k8s-operator-crd branch from 14c8749 to 1aac505 Compare November 29, 2020 22:15

DmitriGekhtman force-pushed the dmitri/k8s-operator-crd branch from d872b14 to 6885a86 Compare December 1, 2020 20:42

DmitriGekhtman changed the title ~~Ray cluster CRD and example CR~~ Ray cluster CRD and example CR + multi-ray-cluster operator Dec 1, 2020

yiranwang52 reviewed Dec 1, 2020

View reviewed changes

yiranwang52 approved these changes Dec 2, 2020

View reviewed changes

DmitriGekhtman force-pushed the dmitri/k8s-operator-crd branch 2 times, most recently from 8909a0d to 5fe741e Compare December 3, 2020 03:55

DmitriGekhtman mentioned this pull request Dec 7, 2020

[k8s] [autoscaler] Helm chart for deploying Ray on Kubernetes #12654

Closed

DmitriGekhtman force-pushed the dmitri/k8s-operator-crd branch from 5fe741e to 8b60d3d Compare December 9, 2020 04:18

edoakes added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Dec 11, 2020

DmitriGekhtman force-pushed the dmitri/k8s-operator-crd branch 3 times, most recently from b2ac1f7 to 87c99aa Compare December 14, 2020 03:48

DmitriGekhtman added 5 commits December 13, 2020 23:17

Ray cluster CRD and example CR

721b907

Remove unused service accounts

745b262

Simplify example resource. Fix doc typo.

f342660

Translate CR. Code rearranged a bit.

b06014e

Parse resources. Debug CR translation.

5053392

DmitriGekhtman added 21 commits December 13, 2020 23:17

Modify logging setup

212d4f8

podSpec validation. Multiple Ray cluster support. (wip)

663dc30

Some bugs fixed. WIP.

f9662fa

WIP

9eb92b2

Replaced thread with subprocess. Functional! Minor cleanup todo (wip).

307bf3a

Subprocess should be daemon.

edc15a5

Remove extraneous logging line in monitor.py

2c22cdb

Remove test_cluster_config, typos.

cb7db0d

Move resource-filling logic to KubernetesNodeProvider

40501bb

Resource filling bug fixes

cdca5ab

Type checks

1354da8

Patch in ownerrefs for k8s garbage collection

baef235

Garbage collection via ownerrefs. Split CRD from config. Helpful exce…

ed2c8db

…ption if CRD missing in operator.

Eliminated garbage collection logic for operator resources. (kubectl …

cb271a4

…delete -f is sufficient)

Docs

638720c

Docs wip

0b7dbf7

Docs. Final lookover remains.

3424af4

Custom ray resources, setup commands.

50543f8

Extraneous operator permissions removed

38b90c2

Fix operator entry point

b2019e5

Last commit, mod any suggested changes

a802307

DmitriGekhtman force-pushed the dmitri/k8s-operator-crd branch from 87c99aa to a802307 Compare December 14, 2020 07:18

DmitriGekhtman added tests-ok The tagger certifies test failures are unrelated and assumes personal liability. and removed @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. labels Dec 14, 2020

edoakes merged commit 11ce1dc into ray-project:master Dec 14, 2020

DmitriGekhtman deleted the dmitri/k8s-operator-crd branch January 1, 2021 09:46

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ray cluster CRD and example CR + multi-ray-cluster operator #12098

Ray cluster CRD and example CR + multi-ray-cluster operator #12098

DmitriGekhtman commented Nov 18, 2020 •

edited

Loading

DmitriGekhtman commented Nov 30, 2020 •

edited

Loading

DmitriGekhtman commented Dec 1, 2020

yiranwang52 Dec 1, 2020 •

edited

Loading

DmitriGekhtman Dec 1, 2020

yiranwang52 Dec 2, 2020

DmitriGekhtman Dec 2, 2020 •

edited

Loading

DmitriGekhtman Dec 2, 2020

yiranwang52 Dec 2, 2020

DmitriGekhtman Dec 2, 2020 •

edited

Loading

yiranwang52 Dec 4, 2020

Ray cluster CRD and example CR + multi-ray-cluster operator #12098

Ray cluster CRD and example CR + multi-ray-cluster operator #12098

Conversation

DmitriGekhtman commented Nov 18, 2020 • edited Loading

Why are these changes needed?

Related issue number

Checks

DmitriGekhtman commented Nov 30, 2020 • edited Loading

DmitriGekhtman commented Dec 1, 2020

yiranwang52 Dec 1, 2020 • edited Loading

Choose a reason for hiding this comment

DmitriGekhtman Dec 1, 2020

Choose a reason for hiding this comment

yiranwang52 Dec 2, 2020

Choose a reason for hiding this comment

DmitriGekhtman Dec 2, 2020 • edited Loading

Choose a reason for hiding this comment

DmitriGekhtman Dec 2, 2020

Choose a reason for hiding this comment

yiranwang52 Dec 2, 2020

Choose a reason for hiding this comment

DmitriGekhtman Dec 2, 2020 • edited Loading

Choose a reason for hiding this comment

yiranwang52 Dec 4, 2020

Choose a reason for hiding this comment

DmitriGekhtman commented Nov 18, 2020 •

edited

Loading

DmitriGekhtman commented Nov 30, 2020 •

edited

Loading

yiranwang52 Dec 1, 2020 •

edited

Loading

DmitriGekhtman Dec 2, 2020 •

edited

Loading

DmitriGekhtman Dec 2, 2020 •

edited

Loading