[tune/placement group] dist. training placement group support #11934

oliverhu · 2020-11-11T10:36:16Z

Why are these changes needed?

Follow up on #9919 to add placement group config (X workers per host) to the distributed trainable creator. Adapted from https://github.com/ray-project/ray/pull/11061/files.

Related issue number

#9919

Checks

I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

…#9919

richardliaw · 2020-11-11T20:07:16Z

@oliverhu were you actually able to test this on multiple nodes?

python/ray/tune/integration/tensorflow.py

richardliaw

Hey! Thanks a bunch for getting this together! Main comment is re: use_gpu.

python/ray/tune/integration/tensorflow.py

python/ray/tune/integration/torch.py

oliverhu · 2020-11-13T10:28:42Z

@oliverhu were you actually able to test this on multiple nodes?

Tested! After latest rebase, I see this warning after Tune training finishes, doesn't seem to be related to this change tho.

(pid=raylet, ip=10.0.0.8) [2020-11-13 10:26:17,925 E 64000 64000] process.cc:498: Failed to kill process 64572 with error system:3: No such process
(pid=raylet, ip=10.0.0.8) [2020-11-13 10:26:17,925 E 64000 64000] process.cc:498: Failed to kill process 64565 with error system:3: No such process```

rkooo567 · 2020-11-13T21:31:34Z

@oliverhu Probably some race condition happening from placement group (processes belonging to placement groups are deleted when the job is done, and maybe at the same time, processes are terminated because the job is done)? Not 100% sure though. We should have a closer look at this.

oliverhu · 2020-11-13T22:33:52Z

@rkooo567 shall we keep that tracked in another issue? I don't think we want to combine that with this PR/issue.

rkooo567 · 2020-11-13T22:35:32Z

That sounds good to me! Can you also make sure this error didn’t occur when placement groups are not used?

oliverhu · 2020-11-13T22:51:24Z

the same error is still there even if I don't use placementgroup😢 @richardliaw

Signed-off-by: Richard Liaw <rliaw@berkeley.edu>

richardliaw · 2020-11-15T05:20:38Z

python/ray/tune/examples/tf_distributed_keras_example.py

+    # Define the checkpoint directory to store the checkpoints

+    checkpoint_dir = "./training_checkpoints"
+    # Name of the checkpoint files
+    checkpoint_prefix = os.path.join(checkpoint_dir, "ckpt_{epoch}")
    multi_worker_model.fit(
        multi_worker_dataset,
        epochs=2,
        steps_per_epoch=70,
        callbacks=[
-            TuneReportCheckpointCallback(
-                {
-                    "mean_accuracy": "accuracy"
-                }, filename="checkpoint")
+            tf.keras.callbacks.ModelCheckpoint(
+                filepath=checkpoint_prefix, save_weights_only=True),
+            TuneReportCallback({
+                "mean_accuracy": "accuracy"
+            })


@oliverhu can you explain what you're doing with this change?

This won't trigger the Tune checkpointing mechanism (which requires a call to tune.checkpoint_dir).

🤦 TuneReportCheckpointCallback doesn't work under distributed training (not always on worker 0). Apparently switching to ModelCheckpoint callback out of box is not compatible with Tune. Let me update this.

found the issue..seems some file name escaping problem

actually not, distributed torch has the same issue.. @richardliaw did you see this before?

2020-11-15 07:09:52,540 ERROR trial_runner.py:712 -- Trial WrappedDistributedTorchTrainable_88060_00003: Error handling checkpoint /home/ray1/ray_results/WrappedDistributedTorchTrainable_2020-11-15_07-09-41/WrappedDistributedTorchTrainable_88060_00003_3_2020-11-15_07-09-41/checkpoint_10/./ Traceback (most recent call last): File "/home/ray1/.local/share/virtualenvs/ray1-OvGpSAQ1/lib/python3.6/site-packages/ray/tune/trial_runner.py", line 707, in _process_trial_save checkpoint=trial.saving_to) File "/home/ray1/.local/share/virtualenvs/ray1-OvGpSAQ1/lib/python3.6/site-packages/ray/tune/callback.py", line 204, in on_checkpoint callback.on_checkpoint(**info) File "/home/ray1/.local/share/virtualenvs/ray1-OvGpSAQ1/lib/python3.6/site-packages/ray/tune/syncer.py", line 450, in on_checkpoint self._sync_trial_checkpoint(trial, checkpoint) File "/home/ray1/.local/share/virtualenvs/ray1-OvGpSAQ1/lib/python3.6/site-packages/ray/tune/syncer.py", line 426, in _sync_trial_checkpoint trial, checkpoint.value)) ray.tune.error.TuneError: Trial WrappedDistributedTorchTrainable_88060_00003: Checkpoint path /home/ray1/ray_results/WrappedDistributedTorchTrainable_2020-11-15_07-09-41/WrappedDistributedTorchTrainable_88060_00003_3_2020-11-15_07-09-41/checkpoint_10/./ not found after successful sync down.

it happens when driver is in remote host

ok, it is 100% reproducible.. should be easy to fix

Seems some regression in checkpoint logic.. even single node's checkpoint doesn't work.

2020-11-15 07:39:25,887 ERROR trial_runner.py:712 -- Trial TrainMNIST_9b13f_00019: Error handling checkpoint /home/ray1/ray_results/TrainMNIST_2020-11-15_07-38-51/TrainMNIST_9b13f_00019_19_lr=0.06339,momentum=0.2118_2020-11-15_07-39-19/checkpoint_16/model.pth Traceback (most recent call last): File "/home/ray1/.local/share/virtualenvs/ray1-OvGpSAQ1/lib/python3.6/site-packages/ray/tune/trial_runner.py", line 707, in _process_trial_save checkpoint=trial.saving_to) File "/home/ray1/.local/share/virtualenvs/ray1-OvGpSAQ1/lib/python3.6/site-packages/ray/tune/callback.py", line 204, in on_checkpoint callback.on_checkpoint(**info) File "/home/ray1/.local/share/virtualenvs/ray1-OvGpSAQ1/lib/python3.6/site-packages/ray/tune/syncer.py", line 450, in on_checkpoint self._sync_trial_checkpoint(trial, checkpoint) File "/home/ray1/.local/share/virtualenvs/ray1-OvGpSAQ1/lib/python3.6/site-packages/ray/tune/syncer.py", line 426, in _sync_trial_checkpoint trial, checkpoint.value)) ray.tune.error.TuneError: Trial TrainMNIST_9b13f_00019: Checkpoint path /home/ray1/ray_results/TrainMNIST_2020-11-15_07-38-51/TrainMNIST_9b13f_00019_19_lr=0.06339,momentum=0.2118_2020-11-15_07-39-19/checkpoint_16/model.pth not found after successful sync down.

Discussed with Richard offline, it is actually caused there is no SSH access between VMs. Checkpoints won't be synced if SSH channel is not built between the hosts.

rkooo567 · 2020-11-16T10:25:09Z

Yay! It is really exciting this has been merged!!

oliverhu · 2020-11-16T16:45:17Z

Thanks for the review and help, @richardliaw and @rkooo567!!

oliverhu added 4 commits November 11, 2020 01:40

[Placement Group] Ray Tune + distributed training support ray-project…

38444ac

…#9919

Fix torchtrainer tests

08bb2b4

fix tf-trainable tests

9e03a67

fix tf trainer tests

6832b34

rkooo567 assigned richardliaw Nov 11, 2020

use the right linter version..

a1f0afe

fix example resource

6a16870

richardliaw reviewed Nov 11, 2020

View reviewed changes

python/ray/tune/integration/tensorflow.py Outdated Show resolved Hide resolved

richardliaw requested changes Nov 11, 2020

View reviewed changes

oliverhu added 13 commits November 13, 2020 00:51

change callback

2d7dfa7

fix path

4d550a4

fix

1488c86

refactor get_remote_worker_options into utils

f690f9d

fix import

2842cc9

fix typo

368df2b

fix placementgroup reference

a219966

move divide to later step

269433f

fix reference

92a3668

fix worker_per_hsot

8979664

fix type

a336617

fix none

9f47655

remove default for worker per hsot

c4c9e4c

richardliaw added 2 commits November 14, 2020 20:47

Merge branch 'master' into oliverhu-master

943c092

Signed-off-by: Richard Liaw <rliaw@berkeley.edu>

Nit fixes and fix usability

0f3a76c

Signed-off-by: Richard Liaw <rliaw@berkeley.edu>

richardliaw reviewed Nov 15, 2020

View reviewed changes

Revert callback change

8db4827

richardliaw approved these changes Nov 16, 2020

View reviewed changes

richardliaw changed the title ~~[Placement Group] Ray Tune + distributed training support #9919~~ [tune/placement group] dist. training placement group support Nov 16, 2020

richardliaw merged commit a501280 into ray-project:master Nov 16, 2020

odp mentioned this pull request Jul 29, 2021

[rfc] Elastic training using AdaptDL #17054

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[tune/placement group] dist. training placement group support #11934

[tune/placement group] dist. training placement group support #11934

oliverhu commented Nov 11, 2020 •

edited

richardliaw commented Nov 11, 2020

richardliaw left a comment

oliverhu commented Nov 13, 2020

rkooo567 commented Nov 13, 2020 •

edited

oliverhu commented Nov 13, 2020

rkooo567 commented Nov 13, 2020 •

edited

oliverhu commented Nov 13, 2020

richardliaw Nov 15, 2020

oliverhu Nov 15, 2020

oliverhu Nov 15, 2020

oliverhu Nov 15, 2020

oliverhu Nov 15, 2020

oliverhu Nov 15, 2020

oliverhu Nov 15, 2020

oliverhu Nov 16, 2020

richardliaw Nov 16, 2020

rkooo567 commented Nov 16, 2020

oliverhu commented Nov 16, 2020

[tune/placement group] dist. training placement group support #11934

[tune/placement group] dist. training placement group support #11934

Conversation

oliverhu commented Nov 11, 2020 • edited

Why are these changes needed?

Related issue number

Checks

richardliaw commented Nov 11, 2020

richardliaw left a comment

Choose a reason for hiding this comment

oliverhu commented Nov 13, 2020

rkooo567 commented Nov 13, 2020 • edited

oliverhu commented Nov 13, 2020

rkooo567 commented Nov 13, 2020 • edited

oliverhu commented Nov 13, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rkooo567 commented Nov 16, 2020

oliverhu commented Nov 16, 2020

oliverhu commented Nov 11, 2020 •

edited

rkooo567 commented Nov 13, 2020 •

edited

rkooo567 commented Nov 13, 2020 •

edited