Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[tune/placement group] dist. training placement group support #11934

Merged
merged 22 commits into from
Nov 16, 2020

Conversation

oliverhu
Copy link
Member

@oliverhu oliverhu commented Nov 11, 2020

Why are these changes needed?

Follow up on #9919 to add placement group config (X workers per host) to the distributed trainable creator. Adapted from https://github.com/ray-project/ray/pull/11061/files.

Related issue number

#9919

Checks

  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

@richardliaw
Copy link
Contributor

@oliverhu were you actually able to test this on multiple nodes?

Copy link
Contributor

@richardliaw richardliaw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey! Thanks a bunch for getting this together! Main comment is re: use_gpu.

python/ray/tune/integration/tensorflow.py Outdated Show resolved Hide resolved
python/ray/tune/integration/tensorflow.py Outdated Show resolved Hide resolved
python/ray/tune/integration/tensorflow.py Outdated Show resolved Hide resolved
python/ray/tune/integration/tensorflow.py Outdated Show resolved Hide resolved
python/ray/tune/integration/torch.py Outdated Show resolved Hide resolved
python/ray/tune/integration/torch.py Outdated Show resolved Hide resolved
python/ray/tune/integration/torch.py Outdated Show resolved Hide resolved
python/ray/tune/integration/torch.py Show resolved Hide resolved
python/ray/tune/integration/torch.py Show resolved Hide resolved
@oliverhu
Copy link
Member Author

@oliverhu were you actually able to test this on multiple nodes?

Tested! After latest rebase, I see this warning after Tune training finishes, doesn't seem to be related to this change tho.

(pid=raylet, ip=10.0.0.8) [2020-11-13 10:26:17,925 E 64000 64000] process.cc:498: Failed to kill process 64572 with error system:3: No such process
(pid=raylet, ip=10.0.0.8) [2020-11-13 10:26:17,925 E 64000 64000] process.cc:498: Failed to kill process 64565 with error system:3: No such process```

@rkooo567
Copy link
Contributor

rkooo567 commented Nov 13, 2020

@oliverhu Probably some race condition happening from placement group (processes belonging to placement groups are deleted when the job is done, and maybe at the same time, processes are terminated because the job is done)? Not 100% sure though. We should have a closer look at this.

@oliverhu
Copy link
Member Author

@rkooo567 shall we keep that tracked in another issue? I don't think we want to combine that with this PR/issue.

@rkooo567
Copy link
Contributor

rkooo567 commented Nov 13, 2020

That sounds good to me! Can you also make sure this error didn’t occur when placement groups are not used?

@oliverhu
Copy link
Member Author

the same error is still there even if I don't use placementgroup😢 @richardliaw

Signed-off-by: Richard Liaw <rliaw@berkeley.edu>
Signed-off-by: Richard Liaw <rliaw@berkeley.edu>
Comment on lines 54 to 68
# Define the checkpoint directory to store the checkpoints

checkpoint_dir = "./training_checkpoints"
# Name of the checkpoint files
checkpoint_prefix = os.path.join(checkpoint_dir, "ckpt_{epoch}")
multi_worker_model.fit(
multi_worker_dataset,
epochs=2,
steps_per_epoch=70,
callbacks=[
TuneReportCheckpointCallback(
{
"mean_accuracy": "accuracy"
}, filename="checkpoint")
tf.keras.callbacks.ModelCheckpoint(
filepath=checkpoint_prefix, save_weights_only=True),
TuneReportCallback({
"mean_accuracy": "accuracy"
})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@oliverhu can you explain what you're doing with this change?

This won't trigger the Tune checkpointing mechanism (which requires a call to tune.checkpoint_dir).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤦 TuneReportCheckpointCallback doesn't work under distributed training (not always on worker 0). Apparently switching to ModelCheckpoint callback out of box is not compatible with Tune. Let me update this.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

found the issue..seems some file name escaping problem

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually not, distributed torch has the same issue.. @richardliaw did you see this before?

2020-11-15 07:09:52,540	ERROR trial_runner.py:712 -- Trial WrappedDistributedTorchTrainable_88060_00003: Error handling checkpoint /home/ray1/ray_results/WrappedDistributedTorchTrainable_2020-11-15_07-09-41/WrappedDistributedTorchTrainable_88060_00003_3_2020-11-15_07-09-41/checkpoint_10/./
Traceback (most recent call last):
  File "/home/ray1/.local/share/virtualenvs/ray1-OvGpSAQ1/lib/python3.6/site-packages/ray/tune/trial_runner.py", line 707, in _process_trial_save
    checkpoint=trial.saving_to)
  File "/home/ray1/.local/share/virtualenvs/ray1-OvGpSAQ1/lib/python3.6/site-packages/ray/tune/callback.py", line 204, in on_checkpoint
    callback.on_checkpoint(**info)
  File "/home/ray1/.local/share/virtualenvs/ray1-OvGpSAQ1/lib/python3.6/site-packages/ray/tune/syncer.py", line 450, in on_checkpoint
    self._sync_trial_checkpoint(trial, checkpoint)
  File "/home/ray1/.local/share/virtualenvs/ray1-OvGpSAQ1/lib/python3.6/site-packages/ray/tune/syncer.py", line 426, in _sync_trial_checkpoint
    trial, checkpoint.value))
ray.tune.error.TuneError: Trial WrappedDistributedTorchTrainable_88060_00003: Checkpoint path /home/ray1/ray_results/WrappedDistributedTorchTrainable_2020-11-15_07-09-41/WrappedDistributedTorchTrainable_88060_00003_3_2020-11-15_07-09-41/checkpoint_10/./ not found after successful sync down.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it happens when driver is in remote host

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, it is 100% reproducible.. should be easy to fix

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems some regression in checkpoint logic.. even single node's checkpoint doesn't work.

2020-11-15 07:39:25,887	ERROR trial_runner.py:712 -- Trial TrainMNIST_9b13f_00019: Error handling checkpoint /home/ray1/ray_results/TrainMNIST_2020-11-15_07-38-51/TrainMNIST_9b13f_00019_19_lr=0.06339,momentum=0.2118_2020-11-15_07-39-19/checkpoint_16/model.pth
Traceback (most recent call last):
  File "/home/ray1/.local/share/virtualenvs/ray1-OvGpSAQ1/lib/python3.6/site-packages/ray/tune/trial_runner.py", line 707, in _process_trial_save
    checkpoint=trial.saving_to)
  File "/home/ray1/.local/share/virtualenvs/ray1-OvGpSAQ1/lib/python3.6/site-packages/ray/tune/callback.py", line 204, in on_checkpoint
    callback.on_checkpoint(**info)
  File "/home/ray1/.local/share/virtualenvs/ray1-OvGpSAQ1/lib/python3.6/site-packages/ray/tune/syncer.py", line 450, in on_checkpoint
    self._sync_trial_checkpoint(trial, checkpoint)
  File "/home/ray1/.local/share/virtualenvs/ray1-OvGpSAQ1/lib/python3.6/site-packages/ray/tune/syncer.py", line 426, in _sync_trial_checkpoint
    trial, checkpoint.value))
ray.tune.error.TuneError: Trial TrainMNIST_9b13f_00019: Checkpoint path /home/ray1/ray_results/TrainMNIST_2020-11-15_07-38-51/TrainMNIST_9b13f_00019_19_lr=0.06339,momentum=0.2118_2020-11-15_07-39-19/checkpoint_16/model.pth not found after successful sync down.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Discussed with Richard offline, it is actually caused there is no SSH access between VMs. Checkpoints won't be synced if SSH channel is not built between the hosts.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice :)

@richardliaw richardliaw changed the title [Placement Group] Ray Tune + distributed training support #9919 [tune/placement group] dist. training placement group support Nov 16, 2020
@richardliaw richardliaw merged commit a501280 into ray-project:master Nov 16, 2020
@rkooo567
Copy link
Contributor

Yay! It is really exciting this has been merged!!

@oliverhu
Copy link
Member Author

Thanks for the review and help, @richardliaw and @rkooo567!!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants