-
Notifications
You must be signed in to change notification settings - Fork 5.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[tune/placement group] dist. training placement group support #11934
Conversation
@oliverhu were you actually able to test this on multiple nodes? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey! Thanks a bunch for getting this together! Main comment is re: use_gpu.
Tested! After latest rebase, I see this warning after Tune training finishes, doesn't seem to be related to this change tho.
|
@oliverhu Probably some race condition happening from placement group (processes belonging to placement groups are deleted when the job is done, and maybe at the same time, processes are terminated because the job is done)? Not 100% sure though. We should have a closer look at this. |
@rkooo567 shall we keep that tracked in another issue? I don't think we want to combine that with this PR/issue. |
That sounds good to me! Can you also make sure this error didn’t occur when placement groups are not used? |
the same error is still there even if I don't use placementgroup😢 @richardliaw |
Signed-off-by: Richard Liaw <rliaw@berkeley.edu>
Signed-off-by: Richard Liaw <rliaw@berkeley.edu>
# Define the checkpoint directory to store the checkpoints | ||
|
||
checkpoint_dir = "./training_checkpoints" | ||
# Name of the checkpoint files | ||
checkpoint_prefix = os.path.join(checkpoint_dir, "ckpt_{epoch}") | ||
multi_worker_model.fit( | ||
multi_worker_dataset, | ||
epochs=2, | ||
steps_per_epoch=70, | ||
callbacks=[ | ||
TuneReportCheckpointCallback( | ||
{ | ||
"mean_accuracy": "accuracy" | ||
}, filename="checkpoint") | ||
tf.keras.callbacks.ModelCheckpoint( | ||
filepath=checkpoint_prefix, save_weights_only=True), | ||
TuneReportCallback({ | ||
"mean_accuracy": "accuracy" | ||
}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@oliverhu can you explain what you're doing with this change?
This won't trigger the Tune checkpointing mechanism (which requires a call to tune.checkpoint_dir
).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🤦 TuneReportCheckpointCallback
doesn't work under distributed training (not always on worker 0). Apparently switching to ModelCheckpoint
callback out of box is not compatible with Tune. Let me update this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
found the issue..seems some file name escaping problem
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
actually not, distributed torch has the same issue.. @richardliaw did you see this before?
2020-11-15 07:09:52,540 ERROR trial_runner.py:712 -- Trial WrappedDistributedTorchTrainable_88060_00003: Error handling checkpoint /home/ray1/ray_results/WrappedDistributedTorchTrainable_2020-11-15_07-09-41/WrappedDistributedTorchTrainable_88060_00003_3_2020-11-15_07-09-41/checkpoint_10/./
Traceback (most recent call last):
File "/home/ray1/.local/share/virtualenvs/ray1-OvGpSAQ1/lib/python3.6/site-packages/ray/tune/trial_runner.py", line 707, in _process_trial_save
checkpoint=trial.saving_to)
File "/home/ray1/.local/share/virtualenvs/ray1-OvGpSAQ1/lib/python3.6/site-packages/ray/tune/callback.py", line 204, in on_checkpoint
callback.on_checkpoint(**info)
File "/home/ray1/.local/share/virtualenvs/ray1-OvGpSAQ1/lib/python3.6/site-packages/ray/tune/syncer.py", line 450, in on_checkpoint
self._sync_trial_checkpoint(trial, checkpoint)
File "/home/ray1/.local/share/virtualenvs/ray1-OvGpSAQ1/lib/python3.6/site-packages/ray/tune/syncer.py", line 426, in _sync_trial_checkpoint
trial, checkpoint.value))
ray.tune.error.TuneError: Trial WrappedDistributedTorchTrainable_88060_00003: Checkpoint path /home/ray1/ray_results/WrappedDistributedTorchTrainable_2020-11-15_07-09-41/WrappedDistributedTorchTrainable_88060_00003_3_2020-11-15_07-09-41/checkpoint_10/./ not found after successful sync down.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it happens when driver is in remote host
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, it is 100% reproducible.. should be easy to fix
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems some regression in checkpoint logic.. even single node's checkpoint doesn't work.
2020-11-15 07:39:25,887 ERROR trial_runner.py:712 -- Trial TrainMNIST_9b13f_00019: Error handling checkpoint /home/ray1/ray_results/TrainMNIST_2020-11-15_07-38-51/TrainMNIST_9b13f_00019_19_lr=0.06339,momentum=0.2118_2020-11-15_07-39-19/checkpoint_16/model.pth
Traceback (most recent call last):
File "/home/ray1/.local/share/virtualenvs/ray1-OvGpSAQ1/lib/python3.6/site-packages/ray/tune/trial_runner.py", line 707, in _process_trial_save
checkpoint=trial.saving_to)
File "/home/ray1/.local/share/virtualenvs/ray1-OvGpSAQ1/lib/python3.6/site-packages/ray/tune/callback.py", line 204, in on_checkpoint
callback.on_checkpoint(**info)
File "/home/ray1/.local/share/virtualenvs/ray1-OvGpSAQ1/lib/python3.6/site-packages/ray/tune/syncer.py", line 450, in on_checkpoint
self._sync_trial_checkpoint(trial, checkpoint)
File "/home/ray1/.local/share/virtualenvs/ray1-OvGpSAQ1/lib/python3.6/site-packages/ray/tune/syncer.py", line 426, in _sync_trial_checkpoint
trial, checkpoint.value))
ray.tune.error.TuneError: Trial TrainMNIST_9b13f_00019: Checkpoint path /home/ray1/ray_results/TrainMNIST_2020-11-15_07-38-51/TrainMNIST_9b13f_00019_19_lr=0.06339,momentum=0.2118_2020-11-15_07-39-19/checkpoint_16/model.pth not found after successful sync down.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Discussed with Richard offline, it is actually caused there is no SSH access between VMs. Checkpoints won't be synced if SSH channel is not built between the hosts.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice :)
Yay! It is really exciting this has been merged!! |
Thanks for the review and help, @richardliaw and @rkooo567!! |
Why are these changes needed?
Follow up on #9919 to add placement group config (X workers per host) to the distributed trainable creator. Adapted from https://github.com/ray-project/ray/pull/11061/files.
Related issue number
#9919
Checks
scripts/format.sh
to lint the changes in this PR.