-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Train loop may crash during checkpointing #2397
Comments
Hey @kokamido, thanks for letting us know! Could you please show us your Ping @pplantinga I think this issue is for you ;) |
I ran the repro with clean save directory. After the crash it looks like this: root@sbx-60283d040ccf4433b126ad86e96ba6ac-5ff484847d-kcvm5:~/speechbraindebugexample# ls experiments/ddp_crash_repro/save/
CKPT+2024-02-08+12-30-07+01 CKPT+2024-02-08+12-30-08+01 CKPT+2024-02-08+12-30-09+01 CKPT+2024-02-08+12-30-10+01 CKPT+2024-02-08+12-30-11+00
CKPT+2024-02-08+12-30-07+02 CKPT+2024-02-08+12-30-08+02 CKPT+2024-02-08+12-30-09+02 CKPT+2024-02-08+12-30-10+02 And error message for this run is Root Cause (first observed failure):
[0]:
time : 2024-02-08_12:30:11
host : sbx-60283d040ccf4433b126ad86e96ba6ac-5ff484847d-kcvm5
rank : 1 (local_rank: 1)
exitcode : 1 (pid: 143902)
error_file: /tmp/torchelastic_9faem1ym/none_hfidn2p7/attempt_0/1/error.json
traceback : Traceback (most recent call last):
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
return f(*args, **kwargs)
File "/root/speechbraindebugexample/repro.py", line 48, in fit
super(TestBrain, self).fit(epoch_counter, train_set, valid_set, progressbar, train_loader_kwargs, valid_loader_kwargs)
File "/usr/local/lib/python3.10/dist-packages/speechbrain/core.py", line 1366, in fit
self._fit_train(train_set=train_set, epoch=epoch, enable=enable)
File "/usr/local/lib/python3.10/dist-packages/speechbrain/core.py", line 1212, in _fit_train
self._save_intra_epoch_ckpt()
File "/usr/local/lib/python3.10/dist-packages/speechbrain/core.py", line 1386, in _save_intra_epoch_ckpt
self.checkpointer.save_and_keep_only(
File "/usr/local/lib/python3.10/dist-packages/speechbrain/utils/checkpoints.py", line 685, in save_and_keep_only
self.delete_checkpoints(
File "/usr/local/lib/python3.10/dist-packages/speechbrain/utils/checkpoints.py", line 988, in delete_checkpoints
self.find_checkpoints(
File "/usr/local/lib/python3.10/dist-packages/speechbrain/utils/checkpoints.py", line 825, in find_checkpoints
ckpts = self.list_checkpoints()
File "/usr/local/lib/python3.10/dist-packages/speechbrain/utils/checkpoints.py", line 914, in list_checkpoints
return self._construct_checkpoint_objects(self._list_checkpoint_dirs())
File "/usr/local/lib/python3.10/dist-packages/speechbrain/utils/checkpoints.py", line 1061, in _construct_checkpoint_objects
with open(ckpt_dir / METAFNAME) as fi:
FileNotFoundError: [Errno 2] No such file or directory: 'experiments/ddp_crash_repro/save/CKPT+2024-02-08+12-30-10+00/CKPT.yaml' |
Hey, could you please fetch the latest speechbrain version available through git clone and let us know if the issue is still there ? Thanks. Best, |
It seems to be fixed for b8a3ee3 |
Good! Thanks for getting back to us @kokamido. We solved some DDP / checkpointing issues in the develop branch. We are planning to merge it in main branch very soon. Since this issue is solved, I will proceed by closing it. Feel free to reopen it if you require more in-depth help. Thanks again for opening the issue! :) |
Describe the bug
Hi! I found out something that looks like a race condition during checkpointing in DDP mode. It my setup it takes a random number of epochs to crash, usually 10-20. Epoch in my repro-setup takes ~1 sec.
Expected behaviour
Behavior should be at least deterministic.
To Reproduce
This behavior can be reproduced with command
torchrun --nnodes=1 --nproc-per-node=2 repro.py repro.yaml
using the following code and config. Settingckpt_interval_minutes: 0.01
is essential. It has to be quite small in order to reduce time you have to wait for crash. It's better to cleanexperiments
folder before run.repro.py
repro.yaml
Environment Details
GPU: 2xV100
OS: Ubuntu 22.04.3 LTS
Python: 3.10.12
CUDA: Cuda compilation tools, release 12.1, V12.1.105, Build cuda_12.1.r12.1/compiler.32688072_0
torch.cuda.nccl.version(): (2, 18, 1)
Dependencies:
Relevant Log Output
Additional Context
No response
The text was updated successfully, but these errors were encountered: