You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
File "/home/a/anaconda3/envs/e2ec/lib/python3.7/runpy.py", line 193, in _run_module_as_main
"__main__", mod_spec)
File "/home/aa/anaconda3/envs/e2ec/lib/python3.7/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/home/a/anaconda3/envs/e2ec/lib/python3.7/site-packages/torch/distributed/launch.py", line 260, in <module>
main()
File "/home/a/anaconda3/envs/e2ec/lib/python3.7/site-packages/torch/distributed/launch.py", line 253, in main
process.wait()
File "/home/a/anaconda3/envs/e2ec/lib/python3.7/subprocess.py", line 1019, in wait
return self._wait(timeout=timeout)
File "/home/a/anaconda3/envs/e2ec/lib/python3.7/subprocess.py", line 1653, in _wait
(pid, sts) = self._try_wait(0)
File "/home/a/anaconda3/envs/e2ec/lib/python3.7/subprocess.py", line 1611, in _try_wait
(pid, sts) = os.waitpid(self.pid, wait_flags)
KeyboardInterrupt```
The text was updated successfully, but these errors were encountered:
Hello, I think this bug is caused by allocating too many workers for dataloader. When training with multi gpus, ${--bs} is the number of batch size of a single gpu, the actual batch size of your above command is 24*4=96. For convenience, I directly set the number of workers as the number of batch size. 96 workers for dataloder is probably too many.
You can try set the batch size smaller, such as --bs 6 when using 4 GPUS. Or you can modify num_worker of the function make_ddp_train_loader in dataset/data_loader.py.
After setting train.batch_size = 6 in configs/coco.py, I tried num_workers=4 or 2 or 0 for make_ddp_train_loader in dataset/data_loader.py and the same issue always happened.
Perhaps this bug is not caused by workers for dataloader.
Hi, I tried the multi-gpu training code but the program always got stuck after a few iterations.
Environment:
Reproduce the bug:
CUDA_VISIBLE_DEVICES=0,1,2,3 python -m torch.distributed.launch --nproc_per_node 4 train_net_ddp.py --config_file coco --gpus 4
Output:
The text was updated successfully, but these errors were encountered: