Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multi-gpu training code get stuck after a few iterations #13

Closed
SSSHZ opened this issue Jun 23, 2022 · 3 comments
Closed

Multi-gpu training code get stuck after a few iterations #13

SSSHZ opened this issue Jun 23, 2022 · 3 comments

Comments

@SSSHZ
Copy link

SSSHZ commented Jun 23, 2022

Hi, I tried the multi-gpu training code but the program always got stuck after a few iterations.

Environment:

  • pytorch 1.7.1
  • cuda 10.2
  • gcc version 7.5.0
  • Ubuntu 18.04.3 LTS

Reproduce the bug:
CUDA_VISIBLE_DEVICES=0,1,2,3 python -m torch.distributed.launch --nproc_per_node 4 train_net_ddp.py --config_file coco --gpus 4

Output:

  File "/home/a/anaconda3/envs/e2ec/lib/python3.7/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/home/aa/anaconda3/envs/e2ec/lib/python3.7/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/home/a/anaconda3/envs/e2ec/lib/python3.7/site-packages/torch/distributed/launch.py", line 260, in <module>
    main()
  File "/home/a/anaconda3/envs/e2ec/lib/python3.7/site-packages/torch/distributed/launch.py", line 253, in main
    process.wait()
  File "/home/a/anaconda3/envs/e2ec/lib/python3.7/subprocess.py", line 1019, in wait
    return self._wait(timeout=timeout)
  File "/home/a/anaconda3/envs/e2ec/lib/python3.7/subprocess.py", line 1653, in _wait
    (pid, sts) = self._try_wait(0)
  File "/home/a/anaconda3/envs/e2ec/lib/python3.7/subprocess.py", line 1611, in _try_wait
    (pid, sts) = os.waitpid(self.pid, wait_flags)
KeyboardInterrupt```
@zhang-tao-whu
Copy link
Owner

Hello, I think this bug is caused by allocating too many workers for dataloader. When training with multi gpus, ${--bs} is the number of batch size of a single gpu, the actual batch size of your above command is 24*4=96. For convenience, I directly set the number of workers as the number of batch size. 96 workers for dataloder is probably too many.

You can try set the batch size smaller, such as --bs 6 when using 4 GPUS. Or you can modify num_worker of the function make_ddp_train_loader in dataset/data_loader.py.

@SSSHZ
Copy link
Author

SSSHZ commented Jun 28, 2022

After setting train.batch_size = 6 in configs/coco.py, I tried num_workers=4 or 2 or 0 for make_ddp_train_loader in dataset/data_loader.py and the same issue always happened.

Perhaps this bug is not caused by workers for dataloader.

@SSSHZ
Copy link
Author

SSSHZ commented Jul 9, 2022

The problem might be caused by the combination of Pytorch 1.7.1, CUDA 10.2, NCCL 2.7.8.
The easiest solution is to try Pytorch 1.7.0 for me.

@SSSHZ SSSHZ closed this as completed Jul 9, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants