Multi-gpu training code get stuck after a few iterations #13

SSSHZ · 2022-06-23T09:25:45Z

Hi, I tried the multi-gpu training code but the program always got stuck after a few iterations.

Environment:

pytorch 1.7.1
cuda 10.2
gcc version 7.5.0
Ubuntu 18.04.3 LTS

Reproduce the bug:
CUDA_VISIBLE_DEVICES=0,1,2,3 python -m torch.distributed.launch --nproc_per_node 4 train_net_ddp.py --config_file coco --gpus 4

Output:

  File "/home/a/anaconda3/envs/e2ec/lib/python3.7/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/home/aa/anaconda3/envs/e2ec/lib/python3.7/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/home/a/anaconda3/envs/e2ec/lib/python3.7/site-packages/torch/distributed/launch.py", line 260, in <module>
    main()
  File "/home/a/anaconda3/envs/e2ec/lib/python3.7/site-packages/torch/distributed/launch.py", line 253, in main
    process.wait()
  File "/home/a/anaconda3/envs/e2ec/lib/python3.7/subprocess.py", line 1019, in wait
    return self._wait(timeout=timeout)
  File "/home/a/anaconda3/envs/e2ec/lib/python3.7/subprocess.py", line 1653, in _wait
    (pid, sts) = self._try_wait(0)
  File "/home/a/anaconda3/envs/e2ec/lib/python3.7/subprocess.py", line 1611, in _try_wait
    (pid, sts) = os.waitpid(self.pid, wait_flags)
KeyboardInterrupt```

The text was updated successfully, but these errors were encountered:

zhang-tao-whu · 2022-06-27T02:46:55Z

Hello, I think this bug is caused by allocating too many workers for dataloader. When training with multi gpus, ${--bs} is the number of batch size of a single gpu, the actual batch size of your above command is 24*4=96. For convenience, I directly set the number of workers as the number of batch size. 96 workers for dataloder is probably too many.

You can try set the batch size smaller, such as --bs 6 when using 4 GPUS. Or you can modify num_worker of the function make_ddp_train_loader in dataset/data_loader.py.

SSSHZ · 2022-06-28T12:33:42Z

After setting train.batch_size = 6 in configs/coco.py, I tried num_workers=4 or 2 or 0 for make_ddp_train_loader in dataset/data_loader.py and the same issue always happened.

Perhaps this bug is not caused by workers for dataloader.

SSSHZ · 2022-07-09T12:58:39Z

The problem might be caused by the combination of Pytorch 1.7.1, CUDA 10.2, NCCL 2.7.8.
The easiest solution is to try Pytorch 1.7.0 for me.

yeshwanth95 mentioned this issue Jun 28, 2022

num_workers=batch_size in function make_ddp_train_loader() in dataset/data_loader.py? #14

Closed

SSSHZ closed this as completed Jul 9, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multi-gpu training code get stuck after a few iterations #13

Multi-gpu training code get stuck after a few iterations #13

SSSHZ commented Jun 23, 2022

zhang-tao-whu commented Jun 27, 2022

SSSHZ commented Jun 28, 2022

SSSHZ commented Jul 9, 2022

Multi-gpu training code get stuck after a few iterations #13

Multi-gpu training code get stuck after a few iterations #13

Comments

SSSHZ commented Jun 23, 2022

zhang-tao-whu commented Jun 27, 2022

SSSHZ commented Jun 28, 2022

SSSHZ commented Jul 9, 2022