Reproducing issues: broken pipe & CUDA out of memory errors #1

velocityCavalry · 2021-06-26T03:40:23Z

Hi,

I was trying to train BPR by running

python train_biencoder.py --gpus=7 --distributed_backend=ddp --train_file=nq-train.json \
--eval_file=nq-dev.json --gradient_clip_val=2.0 --max_epochs=40 --binary --train_batch_size=4 --eval_batch_size=4

However, there are a lot of errors. For example, after validation sanity check, there are a broken pipe error in multiprocessing/connections.py where the output is listed below

Traceback (most recent call last):
  File "/miniconda3/envs/bpr/lib/python3.7/multiprocessing/queues.py", line 242, in _feed
    send_bytes(obj)
  File "/miniconda3/envs/bpr/lib/python3.7/multiprocessing/connection.py", line 200, in send_bytes
    self._send_bytes(m[offset:offset + size])
  File "/miniconda3/envs/bpr/lib/python3.7/multiprocessing/connection.py", line 404, in _send_bytes
    self._send(header + buf)
  File "/miniconda3/envs/bpr/lib/python3.7/multiprocessing/connection.py", line 368, in _send
    n = write(self._handle, buf)
BrokenPipeError: [Errno 32] Broken pipe

Furthermore, I encountered CUDA out of memory issues. The trimmed output is attached: (For each line it is repeated for 3 times because 3 out of 7 GPUs that I am using have encountered OOM errors)

Traceback (most recent call last):
  File "bpr/train_biencoder.py", line 53, in <module>
    trainer.fit(model)
  File "miniconda3/envs/bpr/lib/python3.7/site-packages/pytorch_lightning/trainer/states.py", line 48, in wrapped_fn
    result = fn(self, *args, **kwargs)
  File "miniconda3/envs/bpr/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 1046, in fit
    self.accelerator_backend.train(model)
  File "miniconda3/envs/bpr/lib/python3.7/site-packages/pytorch_lightning/accelerators/ddp_backend.py", line 57, in train
    self.ddp_train(process_idx=self.task_idx, mp_queue=None, model=model)
   File "/gscratch/cse/xyu530/miniconda3/envs/bpr/lib/python3.7/site-packages/pytorch_lightning/accelerators/ddp_backend.py", lin
e 224, in ddp_train
    results = self.trainer.run_pretrain_routine(model)
  File "miniconda3/envs/bpr/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 1239, in run_pretrain_routine
    self.train()
  File "miniconda3/envs/bpr/lib/python3.7/site-packages/pytorch_lightning/trainer/training_loop.py", line 394, in train
    self.run_training_epoch()
File "miniconda3/envs/bpr/lib/python3.7/site-packages/pytorch_lightning/trainer/training_loop.py", line 516, in run_training_epoch
    self.run_evaluation(test_mode=False)
  File "miniconda3/envs/bpr/lib/python3.7/site-packages/pytorch_lightning/trainer/evaluation_loop.py", line
 582, in run_evaluation
     eval_results = self._evaluate(self.model, dataloaders, max_batches, test_mode)
  File "miniconda3/envs/bpr/lib/python3.7/site-packages/pytorch_lightning/trainer/evaluation_loop.py", line 396, in _evaluate
  eval_results = self.__run_eval_epoch_end(test_mode, outputs, dataloaders, using_eval_result)
  File "miniconda3/envs/bpr/lib/python3.7/site-packages/pytorch_lightning/trainer/evaluation_loop.py", line 490, in __run_eval_epoch_end
    eval_results = model.validation_epoch_end(eval_results)
 File "bpr/bpr/biencoder.py", line 246, in validation_epoch_end
    dist.all_gather(passage_repr_list, passage_repr)
  File "miniconda3/envs/bpr/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 1185, in all_gather
    work = _default_pg.allgather([tensor_list], [tensor])
RuntimeError: CUDA out of memory. Tried to allocate 1.14 GiB (GPU X; 10.76 GiB total capacity; 8.14 GiB already allocated; 526.56 MiB free; 9.29 GiB reserved in total by PyTorch)
    work = _default_pg.allgather([tensor_list], [tensor])

Sorry for putting all these outputs here!

I install BPR by pip install -r requirements.txt and completed building passage database successfully. The GPUs I am using are 7 GeForce RTX 2080 Ti.

Thanks for any help!

The text was updated successfully, but these errors were encountered:

ikuyamada · 2021-06-26T05:26:00Z

Hi @velocityCavalry,

I do not have any idea why the Broken pipe error happens. Our code does not directly control multiprocessing, it may be due to an issue in pytorch or pytorch-lightning.

The OOM issue may be due to a lack of GPU memory. We conducted experiments on 8 * Tesla V100 with 16GB memory. You can reduce the GPU memory by using --train_batch_size and --eval_batch_size options.

velocityCavalry · 2021-06-26T05:45:34Z

Hi @velocityCavalry,

I do not have any idea why the Broken pipe error happens. Our code does not directly control multiprocessing, it may be due to an issue in pytorch or pytorch-lightning.

The OOM issue may be due to a lack of GPU memory. We conducted experiments on 8 * Tesla V100 with 16GB memory. You can reduce the GPU memory by using --train_batch_size and --eval_batch_size options.

Thanks for the reply! I do specify the flags by --train_batch_size=4 --eval_batch_size=4, with such a small batch size, it still has this OOM issue.. What would be a reasonable batch size then?

ikuyamada · 2021-06-26T05:53:14Z

It seems that the OOM error happens in the validation step which involves copying computed passage representations between GPUs using the dist.all_gather function and this step may consume a lot of GPU memory. This can be disabled by specifying --eval_rank_local_gpu, so please specify the option to see if the error goes away?

ikuyamada · 2021-07-09T16:16:46Z

I close this issue as there is no activity.

ikuyamada closed this as completed Jul 9, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reproducing issues: broken pipe & CUDA out of memory errors #1

Reproducing issues: broken pipe & CUDA out of memory errors #1

velocityCavalry commented Jun 26, 2021 •

edited

ikuyamada commented Jun 26, 2021

velocityCavalry commented Jun 26, 2021

ikuyamada commented Jun 26, 2021 •

edited

ikuyamada commented Jul 9, 2021

Reproducing issues: broken pipe & CUDA out of memory errors #1

Reproducing issues: broken pipe & CUDA out of memory errors #1

Comments

velocityCavalry commented Jun 26, 2021 • edited

ikuyamada commented Jun 26, 2021

velocityCavalry commented Jun 26, 2021

ikuyamada commented Jun 26, 2021 • edited

ikuyamada commented Jul 9, 2021

velocityCavalry commented Jun 26, 2021 •

edited

ikuyamada commented Jun 26, 2021 •

edited