Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reproducing issues: broken pipe & CUDA out of memory errors #1

Closed
velocityCavalry opened this issue Jun 26, 2021 · 4 comments
Closed

Comments

@velocityCavalry
Copy link

velocityCavalry commented Jun 26, 2021

Hi,

I was trying to train BPR by running

python train_biencoder.py --gpus=7 --distributed_backend=ddp --train_file=nq-train.json \
--eval_file=nq-dev.json --gradient_clip_val=2.0 --max_epochs=40 --binary --train_batch_size=4 --eval_batch_size=4

However, there are a lot of errors. For example, after validation sanity check, there are a broken pipe error in multiprocessing/connections.py where the output is listed below

Traceback (most recent call last):
  File "/miniconda3/envs/bpr/lib/python3.7/multiprocessing/queues.py", line 242, in _feed
    send_bytes(obj)
  File "/miniconda3/envs/bpr/lib/python3.7/multiprocessing/connection.py", line 200, in send_bytes
    self._send_bytes(m[offset:offset + size])
  File "/miniconda3/envs/bpr/lib/python3.7/multiprocessing/connection.py", line 404, in _send_bytes
    self._send(header + buf)
  File "/miniconda3/envs/bpr/lib/python3.7/multiprocessing/connection.py", line 368, in _send
    n = write(self._handle, buf)
BrokenPipeError: [Errno 32] Broken pipe

Furthermore, I encountered CUDA out of memory issues. The trimmed output is attached: (For each line it is repeated for 3 times because 3 out of 7 GPUs that I am using have encountered OOM errors)

Traceback (most recent call last):
  File "bpr/train_biencoder.py", line 53, in <module>
    trainer.fit(model)
  File "miniconda3/envs/bpr/lib/python3.7/site-packages/pytorch_lightning/trainer/states.py", line 48, in wrapped_fn
    result = fn(self, *args, **kwargs)
  File "miniconda3/envs/bpr/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 1046, in fit
    self.accelerator_backend.train(model)
  File "miniconda3/envs/bpr/lib/python3.7/site-packages/pytorch_lightning/accelerators/ddp_backend.py", line 57, in train
    self.ddp_train(process_idx=self.task_idx, mp_queue=None, model=model)
   File "/gscratch/cse/xyu530/miniconda3/envs/bpr/lib/python3.7/site-packages/pytorch_lightning/accelerators/ddp_backend.py", lin
e 224, in ddp_train
    results = self.trainer.run_pretrain_routine(model)
  File "miniconda3/envs/bpr/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 1239, in run_pretrain_routine
    self.train()
  File "miniconda3/envs/bpr/lib/python3.7/site-packages/pytorch_lightning/trainer/training_loop.py", line 394, in train
    self.run_training_epoch()
File "miniconda3/envs/bpr/lib/python3.7/site-packages/pytorch_lightning/trainer/training_loop.py", line 516, in run_training_epoch
    self.run_evaluation(test_mode=False)
  File "miniconda3/envs/bpr/lib/python3.7/site-packages/pytorch_lightning/trainer/evaluation_loop.py", line
 582, in run_evaluation
     eval_results = self._evaluate(self.model, dataloaders, max_batches, test_mode)
  File "miniconda3/envs/bpr/lib/python3.7/site-packages/pytorch_lightning/trainer/evaluation_loop.py", line 396, in _evaluate
  eval_results = self.__run_eval_epoch_end(test_mode, outputs, dataloaders, using_eval_result)
  File "miniconda3/envs/bpr/lib/python3.7/site-packages/pytorch_lightning/trainer/evaluation_loop.py", line 490, in __run_eval_epoch_end
    eval_results = model.validation_epoch_end(eval_results)
 File "bpr/bpr/biencoder.py", line 246, in validation_epoch_end
    dist.all_gather(passage_repr_list, passage_repr)
  File "miniconda3/envs/bpr/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 1185, in all_gather
    work = _default_pg.allgather([tensor_list], [tensor])
RuntimeError: CUDA out of memory. Tried to allocate 1.14 GiB (GPU X; 10.76 GiB total capacity; 8.14 GiB already allocated; 526.56 MiB free; 9.29 GiB reserved in total by PyTorch)
    work = _default_pg.allgather([tensor_list], [tensor])

Sorry for putting all these outputs here!

I install BPR by pip install -r requirements.txt and completed building passage database successfully. The GPUs I am using are 7 GeForce RTX 2080 Ti.

Thanks for any help!

@ikuyamada
Copy link
Member

Hi @velocityCavalry,

I do not have any idea why the Broken pipe error happens. Our code does not directly control multiprocessing, it may be due to an issue in pytorch or pytorch-lightning.

The OOM issue may be due to a lack of GPU memory. We conducted experiments on 8 * Tesla V100 with 16GB memory. You can reduce the GPU memory by using --train_batch_size and --eval_batch_size options.

@velocityCavalry
Copy link
Author

Hi @velocityCavalry,

I do not have any idea why the Broken pipe error happens. Our code does not directly control multiprocessing, it may be due to an issue in pytorch or pytorch-lightning.

The OOM issue may be due to a lack of GPU memory. We conducted experiments on 8 * Tesla V100 with 16GB memory. You can reduce the GPU memory by using --train_batch_size and --eval_batch_size options.

Thanks for the reply! I do specify the flags by --train_batch_size=4 --eval_batch_size=4, with such a small batch size, it still has this OOM issue.. What would be a reasonable batch size then?

@ikuyamada
Copy link
Member

ikuyamada commented Jun 26, 2021

It seems that the OOM error happens in the validation step which involves copying computed passage representations between GPUs using the dist.all_gather function and this step may consume a lot of GPU memory. This can be disabled by specifying --eval_rank_local_gpu, so please specify the option to see if the error goes away?

@ikuyamada
Copy link
Member

I close this issue as there is no activity.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants