trying distributed multigpu #7

mnskim · 2020-10-19T19:40:52Z

Hi, thank you for a great code release,

I've been trying to train with 2 gpus in the new v2.0 code, but having trouble with pytorch distributed parallel.

I used the command:
CUDA_VISIBLE_DEVICES=0,1 python3.7 -m torch.distributed.launch \ --nproc_per_node=${WORLD_SIZE} colbert/train.py \ --triples $TRIPLES_PATH \ --local_rank 2 \ --accum 1
but I'm not sure if it is correct - At the moment the training doesn't seem to run (model loads on 1 gpu but training loop hangs)

I'm wondering, is the distributed parallel meant to work for training with multigpu in the new code? If it works, will we be able to speed up the training with multigpu?
Thank you!

The text was updated successfully, but these errors were encountered:

okhat · 2020-10-19T19:46:57Z

Hi Minsoo!

This is a sample command for v0.2 multi-GPU training with 2 GPUs:

CUDA_VISIBLE_DEVICES="0,1" python -m torch.distributed.launch --nproc_per_node=2 -m colbert.train \
  --amp --doc_maxlen 180 --mask-punctuation --bsize 32 --accum 1 \
  --triples /path/to/MSMARCO/triples.train.small.tsv \
  --root /root/to/experiments/ --experiment MSMARCO-psg \
  --similarity l2 --run msmarco.psg.l2

It seems like the code was handing with your command due to --local_rank. This will be passed automatically; don't worry about it.

Happy to help with any other issues you may face. Will keep this issue open unless you'd like to close it.

mnskim · 2020-10-19T20:03:41Z

Hi Omar,
The command works well without any hang, thank you!!

mnskim closed this as completed Oct 19, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

trying distributed multigpu #7

trying distributed multigpu #7

mnskim commented Oct 19, 2020 •

edited

Loading

okhat commented Oct 19, 2020 •

edited

Loading

mnskim commented Oct 19, 2020

trying distributed multigpu #7

trying distributed multigpu #7

Comments

mnskim commented Oct 19, 2020 • edited Loading

okhat commented Oct 19, 2020 • edited Loading

mnskim commented Oct 19, 2020

mnskim commented Oct 19, 2020 •

edited

Loading

okhat commented Oct 19, 2020 •

edited

Loading