Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

trying distributed multigpu #7

Closed
mnskim opened this issue Oct 19, 2020 · 2 comments
Closed

trying distributed multigpu #7

mnskim opened this issue Oct 19, 2020 · 2 comments

Comments

@mnskim
Copy link

mnskim commented Oct 19, 2020

Hi, thank you for a great code release,

I've been trying to train with 2 gpus in the new v2.0 code, but having trouble with pytorch distributed parallel.

I used the command:
CUDA_VISIBLE_DEVICES=0,1 python3.7 -m torch.distributed.launch \ --nproc_per_node=${WORLD_SIZE} colbert/train.py \ --triples $TRIPLES_PATH \ --local_rank 2 \ --accum 1
but I'm not sure if it is correct - At the moment the training doesn't seem to run (model loads on 1 gpu but training loop hangs)

I'm wondering, is the distributed parallel meant to work for training with multigpu in the new code? If it works, will we be able to speed up the training with multigpu?
Thank you!

@okhat
Copy link
Collaborator

okhat commented Oct 19, 2020

Hi Minsoo!

This is a sample command for v0.2 multi-GPU training with 2 GPUs:

CUDA_VISIBLE_DEVICES="0,1" python -m torch.distributed.launch --nproc_per_node=2 -m colbert.train \
  --amp --doc_maxlen 180 --mask-punctuation --bsize 32 --accum 1 \
  --triples /path/to/MSMARCO/triples.train.small.tsv \
  --root /root/to/experiments/ --experiment MSMARCO-psg \
  --similarity l2 --run msmarco.psg.l2

It seems like the code was handing with your command due to --local_rank. This will be passed automatically; don't worry about it.

Happy to help with any other issues you may face. Will keep this issue open unless you'd like to close it.

@mnskim
Copy link
Author

mnskim commented Oct 19, 2020

Hi Omar,
The command works well without any hang, thank you!!

@mnskim mnskim closed this as completed Oct 19, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants