You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I've been trying to train with 2 gpus in the new v2.0 code, but having trouble with pytorch distributed parallel.
I used the command: CUDA_VISIBLE_DEVICES=0,1 python3.7 -m torch.distributed.launch \ --nproc_per_node=${WORLD_SIZE} colbert/train.py \ --triples $TRIPLES_PATH \ --local_rank 2 \ --accum 1
but I'm not sure if it is correct - At the moment the training doesn't seem to run (model loads on 1 gpu but training loop hangs)
I'm wondering, is the distributed parallel meant to work for training with multigpu in the new code? If it works, will we be able to speed up the training with multigpu?
Thank you!
The text was updated successfully, but these errors were encountered:
Hi, thank you for a great code release,
I've been trying to train with 2 gpus in the new v2.0 code, but having trouble with pytorch distributed parallel.
I used the command:
CUDA_VISIBLE_DEVICES=0,1 python3.7 -m torch.distributed.launch \ --nproc_per_node=${WORLD_SIZE} colbert/train.py \ --triples $TRIPLES_PATH \ --local_rank 2 \ --accum 1
but I'm not sure if it is correct - At the moment the training doesn't seem to run (model loads on 1 gpu but training loop hangs)
I'm wondering, is the distributed parallel meant to work for training with multigpu in the new code? If it works, will we be able to speed up the training with multigpu?
Thank you!
The text was updated successfully, but these errors were encountered: