Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to get rid of the "Duplicate GPU detected : rank 0 and rank 1 both on CUDA device ca000" error while training of the ColBERTv1.9 modell? #331

Open
Aritra02091998 opened this issue Mar 29, 2024 · 1 comment

Comments

@Aritra02091998
Copy link

I am trying to finetune the ColBERT v1.9 on my specific dataset for retrieval, but unable to do so. I encountered the below error:-

torch.distributed.DistBackendError: NCCL error in: /opt/conda/conda-bld/pytorch_1702400431970/work/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1333, invalid usage (run with NCCL_DEBUG=WARN for details), NCCL version 2.18.6
ncclInvalidUsage: This usually reflects invalid usage of NCCL library.
Last error:
Duplicate GPU detected : rank 0 and rank 1 both on CUDA device ca000

I guess it is some issues with the torch.distributed settings. Please help how can I resolve this ?

My specificateions are:

Single NVIDIA A40 GPU
Conda Package Manager
Python 3.8

@4entertainment
Copy link

Hello,

I don't have a solution for the problem you are experiencing. I wish you good luck and success. I would like to ask you to answer a question: Can you share the code(s) you used for the "ColBERT v1.9 on my specific dataset for retrieval" operation?

Thank you for your interest.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants