Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

ProcessGroupNCCL NCCL lib version mismatch #47291

Open
amrragab8080 opened this issue Nov 3, 2020 · 7 comments
Open

ProcessGroupNCCL NCCL lib version mismatch #47291

amrragab8080 opened this issue Nov 3, 2020 · 7 comments
Labels
module: binaries Anything related to official binaries that we release to users oncall: distributed Add this issue/PR to distributed oncall triage queue triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module

Comments

@amrragab8080
Copy link

amrragab8080 commented Nov 3, 2020

馃悰 Bug

To Reproduce

Steps to reproduce the behavior:

  File "./dlrm.py", line 644, in SparseDataDist
    self.backendFuncs.all_to_allv(self.collectiveArgs)
  File "/home/ubuntu/param/train/comms/pt/pytorch_nccl_backend.py", line 124, in all_to_allv
    async_op=collectiveArgs.asyncOp,
  File "/usr/local/lib/python3.6/dist-packages/torch/distributed/distributed_c10d.py", line 1827, in all_to_all_single
    work = group.alltoall_base(output, input, output_split_sizes, input_split_sizes, opts)
RuntimeError: ProcessGroupNCCL only supports alltoall* for NCCL lib version >= 2.7.0

I have NCCL 2.7.8 installed

ip-172-31-76-46:66763:66763 [0] NCCL INFO Using network AWS Libfabric
NCCL version 2.7.8+cuda11.0
ip-172-31-76-46:66769:66769 [6] NCCL INFO Bootstrap : Using [0]ens33:172.31.76.46<0> [1]ens66:172.31.77.107<0> [2]ens131

Environment

Please copy and paste the output from our
environment collection script
(or fill out the checklist below manually).

You can get the script and run it with:

wget https://raw.githubusercontent.com/pytorch/pytorch/master/torch/utils/collect_env.py
# For security purposes, please check the contents of collect_env.py before running it.
python collect_env.py
  • PyTorch Version (e.g., 1.0): 1.7
  • OS (e.g., Linux): Ubuntu 18.04
  • How you installed PyTorch (conda, pip, source): pip
  • Build command you used (if compiling from source):
  • Python version: 3.6
  • CUDA/cuDNN version: CUDA 11/ cuDNN 8
  • GPU models and configuration: AWS p4d A100
  • Any other relevant information:

Additional context

cc @ezyang @seemethere @malfet @walterddr @pietern @mrshenli @pritamdamania87 @zhaojuanmao @satgera @rohan-varma @gqchen @aazzolini @xush6528 @osalpekar @jiayisuse @agolynski

@malfet malfet added module: binaries Anything related to official binaries that we release to users oncall: distributed Add this issue/PR to distributed oncall triage queue labels Nov 3, 2020
@malfet
Copy link
Contributor

malfet commented Nov 3, 2020

This has been fixed by #45900, but this code was not picked for release branch

@malfet malfet added this to the 1.7.1 milestone Nov 3, 2020
@malfet malfet added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Nov 3, 2020
@amrragab8080
Copy link
Author

@malfet checking with the released PY 1.7.1 I still get the same error.

>>> torch.__version__
'1.7.1+cu110'

error

RuntimeError: ProcessGroupNCCL only supports alltoall* for NCCL lib version >= 2.7.0

@malfet
Copy link
Contributor

malfet commented Dec 14, 2020

@amrragab8080 this change wasn't picked into 1.7.1 branch as it is not a regression from 1.6.0.

@amrragab8080
Copy link
Author

amrragab8080 commented Dec 15, 2020

@malfet can you specify which version it will get picked up for?. I was testing it because it was tagged for the 1.7.1 milestone
https://github.com/pytorch/pytorch/milestone/19

@seemethere
Copy link
Member

seemethere commented Dec 15, 2020

This will most likely get picked up for 1.8.0, going to go ahead and remove this from 1.7.1 milestone since 1.7.1 has already been released without this change

@amrragab8080
Copy link
Author

amrragab8080 commented Jan 7, 2021

@seemethere Has this been picked up in 1.8.0 using the nightly build it seems I am able to reproduce the error? I didnt see this issue tagged for 1.8.0milestone

@malfet
Copy link
Contributor

malfet commented Jan 8, 2021

@amrragab8080 are you saying you can still reproduce the problem using nightly build of pytorch?
If so, please run python3 -m torch.utils.collect_env and post the output here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
module: binaries Anything related to official binaries that we release to users oncall: distributed Add this issue/PR to distributed oncall triage queue triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module
Projects
None yet
Development

No branches or pull requests

3 participants