ProcessGroupNCCL NCCL lib version mismatch #47291

amrragab8080 · 2020-11-03T17:09:47Z

🐛 Bug

To Reproduce

Steps to reproduce the behavior:

  File "./dlrm.py", line 644, in SparseDataDist
    self.backendFuncs.all_to_allv(self.collectiveArgs)
  File "/home/ubuntu/param/train/comms/pt/pytorch_nccl_backend.py", line 124, in all_to_allv
    async_op=collectiveArgs.asyncOp,
  File "/usr/local/lib/python3.6/dist-packages/torch/distributed/distributed_c10d.py", line 1827, in all_to_all_single
    work = group.alltoall_base(output, input, output_split_sizes, input_split_sizes, opts)
RuntimeError: ProcessGroupNCCL only supports alltoall* for NCCL lib version >= 2.7.0

I have NCCL 2.7.8 installed

ip-172-31-76-46:66763:66763 [0] NCCL INFO Using network AWS Libfabric
NCCL version 2.7.8+cuda11.0
ip-172-31-76-46:66769:66769 [6] NCCL INFO Bootstrap : Using [0]ens33:172.31.76.46<0> [1]ens66:172.31.77.107<0> [2]ens131

Environment

Please copy and paste the output from our
environment collection script
(or fill out the checklist below manually).

You can get the script and run it with:

wget https://raw.githubusercontent.com/pytorch/pytorch/master/torch/utils/collect_env.py
# For security purposes, please check the contents of collect_env.py before running it.
python collect_env.py

PyTorch Version (e.g., 1.0): 1.7
OS (e.g., Linux): Ubuntu 18.04
How you installed PyTorch (conda, pip, source): pip
Build command you used (if compiling from source):
Python version: 3.6
CUDA/cuDNN version: CUDA 11/ cuDNN 8
GPU models and configuration: AWS p4d A100
Any other relevant information:

Additional context

cc @ezyang @seemethere @malfet @walterddr @pietern @mrshenli @pritamdamania87 @zhaojuanmao @satgera @rohan-varma @gqchen @aazzolini @xush6528 @osalpekar @jiayisuse @agolynski

The text was updated successfully, but these errors were encountered:

malfet · 2020-11-03T17:41:57Z

This has been fixed by #45900, but this code was not picked for release branch

amrragab8080 · 2020-12-14T21:08:15Z

@malfet checking with the released PY 1.7.1 I still get the same error.

>>> torch.__version__
'1.7.1+cu110'

error

RuntimeError: ProcessGroupNCCL only supports alltoall* for NCCL lib version >= 2.7.0

malfet · 2020-12-14T21:13:18Z

@amrragab8080 this change wasn't picked into 1.7.1 branch as it is not a regression from 1.6.0.

amrragab8080 · 2020-12-15T15:38:26Z

@malfet can you specify which version it will get picked up for?. I was testing it because it was tagged for the 1.7.1 milestone
https://github.com/pytorch/pytorch/milestone/19

seemethere · 2020-12-15T19:12:34Z

This will most likely get picked up for 1.8.0, going to go ahead and remove this from 1.7.1 milestone since 1.7.1 has already been released without this change

amrragab8080 · 2021-01-07T22:47:08Z

@seemethere Has this been picked up in 1.8.0 using the nightly build it seems I am able to reproduce the error? I didnt see this issue tagged for 1.8.0milestone

malfet · 2021-01-08T00:49:13Z

@amrragab8080 are you saying you can still reproduce the problem using nightly build of pytorch?
If so, please run python3 -m torch.utils.collect_env and post the output here.

malfet added module: binaries Anything related to official binaries that we release to users oncall: distributed Add this issue/PR to distributed oncall triage queue labels Nov 3, 2020

malfet added this to the 1.7.1 milestone Nov 3, 2020

malfet added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Nov 3, 2020

amrragab8080 mentioned this issue Nov 6, 2020

ProcessGroupNCCL alltoall error facebookresearch/param#13

Closed

seemethere removed this from the 1.7.1 milestone Dec 15, 2020

YazhiGao mentioned this issue Dec 15, 2020

ProcessGroupNCCL does not support scatter facebookresearch/dlrm#144

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ProcessGroupNCCL NCCL lib version mismatch #47291

ProcessGroupNCCL NCCL lib version mismatch #47291

amrragab8080 commented Nov 3, 2020 •

edited by pytorch-probot bot

malfet commented Nov 3, 2020

amrragab8080 commented Dec 14, 2020

malfet commented Dec 14, 2020

amrragab8080 commented Dec 15, 2020 •

edited

seemethere commented Dec 15, 2020 •

edited

amrragab8080 commented Jan 7, 2021 •

edited

malfet commented Jan 8, 2021

ProcessGroupNCCL NCCL lib version mismatch #47291

ProcessGroupNCCL NCCL lib version mismatch #47291

Comments

amrragab8080 commented Nov 3, 2020 • edited by pytorch-probot bot

🐛 Bug

To Reproduce

Environment

Additional context

malfet commented Nov 3, 2020

amrragab8080 commented Dec 14, 2020

malfet commented Dec 14, 2020

amrragab8080 commented Dec 15, 2020 • edited

seemethere commented Dec 15, 2020 • edited

amrragab8080 commented Jan 7, 2021 • edited

malfet commented Jan 8, 2021

amrragab8080 commented Nov 3, 2020 •

edited by pytorch-probot bot

amrragab8080 commented Dec 15, 2020 •

edited

seemethere commented Dec 15, 2020 •

edited

amrragab8080 commented Jan 7, 2021 •

edited