Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

NCCL version upgrade for PyTorch #35363

Open
YingleiZhang opened this issue Mar 25, 2020 · 3 comments
Open

NCCL version upgrade for PyTorch #35363

YingleiZhang opened this issue Mar 25, 2020 · 3 comments
Labels
module: build Build system issues triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module

Comments

@YingleiZhang
Copy link

馃悰 Bug

Build PyTorch from source code would fail on some old Linux release.
This issue has been described here: NVIDIA/nccl#244, and Nvidia folks had it fixed (NVIDIA/nccl@7f2b337). However, that change has not been included in the PyTorch source code, so if people need to build PyTorch from source code on some old linux system, it would still fail for the same problem.

The current nccl version in the latest PyTorch release is still 2.4.8, i think it can be upgraded to 2.5.6 now.

I can see that there is another open ticket for the same issue: #29093

To Reproduce

Steps to reproduce the behavior:

  1. Checkout PyTorch source code.
  2. Build on a linux system with kernel that is below 3.9 (This is when SO_REUSEPORT was introduced).

Expected behavior

Build should succeed on Linux system that is below 3.9.


 - PyTorch Version (e.g., 1.0): 1.4 master
 - OS (e.g., Linux): Linux (Below 3.9)
 - How you installed PyTorch (`conda`, `pip`, source): source
 - Build command you used (if compiling from source): python setup.py develop
 - Python version: 3.6
 - CUDA/cuDNN version: irrelevant
 - GPU models and configuration: irrelevant
 - Any other relevant information: No

Additional context

No

@ngimel ngimel added module: build Build system issues triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module labels Mar 25, 2020
@ngimel
Copy link
Collaborator

ngimel commented Mar 25, 2020

Thank you for reporting! We should update nccl submodule. cc @seemethere to route to the appropriate person.

@mattip
Copy link
Collaborator

mattip commented Jul 26, 2020

This was fixed in gh-41608 which upgraded NCCL to 2.7.3

@zarzen
Copy link

zarzen commented Oct 15, 2020

are there ways to choose a different version of NCCL at runtime?
I have a modified NCCL, I would like to use with PyTorch distributed API, how could I do?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
module: build Build system issues triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module
Projects
None yet
Development

No branches or pull requests

4 participants