Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enabling Infiniband support for Gloo data channel with auto IB detection #4795

Merged
merged 5 commits into from Jan 24, 2018

Conversation

teng-li
Copy link
Contributor

@teng-li teng-li commented Jan 23, 2018

This PR enables the proper and easy use of Infiniband support for Gloo backend of distributed training.

Now simply just building PyTorch with

python ./setup.py install

will take care of everything by automatically detecting IB devices on the system.

this helper function of gloo ::gloo::transport::ibverbs::getDeviceNames was added earlier by me to automatically find all IB interfaces in the system.

For the Gloo data channel and cache. We now use a vector to store all the devices (not used currently, but will be able to easily extend in the future to support multiple IB devices).

Also fixed the format error of the bcast gpu direct checking.

Tested for both TCP and IB, both works fine.

Snippet of build logs:

-- Found NCCL: /public/apps/NCCL/2.1.2-1/include
-- Determining NCCL version from the header file: /public/apps/NCCL/2.1.2-1/include/nccl.h
-- NCCL_MAJOR_VERSION: 2
-- Found NCCL (include: /public/apps/NCCL/2.1.2-1/include, library: /public/apps/NCCL/2.1.2-1/lib/libnccl.so)
-- Found MPI_C: /public/apps/openmpi/2.1.1/gcc.4.8.4/lib/libmpi.so
-- Found MPI_CXX: /public/apps/openmpi/2.1.1/gcc.4.8.4/lib/libmpi.so
-- Found Gloo: TRUE
-- Found CUDA: /public/apps/cuda/9.0 (found suitable version "9.0", minimum required is "7.5")
-- MPI_LIBRARIES: /public/apps/openmpi/2.1.1/gcc.4.8.4/lib/libmpi.so
-- Found Gloo, will compile with Gloo distributed backend
-- Building the gloo backend with both TCP and infiniband support
-- NCCL_LIBRARIES: /public/apps/NCCL/2.1.2-1/lib/libnccl.so
-- NCCL Version 2 or higher found, will compile with NCCL distributed backend

PLUS

Added a helper script to automatically detect IB devices in the system and enable IB build by default. The user can have the option to force IB build as well using

USE_GLOO_IBVERBS python ./setup.py install

IB detected

running install
running build_deps
-- IB_detect: 4 IB devices detected, compiling with IB support.
-- Autodetected CUDA architecture(s): 6.0 6.0 6.0 6.0 6.0 6.0 6.0 6.0
-- Found CUDA with FP16 support, compiling with torch.CudaHalfTensor
-- Removing -DNDEBUG from compile flags

No IB detected

running install
running build_deps
-- IB_detect: no IB device detected, compiling with no IB support by default unless overridden by WITH_GLOO_IBVERBS
-- Autodetected CUDA architecture(s): 6.0 6.0
-- Found CUDA with FP16 support, compiling with torch.CudaHalfTensor
-- Removing -DNDEBUG from compile flags

No IB tool found

running install
running build_deps
-- IB_detect: unable to detect IB devices, compiling with no IB support by default unless overridden by WITH_GLOO_IBVERBS

// This helper function automatically detects the IB device in the system
auto ibDeviceNames = ::gloo::transport::ibverbs::getDeviceNames();

if (!ibDeviceNames.size()) {

This comment was marked as off-topic.

This comment was marked as off-topic.

* We make it a vector for the purpose of future extension to support multiple
* network devices.
*/
std::vector<std::shared_ptr<::gloo::transport::Device>> _deviceList;

This comment was marked as off-topic.

This comment was marked as off-topic.

0,
std::vector<cudaStream_t>{stream});
}
#endif

This comment was marked as off-topic.

This comment was marked as off-topic.

setup.py Outdated
@@ -34,6 +34,7 @@

WITH_DISTRIBUTED = not check_env_flag('NO_DISTRIBUTED') and not IS_WINDOWS
WITH_DISTRIBUTED_MW = WITH_DISTRIBUTED and check_env_flag('WITH_DISTRIBUTED_MW')
WITH_GLOO_IBVERBS = check_env_flag('WITH_GLOO_IBVERBS')

This comment was marked as off-topic.

This comment was marked as off-topic.

setup.py Outdated
@@ -34,7 +34,7 @@

WITH_DISTRIBUTED = not check_env_flag('NO_DISTRIBUTED') and not IS_WINDOWS
WITH_DISTRIBUTED_MW = WITH_DISTRIBUTED and check_env_flag('WITH_DISTRIBUTED_MW')
WITH_GLOO_IBVERBS = check_env_flag('WITH_GLOO_IBVERBS')
WITH_GLOO_IBVERBS = WITH_DISTRIBUTED and not check_env_flag('NO_GLOO_IBVERBS')

This comment was marked as off-topic.

@teng-li
Copy link
Contributor Author

teng-li commented Jan 24, 2018

@apaszke Now we auto-detect IB and enables the IB build when detected, user can force IB build by also using
USE_GLOO_IBVERBS python ./setup.py install

@teng-li teng-li changed the title Enabling Infiniband support for Gloo data channel Enabling Infiniband support for Gloo data channel with auto IB detection Jan 24, 2018

#endif

{

This comment was marked as off-topic.

@teng-li teng-li force-pushed the gloo-enable-ib branch 5 times, most recently from 3c6939b to a67ea5a Compare January 24, 2018 03:20
@teng-li teng-li force-pushed the gloo-enable-ib branch 2 times, most recently from a82e11e to fb4eb7f Compare January 24, 2018 03:45
@teng-li
Copy link
Contributor Author

teng-li commented Jan 24, 2018

CI failure is unrelated, fixed in #4826

@teng-li
Copy link
Contributor Author

teng-li commented Jan 24, 2018

@pytorchbot retest this please

IB_DEVINFO_CMD = "ibv_devinfo"


def get_command_path(command):

This comment was marked as off-topic.

This comment was marked as off-topic.

if len(res) != 1:
raise Exception("-- IB_detect: unexpected parsing error while "
"trying to find the number of available devices.")
return int(res[0])

This comment was marked as off-topic.

This comment was marked as off-topic.

setup.py Outdated
@@ -138,6 +139,10 @@ def build_libs(libs):
my_env["CUDNN_LIBRARY"] = CUDNN_LIBRARY
my_env["CUDNN_INCLUDE_DIR"] = CUDNN_INCLUDE_DIR

if WITH_DISTRIBUTED and (WITH_IB_DEVICES or

This comment was marked as off-topic.

This comment was marked as off-topic.

WITH_IB_DEVICES = True

else:
print("-- IB_detect: no IB device detected, compiling with no IB support "

This comment was marked as off-topic.

This comment was marked as off-topic.

@apaszke apaszke merged commit 1b3d6ab into pytorch:master Jan 24, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants