New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enabling Infiniband support for Gloo data channel with auto IB detection #4795
Conversation
04a3ab9
to
8647463
Compare
// This helper function automatically detects the IB device in the system | ||
auto ibDeviceNames = ::gloo::transport::ibverbs::getDeviceNames(); | ||
|
||
if (!ibDeviceNames.size()) { |
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
* We make it a vector for the purpose of future extension to support multiple | ||
* network devices. | ||
*/ | ||
std::vector<std::shared_ptr<::gloo::transport::Device>> _deviceList; |
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
0, | ||
std::vector<cudaStream_t>{stream}); | ||
} | ||
#endif |
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
setup.py
Outdated
@@ -34,6 +34,7 @@ | |||
|
|||
WITH_DISTRIBUTED = not check_env_flag('NO_DISTRIBUTED') and not IS_WINDOWS | |||
WITH_DISTRIBUTED_MW = WITH_DISTRIBUTED and check_env_flag('WITH_DISTRIBUTED_MW') | |||
WITH_GLOO_IBVERBS = check_env_flag('WITH_GLOO_IBVERBS') |
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
setup.py
Outdated
@@ -34,7 +34,7 @@ | |||
|
|||
WITH_DISTRIBUTED = not check_env_flag('NO_DISTRIBUTED') and not IS_WINDOWS | |||
WITH_DISTRIBUTED_MW = WITH_DISTRIBUTED and check_env_flag('WITH_DISTRIBUTED_MW') | |||
WITH_GLOO_IBVERBS = check_env_flag('WITH_GLOO_IBVERBS') | |||
WITH_GLOO_IBVERBS = WITH_DISTRIBUTED and not check_env_flag('NO_GLOO_IBVERBS') |
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
30788a8
to
97b49a6
Compare
@apaszke Now we auto-detect IB and enables the IB build when detected, user can force IB build by also using |
97b49a6
to
e2c4fd2
Compare
|
||
#endif | ||
|
||
{ |
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
3c6939b
to
a67ea5a
Compare
a82e11e
to
fb4eb7f
Compare
CI failure is unrelated, fixed in #4826 |
fb4eb7f
to
9a143f7
Compare
9a143f7
to
b5a4201
Compare
@pytorchbot retest this please |
tools/setup_helpers/ib_detect.py
Outdated
IB_DEVINFO_CMD = "ibv_devinfo" | ||
|
||
|
||
def get_command_path(command): |
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
tools/setup_helpers/ib_detect.py
Outdated
if len(res) != 1: | ||
raise Exception("-- IB_detect: unexpected parsing error while " | ||
"trying to find the number of available devices.") | ||
return int(res[0]) |
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
setup.py
Outdated
@@ -138,6 +139,10 @@ def build_libs(libs): | |||
my_env["CUDNN_LIBRARY"] = CUDNN_LIBRARY | |||
my_env["CUDNN_INCLUDE_DIR"] = CUDNN_INCLUDE_DIR | |||
|
|||
if WITH_DISTRIBUTED and (WITH_IB_DEVICES or |
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
tools/setup_helpers/ib_detect.py
Outdated
WITH_IB_DEVICES = True | ||
|
||
else: | ||
print("-- IB_detect: no IB device detected, compiling with no IB support " |
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
454ae08
to
1ef007c
Compare
1ef007c
to
b45de3a
Compare
This PR enables the proper and easy use of Infiniband support for Gloo backend of distributed training.
Now simply just building PyTorch with
python ./setup.py install
will take care of everything by automatically detecting IB devices on the system.
this helper function of gloo ::gloo::transport::ibverbs::getDeviceNames was added earlier by me to automatically find all IB interfaces in the system.
For the Gloo data channel and cache. We now use a vector to store all the devices (not used currently, but will be able to easily extend in the future to support multiple IB devices).
Also fixed the format error of the bcast gpu direct checking.
Tested for both TCP and IB, both works fine.
Snippet of build logs:
PLUS
Added a helper script to automatically detect IB devices in the system and enable IB build by default. The user can have the option to force IB build as well using
USE_GLOO_IBVERBS python ./setup.py install
IB detected
No IB detected
No IB tool found