Skip to content
This repository was archived by the owner on Aug 15, 2025. It is now read-only.

Conversation

pbelevich
Copy link
Contributor

@pbelevich pbelevich commented May 26, 2021

@pbelevich pbelevich requested a review from malfet May 26, 2021 19:00
Copy link

@zhouzhuojie zhouzhuojie left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, btw, is pytorch/pytorch#58996 going into 1.9?

@pbelevich
Copy link
Contributor Author

LGTM, btw, is pytorch/pytorch#58996 going into 1.9?

It seems that it's already included

Copy link
Member

@seemethere seemethere left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does the wheel building script also have the USE_GLOO_WITH_OPENSLL flag as well?

seemethere pushed a commit to pytorch/pytorch that referenced this pull request Jun 2, 2021
Needed for pytorch/builder#779

Co-authored-by: Your Name <driazati@users.noreply.github.com>
Copy link
Member

@seemethere seemethere left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think to be safe with this it'd be good to have a test PR upstream with pytorch/pytorch to verify this does not introduce a regression in the nightlies

pbelevich added a commit to pytorch/pytorch that referenced this pull request Jun 2, 2021
@seemethere
Copy link
Member

seemethere commented Jun 2, 2021

I think to be safe with this it'd be good to have a test PR upstream with pytorch/pytorch to verify this does not introduce a regression in the nightlies

Testing PR can be found here: pytorch/pytorch#59306

Looks like in spite of the environment variable being present for the conda build the test still fails:

+ [[ Linux == \L\i\n\u\x ]]
+ export USE_GLOO_WITH_OPENSSL=1
+ USE_GLOO_WITH_OPENSSL=1
Jun 02 16:25:42 + grep 'unsupported gloo device'
Jun 02 16:25:42 + GLOO_DEVICE_TRANSPORT=TCP_TLS
Jun 02 16:25:42 + MASTER_ADDR=localhost
Jun 02 16:25:42 + MASTER_PORT=63945
Jun 02 16:25:42 + python -c 'import torch; import torch.distributed as dist; print(torch.__version__); dist.init_process_group('\''gloo'\'', rank=0, world_size=1)'
Jun 02 16:25:43 Traceback (most recent call last):
Jun 02 16:25:43   File "<string>", line 1, in <module>
Jun 02 16:25:43   File "/opt/conda/envs/testenv/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 563, in init_process_group
Jun 02 16:25:43     default_pg = _new_process_group_helper(
Jun 02 16:25:43   File "/opt/conda/envs/testenv/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 664, in _new_process_group_helper
Jun 02 16:25:43     pg = ProcessGroupGloo(
Jun 02 16:25:43 RuntimeError: makeDeviceForHostname(): unsupported gloo device

Doesn't appear as though USE_GLOO_WITH_OPENSSL was included in the Summary for the cmake as well which makes it hard to know if it was actually enabled or not, Created a PR to do this: pytorch/pytorch#59321

malfet pushed a commit to malfet/pytorch that referenced this pull request Jun 2, 2021
Needed for pytorch/builder#779

Co-authored-by: Your Name <driazati@users.noreply.github.com>
Comment on lines 364 to 375
###############################################################################
# Check PyTorch supports TCP_TLS gloo transport
###############################################################################

if [[ "$(uname)" == 'Linux' ]]; then
GLOO_DEVICE_TRANSPORT=TCP_TLS MASTER_ADDR=localhost MASTER_PORT=63945 python -c "import torch; import torch.distributed as dist; print(torch.__version__); dist.init_process_group('gloo', rank=0, world_size=1)" | grep "unsupported gloo device" &> /dev/null
RESULT=$?
if [ $RESULT -eq 0 ]; then
echo "PyTorch doesn't support TLS_TCP transport, please set USE_GLOO_WITH_OPENSSL=1"
exit 1
fi
fi
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
###############################################################################
# Check PyTorch supports TCP_TLS gloo transport
###############################################################################
if [[ "$(uname)" == 'Linux' ]]; then
GLOO_DEVICE_TRANSPORT=TCP_TLS MASTER_ADDR=localhost MASTER_PORT=63945 python -c "import torch; import torch.distributed as dist; print(torch.__version__); dist.init_process_group('gloo', rank=0, world_size=1)" | grep "unsupported gloo device" &> /dev/null
RESULT=$?
if [ $RESULT -eq 0 ]; then
echo "PyTorch doesn't support TLS_TCP transport, please set USE_GLOO_WITH_OPENSSL=1"
exit 1
fi
fi
###############################################################################
# Check PyTorch supports TCP_TLS gloo transport
###############################################################################
function built_without_gloo_tls() {
GLOO_DEVICE_TRANSPORT=TCP_TLS MASTER_ADDR=localhost MASTER_PORT=63945 python -c "import torch; import torch.distributed as dist; print(torch.__version__); dist.init_process_group('gloo', rank=0, world_size=1)" | grep "unsupported gloo device" &> /dev/null
}
if [[ "$(uname)" == 'Linux' ]]; then
if built_without_gloo_tls; then
echo "PyTorch doesn't support TLS_TCP transport, please set USE_GLOO_WITH_OPENSSL=1"
exit 1
fi
fi

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should work, but I also wrote this suggestion in the github UI so it'd be good to double check that it checks out locally as well

pbelevich and others added 2 commits June 2, 2021 19:27
Co-authored-by: Nikita Shulga <nshulga@fb.com>
pbelevich added a commit to pytorch/pytorch that referenced this pull request Jun 3, 2021
@seemethere
Copy link
Member

Looks like tests are passing in the tester PR for pytorch/pytorch, going to go ahead and merge!

@seemethere seemethere merged commit 44b5f6c into master Jun 3, 2021
@seemethere seemethere deleted the pbelevich-patch-1 branch June 3, 2021 16:02
malfet added a commit that referenced this pull request Jun 5, 2021
Co-authored-by: Nikita Shulga <nshulga@fb.com>
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants