-
Notifications
You must be signed in to change notification settings - Fork 25.6k
Enable ROCm multi-gpu with Gloo #18640
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
936a5ed
to
fb98e12
Compare
|
c513064
to
bbfefae
Compare
@pytorchbot retest this please |
@iotamudelta I'm running into these memory allocation issue when trying to enable the multi gpu support on ROCm:
I suspect this is related to hip initialization in a forked sub-process. Could you help take a look? |
d66603a
to
448a54f
Compare
@jithunnair-amd Can you pls review the failure and triangulate to the relevent experts? thanks |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@bddppq has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.
I'm able to use this to do rocm distributed training. The unittests failures are related to re-initializting ROCm after fork. Let's land this first without enabling the tests. |
@bddppq We have a potential fix for the tests failing with "HIP out of memory" error. It just involves setting the "HCC_LAZYINIT=ON" environment variable for ROCm builds. It works locally, so I'd like to test it on the CI. Should I submit that change in this PR, or wait for this PR to be closed, so I can submit a different PR that builds on top of this one? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Where does the hipification of ProcessGroupGloo happen? We have a bunch of CUDA specifics in there that should eligible for hipification, but I don't see its path listed in this commit.
@pietern I believe the hipification of ProcessGroupGloo happens here: https://github.com/pytorch/pytorch/blob/master/tools/amd_build/pyHIPIFY/hipify_python.py#L785 I see it hipified in my local build. |
@jithunnair-amd |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@bddppq has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@bddppq has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.
I looked at the logs, but couldn't really make out where the SIGSEGV is coming from. All I see is this:
Is there a way to get more info about where the SIGSEGV occurred? |
@jithunnair-amd Let me try to create a repro for you to debug. |
Summary: Pull Request resolved: pytorch/pytorch#18640 Differential Revision: D15185822 Pulled By: bddppq fbshipit-source-id: 1b49ab3fb0f251cfc7ef3ddd62033ae0065a4ec3
No description provided.