-
Notifications
You must be signed in to change notification settings - Fork 24.8k
[ROCm] Added support for pytorch extensions to use HIP #32669
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
This fixes runtime errors of _join_rocm_home() not being defined before it is used.
💊 CircleCI build failures summary and remediationsAs of commit c994da2: Commit c994da2 was recently pushed. Waiting for builds... This comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.Please report bugs/suggestions on the GitHub issue tracker. This comment has been revised 25 times. |
@zou3519 Do you want to review this? You made major changes to cpp extensions recently. |
torch/utils/cpp_extension.py
Outdated
CUDA_HOME = _find_cuda_home() | ||
CUDNN_HOME = os.environ.get('CUDNN_HOME') or os.environ.get('CUDNN_PATH') | ||
ROCM_HOME = _find_rocm_home() | ||
CUDA_HOME = (ROCM_HOME if ROCM_HOME else _find_cuda_home()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we please use torch.version.hip
to check if we want to use ROCm?
Awesome! I think the detection when to use HIP instead of CUDA needs fixing but other than that it looks reasonable at first sight. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should try to add some tests if possible
torch/utils/cpp_extension.py
Outdated
"'-fPIC'"] + cflags + _get_cuda_arch_flags(cflags) | ||
elif ROCM_HOME: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If possible we should enable some tests in test_cpp_extension
to run in our ROCM CI (I assume they don't run right now).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@zou3519 - I just removed test_cpp_extensions_aot_no_ninja
from ROCM_BLACKLIST
. However, the CI still skipped it. Is there anywhere else as well I should change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know how the mechanism works. @iotamudelta do you know how we can trigger the rocm tests on this PR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a deficiency (well, probably on purpose, but not good here) in the definition exclude_test that causes prefixes in the blacklist to match:
Line 404 in 44af8ee
if test.startswith(exclude_test): |
A quick fix could be to allow the blacklist to move to either move to regular expressions or allow "foo$" manually to mean "=='foo'" instead of "startswith(foo)".
Or one could rename the tests, no_ninja to the normal one and suffix the ninja one...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@t-vi, That would be a change with implications throughout the pytorch test suite. Would you recommend not touching it for this pull request and look at it separately?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Upon further inspection, I'm not so sure whether the prefix matching even is the intended behaviour (e.g. non-x64 windows blacklists both cpp_extension_aot and cpp_extension_aot_no_ninja).
I think that:
- we really do want the test,
- renaming the ninja-enabled test to _ninja that is the prefix is probably best and I would say it's legitimate to do it in this PR. Changing the overall behaviour less so.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that renaming the test also needs updating the windows blacklist etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, so that's the problem. I agree with @t-vi that renaming the ninja-enabled test is the easiest thing to do for now. Long-term we should file an issue and figure out if the prefix matching behavior is intended behavior or not; afaict it isn't documented in the test runner.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the direction. Renamed the tests, the CI run should now have cpp_extensions test enabled for ROCm
I'm not done reading through this yet, but I think it might be easier if we merge #32495 (which is still WIP by me) in first because that does a refactor of how cpp extensions do building. I'm happy to help fix up this PR to match the style of the refactor after that. |
I see that @t-vi kindly has already answered some of the questions. In general, the difference CUDA - HIP/ROCm is small as possible. This means the frontend is shared and most of the backend is shared as well. Concerning testing: ROCm test targets are accessible via pytorchbot. |
NOTE: run_test.py's test_cpp_extensions_aot_no_ninja target | ||
also runs this test case, but with ninja disabled. If you are debugging | ||
NOTE: run_test.py's test_cpp_extensions_aot_ninja target | ||
also runs this test case, but with ninja enabled. If you are debugging |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ditto, this should read either neutral or read as if "test_cpp_extensions_aot (with ninja)" is the default. Developers of features for cpp extensions should run test_cpp_extensions_aot (with ninja) because it is faster
torch/utils/cpp_extension.py
Outdated
cflags = COMMON_HIPCC_FLAGS + cflags + _get_rocm_arch_flags(cflags) | ||
else: | ||
cflags = unix_cuda_flags(cflags) | ||
elif is_hip_extension: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: I think this is slightly easier to read if we have the following:
if _is_cuda_file(src):
...
elif isinstance(flags,dict):
cflags = cflags['cxx']
if is_hip_extension:
cflags = COMMON_HIPCC_FLAGS + cflags
so that we don't have to duplicate the isinstance(cflags, dict)
check logic
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Simplified the condition
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The contents of this look good to me. I think we should rename the test files so that there is no "bias" toward which test (test_cpp_extension_aot vs test_cpp_extension_aot_ninja) is the "default" one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One last thing, otherwise lgtm
@@ -259,8 +308,8 @@ def build_extensions(self): | |||
self._define_torch_extension_name(extension) | |||
self._add_gnu_cpp_abi_flag(extension) | |||
|
|||
# Register .cu and .cuh as valid source extensions. | |||
self.compiler.src_extensions += ['.cu', '.cuh'] | |||
# Register .cu, .cuh and .hip as valid source extensions. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(putting this here because github won't let me put it elsewhere)
We should toss a check for use_ninja to BuildExtension.__init__
. If use_ninja
is True and we are building a ROCm extension, we should error out gracefully so that ROCm users aren't confused about the extension not building. (I am not sure when we'll be able to build ROCm extensions with ninja; I know that's next on your list but this check is for just in case we don't get to that by the next release).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the catch. Just added a fallback in the __init__
similar to the behavior when ninja is not present in the system
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@zou3519 has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.
Summary: This pull request has changes for: 1. Enabling a torch module with HIP code to be compiled by cpp_extensions.py 2. Fixes for hipify module to be able to be used by a torch extension cc: ezyang iotamudelta jeffdaily Pull Request resolved: pytorch#32669 Differential Revision: D20033893 Pulled By: zou3519 fbshipit-source-id: fd6ddc8cdcd3930f41008636bb2bc9dd26cdb008
This pull request has changes for:
cc: @ezyang @iotamudelta @jeffdaily