Skip to content

[ROCm] Added support for pytorch extensions to use HIP #32669

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 33 commits into from

Conversation

ashishfarmer
Copy link

This pull request has changes for:

  1. Enabling a torch module with HIP code to be compiled by cpp_extensions.py
  2. Fixes for hipify module to be able to be used by a torch extension

cc: @ezyang @iotamudelta @jeffdaily

@kostmo
Copy link
Member

kostmo commented Jan 27, 2020

💊 CircleCI build failures summary and remediations

As of commit c994da2:

Commit c994da2 was recently pushed. Waiting for builds...


This comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions on the GitHub issue tracker.

This comment has been revised 25 times.

@ezyang
Copy link
Contributor

ezyang commented Jan 28, 2020

@zou3519 Do you want to review this? You made major changes to cpp extensions recently.

CUDA_HOME = _find_cuda_home()
CUDNN_HOME = os.environ.get('CUDNN_HOME') or os.environ.get('CUDNN_PATH')
ROCM_HOME = _find_rocm_home()
CUDA_HOME = (ROCM_HOME if ROCM_HOME else _find_cuda_home())
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we please use torch.version.hip to check if we want to use ROCm?

@t-vi
Copy link
Collaborator

t-vi commented Jan 28, 2020

Awesome! I think the detection when to use HIP instead of CUDA needs fixing but other than that it looks reasonable at first sight.

@zou3519
Copy link
Contributor

zou3519 commented Jan 28, 2020

@zou3519 Do you want to review this? You made major changes to cpp extensions recently.

Sure, I can review this.

At a glance I think this change is probably going to conflict a lot with #32495

Copy link
Contributor

@zou3519 zou3519 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should try to add some tests if possible

"'-fPIC'"] + cflags + _get_cuda_arch_flags(cflags)
elif ROCM_HOME:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If possible we should enable some tests in test_cpp_extension to run in our ROCM CI (I assume they don't run right now).

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@zou3519 - I just removed test_cpp_extensions_aot_no_ninja from ROCM_BLACKLIST. However, the CI still skipped it. Is there anywhere else as well I should change?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know how the mechanism works. @iotamudelta do you know how we can trigger the rocm tests on this PR?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a deficiency (well, probably on purpose, but not good here) in the definition exclude_test that causes prefixes in the blacklist to match:

if test.startswith(exclude_test):

A quick fix could be to allow the blacklist to move to either move to regular expressions or allow "foo$" manually to mean "=='foo'" instead of "startswith(foo)".
Or one could rename the tests, no_ninja to the normal one and suffix the ninja one...

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@t-vi, That would be a change with implications throughout the pytorch test suite. Would you recommend not touching it for this pull request and look at it separately?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Upon further inspection, I'm not so sure whether the prefix matching even is the intended behaviour (e.g. non-x64 windows blacklists both cpp_extension_aot and cpp_extension_aot_no_ninja).
I think that:

  • we really do want the test,
  • renaming the ninja-enabled test to _ninja that is the prefix is probably best and I would say it's legitimate to do it in this PR. Changing the overall behaviour less so.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that renaming the test also needs updating the windows blacklist etc.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, so that's the problem. I agree with @t-vi that renaming the ninja-enabled test is the easiest thing to do for now. Long-term we should file an issue and figure out if the prefix matching behavior is intended behavior or not; afaict it isn't documented in the test runner.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the direction. Renamed the tests, the CI run should now have cpp_extensions test enabled for ROCm

@zou3519
Copy link
Contributor

zou3519 commented Jan 28, 2020

I'm not done reading through this yet, but I think it might be easier if we merge #32495 (which is still WIP by me) in first because that does a refactor of how cpp extensions do building. I'm happy to help fix up this PR to match the style of the refactor after that.

@yf225 yf225 added module: rocm AMD GPU support for Pytorch triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module labels Jan 29, 2020
@iotamudelta
Copy link
Contributor

I see that @t-vi kindly has already answered some of the questions. In general, the difference CUDA - HIP/ROCm is small as possible. This means the frontend is shared and most of the backend is shared as well.

Concerning testing: ROCm test targets are accessible via pytorchbot.

NOTE: run_test.py's test_cpp_extensions_aot_no_ninja target
also runs this test case, but with ninja disabled. If you are debugging
NOTE: run_test.py's test_cpp_extensions_aot_ninja target
also runs this test case, but with ninja enabled. If you are debugging
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto, this should read either neutral or read as if "test_cpp_extensions_aot (with ninja)" is the default. Developers of features for cpp extensions should run test_cpp_extensions_aot (with ninja) because it is faster

cflags = COMMON_HIPCC_FLAGS + cflags + _get_rocm_arch_flags(cflags)
else:
cflags = unix_cuda_flags(cflags)
elif is_hip_extension:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I think this is slightly easier to read if we have the following:

if _is_cuda_file(src):
  ...
elif isinstance(flags,dict):
  cflags = cflags['cxx']
if is_hip_extension:
   cflags = COMMON_HIPCC_FLAGS + cflags

so that we don't have to duplicate the isinstance(cflags, dict) check logic

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Simplified the condition

Copy link
Contributor

@zou3519 zou3519 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The contents of this look good to me. I think we should rename the test files so that there is no "bias" toward which test (test_cpp_extension_aot vs test_cpp_extension_aot_ninja) is the "default" one.

Copy link
Contributor

@zou3519 zou3519 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One last thing, otherwise lgtm

@@ -259,8 +308,8 @@ def build_extensions(self):
self._define_torch_extension_name(extension)
self._add_gnu_cpp_abi_flag(extension)

# Register .cu and .cuh as valid source extensions.
self.compiler.src_extensions += ['.cu', '.cuh']
# Register .cu, .cuh and .hip as valid source extensions.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(putting this here because github won't let me put it elsewhere)

We should toss a check for use_ninja to BuildExtension.__init__. If use_ninja is True and we are building a ROCm extension, we should error out gracefully so that ROCm users aren't confused about the extension not building. (I am not sure when we'll be able to build ROCm extensions with ninja; I know that's next on your list but this check is for just in case we don't get to that by the next release).

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the catch. Just added a fallback in the __init__ similar to the behavior when ninja is not present in the system

Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@zou3519 has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@facebook-github-bot
Copy link
Contributor

@zou3519 merged this pull request in 616beb1.

ttumiel pushed a commit to ttumiel/pytorch that referenced this pull request Mar 4, 2020
Summary:
This pull request has changes for:
1. Enabling a torch module with HIP code to be compiled by cpp_extensions.py
2. Fixes for hipify module to be able to be used by a torch extension

cc: ezyang iotamudelta jeffdaily
Pull Request resolved: pytorch#32669

Differential Revision: D20033893

Pulled By: zou3519

fbshipit-source-id: fd6ddc8cdcd3930f41008636bb2bc9dd26cdb008
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Merged module: rocm AMD GPU support for Pytorch open source triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module
Projects
None yet
Development

Successfully merging this pull request may close these issues.