Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

reland fast TypeMeta/ScalarType conversion #45544

Closed
wants to merge 12 commits into from

Conversation

bhosmer
Copy link
Contributor

@bhosmer bhosmer commented Sep 30, 2020

Stack from ghstack:

This is a reland of #44965, which was reverted because it triggered a latent issue in the ROCm integration, since fixed by @jeffdaily as noted in comments below (thanks Jeff!). Diff is identical except for rebasing and removal of a duplicate definition that Jeff spotted during troubleshooting. See #44965 for description and comments.

Differential Revision: D24006482

@dr-ci
Copy link

dr-ci bot commented Sep 30, 2020

💊 CI failures summary and remediations

As of commit 38ebd0e (more details on the Dr. CI page):


None of the CI failures appear to be your fault 💚



🚧 1 fixed upstream failure:

These were probably caused by upstream breakages that were already fixed.

Please rebase on the viable/strict branch (expand for instructions)

Since your merge base is older than viable/strict, run these commands:

git fetch https://github.com/pytorch/pytorch viable/strict
git rebase FETCH_HEAD

Check out the recency history of this "viable master" tracking branch.


This comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions on the GitHub issue tracker or post in the (internal) Dr. CI Users group.

See how this bot performed.

This comment has been revised 47 times.

@bhosmer
Copy link
Contributor Author

bhosmer commented Oct 22, 2020

Hi @jeffdaily - I'm hoping you can advise me on next steps on a mysterious ROCm error this PR is triggering. (@ngimel tells me you're the person to ask in such situations :)

You can see the failure in this log. Here's the relevant excerpt (the "HEY" messages are me bisecting - the position of the initial MIOpen error oddly precedes any actual op invocation, although the python output might just be out of sync with what's getting thrown out of MIOpen):

05:06:25 test_autocast_torch_fp16 (__main__.TestCuda) ... MIOpen Error: /root/driver/MLOpen/src/gemm_v2.cpp:523: rocBlas error encountered
05:06:26 HEY common_utils.TestCase.setUp top
05:06:26 HEY common_utils.TestCase.setUp bot
05:06:26 HEY CudaMemoryLeakCheck.__enter__ top
05:06:26 HEY CudaMemoryLeakCheck.__enter__ bot
05:06:26 HEY test_autocast_torch_fp16 top
05:06:26 HEY test_autocast_torch_fp16 before loop
05:06:26 HEY test_autocast_torch_fp16 in loop op {op}
05:06:26 HEY _run_autocast_outofplace top op _convolution
05:06:26 HEY _run_autocast_outofplace inside with op _convolution
05:06:26 HEY before module getattr <module 'torch' from '/var/lib/jenkins/.local/lib/python3.6/site-packages/torch/__init__.py'>
05:06:26 HEY before module op call <module 'torch' from '/var/lib/jenkins/.local/lib/python3.6/site-packages/torch/__init__.py'>
05:06:26 ERROR
05:06:26 test_autocast_torch_fp32 (__main__.TestCuda) ... Memory exception on virtual address 0x0, node id 2 : Page not present
05:06:26 Address does not belong to a known buffer
05:06:26 Memory access fault by GPU node-2 (Agent handle: 0x55b789a56f60) on address (nil). Reason: Page not present or supervisor privilege.
05:06:27 Traceback (most recent call last):
05:06:27   File "test/run_test.py", line 788, in <module>
05:06:27     main()
05:06:27   File "test/run_test.py", line 771, in main
05:06:27     raise RuntimeError(err_message)
05:06:27 RuntimeError: test_cuda failed! Received signal: SIGIOT

Hoping there are some obvious next things to try - I'm new to troubleshooting ROCm errors, so any guidance would be super appreciated (and happy to provide any details on the PR that might be useful). Thanks!

@jeffdaily
Copy link
Collaborator

Hi @bhosmer . I'll take a look at this today. And yes, myself or @sunway513 are the best points of contact for ROCm issues.

@bhosmer
Copy link
Contributor Author

bhosmer commented Oct 22, 2020

Hi @bhosmer . I'll take a look at this today. And yes, myself or @sunway513 are the best points of contact for ROCm issues.

Thank you Jeff! If there's any info I can supply to help troubleshoot, of course just LMK :)

@jeffdaily
Copy link
Collaborator

@bhosmer A few observations while I'm still digging.

Looking over the diff for this PR, I'm not completely familiar with the need (or lack) of various C10 macros such as C10_EXPORT, C10_TYPENAME_CONSTEXPR, but I do see between the original and new code some inconsistency in their use.

Perhaps we're getting some different behavior between the host compiler (gcc) and our device compiler hipcc (hip-clang). Looking at the top of c10/util/TypeIndex.h,

gcc will take the __GNUC__ && __GNUC__ < 9 path, setting
#define C10_TYPENAME_SUPPORTS_CONSTEXPR 0
#define C10_TYPENAME_CONSTEXPR

and our hipcc will take the __clang__ && ! defined(__CUDACC__) path, setting
#define C10_TYPENAME_SUPPORTS_CONSTEXPR 1
#define C10_TYPENAME_CONSTEXPR constexpr

@bhosmer
Copy link
Contributor Author

bhosmer commented Oct 22, 2020

@bhosmer A few observations while I'm still digging.

Looking over the diff for this PR, I'm not completely familiar with the need (or lack) of various C10 macros such as C10_EXPORT, C10_TYPENAME_CONSTEXPR, but I do see between the original and new code some inconsistency in their use.

Perhaps we're getting some different behavior between the host compiler (gcc) and our device compiler hipcc (hip-clang). Looking at the top of c10/util/TypeIndex.h,

gcc will take the __GNUC__ && __GNUC__ < 9 path, setting
#define C10_TYPENAME_SUPPORTS_CONSTEXPR 0
#define C10_TYPENAME_CONSTEXPR

and our hipcc will take the __clang__ && ! defined(__CUDACC__) path, setting
#define C10_TYPENAME_SUPPORTS_CONSTEXPR 1
#define C10_TYPENAME_CONSTEXPR constexpr

@jeffdaily yes! I totally agree that using constexpr conditionally based on compiler version seems fragile - that's why the current PR should only be eliminating use of the C10_*_CONSTEXPR macros (the effect would be that all compilers now agree that the definitions are non-constexpr).

On C10_EXPORT, I think the only change in this PR is a removal from a definition that became (unconditionally) constexpr - that macro defines visibility attributes, so I think it's now unnecessary. OTOH re-adding it shouldn't hurt anything, happy to try adding it back in if you think it's worth eliminating it as a source of jitter.

@jeffdaily
Copy link
Collaborator

Trying to understand this PR. You've changed the size of TensorImpl by changing the caffe2::TypeMeta data_type_ member's size, where TypeMeta originally stored a pointer and is now a uint16_t. Am I right, or did I miss something? Then at the end of TensorImpl.h, the static_assert went from sizeof(TensorImpl) == sizeof(int64_t) * 31 to 30, but the savings were 48 bytes not 64?

@bhosmer
Copy link
Contributor Author

bhosmer commented Oct 23, 2020

Trying to understand this PR. You've changed the size of TensorImpl by changing the caffe2::TypeMeta data_type_ member's size, where TypeMeta originally stored a pointer and is now a uint16_t. Am I right, or did I miss something? Then at the end of TensorImpl.h, the static_assert went from sizeof(TensorImpl) == sizeof(int64_t) * 31 to 30, but the savings were 48 bytes not 64?

Yup, you're exactly right. IIRC the reason TensorImpl went down a full 64 is bc data_type_ now fits next to device_opt_ in a single word. [Edit: the other thing in play for these TensorImpl reductions (#45263 shrinks it as well) is the final word not being completely filled and a member size reduction crossing the word boundary. Don't remember for sure which caused the reduction in this PR though.]

@jeffdaily
Copy link
Collaborator

Your PR has somehow tickled a latent bug (almost 2 years old) in our miopen integration. I'm testing a fix in our fork before submitting an upstream PR.

ROCm#760

@bhosmer
Copy link
Contributor Author

bhosmer commented Oct 23, 2020

Your PR has somehow tickled a latent bug (almost 2 years old) in our miopen integration. I'm testing a fix in our fork before submitting an upstream PR.

ROCmSoftwarePlatform#760

Oh man, I can't thank you enough for tracking this down so quickly!

@bhosmer
Copy link
Contributor Author

bhosmer commented Oct 25, 2020

@jeffdaily btw, I'm happy to do an upstream PR for the fix if that's easier!

@jeffdaily
Copy link
Collaborator

PR #46852 ready for review to unblock this PR.

@bhosmer bhosmer requested a review from ezyang October 27, 2020 01:27
Copy link
Contributor

@ezyang ezyang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the end, no changes here, right? Just the upstream rocm bugfix.

@jeffdaily
Copy link
Collaborator

@ezyang yes, just needed the upstream rocm bugfix.

This is a reland of #44965, which was reverted because it triggered a latent issue in the ROCm integration, since fixed by @jeffdaily as noted in comments below (thanks Jeff!). Diff is identical except for rebasing and removal of a duplicate definition that Jeff spotted during troubleshooting. See #44965 for description and comments.

Differential Revision: [D24006482](https://our.internmc.facebook.com/intern/diff/D24006482)

[ghstack-poisoned]
@facebook-github-bot
Copy link
Contributor

@bhosmer merged this pull request in 377a09c.

@facebook-github-bot facebook-github-bot deleted the gh/bhosmer/51/head branch November 2, 2020 15:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Merged oncall: jit Add this issue/PR to JIT oncall triage queue
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants