reland fast TypeMeta/ScalarType conversion #45544

bhosmer · 2020-09-30T00:14:48Z

Stack from ghstack:

use bitfield to shrink TensorImpl #45263 use bitfield to shrink TensorImpl
faster TensorOptions merging #45046 faster TensorOptions merging
track Half/ComplexHalf default dtype #45043 track Half/ComplexHalf default dtype
pass TypeMeta by value #45026 pass TypeMeta by value
reland fast TypeMeta/ScalarType conversion #45544 reland fast TypeMeta/ScalarType conversion

This is a reland of #44965, which was reverted because it triggered a latent issue in the ROCm integration, since fixed by @jeffdaily as noted in comments below (thanks Jeff!). Diff is identical except for rebasing and removal of a duplicate definition that Jeff spotted during troubleshooting. See #44965 for description and comments.

Differential Revision: D24006482

[ghstack-poisoned]

dr-ci · 2020-09-30T00:49:32Z

💊 CI failures summary and remediations

As of commit 38ebd0e (more details on the Dr. CI page):

✅ None of the CI failures appear to be your fault 💚

1/1 broken upstream at merge base c3fc17b on Oct 28 from 6:48am to 10:32am PDT (11 commits; 46b252b - 61ee024)

🚧 1 fixed upstream failure:

These were probably caused by upstream breakages that were already fixed.

Please rebase on the viable/strict branch (expand for instructions)

Since your merge base is older than viable/strict, run these commands:

git fetch https://github.com/pytorch/pytorch viable/strict
git rebase FETCH_HEAD

Check out the recency history of this "viable master" tracking branch.

pytorch_linux_backward_compatibility_check_test on Oct 28 from 6:48am to 10:32am PDT (11 commits; 46b252b - 61ee024)
- 🔁 rerun

This comment was automatically generated by Dr. CI (expand for details).

Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions on the GitHub issue tracker or post in the (internal) Dr. CI Users group.

See how this bot performed.

This comment has been revised 47 times.

Differential Revision: [D24006482](https://our.internmc.facebook.com/intern/diff/D24006482) [ghstack-poisoned]

bhosmer · 2020-10-22T05:56:26Z

Hi @jeffdaily - I'm hoping you can advise me on next steps on a mysterious ROCm error this PR is triggering. (@ngimel tells me you're the person to ask in such situations :)

You can see the failure in this log. Here's the relevant excerpt (the "HEY" messages are me bisecting - the position of the initial MIOpen error oddly precedes any actual op invocation, although the python output might just be out of sync with what's getting thrown out of MIOpen):

05:06:25 test_autocast_torch_fp16 (__main__.TestCuda) ... MIOpen Error: /root/driver/MLOpen/src/gemm_v2.cpp:523: rocBlas error encountered
05:06:26 HEY common_utils.TestCase.setUp top
05:06:26 HEY common_utils.TestCase.setUp bot
05:06:26 HEY CudaMemoryLeakCheck.__enter__ top
05:06:26 HEY CudaMemoryLeakCheck.__enter__ bot
05:06:26 HEY test_autocast_torch_fp16 top
05:06:26 HEY test_autocast_torch_fp16 before loop
05:06:26 HEY test_autocast_torch_fp16 in loop op {op}
05:06:26 HEY _run_autocast_outofplace top op _convolution
05:06:26 HEY _run_autocast_outofplace inside with op _convolution
05:06:26 HEY before module getattr <module 'torch' from '/var/lib/jenkins/.local/lib/python3.6/site-packages/torch/__init__.py'>
05:06:26 HEY before module op call <module 'torch' from '/var/lib/jenkins/.local/lib/python3.6/site-packages/torch/__init__.py'>
05:06:26 ERROR
05:06:26 test_autocast_torch_fp32 (__main__.TestCuda) ... Memory exception on virtual address 0x0, node id 2 : Page not present
05:06:26 Address does not belong to a known buffer
05:06:26 Memory access fault by GPU node-2 (Agent handle: 0x55b789a56f60) on address (nil). Reason: Page not present or supervisor privilege.
05:06:27 Traceback (most recent call last):
05:06:27   File "test/run_test.py", line 788, in <module>
05:06:27     main()
05:06:27   File "test/run_test.py", line 771, in main
05:06:27     raise RuntimeError(err_message)
05:06:27 RuntimeError: test_cuda failed! Received signal: SIGIOT

Hoping there are some obvious next things to try - I'm new to troubleshooting ROCm errors, so any guidance would be super appreciated (and happy to provide any details on the PR that might be useful). Thanks!

jeffdaily · 2020-10-22T15:40:42Z

Hi @bhosmer . I'll take a look at this today. And yes, myself or @sunway513 are the best points of contact for ROCm issues.

bhosmer · 2020-10-22T18:32:08Z

Hi @bhosmer . I'll take a look at this today. And yes, myself or @sunway513 are the best points of contact for ROCm issues.

Thank you Jeff! If there's any info I can supply to help troubleshoot, of course just LMK :)

jeffdaily · 2020-10-22T22:08:23Z

@bhosmer A few observations while I'm still digging.

Looking over the diff for this PR, I'm not completely familiar with the need (or lack) of various C10 macros such as C10_EXPORT, C10_TYPENAME_CONSTEXPR, but I do see between the original and new code some inconsistency in their use.

Perhaps we're getting some different behavior between the host compiler (gcc) and our device compiler hipcc (hip-clang). Looking at the top of c10/util/TypeIndex.h,

gcc will take the __GNUC__ && __GNUC__ < 9 path, setting
#define C10_TYPENAME_SUPPORTS_CONSTEXPR 0
#define C10_TYPENAME_CONSTEXPR

and our hipcc will take the __clang__ && ! defined(__CUDACC__) path, setting
#define C10_TYPENAME_SUPPORTS_CONSTEXPR 1
#define C10_TYPENAME_CONSTEXPR constexpr

bhosmer · 2020-10-22T22:59:38Z

@bhosmer A few observations while I'm still digging.

Looking over the diff for this PR, I'm not completely familiar with the need (or lack) of various C10 macros such as C10_EXPORT, C10_TYPENAME_CONSTEXPR, but I do see between the original and new code some inconsistency in their use.

Perhaps we're getting some different behavior between the host compiler (gcc) and our device compiler hipcc (hip-clang). Looking at the top of c10/util/TypeIndex.h,

gcc will take the __GNUC__ && __GNUC__ < 9 path, setting
#define C10_TYPENAME_SUPPORTS_CONSTEXPR 0
#define C10_TYPENAME_CONSTEXPR

and our hipcc will take the __clang__ && ! defined(__CUDACC__) path, setting
#define C10_TYPENAME_SUPPORTS_CONSTEXPR 1
#define C10_TYPENAME_CONSTEXPR constexpr

@jeffdaily yes! I totally agree that using constexpr conditionally based on compiler version seems fragile - that's why the current PR should only be eliminating use of the C10_*_CONSTEXPR macros (the effect would be that all compilers now agree that the definitions are non-constexpr).

On C10_EXPORT, I think the only change in this PR is a removal from a definition that became (unconditionally) constexpr - that macro defines visibility attributes, so I think it's now unnecessary. OTOH re-adding it shouldn't hurt anything, happy to try adding it back in if you think it's worth eliminating it as a source of jitter.

c10/core/ScalarType.h

jeffdaily · 2020-10-23T00:13:19Z

Trying to understand this PR. You've changed the size of TensorImpl by changing the caffe2::TypeMeta data_type_ member's size, where TypeMeta originally stored a pointer and is now a uint16_t. Am I right, or did I miss something? Then at the end of TensorImpl.h, the static_assert went from sizeof(TensorImpl) == sizeof(int64_t) * 31 to 30, but the savings were 48 bytes not 64?

bhosmer · 2020-10-23T00:33:02Z

Trying to understand this PR. You've changed the size of TensorImpl by changing the caffe2::TypeMeta data_type_ member's size, where TypeMeta originally stored a pointer and is now a uint16_t. Am I right, or did I miss something? Then at the end of TensorImpl.h, the static_assert went from sizeof(TensorImpl) == sizeof(int64_t) * 31 to 30, but the savings were 48 bytes not 64?

Yup, you're exactly right. IIRC the reason TensorImpl went down a full 64 is bc data_type_ now fits next to device_opt_ in a single word. [Edit: the other thing in play for these TensorImpl reductions (#45263 shrinks it as well) is the final word not being completely filled and a member size reduction crossing the word boundary. Don't remember for sure which caused the reduction in this PR though.]

Differential Revision: [D24006482](https://our.internmc.facebook.com/intern/diff/D24006482) [ghstack-poisoned]

jeffdaily · 2020-10-23T20:53:58Z

Your PR has somehow tickled a latent bug (almost 2 years old) in our miopen integration. I'm testing a fix in our fork before submitting an upstream PR.

ROCm#760

bhosmer · 2020-10-23T22:48:04Z

Your PR has somehow tickled a latent bug (almost 2 years old) in our miopen integration. I'm testing a fix in our fork before submitting an upstream PR.

ROCmSoftwarePlatform#760

Oh man, I can't thank you enough for tracking this down so quickly!

bhosmer · 2020-10-25T03:05:17Z

@jeffdaily btw, I'm happy to do an upstream PR for the fix if that's easier!

jeffdaily · 2020-10-26T16:52:59Z

PR #46852 ready for review to unblock this PR.

Differential Revision: [D24006482](https://our.internmc.facebook.com/intern/diff/D24006482) [ghstack-poisoned]

ezyang

In the end, no changes here, right? Just the upstream rocm bugfix.

jeffdaily · 2020-10-27T16:44:48Z

@ezyang yes, just needed the upstream rocm bugfix.

@jeffdaily

This is a reland of #44965, which was reverted because it triggered a latent issue in the ROCm integration, since fixed by @jeffdaily as noted in comments below (thanks Jeff!). Diff is identical except for rebasing and removal of a duplicate definition that Jeff spotted during troubleshooting. See #44965 for description and comments. Differential Revision: [D24006482](https://our.internmc.facebook.com/intern/diff/D24006482) [ghstack-poisoned]

facebook-github-bot · 2020-10-30T02:14:49Z

@bhosmer merged this pull request in 377a09c.

reland fast TypeMeta/ScalarType conversion

e1953fa

[ghstack-poisoned]

bhosmer requested a review from apaszke as a code owner September 30, 2020 00:14

This was referenced Sep 30, 2020

pass TypeMeta by value #45026

Closed

track Half/ComplexHalf default dtype #45043

Closed

faster TensorOptions merging #45046

Closed

use bitfield to shrink TensorImpl #45263

Closed

facebook-github-bot added the oncall: jit Add this issue/PR to JIT oncall triage queue label Sep 30, 2020

bhosmer added 8 commits September 29, 2020 18:07

Update on "reland fast TypeMeta/ScalarType conversion"

00bee15

Differential Revision: [D24006482](https://our.internmc.facebook.com/intern/diff/D24006482) [ghstack-poisoned]

Update on "reland fast TypeMeta/ScalarType conversion"

8a2bc1d

Differential Revision: [D24006482](https://our.internmc.facebook.com/intern/diff/D24006482) [ghstack-poisoned]

Update on "reland fast TypeMeta/ScalarType conversion"

94e62c5

Differential Revision: [D24006482](https://our.internmc.facebook.com/intern/diff/D24006482) [ghstack-poisoned]

Update on "reland fast TypeMeta/ScalarType conversion"

5ea870a

Differential Revision: [D24006482](https://our.internmc.facebook.com/intern/diff/D24006482) [ghstack-poisoned]

Update on "reland fast TypeMeta/ScalarType conversion"

f30dd0b

Differential Revision: [D24006482](https://our.internmc.facebook.com/intern/diff/D24006482) [ghstack-poisoned]

Update on "reland fast TypeMeta/ScalarType conversion"

ddaa3f4

Differential Revision: [D24006482](https://our.internmc.facebook.com/intern/diff/D24006482) [ghstack-poisoned]

Update on "reland fast TypeMeta/ScalarType conversion"

7a0a7e1

Differential Revision: [D24006482](https://our.internmc.facebook.com/intern/diff/D24006482) [ghstack-poisoned]

Update on "reland fast TypeMeta/ScalarType conversion"

4fe6663

Differential Revision: [D24006482](https://our.internmc.facebook.com/intern/diff/D24006482) [ghstack-poisoned]

jeffdaily reviewed Oct 22, 2020

View reviewed changes

c10/core/ScalarType.h Show resolved Hide resolved

Update on "reland fast TypeMeta/ScalarType conversion"

1fd8bfa

Differential Revision: [D24006482](https://our.internmc.facebook.com/intern/diff/D24006482) [ghstack-poisoned]

jeffdaily mentioned this pull request Oct 26, 2020

[ROCm] fix bug in miopen findAlgorithm. #46852

Closed

bhosmer requested a review from ezyang October 27, 2020 01:27

Update on "reland fast TypeMeta/ScalarType conversion"

caab22d

Differential Revision: [D24006482](https://our.internmc.facebook.com/intern/diff/D24006482) [ghstack-poisoned]

ezyang approved these changes Oct 27, 2020

View reviewed changes

facebook-github-bot closed this in 377a09c Oct 29, 2020

facebook-github-bot added the Merged label Oct 30, 2020

facebook-github-bot deleted the gh/bhosmer/51/head branch November 2, 2020 15:17

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

reland fast TypeMeta/ScalarType conversion #45544

reland fast TypeMeta/ScalarType conversion #45544

bhosmer commented Sep 30, 2020 •

edited

dr-ci bot commented Sep 30, 2020 •

edited

bhosmer commented Oct 22, 2020

jeffdaily commented Oct 22, 2020

bhosmer commented Oct 22, 2020

jeffdaily commented Oct 22, 2020

bhosmer commented Oct 22, 2020

jeffdaily commented Oct 23, 2020

bhosmer commented Oct 23, 2020 •

edited

jeffdaily commented Oct 23, 2020

bhosmer commented Oct 23, 2020

bhosmer commented Oct 25, 2020

jeffdaily commented Oct 26, 2020

ezyang left a comment

jeffdaily commented Oct 27, 2020

facebook-github-bot commented Oct 30, 2020

reland fast TypeMeta/ScalarType conversion #45544

reland fast TypeMeta/ScalarType conversion #45544

Conversation

bhosmer commented Sep 30, 2020 • edited

dr-ci bot commented Sep 30, 2020 • edited

💊 CI failures summary and remediations

🚧 1 fixed upstream failure:

bhosmer commented Oct 22, 2020

jeffdaily commented Oct 22, 2020

bhosmer commented Oct 22, 2020

jeffdaily commented Oct 22, 2020

bhosmer commented Oct 22, 2020

jeffdaily commented Oct 23, 2020

bhosmer commented Oct 23, 2020 • edited

jeffdaily commented Oct 23, 2020

bhosmer commented Oct 23, 2020

bhosmer commented Oct 25, 2020

jeffdaily commented Oct 26, 2020

ezyang left a comment

Choose a reason for hiding this comment

jeffdaily commented Oct 27, 2020

facebook-github-bot commented Oct 30, 2020

bhosmer commented Sep 30, 2020 •

edited

dr-ci bot commented Sep 30, 2020 •

edited

bhosmer commented Oct 23, 2020 •

edited