Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

avoid CPU std::copysign segfault when compiling on arm64 with gcc 7.5 / 8 for CUDA #51834

Closed
wants to merge 1 commit into from

Conversation

t-vi
Copy link
Collaborator

@t-vi t-vi commented Feb 6, 2021

It seems that the std::copysign code introduced in #51706 is too much for gcc 7.5 / 8 when compiled on arm64 (e.g. on Jetson with latest Jetpack) and causes it to produce an internal compiler error with segfault during compilation. This avoids the compiler bug it by not using std::copysign.

A very kind person sent a Jetson Xavier NX 🎁 thank you ❤️.

After #51900 fixed this for CPU-only arm64 (eg Raspberry), this fixes it for CUDA-using arm64 (e.g. Jetson). CUDA device lambdas must also be present as host functions for technical reasons but they are never used, so we just assert in the CPU variant instead of actually doing the operation.

@t-vi
Copy link
Collaborator Author

t-vi commented Feb 6, 2021

@mruberry are you terribly attached to the copysign?

@facebook-github-bot
Copy link
Contributor

facebook-github-bot commented Feb 6, 2021

💊 CI failures summary and remediations

As of commit 804c0c6 (more details on the Dr. CI page):


  • 1/1 failures possibly* introduced in this PR
    • 1/1 non-scanned failure(s)

ci.pytorch.org: 1 failed


This comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions to the (internal) Dr. CI Users group.

@t-vi
Copy link
Collaborator Author

t-vi commented Feb 6, 2021

Turns out there are more of these in the cuda bits.

@t-vi t-vi changed the title avoid std::copysign segfault when compiling on arm64 with gcc 7.5 / 8 WIP: avoid std::copysign segfault when compiling on arm64 with gcc 7.5 / 8 Feb 6, 2021
@t-vi
Copy link
Collaborator Author

t-vi commented Feb 6, 2021

So I found that also the cuda copysign causes ICEs, it might be related to https://gcc.gnu.org/git/?p=gcc.git;a=patch;h=315fdae8f965045f86e966953f3c010a61072729 , but I'm still investigating.

@t-vi t-vi changed the title WIP: avoid std::copysign segfault when compiling on arm64 with gcc 7.5 / 8 avoid std::copysign segfault when compiling on arm64 with gcc 7.5 / 8 Feb 6, 2021
@t-vi
Copy link
Collaborator Author

t-vi commented Feb 6, 2021

Seems that msvc doesn't like the ifdefs...

@t-vi
Copy link
Collaborator Author

t-vi commented Feb 7, 2021

So it turns out that signbit(x) ? -abs(x) : abs(x) isn't working well with negative NaNs, I'll be trying bit operations next.

@t-vi t-vi changed the title avoid std::copysign segfault when compiling on arm64 with gcc 7.5 / 8 WIP: avoid std::copysign segfault when compiling on arm64 with gcc 7.5 / 8 Feb 7, 2021
@mruberry mruberry added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Feb 8, 2021
@mruberry mruberry requested a review from ngimel February 8, 2021 07:36
@mruberry
Copy link
Collaborator

mruberry commented Feb 8, 2021

Hey @t-vi, sorry to hear the PR caused an issue, and thank you for working on a fix.

@mruberry are you terribly attached to the copysign?

Not if it's preventing PyTorch from building on a supported platform.

We just cut the 1.8 release branch. cc @gchanan to consider this for cherrypicking.

A first look at the fix seems reasonable to me; I also added @ngimel and @peterbell10 to take a look. Ping when it's ready for review.

Why do other uses of std::copysign not affect this build, by the way? And what do you think is the best way to avoid these issues in the future?

@ptrblck
Copy link
Collaborator

ptrblck commented Feb 8, 2021

Thanks for working on this, @t-vi! :)

CC @dusty-nv, @shmsong, @csarofeen as this PR seems to be needed for the Jetson wheels for the 1.8 release.

@mruberry mruberry requested a review from ptrblck February 8, 2021 08:13
@t-vi
Copy link
Collaborator Author

t-vi commented Feb 8, 2021

Why do other uses of std::copysign not affect this build, by the way? And what do you think is the best way to avoid these issues in the future?

  • So I think the circumstances in which the gcc bug is triggered are rather special in terms of which registers etc. I must admit I didn't try hard to avoid lambdas or so to see if that works. It seems that (unsurprisingly) arm64 isn't that well tested on compilers of 2018 (gcc 7, 8.3). Given the growth arm64 I would expect that this type is much rarer in 2020 compilers.
  • The obvious remedy is to compile on newer gcc, but this depends on Jetpack/CUDA. In particular Jetpack seems to ship Ubuntu 18.4 with CUDA 10.2. In this release gcc-7 and -8 both have the bug and CUDA declares that -8 is the highest supported version. I initially tried to go that route and compile with gcc-7 built from Ubuntu 20.4, but that still has the bug and I didn't try to build with newer gcc-8. Obivously, if the compiler issue linked above is the one causing it, a targeted fix in Jetpack could also help.
  • I reverted the label to WIP because it turned out that copysign tests fail due to failures involving the sign of NaN (there is positive and negative NaN and numpy distinguishes between them for copysign purposes). I now have a fix that does bit manipulation after reinterpreting as uint64_t/uint32_t, but somehow my process of compiling on the Jetson and then moving things over isn't terribly fast, so I need a few hours before I submit that.
  • Incidentally, there are some other test in binary ops failures, too, that seem to be unrelated
  • There are not-yet-upstreamed patches for the thread layout of kernels, so I'm not sure how likely people are to compile unpatched PyTorch from source.

The other part is that Raspberry Pi OS also has gcc 8.3 as default, I don't know if it has the fix or not, but I can upgrade my packages there, too, to check. It PyTorch did build in January, but I didn't systematically run tests.

@mruberry mruberry requested a review from malfet February 8, 2021 08:27
@mruberry
Copy link
Collaborator

mruberry commented Feb 8, 2021

Thanks for elaborating.

  • Incidentally, there are some other test in binary ops failures, too, that seem to be unrelated

Would you elaborate on this? You mean there are other tests failing on jetson?

@t-vi
Copy link
Collaborator Author

t-vi commented Feb 8, 2021

There are a number of tests that just require newer numpy, so arguably this isn't a problem (AttributeError: module 'numpy.random' has no attribute 'default_rng' is by far the most common). I have no idea whether adapting to older numpy is worthwhile.

In the binary ufuncs test script, I'm seeing these (but this is on my branch, so I might have screwed up something):

FAIL: test_cpow_cpu_float32 (__main__.TestBinaryUfuncsCPU)
AssertionError: False is not true : Tensors failed to compare as equal!With rtol=1.3e-06 and atol=1e-05, found 22 element(s) (out of 100) whose difference(s) exceeded the margin of error (including 0 nan comparisons). The greatest difference was 3.881516218185425 (6.514925956726074 vs. 2.6334097385406494), which occurred at index (1, 0).
FAIL: test_cremainder_cpu_float32 (__main__.TestBinaryUfuncsCPU)
AssertionError: False is not true : Tensors failed to compare as equal!With rtol=1.3e-06 and atol=1e-05, found 29 element(s) (out of 100) whose difference(s) exceeded the margin of error (including 0 nan comparisons). The greatest difference was 0.9989410638809204 (0.36731672286987305 vs. 1.3662577867507935), which occurred at index (2, 9).
FAIL: test_div_rounding_modes_cpu_bfloat16 (__main__.TestBinaryUfuncsCPU)
AssertionError: False is not true : Tensors failed to compare as equal!With rtol=0.016 and atol=1e-05, found 28 element(s) (out of 100) whose difference(s) exceeded the margin of error (including 0 nan comparisons). The greatest difference was 8.53125 (0.46875 vs. 9.0), which occurred at index 57.
FAIL: test_div_rounding_modes_cpu_float16 (__main__.TestBinaryUfuncsCPU)
AssertionError: False is not true : Tensors failed to compare as equal!With rtol=0.001 and atol=1e-05, found 25 element(s) (out of 100) whose difference(s) exceeded the margin of error (including 0 nan comparisons). The greatest difference was 7.453125 (-1.265625 vs. -8.71875), which occurred at index 13.
FAIL: test_div_rounding_modes_cpu_float32 (__main__.TestBinaryUfuncsCPU)
AssertionError: False is not true : Tensors failed to compare as equal!With rtol=1.3e-06 and atol=1e-05, found 25 element(s) (out of 100) whose difference(s) exceeded the margin of error (including 0 nan comparisons). The greatest difference was 6.30060338973999 (-0.44316625595092773 vs. -6.743769645690918), which occurred at index 95.
FAIL: test_fmod_remainder_cpu_float32 (__main__.TestBinaryUfuncsCPU)
AssertionError: False is not true : Tensors failed to compare as equal!With rtol=1.3e-06 and atol=1e-05, found 6 element(s) (out of 100) whose difference(s) exceeded the margin of error (including 0 nan comparisons). The greatest difference was 1.5688362121582031 (1.573869228363037 vs. 0.005033016204833984), which occurred at index (5, 7).
FAIL: test_fmod_remainder_cpu_int16 (__main__.TestBinaryUfuncsCPU)
AssertionError: False is not true : Tensors failed to compare as equal!With rtol=1.3e-06 and atol=1e-05, found 9 element(s) (out of 100) whose difference(s) exceeded the margin of error (including 0 nan comparisons). The greatest difference was 5.577301025390625 (-6.0 vs. -0.422698974609375), which occurred at index (8, 9).
FAIL: test_fmod_remainder_cpu_int32 (__main__.TestBinaryUfuncsCPU)
AssertionError: False is not true : Tensors failed to compare as equal!With rtol=1.3e-06 and atol=1e-05, found 9 element(s) (out of 100) whose difference(s) exceeded the margin of error (including 0 nan comparisons). The greatest difference was 5.577301025390625 (-6.0 vs. -0.422698974609375), which occurred at index (8, 9).
FAIL: test_fmod_remainder_cpu_int64 (__main__.TestBinaryUfuncsCPU)
AssertionError: False is not true : Tensors failed to compare as equal!With rtol=1.3e-06 and atol=1e-05, found 9 element(s) (out of 100) whose difference(s) exceeded the margin of error (including 0 nan comparisons). The greatest difference was 5.577301025390625 (-6.0 vs. -0.422698974609375), which occurred at index (8, 9).
FAIL: test_fmod_remainder_cpu_int8 (__main__.TestBinaryUfuncsCPU)
AssertionError: False is not true : Tensors failed to compare as equal!With rtol=1.3e-06 and atol=1e-05, found 9 element(s) (out of 100) whose difference(s) exceeded the margin of error (including 0 nan comparisons). The greatest difference was 5.577301025390625 (-6.0 vs. -0.422698974609375), which occurred at index (8, 9).
FAIL: test_fmod_remainder_cpu_uint8 (__main__.TestBinaryUfuncsCPU)
AssertionError: False is not true : Tensors failed to compare as equal!With rtol=1.3e-06 and atol=1e-05, found 6 element(s) (out of 100) whose difference(s) exceeded the margin of error (including 0 nan comparisons). The greatest difference was 1.9990959167480469 (2.0 vs. 0.000904083251953125), which occurred at index (4, 7).
FAIL: test_hypot_cpu_float32 (__main__.TestBinaryUfuncsCPU)
AssertionError: False is not true : Scalars failed to compare as equal! Comparing 0.4336802363395691 and 0.28148531913757324 gives a difference of 0.15219491720199585, but the allowed difference with rtol=1.3e-06 and atol=1e-05 is only 1.0365930914878845e-05!
FAIL: test_logaddexp2_cpu_float32 (__main__.TestBinaryUfuncsCPU)
AssertionError: False is not true : Scalars failed to compare as equal! Comparing -0.019721556454896927 and nan gives a difference of nan, but the allowed difference with rtol=1.3e-06 and atol=1e-05 is only nan!
FAIL: test_logaddexp_cpu_float32 (__main__.TestBinaryUfuncsCPU)
AssertionError: False is not true : Scalars failed to compare as equal! Comparing -0.32649093866348267 and -1.0432298183441162 gives a difference of 0.7167388796806335, but the allowed difference with rtol=1.3e-06 and atol=1e-05 is only 1.1356198763847352e-05!
FAIL: test_nextafter_cpu_float32 (__main__.TestBinaryUfuncsCPU)
AssertionError: False is not true : Scalars failed to compare as equal! Comparing 0.3206480145454407 and 0.3206479549407959 gives a difference of 5.960464477539063e-08, but the allowed difference with rtol=0 and atol=0 is only 0.0!
FAIL: test_pow_cpu (__main__.TestBinaryUfuncsCPU)
AssertionError: False is not true : Tensors failed to compare as equal!With rtol=1.3e-06 and atol=1e-05, found 42 element(s) (out of 100) whose difference(s) exceeded the margin of error (including 0 nan comparisons). The greatest difference was 2.9337480068206787 (4.66788387298584 vs. 1.7341358661651611), which occurred at index 78.

@t-vi t-vi closed this Feb 8, 2021
@t-vi t-vi reopened this Feb 8, 2021
#if (!defined(__aarch64__)) || defined(__clang__) || \
(__GNUC__ > 8 || (__GNUC__ == 8 && __GNUC_MINOR__ > 3))
// std::copysign gets ICE/Segfaults with gcc 7.5/8 on arm64
// (e.g. Jetson), see PyTorch PR #51834
return std::copysign(a, b);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seemed weird to me that copysign was already being used without issue. Looking at the blame, this template comes from #47413 which says it: "avoids internal compiler error exceptions on aarch64 platforms".

So, I'm guessing this doesn't need changed and instead it could be moved somewhere to share it with the CUDA code. Worth a shot at least?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, so the ICE would only be triggered in specific circumstances. and so the comment here is probably misleading.
(TBH I wouldn't expect gcc to include completely broken builtins and it seems much more likely that there are corner cases.)
I changed this template because I noticed that tests didn't pass on arm64 and after wild editing it seems better now.
I do think we could take the code that I currently put in CUDAMathCompat.h and put it somewhere shared, but I don't really know where (and I'm only doing this off-and-on and iterating things involving headers has a rather long turnaround time when working on embedded without crosscompiling).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The linked PR suggests it's related to calling std:copysign with half and BFloat16 types. If that's true then there's no need to avoid std::copysign completely.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, moving copysign templates solves the issue in the .cpp file for both gcc-7 and gcc-7, let me do it in separate PR. (As well as file a separate issue/gcc bug for that)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ha, that will be much more elegant.

Copy link
Contributor

@malfet malfet Feb 8, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is the PR that fixes it on CPU (for both gcc-7 and 8): #51900
I wonder if making lambdas GPU only would solve the problem for CUDA

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I don't know how this making lambdas GPU only would work, the GPU_LAMBDA makes it __host__ __device__ as far as I can tell.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, GPU_LAMBDA is __host__ __device__ and there was a valid reason for that (to work around other compiler bugs iirc) which may or may not remain valid. Replacing lambdas with functors with only device () functions may work.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think lots of reasons for __host__ __device__ are specific to Windows compiler, I wonder if repo-wide effort to replace __host__ __device__ with __device__ only on Linux, would speed up compilation and fix those sort of problems.

@mruberry mruberry added the module: jetson Related to the Jetson builds by NVIDIA label Feb 8, 2021
@t-vi
Copy link
Collaborator Author

t-vi commented Feb 8, 2021

@malfet Yes, it also happens with 18.04's gcc-8 and with gcc-7 rebuilt from 20.04 on 18.04. So I'm relatively sure that if we want to support compilation on these platforms, we need the BinaryOpsKernel.cpp patch (or something similar) and for CUDA on these platforms we need the other two bits.

I'm sure there are better ways to implement it, but I tried with less and it didn't work...

@Lissanro
Copy link

Lissanro commented Mar 3, 2021

I tried to use this patch to compile latest pytorch on Jetson Nano with GCC 8 but it did not work at first. I had to replace the following in 4 places:

(__GNUC__ > 8 || (__GNUC__ == 8 && __GNUC_MINOR__ > 3))

With this:

(__GNUC__ > 8)

To make it work. I guess somebody thought the bug will be fixed in GCC higher than 8.3 but even with 8.4.0 it still crashes. Here is updated patch which can be applied to current pytorch: http://Dragon.Studio/2021/03/51834.diff

@t-vi
Copy link
Collaborator Author

t-vi commented Mar 3, 2021

Thank you! Yes, I had the "known bad" approach here and knew gcc 8.3 would be bad more than that > 8.3 would be sure to be good. Thank you for reporting back.

Did anyone work on checking if removing the host from the lambdas works on Linux?
If it works, I'd consider that the best approach and give it a try.

@t-vi
Copy link
Collaborator Author

t-vi commented Mar 3, 2021

So to update this: GPU_LAMBDA needs to be __host__ __device__ in order for TensorIterator to do its magic.

@t-vi
Copy link
Collaborator Author

t-vi commented Apr 6, 2021

@shmsong I wonder if this could be a more straightforward alternative to your PR #53790 (where to me the typing is very hard to understand).

@t-vi t-vi changed the title WIP: avoid std::copysign segfault when compiling on arm64 with gcc 7.5 / 8 avoid CPU std::copysign segfault when compiling on arm64 with gcc 7.5 / 8 for CUDA Apr 7, 2021
@t-vi
Copy link
Collaborator Author

t-vi commented Apr 7, 2021

@malfet @ngimel @shmsong I think this is ready for review now. The ROCm failure seems unrelated to this PR.

@facebook-github-bot
Copy link
Contributor

@malfet has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@malfet
Copy link
Contributor

malfet commented Apr 7, 2021

@t-vi thank you very much for the targeted fix!

@facebook-github-bot
Copy link
Contributor

@malfet merged this pull request in 3bb1f59.

@facebook-github-bot
Copy link
Contributor

This pull request has been reverted by b39eeb0.

@ngimel
Copy link
Collaborator

ngimel commented Apr 8, 2021

Sorry, had to revert this, internal builds are failing with

caffe2/c10/cuda/CUDAMathCompat.h(61): error: identifier "TORCH_INTERNAL_ASSERT" is undefined

@t-vi
Copy link
Collaborator Author

t-vi commented Apr 8, 2021

Oh, sorry. Is it OK to just #include <c10/util/Exception.h> or does it need more elaborate treatment?

malfet pushed a commit to malfet/pytorch that referenced this pull request Apr 8, 2021
… / 8 for CUDA (pytorch#51834)

Summary:
It seems that the std::copysign code introduced in pytorch#51706 is too much for gcc 7.5 / 8 when compiled on arm64 (e.g. on Jetson with latest Jetpack) and causes it to produce an internal compiler error with segfault during compilation. This avoids the compiler bug it by not using std::copysign.

A very kind person sent a Jetson Xavier NX {emoji:1f381} thank you {emoji:2764}.

After pytorch#51900 fixed this for CPU-only arm64 (eg Raspberry), this fixes it for CUDA-using arm64 (e.g. Jetson). CUDA device lambdas must also be present as host functions for technical reasons but they are never used, so we just assert in the CPU variant instead of actually doing the operation.

Pull Request resolved: pytorch#51834

Reviewed By: mrshenli

Differential Revision: D27622277

Pulled By: malfet

fbshipit-source-id: a1dc4c3a67f925019782e24b796919e17339749f
@malfet
Copy link
Contributor

malfet commented Apr 8, 2021

@t-vi re-landing in #55608

@t-vi
Copy link
Collaborator Author

t-vi commented Apr 8, 2021

Thank you, @malfet!

facebook-github-bot pushed a commit that referenced this pull request Apr 8, 2021
…5608)

Summary:
Re-land of #51834

Pull Request resolved: #55608

Reviewed By: ngimel

Differential Revision: D27649077

Pulled By: malfet

fbshipit-source-id: 1a21611fb12106f75fe50e8f9f14796ab6ab9464
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cla signed Merged module: arm Related to ARM architectures builds of PyTorch. Includes Apple M1 module: jetson Related to the Jetson builds by NVIDIA open source Reverted triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

9 participants