avoid CPU std::copysign segfault when compiling on arm64 with gcc 7.5 / 8 for CUDA #51834

t-vi · 2021-02-06T17:08:28Z

It seems that the std::copysign code introduced in #51706 is too much for gcc 7.5 / 8 when compiled on arm64 (e.g. on Jetson with latest Jetpack) and causes it to produce an internal compiler error with segfault during compilation. This avoids the compiler bug it by not using std::copysign.

A very kind person sent a Jetson Xavier NX 🎁 thank you ❤️.

After #51900 fixed this for CPU-only arm64 (eg Raspberry), this fixes it for CUDA-using arm64 (e.g. Jetson). CUDA device lambdas must also be present as host functions for technical reasons but they are never used, so we just assert in the CPU variant instead of actually doing the operation.

t-vi · 2021-02-06T17:09:30Z

@mruberry are you terribly attached to the copysign?

facebook-github-bot · 2021-02-06T17:10:04Z

💊 CI failures summary and remediations

As of commit 804c0c6 (more details on the Dr. CI page):

1/1 failures possibly* introduced in this PR
- 1/1 non-scanned failure(s)

ci.pytorch.org: 1 failed

Failed: pr/pytorch-linux-bionic-rocm4.1-py3.6

This comment was automatically generated by Dr. CI (expand for details).

Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions to the (internal) Dr. CI Users group.

t-vi · 2021-02-06T18:26:53Z

Turns out there are more of these in the cuda bits.

t-vi · 2021-02-06T20:28:37Z

So I found that also the cuda copysign causes ICEs, it might be related to https://gcc.gnu.org/git/?p=gcc.git;a=patch;h=315fdae8f965045f86e966953f3c010a61072729 , but I'm still investigating.

t-vi · 2021-02-06T23:04:54Z

Seems that msvc doesn't like the ifdefs...

t-vi · 2021-02-07T22:29:30Z

So it turns out that signbit(x) ? -abs(x) : abs(x) isn't working well with negative NaNs, I'll be trying bit operations next.

mruberry · 2021-02-08T07:43:02Z

Hey @t-vi, sorry to hear the PR caused an issue, and thank you for working on a fix.

@mruberry are you terribly attached to the copysign?

Not if it's preventing PyTorch from building on a supported platform.

We just cut the 1.8 release branch. cc @gchanan to consider this for cherrypicking.

A first look at the fix seems reasonable to me; I also added @ngimel and @peterbell10 to take a look. Ping when it's ready for review.

Why do other uses of std::copysign not affect this build, by the way? And what do you think is the best way to avoid these issues in the future?

ptrblck · 2021-02-08T08:11:59Z

Thanks for working on this, @t-vi! :)

CC @dusty-nv, @shmsong, @csarofeen as this PR seems to be needed for the Jetson wheels for the 1.8 release.

t-vi · 2021-02-08T08:15:05Z

Why do other uses of std::copysign not affect this build, by the way? And what do you think is the best way to avoid these issues in the future?

So I think the circumstances in which the gcc bug is triggered are rather special in terms of which registers etc. I must admit I didn't try hard to avoid lambdas or so to see if that works. It seems that (unsurprisingly) arm64 isn't that well tested on compilers of 2018 (gcc 7, 8.3). Given the growth arm64 I would expect that this type is much rarer in 2020 compilers.
The obvious remedy is to compile on newer gcc, but this depends on Jetpack/CUDA. In particular Jetpack seems to ship Ubuntu 18.4 with CUDA 10.2. In this release gcc-7 and -8 both have the bug and CUDA declares that -8 is the highest supported version. I initially tried to go that route and compile with gcc-7 built from Ubuntu 20.4, but that still has the bug and I didn't try to build with newer gcc-8. Obivously, if the compiler issue linked above is the one causing it, a targeted fix in Jetpack could also help.
I reverted the label to WIP because it turned out that copysign tests fail due to failures involving the sign of NaN (there is positive and negative NaN and numpy distinguishes between them for copysign purposes). I now have a fix that does bit manipulation after reinterpreting as uint64_t/uint32_t, but somehow my process of compiling on the Jetson and then moving things over isn't terribly fast, so I need a few hours before I submit that.
Incidentally, there are some other test in binary ops failures, too, that seem to be unrelated
There are not-yet-upstreamed patches for the thread layout of kernels, so I'm not sure how likely people are to compile unpatched PyTorch from source.

The other part is that Raspberry Pi OS also has gcc 8.3 as default, I don't know if it has the fix or not, but I can upgrade my packages there, too, to check. It PyTorch did build in January, but I didn't systematically run tests.

mruberry · 2021-02-08T08:27:22Z

Thanks for elaborating.

Incidentally, there are some other test in binary ops failures, too, that seem to be unrelated

Would you elaborate on this? You mean there are other tests failing on jetson?

@malfet, too

t-vi · 2021-02-08T10:37:57Z

There are a number of tests that just require newer numpy, so arguably this isn't a problem (AttributeError: module 'numpy.random' has no attribute 'default_rng' is by far the most common). I have no idea whether adapting to older numpy is worthwhile.

In the binary ufuncs test script, I'm seeing these (but this is on my branch, so I might have screwed up something):

FAIL: test_cpow_cpu_float32 (__main__.TestBinaryUfuncsCPU)
AssertionError: False is not true : Tensors failed to compare as equal!With rtol=1.3e-06 and atol=1e-05, found 22 element(s) (out of 100) whose difference(s) exceeded the margin of error (including 0 nan comparisons). The greatest difference was 3.881516218185425 (6.514925956726074 vs. 2.6334097385406494), which occurred at index (1, 0).
FAIL: test_cremainder_cpu_float32 (__main__.TestBinaryUfuncsCPU)
AssertionError: False is not true : Tensors failed to compare as equal!With rtol=1.3e-06 and atol=1e-05, found 29 element(s) (out of 100) whose difference(s) exceeded the margin of error (including 0 nan comparisons). The greatest difference was 0.9989410638809204 (0.36731672286987305 vs. 1.3662577867507935), which occurred at index (2, 9).
FAIL: test_div_rounding_modes_cpu_bfloat16 (__main__.TestBinaryUfuncsCPU)
AssertionError: False is not true : Tensors failed to compare as equal!With rtol=0.016 and atol=1e-05, found 28 element(s) (out of 100) whose difference(s) exceeded the margin of error (including 0 nan comparisons). The greatest difference was 8.53125 (0.46875 vs. 9.0), which occurred at index 57.
FAIL: test_div_rounding_modes_cpu_float16 (__main__.TestBinaryUfuncsCPU)
AssertionError: False is not true : Tensors failed to compare as equal!With rtol=0.001 and atol=1e-05, found 25 element(s) (out of 100) whose difference(s) exceeded the margin of error (including 0 nan comparisons). The greatest difference was 7.453125 (-1.265625 vs. -8.71875), which occurred at index 13.
FAIL: test_div_rounding_modes_cpu_float32 (__main__.TestBinaryUfuncsCPU)
AssertionError: False is not true : Tensors failed to compare as equal!With rtol=1.3e-06 and atol=1e-05, found 25 element(s) (out of 100) whose difference(s) exceeded the margin of error (including 0 nan comparisons). The greatest difference was 6.30060338973999 (-0.44316625595092773 vs. -6.743769645690918), which occurred at index 95.
FAIL: test_fmod_remainder_cpu_float32 (__main__.TestBinaryUfuncsCPU)
AssertionError: False is not true : Tensors failed to compare as equal!With rtol=1.3e-06 and atol=1e-05, found 6 element(s) (out of 100) whose difference(s) exceeded the margin of error (including 0 nan comparisons). The greatest difference was 1.5688362121582031 (1.573869228363037 vs. 0.005033016204833984), which occurred at index (5, 7).
FAIL: test_fmod_remainder_cpu_int16 (__main__.TestBinaryUfuncsCPU)
AssertionError: False is not true : Tensors failed to compare as equal!With rtol=1.3e-06 and atol=1e-05, found 9 element(s) (out of 100) whose difference(s) exceeded the margin of error (including 0 nan comparisons). The greatest difference was 5.577301025390625 (-6.0 vs. -0.422698974609375), which occurred at index (8, 9).
FAIL: test_fmod_remainder_cpu_int32 (__main__.TestBinaryUfuncsCPU)
AssertionError: False is not true : Tensors failed to compare as equal!With rtol=1.3e-06 and atol=1e-05, found 9 element(s) (out of 100) whose difference(s) exceeded the margin of error (including 0 nan comparisons). The greatest difference was 5.577301025390625 (-6.0 vs. -0.422698974609375), which occurred at index (8, 9).
FAIL: test_fmod_remainder_cpu_int64 (__main__.TestBinaryUfuncsCPU)
AssertionError: False is not true : Tensors failed to compare as equal!With rtol=1.3e-06 and atol=1e-05, found 9 element(s) (out of 100) whose difference(s) exceeded the margin of error (including 0 nan comparisons). The greatest difference was 5.577301025390625 (-6.0 vs. -0.422698974609375), which occurred at index (8, 9).
FAIL: test_fmod_remainder_cpu_int8 (__main__.TestBinaryUfuncsCPU)
AssertionError: False is not true : Tensors failed to compare as equal!With rtol=1.3e-06 and atol=1e-05, found 9 element(s) (out of 100) whose difference(s) exceeded the margin of error (including 0 nan comparisons). The greatest difference was 5.577301025390625 (-6.0 vs. -0.422698974609375), which occurred at index (8, 9).
FAIL: test_fmod_remainder_cpu_uint8 (__main__.TestBinaryUfuncsCPU)
AssertionError: False is not true : Tensors failed to compare as equal!With rtol=1.3e-06 and atol=1e-05, found 6 element(s) (out of 100) whose difference(s) exceeded the margin of error (including 0 nan comparisons). The greatest difference was 1.9990959167480469 (2.0 vs. 0.000904083251953125), which occurred at index (4, 7).
FAIL: test_hypot_cpu_float32 (__main__.TestBinaryUfuncsCPU)
AssertionError: False is not true : Scalars failed to compare as equal! Comparing 0.4336802363395691 and 0.28148531913757324 gives a difference of 0.15219491720199585, but the allowed difference with rtol=1.3e-06 and atol=1e-05 is only 1.0365930914878845e-05!
FAIL: test_logaddexp2_cpu_float32 (__main__.TestBinaryUfuncsCPU)
AssertionError: False is not true : Scalars failed to compare as equal! Comparing -0.019721556454896927 and nan gives a difference of nan, but the allowed difference with rtol=1.3e-06 and atol=1e-05 is only nan!
FAIL: test_logaddexp_cpu_float32 (__main__.TestBinaryUfuncsCPU)
AssertionError: False is not true : Scalars failed to compare as equal! Comparing -0.32649093866348267 and -1.0432298183441162 gives a difference of 0.7167388796806335, but the allowed difference with rtol=1.3e-06 and atol=1e-05 is only 1.1356198763847352e-05!
FAIL: test_nextafter_cpu_float32 (__main__.TestBinaryUfuncsCPU)
AssertionError: False is not true : Scalars failed to compare as equal! Comparing 0.3206480145454407 and 0.3206479549407959 gives a difference of 5.960464477539063e-08, but the allowed difference with rtol=0 and atol=0 is only 0.0!
FAIL: test_pow_cpu (__main__.TestBinaryUfuncsCPU)
AssertionError: False is not true : Tensors failed to compare as equal!With rtol=1.3e-06 and atol=1e-05, found 42 element(s) (out of 100) whose difference(s) exceeded the margin of error (including 0 nan comparisons). The greatest difference was 2.9337480068206787 (4.66788387298584 vs. 1.7341358661651611), which occurred at index 78.

peterbell10 · 2021-02-08T14:34:13Z

aten/src/ATen/native/cpu/BinaryOpsKernel.cpp

+#if (!defined(__aarch64__)) || defined(__clang__) || \
+    (__GNUC__ > 8 || (__GNUC__ == 8 && __GNUC_MINOR__ > 3))
+  // std::copysign gets ICE/Segfaults with gcc 7.5/8 on arm64
+  // (e.g. Jetson), see PyTorch PR #51834
  return std::copysign(a, b);


Seemed weird to me that copysign was already being used without issue. Looking at the blame, this template comes from #47413 which says it: "avoids internal compiler error exceptions on aarch64 platforms".

So, I'm guessing this doesn't need changed and instead it could be moved somewhere to share it with the CUDA code. Worth a shot at least?

Well, so the ICE would only be triggered in specific circumstances. and so the comment here is probably misleading.
(TBH I wouldn't expect gcc to include completely broken builtins and it seems much more likely that there are corner cases.)
I changed this template because I noticed that tests didn't pass on arm64 and after wild editing it seems better now.
I do think we could take the code that I currently put in CUDAMathCompat.h and put it somewhere shared, but I don't really know where (and I'm only doing this off-and-on and iterating things involving headers has a rather long turnaround time when working on embedded without crosscompiling).

The linked PR suggests it's related to calling std:copysign with half and BFloat16 types. If that's true then there's no need to avoid std::copysign completely.

Yeah, moving copysign templates solves the issue in the .cpp file for both gcc-7 and gcc-7, let me do it in separate PR. (As well as file a separate issue/gcc bug for that)

Ha, that will be much more elegant.

Here is the PR that fixes it on CPU (for both gcc-7 and 8): #51900
I wonder if making lambdas GPU only would solve the problem for CUDA

So I don't know how this making lambdas GPU only would work, the GPU_LAMBDA makes it __host__ __device__ as far as I can tell.

Yep, GPU_LAMBDA is __host__ __device__ and there was a valid reason for that (to work around other compiler bugs iirc) which may or may not remain valid. Replacing lambdas with functors with only device () functions may work.

I think lots of reasons for __host__ __device__ are specific to Windows compiler, I wonder if repo-wide effort to replace __host__ __device__ with __device__ only on Linux, would speed up compilation and fix those sort of problems.

t-vi · 2021-02-08T18:00:02Z

@malfet Yes, it also happens with 18.04's gcc-8 and with gcc-7 rebuilt from 20.04 on 18.04. So I'm relatively sure that if we want to support compilation on these platforms, we need the BinaryOpsKernel.cpp patch (or something similar) and for CUDA on these platforms we need the other two bits.

I'm sure there are better ways to implement it, but I tried with less and it didn't work...

Lissanro · 2021-03-03T06:57:19Z

I tried to use this patch to compile latest pytorch on Jetson Nano with GCC 8 but it did not work at first. I had to replace the following in 4 places:

(__GNUC__ > 8 || (__GNUC__ == 8 && __GNUC_MINOR__ > 3))

With this:

(__GNUC__ > 8)

To make it work. I guess somebody thought the bug will be fixed in GCC higher than 8.3 but even with 8.4.0 it still crashes. Here is updated patch which can be applied to current pytorch: http://Dragon.Studio/2021/03/51834.diff

t-vi · 2021-03-03T08:00:14Z

Thank you! Yes, I had the "known bad" approach here and knew gcc 8.3 would be bad more than that > 8.3 would be sure to be good. Thank you for reporting back.

Did anyone work on checking if removing the host from the lambdas works on Linux?
If it works, I'd consider that the best approach and give it a try.

t-vi · 2021-03-03T13:39:45Z

So to update this: GPU_LAMBDA needs to be __host__ __device__ in order for TensorIterator to do its magic.

t-vi · 2021-04-06T17:49:33Z

@shmsong I wonder if this could be a more straightforward alternative to your PR #53790 (where to me the typing is very hard to understand).

t-vi · 2021-04-07T15:17:33Z

@malfet @ngimel @shmsong I think this is ready for review now. The ROCm failure seems unrelated to this PR.

facebook-github-bot · 2021-04-07T15:58:10Z

@malfet has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

malfet · 2021-04-07T15:58:35Z

@t-vi thank you very much for the targeted fix!

facebook-github-bot · 2021-04-07T16:32:52Z

@malfet merged this pull request in 3bb1f59.

facebook-github-bot · 2021-04-08T01:27:15Z

This pull request has been reverted by b39eeb0.

ngimel · 2021-04-08T03:09:09Z

Sorry, had to revert this, internal builds are failing with

caffe2/c10/cuda/CUDAMathCompat.h(61): error: identifier "TORCH_INTERNAL_ASSERT" is undefined

t-vi · 2021-04-08T06:19:00Z

Oh, sorry. Is it OK to just #include <c10/util/Exception.h> or does it need more elaborate treatment?

… / 8 for CUDA (pytorch#51834) Summary: It seems that the std::copysign code introduced in pytorch#51706 is too much for gcc 7.5 / 8 when compiled on arm64 (e.g. on Jetson with latest Jetpack) and causes it to produce an internal compiler error with segfault during compilation. This avoids the compiler bug it by not using std::copysign. A very kind person sent a Jetson Xavier NX {emoji:1f381} thank you {emoji:2764}. After pytorch#51900 fixed this for CPU-only arm64 (eg Raspberry), this fixes it for CUDA-using arm64 (e.g. Jetson). CUDA device lambdas must also be present as host functions for technical reasons but they are never used, so we just assert in the CPU variant instead of actually doing the operation. Pull Request resolved: pytorch#51834 Reviewed By: mrshenli Differential Revision: D27622277 Pulled By: malfet fbshipit-source-id: a1dc4c3a67f925019782e24b796919e17339749f

malfet · 2021-04-08T13:46:30Z

@t-vi re-landing in #55608

t-vi · 2021-04-08T14:05:35Z

Thank you, @malfet!

…5608) Summary: Re-land of #51834 Pull Request resolved: #55608 Reviewed By: ngimel Differential Revision: D27649077 Pulled By: malfet fbshipit-source-id: 1a21611fb12106f75fe50e8f9f14796ab6ab9464

facebook-github-bot added the cla signed label Feb 6, 2021

pytorchbot added the open source label Feb 6, 2021

t-vi force-pushed the arm_copysign_ice branch from 7e89958 to 3091814 Compare February 6, 2021 18:08

t-vi changed the title ~~avoid std::copysign segfault when compiling on arm64 with gcc 7.5 / 8~~ WIP: avoid std::copysign segfault when compiling on arm64 with gcc 7.5 / 8 Feb 6, 2021

t-vi force-pushed the arm_copysign_ice branch from 3091814 to a886e8d Compare February 6, 2021 22:36

t-vi changed the title ~~WIP: avoid std::copysign segfault when compiling on arm64 with gcc 7.5 / 8~~ avoid std::copysign segfault when compiling on arm64 with gcc 7.5 / 8 Feb 6, 2021

t-vi force-pushed the arm_copysign_ice branch from a886e8d to 12c9731 Compare February 7, 2021 09:59

t-vi changed the title ~~avoid std::copysign segfault when compiling on arm64 with gcc 7.5 / 8~~ WIP: avoid std::copysign segfault when compiling on arm64 with gcc 7.5 / 8 Feb 7, 2021

mruberry requested review from peterbell10 and mruberry February 8, 2021 07:33

mruberry added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Feb 8, 2021

mruberry requested a review from ngimel February 8, 2021 07:36

mruberry requested a review from ptrblck February 8, 2021 08:13

mruberry requested a review from malfet February 8, 2021 08:27

t-vi closed this Feb 8, 2021

t-vi force-pushed the arm_copysign_ice branch from 12c9731 to bce4c82 Compare February 8, 2021 10:48

t-vi reopened this Feb 8, 2021

peterbell10 reviewed Feb 8, 2021

View reviewed changes

mruberry added the module: jetson Related to the Jetson builds by NVIDIA label Feb 8, 2021

malfet mentioned this pull request Feb 8, 2021

Compiling PyTorch on ARM64:Ubuntu-18.04 fails with internal compiler error #51889

Closed

shmsong mentioned this pull request Mar 11, 2021

Workaround arm64 gcc error in std::copysign on Jetson platforms #53790

Closed

t-vi force-pushed the arm_copysign_ice branch from 625e8e7 to 213fced Compare April 6, 2021 17:47

new fix for CUDA - just ASSERT on CPU use

804c0c6

t-vi force-pushed the arm_copysign_ice branch from 213fced to 804c0c6 Compare April 7, 2021 08:31

t-vi changed the title ~~WIP: avoid std::copysign segfault when compiling on arm64 with gcc 7.5 / 8~~ avoid CPU std::copysign segfault when compiling on arm64 with gcc 7.5 / 8 for CUDA Apr 7, 2021

malfet approved these changes Apr 7, 2021

View reviewed changes

facebook-github-bot closed this in 3bb1f59 Apr 7, 2021

facebook-github-bot added the Merged label Apr 7, 2021

facebook-github-bot added the Reverted label Apr 8, 2021

malfet mentioned this pull request Apr 8, 2021

avoid CPU std::copysign segfault when compiling on arm64 (take-2) #55608

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

avoid CPU std::copysign segfault when compiling on arm64 with gcc 7.5 / 8 for CUDA #51834

avoid CPU std::copysign segfault when compiling on arm64 with gcc 7.5 / 8 for CUDA #51834

t-vi commented Feb 6, 2021 •

edited

t-vi commented Feb 6, 2021

facebook-github-bot commented Feb 6, 2021 •

edited

t-vi commented Feb 6, 2021

t-vi commented Feb 6, 2021

t-vi commented Feb 6, 2021

t-vi commented Feb 7, 2021

mruberry commented Feb 8, 2021

ptrblck commented Feb 8, 2021

t-vi commented Feb 8, 2021

mruberry commented Feb 8, 2021

t-vi commented Feb 8, 2021

peterbell10 Feb 8, 2021

t-vi Feb 8, 2021

peterbell10 Feb 8, 2021

malfet Feb 8, 2021

t-vi Feb 8, 2021

malfet Feb 8, 2021 •

edited

t-vi Feb 8, 2021

ngimel Feb 8, 2021

malfet Feb 9, 2021

t-vi commented Feb 8, 2021

Lissanro commented Mar 3, 2021

t-vi commented Mar 3, 2021

t-vi commented Mar 3, 2021

t-vi commented Apr 6, 2021

t-vi commented Apr 7, 2021

facebook-github-bot commented Apr 7, 2021

malfet commented Apr 7, 2021

facebook-github-bot commented Apr 7, 2021

facebook-github-bot commented Apr 8, 2021

ngimel commented Apr 8, 2021

t-vi commented Apr 8, 2021

malfet commented Apr 8, 2021

t-vi commented Apr 8, 2021

avoid CPU std::copysign segfault when compiling on arm64 with gcc 7.5 / 8 for CUDA #51834

avoid CPU std::copysign segfault when compiling on arm64 with gcc 7.5 / 8 for CUDA #51834

Conversation

t-vi commented Feb 6, 2021 • edited

t-vi commented Feb 6, 2021

facebook-github-bot commented Feb 6, 2021 • edited

💊 CI failures summary and remediations

ci.pytorch.org: 1 failed

t-vi commented Feb 6, 2021

t-vi commented Feb 6, 2021

t-vi commented Feb 6, 2021

t-vi commented Feb 7, 2021

mruberry commented Feb 8, 2021

ptrblck commented Feb 8, 2021

t-vi commented Feb 8, 2021

mruberry commented Feb 8, 2021

t-vi commented Feb 8, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

malfet Feb 8, 2021 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

t-vi commented Feb 8, 2021

Lissanro commented Mar 3, 2021

t-vi commented Mar 3, 2021

t-vi commented Mar 3, 2021

t-vi commented Apr 6, 2021

t-vi commented Apr 7, 2021

facebook-github-bot commented Apr 7, 2021

malfet commented Apr 7, 2021

facebook-github-bot commented Apr 7, 2021

facebook-github-bot commented Apr 8, 2021

ngimel commented Apr 8, 2021

t-vi commented Apr 8, 2021

malfet commented Apr 8, 2021

t-vi commented Apr 8, 2021

t-vi commented Feb 6, 2021 •

edited

facebook-github-bot commented Feb 6, 2021 •

edited

malfet Feb 8, 2021 •

edited