New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
avoid CPU std::copysign segfault when compiling on arm64 with gcc 7.5 / 8 for CUDA #51834
Conversation
@mruberry are you terribly attached to the copysign? |
💊 CI failures summary and remediationsAs of commit 804c0c6 (more details on the Dr. CI page):
ci.pytorch.org: 1 failedThis comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.Please report bugs/suggestions to the (internal) Dr. CI Users group. |
7e89958
to
3091814
Compare
Turns out there are more of these in the cuda bits. |
So I found that also the cuda copysign causes ICEs, it might be related to https://gcc.gnu.org/git/?p=gcc.git;a=patch;h=315fdae8f965045f86e966953f3c010a61072729 , but I'm still investigating. |
3091814
to
a886e8d
Compare
Seems that msvc doesn't like the ifdefs... |
a886e8d
to
12c9731
Compare
So it turns out that |
Hey @t-vi, sorry to hear the PR caused an issue, and thank you for working on a fix.
Not if it's preventing PyTorch from building on a supported platform. We just cut the 1.8 release branch. cc @gchanan to consider this for cherrypicking. A first look at the fix seems reasonable to me; I also added @ngimel and @peterbell10 to take a look. Ping when it's ready for review. Why do other uses of std::copysign not affect this build, by the way? And what do you think is the best way to avoid these issues in the future? |
Thanks for working on this, @t-vi! :) CC @dusty-nv, @shmsong, @csarofeen as this PR seems to be needed for the Jetson wheels for the 1.8 release. |
The other part is that Raspberry Pi OS also has gcc 8.3 as default, I don't know if it has the fix or not, but I can upgrade my packages there, too, to check. It PyTorch did build in January, but I didn't systematically run tests. |
Thanks for elaborating.
Would you elaborate on this? You mean there are other tests failing on jetson?
|
There are a number of tests that just require newer numpy, so arguably this isn't a problem ( In the binary ufuncs test script, I'm seeing these (but this is on my branch, so I might have screwed up something):
|
12c9731
to
bce4c82
Compare
#if (!defined(__aarch64__)) || defined(__clang__) || \ | ||
(__GNUC__ > 8 || (__GNUC__ == 8 && __GNUC_MINOR__ > 3)) | ||
// std::copysign gets ICE/Segfaults with gcc 7.5/8 on arm64 | ||
// (e.g. Jetson), see PyTorch PR #51834 | ||
return std::copysign(a, b); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seemed weird to me that copysign was already being used without issue. Looking at the blame, this template comes from #47413 which says it: "avoids internal compiler error exceptions on aarch64 platforms".
So, I'm guessing this doesn't need changed and instead it could be moved somewhere to share it with the CUDA code. Worth a shot at least?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, so the ICE would only be triggered in specific circumstances. and so the comment here is probably misleading.
(TBH I wouldn't expect gcc to include completely broken builtins and it seems much more likely that there are corner cases.)
I changed this template because I noticed that tests didn't pass on arm64 and after wild editing it seems better now.
I do think we could take the code that I currently put in CUDAMathCompat.h and put it somewhere shared, but I don't really know where (and I'm only doing this off-and-on and iterating things involving headers has a rather long turnaround time when working on embedded without crosscompiling).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The linked PR suggests it's related to calling std:copysign
with half and BFloat16 types. If that's true then there's no need to avoid std::copysign
completely.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, moving copysign
templates solves the issue in the .cpp file for both gcc-7 and gcc-7, let me do it in separate PR. (As well as file a separate issue/gcc bug for that)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ha, that will be much more elegant.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here is the PR that fixes it on CPU (for both gcc-7 and 8): #51900
I wonder if making lambdas GPU only would solve the problem for CUDA
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So I don't know how this making lambdas GPU only would work, the GPU_LAMBDA makes it __host__ __device__
as far as I can tell.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, GPU_LAMBDA is __host__ __device__
and there was a valid reason for that (to work around other compiler bugs iirc) which may or may not remain valid. Replacing lambdas with functors with only device ()
functions may work.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think lots of reasons for __host__ __device__
are specific to Windows compiler, I wonder if repo-wide effort to replace __host__ __device__
with __device__
only on Linux, would speed up compilation and fix those sort of problems.
@malfet Yes, it also happens with 18.04's gcc-8 and with gcc-7 rebuilt from 20.04 on 18.04. So I'm relatively sure that if we want to support compilation on these platforms, we need the BinaryOpsKernel.cpp patch (or something similar) and for CUDA on these platforms we need the other two bits. I'm sure there are better ways to implement it, but I tried with less and it didn't work... |
I tried to use this patch to compile latest pytorch on Jetson Nano with GCC 8 but it did not work at first. I had to replace the following in 4 places:
With this:
To make it work. I guess somebody thought the bug will be fixed in GCC higher than 8.3 but even with 8.4.0 it still crashes. Here is updated patch which can be applied to current pytorch: http://Dragon.Studio/2021/03/51834.diff |
Thank you! Yes, I had the "known bad" approach here and knew gcc 8.3 would be bad more than that > 8.3 would be sure to be good. Thank you for reporting back. Did anyone work on checking if removing the host from the lambdas works on Linux? |
So to update this: GPU_LAMBDA needs to be |
@malfet has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator. |
@t-vi thank you very much for the targeted fix! |
This pull request has been reverted by b39eeb0. |
Sorry, had to revert this, internal builds are failing with
|
Oh, sorry. Is it OK to just |
… / 8 for CUDA (pytorch#51834) Summary: It seems that the std::copysign code introduced in pytorch#51706 is too much for gcc 7.5 / 8 when compiled on arm64 (e.g. on Jetson with latest Jetpack) and causes it to produce an internal compiler error with segfault during compilation. This avoids the compiler bug it by not using std::copysign. A very kind person sent a Jetson Xavier NX {emoji:1f381} thank you {emoji:2764}. After pytorch#51900 fixed this for CPU-only arm64 (eg Raspberry), this fixes it for CUDA-using arm64 (e.g. Jetson). CUDA device lambdas must also be present as host functions for technical reasons but they are never used, so we just assert in the CPU variant instead of actually doing the operation. Pull Request resolved: pytorch#51834 Reviewed By: mrshenli Differential Revision: D27622277 Pulled By: malfet fbshipit-source-id: a1dc4c3a67f925019782e24b796919e17339749f
Thank you, @malfet! |
It seems that the std::copysign code introduced in #51706 is too much for gcc 7.5 / 8 when compiled on arm64 (e.g. on Jetson with latest Jetpack) and causes it to produce an internal compiler error with segfault during compilation. This avoids the compiler bug it by not using std::copysign.
A very kind person sent a Jetson Xavier NX 🎁 thank you ❤️.
After #51900 fixed this for CPU-only arm64 (eg Raspberry), this fixes it for CUDA-using arm64 (e.g. Jetson). CUDA device lambdas must also be present as host functions for technical reasons but they are never used, so we just assert in the CPU variant instead of actually doing the operation.