Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Build with CUDA support fails with GCC >= 10.3 #48890

Closed
pwuertz opened this issue May 3, 2021 · 17 comments
Closed

Build with CUDA support fails with GCC >= 10.3 #48890

pwuertz opened this issue May 3, 2021 · 17 comments
Assignees
Labels
stale This label marks the issue/pr stale - to be closed automatically if no activity stat:awaiting response Status - Awaiting response from author stat:awaiting tensorflower Status - Awaiting response from tensorflower subtype: ubuntu/linux Ubuntu/Linux Build/Installation Issues type:build/install Build and install issues

Comments

@pwuertz
Copy link

pwuertz commented May 3, 2021

System information

  • OS Platform and Distribution: Ubuntu Linux 21.04
  • TensorFlow installed from (source or binary): source
  • TensorFlow version: v2.5.0-rc2
  • Python version: 3.9
  • Bazel version (if compiling from source): 3.7.2
  • GCC/Compiler version (if compiling from source): 10.3
  • CUDA/cuDNN version: 11.2 / 8.2

Describe the problem

Building tensorflow with CUDA support with GCC 10.3 fails with the following error:

/usr/include/c++/10/chrono:428:27: internal compiler error: Segmentation fault
  428 |  _S_gcd(intmax_t __m, intmax_t __n) noexcept
      |                           ^~~~~~
Please submit a full bug report,
with preprocessed source if appropriate.
See <file:///usr/share/doc/gcc-10/README.Bugs> for instructions.
Target //tensorflow/tools/pip_package:build_pip_package failed to build
Use --verbose_failures to see the command lines of failed build steps.

Apparently, this is a regression starting with GCC 10.3 (default compiler on Ubuntu 21.04) when using gcc in conjunction with nvcc. Here is the upstream bug report: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100102

Installing and using gcc-9 as NVCC host compiler in configure still works.

@pwuertz pwuertz added the type:build/install Build and install issues label May 3, 2021
@bhack
Copy link
Contributor

bhack commented May 3, 2021

NVIDIA/nccl#494

@Saduf2019
Copy link
Contributor

@pwuertz
Would you want to down grade to cuda 11.0 and tensorflow stable tested version 2.4.1 and let us know if you face any issues.

@Saduf2019 Saduf2019 added stat:awaiting response Status - Awaiting response from author subtype: ubuntu/linux Ubuntu/Linux Build/Installation Issues labels May 4, 2021
@pwuertz
Copy link
Author

pwuertz commented May 4, 2021

Would you want to down grade to cuda 11.0 and tensorflow stable tested version 2.4.1 and let us know if you face any issues.

Not really, sorry, since I'm building everything at a non-isolated system level and require python 3.9 and CUDA 11.2 for other reasons. And as said before, building the configuration above with GCC-9 as NVCC host compiler works fine. I just wanted to make anyone who is looking for this aware and give a pointer to the upstream bug report at GCC.

The problem seems to be pretty much identified at this point, so I suppose we'll have to see and wait for an upstream fix...

@bhack
Copy link
Contributor

bhack commented May 4, 2021

@pwuertz In the meantime can you build in our docker devel container tensorflow/tensorflow:devel or does it not fit your use case?

@pwuertz
Copy link
Author

pwuertz commented May 4, 2021

@bhack I'm fine with my GCC-9 based build, thanks.

@Saduf2019 Saduf2019 removed the stat:awaiting response Status - Awaiting response from author label May 4, 2021
@Saduf2019 Saduf2019 assigned ymodak and unassigned Saduf2019 May 4, 2021
@ymodak ymodak added the stat:awaiting tensorflower Status - Awaiting response from tensorflower label May 4, 2021
@ymodak ymodak assigned sanjoy and unassigned ymodak May 4, 2021
@Medoalmasry
Copy link

I know this isn't strictly how bugs should be fixed, however, if you're desperate and under a deadline like me, comment out that block of code

        /*
        static constexpr intmax_t
        _S_gcd(intmax_t __m, intmax_t __n) noexcept
        {
          // Duration only allows positive periods so we don't need to
          // support negative values here (unlike __static_gcd and std::gcd).
          return (__m == 0) ? __n : (__n == 0) ? __m : _S_gcd(__n, __m % __n);
        }
        */

The segmentation error disappears

archlinux-github pushed a commit to archlinux/svntogit-community that referenced this issue May 18, 2021
See also tensorflow/tensorflow#48890 and
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100102

git-svn-id: file:///srv/repos/svn-community/svn@936348 9fca08f4-af9d-4005-b8df-a31f2cc04f65
archlinux-github pushed a commit to archlinux/svntogit-community that referenced this issue May 18, 2021
See also tensorflow/tensorflow#48890 and
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100102

git-svn-id: file:///srv/repos/svn-community/svn@936348 9fca08f4-af9d-4005-b8df-a31f2cc04f65
@sanjoy
Copy link
Contributor

sanjoy commented May 20, 2021

@pwuertz, this is purely a GCC issue right? Or are you suggesting that there is some workaround possible in the TF source code?

@pwuertz
Copy link
Author

pwuertz commented May 20, 2021

@sanjoy Yes, probably a pure GCC issue. No suggestions on how to handle this on the Tensorflow end other than monitoring what's happening upstream. A warning emitted by the Tensorflow build for known-bad compiler versions would be nice, but I don't know how much work this is.
Could be worthwhile though since there is no telling when we'll get a fix in GCC and at which point that patch is applied in linux-distribution-of-your-choice (if at all).

@mertkiray
Copy link

@pwuertz How do you change the NVCC host compiler to gcc-9? I installed gcc-9 but the nvcc still uses 10 I think. Thanks.

@pwuertz
Copy link
Author

pwuertz commented May 22, 2021

@pwuertz How do you change the NVCC host compiler to gcc-9? I installed gcc-9 but the nvcc still uses 10 I think. Thanks.

When you ./configure your Tensorflow build one of the questions asked will be "Please specify which gcc should be used by nvcc as the host compiler" or similar. You then just enter /usr/bin/gcc-9 and you should be good to go.

@edrozenberg
Copy link

The GCC project has committed a patch:

https://gcc.gnu.org/git/gitweb.cgi?p=gcc.git;h=5357ab75dedef403b0eebf9277d61d1cbeb5898f
(in response to the problem report https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100102)

@zipy124
Copy link

zipy124 commented Sep 3, 2021

This should be fixed for newer versions of GCC, as per the above bug report. "Fixed for GCC 10.4, 11.2 and 12."

@sushreebarsa sushreebarsa self-assigned this Oct 11, 2021
@sushreebarsa
Copy link
Contributor

@pwuertz Could you please try to install from source using latest version of TF 2.6.0 and let us know if it helps? Thank you!

@sushreebarsa sushreebarsa added the stat:awaiting response Status - Awaiting response from author label Oct 11, 2021
@google-ml-butler
Copy link

This issue has been automatically marked as stale because it has no recent activity. It will be closed if no further activity occurs. Thank you.

@google-ml-butler google-ml-butler bot added the stale This label marks the issue/pr stale - to be closed automatically if no activity label Oct 18, 2021
@google-ml-butler
Copy link

Closing as stale. Please reopen if you'd like to work on this further.

@google-ml-butler
Copy link

Are you satisfied with the resolution of your issue?
Yes
No

@pwuertz
Copy link
Author

pwuertz commented Nov 10, 2021

@pwuertz Could you please try to install from source using latest version of TF 2.6.0 and let us know if it helps? Thank you!

Got a successful build with the following environment:

  • Tensorflow 2.7.0
  • CUDA 11.3
  • GCC 10.3.0 as NVCC host compiler (GCC 11 not supported by CUDA 11.3)
  • Ubuntu 21.10

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
stale This label marks the issue/pr stale - to be closed automatically if no activity stat:awaiting response Status - Awaiting response from author stat:awaiting tensorflower Status - Awaiting response from tensorflower subtype: ubuntu/linux Ubuntu/Linux Build/Installation Issues type:build/install Build and install issues
Projects
None yet
Development

No branches or pull requests