Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove GCC_HOST_COMPILER_PREFIX as it may be out of sync with GCC_HOST_COMPILER_PATH #39263

Open
Flamefire opened this issue May 7, 2020 · 8 comments
Assignees
Labels
stat:awaiting tensorflower Status - Awaiting response from tensorflower subtype: ubuntu/linux Ubuntu/Linux Build/Installation Issues TF 2.2 Issues related to TF 2.2 type:build/install Build and install issues

Comments

@Flamefire
Copy link
Contributor

System information

  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04): RHEL 7.5
  • TensorFlow installed from (source or binary): source
  • TensorFlow version: 2.2.0
  • Python version: 3.7.4
  • Bazel version (if compiling from source): 2.0.0
  • GCC/Compiler version (if compiling from source): 7.3.0

Describe the problem

The file https://github.com/tensorflow/tensorflow/blob/1588f45ee56860d247a1c26ea228cb3721b4bf1b/third_party/gpus/cuda_configure.bzl has a documented environment variable GCC_HOST_COMPILER_PATH to set the path (or name) of a GCC host compiler. However it also uses an undocumented variable GCC_HOST_COMPILER_PREFIX to get the folder where the gcc binary resides (guessing from name) defaulting to /usr/bin if it isn't set:

host_compiler_prefix = get_host_environ(repository_ctx, _GCC_HOST_COMPILER_PREFIX)
if not host_compiler_prefix:
host_compiler_prefix = "/usr/bin"

I have 2 problems with that:

  • It is undocumented and hence hard to set right if you don't know for sure what it is
  • The default is wrong when the (documented!) GCC_HOST_COMPILER_PATH is used

This leads to issues such as

 external/local_config_cuda/crosstool/clang/bin/crosstool_wrapper_driver_is_not_gcc -o bazel-out/k8-opt/bin/external/protobuf_archive/js_embed -Wl,-no-as-needed -pie -Wl,-z,relro,-z,now '-Wl,--build-id=md5' '-Wl,--hash-style=gnu' -no-canonical-prefixes -B/usr/bin -Wl,--gc-sections -Wl,@bazel-out/k8-opt/bin/external/protobuf_archive/js_embed-2.params)
/usr/bin/ld.gold: error: /software/software/GCCcore/7.3.0/lib/gcc/x86_64-pc-linux-gnu/7.3.0/crtbeginS.o: unsupported reloc 42 against global symbol _ITM_deregisterTMCloneTable
/usr/bin/ld.gold: error: /software/software/GCCcore/7.3.0/lib/gcc/x86_64-pc-linux-gnu/7.3.0/crtbeginS.o: unsupported reloc 42 against global symbol _ITM_registerTMCloneTable
/usr/bin/ld.gold: error: bazel-out/k8-opt/bin/external/protobuf_archive/_objs/js_embed/embed.o: unsupported reloc 42 against global symbol std::ios_base::Init::~Init()
/software/software/GCCcore/7.3.0/lib/gcc/x86_64-pc-linux-gnu/7.3.0/crtbeginS.o(.text+0x1a): error: unsupported reloc 42
/software/software/GCCcore/7.3.0/lib/gcc/x86_64-pc-linux-gnu/7.3.0/crtbeginS.o(.text+0x6b): error: unsupported reloc 42
bazel-out/k8-opt/bin/external/protobuf_archive/_objs/js_embed/embed.o:embed.cc:function _GLOBAL__sub_I_main: error: unsupported reloc 42
collect2: error: ld returned 1 exit status

As reported at easybuilders/easybuild-easyconfigs#7800 (comment)

I hence propose to either completely remove that variable in favor of deriving its value from GCC_HOST_COMPILER_PATH or properly documenting it with a better default.

It does not seem to be required at all so it's likely best to just remove it. This should have been done by #34218 but for some reason that merge was reverted with a very misleading commit title: f057199

@gunan @mihaimaruseac please take a look what went wrong

@Flamefire Flamefire added the type:build/install Build and install issues label May 7, 2020
@amahendrakar amahendrakar added subtype: ubuntu/linux Ubuntu/Linux Build/Installation Issues TF 2.2 Issues related to TF 2.2 labels May 7, 2020
@amahendrakar amahendrakar assigned ymodak and unassigned amahendrakar May 7, 2020
@r4nt
Copy link

r4nt commented May 7, 2020

Before that flag was introduced we hard-coded -B/usr/bin, which is arguably significantly worse :)
-B/usr/bin was needed because currently the toolchain is not good at figuring out where other binutils are when they are not in the same path as the compiler, which is a not uncommon setup for people building their own compilers.

@r4nt
Copy link

r4nt commented May 7, 2020

The right solution is to actually discover all binutils via PATH, like other build systems would.

@Flamefire
Copy link
Contributor Author

Flamefire commented May 7, 2020

From the comment:

 # TODO: when bazel stops adding '-B/usr/bin' by default, remove this
 #       flag from the CROSSTOOL completely (see
 #       https://github.com/bazelbuild/bazel/issues/563

And as explained in #34202:

As the underlying Bazel issue bazelbuild/bazel#5634 is resolved, this code can (and should) go now

So I don't see why setting this is still required. And I also disagree with "Before that flag was introduced": That flag was never documented as far as I can tell. So are people supposed to use/rely on it?

For another datapoint: In Easybuild (install automation tool) we patch that file to set this to empty since forever as adding /usr/bin leads to (the mentioned) problems and haven't had any issues.

Additionally the removal was already accepted and merged (twice!) but has been removed again, I guess due to a mistake as no indication for a reason is visible and the commit message of f057199 is clearly wrong

@r4nt
Copy link

r4nt commented May 12, 2020

We're currently using this in our cross-compilation setup. We use a cross-compiler gcc for manylinux, but the host system's binutils. Removing this will get rolled back until somebody has time to hunt down regressions, root-cause them and then figure out how we'll need to adapt our setup. I think you're right that it can be removed, but it's a bit of work to change & test everything that relies on it.

Re: the commit message: what's wrong about it?

@r4nt
Copy link

r4nt commented May 12, 2020

It also looks like there are still open issues in bazel around this:
bazelbuild/bazel#6834

@Flamefire
Copy link
Contributor Author

Flamefire commented May 12, 2020

We're currently using this in our cross-compilation setup.

Can you clarify what exactly you mean by "this"?
Especially did you try with #34218 applied (i.e. cuda_defines["%{linker_bin_path}"] = "")? I'd expect not setting the path would make it pick the binutils from PATH.

Ok just read the linked bazel issue and it seems PATH is discarded by Bazel.

Then the comment quoted above (when bazel stops adding '-B/usr/bin' by default, remove this) is clearly wrong. This results in my second alternative of the proposal: "properly documenting it with a better default."

So at the very least it should be documented because as the linked commit from bazelbuild/bazel#6834 states:

and [adding /usr/bin] was done to get Tensor Flow building on some (but failing on other?) versions of RedHat

So there are known failures with that added as well as without it. I'm wondering why it is now only done for the CUDA toolchain though and not for anything else.

Re: the commit message: what's wrong about it?

The commit message is "Merge pull request #34218 from Flamefire:fix_missing_linker_path" but it does revert that merge. Also interesting: The merge commit has a green CI tick while the revert commit has a failed CI tick 🤷

So summary:

  • Check if that linker_bin_path setting is still required and remove if not. Otherwise:
  • Document GCC_HOST_COMPILER_PREFIX like the other variables
  • Opt (but recommended) default to the path of the GCC_HOST_COMPILER, maybe checking if binutils can be found there. IMO this is the right thing because having binutils separate is the exception rather than the norm.

Edit: Oh and maybe don't name it GCC_HOST_COMPILER_PREFIX if it is the HOST_BINUTILS_PATH

@kumariko
Copy link

kumariko commented Aug 27, 2021

@Flamefire Could you please let us know if this issue still persists ? If it it resolved then please feel free to move this issue to close status ? Thanks!

@kumariko kumariko added the stat:awaiting response Status - Awaiting response from author label Aug 27, 2021
@Flamefire
Copy link
Contributor Author

Yes, the comment from #39263 (comment) still applies as-is with latest master. See the summary there

@kumariko kumariko added stat:awaiting tensorflower Status - Awaiting response from tensorflower and removed stat:awaiting response Status - Awaiting response from author labels Aug 30, 2021
@kumariko kumariko removed their assignment Sep 2, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
stat:awaiting tensorflower Status - Awaiting response from tensorflower subtype: ubuntu/linux Ubuntu/Linux Build/Installation Issues TF 2.2 Issues related to TF 2.2 type:build/install Build and install issues
Projects
None yet
Development

No branches or pull requests

5 participants