Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Source compilation fails on Musl system on multiple places #45446

Closed
PureTryOut opened this issue Dec 7, 2020 · 42 comments
Closed

Source compilation fails on Musl system on multiple places #45446

PureTryOut opened this issue Dec 7, 2020 · 42 comments
Assignees
Labels
stale This label marks the issue/pr stale - to be closed automatically if no activity stat:awaiting response Status - Awaiting response from author stat:awaiting tensorflower Status - Awaiting response from tensorflower subtype: ubuntu/linux Ubuntu/Linux Build/Installation Issues TF 2.3 Issues related to TF 2.3 type:build/install Build and install issues

Comments

@PureTryOut
Copy link
Contributor

System information

  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Alpine Linux edge
  • TensorFlow installed from (source or binary): source
  • TensorFlow version: 2.3.1
  • Python version: 3.8.6
  • Installed using virtualenv? pip? conda?: source
  • Bazel version (if compiling from source): 3.5.0
  • GCC/Compiler version (if compiling from source): 10.2.1_pre1

Describe the problem

I'm trying to compile TensorFlow from source on Alpine Linux which is a Musl based system. It fails on several points however:

ERROR: /builds/PureTryOut/aports/testing/tensorflow/src/tensorflow-2.3.1/tensorflow/core/platform/default/BUILD:75:11: Couldn't build file tensorflow/core/platform/default/_objs/env/0/env.pic.o: C++ compilation of rule '//tensorflow/core/platform/default:env' failed (Exit 1)
tensorflow/core/platform/default/env.cc: In member function 'virtual bool tensorflow::{anonymous}::PosixEnv::GetCurrentThreadName(std::string*)':
tensorflow/core/platform/default/env.cc:160:15: error: 'pthread_getname_np' was not declared in this scope; did you mean 'pthread_setname_np'?
  160 |     int res = pthread_getname_np(pthread_self(), buf, static_cast<size_t>(100));
      |               ^~~~~~~~~~~~~~~~~~
      |               pthread_setname_np
ERROR: /builds/PureTryOut/aports/testing/tensorflow/src/tensorflow-2.3.1/tensorflow/core/platform/s3/BUILD:44:11: Couldn't build file tensorflow/core/platform/s3/_objs/aws_crypto/aws_crypto.pic.o: C++ compilation of rule '//tensorflow/core/platform/s3:aws_crypto' failed (Exit 1)
tensorflow/core/platform/s3/aws_crypto.cc: In member function 'virtual Aws::Utils::Crypto::HashResult tensorflow::AWSSha256HMACOpenSSLImpl::Calculate(const ByteBuffer&, const ByteBuffer&)':
tensorflow/core/platform/s3/aws_crypto.cc:38:14: error: aggregate 'HMAC_CTX ctx' has incomplete type and cannot be defined
   38 |     HMAC_CTX ctx;
      |              ^~~
tensorflow/core/platform/s3/aws_crypto.cc:39:5: error: 'HMAC_CTX_init' was not declared in this scope; did you mean 'HMAC_CTX_new'?
   39 |     HMAC_CTX_init(&ctx);
      |     ^~~~~~~~~~~~~
      |     HMAC_CTX_new
tensorflow/core/platform/s3/aws_crypto.cc:45:5: error: 'HMAC_CTX_cleanup' was not declared in this scope
   45 |     HMAC_CTX_cleanup(&ctx);
      |     ^~~~~~~~~~~~~~~~

Provide the exact sequence of commands / steps that you executed before running into the problem
The build script can be found at https://gitlab.alpinelinux.org/alpine/aports/-/raw/4cf626b10d2f4700cc5e5e9e7536061137c8c6a1/testing/tensorflow/APKBUILD. Note that prepare() there is called before build() and the environment variables set carry over.

@PureTryOut PureTryOut added the type:build/install Build and install issues label Dec 7, 2020
@amahendrakar
Copy link
Contributor

@PureTryOut,
Could you please install the pip package dependencies as mentioned in the guide and check if you are still facing the same issue? Thanks!

@amahendrakar amahendrakar added stat:awaiting response Status - Awaiting response from author subtype: ubuntu/linux Ubuntu/Linux Build/Installation Issues TF 2.3 Issues related to TF 2.3 labels Dec 8, 2020
@google-ml-butler
Copy link

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you.

@google-ml-butler google-ml-butler bot added the stale This label marks the issue/pr stale - to be closed automatically if no activity label Dec 15, 2020
@PureTryOut
Copy link
Contributor Author

Well, I get different errors that way.

ERROR: /home/bart/.cache/bazel/_bazel_bart/f5cee94bd62e1be0d824a974250af519/external/llvm-project/llvm/BUILD:3766:11: C++ compilation of rule '@llvm-project//llvm:Support' failed (Exit 1)
In file included from external/llvm-project/llvm/lib/Support/Process.cpp:101:
external/llvm-project/llvm/lib/Support/Unix/Process.inc: In static member function 'static size_t llvm::sys::Process::GetMallocUsage()':
external/llvm-project/llvm/lib/Support/Unix/Process.inc:93:19: error: aggregate 'llvm::sys::Process::GetMallocUsage()::mallinfo mi' has incomplete type and cannot be defined
   93 |   struct mallinfo mi;
      |                   ^~
external/llvm-project/llvm/lib/Support/Unix/Process.inc:94:10: error: '::mallinfo' has not been declared
   94 |   mi = ::mallinfo();
      |          ^~~~~~~~
Target //tensorflow/tools/pip_package:build_pip_package failed to build

@google-ml-butler google-ml-butler bot removed the stale This label marks the issue/pr stale - to be closed automatically if no activity label Dec 15, 2020
@amahendrakar amahendrakar removed the stat:awaiting response Status - Awaiting response from author label Dec 15, 2020
@amahendrakar amahendrakar assigned ymodak and unassigned amahendrakar Dec 15, 2020
@ymodak ymodak assigned mihaimaruseac and unassigned ymodak Dec 18, 2020
@ymodak ymodak added the stat:awaiting tensorflower Status - Awaiting response from tensorflower label Dec 18, 2020
@mihaimaruseac
Copy link
Collaborator

This looks like some C/C++/system dependencies are not installed/found.

It seems issues come from LLVM. Are you able to compile LLVM on the system?

@tensorflowbutler tensorflowbutler removed the stat:awaiting tensorflower Status - Awaiting response from tensorflower label Dec 29, 2020
@PureTryOut
Copy link
Contributor Author

Well, right now it doesn't even want to compile that as it keeps trying to exec python while it should be using python3 instead.

INFO: Analyzed target //tensorflow/tools/pip_package:build_pip_package (405 packages loaded, 29314 targets configured).
INFO: Found 1 target...
ERROR: /home/bart/Documents/Git/alpine/aports/testing/tensorflow/src/tensorflow-2.4.0/tensorflow/core/util/BUILD:370:24: error executing shell command: '/bin/bash -c bazel-out/host/bin/tensorflow/tools/git/gen_git_source --generate "$@" --git_tag_override=${GIT_TAG_OVERRIDE:-}  external/local_config_git/gen/spec.json external/local_config_git/gen/h...' failed (Exit 127): bash failed: error executing command /bin/bash -c 'bazel-out/host/bin/tensorflow/tools/git/gen_git_source --generate "$@" --git_tag_override=${GIT_TAG_OVERRIDE:-}' '' external/local_config_git/gen/spec.json ... (remaining 3 argument(s) skipped)
env: can't execute 'python': No such file or directory

I have PYTHON_BIN_PATH=/usr/bin/python3 set, but it seems that is only used by ./configure and not by bazel?

@mihaimaruseac
Copy link
Collaborator

That again is a Bazel bug.

@PureTryOut
Copy link
Contributor Author

Found a workaround luckily.

So, from source install with instructions from https://www.tensorflow.org/install/source#install_python_and_the_tensorflow_package_dependencies, I now get a different error with TF 2.4.0, but that's fixed with #46138.

Now I'm back to the mallinfo error. LLVM is actually available on this distribution, is it not possible to use the system variant for TF?

@mihaimaruseac
Copy link
Collaborator

Unfortunately not really. TF does a lot of JIT and since LLVM does not provide a real stable ABI using the system variant might result in broken code.

@PureTryOut
Copy link
Contributor Author

PureTryOut commented Jan 6, 2021

Well with hacks I get further and further.

To workaround the Python interpreter problem as mentioned in #15618 (comment):

ln /usr/bin/python3 /usr/bin/python

It seems I got rid of the LLVM problem by adding deps as used in the LLVM packages from the distribution.

However I'm now back to one of the original problems:

ERROR: /home/bart/Documents/Git/alpine/aports/testing/tensorflow/src/tensorflow-2.4.0/tensorflow/core/platform/s3/BUILD:44:11: C++ compilation of rule '//tensorflow/core/platform/s3:aws_crypto' failed (Exit 1): gcc failed: error executing command /usr/bin/gcc -U_FORTIFY_SOURCE -fstack-protector -Wall -Wunused-but-set-parameter -Wno-free-nonheap-object -fno-omit-frame-pointer -g0 -O2 '-D_FORTIFY_SOURCE=1' -DNDEBUG -ffunction-sections ... (remaining 75 argument(s) skipped)
tensorflow/core/platform/s3/aws_crypto.cc: In member function 'virtual Aws::Utils::Crypto::HashResult tensorflow::AWSSha256HMACOpenSSLImpl::Calculate(const ByteBuffer&, const ByteBuffer&)':
tensorflow/core/platform/s3/aws_crypto.cc:38:14: error: aggregate 'HMAC_CTX ctx' has incomplete type and cannot be defined
   38 |     HMAC_CTX ctx;
      |              ^~~
tensorflow/core/platform/s3/aws_crypto.cc:39:5: error: 'HMAC_CTX_init' was not declared in this scope; did you mean 'HMAC_CTX_new'?
   39 |     HMAC_CTX_init(&ctx);
      |     ^~~~~~~~~~~~~
      |     HMAC_CTX_new
tensorflow/core/platform/s3/aws_crypto.cc:45:5: error: 'HMAC_CTX_cleanup' was not declared in this scope
   45 |     HMAC_CTX_cleanup(&ctx);
      |     ^~~~~~~~~~~~~~~~
ERROR: /home/bart/Documents/Git/alpine/aports/testing/tensorflow/src/tensorflow-2.4.0/tensorflow/tools/pip_package/BUILD:69:10 C++ compilation of rule '//tensorflow/core/platform/s3:aws_crypto' failed (Exit 1): gcc failed: error executing command /usr/bin/gcc -U_FORTIFY_SOURCE -fstack-protector -Wall -Wunused-but-set-parameter -Wno-free-nonheap-object -fno-omit-frame-pointer -g0 -O2 '-D_FORTIFY_SOURCE=1' -DNDEBUG -ffunction-sections ... (remaining 75 argument(s) skipped)

It seems those failing files are part of S3 support, but I actually tried to disable that:

export TF_NEED_AWS=0
export TF_NEED_S3=0

It still tries to compile those files however.

@mihaimaruseac
Copy link
Collaborator

After setting those exports you have to run ./configure (better, python configure.py). i think this might solve.

@PureTryOut
Copy link
Contributor Author

That is exactly what I'm doing and what I have always done, yes. Doesn't make a difference.

Please check my build script, should be nothing wrong. https://gitlab.alpinelinux.org/PureTryOut/aports/-/raw/mycroft-precise/testing/tensorflow/APKBUILD

@PureTryOut
Copy link
Contributor Author

PureTryOut commented Jan 7, 2021

I took a change from Gentoo, seems echo "build --config=noaws" >> .bazelrc does the trick as well to stop building AWS support.

So just stuck on the LLVM issue now. I really don't understand why, the failing code is guarded by a same conditional statement as one that's used to include the required header file. https://github.com/llvm/llvm-project/blob/main/llvm/lib/Support/Unix/Process.inc#L92 and https://github.com/llvm/llvm-project/blob/main/llvm/lib/Support/Unix/Process.inc#L34

There really is no reason this should fail...

EDIT: So Musl does not have support for mallinfo, https://www.openwall.com/lists/musl/2018/01/17/2

However, that shouldn't be a problem as support for it is checked in https://github.com/llvm/llvm-project/blob/main/llvm/cmake/config-ix.cmake#L234 which should just return false. However, for some reason in the case of Tensorflow this line reports that the system does support it?

@mihaimaruseac
Copy link
Collaborator

IREE uses LLVM and Bazel in the same way as TF (almost). Can you try compiling that? This should give us an indication on whether the issue is from TF or from LLVM.

Regarding having to manually write to .bazelrc to disable AWS, I think that is a bug in the ./configure process.

@PureTryOut
Copy link
Contributor Author

PureTryOut commented Jan 9, 2021

Following these instructions https://google.github.io/iree/get-started/getting-started-linux-bazel, LLVM fails the same way yes.

Although interestingly enough it also fails on pthread_getname_np there with LLVM, which it (so far) doesn't for Tensorflow LLVM.

@PureTryOut
Copy link
Contributor Author

Does Bazel/Tensorflow call CMake for LLVM differently somehow? The thing it's failing on currently is guarded properly by CMake, and it works for the distribution packages in Alpine Linux. So what is different in Tensorflow that it somehow passes the condition while it shouldn't?

@mihaimaruseac
Copy link
Collaborator

mihaimaruseac commented Jan 11, 2021

Bazel doesn't use cmake to configure the build.

It seems the issue comes from the BUILD files that LLVM uses. A copy of them is at https://github.com/google/llvm-bazel/

@mihaimaruseac
Copy link
Collaborator

mihaimaruseac commented Jan 11, 2021

Thanks @GMNGeoffrey for the additional context and the help.

@PureTryOut I think the new error comes from an invalid BUILD file edit. Do you have more lines of context around that error?

@PureTryOut
Copy link
Contributor Author

Well there is some stuff, but it doesn't seem related.

INFO: Reading rc options for 'build' from /home/bart/Documents/Git/alpine/aports/testing/tensorflow/src/tensorflow-2.4.0/.bazelrc:
  'build' options: --jobs=8 --compilation_mode=opt --host_compilation_mode=opt --repository_cache=/home/bart/Documents/Git/alpine/aports/testing/tensorflow/src/tensorflow-2.4.0/bazel-cache/ --distdir=/home/bart/Documents/Git/alpine/aports/testing/tensorflow/src/tensorflow-2.4.0/bazel-distdir/ --define=PREFIX=/home/bart/Documents/Git/alpine/aports/testing/tensorflow/pkg/tensorflow/usr --config=noaws --config=nohdfs --config=mkl
INFO: Found applicable config definition build:short_logs in file /home/bart/Documents/Git/alpine/aports/testing/tensorflow/src/tensorflow-2.4.0/.bazelrc: --output_filter=DONT_MATCH_ANYTHING
INFO: Found applicable config definition build:v2 in file /home/bart/Documents/Git/alpine/aports/testing/tensorflow/src/tensorflow-2.4.0/.bazelrc: --define=tf_api_version=2 --action_env=TF2_BEHAVIOR=1
INFO: Found applicable config definition build:xla in file /home/bart/Documents/Git/alpine/aports/testing/tensorflow/src/tensorflow-2.4.0/.bazelrc: --define=with_xla_support=true
INFO: Found applicable config definition build:noaws in file /home/bart/Documents/Git/alpine/aports/testing/tensorflow/src/tensorflow-2.4.0/.bazelrc: --define=no_aws_support=true
INFO: Found applicable config definition build:nohdfs in file /home/bart/Documents/Git/alpine/aports/testing/tensorflow/src/tensorflow-2.4.0/.bazelrc: --define=no_hdfs_support=true
INFO: Found applicable config definition build:mkl in file /home/bart/Documents/Git/alpine/aports/testing/tensorflow/src/tensorflow-2.4.0/.bazelrc: --define=build_with_mkl=true --define=enable_mkl=true --define=tensorflow_mkldnn_contraction_kernel=0 --define=build_with_openmp=true -c opt
INFO: Found applicable config definition build:linux in file /home/bart/Documents/Git/alpine/aports/testing/tensorflow/src/tensorflow-2.4.0/.bazelrc: --copt=-w --host_copt=-w --define=PREFIX=/usr --define=LIBDIR=$(PREFIX)/lib --define=INCLUDEDIR=$(PREFIX)/include --define=PROTOBUF_INCLUDE_PATH=$(PREFIX)/include --cxxopt=-std=c++14 --host_cxxopt=-std=c++14 --config=dynamic_kernels
INFO: Found applicable config definition build:dynamic_kernels in file /home/bart/Documents/Git/alpine/aports/testing/tensorflow/src/tensorflow-2.4.0/.bazelrc: --define=dynamic_loaded_kernels=true --copt=-DAUTOLOAD_DYNAMIC_KERNELS

@mihaimaruseac
Copy link
Collaborator

There should be some lines that should print the path to a malformed BUILD target.

Alternatively, you can use git commands to see where ' ' gets inserted, in case this was from an accidental edit.

@PureTryOut
Copy link
Contributor Author

It seems I found the cause of that particular issue. I did the following:

bazel build \
    //tensorflow:libtensorflow_framework.so \
    //tensorflow:libtensorflow.so \
    //tensorflow:libtensorflow_cc.so \
    //tensorflow/tools/pip_package:build_pip_package

However, it seems Bazel doesn't like new lines and the whitespace. I put it all on one line instead, and the error was gone. Annoying 🤷

New error:

ERROR: /home/bart/.cache/bazel/_bazel_bart/fa04af0b72c1747dcd0d716042f56ed5/external/com_github_grpc_grpc/src/compiler/BUILD:66:8: declared output 'external/com_github_grpc_grpc/src/compiler/grpc_python_plugin.bin' was not created by genrule. This is probably because the genrule actually didn't create this output, or because the output was a directory and the genrule was run remotely (note that only the contents of declared file outputs are copied from genrules run remotely)
ERROR: /home/bart/.cache/bazel/_bazel_bart/fa04af0b72c1747dcd0d716042f56ed5/external/com_github_grpc_grpc/src/compiler/BUILD:66:8: not all outputs were created or valid
ERROR: /home/bart/Documents/Git/alpine/aports/testing/tensorflow/src/tensorflow-2.4.0/tensorflow/python/tools/BUILD:143:10 not all outputs were created or valid

@mihaimaruseac
Copy link
Collaborator

Hmm, now this is a grpc issue.

They also use Bazel, can you file an issue at https://github.com/grpc/grpc please?

@PureTryOut
Copy link
Contributor Author

Sorry it took a while. I filed an issue, grpc/grpc#25188

@foopub
Copy link

foopub commented May 14, 2021

It seems I got rid of the LLVM problem by adding deps as used in the LLVM packages from the distribution.

Can you please elaborate on that?
I managed to compile 2.5 with Bazel 3.7.2, Python 3.9.5 running:

bazel build --config=noaws --config=nogcp --config=nonccl --config=mkl //tensorflow/tools/pip_package:build_pip_package

The only problem I had with TF itself was that I had to disable stacktrace using the method from this patch:
https://github.com/ribalda/meta-tensorflow/blob/master/recipes-framework/tensorflow/files/0001-support-musl.patch

LLVM refused to compile for me even with all the build dependencies. I took a similar approach for mallinfo by adding a GLIBC condition in
~/.cache/bazel/ ... /external/llvm-project/llvm/lib/Support/Unix/Process.inc

 96 #elif defined(HAVE_MALLINFO) && defined(__GLIBC__)
 97   struct mallinfo mi;
 98   mi = ::mallinfo();

and had to remove backtrace in
~/.cache/bazel/ ... /external/llvm-project/llvm/include/llvm/Config/config.h.cmake

 22 /* Define to 1 if you have the `backtrace' function. */
 23 /*#define HAVE_BACKTRACE 0 */
 24 
 25 /*#define BACKTRACE_HEADER <${BACKTRACE_HEADER}>*/

You can actually keep ENABLE_BACKTRACE defined to 1... but just setting HAVE_BACKTRACE to 0 doesn't work because the troublesome header uses an ifdef... which seems like a mistake on llvm's part.

There's probably a more elegant solution - it's a matter of musl lacking execinfo.h discussed here:
https://www.openwall.com/lists/musl/2015/04/09/3
As mentioned, a viable solution is finding an alternative backtrace lib, if you care about that.

@sushreebarsa
Copy link
Contributor

@PureTryOut Could you please let us know if this issue still persists ? Thanks!

@sushreebarsa sushreebarsa added the stat:awaiting response Status - Awaiting response from author label Dec 28, 2021
@PureTryOut
Copy link
Contributor Author

Oof I haven't tried it in a while, I kinda lost interest because of all the issues. I'll give it another shot early in the new year.

@sushreebarsa sushreebarsa added stat:awaiting tensorflower Status - Awaiting response from tensorflower and removed stat:awaiting response Status - Awaiting response from author labels Dec 28, 2021
@GMNGeoffrey
Copy link
Contributor

Well the good news is that there are fewer copies of these build files now. They've been upstreamed at https://github.com/llvm/llvm-project/tree/main/utils/bazel. Additionally, we've started defining things based on C preprocessor macros in config.h, which are generally way easier to use than Bazel platform selects (with the limitation that we can't execute arbitrary code like try-compile), so if you want to move HAVE_MALLINFO and the like out of config.bzl and into the config headers themselves, then go for it.

@grebaza
Copy link

grebaza commented Mar 18, 2022

Hey there. I successfully built TF 2.8.0 on Alpine Linux with these changes: grebaza@39381a1. I included the backtrace library (libexecinfo) and hardcoded POSIX defines. Hope this helps.

@PureTryOut
Copy link
Contributor Author

That is awesome! Any chance you could upstream that to the main Tensorflow repo (this one)?

@grebaza
Copy link

grebaza commented Mar 25, 2022

Sure thing. I will also upload changes into tensorflow-io's repo (as tensorflow requires io during its installation).

@mihaimaruseac mihaimaruseac removed their assignment Nov 29, 2022
@SuryanarayanaY
Copy link
Collaborator

@grebaza , Could you please confirm whether you got time to raise a PR for the mentioned changes in above comment ?
CC - @PureTryOut

@SuryanarayanaY SuryanarayanaY self-assigned this Mar 6, 2023
@PureTryOut
Copy link
Contributor Author

Note that since then libexecinfo has been removed from Alpine Linux. They had technical reasons but I don't recall them exactly, it's been a while since I looked at this.

@grebaza
Copy link

grebaza commented Mar 7, 2023

Hello there, I can delete the dependencies on libexecinfo thus enabling compilation on Alpine Linux >= 3.17 (in the same manner done here). After that I will raise a PR.

@SuryanarayanaY
Copy link
Collaborator

Hi @PureTryOut , @grebaza ,

Tensorflow team maintaining Ubuntu instructions officially may be due to Limited resources. If you still believe this is useful for larger community and also willing to contribute please feel free to raise PR.

Thanks!

@SuryanarayanaY SuryanarayanaY added the stat:awaiting response Status - Awaiting response from author label May 15, 2023
@github-actions
Copy link

This issue is stale because it has been open for 7 days with no activity. It will be closed if no further activity occurs. Thank you.

@github-actions github-actions bot added the stale This label marks the issue/pr stale - to be closed automatically if no activity label May 23, 2023
@github-actions
Copy link

This issue was closed because it has been inactive for 7 days since being marked as stale. Please reopen if you'd like to work on this further.

@google-ml-butler
Copy link

Are you satisfied with the resolution of your issue?
Yes
No

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
stale This label marks the issue/pr stale - to be closed automatically if no activity stat:awaiting response Status - Awaiting response from author stat:awaiting tensorflower Status - Awaiting response from tensorflower subtype: ubuntu/linux Ubuntu/Linux Build/Installation Issues TF 2.3 Issues related to TF 2.3 type:build/install Build and install issues
Projects
None yet
Development

No branches or pull requests

10 participants