Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

C++ compilation of rule '@jemalloc//:jemalloc' failed: #7268

Closed
Montmorency opened this issue Feb 5, 2017 · 12 comments
Closed

C++ compilation of rule '@jemalloc//:jemalloc' failed: #7268

Montmorency opened this issue Feb 5, 2017 · 12 comments
Assignees
Labels
stat:awaiting response Status - Awaiting response from author

Comments

@Montmorency
Copy link

Trying to install tensor flow on cluster from source. I have installed bazel
[bazel release 0.4.4- (@non-git)], and I am using python 2.7.13 with pyenv. Upon trying to build the tensorflow pip wheel I am getting a compilation error:

ERROR: ~/.cache/bazel/_bazel/e924d9c3ba75314415252c6f4f93bb86/external/jemalloc/BUILD:10:1: C++ compilation of rule '@jemalloc//:jemalloc' failed: gcc failed: error executing command /opt/apps/compilers/gcc/4.8.2/bin/gcc -U_FORTIFY_SOURCE -fstack-protector -Wall -B/opt/apps/compilers/gcc/4.8.2/bin -B/usr/bin -Wunused-but-set-parameter -Wno-free-nonheap-object ... (remaining 38 argument(s) skipped): com.google.devtools.build.lib.shell.BadExitStatusException: Process exited with status 1.

Has anyone experienced this? Is this a consequence of the warning I get from bazel for being on an NFS:
WARNING: Output base '~/.cache/bazel/_bazel/e924d9c3ba75314415252c6f4f93bb86' is on NFS. This may lead to surprising failures and undetermined behaviour.

My gcc version is:
gcc (GCC) 4.8.2
Copyright (C) 2013 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

@Montmorency
Copy link
Author

Montmorency commented Feb 5, 2017

Ultimately I disabled jemalloc when prompted during configuration by the tensorflow/configure file. TensorFlow seems to have compiled fine from there on.

@yaroslavvb
Copy link
Contributor

cc @jhseu who added jemalloc. @Montmorency can you specify your linux/bazel versions?

@byronyi
Copy link
Contributor

byronyi commented Feb 6, 2017

@Montmorency Possibly unrelated, but I've met random problems using bazel build on NFS hosted home directory.

export TEST_TMPDIR="/tmp" # or some local directory that is not hosted on NFS

This might help.

@girving
Copy link
Contributor

girving commented Feb 6, 2017

@jhseu Can you comment?

@girving girving added the stat:awaiting tensorflower Status - Awaiting response from tensorflower label Feb 6, 2017
@jhseu jhseu self-assigned this Feb 6, 2017
@jhseu
Copy link
Contributor

jhseu commented Feb 6, 2017

@Montmorency I tested against gcc 4.8.4 and it works for me.

Can you copy the compilation error that you get before that error message? Also, can you try removing NFS as a potential issue by doing what @byronyi mentioned?

@girving girving added stat:awaiting response Status - Awaiting response from author and removed stat:awaiting tensorflower Status - Awaiting response from tensorflower labels Feb 6, 2017
@Montmorency
Copy link
Author

Thanks for response!
Bazel

 [bazel release 0.4.4- (@non-git)]

Distro

$cat /etc/issue
  Scientific Linux release 6.6 (Carbon)
  Kernel \r on an \m
$uname -r
  2.6.32-431.17.1.el6.x86_64
$uname -i
  x86_64
$gcc --version
gcc (GCC) 6.2.0
Copyright (C) 2016 Free Software Foundation, Inc.

Update:
I had a closer look and started using a more recent compiler. The NFS is not an issue. Closer inspection of the output pointed to failure to do with JEMALLOC_THP and some undefined variable(MADV_NOHUGEPAGE). I think the cluster has an older linux kernel. I manually removed these flags from the jemalloc/BUILD file and compilation completed without any further issues:
deleted:
"#undef JEMALLOC_THP": "#define JEMALLOC_THP",
"#undef JEMALLOC_HAVE_SECURE_GETENV": "#define JEMALLOC_HAVE_SECURE_GETENV",

I removed the second flag when the build failed at linking stage I think this is related to an older version of openssl on the cluster which configure didn't seem to detect (getenv rather than secure_getenv). Training the models is fine with this build but I do still a strange glibc error when running model.predict_proba():

*** glibc detected *** python: double free or corruption (!prev): 0x0000000001375da0 ***
======= Backtrace: =========
/lib64/libc.so.6[0x3168e75e66]
/lib64/libc.so.6[0x3168e789b3]
/lib64/ld-linux-x86-64.so.2(_dl_deallocate_tls+0x67)[0x31686112f7]
/lib64/libpthread.so.0[0x316920675d]
/lib64/libpthread.so.0[0x31692078ea]
/lib64/libpthread.so.0(pthread_join+0xd4)[0x31692081f4]
/opt/apps/compilers/gcc/6.2.0/lib64/libstdc++.so.6(_ZNSt6thread4joinEv+0x27)[0x7fbc4ea24a77]
/users/k1511981/.pyenv/versions/t2/lib/python2.7/site-packages/tensorflow/python/_pywrap_tensorflow.so(+0x248c310)[0x7fbc51186310]
/users/k1511981/.pyenv/versions/t2/lib/python2.7/site-packages/tensorflow/python/_pywrap_tensorflow.so(_ZN10tensorflow6thread10ThreadPool4ImplD0Ev+0x144)[0x7fbc51160594]
/users/k1511981/.pyenv/versions/t2/lib/python2.7/site-packages/tensorflow/python/_pywrap_tensorflow.so(_ZN10tensorflow10FileSystem16GetMatchingPathsERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEEPSt6vectorIS6_SaIS6_EE+0x5ce)[0x7fbc511830de]
/users/k1511981/.pyenv/versions/t2/lib/python2.7/site-packages/tensorflow/python/_pywrap_tensorflow.so(_ZN10tensorflow3Env16GetMatchingPathsERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEEPSt6vectorIS6_SaIS6_EE+0xab)[0x7fbc5117f7eb]
/users/k1511981/.pyenv/versions/t2/lib/python2.7/site-packages/tensorflow/python/_pywrap_tensorflow.so(+0x986bfb)[0x7fbc4f680bfb]
/users/k1511981/.pyenv/versions/t2/lib/python2.7/site-packages/tensorflow/python/_pywrap_tensorflow.so(+0x9877c4)[0x7fbc4f6817c4]

@jhseu
Copy link
Contributor

jhseu commented Feb 7, 2017

Ah yeah, that kernel is really old (from 2009!). I'm not sure we should try to support jemalloc with that kernel when there's a reasonable workaround.

The double free issue is unrelated because it happens even when disabling jemalloc (from your other comment on #6968). Also, it's in code that's unaffected by jemalloc. Looking at the stack trace, it's crashing in pthread_join in deallocating thread-local storage. Seems unlikely to be a TensorFlow issue (possibly also related to the old linux kernel?)

Closing the issue out as intended behavior.

@jhseu jhseu closed this as completed Feb 7, 2017
@jhseu
Copy link
Contributor

jhseu commented Feb 7, 2017

I haven't tested, but some searching indicates that jemalloc should build as long as the Linux kernel version is >= 2.6.38, otherwise it needs to be disabled.

@edi-bice
Copy link

edi-bice commented May 4, 2017

I removed bazel cache, did bazel clean, disabled jemalloc via configure and still get the following error (and yes my kernel is 2.6.32)

ERROR: /home/ebice/.cache/bazel/bazel_ebice/975e0509e630426b34ea61d02aa8b898/ex ternal/jemalloc/BUILD:10:1: C++ compilation of rule '@jemalloc//:jemalloc' faile d: gcc failed: error executing command /opt/rh/devtoolset-6/root/usr/bin/gcc -U FORTIFY_SOURCE -fstack-protector -Wall -B/opt/rh/devtoolset-6/root/usr/bin -B/us r/bin -Wunused-but-set-parameter -Wno-free-nonheap-object ... (remaining 38 argu ment(s) skipped): com.google.devtools.build.lib.shell.BadExitStatusException: Pr ocess exited with status 1.
external/jemalloc/src/pages.c: In function 'je_pages_huge':
external/jemalloc/src/pages.c:203:30: error: 'MADV_HUGEPAGE' undeclared (first u se in this function)
return (madvise(addr, size, MADV_HUGEPAGE) != 0);

@jhseu
Copy link
Contributor

jhseu commented May 4, 2017

@edi-bice Unfortunately, I don't have access to any machine with such old Linux kernels to test. That shouldn't happen if jemalloc is really disabled as far as I can tell.

@edi-bice
Copy link

edi-bice commented May 4, 2017

"really disabled" is the keyword. Apparently bazel clean did not really clean everything. In addition to .bazelrc the file .tf_configure.bazelrc remained even after a bazel clean and inside there jemalloc=true despite configure output stating "jemalloc disabled".

@jhseu
Copy link
Contributor

jhseu commented May 4, 2017

Did you rerun ./configure? That file is deleted and updated again here:
https://github.com/tensorflow/tensorflow/blob/master/configure#L167

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
stat:awaiting response Status - Awaiting response from author
Projects
None yet
Development

No branches or pull requests

6 participants