Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tests for tensorflow-sys fail #66

Closed
nbigaouette-eai opened this issue Mar 7, 2017 · 11 comments
Closed

Tests for tensorflow-sys fail #66

nbigaouette-eai opened this issue Mar 7, 2017 · 11 comments
Labels

Comments

@nbigaouette-eai
Copy link
Contributor

Running cargo test in the tensorflow-sys directory fails (but tests pass for the main crate).

Here's the output:

 -> cd ~/tensorflow_rust.git/tensorflow-sys
 -> cargo test
    Finished debug [unoptimized + debuginfo] target(s) in 0.0 secs
     Running target/debug/deps/tensorflow-9f5d1dac7e952430

running 12 tests
test buffer::tests::basic ... ok
test session::tests::smoke ... ok
test session::tests::test_close ... ok
test graph::tests::smoke ... ok
test tests::smoke ... ok
test tests::test_close ... ok
test tests::test_extend_graph ... ok
test session::tests::test_run ... ok
test tests::test_set_config ... ok
test tests::test_set_target ... ok
test tests::test_tensor ... ok
test tests::test_run ... ok

test result: ok. 12 passed; 0 failed; 0 ignored; 0 measured

   Doc-tests tensorflow

running 0 tests

test result: ok. 0 passed; 0 failed; 0 ignored; 0 measured

 -> cd tensorflow-sys/
 -> cargo test --verbose
       Fresh lazy_static v0.2.4
       Fresh regex-syntax v0.3.9
       Fresh pkg-config v0.3.9
       Fresh utf8-ranges v0.1.3
       Fresh libc v0.2.21
       Fresh winapi v0.2.8
       Fresh winapi-build v0.1.1
       Fresh memchr v0.1.11
       Fresh aho-corasick v0.5.3
       Fresh kernel32-sys v0.2.2
       Fresh thread-id v2.0.0
       Fresh thread_local v0.2.7
       Fresh regex v0.1.80
       Fresh semver-parser v0.6.2
       Fresh semver v0.5.1
       Fresh tensorflow-sys v0.7.0 (file:///home/nbigaouette/tensorflow_rust.git/tensorflow-sys)
    Finished debug [unoptimized + debuginfo] target(s) in 0.0 secs
     Running `/home/nbigaouette/tensorflow_rust.git/target/debug/deps/lib-a517463cab98ea9f`

running 1 test
test linkage ... ok

test result: ok. 1 passed; 0 failed; 0 ignored; 0 measured

     Running `/home/nbigaouette/tensorflow_rust.git/target/debug/deps/tensorflow_sys-b18d6c19e08d67bd`
terminate called without an active exception
error: process didn't exit successfully: `/home/nbigaouette/tensorflow_rust.git/target/debug/deps/tensorflow_sys-b18d6c19e08d67bd` (signal: 6, SIGABRT: process abort signal)

Caused by:
  process didn't exit successfully: `/home/nbigaouette/tensorflow_rust.git/target/debug/deps/tensorflow_sys-b18d6c19e08d67bd` (signal: 6, SIGABRT: process abort signal)

Here's a backtrace from running /home/nbigaouette/tensorflow_rust.git/target/debug/deps/tensorflow_sys-b18d6c19e08d67bd through gdb:

Program received signal SIGABRT, Aborted.
0x00007ffff32f904f in raise () from /usr/lib/libc.so.6
(gdb) bt
#0  0x00007ffff32f904f in raise () from /usr/lib/libc.so.6
#1  0x00007ffff32fa47a in abort () from /usr/lib/libc.so.6
#2  0x00007ffff2ccb4ed in __gnu_cxx::__verbose_terminate_handler () at /build/gcc/src/gcc/libstdc++-v3/libsupc++/vterminate.cc:95
#3  0x00007ffff2cc92a6 in __cxxabiv1::__terminate (handler=<optimized out>) at /build/gcc/src/gcc/libstdc++-v3/libsupc++/eh_terminate.cc:47
#4  0x00007ffff2cc92f1 in std::terminate () at /build/gcc/src/gcc/libstdc++-v3/libsupc++/eh_terminate.cc:57
#5  0x00007ffff2cc8062 in __cxxabiv1::__cxa_allocate_exception (thrown_size=136, thrown_size@entry=8) at /build/gcc/src/gcc/libstdc++-v3/libsupc++/eh_alloc.cc:278
#6  0x00007ffff2cc9a98 in operator new (sz=32) at /build/gcc/src/gcc/libstdc++-v3/libsupc++/new_op.cc:54
#7  0x00007ffff57ab4c8 in tensorflow::monitoring::Counter<2>* tensorflow::monitoring::Counter<2>::New<char const (&) [46], char const (&) [58], char const (&) [11], char const (&) [7]>(char const (&) [46], char c
onst (&) [58], char const (&) [11], char const (&) [7]) () from /usr/lib/libtensorflow.so
#8  0x00007ffff571a50b in ?? () from /usr/lib/libtensorflow.so
#9  0x00007ffff7de94fa in call_init.part () from /lib64/ld-linux-x86-64.so.2
#10 0x00007ffff7de960b in _dl_init () from /lib64/ld-linux-x86-64.so.2
#11 0x00007ffff7ddadaa in _dl_start_user () from /lib64/ld-linux-x86-64.so.2
#12 0x0000000000000001 in ?? ()
#13 0x00007fffffffe846 in ?? ()
#14 0x0000000000000000 in ?? ()
@adamcrume adamcrume added the bug label Mar 8, 2017
@adamcrume
Copy link
Contributor

We'll probably have to rope in a TensorFlow developer, because that stack trace looks like 100% C++.

@nbigaouette-eai
Copy link
Contributor Author

This looks very weird as not much is executed. I suspect it's a rust related setup issue...

If I comment out tensorflow-sys/src/lib.rs's line 558 the test pass!

Here's the failure:

(~/tensorflow_rust.git/tensorflow-sys)
 -> cargo test --verbose --jobs 1 --no-fail-fast -- --nocapture
       Fresh winapi v0.2.8
       Fresh pkg-config v0.3.9
       Fresh winapi-build v0.1.1
       Fresh utf8-ranges v0.1.3
       Fresh libc v0.2.21
       Fresh memchr v0.1.11
       Fresh aho-corasick v0.5.3
       Fresh kernel32-sys v0.2.2
       Fresh thread-id v2.0.0
       Fresh thread_local v0.2.7
       Fresh regex-syntax v0.3.9
       Fresh regex v0.1.80
       Fresh lazy_static v0.2.4
       Fresh semver-parser v0.6.2
       Fresh semver v0.5.1
       Fresh tensorflow-sys v0.7.0 (file:///home/nbigaouette/tensorflow_rust.git/tensorflow-sys)
    Finished debug [unoptimized + debuginfo] target(s) in 0.0 secs
     Running `/home/nbigaouette/tensorflow_rust.git/target/debug/deps/lib-a517463cab98ea9f --nocapture`

running 1 test
test linkage ... ok

test result: ok. 1 passed; 0 failed; 0 ignored; 0 measured

     Running `/home/nbigaouette/tensorflow_rust.git/target/debug/deps/tensorflow_sys-b18d6c19e08d67bd --nocapture`
terminate called without an active exception
   Doc-tests tensorflow-sys
     Running `rustdoc --test /home/nbigaouette/tensorflow_rust.git/tensorflow-sys/src/lib.rs --crate-name tensorflow_sys -L dependency=/home/nbigaouette/tensorflow_rust.git/target$
debug/deps -L native=/usr/lib --test-args --nocapture --extern tensorflow_sys=/home/nbigaouette/tensorflow_rust.git/target/debug/deps/libtensorflow_sys-78389605804932ce.rlib --extern libc=/home/n$
igaouette/tensorflow_rust.git/target/debug/deps/liblibc-6451aa7d8103c93e.rlib`

running 0 tests

test result: ok. 0 passed; 0 failed; 0 ignored; 0 measured

error: process didn't exit successfully: `/home/nbigaouette/tensorflow_rust.git/target/debug/deps/tensorflow_sys-b18d6c19e08d67bd --nocapture` (signal: 6, SIGABRT: process abort signal)

Caused by:
  process didn't exit successfully: `/home/nbigaouette/tensorflow_rust.git/target/debug/deps/tensorflow_sys-b18d6c19e08d67bd --nocapture` (signal: 6, SIGABRT: process abort signal)

And here's the success after commenting line 558:

(~/tensorflow_rust.git/tensorflow-sys)
 -> cargo test --verbose --jobs 1 --no-fail-fast -- --nocapture
       Fresh regex-syntax v0.3.9
       Fresh utf8-ranges v0.1.3
       Fresh winapi v0.2.8
       Fresh libc v0.2.21
       Fresh memchr v0.1.11
       Fresh lazy_static v0.2.4
       Fresh pkg-config v0.3.9
       Fresh aho-corasick v0.5.3
       Fresh winapi-build v0.1.1
       Fresh kernel32-sys v0.2.2
       Fresh thread-id v2.0.0
       Fresh thread_local v0.2.7
       Fresh regex v0.1.80
       Fresh semver-parser v0.6.2
       Fresh semver v0.5.1
       Fresh tensorflow-sys v0.7.0 (file:///home/nbigaouette/tensorflow_rust.git/tensorflow-sys)
    Finished debug [unoptimized + debuginfo] target(s) in 0.0 secs
     Running `/home/nbigaouette/tensorflow_rust.git/target/debug/deps/lib-a517463cab98ea9f --nocapture`

running 1 test
test linkage ... ok

test result: ok. 1 passed; 0 failed; 0 ignored; 0 measured

     Running `/home/nbigaouette/tensorflow_rust.git/target/debug/deps/tensorflow_sys-b18d6c19e08d67bd --nocapture`

running 0 tests

test result: ok. 0 passed; 0 failed; 0 ignored; 0 measured

   Doc-tests tensorflow-sys
     Running `rustdoc --test /home/nbigaouette/tensorflow_rust.git/tensorflow-sys/src/lib.rs --crate-name tensorflow_sys -L dependency=/home/nbigaouette/tensorflow_rust.git/target/debug/deps -L native=/usr/lib --test-args --nocapture --extern libc=/home/nbigaouette/tensorflow_rust.git/target/debug/deps/liblibc-6451aa7d8103c93e.rlib --extern tensorflow_sys=/home/nbigaouette/tensorflow_rust.git/target/debug/deps/libtensorflow_sys-78389605804932ce.rlib`

running 0 tests

test result: ok. 0 passed; 0 failed; 0 ignored; 0 measured

@nbigaouette-eai
Copy link
Contributor Author

There's something going on when linking TF_SetAttrValueProto()...

While the --lib test fails in debug mode:

(~/tensorflow_rust.git/tensorflow-sys)
 -> cargo test --verbose --lib -- --nocapture 
       Fresh libc v0.2.21
       Fresh utf8-ranges v0.1.3
       Fresh regex-syntax v0.3.9
       Fresh lazy_static v0.2.4
       Fresh memchr v0.1.11
       Fresh winapi v0.2.8
       Fresh pkg-config v0.3.9
       Fresh winapi-build v0.1.1
       Fresh aho-corasick v0.5.3
       Fresh kernel32-sys v0.2.2
       Fresh thread-id v2.0.0
       Fresh thread_local v0.2.7
       Fresh regex v0.1.80
       Fresh semver-parser v0.6.2
       Fresh semver v0.5.1
       Fresh tensorflow-sys v0.7.0 (file:///home/nbigaouette/tensorflow_rust.git/tensorflow-sys)
    Finished debug [unoptimized + debuginfo] target(s) in 0.0 secs
     Running `/home/nbigaouette/tensorflow_rust.git/target/debug/deps/tensorflow_sys-b18d6c19e08d67bd --nocapture`
terminate called without an active exception
error: process didn't exit successfully: `/home/nbigaouette/tensorflow_rust.git/target/debug/deps/tensorflow_sys-b18d6c19e08d67bd --nocapture` (signal: 6, SIGABRT: process abort signal)

Caused by:
  process didn't exit successfully: `/home/nbigaouette/tensorflow_rust.git/target/debug/deps/tensorflow_sys-b18d6c19e08d67bd --nocapture` (signal: 6, SIGABRT: process abort signal)

it passes in release mode:

(~/tensorflow_rust.git/tensorflow-sys)
 -> cargo test --verbose --lib --release -- --nocapture 
       Fresh lazy_static v0.2.4
       Fresh utf8-ranges v0.1.3
       Fresh pkg-config v0.3.9
       Fresh libc v0.2.21
       Fresh winapi v0.2.8
       Fresh winapi-build v0.1.1
       Fresh regex-syntax v0.3.9
       Fresh memchr v0.1.11
       Fresh aho-corasick v0.5.3
       Fresh kernel32-sys v0.2.2
       Fresh thread-id v2.0.0
       Fresh thread_local v0.2.7
       Fresh regex v0.1.80
       Fresh semver-parser v0.6.2
       Fresh semver v0.5.1
       Fresh tensorflow-sys v0.7.0 (file:///home/nbigaouette/tensorflow_rust.git/tensorflow-sys)
    Finished release [optimized] target(s) in 0.0 secs
     Running `/home/nbigaouette/tensorflow_rust.git/target/release/deps/tensorflow_sys-51d55d892be4a403 --nocapture`

running 0 tests

test result: ok. 0 passed; 0 failed; 0 ignored; 0 measured

Probably because of dead code elimination or something like that...

@nbigaouette-eai
Copy link
Contributor Author

Running through valgrind reveals more information:

 -> valgrind --leak-check=full /home/nbigaouette/tensorflow_rust.git/target/debug/deps/tensorflow_sys-b18d6c19e08d67bd
==2292== Memcheck, a memory error detector
==2292== Copyright (C) 2002-2015, and GNU GPL'd, by Julian Seward et al.
==2292== Using Valgrind-3.12.0 and LibVEX; rerun with -h for copyright info
==2292== Command: /home/nbigaouette/tensorflow_rust.git/target/debug/deps/tensorflow_sys-b18d6c19e08d67bd
==2292== 
==2292== Invalid read of size 1
==2292==    at 0x80EBBDA: je_base_alloc (in /usr/lib/libtensorflow.so)
==2292==    by 0x168E66: run_quantize_init (arena.c:3567)
==2292==    by 0x168E66: je_arena_boot (arena.c:3639)
==2292==    by 0x15F46A: malloc_init_hard_a0_locked (jemalloc.c:1243)
==2292==    by 0x15F542: malloc_init_hard (jemalloc.c:1373)
==2292==    by 0x1753DC: __libc_csu_init (in /home/nbigaouette/tensorflow_rust.git/target/debug/deps/tensorflow_sys-b18d6c19e08d67bd)
==2292==    by 0x95CD21F: (below main) (in /usr/lib/libc-2.24.so)
==2292==  Address 0x20000000088e0a7f is not stack'd, malloc'd or (recently) free'd
==2292== 
==2292== 
==2292== Process terminating with default action of signal 11 (SIGSEGV): dumping core
==2292==  General Protection Fault
==2292==    at 0x80EBBDA: je_base_alloc (in /usr/lib/libtensorflow.so)
==2292==    by 0x168E66: run_quantize_init (arena.c:3567)
==2292==    by 0x168E66: je_arena_boot (arena.c:3639)
==2292==    by 0x15F46A: malloc_init_hard_a0_locked (jemalloc.c:1243)
==2292==    by 0x15F542: malloc_init_hard (jemalloc.c:1373)
==2292==    by 0x1753DC: __libc_csu_init (in /home/nbigaouette/tensorflow_rust.git/target/debug/deps/tensorflow_sys-b18d6c19e08d67bd)
==2292==    by 0x95CD21F: (below main) (in /usr/lib/libc-2.24.so)
==2292== 
==2292== HEAP SUMMARY:
==2292==     in use at exit: 3,595,128 bytes in 48,718 blocks
==2292==   total heap usage: 193,160 allocs, 144,442 frees, 13,614,953 bytes allocated
==2292== 
==2292== LEAK SUMMARY:
==2292==    definitely lost: 0 bytes in 0 blocks
==2292==    indirectly lost: 0 bytes in 0 blocks
==2292==      possibly lost: 0 bytes in 0 blocks
==2292==    still reachable: 3,595,128 bytes in 48,718 blocks
==2292==         suppressed: 0 bytes in 0 blocks
==2292== Reachable blocks (those to which a pointer was found) are not shown.
==2292== To see them, rerun with: --leak-check=full --show-leak-kinds=all
==2292== 
==2292== For counts of detected and suppressed errors, rerun with: -v
==2292== ERROR SUMMARY: 1 errors from 1 contexts (suppressed: 0 from 0)
Segmentation fault (core dumped)

@nbigaouette-eai
Copy link
Contributor Author

Could it be a conflict with jemalloc?

tensorflow can be compiled with jemalloc 4.4.0 (see https://github.com/tensorflow/tensorflow/blob/master/tensorflow/workspace.bzl#L447-L457).

As suggested from IRC, I've compiled using a nightly and the system allocator (instead of jemalloc) by adding the following at the top of tensorflow-sys/src/lib.rs:

#![feature(alloc_system)]

extern crate alloc_system;

(see https://doc.rust-lang.org/book/custom-allocators.html). Running the "test" (note that there is none) succeeds:

(~/tensorflow_rust.git/tensorflow-sys)
 -> rustup run nightly cargo test --verbose --lib -- --nocapture
       Fresh libc v0.2.21
       Fresh winapi v0.2.8
       Fresh utf8-ranges v0.1.3
       Fresh lazy_static v0.2.4
       Fresh memchr v0.1.11
       Fresh winapi-build v0.1.1
       Fresh pkg-config v0.3.9
       Fresh aho-corasick v0.5.3
       Fresh regex-syntax v0.3.9
       Fresh kernel32-sys v0.2.2
       Fresh thread-id v2.0.0
       Fresh thread_local v0.2.7
       Fresh regex v0.1.80
       Fresh semver-parser v0.6.2
       Fresh semver v0.5.1
   Compiling tensorflow-sys v0.7.0 (file:///home/nbigaouette/tensorflow_rust.git/tensorflow-sys)
     Running `/home/nbigaouette/tensorflow_rust.git/target/debug/build/tensorflow-sys-47d0b49bf215e55a/build-script-build`
     Running `rustc --crate-name tensorflow_sys src/lib.rs --emit=dep-info,link -C debuginfo=2 --test -C metadata=dde201bfdb2eb731 -C extra-filename=-dde201bfdb2eb731 --out-dir /home/nbigaouette/t
ensorflow_rust.git/target/debug/deps -L dependency=/home/nbigaouette/tensorflow_rust.git/target/debug/deps --extern libc=/home/nbigaouette/tensorflow_rust.git/target/debug/deps/lib
libc-5dc7b85e748840b4.rlib -L native=/usr/lib -l tensorflow -l 'stdc++'`
    Finished dev [unoptimized + debuginfo] target(s) in 1.3 secs
     Running `/home/nbigaouette/tensorflow_rust.git/target/debug/deps/tensorflow_sys-dde201bfdb2eb731 --nocapture`

running 0 tests

test result: ok. 0 passed; 0 failed; 0 ignored; 0 measured

Using default jemalloc:

(~/tensorflow_rust.git/tensorflow-sys)
 -> rustup run nightly cargo test --verbose --lib -- --nocapture
       Fresh libc v0.2.21
       Fresh pkg-config v0.3.9
       Fresh regex-syntax v0.3.9
       Fresh winapi-build v0.1.1
       Fresh lazy_static v0.2.4
       Fresh memchr v0.1.11
       Fresh winapi v0.2.8
       Fresh utf8-ranges v0.1.3
       Fresh aho-corasick v0.5.3
       Fresh kernel32-sys v0.2.2
       Fresh thread-id v2.0.0
       Fresh thread_local v0.2.7
       Fresh regex v0.1.80
       Fresh semver-parser v0.6.2
       Fresh semver v0.5.1
   Compiling tensorflow-sys v0.7.0 (file:///home/nbigaouette/tensorflow_rust.git/tensorflow-sys)
     Running `/home/nbigaouette/tensorflow_rust.git/target/debug/build/tensorflow-sys-47d0b49bf215e55a/build-script-build`
     Running `rustc --crate-name tensorflow_sys src/lib.rs --emit=dep-info,link -C debuginfo=2 --test -C metadata=dde201bfdb2eb731 -C extra-filename=-dde201bfdb2eb731 --out-dir /home/nbigaouette/tensorflow_rust.git/target/debug/deps -L dependency=/home/nbigaouette/tensorflow_rust.git/target/debug/deps --extern libc=/home/nbigaouette/tensorflow_rust.git/target/debug/deps/liblibc-5dc7b85e748840b4.rlib -L native=/usr/lib -l tensorflow -l 'stdc++'`
    Finished dev [unoptimized + debuginfo] target(s) in 1.8 secs
     Running `/home/nbigaouette/tensorflow_rust.git/target/debug/deps/tensorflow_sys-dde201bfdb2eb731 --nocapture`
terminate called without an active exception
error: process didn't exit successfully: `/home/nbigaouette/tensorflow_rust.git/target/debug/deps/tensorflow_sys-dde201bfdb2eb731 --nocapture` (signal: 6, SIGABRT: process abort signal)

Caused by:
  process didn't exit successfully: `/home/nbigaouette/tensorflow_rust.git/target/debug/deps/tensorflow_sys-dde201bfdb2eb731 --nocapture` (signal: 6, SIGABRT: process abort signal)

I might try to recompile tf with TF_NEED_JEMALLOC=0 instead of TF_NEED_JEMALLOC=1...

@nbigaouette-eai
Copy link
Contributor Author

Recompiled tensorflow 1.0.1 with jemalloc disabled (TF_NEED_JEMALLOC=0); now all tests are passing!

@adamcrume
Copy link
Contributor

Thanks for debugging; that's great detective work! And good idea using valgrind, we should add that as a test (#69).

Unfortunately, this is a bit of a pickle. I don't want to require building with a nightly release, so changing the allocator in Rust is probably not an option. On the other hand, I'm trying to add support for prebuilt TensorFlow binaries (#65), and they seem to be built with jemalloc enabled.

I'm also still lost on why the tests for the tensorflow crate are fine (except for a few leaked bytes, verified by valgrind), and the tests in tensorflow-sys/tests/lib.rs are fine, but the tests for the tensorflow_sys crate crash. Even if I add test functions in tensorflow-sys/src/lib.rs, it crashes before it runs my tests. (It seems to be crashing while loading libtensorflow.so.) What I would really like to know is whether users could be affected by this bug. If it means that we can't run tests in tensorflow-sys/src/lib.rs, then that's no big deal. If users are affected, then this has to be fixed somehow.

@lilianmoraru
Copy link

I expect tensorflow-sys to be shaky, especially on Windows.
I wanted to see if there are docs for tensorflow-sys and observed that it uses bindgen, but the bindings are not generated at build.
bindgen uses the current environment to define some of the variables.
So, even on Linux you could have different values generating with glibc and musl(or any other libc implementation) for example.

Throwing it out there that these issues might also appear because of this.

@nbigaouette-eai
Copy link
Contributor Author

It would surprise me that TF's C API would be platform dependent. Its header should be identical on all platform... Or is it not?

It seemed more reliable to use bindgen to generate the Rust wrapper and commit the result rather than include a build.rs that would automatically generate that binding. Adding bindgen as a build dependency can add some overhead, which I think might not be necessary.

The script to generate the binding is included in the repository so that anybody can either update the binding or verify that the binding generated is the same as the committed one.

I invite anybody to run this script to verify the assumption. If I am wrong about the assumption then bindgen will have to be added as a build dependency.

I'll have to revisit this issue at some point, the build process changed a little bit since I filled this.

@adamcrume
Copy link
Contributor

It looks like running rustup run nightly cargo test --verbose --lib -- --nocapture and cargo test on a clean checkout succeed now. Can you verify?

@nbigaouette-eai
Copy link
Contributor Author

I just tested 15a40b0 on both macOS and Linux. Both are using the pre-built tensorflow downloaded build build.rs (there is no TF installed system wide on both systems).

I cannot reproduce the problem I had back in March; all tests (for both crates) are passing without issue.

I'm not sure what went wrong back then. Maybe there was something with the pre-built vs compiled integration, the build.rs changed, or maybe it was due to a problem with the bindgen generated files. Who knows...

At least now it works! Closing this.

ramon-garcia pushed a commit to ramon-garcia/tensorflow-rust that referenced this issue May 20, 2023
* dotnet loves parquet!

* typo
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants