Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Floating Point Exception/SIGFPE in tf.map_fn over an empty tensor #8554

Closed
Panaetius opened this issue Mar 20, 2017 · 4 comments
Closed

Floating Point Exception/SIGFPE in tf.map_fn over an empty tensor #8554

Panaetius opened this issue Mar 20, 2017 · 4 comments
Labels
stat:awaiting tensorflower Status - Awaiting response from tensorflower type:bug Bug

Comments

@Panaetius
Copy link

Working on a custom implementation of Faster-RCNN, it runs fine for ~150 - 300 batches but then I get a Floating Point Error, apparently in ConcatGPUImpl.

Source is here: https://github.com/Panaetius/woipv (src/models/train_model.py and src/models/model.py, it's a bit of a mess still since it's a work in progress), reproducible as of commit 6eb1e3c5e818919b64a0a981abb98c2f3bc3dea1

Environment info

Operating System: Ubuntu 16.04

Installed version of CUDA and cuDNN: CUDA 8.0, cuDNN 5
(please attach the output of ls -l /path/to/cuda/lib/libcud*):

-rw-r--r-- 1 root root 558720 Okt 4 23:15 /usr/local/cuda/lib64/libcudadevrt.a
lrwxrwxrwx 1 root root 16 Okt 4 23:15 /usr/local/cuda/lib64/libcudart.so -> libcudart.so.8.0
lrwxrwxrwx 1 root root 19 Okt 4 23:15 /usr/local/cuda/lib64/libcudart.so.8.0 -> libcudart.so.8.0.44
-rwxr-xr-x 1 root root 415432 Okt 4 23:15 /usr/local/cuda/lib64/libcudart.so.8.0.44
-rw-r--r-- 1 root root 775162 Okt 4 23:15 /usr/local/cuda/lib64/libcudart_static.a
lrwxrwxrwx 1 root root 13 Okt 4 23:23 /usr/local/cuda/lib64/libcudnn.so -> libcudnn.so.5
lrwxrwxrwx 1 root root 17 Okt 4 23:23 /usr/local/cuda/lib64/libcudnn.so.5 -> libcudnn.so.5.1.5
-rwxr-xr-x 1 root root 79337624 Okt 4 23:23 /usr/local/cuda/lib64/libcudnn.so.5.1.5
-rw-r--r-- 1 root root 69756172 Okt 4 23:23 /usr/local/cuda/lib64/libcudnn_static.a

If installed from binary pip package, provide:

  1. A link to the pip package you installed: official python 3 tensorflow-gpu package
  2. The output from python -c "import tensorflow; print(tensorflow.__version__)":

I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcublas.so.8.0 locally
I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcudnn.so.5 locally
I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcufft.so.8.0 locally
I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcuda.so.1 locally
I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcurand.so.8.0 locally
1.0.0

Logs or other output that would be helpful

gdb:

Thread 49 "python" received signal SIGFPE, Arithmetic exception.
[Switching to Thread 0x7fff10bb8700 (LWP 8227)]
0x00007fffcaab91de in void tensorflow::ConcatGPUImpl<float, int>(Eigen::GpuDevice const&, tensorflow::CudaDeviceArrayStruct<float const*, 8> const&, tensorflow::CudaDeviceArrayStruct<int, 8> const&, bool, int, tensorflow::TTypes<float, 2, long>::Matrix*) ()
from /home/zenon/anaconda3/envs/tensorflow/lib/python3.5/site-packages/tensorflow/python/_pywrap_tensorflow.so

Backtrace:

#0 0x00007fffcaab91de in void tensorflow::ConcatGPUImpl<float, int>(Eigen::GpuDevice const&, tensorflow::CudaDeviceArrayStruct<float const*, 8> const&, tensorflow::CudaDeviceArrayStruct<int, 8> const&, bool, int, tensorflow::TTypes<float, 2, long>::Matrix*) () from /home/zenon/anaconda3/envs/tensorflow/lib/python3.5/site-packages/tensorflow/python/_pywrap_tensorflow.so
#1 0x00007fffcaaaf45c in void tensorflow::(anonymous namespace)::ConcatGPUCall<float, int>(tensorflow::OpKernelContext*, std::vector<std::unique_ptr<tensorflow::TTypes<float, 2, long>::ConstMatrix, std::default_delete<tensorflow::TTypes<float, 2, long>::ConstMatrix> >, std::allocator<std::unique_ptr<tensorflow::TTypes<float, 2, long>::ConstMatrix, std::default_delete<tensorflow::TTypes<float, 2, long>::ConstMatrix> > > > const&, tensorflow::TTypes<float, 2, long>::Tensor*) ()
from /home/zenon/anaconda3/envs/tensorflow/lib/python3.5/site-packages/tensorflow/python/_pywrap_tensorflow.so
#2 0x00007fffc9b14c16 in tensorflow::TensorArrayPackOrGatherOp<Eigen::GpuDevice, float, false>::Compute(tensorflow::OpKernelContext*) () from /home/zenon/anaconda3/envs/tensorflow/lib/python3.5/site-packages/tensorflow/python/_pywrap_tensorflow.so
#3 0x00007fffcad155b2 in tensorflow::BaseGPUDevice::Compute(tensorflow::OpKernel*, tensorflow::OpKernelContext*) ()
from /home/zenon/anaconda3/envs/tensorflow/lib/python3.5/site-packages/tensorflow/python/_pywrap_tensorflow.so
#4 0x00007fffcad56183 in tensorflow::(anonymous namespace)::ExecutorState::Process(tensorflow::(anonymous namespace)::ExecutorState::TaggedNode, long long) ()
from /home/zenon/anaconda3/envs/tensorflow/lib/python3.5/site-packages/tensorflow/python/_pywrap_tensorflow.so
#5 0x00007fffcad569fa in std::_Function_handler<void (), tensorflow::(anonymous namespace)::ExecutorState::ScheduleReady(tensorflow::gtl::InlinedVector<tensorflow::(anonymous namespace)::ExecutorState::TaggedNode, 8> const&, tensorflow::(anonymous namespace)::ExecutorState::TaggedNodeReadyQueue*)::{lambda()#1}>::_M_invoke(std::_Any_data const&) ()
from /home/zenon/anaconda3/envs/tensorflow/lib/python3.5/site-packages/tensorflow/python/_pywrap_tensorflow.so
#6 0x00007fffcb09a960 in std::_Function_handler<void (), Eigen::NonBlockingThreadPoolTempltensorflow::thread::EigenEnvironment::NonBlockingThreadPoolTempl(int, tensorflow::thread::EigenEnvironment)::{lambda()#1}>::_M_invoke(std::_Any_data const&) ()
from /home/zenon/anaconda3/envs/tensorflow/lib/python3.5/site-packages/tensorflow/python/_pywrap_tensorflow.so
#7 0x00007fffcb099c10 in std::_Function_handler<void (), tensorflow::thread::EigenEnvironment::CreateThread(std::function<void ()>)::{lambda()#1}>::_M_invoke(std::_Any_data const&) ()
from /home/zenon/anaconda3/envs/tensorflow/lib/python3.5/site-packages/tensorflow/python/_pywrap_tensorflow.so
---Type to continue, or q to quit---
#8 0x00007fffc7f2c260 in ?? () from /home/zenon/anaconda3/envs/tensorflow/bin/../lib/libstdc++.so.6
#9 0x00007ffff76d16fa in start_thread (arg=0x7fff10bb8700) at pthread_create.c:333
#10 0x00007ffff6aefb5d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:109

@Panaetius
Copy link
Author

Upon researching this issue some more (with the good old "Sprinkle Prints everywhere" method of debugging), it seems to happen when the input labels/regions are empty. The MSCOCO dataset apparently has images with no labels, so those would cause the issue. I'm reading the labels/regions with a tf.VarLengthFeature, so the dimensions aren't known ahead of time.

It's somewhere inside one of the two tf.map_fn() applications, going by the gdb stacktrace, though have yet to narrow it down precisely to write a simple test case

@aselle aselle added stat:awaiting response Status - Awaiting response from author type:bug Bug and removed stat:awaiting response Status - Awaiting response from author labels Mar 21, 2017
@Panaetius
Copy link
Author

Panaetius commented Mar 21, 2017

Ok I managed to reproduce it in a simple test case: https://gist.github.com/Panaetius/8da26064491ab4ad1890bca3dfd86eff

The test case can probably be made smaller still (for instance, the whole save data -> load data with tf.VarLenFeature is probably not needed, you could just construct an empty SparseTensor, though I couldn't get this to work when I gave it a quick try).

Basically, calling tf.map_fn with an empty tensor ( [ ] ) leads to the SIGFPE, most likely when the x[0] is done in the lambda.

I'll change the title of the issue to reflect that this is an issue with tf.map_fn

Of course I'll have to change my code to correctly deal with empty/missing labels, so I doubt this will be an issue for me in the future, though tf.map_fn actually trying to run over an empty tensor and then just straight crashing seems like a bug that should be fixed to me

@Panaetius Panaetius changed the title Floating Point Exception/SIGFPE in ConcatGPUImpl Floating Point Exception/SIGFPE in tf.map_fn when an empty tensor is used Mar 21, 2017
@Panaetius Panaetius changed the title Floating Point Exception/SIGFPE in tf.map_fn when an empty tensor is used Floating Point Exception/SIGFPE in tf.map_fn over an empty tensor Mar 21, 2017
@aselle aselle removed the stat:awaiting response Status - Awaiting response from author label Mar 21, 2017
@aselle
Copy link
Contributor

aselle commented Mar 28, 2017

@yuanbyu, do you think there is a way we could make this failure case in tf.map_fn more intuitive to detect?

@aselle aselle added the stat:awaiting tensorflower Status - Awaiting response from tensorflower label Mar 28, 2017
@gunan
Copy link
Contributor

gunan commented Jun 16, 2017

Automatically closing due to lack of recent activity. Please update the issue when new information becomes available, and we will reopen the issue. Thanks!

@gunan gunan closed this as completed Jun 16, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
stat:awaiting tensorflower Status - Awaiting response from tensorflower type:bug Bug
Projects
None yet
Development

No branches or pull requests

3 participants