Floating Point Exception/SIGFPE in tf.map_fn over an empty tensor #8554

Panaetius · 2017-03-20T10:41:44Z

Working on a custom implementation of Faster-RCNN, it runs fine for ~150 - 300 batches but then I get a Floating Point Error, apparently in ConcatGPUImpl.

Source is here: https://github.com/Panaetius/woipv (src/models/train_model.py and src/models/model.py, it's a bit of a mess still since it's a work in progress), reproducible as of commit 6eb1e3c5e818919b64a0a981abb98c2f3bc3dea1

Environment info

Operating System: Ubuntu 16.04

Installed version of CUDA and cuDNN: CUDA 8.0, cuDNN 5
(please attach the output of ls -l /path/to/cuda/lib/libcud*):

-rw-r--r-- 1 root root 558720 Okt 4 23:15 /usr/local/cuda/lib64/libcudadevrt.a
lrwxrwxrwx 1 root root 16 Okt 4 23:15 /usr/local/cuda/lib64/libcudart.so -> libcudart.so.8.0
lrwxrwxrwx 1 root root 19 Okt 4 23:15 /usr/local/cuda/lib64/libcudart.so.8.0 -> libcudart.so.8.0.44
-rwxr-xr-x 1 root root 415432 Okt 4 23:15 /usr/local/cuda/lib64/libcudart.so.8.0.44
-rw-r--r-- 1 root root 775162 Okt 4 23:15 /usr/local/cuda/lib64/libcudart_static.a
lrwxrwxrwx 1 root root 13 Okt 4 23:23 /usr/local/cuda/lib64/libcudnn.so -> libcudnn.so.5
lrwxrwxrwx 1 root root 17 Okt 4 23:23 /usr/local/cuda/lib64/libcudnn.so.5 -> libcudnn.so.5.1.5
-rwxr-xr-x 1 root root 79337624 Okt 4 23:23 /usr/local/cuda/lib64/libcudnn.so.5.1.5
-rw-r--r-- 1 root root 69756172 Okt 4 23:23 /usr/local/cuda/lib64/libcudnn_static.a

If installed from binary pip package, provide:

A link to the pip package you installed: official python 3 tensorflow-gpu package
The output from python -c "import tensorflow; print(tensorflow.__version__)":

I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcublas.so.8.0 locally
I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcudnn.so.5 locally
I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcufft.so.8.0 locally
I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcuda.so.1 locally
I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcurand.so.8.0 locally
1.0.0

Logs or other output that would be helpful

gdb:

Thread 49 "python" received signal SIGFPE, Arithmetic exception.
[Switching to Thread 0x7fff10bb8700 (LWP 8227)]
0x00007fffcaab91de in void tensorflow::ConcatGPUImpl<float, int>(Eigen::GpuDevice const&, tensorflow::CudaDeviceArrayStruct<float const*, 8> const&, tensorflow::CudaDeviceArrayStruct<int, 8> const&, bool, int, tensorflow::TTypes<float, 2, long>::Matrix*) ()
from /home/zenon/anaconda3/envs/tensorflow/lib/python3.5/site-packages/tensorflow/python/_pywrap_tensorflow.so

Backtrace:

#0 0x00007fffcaab91de in void tensorflow::ConcatGPUImpl<float, int>(Eigen::GpuDevice const&, tensorflow::CudaDeviceArrayStruct<float const*, 8> const&, tensorflow::CudaDeviceArrayStruct<int, 8> const&, bool, int, tensorflow::TTypes<float, 2, long>::Matrix*) () from /home/zenon/anaconda3/envs/tensorflow/lib/python3.5/site-packages/tensorflow/python/_pywrap_tensorflow.so
#1 0x00007fffcaaaf45c in void tensorflow::(anonymous namespace)::ConcatGPUCall<float, int>(tensorflow::OpKernelContext*, std::vector<std::unique_ptr<tensorflow::TTypes<float, 2, long>::ConstMatrix, std::default_delete<tensorflow::TTypes<float, 2, long>::ConstMatrix> >, std::allocator<std::unique_ptr<tensorflow::TTypes<float, 2, long>::ConstMatrix, std::default_delete<tensorflow::TTypes<float, 2, long>::ConstMatrix> > > > const&, tensorflow::TTypes<float, 2, long>::Tensor*) ()
from /home/zenon/anaconda3/envs/tensorflow/lib/python3.5/site-packages/tensorflow/python/_pywrap_tensorflow.so
#2 0x00007fffc9b14c16 in tensorflow::TensorArrayPackOrGatherOp<Eigen::GpuDevice, float, false>::Compute(tensorflow::OpKernelContext*) () from /home/zenon/anaconda3/envs/tensorflow/lib/python3.5/site-packages/tensorflow/python/_pywrap_tensorflow.so
#3 0x00007fffcad155b2 in tensorflow::BaseGPUDevice::Compute(tensorflow::OpKernel*, tensorflow::OpKernelContext*) ()
from /home/zenon/anaconda3/envs/tensorflow/lib/python3.5/site-packages/tensorflow/python/_pywrap_tensorflow.so
#4 0x00007fffcad56183 in tensorflow::(anonymous namespace)::ExecutorState::Process(tensorflow::(anonymous namespace)::ExecutorState::TaggedNode, long long) ()
from /home/zenon/anaconda3/envs/tensorflow/lib/python3.5/site-packages/tensorflow/python/_pywrap_tensorflow.so
#5 0x00007fffcad569fa in std::_Function_handler<void (), tensorflow::(anonymous namespace)::ExecutorState::ScheduleReady(tensorflow::gtl::InlinedVector<tensorflow::(anonymous namespace)::ExecutorState::TaggedNode, 8> const&, tensorflow::(anonymous namespace)::ExecutorState::TaggedNodeReadyQueue*)::{lambda()#1}>::_M_invoke(std::_Any_data const&) ()
from /home/zenon/anaconda3/envs/tensorflow/lib/python3.5/site-packages/tensorflow/python/_pywrap_tensorflow.so
#6 0x00007fffcb09a960 in std::_Function_handler<void (), Eigen::NonBlockingThreadPoolTempltensorflow::thread::EigenEnvironment::NonBlockingThreadPoolTempl(int, tensorflow::thread::EigenEnvironment)::{lambda()#1}>::_M_invoke(std::_Any_data const&) ()
from /home/zenon/anaconda3/envs/tensorflow/lib/python3.5/site-packages/tensorflow/python/_pywrap_tensorflow.so
#7 0x00007fffcb099c10 in std::_Function_handler<void (), tensorflow::thread::EigenEnvironment::CreateThread(std::function<void ()>)::{lambda()#1}>::_M_invoke(std::_Any_data const&) ()
from /home/zenon/anaconda3/envs/tensorflow/lib/python3.5/site-packages/tensorflow/python/_pywrap_tensorflow.so
---Type to continue, or q to quit---
#8 0x00007fffc7f2c260 in ?? () from /home/zenon/anaconda3/envs/tensorflow/bin/../lib/libstdc++.so.6
#9 0x00007ffff76d16fa in start_thread (arg=0x7fff10bb8700) at pthread_create.c:333
#10 0x00007ffff6aefb5d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:109

The text was updated successfully, but these errors were encountered:

Panaetius · 2017-03-21T22:03:49Z

Upon researching this issue some more (with the good old "Sprinkle Prints everywhere" method of debugging), it seems to happen when the input labels/regions are empty. The MSCOCO dataset apparently has images with no labels, so those would cause the issue. I'm reading the labels/regions with a tf.VarLengthFeature, so the dimensions aren't known ahead of time.

It's somewhere inside one of the two tf.map_fn() applications, going by the gdb stacktrace, though have yet to narrow it down precisely to write a simple test case

Panaetius · 2017-03-21T22:17:44Z

Ok I managed to reproduce it in a simple test case: https://gist.github.com/Panaetius/8da26064491ab4ad1890bca3dfd86eff

The test case can probably be made smaller still (for instance, the whole save data -> load data with tf.VarLenFeature is probably not needed, you could just construct an empty SparseTensor, though I couldn't get this to work when I gave it a quick try).

Basically, calling tf.map_fn with an empty tensor ( [ ] ) leads to the SIGFPE, most likely when the x[0] is done in the lambda.

I'll change the title of the issue to reflect that this is an issue with tf.map_fn

Of course I'll have to change my code to correctly deal with empty/missing labels, so I doubt this will be an issue for me in the future, though tf.map_fn actually trying to run over an empty tensor and then just straight crashing seems like a bug that should be fixed to me

aselle · 2017-03-28T17:47:09Z

@yuanbyu, do you think there is a way we could make this failure case in tf.map_fn more intuitive to detect?

gunan · 2017-06-16T21:17:25Z

Automatically closing due to lack of recent activity. Please update the issue when new information becomes available, and we will reopen the issue. Thanks!

aselle added stat:awaiting response Status - Awaiting response from author type:bug Bug and removed stat:awaiting response Status - Awaiting response from author labels Mar 21, 2017

Panaetius changed the title ~~Floating Point Exception/SIGFPE in ConcatGPUImpl~~ Floating Point Exception/SIGFPE in tf.map_fn when an empty tensor is used Mar 21, 2017

Panaetius changed the title ~~Floating Point Exception/SIGFPE in tf.map_fn when an empty tensor is used~~ Floating Point Exception/SIGFPE in tf.map_fn over an empty tensor Mar 21, 2017

aselle removed the stat:awaiting response Status - Awaiting response from author label Mar 21, 2017

aselle added the stat:awaiting tensorflower Status - Awaiting response from tensorflower label Mar 28, 2017

gunan closed this as completed Jun 16, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Floating Point Exception/SIGFPE in tf.map_fn over an empty tensor #8554

Floating Point Exception/SIGFPE in tf.map_fn over an empty tensor #8554

Panaetius commented Mar 20, 2017

Panaetius commented Mar 21, 2017

Panaetius commented Mar 21, 2017 •

edited

aselle commented Mar 28, 2017

gunan commented Jun 16, 2017

Floating Point Exception/SIGFPE in tf.map_fn over an empty tensor #8554

Floating Point Exception/SIGFPE in tf.map_fn over an empty tensor #8554

Comments

Panaetius commented Mar 20, 2017

Environment info

Logs or other output that would be helpful

Panaetius commented Mar 21, 2017

Panaetius commented Mar 21, 2017 • edited

aselle commented Mar 28, 2017

gunan commented Jun 16, 2017

Panaetius commented Mar 21, 2017 •

edited