New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Floating Point Exception/SIGFPE in tf.map_fn over an empty tensor #8554
Comments
Upon researching this issue some more (with the good old "Sprinkle Prints everywhere" method of debugging), it seems to happen when the input labels/regions are empty. The MSCOCO dataset apparently has images with no labels, so those would cause the issue. I'm reading the labels/regions with a tf.VarLengthFeature, so the dimensions aren't known ahead of time. It's somewhere inside one of the two tf.map_fn() applications, going by the gdb stacktrace, though have yet to narrow it down precisely to write a simple test case |
Ok I managed to reproduce it in a simple test case: https://gist.github.com/Panaetius/8da26064491ab4ad1890bca3dfd86eff The test case can probably be made smaller still (for instance, the whole save data -> load data with tf.VarLenFeature is probably not needed, you could just construct an empty SparseTensor, though I couldn't get this to work when I gave it a quick try). Basically, calling tf.map_fn with an empty tensor ( [ ] ) leads to the SIGFPE, most likely when the x[0] is done in the lambda. I'll change the title of the issue to reflect that this is an issue with tf.map_fn Of course I'll have to change my code to correctly deal with empty/missing labels, so I doubt this will be an issue for me in the future, though tf.map_fn actually trying to run over an empty tensor and then just straight crashing seems like a bug that should be fixed to me |
@yuanbyu, do you think there is a way we could make this failure case in tf.map_fn more intuitive to detect? |
Automatically closing due to lack of recent activity. Please update the issue when new information becomes available, and we will reopen the issue. Thanks! |
Working on a custom implementation of Faster-RCNN, it runs fine for ~150 - 300 batches but then I get a Floating Point Error, apparently in ConcatGPUImpl.
Source is here: https://github.com/Panaetius/woipv (src/models/train_model.py and src/models/model.py, it's a bit of a mess still since it's a work in progress), reproducible as of commit 6eb1e3c5e818919b64a0a981abb98c2f3bc3dea1
Environment info
Operating System: Ubuntu 16.04
Installed version of CUDA and cuDNN: CUDA 8.0, cuDNN 5
(please attach the output of
ls -l /path/to/cuda/lib/libcud*
):-rw-r--r-- 1 root root 558720 Okt 4 23:15 /usr/local/cuda/lib64/libcudadevrt.a
lrwxrwxrwx 1 root root 16 Okt 4 23:15 /usr/local/cuda/lib64/libcudart.so -> libcudart.so.8.0
lrwxrwxrwx 1 root root 19 Okt 4 23:15 /usr/local/cuda/lib64/libcudart.so.8.0 -> libcudart.so.8.0.44
-rwxr-xr-x 1 root root 415432 Okt 4 23:15 /usr/local/cuda/lib64/libcudart.so.8.0.44
-rw-r--r-- 1 root root 775162 Okt 4 23:15 /usr/local/cuda/lib64/libcudart_static.a
lrwxrwxrwx 1 root root 13 Okt 4 23:23 /usr/local/cuda/lib64/libcudnn.so -> libcudnn.so.5
lrwxrwxrwx 1 root root 17 Okt 4 23:23 /usr/local/cuda/lib64/libcudnn.so.5 -> libcudnn.so.5.1.5
-rwxr-xr-x 1 root root 79337624 Okt 4 23:23 /usr/local/cuda/lib64/libcudnn.so.5.1.5
-rw-r--r-- 1 root root 69756172 Okt 4 23:23 /usr/local/cuda/lib64/libcudnn_static.a
If installed from binary pip package, provide:
python -c "import tensorflow; print(tensorflow.__version__)"
:I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcublas.so.8.0 locally
I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcudnn.so.5 locally
I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcufft.so.8.0 locally
I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcuda.so.1 locally
I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcurand.so.8.0 locally
1.0.0
Logs or other output that would be helpful
gdb:
Thread 49 "python" received signal SIGFPE, Arithmetic exception.
[Switching to Thread 0x7fff10bb8700 (LWP 8227)]
0x00007fffcaab91de in void tensorflow::ConcatGPUImpl<float, int>(Eigen::GpuDevice const&, tensorflow::CudaDeviceArrayStruct<float const*, 8> const&, tensorflow::CudaDeviceArrayStruct<int, 8> const&, bool, int, tensorflow::TTypes<float, 2, long>::Matrix*) ()
from /home/zenon/anaconda3/envs/tensorflow/lib/python3.5/site-packages/tensorflow/python/_pywrap_tensorflow.so
Backtrace:
#0 0x00007fffcaab91de in void tensorflow::ConcatGPUImpl<float, int>(Eigen::GpuDevice const&, tensorflow::CudaDeviceArrayStruct<float const*, 8> const&, tensorflow::CudaDeviceArrayStruct<int, 8> const&, bool, int, tensorflow::TTypes<float, 2, long>::Matrix*) () from /home/zenon/anaconda3/envs/tensorflow/lib/python3.5/site-packages/tensorflow/python/_pywrap_tensorflow.so
#1 0x00007fffcaaaf45c in void tensorflow::(anonymous namespace)::ConcatGPUCall<float, int>(tensorflow::OpKernelContext*, std::vector<std::unique_ptr<tensorflow::TTypes<float, 2, long>::ConstMatrix, std::default_delete<tensorflow::TTypes<float, 2, long>::ConstMatrix> >, std::allocator<std::unique_ptr<tensorflow::TTypes<float, 2, long>::ConstMatrix, std::default_delete<tensorflow::TTypes<float, 2, long>::ConstMatrix> > > > const&, tensorflow::TTypes<float, 2, long>::Tensor*) ()
from /home/zenon/anaconda3/envs/tensorflow/lib/python3.5/site-packages/tensorflow/python/_pywrap_tensorflow.so
#2 0x00007fffc9b14c16 in tensorflow::TensorArrayPackOrGatherOp<Eigen::GpuDevice, float, false>::Compute(tensorflow::OpKernelContext*) () from /home/zenon/anaconda3/envs/tensorflow/lib/python3.5/site-packages/tensorflow/python/_pywrap_tensorflow.so
#3 0x00007fffcad155b2 in tensorflow::BaseGPUDevice::Compute(tensorflow::OpKernel*, tensorflow::OpKernelContext*) ()
from /home/zenon/anaconda3/envs/tensorflow/lib/python3.5/site-packages/tensorflow/python/_pywrap_tensorflow.so
#4 0x00007fffcad56183 in tensorflow::(anonymous namespace)::ExecutorState::Process(tensorflow::(anonymous namespace)::ExecutorState::TaggedNode, long long) ()
from /home/zenon/anaconda3/envs/tensorflow/lib/python3.5/site-packages/tensorflow/python/_pywrap_tensorflow.so
#5 0x00007fffcad569fa in std::_Function_handler<void (), tensorflow::(anonymous namespace)::ExecutorState::ScheduleReady(tensorflow::gtl::InlinedVector<tensorflow::(anonymous namespace)::ExecutorState::TaggedNode, 8> const&, tensorflow::(anonymous namespace)::ExecutorState::TaggedNodeReadyQueue*)::{lambda()#1}>::_M_invoke(std::_Any_data const&) ()
from /home/zenon/anaconda3/envs/tensorflow/lib/python3.5/site-packages/tensorflow/python/_pywrap_tensorflow.so
#6 0x00007fffcb09a960 in std::_Function_handler<void (), Eigen::NonBlockingThreadPoolTempltensorflow::thread::EigenEnvironment::NonBlockingThreadPoolTempl(int, tensorflow::thread::EigenEnvironment)::{lambda()#1}>::_M_invoke(std::_Any_data const&) ()
from /home/zenon/anaconda3/envs/tensorflow/lib/python3.5/site-packages/tensorflow/python/_pywrap_tensorflow.so
#7 0x00007fffcb099c10 in std::_Function_handler<void (), tensorflow::thread::EigenEnvironment::CreateThread(std::function<void ()>)::{lambda()#1}>::_M_invoke(std::_Any_data const&) ()
from /home/zenon/anaconda3/envs/tensorflow/lib/python3.5/site-packages/tensorflow/python/_pywrap_tensorflow.so
---Type to continue, or q to quit---
#8 0x00007fffc7f2c260 in ?? () from /home/zenon/anaconda3/envs/tensorflow/bin/../lib/libstdc++.so.6
#9 0x00007ffff76d16fa in start_thread (arg=0x7fff10bb8700) at pthread_create.c:333
#10 0x00007ffff6aefb5d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:109
The text was updated successfully, but these errors were encountered: