Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Importing tensorflow after import pytorch crashes inside tensorflow::port::TestCPUFeature #13615

Closed
yaroslavvb opened this Issue Oct 10, 2017 · 5 comments

Comments

Projects
None yet
5 participants
@yaroslavvb
Copy link
Contributor

commented Oct 10, 2017

TF version: 07bf1d3
PyTorch version: '0.2.0_4' (whatever is installed by default yesterday)

Crashing code

import torch
import tensorflow

Work-around

import tensorflow
import torch

Stacktrace

#0  0x00007f2d2f5e7577 in void std::__once_call_impl<std::_Bind_simple<void (*())()> >() ()
   from /home/yaroslav/anaconda3/envs/oct12/lib/python3.5/site-packages/torch/lib/libTHC.so.1
#1  0x00007f2d546d5a99 in __pthread_once_slow (
    once_control=0x7f2d0a470830 <tensorflow::port::(anonymous namespace)::cpuid_once_flag>, init_routine=0x7f2d2789e2a0 <__once_proxy>) at pthread_once.c:116
#2  0x00007f2d09ae7faa in void std::call_once<void (&)()>(std::once_flag&, void (&)()) ()
   from /home/yaroslav/anaconda3/envs/oct12/lib/python3.5/site-packages/tensorflow/python/../libtensorflow_framework.so
#3  0x00007f2d09ae7fee in tensorflow::port::TestCPUFeature(tensorflow::port::CPUFeature) ()
   from /home/yaroslav/anaconda3/envs/oct12/lib/python3.5/site-packages/tensorflow/python/../libtensorflow_framework.so
#4  0x00007f2d09942895 in _GLOBAL__sub_I_cpu_feature_guard.cc ()
   from /home/yaroslav/anaconda3/envs/oct12/lib/python3.5/site-packages/tensorflow/python/../libtensorflow_framework.so
#5  0x00007f2d54de86ba in call_init (l=<optimized out>, argc=argc@entry=2, 
    argv=argv@entry=0x7ffe69b84308, env=env@entry=0x1ba0d00) at dl-init.c:72
#6  0x00007f2d54de87cb in call_init (env=0x1ba0d00, argv=0x7ffe69b84308, 
    argc=2, l=<optimized out>) at dl-init.c:30
#7  _dl_init (main_map=main_map@entry=0x284e6d0, argc=2, argv=0x7ffe69b84308, 
    env=0x1ba0d00) at dl-init.c:120
#8  0x00007f2d54ded8e2 in dl_open_worker (a=a@entry=0x7ffe69b7e310)
    at dl-open.c:575
#9  0x00007f2d54de8564 in _dl_catch_error (
    objname=objname@entry=0x7ffe69b7e300, 
    errstring=errstring@entry=0x7ffe69b7e308, 
    mallocedp=mallocedp@entry=0x7ffe69b7e2ff, 
    operate=operate@entry=0x7f2d54ded4d0 <dl_open_worker>, 
    args=args@entry=0x7ffe69b7e310) at dl-error.c:187
#10 0x00007f2d54decda9 in _dl_open (
    file=0x7f2d1ce72cc8 "/home/yaroslav/anaconda3/envs/oct12/lib/python3.5/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so", 
    mode=-2147483646, 
    caller_dlopen=0x7f2d54a68553 <_PyImport_FindSharedFuncptr+115>, nsid=-2, 
    argc=<optimized out>, argv=<optimized out>, env=0x1ba0d00)
    at dl-open.c:660
#11 0x00007f2d544c3f09 in dlopen_doit (a=a@entry=0x7ffe69b7e540)
    at dlopen.c:66
#12 0x00007f2d54de8564 in _dl_catch_error (objname=0x1b20de0, 
    errstring=0x1b20de8, mallocedp=0x1b20dd8, 
    operate=0x7f2d544c3eb0 <dlopen_doit>, args=0x7ffe69b7e540)
    at dl-error.c:187
#13 0x00007f2d544c4571 in _dlerror_run (
    operate=operate@entry=0x7f2d544c3eb0 <dlopen_doit>, 
    args=args@entry=0x7ffe69b7e540) at dlerror.c:163
#14 0x00007f2d544c3fa1 in __dlopen (file=<optimized out>, 
    mode=<optimized out>) at dlopen.c:87
#15 0x00007f2d54a68553 in _PyImport_FindSharedFuncptr (
    prefix=0x7f2d54aeceda "PyInit", 
    shortname=0x7f2d1c9d04d0 "_pywrap_tensorflow_internal", 
---Type <return> to continue, or q <return> to quit--- 
    pathname=0x7f2d1ce72cc8 "/home/yaroslav/anaconda3/envs/oct12/lib/python3.5/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so", fp=0x0)
    at ./Python/dynload_shlib.c:95
#16 0x00007f2d54a43ce7 in _PyImport_LoadDynamicModuleWithSpec (
    spec=0x7f2d1c9b3b00, fp=0x0) at ./Python/importdl.c:124
#17 0x00007f2d54a40aef in _imp_create_dynamic_impl (file=<optimized out>, 
    spec=0x7f2d1c9b3b00, module=<optimized out>) at Python/import.c:2031
#18 _imp_create_dynamic (module=<optimized out>, args=<optimized out>)
    at Python/clinic/import.c.h:282
#19 0x00007f2d549a1209 in PyCFunction_Call (func=0x7f2d54f77ee8, 
    args=0x7f2d1c9b3c88, kwds=<optimized out>) at Objects/methodobject.c:109
#20 0x00007f2d54a274fa in ext_do_call (nk=479935624, na=0, 
    flags=<optimized out>, pp_stack=0x7ffe69b7ea48, func=0x7f2d54f77ee8)
    at Python/ceval.c:5084
#21 PyEval_EvalFrameEx (f=<optimized out>, throwflag=<optimized out>)
    at Python/ceval.c:3328
#22 0x00007f2d54a2aa49 in _PyEval_EvalCodeWithName (_co=<optimized out>, 
    globals=<optimized out>, locals=<optimized out>, args=<optimized out>, 
    argcount=2, kws=0x7f2d53040760, kwcount=0, defs=0x0, defcount=0, 
    kwdefs=0x0, closure=0x0, name=0x7f2d54f657b0, qualname=0x7f2d54f657b0)
    at Python/ceval.c:4071
#23 0x00007f2d54a2894c in fast_function (nk=<optimized out>, na=2, 
    n=<optimized out>, pp_stack=0x7ffe69b7ec68, func=0x7f2d54f830d0)
    at Python/ceval.c:4866
#24 call_function (oparg=<optimized out>, pp_stack=0x7ffe69b7ec68)
    at Python/ceval.c:4783
#25 PyEval_EvalFrameEx (f=<optimized out>, throwflag=<optimized out>)
    at Python/ceval.c:3289
#26 0x00007f2d54a28ccc in fast_function (nk=<optimized out>, na=2, 
    n=<optimized out>, pp_stack=0x7ffe69b7ede8, func=0x7f2d54f2a7b8)
    at Python/ceval.c:4856
#27 call_function (oparg=<optimized out>, pp_stack=0x7ffe69b7ede8)
    at Python/ceval.c:4783
#28 PyEval_EvalFrameEx (f=<optimized out>, throwflag=<optimized out>)
    at Python/ceval.c:3289
#29 0x00007f2d54a28ccc in fast_function (nk=<optimized out>, na=1, 
    n=<optimized out>, pp_stack=0x7ffe69b7ef68, func=0x7f2d54f83b70)
    at Python/ceval.c:4856
#30 call_function (oparg=<optimized out>, pp_stack=0x7ffe69b7ef68)
    at Python/ceval.c:4783
#31 PyEval_EvalFrameEx (f=<optimized out>, throwflag=<optimized out>)
    at Python/ceval.c:3289
#32 0x00007f2d54a28ccc in fast_function (nk=<optimized out>, na=1, 
    n=<optimized out>, pp_stack=0x7ffe69b7f0e8, func=0x7f2d54f83d90)
    at Python/ceval.c:4856
#33 call_function (oparg=<optimized out>, pp_stack=0x7ffe69b7f0e8)
    at Python/ceval.c:4783
#34 PyEval_EvalFrameEx (f=<optimized out>, throwflag=<optimized out>)
    at Python/ceval.c:3289
#35 0x00007f2d54a28ccc in fast_function (nk=<optimized out>, na=1, 
    n=<optimized out>, pp_stack=0x7ffe69b7f268, func=0x7f2d54f83e18)
    at Python/ceval.c:4856
#36 call_function (oparg=<optimized out>, pp_stack=0x7ffe69b7f268)
@allenlavoie

This comment has been minimized.

Copy link
Member

commented Oct 10, 2017

Looks like PyTorch is exporting some pthread symbols via RTLD_GLOBAL. We don't use RTLD_GLOBAL anymore, but we're still exposed to those symbols.

Other than defensively using RTLD_DEEPBIND in pywrap_tensorflow.py (which does fix the crash for me, but would also break people who use LD_PRELOAD), I guess we just need to avoid using that symbol.

@soumith

This comment has been minimized.

Copy link

commented Oct 11, 2017

Sorry for the issue, we'll fix it in the next pytorch release with a solution described here: pytorch/pytorch#3059 (comment)

@drpngx

This comment has been minimized.

Copy link
Member

commented Oct 16, 2017

Thanks @soumith !

@carlosgalvezp

This comment has been minimized.

Copy link

commented May 2, 2018

Hi @soumith !

I believe I am having similar issues. I'm not sure if I should open a new issue or continue here, please let me know!

My system is:

  • Ubuntu 16.04
  • TensorFlow 1.7.0 (pip)
  • PyTorch 0.4.0 (pip)
  • CUDA: 9.0.176
  • CuDNN: 7.0.5.15

I get SIGABRT when running my TensorFlow program if I import torch before importing tensorflow. It doesn't happen if I do it the other way around.

How to reproduce the problem:

  1. Download the MNIST example: https://github.com/tensorflow/models/blob/master/tutorials/image/mnist/convolutional.py

  2. Add import torch before import tensorflow as tf.

  3. Execute. It will receive SIGABRT.

Here's the backtrace:

(gdb) run convolutional.py 
The program being debugged has been started already.
Start it from the beginning? (y or n) y
Starting program: /usr/bin/python3 convolutional.py
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
[New Thread 0x7ffff3dc7700 (LWP 217)]
[New Thread 0x7ffff15c6700 (LWP 218)]
[New Thread 0x7fffeedc5700 (LWP 219)]
[New Thread 0x7fffec5c4700 (LWP 220)]
[New Thread 0x7fffe9dc3700 (LWP 221)]
[New Thread 0x7fffe95c2700 (LWP 222)]
[New Thread 0x7fffe6dc1700 (LWP 223)]
[Thread 0x7fffe6dc1700 (LWP 223) exited]
[Thread 0x7fffe95c2700 (LWP 222) exited]
[Thread 0x7fffe9dc3700 (LWP 221) exited]
[Thread 0x7fffec5c4700 (LWP 220) exited]
[Thread 0x7fffeedc5700 (LWP 219) exited]
[Thread 0x7ffff15c6700 (LWP 218) exited]
[Thread 0x7ffff3dc7700 (LWP 217) exited]
Extracting data/train-images-idx3-ubyte.gz
Extracting data/train-labels-idx1-ubyte.gz
Extracting data/t10k-images-idx3-ubyte.gz
Extracting data/t10k-labels-idx1-ubyte.gz
2018-05-02 08:06:25.663691: I tensorflow/core/platform/cpu_feature_guard.cc:140] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
[New Thread 0x7fffe6dc1700 (LWP 227)]
[New Thread 0x7fffe95c2700 (LWP 228)]
[New Thread 0x7fffe9dc3700 (LWP 229)]
[New Thread 0x7fffec5c4700 (LWP 230)]
[New Thread 0x7fff493ff700 (LWP 231)]
[New Thread 0x7fff48bfe700 (LWP 232)]
[New Thread 0x7fff483fd700 (LWP 233)]
[New Thread 0x7fff47bfc700 (LWP 234)]
[New Thread 0x7fff45400700 (LWP 235)]
[New Thread 0x7fff44bff700 (LWP 236)]
[New Thread 0x7fff3ffff700 (LWP 237)]
[New Thread 0x7fff3f7fe700 (LWP 238)]
2018-05-02 08:06:25.755908: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:898] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2018-05-02 08:06:25.756207: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1344] Found device 0 with properties: 
name: GeForce GTX 1050 Ti major: 6 minor: 1 memoryClockRate(GHz): 1.62
pciBusID: 0000:01:00.0
totalMemory: 3.94GiB freeMemory: 3.26GiB
2018-05-02 08:06:25.756248: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1423] Adding visible gpu devices: 0
2018-05-02 08:06:26.002388: I tensorflow/core/common_runtime/gpu/gpu_device.cc:911] Device interconnect StreamExecutor with strength 1 edge matrix:
2018-05-02 08:06:26.002425: I tensorflow/core/common_runtime/gpu/gpu_device.cc:917]      0 
2018-05-02 08:06:26.002434: I tensorflow/core/common_runtime/gpu/gpu_device.cc:930] 0:   N 
2018-05-02 08:06:26.002678: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1041] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 2966 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1050 Ti, pci bus id: 0000:01:00.0, compute capability: 6.1)
[New Thread 0x7ffefdbb8700 (LWP 239)]
[New Thread 0x7ffef53b7700 (LWP 240)]
[New Thread 0x7ffefd3b7700 (LWP 241)]
[New Thread 0x7ffefcbb6700 (LWP 242)]
[New Thread 0x7ffef7fff700 (LWP 243)]
[New Thread 0x7ffef77fe700 (LWP 244)]
[New Thread 0x7ffef6ffd700 (LWP 245)]
[New Thread 0x7ffef67fc700 (LWP 246)]
[New Thread 0x7ffef5ffb700 (LWP 247)]
[New Thread 0x7ffef4bb6700 (LWP 248)]
Initialized!
2018-05-02 08:06:28.027051: E tensorflow/stream_executor/cuda/cuda_blas.cc:462] failed to create cublas handle: CUBLAS_STATUS_NOT_INITIALIZED
2018-05-02 08:06:28.031781: E tensorflow/stream_executor/cuda/cuda_blas.cc:462] failed to create cublas handle: CUBLAS_STATUS_NOT_INITIALIZED
2018-05-02 08:06:28.035610: E tensorflow/stream_executor/cuda/cuda_blas.cc:462] failed to create cublas handle: CUBLAS_STATUS_NOT_INITIALIZED
2018-05-02 08:06:28.039622: E tensorflow/stream_executor/cuda/cuda_blas.cc:462] failed to create cublas handle: CUBLAS_STATUS_NOT_INITIALIZED
2018-05-02 08:06:28.043484: E tensorflow/stream_executor/cuda/cuda_blas.cc:462] failed to create cublas handle: CUBLAS_STATUS_NOT_INITIALIZED
2018-05-02 08:06:28.046642: E tensorflow/stream_executor/cuda/cuda_blas.cc:462] failed to create cublas handle: CUBLAS_STATUS_NOT_INITIALIZED
2018-05-02 08:06:28.054981: E tensorflow/stream_executor/cuda/cuda_dnn.cc:403] could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR
2018-05-02 08:06:28.055044: F tensorflow/core/kernels/conv_ops.cc:712] Check failed: stream->parent()->GetConvolveAlgorithms( conv_parameters.ShouldIncludeWinogradNonfusedAlgo<T>(), &algorithms) 

Thread 30 "python3" received signal SIGABRT, Aborted.
[Switching to Thread 0x7ffef4bb6700 (LWP 248)]
0x00007ffff7825428 in raise () from /lib/x86_64-linux-gnu/libc.so.6
(gdb) bt
#0  0x00007ffff7825428 in raise () from /lib/x86_64-linux-gnu/libc.so.6
#1  0x00007ffff782702a in abort () from /lib/x86_64-linux-gnu/libc.so.6
#2  0x00007fff849b8c34 in tensorflow::internal::LogMessageFatal::~LogMessageFatal() () from /usr/local/lib/python3.5/dist-packages/tensorflow/python/_pywrap_tensorflow_internal.so
#3  0x00007fff846b166a in tensorflow::LaunchConv2DOp<Eigen::GpuDevice, float>::operator()(tensorflow::OpKernelContext*, bool, bool, tensorflow::Tensor const&, tensorflow::Tensor const&, int, int, int, int, tensorflow::Padding const&, tensorflow::Tensor*, tensorflow::TensorFormat) () from /usr/local/lib/python3.5/dist-packages/tensorflow/python/_pywrap_tensorflow_internal.so
#4  0x00007fff846b3ce3 in tensorflow::Conv2DOp<Eigen::GpuDevice, float>::Compute(tensorflow::OpKernelContext*) ()
   from /usr/local/lib/python3.5/dist-packages/tensorflow/python/_pywrap_tensorflow_internal.so
#5  0x00007fff7fc23289 in tensorflow::BaseGPUDevice::ComputeHelper(tensorflow::OpKernel*, tensorflow::OpKernelContext*) ()
   from /usr/local/lib/python3.5/dist-packages/tensorflow/python/../libtensorflow_framework.so
#6  0x00007fff7fc23750 in tensorflow::BaseGPUDevice::Compute(tensorflow::OpKernel*, tensorflow::OpKernelContext*) ()
   from /usr/local/lib/python3.5/dist-packages/tensorflow/python/../libtensorflow_framework.so
#7  0x00007fff7fc5d365 in tensorflow::(anonymous namespace)::ExecutorState::Process(tensorflow::(anonymous namespace)::ExecutorState::TaggedNode, long long) ()
   from /usr/local/lib/python3.5/dist-packages/tensorflow/python/../libtensorflow_framework.so
#8  0x00007fff7fc5db7a in std::_Function_handler<void (), tensorflow::(anonymous namespace)::ExecutorState::ScheduleReady(tensorflow::gtl::InlinedVector<tensorflow::(anonymous namespace)::ExecutorState::TaggedNode, 8> const&, tensorflow::(anonymous namespace)::ExecutorState::TaggedNodeReadyQueue*)::{lambda()#1}>::_M_invoke(std::_Any_data const&) ()
   from /usr/local/lib/python3.5/dist-packages/tensorflow/python/../libtensorflow_framework.so
#9  0x00007fff7f8ce8ba in Eigen::NonBlockingThreadPoolTempl<tensorflow::thread::EigenEnvironment>::WorkerLoop(int) ()
   from /usr/local/lib/python3.5/dist-packages/tensorflow/python/../libtensorflow_framework.so
#10 0x00007fff7f8cd962 in std::_Function_handler<void (), tensorflow::thread::EigenEnvironment::CreateThread(std::function<void ()>)::{lambda()#1}>::_M_invoke(std::_Any_data const&) ()
   from /usr/local/lib/python3.5/dist-packages/tensorflow/python/../libtensorflow_framework.so
#11 0x00007fffee8e6c80 in ?? () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#12 0x00007ffff7bc16ba in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0
#13 0x00007ffff78f741d in clone () from /lib/x86_64-linux-gnu/libc.so.6

Thanks!

@soumith

This comment has been minimized.

Copy link

commented May 2, 2018

@carlosgalvezp this might be expected. TF is basically resolving some of the pytorch's statically linked cudnn symbols and some of the system libcudnn.so symbols, which are probably of different versions. We'll try to fix this next release.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.