Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

crash via tf_should_use format_stack #22770

Open
albertz opened this issue Oct 5, 2018 · 27 comments
Open

crash via tf_should_use format_stack #22770

albertz opened this issue Oct 5, 2018 · 27 comments
Assignees
Labels
comp:ops OPs related issues TF 2.9 Issues found in the TF 2.9 release (or RCs) type:performance Performance Issue

Comments

@albertz
Copy link
Contributor

albertz commented Oct 5, 2018

System information

  • Have I written custom code (as opposed to using a stock example script provided in TensorFlow): no
  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Ubuntu 16.04
  • Mobile device (e.g. iPhone 8, Pixel 2, Samsung Galaxy) if the issue happens on mobile device: -
  • TensorFlow installed from (source or binary): binary (pip)
  • TensorFlow version (use command below): v1.11.0-0-gc19e29306c 1.11.0
  • Python version: 3.6.3
  • Bazel version (if compiling from source): -
  • GCC/Compiler version (if compiling from source): -
  • CUDA/cuDNN version: 8.0
  • GPU model and memory: GTX 680 (will not be used)
  • Exact command to reproduce: -

Describe the problem

When __repr__ is called on some TF objects at the wrong time, this can lead to a crash (seg fault; see below). There can be various reasons why this can happen, e.g. when a debugger shows the locals of all threads. My case was this, but I think this doesn't matter:

  • Via better_exchook, I extended the output of sys.excepthook and some traceback functions to print out some local vars and their __repr__ output. There is something similar for IPython.
  • I created some tf.TensorArray and called unstack and I did not use the result value. That unstack method is wrapped via should_use_result.
  • The Python GC called the _TFShouldUseHelper.__del__ function at some random point, and this triggered the stack formating and then the call some some __repr__ of some TF objects.

Originally, this happened at exit, and I thought that probably it's just not safe at exit to touch any existing TF objects. So I fixed that case in better_exchook: It will not print any vars at exit. A test case to reproduce exactly that case is here.

However, now I get the same crash also not at exit but at another random point (see stack below). It will be hard to come up with a test case for this, as it is very non-deterministic when exactly the GC runs and calls the __del__ function.

Source code / logs

Current thread 0x00007f14209e8700 (most recent call first):
  File "/u/zeyer/py-envs/py36-tf111/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 1897 in name
  File "/u/zeyer/py-envs/py36-tf111/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 352 in name
  File "/u/zeyer/py-envs/py36-tf111/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 614 in __repr__
  File "/u/zeyer/setups/librispeech/2018-02-26--att/returnn/tests/../better_exchook.py", line 250 in pretty_print
  File "/u/zeyer/setups/librispeech/2018-02-26--att/returnn/tests/../better_exchook.py", line 487 in format_py_obj
  File "/u/zeyer/setups/librispeech/2018-02-26--att/returnn/tests/../better_exchook.py", line 571 in <lambda>
  File "/u/zeyer/setups/librispeech/2018-02-26--att/returnn/tests/../better_exchook.py", line 522 in _trySet
  File "/u/zeyer/setups/librispeech/2018-02-26--att/returnn/tests/../better_exchook.py", line 571 in format_tb
  File "/u/zeyer/.linuxbrew/opt/python3/lib/python3.6/traceback.py", line 37 in format_list
  File "/u/zeyer/.linuxbrew/opt/python3/lib/python3.6/traceback.py", line 193 in format_stack
  File "/u/zeyer/py-envs/py36-tf111/lib/python3.6/site-packages/tensorflow/python/util/tf_should_use.py", line 60 in __del__
  File "/u/zeyer/py-envs/py36-tf111/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 81 in __init__
  File "/u/zeyer/py-envs/py36-tf111/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 4181 in _add_device_to_stack
  File "/u/zeyer/py-envs/py36-tf111/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 4243 in device
  File "/u/zeyer/.linuxbrew/opt/python3/lib/python3.6/contextlib.py", line 81 in __enter__
  File "/u/zeyer/py-envs/py36-tf111/lib/python3.6/site-packages/tensorflow/python/ops/control_flow_ops.py", line 3366 in _GroupControlDeps
  File "/u/zeyer/py-envs/py36-tf111/lib/python3.6/site-packages/tensorflow/python/ops/control_flow_ops.py", line 3415 in group
  File "/u/zeyer/py-envs/py36-tf111/lib/python3.6/site-packages/tensorflow/python/ops/control_flow_ops.py", line 3486 in tuple
  File "/u/zeyer/py-envs/py36-tf111/lib/python3.6/site-packages/tensorflow/python/ops/gradients_impl.py", line 791 in _GradientsHelper
  File "/u/zeyer/py-envs/py36-tf111/lib/python3.6/site-packages/tensorflow/python/ops/gradients_impl.py", line 596 in gradients
  File "/u/zeyer/py-envs/py36-tf111/lib/python3.6/site-packages/tensorflow/python/training/optimizer.py", line 517 in compute_gradients
  File "/u/zeyer/py-envs/py36-tf111/lib/python3.6/site-packages/tensorflow/python/training/optimizer.py", line 401 in minimize
  File "tests/test_TFNetworkRecLayer.py", line 219 in test_rhn_nan
  File "tests/test_TFNetworkRecLayer.py", line 2175 in <module>

"ops.py", line 1897 in name, that is this code:

  @property
  def name(self):
    """The full name of this operation."""
    return c_api.TF_OperationName(self._c_op)

I often also see this just before the crash:

pure virtual method called

A Travis log with this crash can also be seen here, or here.

The C backtrace is this:

/lib/x86_64-linux-gnu/libpthread.so.0(raise+0x29)[0x7f7e8df1a269]
/lib/x86_64-linux-gnu/libpthread.so.0(+0x11390)[0x7f7e8df1a390]
/u/zeyer/py-envs/py36-tf111/lib/python3.6/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so(TF_OperationName+0xa)[0x7f7e5ccc0eca]
/u/zeyer/py-envs/py36-tf111/lib/python3.6/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so(+0x1982264)[0x7f7e5ca78264]
/u/zeyer/.linuxbrew/Cellar/python3/3.6.3/lib/libpython3.6m.so.1.0(_PyCFunction_FastCallDict+0x209)[0x7f7e8e1f61c9]
...
albertz added a commit to albertz/py_better_exchook that referenced this issue Oct 7, 2018
@ymodak ymodak assigned alextp and unassigned ymodak Oct 15, 2018
@alextp alextp assigned ebrevdo and unassigned alextp Oct 15, 2018
@alextp
Copy link
Contributor

alextp commented Oct 15, 2018

@ebrevdo any idea what could be causing this?

@albertz
Copy link
Contributor Author

albertz commented Oct 15, 2018

My guess: Some Swig internals, which do not expect a thread change in certain context (which is triggered here by the Python GC calling __del__ in some unexpected context).

@ebrevdo
Copy link
Contributor

ebrevdo commented Oct 15, 2018

@allenlavoie may have insight.

@allenlavoie
Copy link
Member

Nothing jumps out to me as an obvious cause. Sounds like this needs debugging, and without a more concrete reproduction I'm not sure there's much to be done.

Is there a loop you can construct which eventually results in this bug being triggered?

@ebrevdo
Copy link
Contributor

ebrevdo commented Nov 22, 2018 via email

@albertz
Copy link
Contributor Author

albertz commented Nov 22, 2018 via email

@ebrevdo
Copy link
Contributor

ebrevdo commented Nov 22, 2018

I removed better_exchook (removed the import, and the two commands in main) and still am no able to replicate in py3.5

@albertz
Copy link
Contributor Author

albertz commented Nov 22, 2018 via email

@ebrevdo
Copy link
Contributor

ebrevdo commented Nov 22, 2018 via email

@ebrevdo
Copy link
Contributor

ebrevdo commented Nov 22, 2018

ok i was able to replicate the issue. gonna see if i can run this under address sanitizer...

@ebrevdo
Copy link
Contributor

ebrevdo commented Nov 23, 2018

OK; asan picked something up:

Colocations handled automatically by placer.
variables:
[<tf.Variable 'b:0' shape=(10, 1, 6) dtype=float32_ref>]
init vars
graph size: 8668
train
step 0, loss: 1.596843
EXCEPTION
Traceback (most recent call last):
  File "test-tf111-tfshoulduse-crash.py", line 75, in test
    line: raise Exception("foo")
    locals:
      Exception = <builtin> <class 'Exception'>
Exception: foo
Exit blah.
atexit handler
EXCEPTION
Traceback (most recent call last):
  File "test-tf111-tfshoulduse-crash.py", line 87, in at_exit_handler
    line: raise Exception("foo")
    locals:
      Exception = <builtin> <class 'Exception'>
Exception: foo
Dummy Goodbye
=================================================================
==229269==ERROR: AddressSanitizer: heap-use-after-free on address 0x62500077e338 at pc 0x55e64a6b258e bp 0x7ffffdcce0c0 sp 0x7ffffdcce0b8
READ of size 8 at 0x62500077e338 thread T0
    #0 0x55e64a6b258d in std::__shared_ptr<tensorflow::NodeProperties, (__gnu_cxx::_Lock_policy)2>::operator->() const crosstool/stable/toolchain/bin/../lib/gcc/x86_64-linux-gnu/version/../../../../x86_64-linux-gnu/include/c++/version/bits/shared_ptr_base.h:1046:9
    #1 0x55e64a6a58df in tensorflow::Node::name() const tensorflow/core/graph/graph.cc:159:43
    #2 0x55e63bd73458 in TF_OperationName tensorflow/c/c_api.cc:1418:21
    #3 0x7fc5b33dfa94 in _wrap_TF_OperationName(_object*, _object*) bazel-out/asan-py3-dbg/genfiles/tensorflow/python/_third__party_tensorflow_python_pywrap__tensorflow__internal.cc:12098:22
    #4 0x55e64d31f265 in _PyCFunction_FastCallDict python_runtime/v3_6/Objects/methodobject.c:234:22
    #5 0x55e64d3a407d in call_function python_runtime/v3_6/Python/ceval.c:4837:9
    #6 0x55e64d3a0fed in _PyEval_EvalFrameDefault python_runtime/v3_6/Python/ceval.c:3335:19
    #7 0x55e64d3a4a82 in _PyEval_EvalCodeWithName python_runtime/v3_6/Python/ceval.c:4166:14
    #8 0x55e64d39bd63 in PyEval_EvalCodeEx python_runtime/v3_6/Python/ceval.c:4187:12
    #9 0x55e64d307a85 in function_call python_runtime/v3_6/Objects/funcobject.c:604:14
    #10 0x55e64d2d5e2a in PyObject_Call python_runtime/v3_6/Objects/abstract.c:2261:14
    #11 0x55e64d2f27d2 in property_descr_get python_runtime/v3_6/Objects/descrobject.c:1384:11
    #12 0x55e64d322ba8 in _PyObject_GenericGetAttrWithDict python_runtime/v3_6/Objects/object.c:1066:19
    #13 0x55e64d39ff59 in _PyEval_EvalFrameDefault python_runtime/v3_6/Python/ceval.c:2872:29
    #14 0x55e64d3a4a82 in _PyEval_EvalCodeWithName python_runtime/v3_6/Python/ceval.c:4166:14
    #15 0x55e64d39bd63 in PyEval_EvalCodeEx python_runtime/v3_6/Python/ceval.c:4187:12
    #16 0x55e64d307a85 in function_call python_runtime/v3_6/Objects/funcobject.c:604:14
    #17 0x55e64d2d5e2a in PyObject_Call python_runtime/v3_6/Objects/abstract.c:2261:14
    #18 0x55e64d2f27d2 in property_descr_get python_runtime/v3_6/Objects/descrobject.c:1384:11
    #19 0x55e64d322ba8 in _PyObject_GenericGetAttrWithDict python_runtime/v3_6/Objects/object.c:1066:19
    #20 0x55e64d39ff59 in _PyEval_EvalFrameDefault python_runtime/v3_6/Python/ceval.c:2872:29
    #21 0x55e64d3a4a82 in _PyEval_EvalCodeWithName python_runtime/v3_6/Python/ceval.c:4166:14
    #22 0x55e64d39bd63 in PyEval_EvalCodeEx python_runtime/v3_6/Python/ceval.c:4187:12
    #23 0x55e64d307a85 in function_call python_runtime/v3_6/Objects/funcobject.c:604:14
    #24 0x55e64d2d5e2a in PyObject_Call python_runtime/v3_6/Objects/abstract.c:2261:14
    #25 0x55e64d2f27d2 in property_descr_get python_runtime/v3_6/Objects/descrobject.c:1384:11
    #26 0x55e64d322ba8 in _PyObject_GenericGetAttrWithDict python_runtime/v3_6/Objects/object.c:1066:19
    #27 0x55e64d39ff59 in _PyEval_EvalFrameDefault python_runtime/v3_6/Python/ceval.c:2872:29
    #28 0x55e64d3a5606 in _PyFunction_FastCall python_runtime/v3_6/Python/ceval.c:4919:14
    #29 0x55e64d2d60a1 in _PyObject_FastCallDict python_runtime/v3_6/Objects/abstract.c:2310:18
    #30 0x55e64d2d6295 in _PyObject_Call_Prepend python_runtime/v3_6/Objects/abstract.c:2373:14
    #31 0x55e64d2d605b in _PyObject_FastCallDict python_runtime/v3_6/Objects/abstract.c:2331:18
    #32 0x55e64d335e89 in slot_tp_repr python_runtime/v3_6/Objects/typeobject.c:6127:15
    #33 0x55e64d321d26 in PyObject_Repr python_runtime/v3_6/Objects/object.c:490:11
    #34 0x55e64d32edca in tuplerepr python_runtime/v3_6/Objects/tupleobject.c:303:13
    #35 0x55e64d321d26 in PyObject_Repr python_runtime/v3_6/Objects/object.c:490:11
    #36 0x55e64d31f3b2 in _PyCFunction_FastCallDict python_runtime/v3_6/Objects/methodobject.c
    #37 0x55e64d3a407d in call_function python_runtime/v3_6/Python/ceval.c:4837:9
    #38 0x55e64d3a0fed in _PyEval_EvalFrameDefault python_runtime/v3_6/Python/ceval.c:3335:19
    #39 0x55e64d3a5606 in _PyFunction_FastCall python_runtime/v3_6/Python/ceval.c:4919:14
    #40 0x55e64d3a405a in call_function python_runtime/v3_6/Python/ceval.c:4858:17
    #41 0x55e64d3a0fed in _PyEval_EvalFrameDefault python_runtime/v3_6/Python/ceval.c:3335:19
    #42 0x55e64d3a4a82 in _PyEval_EvalCodeWithName python_runtime/v3_6/Python/ceval.c:4166:14
    #43 0x55e64d3a519b in fast_function python_runtime/v3_6/Python/ceval.c:4978:12
    #44 0x55e64d3a405a in call_function python_runtime/v3_6/Python/ceval.c:4858:17
    #45 0x55e64d3a0fed in _PyEval_EvalFrameDefault python_runtime/v3_6/Python/ceval.c:3335:19
    #46 0x55e64d3a4a82 in _PyEval_EvalCodeWithName python_runtime/v3_6/Python/ceval.c:4166:14
    #47 0x55e64d3a519b in fast_function python_runtime/v3_6/Python/ceval.c:4978:12
    #48 0x55e64d3a405a in call_function python_runtime/v3_6/Python/ceval.c:4858:17
    #49 0x55e64d3a0fed in _PyEval_EvalFrameDefault python_runtime/v3_6/Python/ceval.c:3335:19
    #50 0x55e64d3a4a82 in _PyEval_EvalCodeWithName python_runtime/v3_6/Python/ceval.c:4166:14
    #51 0x55e64d3a519b in fast_function python_runtime/v3_6/Python/ceval.c:4978:12
    #52 0x55e64d3a405a in call_function python_runtime/v3_6/Python/ceval.c:4858:17
    #53 0x55e64d3a0fed in _PyEval_EvalFrameDefault python_runtime/v3_6/Python/ceval.c:3335:19
    #54 0x55e64d3a4a82 in _PyEval_EvalCodeWithName python_runtime/v3_6/Python/ceval.c:4166:14
    #55 0x55e64d3a519b in fast_function python_runtime/v3_6/Python/ceval.c:4978:12
    #56 0x55e64d3a405a in call_function python_runtime/v3_6/Python/ceval.c:4858:17
    #57 0x55e64d3a0fed in _PyEval_EvalFrameDefault python_runtime/v3_6/Python/ceval.c:3335:19
    #58 0x55e64d3a5606 in _PyFunction_FastCall python_runtime/v3_6/Python/ceval.c:4919:14
    #59 0x55e64d3a405a in call_function python_runtime/v3_6/Python/ceval.c:4858:17
    #60 0x55e64d3a0fed in _PyEval_EvalFrameDefault python_runtime/v3_6/Python/ceval.c:3335:19
    #61 0x55e64d3a4a82 in _PyEval_EvalCodeWithName python_runtime/v3_6/Python/ceval.c:4166:14
    #62 0x55e64d3a519b in fast_function python_runtime/v3_6/Python/ceval.c:4978:12
    #63 0x55e64d3a405a in call_function python_runtime/v3_6/Python/ceval.c:4858:17
    #64 0x55e64d3a0fed in _PyEval_EvalFrameDefault python_runtime/v3_6/Python/ceval.c:3335:19
    #65 0x55e64d3a5606 in _PyFunction_FastCall python_runtime/v3_6/Python/ceval.c:4919:14
    #66 0x55e64d2d60a1 in _PyObject_FastCallDict python_runtime/v3_6/Objects/abstract.c:2310:18
    #67 0x55e64d2d6295 in _PyObject_Call_Prepend python_runtime/v3_6/Objects/abstract.c:2373:14
    #68 0x55e64d2d605b in _PyObject_FastCallDict python_runtime/v3_6/Objects/abstract.c:2331:18
    #69 0x55e64d336a4f in slot_tp_finalize python_runtime/v3_6/Objects/typeobject.c:6463:15
    #70 0x55e64cd7e172 in finalize_garbage python_runtime/v3_6/Modules/gcmodule.c:806:13
    #71 0x55e64cd7b1b7 in collect python_runtime/v3_6/Modules/gcmodule.c:1005:5
    #72 0x55e64cd7ad19 in collect_with_callback python_runtime/v3_6/Modules/gcmodule.c:1128:14
    #73 0x55e64cd7ab60 in PyGC_Collect python_runtime/v3_6/Modules/gcmodule.c:1594:13
    #74 0x55e64d3cb556 in Py_FinalizeEx python_runtime/v3_6/Python/pylifecycle.c:601:5

0x62500077e338 is located 2616 bytes inside of 8192-byte region [0x62500077d900,0x62500077f900)
freed by thread T0 here:
    #0 0x55e62bb55ac2 in __interceptor_free llvm/llvm/projects/compiler-rt/lib/asan/asan_malloc_linux.cc:124:3
    #1 0x55e64aa37578 in tensorflow::core::Arena::~Arena() tensorflow/core/lib/core/arena.cc:66:5
    #2 0x55e64a6a928b in tensorflow::Graph::~Graph() tensorflow/core/graph/graph.cc:372:1
    #3 0x55e63bd873e5 in TF_Graph::~TF_Graph() tensorflow/c/c_api_internal.h:75:8
    #4 0x55e63bd7fb9d in TF_DeleteSession tensorflow/c/c_api.cc:2588:14
    #5 0x7fc5b33f65dc in _wrap_TF_DeleteSession(_object*, _object*) bazel-out/asan-py3-dbg/genfiles/tensorflow/python/_third__party_tensorflow_python_pywrap__tensorflow__internal.cc:16303:5
    #6 0x55e64d31f265 in _PyCFunction_FastCallDict python_runtime/v3_6/Objects/methodobject.c:234:22
    #7 0x55e64d3a407d in call_function python_runtime/v3_6/Python/ceval.c:4837:9
    #8 0x55e64d3a0fed in _PyEval_EvalFrameDefault python_runtime/v3_6/Python/ceval.c:3335:19
    #9 0x55e64d3a5606 in _PyFunction_FastCall python_runtime/v3_6/Python/ceval.c:4919:14
    #10 0x55e64d2d60a1 in _PyObject_FastCallDict python_runtime/v3_6/Objects/abstract.c:2310:18
    #11 0x55e64d2d6295 in _PyObject_Call_Prepend python_runtime/v3_6/Objects/abstract.c:2373:14
    #12 0x55e64d2d605b in _PyObject_FastCallDict python_runtime/v3_6/Objects/abstract.c:2331:18
    #13 0x55e64d336a4f in slot_tp_finalize python_runtime/v3_6/Objects/typeobject.c:6463:15
    #14 0x55e64cd7e172 in finalize_garbage python_runtime/v3_6/Modules/gcmodule.c:806:13
    #15 0x55e64cd7b1b7 in collect python_runtime/v3_6/Modules/gcmodule.c:1005:5
    #16 0x55e64cd7ad19 in collect_with_callback python_runtime/v3_6/Modules/gcmodule.c:1128:14
    #17 0x55e64cd7ab60 in PyGC_Collect python_runtime/v3_6/Modules/gcmodule.c:1594:13
    #18 0x55e64d3cb556 in Py_FinalizeEx python_runtime/v3_6/Python/pylifecycle.c:601:5

previously allocated by thread T0 here:
    #0 0x55e62bb56ac9 in __interceptor_posix_memalign llvm/llvm/projects/compiler-rt/lib/asan/asan_malloc_linux.cc:219:3
    #1 0x55e62ebbd292 in aligned_malloc(unsigned long, unsigned long) base/port.h:897:7
    #2 0x55e64aa37200 in tensorflow::core::Arena::Arena(unsigned long) tensorflow/core/lib/core/arena.cc:54:31
    #3 0x55e64a6a7d4c in tensorflow::Graph::Graph(tensorflow::OpRegistryInterface const*) tensorflow/core/graph/graph.cc:323:7
    #4 0x55e63bd792bc in TF_Graph::TF_Graph() tensorflow/c/c_api.cc:1854:7
    #5 0x55e63bd7942a in TF_NewGraph tensorflow/c/c_api.cc:1860:38
    #6 0x7fc5b33d3f88 in _wrap_TF_NewGraph(_object*, _object*) bazel-out/asan-py3-dbg/genfiles/tensorflow/python/_third__party_tensorflow_python_pywrap__tensorflow__internal.cc:10165:26
    #7 0x55e64d31f265 in _PyCFunction_FastCallDict python_runtime/v3_6/Objects/methodobject.c:234:22
    #8 0x55e64d3a407d in call_function python_runtime/v3_6/Python/ceval.c:4837:9
    #9 0x55e64d3a0fed in _PyEval_EvalFrameDefault python_runtime/v3_6/Python/ceval.c:3335:19
    #10 0x55e64d3a5606 in _PyFunction_FastCall python_runtime/v3_6/Python/ceval.c:4919:14
    #11 0x55e64d2d60a1 in _PyObject_FastCallDict python_runtime/v3_6/Objects/abstract.c:2310:18
    #12 0x55e64d2d6295 in _PyObject_Call_Prepend python_runtime/v3_6/Objects/abstract.c:2373:14
    #13 0x55e64d2d5e2a in PyObject_Call python_runtime/v3_6/Objects/abstract.c:2261:14
    #14 0x55e64d336908 in slot_tp_init python_runtime/v3_6/Objects/typeobject.c:6420:11
    #15 0x55e64d331fc8 in type_call python_runtime/v3_6/Objects/typeobject.c:915:19
    #16 0x55e64d2d605b in _PyObject_FastCallDict python_runtime/v3_6/Objects/abstract.c:2331:18
    #17 0x55e64d3a4053 in call_function python_runtime/v3_6/Python/ceval.c:4861:17
    #18 0x55e64d3a0fed in _PyEval_EvalFrameDefault python_runtime/v3_6/Python/ceval.c:3335:19
    #19 0x55e64d3a5606 in _PyFunction_FastCall python_runtime/v3_6/Python/ceval.c:4919:14
    #20 0x55e64d2d60a1 in _PyObject_FastCallDict python_runtime/v3_6/Objects/abstract.c:2310:18
    #21 0x55e64d2d6295 in _PyObject_Call_Prepend python_runtime/v3_6/Objects/abstract.c:2373:14
    #22 0x55e64d2d5e2a in PyObject_Call python_runtime/v3_6/Objects/abstract.c:2261:14
    #23 0x55e64d336908 in slot_tp_init python_runtime/v3_6/Objects/typeobject.c:6420:11
    #24 0x55e64d331fc8 in type_call python_runtime/v3_6/Objects/typeobject.c:915:19
    #25 0x55e64d2d605b in _PyObject_FastCallDict python_runtime/v3_6/Objects/abstract.c:2331:18
    #26 0x55e64d3a4053 in call_function python_runtime/v3_6/Python/ceval.c:4861:17
    #27 0x55e64d3a0fed in _PyEval_EvalFrameDefault python_runtime/v3_6/Python/ceval.c:3335:19
    #28 0x55e64d308a38 in gen_send_ex python_runtime/v3_6/Objects/genobject.c:189:14
    #29 0x55e64d398cf8 in builtin_next python_runtime/v3_6/Python/bltinmodule.c:1330:11

SUMMARY: AddressSanitizer: heap-use-after-free crosstool/stable/toolchain/bin/../lib/gcc/x86_64-linux-gnu/version/../../../../x86_64-linux-gnu/include/c++/version/bits/shared_ptr_base.h:1046:9 in std::__shared_ptr<tensorflow::NodeProperties, (__gnu_cxx::_Lock_policy)2>::operator->() const
Shadow bytes around the buggy address:
  0x0c4a800e7c10: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
  0x0c4a800e7c20: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
  0x0c4a800e7c30: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
  0x0c4a800e7c40: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
  0x0c4a800e7c50: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
=>0x0c4a800e7c60: fa fa fa fa fa fa fa[fa]fa fa fa fa fa fa fa fa
  0x0c4a800e7c70: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
  0x0c4a800e7c80: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
  0x0c4a800e7c90: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
  0x0c4a800e7ca0: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
  0x0c4a800e7cb0: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
Shadow byte legend (one shadow byte represents 8 application bytes):
  Addressable:           00
  Partially addressable: 01 02 03 04 05 06 07
  Heap left redzone:       fa
  Freed heap region:       fd
  Stack left redzone:      f1
  Stack mid redzone:       f2
  Stack right redzone:     f3
  Stack after return:      f5
  Stack use after scope:   f8
  Global redzone:          f9
  Global init order:       f6
  Poisoned by user:        f7
  Container overflow:      fc
  Array cookie:            ac
  Intra object redzone:    bb
  ASan internal:           fe
  Left alloca redzone:     ca
  Right alloca redzone:    cb
  Shadow gap:              cc
==229269==ABORTING

@ebrevdo
Copy link
Contributor

ebrevdo commented Nov 23, 2018

@allenlavoie looks like in this case (test-tf111-tfshoulduse-crash.py in python3 with better_exchook == 20171121.105512) we attempt to access graph data after it has been deleted, presumably this is caused by an interaction with tf_should_use format_stack.

@ebrevdo
Copy link
Contributor

ebrevdo commented Nov 23, 2018

Perhaps we can be more careful about when we call format_stack? We do this lazily to avoid the cost of formatting, but is there a way to check that the graph in the stack still exists?

@ebrevdo
Copy link
Contributor

ebrevdo commented Nov 23, 2018

We could also consider sanitizing the stack before formatting.

@albertz
Copy link
Contributor Author

albertz commented Nov 26, 2018

So, to make it clear: There is a Python object which corresponds to a graph in C++ which does not exist anymore, or has become invalid? How is this possible? This is via Swig, right? I thought that Swig does some sort of reference counting.

Or does the C++ graph object itself still exists, but accessing it becomes invalid? Is there a flag or so that marks that the object is invalid now? Maybe there should just be a check for this flag and if the object is invalid, any related functions should return some sane value (None or so) or throw a Python exception, instead of this crash?

I feel like cleaning/sanitizing the stack trace to try to avoid any possible access to such objects is just a workaround to the problem.

albertz added a commit to albertz/playground that referenced this issue Nov 26, 2018
@albertz
Copy link
Contributor Author

albertz commented Nov 26, 2018

I tried to write some simpler test case. See the commit I just referenced. That code sometimes crashes in various different way.

@allenlavoie
Copy link
Member

Oh interesting, good find. So maybe we just need to set some Python properties to None when the destructor for the C Graph object runs?

c_api.TF_DeleteGraph(self.graph)

@rmothukuru rmothukuru added type:performance Performance Issue comp:runtime c++ runtime, performance issues (cpu) and removed comp:runtime c++ runtime, performance issues (cpu) labels May 25, 2021
@tilakrayal tilakrayal added comp:ops OPs related issues stat:awaiting tensorflower Status - Awaiting response from tensorflower labels Jun 3, 2022
@mohantym mohantym self-assigned this Jul 19, 2022
@mohantym
Copy link
Contributor

mohantym commented Jul 19, 2022

Hi @albertz !
It is not crashing now in 2.x version any more. Attached gist for reference. Shall we consider it resolved now .
Thank you!

@mohantym mohantym added stat:awaiting response Status - Awaiting response from author and removed stat:awaiting tensorflower Status - Awaiting response from tensorflower labels Jul 19, 2022
@albertz
Copy link
Contributor Author

albertz commented Jul 19, 2022

@mohantym Is this also for the code in albertz/playground@114bcaf ?

What was done to resolve this?

@google-ml-butler google-ml-butler bot removed the stat:awaiting response Status - Awaiting response from author label Jul 19, 2022
@mohantym
Copy link
Contributor

@albertz !
I have used compatibility mode as it was originally 1.x codebase (updated comments in code ). Yeah , I got the code from your Github repo only.
Thank you!

@albertz
Copy link
Contributor Author

albertz commented Jul 19, 2022

Yeah , I got the code from your Github repo only.

In your gist, you had the initial code here in this issue, but I was referring to this simplified code: albertz/playground@114bcaf

@mohantym
Copy link
Contributor

mohantym commented Aug 2, 2022

Hi @albertz !
I am facing an attribute error with new code. Attached gist for reference. Could you share a Colab gist with error from your side.
Thank you!

@mohantym mohantym added the stat:awaiting response Status - Awaiting response from author label Aug 2, 2022
@google-ml-butler
Copy link

This issue has been automatically marked as stale because it has no recent activity. It will be closed if no further activity occurs. Thank you.

@google-ml-butler google-ml-butler bot added the stale This label marks the issue/pr stale - to be closed automatically if no activity label Aug 9, 2022
albertz added a commit to albertz/playground that referenced this issue Aug 9, 2022
@albertz
Copy link
Contributor Author

albertz commented Aug 9, 2022

@mohantym I updated the code for TF2. Please see here: https://github.com/albertz/playground/blob/master/tf-crash-use-after-delete-graph.py
The problem is still there. It still crashes.

@google-ml-butler google-ml-butler bot removed stat:awaiting response Status - Awaiting response from author stale This label marks the issue/pr stale - to be closed automatically if no activity labels Aug 9, 2022
@mohantym
Copy link
Contributor

@albertz !
I am observing a different behaviour in 2.9 and nightly. Could you let us know from your end.
Thank you!

@mohantym mohantym added the stat:awaiting response Status - Awaiting response from author label Aug 10, 2022
@ebrevdo
Copy link
Contributor

ebrevdo commented Aug 10, 2022

I can replicate the error in tf-nightly. in Python 3.10. Running in gdb, here's the stack trace at segfault:

Thread 1 "python3" received signal SIGSEGV, Segmentation fault.
__strlen_avx2 () at ../sysdeps/x86_64/multiarch/strlen-avx2.S:74
74	../sysdeps/x86_64/multiarch/strlen-avx2.S: No such file or directory.
(gdb) bt
#0  __strlen_avx2 () at ../sysdeps/x86_64/multiarch/strlen-avx2.S:74
#1  0x00007fffbc50f975 in pybind11::detail::type_caster<char, void>::cast(char const*, pybind11::return_value_policy, pybind11::handle) ()
   from /home/ebrevdo/.local/lib/python3.10/site-packages/tensorflow/python/client/_pywrap_tf_session.so
#2  0x00007fffbc52d350 in pybind11::cpp_function::initialize<char const* (*&)(TF_Operation*), char const*, TF_Operation*, pybind11::name, pybind11::scope, pybind11::sibling, pybind11::call_guard<pybind11::gil_scoped_release> >(char const* (*&)(TF_Operation*), char const* (*)(TF_Operation*), pybind11::name const&, pybind11::scope const&, pybind11::sibling const&, pybind11::call_guard<pybind11::gil_scoped_release> const&)::{lambda(pybind11::detail::function_call&)#3}::_FUN(pybind11::detail::function_call&) ()
   from /home/ebrevdo/.local/lib/python3.10/site-packages/tensorflow/python/client/_pywrap_tf_session.so
#3  0x00007fffbc5359a1 in pybind11::cpp_function::dispatcher(_object*, _object*, _object*) ()
   from /home/ebrevdo/.local/lib/python3.10/site-packages/tensorflow/python/client/_pywrap_tf_session.so
#4  0x00005555556dbfee in ?? ()
#5  0x00005555556d2c93 in _PyObject_MakeTpCall ()
#6  0x00005555556cc65d in _PyEval_EvalFrameDefault ()
#7  0x00005555556dc798 in _PyFunction_Vectorcall ()
#8  0x00005555556c6ee7 in _PyEval_EvalFrameDefault ()
#9  0x00005555556dc798 in _PyFunction_Vectorcall ()
#10 0x00005555556c6ee7 in _PyEval_EvalFrameDefault ()
#11 0x00005555557aaa22 in ?? ()
#12 0x00005555557aa962 in PyEval_EvalCode ()
#13 0x00005555557d1374 in ?? ()
#14 0x00005555557cbbdb in ?? ()
#15 0x00005555557d1121 in ?? ()
#16 0x00005555557d0754 in _PyRun_SimpleFileObject ()
#17 0x00005555557d04b3 in _PyRun_AnyFileObject ()
#18 0x00005555557c42ca in Py_RunMain ()
#19 0x000055555579ee19 in Py_BytesMain ()
#20 0x00007ffff7c337fd in __libc_start_main (main=0x55555579ede0, argc=2, argv=0x7fffffffde18, init=<optimized out>, fini=<optimized out>, 
    rtld_fini=<optimized out>, stack_end=0x7fffffffde08) at ../csu/libc-start.c:332
#21 0x000055555579ed1a in _start ()

@ebrevdo
Copy link
Contributor

ebrevdo commented Aug 10, 2022

(that said, the new tf2 code calls tf.disable_v2_behavior(), which drops you back in TF1-mode. I don't know the support story for this anymore...)

@mohantym mohantym added TF 2.9 Issues found in the TF 2.9 release (or RCs) and removed stat:awaiting response Status - Awaiting response from author labels Aug 10, 2022
@mohantym mohantym removed their assignment Aug 10, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
comp:ops OPs related issues TF 2.9 Issues found in the TF 2.9 release (or RCs) type:performance Performance Issue
Projects
None yet
Development

No branches or pull requests

8 participants