New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
crash via tf_should_use format_stack #22770
Comments
@ebrevdo any idea what could be causing this? |
My guess: Some Swig internals, which do not expect a thread change in certain context (which is triggered here by the Python GC calling |
@allenlavoie may have insight. |
Nothing jumps out to me as an obvious cause. Sounds like this needs debugging, and without a more concrete reproduction I'm not sure there's much to be done. Is there a loop you can construct which eventually results in this bug being triggered? |
I'm having a hard time replicating the issue. I ran:
$ python3 --version
Python 3.5.3
$ python3 -c 'import tensorflow; print(tensorflow.__version__)'
1.13.0-dev20181121
$ python3 test-tf111-tfshoulduse-crash.py
2018-11-21 15:59:19.821636: I
tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports
instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
create graph
WARNING:tensorflow:From
/home/ebrevdo/.local/lib/python3.5/site-packages/tensorflow/python/framework/op_def_library.py:263:
colocate_with (from tensorflow.python.framework.ops) is deprecated and will
be removed in a future version.
Instructions for updating:
Colocations handled automatically by placer.
variables:
[<tf.Variable 'b:0' shape=(10, 1, 6) dtype=float32_ref>]
init vars
graph size: 8668
train
step 0, loss: 1.596843
EXCEPTION
Traceback (most recent call last):
File "test-tf111-tfshoulduse-crash.py", line 217, in test
line: raise Exception("foo")
locals:
Exception = <builtin> <class 'Exception'>
Exception: foo
Exit.
atexit handler
EXCEPTION
Traceback (most recent call last):
(Exclude vars because we are exiting.)
File "test-tf111-tfshoulduse-crash.py", line 229, in at_exit_handler
line: raise Exception("foo")
Exception: foo
Dummy Goodbye
ERROR:tensorflow:==================================
Object was never used (type <class
'tensorflow.python.ops.tensor_array_ops.TensorArray'>):
<tensorflow.python.ops.tensor_array_ops.TensorArray object at
0x7f2390b5b6d8>
If you want to mark it as used call its "mark_used()" method.
It was originally created here:
File "test-tf111-tfshoulduse-crash.py", line 240, in <module> line:
print("Exit.") File "test-tf111-tfshoulduse-crash.py", line 219, in test
line: sys.excepthook(*sys.exc_info()) File
"/home/ebrevdo/.local/lib/python3.5/site-packages/tensorflow/python/util/tf_should_use.py",
line 189, in wrapped line: return _add_should_use_warning(fn(*args,
**kwargs))
==================================
...
I had a similar successful run with TF nightly from september.
|
You used the better_exchook version which includes the workaround for this
case. Can you try an older version?
Am Do., 22. Nov. 2018, 08:03 hat ebrevdo <notifications@github.com>
geschrieben:
… I'm having a hard time replicating the issue. I ran:
```
$ python3 --version
Python 3.5.3
$ python3 -c 'import tensorflow; print(tensorflow.__version__)'
1.13.0-dev20181121
$ python3 test-tf111-tfshoulduse-crash.py
2018-11-21 15:59:19.821636: I
tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports
instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
create graph
WARNING:tensorflow:From
/home/ebrevdo/.local/lib/python3.5/site-packages/tensorflow/python/framework/op_def_library.py:263:
colocate_with (from tensorflow.python.framework.ops) is deprecated and will
be removed in a future version.
Instructions for updating:
Colocations handled automatically by placer.
variables:
[<tf.Variable 'b:0' shape=(10, 1, 6) dtype=float32_ref>]
init vars
graph size: 8668
train
step 0, loss: 1.596843
EXCEPTION
Traceback (most recent call last):
File "test-tf111-tfshoulduse-crash.py", line 217, in test
line: raise Exception("foo")
locals:
Exception = <builtin> <class 'Exception'>
Exception: foo
Exit.
atexit handler
EXCEPTION
Traceback (most recent call last):
(Exclude vars because we are exiting.)
File "test-tf111-tfshoulduse-crash.py", line 229, in at_exit_handler
line: raise Exception("foo")
Exception: foo
Dummy Goodbye
ERROR:tensorflow:==================================
Object was never used (type <class
'tensorflow.python.ops.tensor_array_ops.TensorArray'>):
<tensorflow.python.ops.tensor_array_ops.TensorArray object at
0x7f2390b5b6d8>
If you want to mark it as used call its "mark_used()" method.
It was originally created here:
File "test-tf111-tfshoulduse-crash.py", line 240, in <module> line:
print("Exit.") File "test-tf111-tfshoulduse-crash.py", line 219, in test
line: sys.excepthook(*sys.exc_info()) File
"/home/ebrevdo/.local/lib/python3.5/site-packages/tensorflow/python/util/tf_should_use.py",
line 189, in wrapped line: return _add_should_use_warning(fn(*args,
**kwargs))
==================================
...
```
I had a similar successful run with TF nightly from september.
On Wed, Nov 21, 2018 at 10:57 AM, Alfred Sorten Wolf <
***@***.***> wrote:
> Nagging Assignee @ebrevdo <https://github.com/ebrevdo>: It has been 29
> days with no activity and this issue has an assignee. Please update the
> label and/or status accordingly.
>
> —
> You are receiving this because you were mentioned.
> Reply to this email directly, view it on GitHub
> <
#22770 (comment)
>,
> or mute the thread
> <
https://github.com/notifications/unsubscribe-auth/ABtim9IkzbUaa0mtj5s1WZtMyuWB79Mjks5uxaIxgaJpZM4XKiRX
>
> .
>
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#22770 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AADm_Og_cTU0HLyNrYC9i4JFTfYNy37nks5uxenKgaJpZM4XKiRX>
.
|
I removed better_exchook (removed the import, and the two commands in main) and still am no able to replicate in py3.5 |
See my earlier explanation. Only with better_exchook you can trigger this
crash.
Am Do., 22. Nov. 2018, 08:34 hat ebrevdo <notifications@github.com>
geschrieben:
… I removed better_exchook (removed the import, and the two commands in
main) and still am no able to replicate in py3.5
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#22770 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AADm_IWoiR4eA4WIOBZ5tbeIXC5NLRDSks5uxfEFgaJpZM4XKiRX>
.
|
oh i see; an older version of better_exchook. checking...
…On Wed, Nov 21, 2018 at 4:36 PM, Albert Zeyer ***@***.***> wrote:
See my earlier explanation. Only with better_exchook you can trigger this
crash.
Am Do., 22. Nov. 2018, 08:34 hat ebrevdo ***@***.***>
geschrieben:
> I removed better_exchook (removed the import, and the two commands in
> main) and still am no able to replicate in py3.5
>
> —
> You are receiving this because you authored the thread.
> Reply to this email directly, view it on GitHub
> <https://github.com/tensorflow/tensorflow/issues/
22770#issuecomment-440864313>,
> or mute the thread
> <https://github.com/notifications/unsubscribe-auth/AADm_
IWoiR4eA4WIOBZ5tbeIXC5NLRDSks5uxfEFgaJpZM4XKiRX>
> .
>
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#22770 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ABtim1vtZRsHzzizd3uF2jUcJkLmD023ks5uxfGJgaJpZM4XKiRX>
.
|
ok i was able to replicate the issue. gonna see if i can run this under address sanitizer... |
OK; asan picked something up:
|
@allenlavoie looks like in this case ( |
Perhaps we can be more careful about when we call format_stack? We do this lazily to avoid the cost of formatting, but is there a way to check that the graph in the stack still exists? |
We could also consider sanitizing the stack before formatting. |
So, to make it clear: There is a Python object which corresponds to a graph in C++ which does not exist anymore, or has become invalid? How is this possible? This is via Swig, right? I thought that Swig does some sort of reference counting. Or does the C++ graph object itself still exists, but accessing it becomes invalid? Is there a flag or so that marks that the object is invalid now? Maybe there should just be a check for this flag and if the object is invalid, any related functions should return some sane value (None or so) or throw a Python exception, instead of this crash? I feel like cleaning/sanitizing the stack trace to try to avoid any possible access to such objects is just a workaround to the problem. |
I tried to write some simpler test case. See the commit I just referenced. That code sometimes crashes in various different way. |
Oh interesting, good find. So maybe we just need to set some Python properties to
|
@mohantym Is this also for the code in albertz/playground@114bcaf ? What was done to resolve this? |
@albertz ! |
In your gist, you had the initial code here in this issue, but I was referring to this simplified code: albertz/playground@114bcaf |
This issue has been automatically marked as stale because it has no recent activity. It will be closed if no further activity occurs. Thank you. |
@mohantym I updated the code for TF2. Please see here: https://github.com/albertz/playground/blob/master/tf-crash-use-after-delete-graph.py |
@albertz ! |
I can replicate the error in tf-nightly. in Python 3.10. Running in gdb, here's the stack trace at segfault:
|
(that said, the new tf2 code calls |
System information
Describe the problem
When
__repr__
is called on some TF objects at the wrong time, this can lead to a crash (seg fault; see below). There can be various reasons why this can happen, e.g. when a debugger shows the locals of all threads. My case was this, but I think this doesn't matter:sys.excepthook
and sometraceback
functions to print out some local vars and their__repr__
output. There is something similar for IPython.tf.TensorArray
and calledunstack
and I did not use the result value. Thatunstack
method is wrapped viashould_use_result
._TFShouldUseHelper.__del__
function at some random point, and this triggered the stack formating and then the call some some__repr__
of some TF objects.Originally, this happened at exit, and I thought that probably it's just not safe at exit to touch any existing TF objects. So I fixed that case in better_exchook: It will not print any vars at exit. A test case to reproduce exactly that case is here.
However, now I get the same crash also not at exit but at another random point (see stack below). It will be hard to come up with a test case for this, as it is very non-deterministic when exactly the GC runs and calls the
__del__
function.Source code / logs
"ops.py", line 1897 in name, that is this code:
I often also see this just before the crash:
A Travis log with this crash can also be seen here, or here.
The C backtrace is this:
The text was updated successfully, but these errors were encountered: