A very weird bug of tf_debug: Non-Ok-status env->NewWritableFile(file_path, &f) status: Resource exaustated #11628

ghost · 2017-07-20T02:33:02Z

System information

Have I written custom code: Yes
OS Platform and Distribution: CentOS
TensorFlow installed from (source or binary): binary
TensorFlow version: 1.2
Python version: 3.4

When I run the seq2seq model, I got the nan loss. Thus I use tf_debug to find out where the problem occurs. I use tf_debug by

sv = tf.train.Supervisor(logdir=FLAGS.log_root,
                         is_chief=True,
                         saver=saver,
                         summary_op=None,
                         save_summaries_secs=60,
                         save_model_secs=FLAGS.checkpoint_secs)
sess = sv.prepare_or_wait_for_session(config=tf.ConfigProto(
    allow_soft_placement=True))
sess = tf_debug.LocalCLIDebugWrapperSession(sess)
sess.add_tensor_filter("has_inf_or_nan", tf_debug.has_inf_or_nan)

But it got such logs and exit:

tfdebug Non-OK-status: env->NewWritableFile(file_path, &f) status: Resource exaustated: /tmp/tfdbg_9_gcc3sc/gradients/output/Reshape_1_grad/Reshape_0_DebugIdentity_150023213
Aborted (core dumped)

I think it is kind of issue related to tf_debug and it may be new, since I can not find anything when I google the error.

The text was updated successfully, but these errors were encountered:

asimshankar · 2017-07-21T18:51:17Z

This error occurs when writing to the filesystem fails - for example if there is no space left on the device, or if the disk quota exceeded or if there were too many files etc. (see errors.cc)

Unfortunately, the error message given doesn't give more detail (I'll look into fixing that), but the underlying problem does appear to be that you're hitting some sort of space or user limit issues when writing to /tmp/tfdbg_9_gcc3sc/gradients/output/Reshape_1_grad/Reshape_0_DebugIdentity_150023213

Could you look into how much space is left on the device and whether you're running into disk limits?

ghost · 2017-07-22T08:23:18Z

@asimshankar Thanks for your reply! But it is unlikely that no space left on the device. Cause I run the code on a server, whose space is much more than enough. Do you think user authentication can be a possible reason. But I have no problem on writing under my path.

asimshankar · 2017-07-22T17:58:35Z

I didn't quite follow what you meant by "I have no problem on writing under my path". Could you elaborate on that?

But regardless, it seems like the fopen call here that is failing, so it does seem to be the filesystem rejecting the open call. So perhaps there are some ulimits in place?

…Status This change is made out of a desire to have additional information be reported when there are filesystem errors (for e.g. see tensorflow#11628) PiperOrigin-RevId: 163091773

* Update ops-related pbtxt files. PiperOrigin-RevId: 163014080 * Go: Update generated wrapper functions for TensorFlow ops. PiperOrigin-RevId: 163014834 * Removing session reset since destroying the session object would delete its variables as well. Resetting session might unintentionally close other sessions in the same process. PiperOrigin-RevId: 163019166 * [XLA] Teach CPU and GPU compilers to optionally invoke the HLO insert-reduce-precision-operations pass. This also required a few additions and fixups. We add pieces to ReducePrecisionInsertion to translate between the protocol-buffer representation of the pass options and the predicate-function actually used in the pass. To facilitate this translation, we also add a function to HloOpcode to return the number of opcodes so that we can iterate over the whole set easily. PiperOrigin-RevId: 163037250 * Refactor HLO graph dumping. This also makes a few minor cosmetic changes, like moving the fusion type out of the fusion node and into the out-of-line computation and adjusting the arrow labels that we use to indicate operand numbers. PiperOrigin-RevId: 163038795 * Use correct order of arguments in call of valid_bitcast_callback_. There are platforms where bitcasts are not symmetric. I.e. there are shapes A and B so that A->B is a bitcast, but B->A not. So we have to consider the correct order when calling valid_bitcast_callback_. PiperOrigin-RevId: 163058665 * Two improvements to pip.sh 1. Distinguish between passed and skipped tests. 2. Allow skipping the smoke test of tensorflow install in clean virtualenv with NO_TEST_ON_INSTALL=1 PiperOrigin-RevId: 163065599 * [XLA] Update StatusOr implementation to use more nuanced type traits. Previously we would evaluate the is_copy_constructible trait before template parameters were fully defined; e.g. StatusOr<ThingIAmDefiningRightNow>, which could lead to surprising effects. Also, previously it was not possible to provide an error status to a StatusOr<T> where T was not default-constructible. PiperOrigin-RevId: 163073057 * [TF:XLA] Register a no-op kernel for ControlTrigger, but forbid the JIT marking pass from compiling ControlTrigger nodes. CL in preparation for compiling dynamic RNN gradients via XLA. PiperOrigin-RevId: 163073212 * Improve the HLO graph dumper's output. - Truncate long shapes. It's not uncommon to have giant tuples, and displaying the whole thing makes the graph unreadable. - Don't traverse into the users of a node with < 16 users. These are probably not interesting, and traversing into them can quickly blow up the graph, making it un-renderable. - Allow nodes which have multiple trivial subcomputations (e.g. select-and-scatter) to have those computations inlined. - Match additional patterns in MatchTrivialComputation PiperOrigin-RevId: 163079329 * If the value to be forwarded from a loop to its gradient is a constant, clone the constant instead of repeatedly pushing it onto a stack on each iteration. This should never consume more memory than the stack approach (notwithstanding swapping), and frequently should be much better. This change is in preparation for enabling XLA compilation of RNN gradients. PiperOrigin-RevId: 163082165 * [TF:XLA] Make the shape of a TensorArray flow value a scalar. Previously we used an f32[0] value, since the exact flow value does not matter, however this causes problems when a TensorArray computation is placed in a loop since the shape of the flow value is no longer loop invariant. PiperOrigin-RevId: 163082452 * Automated g4 rollback of changelist 163019166 PiperOrigin-RevId: 163083436 * Automated g4 rollback of changelist 162769374 PiperOrigin-RevId: 163086518 * internal change PiperOrigin-RevId: 163088509 * Clarify docstring for tf.rank. PiperOrigin-RevId: 163089480 * Reduce gather_op_test timeouts by reducing the size of testHigherRank. PiperOrigin-RevId: 163090428 * Add PopulationCount op (popcnt): element-wise counts the number of "on" bits. PiperOrigin-RevId: 163090921 * Show fusion nodes inline in HLO graph dumper. To make this work sanely I had to change NodeFilter so that it says to dump all nodes inside subcomputations. Previously, we passed an explicit NodeFilter down to DumpSubcomputation, and used that to control whether or not we dumped nodes in there. But this becomes unwieldy with inline fusion nodes, as sometimes you want to look at 'filter', and other times you want to look at 'filter_', and there's no good way to tell why. I also had to remove the heuristic whereby we'd pull in operands of nodes with just some operands shown. With the much bigger nodes that are generated by this change, the graph was becoming illegible. I think most of the confusion that heuristic was attempting to avoid is addressed by the fact that we "gray out" incomplete nodes. PiperOrigin-RevId: 163091423 * errors: Avoid stripping error details when convering POSIX errors to Status This change is made out of a desire to have additional information be reported when there are filesystem errors (for e.g. see #11628) PiperOrigin-RevId: 163091773 * C API: Fix a bug with TF_OperationGetAttrTensor when TF_STRING tensors are involved. The TensorBuffer owned by a TF_Tensor object has a different memory layout than the TensorBuffer owned by the corresponding tensorflow::Tensor object. This change consolidates conversions between the runtime's tensorflow::Tensor and the C API's TF_Tensor objects into a pair helper functions. The added test: CApiAttributesTest.StringTensor fails without corresponding changes to c_api.cc PiperOrigin-RevId: 163091789 * Speed up tf.contrib.signal spectral_ops_test.py by reducing the size of the gradient test. PiperOrigin-RevId: 163092423 * Add new CompareAndBitpackOp. PiperOrigin-RevId: 163093146 * Update ops-related pbtxt files. PiperOrigin-RevId: 163094455 * Minor tweaks to avoid unnecessary copies PiperOrigin-RevId: 163101160 * [BatchNormGrad] Add end-to-end test for BatchNormGrad RELNOTES: n/a PiperOrigin-RevId: 163101568 * Go: Update generated wrapper functions for TensorFlow ops. PiperOrigin-RevId: 163102070 * [XLA] Add more unit tests for DynamicSlice and DynamicUpdateSlice. PiperOrigin-RevId: 163102445 * Adding missing deps to targets in llvm.BUILD. This was only working in non-sandboxed builds. PiperOrigin-RevId: 163103908 * Pass batch_size in params when use_tpu=False. PiperOrigin-RevId: 163105673 * Remove duplicate import. PiperOrigin-RevId: 163108237 * Implementation of UnsortedSegmentSum in tf2xla bridge. PiperOrigin-RevId: 163109769 * Add gradient checking tests for nn.moments(). PiperOrigin-RevId: 163110994 * Improved the speed of constant folding PiperOrigin-RevId: 163113085 * Convert configure to python. PiperOrigin-RevId: 163114551 * [TF:XLA] Ignore control edges from Enter nodes to the graph sink during loop functionalization. PiperOrigin-RevId: 1631159 * Support customized residual function in the residual wrapper. PiperOrigin-RevId: 163121296

tensorflowbutler · 2017-10-22T01:25:57Z

This issue is automatically closed due to lack of activity. Please re-open if this is still an issue for you. Thanks!

ghost changed the title ~~A very weird bug when use tf_debug: Non-Ok-status env->NewWritableFile(file_path, &f) status: Resource exaustated~~ A very weird bug of tf_debug: Non-Ok-status env->NewWritableFile(file_path, &f) status: Resource exaustated Jul 20, 2017

asimshankar added stat:awaiting response Status - Awaiting response from author type:support Support issues labels Jul 21, 2017

aselle removed the stat:awaiting response Status - Awaiting response from author label Jul 22, 2017

asimshankar added the stat:awaiting response Status - Awaiting response from author label Jul 22, 2017

tensorflowbutler closed this as completed Oct 22, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

A very weird bug of tf_debug: Non-Ok-status env->NewWritableFile(file_path, &f) status: Resource exaustated #11628

A very weird bug of tf_debug: Non-Ok-status env->NewWritableFile(file_path, &f) status: Resource exaustated #11628

ghost commented Jul 20, 2017 •

edited by ghost

asimshankar commented Jul 21, 2017

ghost commented Jul 22, 2017

asimshankar commented Jul 22, 2017

tensorflowbutler commented Oct 22, 2017

A very weird bug of tf_debug: Non-Ok-status env->NewWritableFile(file_path, &f) status: Resource exaustated #11628

A very weird bug of tf_debug: Non-Ok-status env->NewWritableFile(file_path, &f) status: Resource exaustated #11628

Comments

ghost commented Jul 20, 2017 • edited by ghost

System information

asimshankar commented Jul 21, 2017

ghost commented Jul 22, 2017

asimshankar commented Jul 22, 2017

tensorflowbutler commented Oct 22, 2017

ghost commented Jul 20, 2017 •

edited by ghost