Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

A very weird bug of tf_debug: Non-Ok-status env->NewWritableFile(file_path, &f) status: Resource exaustated #11628

Closed
ghost opened this issue Jul 20, 2017 · 4 comments
Labels
stat:awaiting response Status - Awaiting response from author type:support Support issues

Comments

@ghost
Copy link

ghost commented Jul 20, 2017

System information

  • Have I written custom code: Yes
  • OS Platform and Distribution: CentOS
  • TensorFlow installed from (source or binary): binary
  • TensorFlow version: 1.2
  • Python version: 3.4

When I run the seq2seq model, I got the nan loss. Thus I use tf_debug to find out where the problem occurs. I use tf_debug by

sv = tf.train.Supervisor(logdir=FLAGS.log_root,
                         is_chief=True,
                         saver=saver,
                         summary_op=None,
                         save_summaries_secs=60,
                         save_model_secs=FLAGS.checkpoint_secs)
sess = sv.prepare_or_wait_for_session(config=tf.ConfigProto(
    allow_soft_placement=True))
sess = tf_debug.LocalCLIDebugWrapperSession(sess)
sess.add_tensor_filter("has_inf_or_nan", tf_debug.has_inf_or_nan)

But it got such logs and exit:

tfdebug Non-OK-status: env->NewWritableFile(file_path, &f) status: Resource exaustated: /tmp/tfdbg_9_gcc3sc/gradients/output/Reshape_1_grad/Reshape_0_DebugIdentity_150023213
Aborted (core dumped)

I think it is kind of issue related to tf_debug and it may be new, since I can not find anything when I google the error.

@ghost ghost changed the title tfdebug Non-OK-status: env->NewWritableFile(file_path, &f) status: Resource exaustated: /tmp/tfdbg_9_gcc3sc/gradients/output/Reshape_1_grad/Reshape_0_DebugIdentity_150023213 A very weird bug when use tf_debug: Non-Ok-status env->NewWritableFile(file_path, &f) status: Resource exaustated Jul 20, 2017
@ghost ghost changed the title A very weird bug when use tf_debug: Non-Ok-status env->NewWritableFile(file_path, &f) status: Resource exaustated A very weird bug of tf_debug: Non-Ok-status env->NewWritableFile(file_path, &f) status: Resource exaustated Jul 20, 2017
@asimshankar
Copy link
Contributor

This error occurs when writing to the filesystem fails - for example if there is no space left on the device, or if the disk quota exceeded or if there were too many files etc. (see errors.cc)

Unfortunately, the error message given doesn't give more detail (I'll look into fixing that), but the underlying problem does appear to be that you're hitting some sort of space or user limit issues when writing to /tmp/tfdbg_9_gcc3sc/gradients/output/Reshape_1_grad/Reshape_0_DebugIdentity_150023213

Could you look into how much space is left on the device and whether you're running into disk limits?

@asimshankar asimshankar added stat:awaiting response Status - Awaiting response from author type:support Support issues labels Jul 21, 2017
@ghost
Copy link
Author

ghost commented Jul 22, 2017

@asimshankar Thanks for your reply! But it is unlikely that no space left on the device. Cause I run the code on a server, whose space is much more than enough. Do you think user authentication can be a possible reason. But I have no problem on writing under my path.

@aselle aselle removed the stat:awaiting response Status - Awaiting response from author label Jul 22, 2017
@asimshankar
Copy link
Contributor

I didn't quite follow what you meant by "I have no problem on writing under my path". Could you elaborate on that?

But regardless, it seems like the fopen call here that is failing, so it does seem to be the filesystem rejecting the open call. So perhaps there are some ulimits in place?

@asimshankar asimshankar added the stat:awaiting response Status - Awaiting response from author label Jul 22, 2017
vrv pushed a commit to vrv/tensorflow that referenced this issue Jul 25, 2017
…Status

This change is made out of a desire to have additional information be reported
when there are filesystem errors (for e.g. see
tensorflow#11628)

PiperOrigin-RevId: 163091773
vrv pushed a commit that referenced this issue Jul 26, 2017
* Update ops-related pbtxt files.

PiperOrigin-RevId: 163014080

* Go: Update generated wrapper functions for TensorFlow ops.

PiperOrigin-RevId: 163014834

* Removing session reset since destroying the session object would delete its variables as well. Resetting session might unintentionally close other sessions in the same process.

PiperOrigin-RevId: 163019166

* [XLA] Teach CPU and GPU compilers to optionally invoke the HLO insert-reduce-precision-operations pass.

This also required a few additions and fixups.  We add pieces to ReducePrecisionInsertion to translate between the protocol-buffer representation of the pass options and the predicate-function actually used in the pass.  To facilitate this translation, we also add a function to HloOpcode to return the number of opcodes so that we can iterate over the whole set easily.

PiperOrigin-RevId: 163037250

* Refactor HLO graph dumping.

This also makes a few minor cosmetic changes, like moving the fusion
type out of the fusion node and into the out-of-line computation and
adjusting the arrow labels that we use to indicate operand numbers.

PiperOrigin-RevId: 163038795

* Use correct order of arguments in call of valid_bitcast_callback_.

There are platforms where bitcasts are not symmetric. I.e. there are shapes A and B so that A->B is a bitcast, but B->A not. So we have to consider the correct order when calling valid_bitcast_callback_.

PiperOrigin-RevId: 163058665

* Two improvements to pip.sh

1. Distinguish between passed and skipped tests.
2. Allow skipping the smoke test of tensorflow install in clean virtualenv with NO_TEST_ON_INSTALL=1

PiperOrigin-RevId: 163065599

* [XLA] Update StatusOr implementation to use more nuanced type traits.

Previously we would evaluate the is_copy_constructible trait before template
parameters were fully defined; e.g. StatusOr<ThingIAmDefiningRightNow>,
which could lead to surprising effects.

Also, previously it was not possible to provide an error status to a
StatusOr<T> where T was not default-constructible.

PiperOrigin-RevId: 163073057

* [TF:XLA] Register a no-op kernel for ControlTrigger, but forbid the JIT marking pass from compiling ControlTrigger nodes.

CL in preparation for compiling dynamic RNN gradients via XLA.

PiperOrigin-RevId: 163073212

* Improve the HLO graph dumper's output.

 - Truncate long shapes.  It's not uncommon to have giant tuples, and
   displaying the whole thing makes the graph unreadable.

 - Don't traverse into the users of a node with < 16 users.  These are
   probably not interesting, and traversing into them can quickly blow
   up the graph, making it un-renderable.

 - Allow nodes which have multiple trivial subcomputations (e.g.
   select-and-scatter) to have those computations inlined.

 - Match additional patterns in MatchTrivialComputation

PiperOrigin-RevId: 163079329

* If the value to be forwarded from a loop to its gradient is a constant, clone the constant instead of repeatedly pushing it onto a stack on each iteration. This should never consume more memory than the stack approach (notwithstanding swapping), and frequently should be much better.

This change is in preparation for enabling XLA compilation of RNN gradients.

PiperOrigin-RevId: 163082165

* [TF:XLA] Make the shape of a TensorArray flow value a scalar.

Previously we used an f32[0] value, since the exact flow value does not matter, however this causes problems when a TensorArray computation is placed in a loop since the shape of the flow value is no longer loop invariant.

PiperOrigin-RevId: 163082452

* Automated g4 rollback of changelist 163019166

PiperOrigin-RevId: 163083436

* Automated g4 rollback of changelist 162769374

PiperOrigin-RevId: 163086518

* internal change

PiperOrigin-RevId: 163088509

* Clarify docstring for tf.rank.

PiperOrigin-RevId: 163089480

* Reduce gather_op_test timeouts by reducing the size of testHigherRank.

PiperOrigin-RevId: 163090428

* Add PopulationCount op (popcnt): element-wise counts the number of "on" bits.

PiperOrigin-RevId: 163090921

* Show fusion nodes inline in HLO graph dumper.

To make this work sanely I had to change NodeFilter so that it says to
dump all nodes inside subcomputations.  Previously, we passed an
explicit NodeFilter down to DumpSubcomputation, and used that to control
whether or not we dumped nodes in there.  But this becomes unwieldy with
inline fusion nodes, as sometimes you want to look at 'filter', and
other times you want to look at 'filter_', and there's no good way to
tell why.

I also had to remove the heuristic whereby we'd pull in operands of
nodes with just some operands shown.  With the much bigger nodes that
are generated by this change, the graph was becoming illegible.  I think
most of the confusion that heuristic was attempting to avoid is
addressed by the fact that we "gray out" incomplete nodes.

PiperOrigin-RevId: 163091423

* errors: Avoid stripping error details when convering POSIX errors to Status

This change is made out of a desire to have additional information be reported
when there are filesystem errors (for e.g. see
#11628)

PiperOrigin-RevId: 163091773

* C API: Fix a bug with TF_OperationGetAttrTensor when TF_STRING tensors are
involved.

The TensorBuffer owned by a TF_Tensor object has a different memory layout than
the TensorBuffer owned by the corresponding tensorflow::Tensor object.
This change consolidates conversions between the runtime's tensorflow::Tensor
and the C API's TF_Tensor objects into a pair helper functions.

The added test: CApiAttributesTest.StringTensor fails without corresponding
changes to c_api.cc

PiperOrigin-RevId: 163091789

* Speed up tf.contrib.signal spectral_ops_test.py by reducing the size of the gradient test.

PiperOrigin-RevId: 163092423

* Add new CompareAndBitpackOp.

PiperOrigin-RevId: 163093146

* Update ops-related pbtxt files.

PiperOrigin-RevId: 163094455

* Minor tweaks to avoid unnecessary copies

PiperOrigin-RevId: 163101160

* [BatchNormGrad] Add end-to-end test for BatchNormGrad

RELNOTES: n/a
PiperOrigin-RevId: 163101568

* Go: Update generated wrapper functions for TensorFlow ops.

PiperOrigin-RevId: 163102070

* [XLA] Add more unit tests for DynamicSlice and DynamicUpdateSlice.

PiperOrigin-RevId: 163102445

* Adding missing deps to targets in llvm.BUILD. This was only working in non-sandboxed builds.

PiperOrigin-RevId: 163103908

* Pass batch_size in params when use_tpu=False.

PiperOrigin-RevId: 163105673

* Remove duplicate import.

PiperOrigin-RevId: 163108237

* Implementation of UnsortedSegmentSum in tf2xla bridge.

PiperOrigin-RevId: 163109769

* Add gradient checking tests for nn.moments().

PiperOrigin-RevId: 163110994

* Improved the speed of constant folding

PiperOrigin-RevId: 163113085

* Convert configure to python.

PiperOrigin-RevId: 163114551

* [TF:XLA] Ignore control edges from Enter nodes to the graph sink during loop functionalization.

PiperOrigin-RevId: 1631159

* Support customized residual function in the residual wrapper.

PiperOrigin-RevId: 163121296
@tensorflowbutler
Copy link
Member

This issue is automatically closed due to lack of activity. Please re-open if this is still an issue for you. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
stat:awaiting response Status - Awaiting response from author type:support Support issues
Projects
None yet
Development

No branches or pull requests

3 participants