Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[XLA] ResourceExhaustedError when trying to define a Sequential model in Keras under jit_scope context manager #21638

Closed
AlexandruBurlacu opened this issue Aug 15, 2018 · 7 comments
Assignees
Labels
stat:awaiting response Status - Awaiting response from author

Comments

@AlexandruBurlacu
Copy link

Intel i5-2430M, 6 GB RAM
Linux Ubuntu 16.04 | Python 3.6.5 | Bazel 0.16.0 | GCC 5.4.0
TensorFlow 1.8 compiled from source with MKL and XLA support


So, my issue is that I try to use XLA for CPU via Keras that is embedded in TensorFlow 1.8 using the tf.contrib.compiler.jit.experimental_jit_scope (for CPU it's the only way I know to enable XLA, using ConfigProto doesn't work on CPU for me). For some strange reason I am thrown ResourceExhaustedError when trying to allocate 0 bytes. Looks like something's wrong, either in TensorFlow or Keras. Below is the listing of the code I use and the full trace.

Code

import tensorflow as tf
from tensorflow.python.client import timeline

import numpy as np

JIT_SCOPE = tf.contrib.compiler.jit.experimental_jit_scope

options = tf.RunOptions(trace_level=tf.RunOptions.FULL_TRACE)                   
run_metadata = tf.RunMetadata()

(train_x, train_y), _ = tf.keras.datasets.mnist.load_data()

train_x = np.expand_dims(train_x, axis=-1) / 255.
train_y = tf.keras.utils.to_categorical(train_y)

with JIT_SCOPE():                                                               
    model = tf.keras.models.Sequential([
        tf.keras.layers.Conv2D(16, (3, 3), activation="relu", input_shape=(28, 28, 1)),
        tf.keras.layers.MaxPool2D((2, 2)),
        tf.keras.layers.Flatten(),
        tf.keras.layers.Dense(10, activation="softmax")
    ])

    model.compile("sgd", "categorical_crossentropy", options=options, run_metadata=run_metadata)

model.fit(train_x, train_y) # error happens at this moment

trace = timeline.Timeline(step_stats=run_metadata.step_stats)
with open("timeline.ctr.json", "w") as f:
    f.write(trace.generate_chrome_trace_format())

Traceback

Epoch 1/1
2018-08-15 20:28:54.784459: I tensorflow/compiler/xla/service/service.cc:159] XLA service 0x7f70ec071a30 executing computations on platform Host. Devices:
2018-08-15 20:28:54.784509: I tensorflow/compiler/xla/service/service.cc:167]   StreamExecutor device (0): <undefined>, <undefined>
2018-08-15 20:28:55.548381: E tensorflow/core/common_runtime/bfc_allocator.cc:246] tried to allocate 0 bytes
2018-08-15 20:28:55.548481: W tensorflow/core/common_runtime/allocator_retry.cc:32] Request to allocate 0 bytes
2018-08-15 20:28:55.561315: E tensorflow/core/common_runtime/bfc_allocator.cc:246] tried to allocate 0 bytes
2018-08-15 20:28:55.561365: W tensorflow/core/common_runtime/allocator_retry.cc:32] Request to allocate 0 bytes
---------------------------------------------------------------------------
ResourceExhaustedError                    Traceback (most recent call last)
~/Work/2018_Summer_CERN/tf_v_tmva/tf_opt/.venv/lib/python3.6/site-packages/tensorflow/python/client/session.py in _do_call(self, fn, *args)
   1321     try:
-> 1322       return fn(*args)
   1323     except errors.OpError as e:

~/Work/2018_Summer_CERN/tf_v_tmva/tf_opt/.venv/lib/python3.6/site-packages/tensorflow/python/client/session.py in _run_fn(feed_dict, fetch_list, target_list, options, run_metadata)
   1306       return self._call_tf_sessionrun(
-> 1307           options, feed_dict, fetch_list, target_list, run_metadata)
   1308 

~/Work/2018_Summer_CERN/tf_v_tmva/tf_opt/.venv/lib/python3.6/site-packages/tensorflow/python/client/session.py in _call_tf_sessionrun(self, options, feed_dict, fetch_list, target_list, run_metadata)
   1408           self._session, options, feed_dict, fetch_list, target_list,
-> 1409           run_metadata)
   1410     else:

ResourceExhaustedError: Out of memory while trying to allocate 0 bytes.
	 [[Node: cluster_1/_4/_5 = _XlaLaunch[Nresources=0, Targs=[], Tconstants=[], Tresults=[DT_FLOAT], function=cluster_1[_XlaCompiledKernel=true, _XlaNumConstantArgs=0, _XlaNumResourceArgs=0], _device="/job:localhost/replica:0/task:0/device:CPU:0"]()]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.


During handling of the above exception, another exception occurred:

ResourceExhaustedError                    Traceback (most recent call last)
<ipython-input-46-dbab7a29ab1f> in <module>()
----> 1 model.fit(train_x[:1000], train_y[:1000], epochs=1)

~/Work/2018_Summer_CERN/tf_v_tmva/tf_opt/.venv/lib/python3.6/site-packages/tensorflow/python/keras/_impl/keras/engine/training.py in fit(self, x, y, batch_size, epochs, verbose, callbacks, validation_split, validation_data, shuffle, class_weight, sample_weight, initial_epoch, steps_per_epoch, validation_steps, **kwargs)
   1214           initial_epoch=initial_epoch,
   1215           steps_per_epoch=steps_per_epoch,
-> 1216           validation_steps=validation_steps)
   1217 
   1218   def evaluate(self,

~/Work/2018_Summer_CERN/tf_v_tmva/tf_opt/.venv/lib/python3.6/site-packages/tensorflow/python/keras/_impl/keras/engine/training_arrays.py in fit_loop(model, inputs, targets, sample_weights, batch_size, epochs, verbose, callbacks, val_inputs, val_targets, val_sample_weights, shuffle, callback_metrics, initial_epoch, steps_per_epoch, validation_steps)
    243           ins_batch[i] = ins_batch[i].toarray()
    244 
--> 245         outs = f(ins_batch)
    246         if not isinstance(outs, list):
    247           outs = [outs]

~/Work/2018_Summer_CERN/tf_v_tmva/tf_opt/.venv/lib/python3.6/site-packages/tensorflow/python/keras/_impl/keras/backend.py in __call__(self, inputs)
   2797       feed_dict = {}
   2798 
-> 2799     session = get_session()
   2800     data_tensors_to_feed = []
   2801     for tensor, value in zip(self.inputs, inputs):

~/Work/2018_Summer_CERN/tf_v_tmva/tf_opt/.venv/lib/python3.6/site-packages/tensorflow/python/keras/_impl/keras/backend.py in get_session()
    440   if not _MANUAL_VAR_INIT:
    441     with session.graph.as_default():
--> 442       _initialize_variables(session)
    443   return session
    444 

~/Work/2018_Summer_CERN/tf_v_tmva/tf_opt/.venv/lib/python3.6/site-packages/tensorflow/python/keras/_impl/keras/backend.py in _initialize_variables(session)
    671       v._keras_initialized = True
    672     if uninitialized_vars:
--> 673       session.run(variables_module.variables_initializer(uninitialized_vars))
    674 
    675 

~/Work/2018_Summer_CERN/tf_v_tmva/tf_opt/.venv/lib/python3.6/site-packages/tensorflow/python/client/session.py in run(self, fetches, feed_dict, options, run_metadata)
    898     try:
    899       result = self._run(None, fetches, feed_dict, options_ptr,
--> 900                          run_metadata_ptr)
    901       if run_metadata:
    902         proto_data = tf_session.TF_GetBuffer(run_metadata_ptr)

~/Work/2018_Summer_CERN/tf_v_tmva/tf_opt/.venv/lib/python3.6/site-packages/tensorflow/python/client/session.py in _run(self, handle, fetches, feed_dict, options, run_metadata)
   1133     if final_fetches or final_targets or (handle and feed_dict_tensor):
   1134       results = self._do_run(handle, final_targets, final_fetches,
-> 1135                              feed_dict_tensor, options, run_metadata)
   1136     else:
   1137       results = []

~/Work/2018_Summer_CERN/tf_v_tmva/tf_opt/.venv/lib/python3.6/site-packages/tensorflow/python/client/session.py in _do_run(self, handle, target_list, fetch_list, feed_dict, options, run_metadata)
   1314     if handle is None:
   1315       return self._do_call(_run_fn, feeds, fetches, targets, options,
-> 1316                            run_metadata)
   1317     else:
   1318       return self._do_call(_prun_fn, handle, feeds, fetches)

~/Work/2018_Summer_CERN/tf_v_tmva/tf_opt/.venv/lib/python3.6/site-packages/tensorflow/python/client/session.py in _do_call(self, fn, *args)
   1333         except KeyError:
   1334           pass
-> 1335       raise type(e)(node_def, op, message)
   1336 
   1337   def _extend_graph(self):

ResourceExhaustedError: Out of memory while trying to allocate 0 bytes.
	 [[Node: cluster_1/_4/_5 = _XlaLaunch[Nresources=0, Targs=[], Tconstants=[], Tresults=[DT_FLOAT], function=cluster_1[_XlaCompiledKernel=true, _XlaNumConstantArgs=0, _XlaNumResourceArgs=0], _device="/job:localhost/replica:0/task:0/device:CPU:0"]()]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.
@tensorflowbutler tensorflowbutler added the stat:awaiting response Status - Awaiting response from author label Aug 16, 2018
@tensorflowbutler
Copy link
Member

Thank you for your post. We noticed you have not filled out the following field in the issue template. Could you update them if they are relevant in your case, or leave them as N/A? Thanks.
Have I written custom code
OS Platform and Distribution
TensorFlow installed from
TensorFlow version
Bazel version
CUDA/cuDNN version
GPU model and memory
Exact command to reproduce
Mobile device

@tensorflowbutler
Copy link
Member

It has been 30 days with no activity and the awaiting response label was assigned. Is this still an issue?

@ymodak
Copy link
Contributor

ymodak commented Sep 28, 2018

This question is better asked on StackOverflow since it is not a bug or feature request. There is also a larger community that reads questions there.

If you think we've misinterpreted a bug, please comment again with a clear explanation, as well as all of the information requested in the issue template. Thanks!

@ymodak ymodak closed this as completed Sep 28, 2018
@AlexandruBurlacu
Copy link
Author

How is this not a bug? In the trace it shows that it tries to allocate 0 bytes and fails, which is strange, and surely not a problem on my side of the code. The problem might be on Keras side tho, but I am not sure.
All the information requested from the issue template is in the issue already, just formatted in an unorthodox way.

@ymodak
Copy link
Contributor

ymodak commented Oct 1, 2018

Hi @AlexandruBurlacu
I was able to execute your code successfully in TensorFlow 1.11
Can you please update your TensorFlow version and try it?
Meanwhile I will create a virtual env for TF 1.8 and execute the script again.

@ymodak
Copy link
Contributor

ymodak commented Oct 1, 2018

Quick update I was able to run the script successfully in version TF 1.8 as well.
So I don't think this is a problem on Keras side as well.

@AlexandruBurlacu
Copy link
Author

Interesting, I will check it again and come back with the results on my side. Thank you anyway!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
stat:awaiting response Status - Awaiting response from author
Projects
None yet
Development

No branches or pull requests

4 participants