[XLA] ResourceExhaustedError when trying to define a Sequential model in Keras under jit_scope context manager #21638

AlexandruBurlacu · 2018-08-15T18:54:39Z

Intel i5-2430M, 6 GB RAM
Linux Ubuntu 16.04 | Python 3.6.5 | Bazel 0.16.0 | GCC 5.4.0
TensorFlow 1.8 compiled from source with MKL and XLA support

So, my issue is that I try to use XLA for CPU via Keras that is embedded in TensorFlow 1.8 using the tf.contrib.compiler.jit.experimental_jit_scope (for CPU it's the only way I know to enable XLA, using ConfigProto doesn't work on CPU for me). For some strange reason I am thrown ResourceExhaustedError when trying to allocate 0 bytes. Looks like something's wrong, either in TensorFlow or Keras. Below is the listing of the code I use and the full trace.

Code

import tensorflow as tf
from tensorflow.python.client import timeline

import numpy as np

JIT_SCOPE = tf.contrib.compiler.jit.experimental_jit_scope

options = tf.RunOptions(trace_level=tf.RunOptions.FULL_TRACE)                   
run_metadata = tf.RunMetadata()

(train_x, train_y), _ = tf.keras.datasets.mnist.load_data()

train_x = np.expand_dims(train_x, axis=-1) / 255.
train_y = tf.keras.utils.to_categorical(train_y)

with JIT_SCOPE():                                                               
    model = tf.keras.models.Sequential([
        tf.keras.layers.Conv2D(16, (3, 3), activation="relu", input_shape=(28, 28, 1)),
        tf.keras.layers.MaxPool2D((2, 2)),
        tf.keras.layers.Flatten(),
        tf.keras.layers.Dense(10, activation="softmax")
    ])

    model.compile("sgd", "categorical_crossentropy", options=options, run_metadata=run_metadata)

model.fit(train_x, train_y) # error happens at this moment

trace = timeline.Timeline(step_stats=run_metadata.step_stats)
with open("timeline.ctr.json", "w") as f:
    f.write(trace.generate_chrome_trace_format())

Traceback

Epoch 1/1
2018-08-15 20:28:54.784459: I tensorflow/compiler/xla/service/service.cc:159] XLA service 0x7f70ec071a30 executing computations on platform Host. Devices:
2018-08-15 20:28:54.784509: I tensorflow/compiler/xla/service/service.cc:167]   StreamExecutor device (0): <undefined>, <undefined>
2018-08-15 20:28:55.548381: E tensorflow/core/common_runtime/bfc_allocator.cc:246] tried to allocate 0 bytes
2018-08-15 20:28:55.548481: W tensorflow/core/common_runtime/allocator_retry.cc:32] Request to allocate 0 bytes
2018-08-15 20:28:55.561315: E tensorflow/core/common_runtime/bfc_allocator.cc:246] tried to allocate 0 bytes
2018-08-15 20:28:55.561365: W tensorflow/core/common_runtime/allocator_retry.cc:32] Request to allocate 0 bytes
---------------------------------------------------------------------------
ResourceExhaustedError                    Traceback (most recent call last)
~/Work/2018_Summer_CERN/tf_v_tmva/tf_opt/.venv/lib/python3.6/site-packages/tensorflow/python/client/session.py in _do_call(self, fn, *args)
   1321     try:
-> 1322       return fn(*args)
   1323     except errors.OpError as e:

~/Work/2018_Summer_CERN/tf_v_tmva/tf_opt/.venv/lib/python3.6/site-packages/tensorflow/python/client/session.py in _run_fn(feed_dict, fetch_list, target_list, options, run_metadata)
   1306       return self._call_tf_sessionrun(
-> 1307           options, feed_dict, fetch_list, target_list, run_metadata)
   1308 

~/Work/2018_Summer_CERN/tf_v_tmva/tf_opt/.venv/lib/python3.6/site-packages/tensorflow/python/client/session.py in _call_tf_sessionrun(self, options, feed_dict, fetch_list, target_list, run_metadata)
   1408           self._session, options, feed_dict, fetch_list, target_list,
-> 1409           run_metadata)
   1410     else:

ResourceExhaustedError: Out of memory while trying to allocate 0 bytes.
	 [[Node: cluster_1/_4/_5 = _XlaLaunch[Nresources=0, Targs=[], Tconstants=[], Tresults=[DT_FLOAT], function=cluster_1[_XlaCompiledKernel=true, _XlaNumConstantArgs=0, _XlaNumResourceArgs=0], _device="/job:localhost/replica:0/task:0/device:CPU:0"]()]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.


During handling of the above exception, another exception occurred:

ResourceExhaustedError                    Traceback (most recent call last)
<ipython-input-46-dbab7a29ab1f> in <module>()
----> 1 model.fit(train_x[:1000], train_y[:1000], epochs=1)

~/Work/2018_Summer_CERN/tf_v_tmva/tf_opt/.venv/lib/python3.6/site-packages/tensorflow/python/keras/_impl/keras/engine/training.py in fit(self, x, y, batch_size, epochs, verbose, callbacks, validation_split, validation_data, shuffle, class_weight, sample_weight, initial_epoch, steps_per_epoch, validation_steps, **kwargs)
   1214           initial_epoch=initial_epoch,
   1215           steps_per_epoch=steps_per_epoch,
-> 1216           validation_steps=validation_steps)
   1217 
   1218   def evaluate(self,

~/Work/2018_Summer_CERN/tf_v_tmva/tf_opt/.venv/lib/python3.6/site-packages/tensorflow/python/keras/_impl/keras/engine/training_arrays.py in fit_loop(model, inputs, targets, sample_weights, batch_size, epochs, verbose, callbacks, val_inputs, val_targets, val_sample_weights, shuffle, callback_metrics, initial_epoch, steps_per_epoch, validation_steps)
    243           ins_batch[i] = ins_batch[i].toarray()
    244 
--> 245         outs = f(ins_batch)
    246         if not isinstance(outs, list):
    247           outs = [outs]

~/Work/2018_Summer_CERN/tf_v_tmva/tf_opt/.venv/lib/python3.6/site-packages/tensorflow/python/keras/_impl/keras/backend.py in __call__(self, inputs)
   2797       feed_dict = {}
   2798 
-> 2799     session = get_session()
   2800     data_tensors_to_feed = []
   2801     for tensor, value in zip(self.inputs, inputs):

~/Work/2018_Summer_CERN/tf_v_tmva/tf_opt/.venv/lib/python3.6/site-packages/tensorflow/python/keras/_impl/keras/backend.py in get_session()
    440   if not _MANUAL_VAR_INIT:
    441     with session.graph.as_default():
--> 442       _initialize_variables(session)
    443   return session
    444 

~/Work/2018_Summer_CERN/tf_v_tmva/tf_opt/.venv/lib/python3.6/site-packages/tensorflow/python/keras/_impl/keras/backend.py in _initialize_variables(session)
    671       v._keras_initialized = True
    672     if uninitialized_vars:
--> 673       session.run(variables_module.variables_initializer(uninitialized_vars))
    674 
    675 

~/Work/2018_Summer_CERN/tf_v_tmva/tf_opt/.venv/lib/python3.6/site-packages/tensorflow/python/client/session.py in run(self, fetches, feed_dict, options, run_metadata)
    898     try:
    899       result = self._run(None, fetches, feed_dict, options_ptr,
--> 900                          run_metadata_ptr)
    901       if run_metadata:
    902         proto_data = tf_session.TF_GetBuffer(run_metadata_ptr)

~/Work/2018_Summer_CERN/tf_v_tmva/tf_opt/.venv/lib/python3.6/site-packages/tensorflow/python/client/session.py in _run(self, handle, fetches, feed_dict, options, run_metadata)
   1133     if final_fetches or final_targets or (handle and feed_dict_tensor):
   1134       results = self._do_run(handle, final_targets, final_fetches,
-> 1135                              feed_dict_tensor, options, run_metadata)
   1136     else:
   1137       results = []

~/Work/2018_Summer_CERN/tf_v_tmva/tf_opt/.venv/lib/python3.6/site-packages/tensorflow/python/client/session.py in _do_run(self, handle, target_list, fetch_list, feed_dict, options, run_metadata)
   1314     if handle is None:
   1315       return self._do_call(_run_fn, feeds, fetches, targets, options,
-> 1316                            run_metadata)
   1317     else:
   1318       return self._do_call(_prun_fn, handle, feeds, fetches)

~/Work/2018_Summer_CERN/tf_v_tmva/tf_opt/.venv/lib/python3.6/site-packages/tensorflow/python/client/session.py in _do_call(self, fn, *args)
   1333         except KeyError:
   1334           pass
-> 1335       raise type(e)(node_def, op, message)
   1336 
   1337   def _extend_graph(self):

ResourceExhaustedError: Out of memory while trying to allocate 0 bytes.
	 [[Node: cluster_1/_4/_5 = _XlaLaunch[Nresources=0, Targs=[], Tconstants=[], Tresults=[DT_FLOAT], function=cluster_1[_XlaCompiledKernel=true, _XlaNumConstantArgs=0, _XlaNumResourceArgs=0], _device="/job:localhost/replica:0/task:0/device:CPU:0"]()]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

The text was updated successfully, but these errors were encountered:

tensorflowbutler · 2018-08-16T06:46:36Z

Thank you for your post. We noticed you have not filled out the following field in the issue template. Could you update them if they are relevant in your case, or leave them as N/A? Thanks.
Have I written custom code
OS Platform and Distribution
TensorFlow installed from
TensorFlow version
Bazel version
CUDA/cuDNN version
GPU model and memory
Exact command to reproduce
Mobile device

tensorflowbutler · 2018-09-16T02:30:25Z

It has been 30 days with no activity and the awaiting response label was assigned. Is this still an issue?

ymodak · 2018-09-28T21:43:44Z

This question is better asked on StackOverflow since it is not a bug or feature request. There is also a larger community that reads questions there.

If you think we've misinterpreted a bug, please comment again with a clear explanation, as well as all of the information requested in the issue template. Thanks!

AlexandruBurlacu · 2018-09-30T17:07:16Z

How is this not a bug? In the trace it shows that it tries to allocate 0 bytes and fails, which is strange, and surely not a problem on my side of the code. The problem might be on Keras side tho, but I am not sure.
All the information requested from the issue template is in the issue already, just formatted in an unorthodox way.

ymodak · 2018-10-01T19:22:54Z

Hi @AlexandruBurlacu
I was able to execute your code successfully in TensorFlow 1.11
Can you please update your TensorFlow version and try it?
Meanwhile I will create a virtual env for TF 1.8 and execute the script again.

ymodak · 2018-10-01T20:41:39Z

Quick update I was able to run the script successfully in version TF 1.8 as well.
So I don't think this is a problem on Keras side as well.

AlexandruBurlacu · 2018-10-02T14:39:41Z

Interesting, I will check it again and come back with the results on my side. Thank you anyway!

tensorflowbutler added the stat:awaiting response Status - Awaiting response from author label Aug 16, 2018

tensorflowbutler assigned shivaniag Aug 16, 2018

ymodak closed this as completed Sep 28, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[XLA] ResourceExhaustedError when trying to define a Sequential model in Keras under jit_scope context manager #21638

[XLA] ResourceExhaustedError when trying to define a Sequential model in Keras under jit_scope context manager #21638

AlexandruBurlacu commented Aug 15, 2018

tensorflowbutler commented Aug 16, 2018

tensorflowbutler commented Sep 16, 2018

ymodak commented Sep 28, 2018

AlexandruBurlacu commented Sep 30, 2018

ymodak commented Oct 1, 2018

ymodak commented Oct 1, 2018

AlexandruBurlacu commented Oct 2, 2018

[XLA] ResourceExhaustedError when trying to define a Sequential model in Keras under jit_scope context manager #21638

[XLA] ResourceExhaustedError when trying to define a Sequential model in Keras under jit_scope context manager #21638

Comments

AlexandruBurlacu commented Aug 15, 2018

Code

Traceback

tensorflowbutler commented Aug 16, 2018

tensorflowbutler commented Sep 16, 2018

ymodak commented Sep 28, 2018

AlexandruBurlacu commented Sep 30, 2018

ymodak commented Oct 1, 2018

ymodak commented Oct 1, 2018

AlexandruBurlacu commented Oct 2, 2018