Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Resource exhausted: OOM when allocating tensor with shape[2304,384] Traceback (most recent call last): #1993

Closed
ehfo0 opened this issue Jul 20, 2017 · 19 comments

Comments

@ehfo0
Copy link

ehfo0 commented Jul 20, 2017

Please go to Stack Overflow for help and support:

I tried to run models/tutorials/image/cifar10/train.py
I let it run about a day on my pc :
(windows10 , tensorflow-gpu 1.2 ,) after
2017-07-20 13:58:20.441224: step 941580, loss = 0.14 (3076.2 examples/sec; 0.042 sec/batch)

`I got this error :

2017-07-20 13:58:20.791379: W c:\tf_jenkins\home\workspace\release-win\m\windows-gpu\py\35\tensorflow\core\framework\op_kernel.cc:1158] Resource exhausted: OOM when allocating tensor with shape[2304,384]
Traceback (most recent call last):
  File "D:\Anaconda3\lib\site-packages\tensorflow\python\client\session.py", line 1139, in _do_call
    return fn(*args)
  File "D:\Anaconda3\lib\site-packages\tensorflow\python\client\session.py", line 1121, in _run_fn
    status, run_metadata)
  File "D:\Anaconda3\lib\contextlib.py", line 66, in __exit__
    next(self.gen)
  File "D:\Anaconda3\lib\site-packages\tensorflow\python\framework\errors_impl.py", line 466, in raise_exception_on_not_ok_status
    pywrap_tensorflow.TF_GetCode(status))
tensorflow.python.framework.errors_impl.ResourceExhaustedError: OOM when allocating tensor with shape[2304,384]
	 [[Node: ExponentialMovingAverage/AssignMovingAvg_4/sub_1 = Sub[T=DT_FLOAT, _class=["loc:@local3/weights"], _device="/job:localhost/replica:0/task:0/cpu:0"](local3/weights/ExponentialMovingAverage/read, local3/weights/read)]]

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "C:/Users/Hoda/Documents/GitHub/models/tutorials/image/cifar10/cifar10_train.py", line 127, in <module>
    tf.app.run()
  File "D:\Anaconda3\lib\site-packages\tensorflow\python\platform\app.py", line 48, in run
    _sys.exit(main(_sys.argv[:1] + flags_passthrough))
  File "C:/Users/Hoda/Documents/GitHub/models/tutorials/image/cifar10/cifar10_train.py", line 123, in main
    train()
  File "C:/Users/Hoda/Documents/GitHub/models/tutorials/image/cifar10/cifar10_train.py", line 115, in train
    mon_sess.run(train_op)
  File "D:\Anaconda3\lib\site-packages\tensorflow\python\training\monitored_session.py", line 505, in run
    run_metadata=run_metadata)
  File "D:\Anaconda3\lib\site-packages\tensorflow\python\training\monitored_session.py", line 842, in run
    run_metadata=run_metadata)
  File "D:\Anaconda3\lib\site-packages\tensorflow\python\training\monitored_session.py", line 798, in run
    return self._sess.run(*args, **kwargs)
  File "D:\Anaconda3\lib\site-packages\tensorflow\python\training\monitored_session.py", line 952, in run
    run_metadata=run_metadata)
  File "D:\Anaconda3\lib\site-packages\tensorflow\python\training\monitored_session.py", line 798, in run
    return self._sess.run(*args, **kwargs)
  File "D:\Anaconda3\lib\site-packages\tensorflow\python\client\session.py", line 789, in run
    run_metadata_ptr)
  File "D:\Anaconda3\lib\site-packages\tensorflow\python\client\session.py", line 997, in _run
    feed_dict_string, options, run_metadata)
  File "D:\Anaconda3\lib\site-packages\tensorflow\python\client\session.py", line 1132, in _do_run
    target_list, options, run_metadata)
  File "D:\Anaconda3\lib\site-packages\tensorflow\python\client\session.py", line 1152, in _do_call
    raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.ResourceExhaustedError: OOM when allocating tensor with shape[2304,384]
	 [[Node: ExponentialMovingAverage/AssignMovingAvg_4/sub_1 = Sub[T=DT_FLOAT, _class=["loc:@local3/weights"], _device="/job:localhost/replica:0/task:0/cpu:0"](local3/weights/ExponentialMovingAverage/read, local3/weights/read)]]

Caused by op 'ExponentialMovingAverage/AssignMovingAvg_4/sub_1', defined at:
  File "C:/Users/Hoda/Documents/GitHub/models/tutorials/image/cifar10/cifar10_train.py", line 127, in <module>
    tf.app.run()
  File "D:\Anaconda3\lib\site-packages\tensorflow\python\platform\app.py", line 48, in run
    _sys.exit(main(_sys.argv[:1] + flags_passthrough))
  File "C:/Users/Hoda/Documents/GitHub/models/tutorials/image/cifar10/cifar10_train.py", line 123, in main
    train()
  File "C:/Users/Hoda/Documents/GitHub/models/tutorials/image/cifar10/cifar10_train.py", line 79, in train
    train_op = cifar10.train(loss, global_step)
  File "C:\Users\Hoda\Documents\GitHub\models\tutorials\image\cifar10\cifar10.py", line 373, in train
    variables_averages_op = variable_averages.apply(tf.trainable_variables())
  File "D:\Anaconda3\lib\site-packages\tensorflow\python\training\moving_averages.py", line 392, in apply
    self._averages[var], var, decay, zero_debias=zero_debias))
  File "D:\Anaconda3\lib\site-packages\tensorflow\python\training\moving_averages.py", line 72, in assign_moving_average
    update_delta = (variable - value) * decay
  File "D:\Anaconda3\lib\site-packages\tensorflow\python\ops\variables.py", line 694, in _run_op
    return getattr(ops.Tensor, operator)(a._AsTensor(), *args)
  File "D:\Anaconda3\lib\site-packages\tensorflow\python\ops\math_ops.py", line 838, in binary_op_wrapper
    return func(x, y, name=name)
  File "D:\Anaconda3\lib\site-packages\tensorflow\python\ops\gen_math_ops.py", line 2501, in _sub
    result = _op_def_lib.apply_op("Sub", x=x, y=y, name=name)
  File "D:\Anaconda3\lib\site-packages\tensorflow\python\framework\op_def_library.py", line 767, in apply_op
    op_def=op_def)
  File "D:\Anaconda3\lib\site-packages\tensorflow\python\framework\ops.py", line 2510, in create_op
    original_op=self._default_original_op, op_def=op_def)
  File "D:\Anaconda3\lib\site-packages\tensorflow\python\framework\ops.py", line 1273, in __init__
    self._traceback = _extract_stack()

ResourceExhaustedError (see above for traceback): OOM when allocating tensor with shape[2304,384]
	 [[Node: ExponentialMovingAverage/AssignMovingAvg_4/sub_1 = Sub[T=DT_FLOAT, _class=["loc:@local3/weights"], _device="/job:localhost/replica:0/task:0/cpu:0"](local3/weights/ExponentialMovingAverage/read, local3/weights/read)]]

`how can I fix it? and do I have to run it again from or the previous result is saved?
ibe the problem clearly here. Be sure to convey here why it's a bug in TensorFlow or a feature request.

@drpngx
Copy link
Contributor

drpngx commented Jul 20, 2017

The most expedient way is probably to reduce the batch size. It'll run slower, but use less memory.

@drpngx
Copy link
Contributor

drpngx commented Jul 20, 2017

Are you saying that there is a memory leak?

@drpngx drpngx added the stat:awaiting response Waiting on input from the contributor label Jul 20, 2017
@ehfo0
Copy link
Author

ehfo0 commented Jul 21, 2017

thanks for the answer
I changed the batch_size from 128 to 64 now it's running ! I am running it on my pc so it might take a while!
I don't know about the memory leak but I guess the the cache run out of memory! I have 16 GB RAM and geforce gtx970 graphic card.
does it lose the previous time I let it train or just the network gets better each time we run train.py?

@aselle aselle removed the stat:awaiting response Waiting on input from the contributor label Jul 21, 2017
@ehfo0
Copy link
Author

ehfo0 commented Jul 21, 2017

thanks I reduced the batch size and it worked!
I got precision of :
2017-07-21 21:42:04.630874: precision @ 1 = 0.859

@drpngx
Copy link
Contributor

drpngx commented Jul 22, 2017

Yay!!

@drpngx drpngx closed this as completed Jul 22, 2017
@deepakmeena635
Copy link

Im having the same issue
below you can find more details
tensorflow/tensorflow#4735 (comment)
please look into it

@ShubhamKanitkar
Copy link

The most expedient way is probably to reduce the batch size. It'll run slower, but use less memory.

I am using CPU to train the model.
I have already kept batch size to 1 and have resized image size to 200 X 200.
Still it is throwing Resource Exhausted Error.

Please help

@krw0320
Copy link

krw0320 commented Jul 2, 2019

I am having the same issue even with a reduced batch size

@heizie
Copy link

heizie commented Jul 8, 2019

same here. The batch size was already 1 and i've change the fixed_shape_resizer as 500x500 (using faster rcnn models) and

  session_config = tf.ConfigProto()
  session_config.gpu_options.per_process_gpu_memory_fraction = 0.3
  config = tf.estimator.RunConfig(model_dir=FLAGS.model_dir, session_config=session_config)

is also set.

But it keep showing (by ssd resnet, same error) :

2019-07-08 18:37:10.194834: W tensorflow/core/framework/op_kernel.cc:1502] OP_REQUIRES failed at cwise_ops_common.cc:70 : Resource exhausted: OOM when allocating tensor with shape[100,51150] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
until it break automatically.
I can only train the ssd mobilenet. very confusing.

i'm using gtx 1060 6G and RTX 2070. both has same error

@Alex-Naxitus
Copy link

Helllo, facing the same problem when training with the kangaroo data set .
Reducing training bach from 16 to 4 has not changed anything
CPU 8Gb + ( Intel(R) UHD Graphics 630
GPU Geforce GTX 1050 3GB , Windows10 + Anaconda

from
import tensorflow as tf
from tensorflow.python.client import device_lib
print(device_lib.list_local_devices())_

I have the response
GPU:0 with 2131 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1050, pci bus id: 0000:01:00.0, compute capability: 6.1)
[name: "/device:CPU:0"
device_type: "CPU"
memory_limit: 268435456
locality {
}
incarnation: 18102239670215265869
, name: "/device:GPU:0"
device_type: "GPU"
memory_limit: 2235275673
locality {
bus_id: 1
links {
}
}
incarnation: 6041356209009565047
physical_device_desc: "device: 0, name: GeForce GTX 1050, pci bus id: 0000:01:00.0, compute capability: 6.1"
]

Thanks for your help and advise ....

@dlavrantonis
Copy link

the issue is there, not sure why the ticket is closed. There must be a leak since if you reboot the machine it goes away

@n-92
Copy link

n-92 commented Jan 5, 2021

This issue should not be closed.. Tsk

@n-92
Copy link

n-92 commented Jan 5, 2021

The most expedient way is probably to reduce the batch size. It'll run slower, but use less memory.

Hi, Which file to modify?

@aeon0
Copy link

aeon0 commented Jan 8, 2021

@o92 I would say this is not a tensorflow issue to begin with. There are multiple things you can do:

  • Decrease batch size
  • Decrase model input size
  • Decrease other model properties such as filter size
  • Get better hardware

@sidharth1805
Copy link

these oom allocation errors main reason is the unavailability of enough ram so these are the fixes you can try

  1. Switch to other TF2 object detection models like mobile net with less processing time(you have to sacrifice the accuracy)
  2. Decrease the batch size
  3. change the image resizer values
    Try one by one and if it's showing the same error then choose a lite model and reduce the batch size to 1 and try to run it on google colab for people with 8 GB ram its very difficult to run this on CPU so try running it on the google colab it has 12 gb ram
    these higher-level object detection models require a lot of computing power so don't go with complex models if necessary then you will surely need to upgrade the hardware

Reducing the batch size didn't help me first then I did this now it's working

@ChuaCheowHuan
Copy link

I encountered this issue when trying to do fine tuning on Colab with EfficientNetB7. When I reduced the number of trainable layers during the fine tuning process, everything works fine. The batch size I'm using is 64.

# Fine tune from this layer onwards
#fine_tune_at = 100   # Error
fine_tune_at = 700   # OK

# Freeze all the layers before the `fine_tune_at` layer
for layer in base_model.layers[:fine_tune_at]:
  layer.trainable =  False

Number of layers in the base model (EfficientNetB7): 813

@tinayzdzd
Copy link

Hi all,
I have the same issue.
My data is text data and it contains 3097 rows.
I am using gg colab, with GPU.
I use Tensorflow 1.15 and Elmo embedding for a feed-forward network.
After checking different max sequences, by using max_sequence of 1024 and batch size of 8, 2144/2355 training samples trained and then I got this error message:

ResourceExhaustedError: 2 root error(s) found.
(0) Resource exhausted: OOM when allocating tensor with shape[8,1024,4849,44] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
[[{{node lambda_3/module_4_apply_default/bilm/CNN/Conv2D_6}}]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

[[loss_3/mul/_523]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

(1) Resource exhausted: OOM when allocating tensor with shape[8,1024,4849,44] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
[[{{node lambda_3/module_4_apply_default/bilm/CNN/Conv2D_6}}]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

0 successful operations.

0 derived errors ignored.

As you can see in the error, there is OOM when allocating tensor with shape [8,1024, 4849,44]. My questions are:

  1. Why the shape is 4 dimensional? As I know the Elmo embedding has a 3-dimensional shape.
  2. 8 is the batch size, 1024 in the max sequence, I don't know what are 4849 and 44 here?
  3. Is there any recommendations for fixing this issue?

Thanks a lot in advance.

@WittmannF
Copy link

Coming here as reducing the batch size problem didn't solve in my case. If you are having this problem during inference, the following might help:

with tf.device("cpu:0"):
  prediction = model.predict(...)

The difference between CPU and GPU inference time is not that high, and we'll have way more memory available using CPU.

@oldmonkABA
Copy link

with tf.device("cpu:0"):

God bless you dude.. was struggling with this for 2 days

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests