ResourceExhaustedError after segmentation models update! #167

safak17 · 2019-08-10T12:31:43Z

Hi! I was working FPN with 'resnext101' backbone on Google Colab. I've trained the model and have done lots of experiments and the results were very good. Today, after I updated the segmentation models (actually, every time I use Google Colab, I have to reinstall it) I got the following error shown below. By the way, I tried to use Unet with 'vgg16' backbone and everything went well. I wonder why FPN with resnext101 backbone does not fit GPU memory as it fit two days ago.

Thank you very much @qubvel .

Edit1:
FPN with vgg16 backbone is OK.
FPN with vgg19 backbone is OK.
FPN with resnet34 backbone is OK.
FPN with resnet50 backbone is NOT OK (The same error is shown below).
FPN with resnet101 backbone is NOT OK (The same error is shown below).
FPN with resnext50 backbone is NOT OK (The same error is shown below).

Edit2:
The related StackOverflow question.

Epoch 1/100
---------------------------------------------------------------------------
ResourceExhaustedError                    Traceback (most recent call last)
<ipython-input-22-1b2892f8cab2> in <module>()
----> 1 get_ipython().run_cell_magic('time', '', 'history = model.fit_generator(\n    generator = zipped_train_generator,\n  validation_data=(X_validation, y_validation),\n    steps_per_epoch=len(X_train) // NUM_BATCH,\n    callbacks= callbacks_list,\n    verbose = 1,\n    epochs = NUM_EPOCH)')

9 frames
</usr/local/lib/python3.6/dist-packages/decorator.py:decorator-gen-60> in time(self, line, cell, local_ns)

<timed exec> in <module>()

/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py in __call__(self, *args, **kwargs)
   1456         ret = tf_session.TF_SessionRunCallable(self._session._session,
   1457                                                self._handle, args,
-> 1458                                                run_metadata_ptr)
   1459         if run_metadata:
   1460           proto_data = tf_session.TF_GetBuffer(run_metadata_ptr)

ResourceExhaustedError: 2 root error(s) found.
  (0) Resource exhausted: OOM when allocating tensor with shape[32,128,112,112] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
	 [[{{node training/RMSprop/gradients/zeros_21}}]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

	 [[loss/mul/_11081]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

  (1) Resource exhausted: OOM when allocating tensor with shape[32,128,112,112] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
	 [[{{node training/RMSprop/gradients/zeros_21}}]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

0 successful operations.
0 derived errors ignored.

qubvel · 2019-08-10T14:08:26Z

Hi, I have add a one more skip connection (a little modification, but now it requires more memory).
Two possible ways:

roll back
use aggregation_mode = "sum" - this should reduce memory consumption

May be that was not a good decision to make this modification, but I have received a feedback that it help to improve quality.

qubvel · 2019-08-10T14:10:33Z

May it will be good to make this modification optional and return original architecture.

safak17 · 2019-08-10T14:36:25Z

Newbies like me will try to run these kinds of libraries firstly on Google Colab. And I don't think that we'll easily understand and solve this model & GPU fit problem. As you said, it would be better when it is optional. Could you please let me know when you update it?

I would try to roll back but I don't know how to see the versions of the library. Which version should I go, and how?
I didn't understand what aggregation_mode = "sum" means. Where and how will I use it? Do you mean pyramid_aggregation = 'sum' in FPN constructor? If so, the problem is not solved.

model = sm.FPN(backbone_name = 'resnext101',
               classes = NUM_CLASSES,
               input_shape = INPUT_SHAPE,
               encoder_weights = "imagenet",
               activation="softmax",
               encoder_freeze=True,
               pyramid_aggregation = 'sum')

Thank you very much! :-)

qubvel · 2019-08-10T14:57:42Z

I have mentioned this in readme: you need 0.2.1 version.
Yeah, I mean pyramid aggregation parameter, so you have only one option - roll back)

qubvel · 2019-08-10T15:23:50Z

pip install -U segmentation-models==0.2.1

safak17 · 2019-08-10T15:51:21Z

Still the same problem.

# !pip install -U segmentation-models==0.2.1
import segmentation_models as sm

# Load model
model = sm.FPN(backbone_name = 'resnext101',
               classes = NUM_CLASSES,
               input_shape = INPUT_SHAPE,
               encoder_weights = "imagenet",
               activation="softmax",
               encoder_freeze=True)

model.compile(optimizer    = optimizers.rmsprop(lr=0.00032),
              loss         = categorical_focal_loss(),
              metrics      = ["accuracy", sm.metrics.iou_score])

NUM_BATCH = 32
NUM_EPOCH = 100

history = model.fit_generator(
    generator = zipped_train_generator,
    validation_data=(X_validation, y_validation),
    steps_per_epoch=len(X_train) // NUM_BATCH,
    callbacks= callbacks_list,
    verbose = 1,
    epochs = NUM_EPOCH)

Epoch 1/100
---------------------------------------------------------------------------
ResourceExhaustedError                    Traceback (most recent call last)
<ipython-input-17-d397d21e8395> in <module>()
----> 1 get_ipython().run_cell_magic('time', '', 'history = model.fit_generator(\n    generator = zipped_train_generator,\n    validation_data=(X_validation, y_validation),\n    steps_per_epoch=len(X_train) // NUM_BATCH,\n    callbacks= callbacks_list,\n    verbose = 1,\n    epochs = NUM_EPOCH)')

9 frames
/usr/local/lib/python3.6/dist-packages/IPython/core/interactiveshell.py in run_cell_magic(self, magic_name, line, cell)
   2115             magic_arg_s = self.var_expand(line, stack_depth)
   2116             with self.builtin_trap:
-> 2117                 result = fn(magic_arg_s, cell)
   2118             return result
   2119 

</usr/local/lib/python3.6/dist-packages/decorator.py:decorator-gen-60> in time(self, line, cell, local_ns)

/usr/local/lib/python3.6/dist-packages/IPython/core/magic.py in <lambda>(f, *a, **k)
    186     # but it's overkill for just that one bit of state.
    187     def magic_deco(arg):
--> 188         call = lambda f, *a, **k: f(*a, **k)
    189 
    190         if callable(arg):

/usr/local/lib/python3.6/dist-packages/IPython/core/magics/execution.py in time(self, line, cell, local_ns)
   1191         else:
   1192             st = clock2()
-> 1193             exec(code, glob, local_ns)
   1194             end = clock2()
   1195             out = None

<timed exec> in <module>()

/usr/local/lib/python3.6/dist-packages/keras/legacy/interfaces.py in wrapper(*args, **kwargs)
     89                 warnings.warn('Update your `' + object_name + '` call to the ' +
     90                               'Keras 2 API: ' + signature, stacklevel=2)
---> 91             return func(*args, **kwargs)
     92         wrapper._original_function = func
     93         return wrapper

/usr/local/lib/python3.6/dist-packages/keras/engine/training.py in fit_generator(self, generator, steps_per_epoch, epochs, verbose, callbacks, validation_data, validation_steps, class_weight, max_queue_size, workers, use_multiprocessing, shuffle, initial_epoch)
   1416             use_multiprocessing=use_multiprocessing,
   1417             shuffle=shuffle,
-> 1418             initial_epoch=initial_epoch)
   1419 
   1420     @interfaces.legacy_generator_methods_support

/usr/local/lib/python3.6/dist-packages/keras/engine/training_generator.py in fit_generator(model, generator, steps_per_epoch, epochs, verbose, callbacks, validation_data, validation_steps, class_weight, max_queue_size, workers, use_multiprocessing, shuffle, initial_epoch)
    215                 outs = model.train_on_batch(x, y,
    216                                             sample_weight=sample_weight,
--> 217                                             class_weight=class_weight)
    218 
    219                 outs = to_list(outs)

/usr/local/lib/python3.6/dist-packages/keras/engine/training.py in train_on_batch(self, x, y, sample_weight, class_weight)
   1215             ins = x + y + sample_weights
   1216         self._make_train_function()
-> 1217         outputs = self.train_function(ins)
   1218         return unpack_singleton(outputs)
   1219 

/usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py in __call__(self, inputs)
   2713                 return self._legacy_call(inputs)
   2714 
-> 2715             return self._call(inputs)
   2716         else:
   2717             if py_any(is_tensor(x) for x in inputs):

/usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py in _call(self, inputs)
   2673             fetched = self._callable_fn(*array_vals, run_metadata=self.run_metadata)
   2674         else:
-> 2675             fetched = self._callable_fn(*array_vals)
   2676         return fetched[:len(self.outputs)]
   2677 

/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py in __call__(self, *args, **kwargs)
   1456         ret = tf_session.TF_SessionRunCallable(self._session._session,
   1457                                                self._handle, args,
-> 1458                                                run_metadata_ptr)
   1459         if run_metadata:
   1460           proto_data = tf_session.TF_GetBuffer(run_metadata_ptr)

ResourceExhaustedError: OOM when allocating tensor with shape[32,512,56,56] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
	 [[{{node training/RMSprop/gradients/conv2d_1057/convolution_grad/Conv2DBackpropInput}}]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

qubvel · 2019-08-10T16:01:30Z

check the version

import segmentation_models as sm
print(sm.__version__)

qubvel · 2019-08-10T16:04:52Z

You can also try to make smaller batch_size, or reduce image spatial size to fit into memory

safak17 · 2019-08-10T16:17:00Z

I checked the version and it's correct. 0.2.1

I will try your suggestions. But it's weird, two days ago everything was working very well and now I am suffering from GPU memory allocation with the same code, and the same environment. :-)

qubvel · 2019-08-10T16:46:06Z

Actually it does not look like it is cuda memory exhausted error, usually it says something like "not enough memory for tensor with shape ..."
So, try to check your code again.

qubvel · 2019-08-10T16:47:04Z

The first traceback is correct, but the second looks different

qubvel · 2019-08-10T16:48:12Z

Oh, it was just my incorrect mobile version, they are the same))

qubvel · 2019-08-10T16:48:58Z

Yeah, thats strange..

mletombe · 2019-10-18T08:03:59Z

Hello,

With the 1.0.0 version FPN(resnet34) FPN(resnet18) FPN(inception v3) do OOM. They worked with the 0.2.1, so may be there's something else, don't you think?
My images are 1024x2048x3, my GPU is a 2080 ti 11GB, batch size is 1.
Rolling back is ok.

Thank's for your work,
Mathieu.

qubvel · 2019-10-18T08:20:32Z

Hi, there is an option how to aggregare feature pyramid: sum or concat
Use sum for less memory consumption (as it was in version 0.2)

mletombe · 2019-10-18T09:15:15Z

I tried,
I still have OOM for FPN(resnet18) and FPN(inceptionv3)...

CChen89 · 2019-10-20T23:35:46Z

Hi,

Does model use multiple GPUs in training (e.g. keras.utils.multi_gpu_model)? I have the same problem when using Unet(efficientnetb4/b5/b6) with a batch_size > 12. My images are all 320 by 320. It's OK if I use a batch_size smaller than 10.

Thank you,

Chen

mletombe · 2019-10-21T07:10:04Z

Hi,

No my model doesn't use keras.utils.multi_gpu_model.

Mathieu.

qubvel closed this as completed Oct 15, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ResourceExhaustedError after segmentation models update! #167

ResourceExhaustedError after segmentation models update! #167

safak17 commented Aug 10, 2019 •

edited

qubvel commented Aug 10, 2019

qubvel commented Aug 10, 2019

safak17 commented Aug 10, 2019 •

edited

qubvel commented Aug 10, 2019

qubvel commented Aug 10, 2019

safak17 commented Aug 10, 2019 •

edited

qubvel commented Aug 10, 2019 •

edited

qubvel commented Aug 10, 2019

safak17 commented Aug 10, 2019 •

edited

qubvel commented Aug 10, 2019

qubvel commented Aug 10, 2019

qubvel commented Aug 10, 2019

qubvel commented Aug 10, 2019

mletombe commented Oct 18, 2019

qubvel commented Oct 18, 2019 •

edited

mletombe commented Oct 18, 2019

CChen89 commented Oct 20, 2019 •

edited

mletombe commented Oct 21, 2019

ResourceExhaustedError after segmentation models update! #167

ResourceExhaustedError after segmentation models update! #167

Comments

safak17 commented Aug 10, 2019 • edited

qubvel commented Aug 10, 2019

qubvel commented Aug 10, 2019

safak17 commented Aug 10, 2019 • edited

qubvel commented Aug 10, 2019

qubvel commented Aug 10, 2019

safak17 commented Aug 10, 2019 • edited

qubvel commented Aug 10, 2019 • edited

qubvel commented Aug 10, 2019

safak17 commented Aug 10, 2019 • edited

qubvel commented Aug 10, 2019

qubvel commented Aug 10, 2019

qubvel commented Aug 10, 2019

qubvel commented Aug 10, 2019

mletombe commented Oct 18, 2019

qubvel commented Oct 18, 2019 • edited

mletombe commented Oct 18, 2019

CChen89 commented Oct 20, 2019 • edited

mletombe commented Oct 21, 2019

safak17 commented Aug 10, 2019 •

edited

safak17 commented Aug 10, 2019 •

edited

safak17 commented Aug 10, 2019 •

edited

qubvel commented Aug 10, 2019 •

edited

safak17 commented Aug 10, 2019 •

edited

qubvel commented Oct 18, 2019 •

edited

CChen89 commented Oct 20, 2019 •

edited