Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ResourceExhaustedError after segmentation models update! #167

Closed
safak17 opened this issue Aug 10, 2019 · 18 comments
Closed

ResourceExhaustedError after segmentation models update! #167

safak17 opened this issue Aug 10, 2019 · 18 comments

Comments

@safak17
Copy link

safak17 commented Aug 10, 2019

Hi! I was working FPN with 'resnext101' backbone on Google Colab. I've trained the model and have done lots of experiments and the results were very good. Today, after I updated the segmentation models (actually, every time I use Google Colab, I have to reinstall it) I got the following error shown below. By the way, I tried to use Unet with 'vgg16' backbone and everything went well. I wonder why FPN with resnext101 backbone does not fit GPU memory as it fit two days ago.

Thank you very much @qubvel .

Edit1:
FPN with vgg16 backbone is OK.
FPN with vgg19 backbone is OK.
FPN with resnet34 backbone is OK.
FPN with resnet50 backbone is NOT OK (The same error is shown below).
FPN with resnet101 backbone is NOT OK (The same error is shown below).
FPN with resnext50 backbone is NOT OK (The same error is shown below).

Edit2:
The related StackOverflow question.

Epoch 1/100
---------------------------------------------------------------------------
ResourceExhaustedError                    Traceback (most recent call last)
<ipython-input-22-1b2892f8cab2> in <module>()
----> 1 get_ipython().run_cell_magic('time', '', 'history = model.fit_generator(\n    generator = zipped_train_generator,\n  validation_data=(X_validation, y_validation),\n    steps_per_epoch=len(X_train) // NUM_BATCH,\n    callbacks= callbacks_list,\n    verbose = 1,\n    epochs = NUM_EPOCH)')

9 frames
</usr/local/lib/python3.6/dist-packages/decorator.py:decorator-gen-60> in time(self, line, cell, local_ns)

<timed exec> in <module>()

/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py in __call__(self, *args, **kwargs)
   1456         ret = tf_session.TF_SessionRunCallable(self._session._session,
   1457                                                self._handle, args,
-> 1458                                                run_metadata_ptr)
   1459         if run_metadata:
   1460           proto_data = tf_session.TF_GetBuffer(run_metadata_ptr)

ResourceExhaustedError: 2 root error(s) found.
  (0) Resource exhausted: OOM when allocating tensor with shape[32,128,112,112] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
	 [[{{node training/RMSprop/gradients/zeros_21}}]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

	 [[loss/mul/_11081]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

  (1) Resource exhausted: OOM when allocating tensor with shape[32,128,112,112] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
	 [[{{node training/RMSprop/gradients/zeros_21}}]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

0 successful operations.
0 derived errors ignored.
@qubvel
Copy link
Owner

qubvel commented Aug 10, 2019

Hi, I have add a one more skip connection (a little modification, but now it requires more memory).
Two possible ways:

  1. roll back
  2. use aggregation_mode = "sum" - this should reduce memory consumption

May be that was not a good decision to make this modification, but I have received a feedback that it help to improve quality.

@qubvel
Copy link
Owner

qubvel commented Aug 10, 2019

May it will be good to make this modification optional and return original architecture.

@safak17
Copy link
Author

safak17 commented Aug 10, 2019

Newbies like me will try to run these kinds of libraries firstly on Google Colab. And I don't think that we'll easily understand and solve this model & GPU fit problem. As you said, it would be better when it is optional. Could you please let me know when you update it?

  1. I would try to roll back but I don't know how to see the versions of the library. Which version should I go, and how?
  2. I didn't understand what aggregation_mode = "sum" means. Where and how will I use it? Do you mean pyramid_aggregation = 'sum' in FPN constructor? If so, the problem is not solved.
model = sm.FPN(backbone_name = 'resnext101',
               classes = NUM_CLASSES,
               input_shape = INPUT_SHAPE,
               encoder_weights = "imagenet",
               activation="softmax",
               encoder_freeze=True,
               pyramid_aggregation = 'sum')

Thank you very much! :-)

@qubvel
Copy link
Owner

qubvel commented Aug 10, 2019

I have mentioned this in readme: you need 0.2.1 version.
Yeah, I mean pyramid aggregation parameter, so you have only one option - roll back)

@qubvel
Copy link
Owner

qubvel commented Aug 10, 2019

pip install -U segmentation-models==0.2.1

@safak17
Copy link
Author

safak17 commented Aug 10, 2019

Still the same problem.

# !pip install -U segmentation-models==0.2.1
import segmentation_models as sm

# Load model
model = sm.FPN(backbone_name = 'resnext101',
               classes = NUM_CLASSES,
               input_shape = INPUT_SHAPE,
               encoder_weights = "imagenet",
               activation="softmax",
               encoder_freeze=True)

model.compile(optimizer    = optimizers.rmsprop(lr=0.00032),
              loss         = categorical_focal_loss(),
              metrics      = ["accuracy", sm.metrics.iou_score])

NUM_BATCH = 32
NUM_EPOCH = 100

history = model.fit_generator(
    generator = zipped_train_generator,
    validation_data=(X_validation, y_validation),
    steps_per_epoch=len(X_train) // NUM_BATCH,
    callbacks= callbacks_list,
    verbose = 1,
    epochs = NUM_EPOCH)
Epoch 1/100
---------------------------------------------------------------------------
ResourceExhaustedError                    Traceback (most recent call last)
<ipython-input-17-d397d21e8395> in <module>()
----> 1 get_ipython().run_cell_magic('time', '', 'history = model.fit_generator(\n    generator = zipped_train_generator,\n    validation_data=(X_validation, y_validation),\n    steps_per_epoch=len(X_train) // NUM_BATCH,\n    callbacks= callbacks_list,\n    verbose = 1,\n    epochs = NUM_EPOCH)')

9 frames
/usr/local/lib/python3.6/dist-packages/IPython/core/interactiveshell.py in run_cell_magic(self, magic_name, line, cell)
   2115             magic_arg_s = self.var_expand(line, stack_depth)
   2116             with self.builtin_trap:
-> 2117                 result = fn(magic_arg_s, cell)
   2118             return result
   2119 

</usr/local/lib/python3.6/dist-packages/decorator.py:decorator-gen-60> in time(self, line, cell, local_ns)

/usr/local/lib/python3.6/dist-packages/IPython/core/magic.py in <lambda>(f, *a, **k)
    186     # but it's overkill for just that one bit of state.
    187     def magic_deco(arg):
--> 188         call = lambda f, *a, **k: f(*a, **k)
    189 
    190         if callable(arg):

/usr/local/lib/python3.6/dist-packages/IPython/core/magics/execution.py in time(self, line, cell, local_ns)
   1191         else:
   1192             st = clock2()
-> 1193             exec(code, glob, local_ns)
   1194             end = clock2()
   1195             out = None

<timed exec> in <module>()

/usr/local/lib/python3.6/dist-packages/keras/legacy/interfaces.py in wrapper(*args, **kwargs)
     89                 warnings.warn('Update your `' + object_name + '` call to the ' +
     90                               'Keras 2 API: ' + signature, stacklevel=2)
---> 91             return func(*args, **kwargs)
     92         wrapper._original_function = func
     93         return wrapper

/usr/local/lib/python3.6/dist-packages/keras/engine/training.py in fit_generator(self, generator, steps_per_epoch, epochs, verbose, callbacks, validation_data, validation_steps, class_weight, max_queue_size, workers, use_multiprocessing, shuffle, initial_epoch)
   1416             use_multiprocessing=use_multiprocessing,
   1417             shuffle=shuffle,
-> 1418             initial_epoch=initial_epoch)
   1419 
   1420     @interfaces.legacy_generator_methods_support

/usr/local/lib/python3.6/dist-packages/keras/engine/training_generator.py in fit_generator(model, generator, steps_per_epoch, epochs, verbose, callbacks, validation_data, validation_steps, class_weight, max_queue_size, workers, use_multiprocessing, shuffle, initial_epoch)
    215                 outs = model.train_on_batch(x, y,
    216                                             sample_weight=sample_weight,
--> 217                                             class_weight=class_weight)
    218 
    219                 outs = to_list(outs)

/usr/local/lib/python3.6/dist-packages/keras/engine/training.py in train_on_batch(self, x, y, sample_weight, class_weight)
   1215             ins = x + y + sample_weights
   1216         self._make_train_function()
-> 1217         outputs = self.train_function(ins)
   1218         return unpack_singleton(outputs)
   1219 

/usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py in __call__(self, inputs)
   2713                 return self._legacy_call(inputs)
   2714 
-> 2715             return self._call(inputs)
   2716         else:
   2717             if py_any(is_tensor(x) for x in inputs):

/usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py in _call(self, inputs)
   2673             fetched = self._callable_fn(*array_vals, run_metadata=self.run_metadata)
   2674         else:
-> 2675             fetched = self._callable_fn(*array_vals)
   2676         return fetched[:len(self.outputs)]
   2677 

/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py in __call__(self, *args, **kwargs)
   1456         ret = tf_session.TF_SessionRunCallable(self._session._session,
   1457                                                self._handle, args,
-> 1458                                                run_metadata_ptr)
   1459         if run_metadata:
   1460           proto_data = tf_session.TF_GetBuffer(run_metadata_ptr)

ResourceExhaustedError: OOM when allocating tensor with shape[32,512,56,56] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
	 [[{{node training/RMSprop/gradients/conv2d_1057/convolution_grad/Conv2DBackpropInput}}]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

@qubvel
Copy link
Owner

qubvel commented Aug 10, 2019

check the version

import segmentation_models as sm
print(sm.__version__)

@qubvel
Copy link
Owner

qubvel commented Aug 10, 2019

You can also try to make smaller batch_size, or reduce image spatial size to fit into memory

@safak17
Copy link
Author

safak17 commented Aug 10, 2019

I checked the version and it's correct. 0.2.1
version

I will try your suggestions. But it's weird, two days ago everything was working very well and now I am suffering from GPU memory allocation with the same code, and the same environment. :-)

@qubvel
Copy link
Owner

qubvel commented Aug 10, 2019

Actually it does not look like it is cuda memory exhausted error, usually it says something like "not enough memory for tensor with shape ..."
So, try to check your code again.

@qubvel
Copy link
Owner

qubvel commented Aug 10, 2019

The first traceback is correct, but the second looks different

@qubvel
Copy link
Owner

qubvel commented Aug 10, 2019

Oh, it was just my incorrect mobile version, they are the same))

@qubvel
Copy link
Owner

qubvel commented Aug 10, 2019

Yeah, thats strange..

@qubvel qubvel closed this as completed Oct 15, 2019
@mletombe
Copy link

Hello,

With the 1.0.0 version FPN(resnet34) FPN(resnet18) FPN(inception v3) do OOM. They worked with the 0.2.1, so may be there's something else, don't you think?
My images are 1024x2048x3, my GPU is a 2080 ti 11GB, batch size is 1.
Rolling back is ok.

Thank's for your work,
Mathieu.

@qubvel
Copy link
Owner

qubvel commented Oct 18, 2019

Hi, there is an option how to aggregare feature pyramid: sum or concat
Use sum for less memory consumption (as it was in version 0.2)

@mletombe
Copy link

I tried,
I still have OOM for FPN(resnet18) and FPN(inceptionv3)...

@CChen89
Copy link

CChen89 commented Oct 20, 2019

Hi,

Does model use multiple GPUs in training (e.g. keras.utils.multi_gpu_model)? I have the same problem when using Unet(efficientnetb4/b5/b6) with a batch_size > 12. My images are all 320 by 320. It's OK if I use a batch_size smaller than 10.

Thank you,

Chen

@mletombe
Copy link

Hi,

No my model doesn't use keras.utils.multi_gpu_model.

Mathieu.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants