Which parameters to reduce to avoid ResourceExhaustedError #3

MounirB · 2018-10-10T13:16:31Z

Hello, I try to train the faster_rcnn_inception_resnet_v2_atrous_coco_2018_01_28 model by launching the train.py script on it, but I get the following ResourceExhaustedError. Do you have any idea on how to solve it ? I tried to change many parameters in pipeline.config, but It doesn't change anything

2018-10-10 14:54:05.313837: I tensorflow/core/common_runtime/bfc_allocator.cc:678] Sum Total of in-use chunks: 1.25GiB
2018-10-10 14:54:05.313845: I tensorflow/core/common_runtime/bfc_allocator.cc:680] Stats:
Limit: 1363345408
InUse: 1338755072
MaxInUse: 1350130944
NumAllocs: 3937
MaxAllocSize: 256131072

2018-10-10 14:54:05.313921: W tensorflow/core/common_runtime/bfc_allocator.cc:279] ****************************************************************************************************
2018-10-10 14:54:05.313944: W tensorflow/core/framework/op_kernel.cc:1275] OP_REQUIRES failed at transpose_op.cc:199 : Resource exhausted: OOM when allocating tensor with shape[4,160,42,64] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
INFO:tensorflow:Error reported to Coordinator: OOM when allocating tensor with shape[4,160,42,64] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
[[Node: FirstStageFeatureExtractor/InceptionResnetV2/InceptionResnetV2/Repeat_1/block17_2/Branch_1/Conv2d_0c_7x1/Conv2D-0-TransposeNHWCToNCHW-LayoutOptimizer = Transpose[T=DT_FLOAT, Tperm=DT_INT32, _device="/job:localhost/replica:0/task:0/device:GPU:0"](FirstStageFeatureExtractor/InceptionResnetV2/InceptionResnetV2/Repeat_1/block17_2/Branch_1/Conv2d_0c_7x1/SpaceToBatchND, PermConstNHWCToNCHW-LayoutOptimizer)]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

[[Node: BatchMultiClassNonMaxSuppression/map/while/MultiClassNonMaxSuppression/strided_slice/_1871 = _Recvclient_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_11146...ided_slice", tensor_type=DT_INT32, _device="/job:localhost/replica:0/task:0/device:CPU:0"]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.
Traceback (most recent call last):
File "/home/mounir/anaconda3/envs/tflow-gpu/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1278, in _do_call
return fn(*args)
File "/home/mounir/anaconda3/envs/tflow-gpu/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1263, in _run_fn
options, feed_dict, fetch_list, target_list, run_metadata)
File "/home/mounir/anaconda3/envs/tflow-gpu/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1350, in _call_tf_sessionrun
run_metadata)
tensorflow.python.framework.errors_impl.ResourceExhaustedError: OOM when allocating tensor with shape[4,160,42,64] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
[[Node: FirstStageFeatureExtractor/InceptionResnetV2/InceptionResnetV2/Repeat_1/block17_2/Branch_1/Conv2d_0c_7x1/Conv2D-0-TransposeNHWCToNCHW-LayoutOptimizer = Transpose[T=DT_FLOAT, Tperm=DT_INT32, _device="/job:localhost/replica:0/task:0/device:GPU:0"](FirstStageFeatureExtractor/InceptionResnetV2/InceptionResnetV2/Repeat_1/block17_2/Branch_1/Conv2d_0c_7x1/SpaceToBatchND, PermConstNHWCToNCHW-LayoutOptimizer)]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

[[Node: BatchMultiClassNonMaxSuppression/map/while/MultiClassNonMaxSuppression/strided_slice/_1871 = _Recvclient_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_11146...ided_slice", tensor_type=DT_INT32, _device="/job:localhost/replica:0/task:0/device:CPU:0"]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/home/mounir/anaconda3/envs/tflow-gpu/lib/python3.6/site-packages/tensorflow/python/training/coordinator.py", line 297, in stop_on_exception
yield
File "/home/mounir/anaconda3/envs/tflow-gpu/lib/python3.6/site-packages/tensorflow/python/training/coordinator.py", line 495, in run
self.run_loop()
File "/home/mounir/anaconda3/envs/tflow-gpu/lib/python3.6/site-packages/tensorflow/python/training/supervisor.py", line 1035, in run_loop
self._sv.global_step])
File "/home/mounir/anaconda3/envs/tflow-gpu/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 877, in run
run_metadata_ptr)
File "/home/mounir/anaconda3/envs/tflow-gpu/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1100, in _run
feed_dict_tensor, options, run_metadata)
File "/home/mounir/anaconda3/envs/tflow-gpu/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1272, in _do_run
run_metadata)
File "/home/mounir/anaconda3/envs/tflow-gpu/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1291, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.ResourceExhaustedError: OOM when allocating tensor with shape[4,160,42,64] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
[[Node: FirstStageFeatureExtractor/InceptionResnetV2/InceptionResnetV2/Repeat_1/block17_2/Branch_1/Conv2d_0c_7x1/Conv2D-0-TransposeNHWCToNCHW-LayoutOptimizer = Transpose[T=DT_FLOAT, Tperm=DT_INT32, _device="/job:localhost/replica:0/task:0/device:GPU:0"](FirstStageFeatureExtractor/InceptionResnetV2/InceptionResnetV2/Repeat_1/block17_2/Branch_1/Conv2d_0c_7x1/SpaceToBatchND, PermConstNHWCToNCHW-LayoutOptimizer)]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

[[Node: BatchMultiClassNonMaxSuppression/map/while/MultiClassNonMaxSuppression/strided_slice/_1871 = _Recvclient_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_11146...ided_slice", tensor_type=DT_INT32, _device="/job:localhost/replica:0/task:0/device:CPU:0"]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

Traceback (most recent call last):
File "/home/mounir/anaconda3/envs/tflow-gpu/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1278, in _do_call
return fn(*args)
File "/home/mounir/anaconda3/envs/tflow-gpu/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1263, in _run_fn
options, feed_dict, fetch_list, target_list, run_metadata)
File "/home/mounir/anaconda3/envs/tflow-gpu/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1350, in _call_tf_sessionrun
run_metadata)
tensorflow.python.framework.errors_impl.ResourceExhaustedError: OOM when allocating tensor with shape[1,384,72,128] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
[[Node: FirstStageFeatureExtractor/InceptionResnetV2/InceptionResnetV2/Mixed_6a/Branch_1/Conv2d_1a_3x3/Conv2D = Conv2D[T=DT_FLOAT, data_format="NCHW", dilations=[1, 1, 1, 1], padding="SAME", strides=[1, 1, 1, 1], use_cudnn_on_gpu=true, _device="/job:localhost/replica:0/task:0/device:GPU:0"](FirstStageFeatureExtractor/InceptionResnetV2/InceptionResnetV2/Mixed_6a/Branch_1/Conv2d_0b_3x3/Relu, FirstStageFeatureExtractor/InceptionResnetV2/Mixed_6a/Branch_1/Conv2d_1a_3x3/weights/read/_3137)]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

[[Node: gradients/SecondStageFeatureExtractor/InceptionResnetV2/Repeat/block8_9/Conv2d_1x1/Conv2D_grad/tuple/control_dependency_1/_5073 = _Recvclient_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_13509...pendency_1", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/home/mounir/anaconda3/envs/tflow-gpu/lib/python3.6/site-packages/tensorflow/python/training/supervisor.py", line 995, in managed_session
yield sess
File "/home/mounir/anaconda3/envs/tflow-gpu/lib/python3.6/site-packages/tensorflow/contrib/slim/python/slim/learning.py", line 770, in train
sess, train_op, global_step, train_step_kwargs)
File "/home/mounir/anaconda3/envs/tflow-gpu/lib/python3.6/site-packages/tensorflow/contrib/slim/python/slim/learning.py", line 487, in train_step
run_metadata=run_metadata)
File "/home/mounir/anaconda3/envs/tflow-gpu/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 877, in run
run_metadata_ptr)
File "/home/mounir/anaconda3/envs/tflow-gpu/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1100, in _run
feed_dict_tensor, options, run_metadata)
File "/home/mounir/anaconda3/envs/tflow-gpu/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1272, in _do_run
run_metadata)
File "/home/mounir/anaconda3/envs/tflow-gpu/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1291, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.ResourceExhaustedError: OOM when allocating tensor with shape[1,384,72,128] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
[[Node: FirstStageFeatureExtractor/InceptionResnetV2/InceptionResnetV2/Mixed_6a/Branch_1/Conv2d_1a_3x3/Conv2D = Conv2D[T=DT_FLOAT, data_format="NCHW", dilations=[1, 1, 1, 1], padding="SAME", strides=[1, 1, 1, 1], use_cudnn_on_gpu=true, _device="/job:localhost/replica:0/task:0/device:GPU:0"](FirstStageFeatureExtractor/InceptionResnetV2/InceptionResnetV2/Mixed_6a/Branch_1/Conv2d_0b_3x3/Relu, FirstStageFeatureExtractor/InceptionResnetV2/Mixed_6a/Branch_1/Conv2d_1a_3x3/weights/read/_3137)]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

[[Node: gradients/SecondStageFeatureExtractor/InceptionResnetV2/Repeat/block8_9/Conv2d_1x1/Conv2D_grad/tuple/control_dependency_1/_5073 = _Recvclient_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_13509...pendency_1", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

Caused by op 'FirstStageFeatureExtractor/InceptionResnetV2/InceptionResnetV2/Mixed_6a/Branch_1/Conv2d_1a_3x3/Conv2D', defined at:
File "/home/mounir/PycharmProjects/Pedestrian-Detection-master/object_detection/train.py", line 163, in
tf.app.run()
File "/home/mounir/anaconda3/envs/tflow-gpu/lib/python3.6/site-packages/tensorflow/python/platform/app.py", line 125, in run
_sys.exit(main(argv))
File "/home/mounir/PycharmProjects/Pedestrian-Detection-master/object_detection/train.py", line 159, in main
worker_job_name, is_chief, FLAGS.train_dir)
File "/home/mounir/PycharmProjects/Pedestrian-Detection-master/object_detection/trainer.py", line 228, in train
clones = model_deploy.create_clones(deploy_config, model_fn, [input_queue])
File "/home/mounir/PycharmProjects/Pedestrian-Detection-master/object_detection/deployment/model_deploy.py", line 193, in create_clones
outputs = model_fn(*args, **kwargs)
File "/home/mounir/PycharmProjects/Pedestrian-Detection-master/object_detection/trainer.py", line 165, in _create_losses
prediction_dict = detection_model.predict(images)
File "/home/mounir/PycharmProjects/Pedestrian-Detection-master/object_detection/meta_architectures/faster_rcnn_meta_arch.py", line 531, in predict
image_shape) = self._extract_rpn_feature_maps(preprocessed_inputs)
File "/home/mounir/PycharmProjects/Pedestrian-Detection-master/object_detection/meta_architectures/faster_rcnn_meta_arch.py", line 685, in _extract_rpn_feature_maps
preprocessed_inputs, scope=self.first_stage_feature_extractor_scope)
File "/home/mounir/PycharmProjects/Pedestrian-Detection-master/object_detection/meta_architectures/faster_rcnn_meta_arch.py", line 134, in extract_proposal_features
return self._extract_proposal_features(preprocessed_inputs, scope)
File "/home/mounir/PycharmProjects/Pedestrian-Detection-master/object_detection/models/faster_rcnn_inception_resnet_v2_feature_extractor.py", line 112, in _extract_proposal_features
align_feature_maps=True))
File "/home/mounir/PycharmProjects/Pedestrian-Detection-master/object_detection/nets/inception_resnet_v2.py", line 232, in inception_resnet_v2_base
scope='Conv2d_1a_3x3')
File "/home/mounir/anaconda3/envs/tflow-gpu/lib/python3.6/site-packages/tensorflow/contrib/framework/python/ops/arg_scope.py", line 183, in func_with_args
return func(*args, **current_args)
File "/home/mounir/anaconda3/envs/tflow-gpu/lib/python3.6/site-packages/tensorflow/contrib/layers/python/layers/layers.py", line 1154, in convolution2d
conv_dims=2)
File "/home/mounir/anaconda3/envs/tflow-gpu/lib/python3.6/site-packages/tensorflow/contrib/framework/python/ops/arg_scope.py", line 183, in func_with_args
return func(*args, **current_args)
File "/home/mounir/anaconda3/envs/tflow-gpu/lib/python3.6/site-packages/tensorflow/contrib/layers/python/layers/layers.py", line 1057, in convolution
outputs = layer.apply(inputs)
File "/home/mounir/anaconda3/envs/tflow-gpu/lib/python3.6/site-packages/tensorflow/python/keras/engine/base_layer.py", line 805, in apply
return self.call(inputs, *args, **kwargs)
File "/home/mounir/anaconda3/envs/tflow-gpu/lib/python3.6/site-packages/tensorflow/python/layers/base.py", line 362, in call
outputs = super(Layer, self).call(inputs, *args, **kwargs)
File "/home/mounir/anaconda3/envs/tflow-gpu/lib/python3.6/site-packages/tensorflow/python/keras/engine/base_layer.py", line 736, in call
outputs = self.call(inputs, *args, **kwargs)
File "/home/mounir/anaconda3/envs/tflow-gpu/lib/python3.6/site-packages/tensorflow/python/keras/layers/convolutional.py", line 186, in call
outputs = self._convolution_op(inputs, self.kernel)
File "/home/mounir/anaconda3/envs/tflow-gpu/lib/python3.6/site-packages/tensorflow/python/ops/nn_ops.py", line 868, in call
return self.conv_op(inp, filter)
File "/home/mounir/anaconda3/envs/tflow-gpu/lib/python3.6/site-packages/tensorflow/python/ops/nn_ops.py", line 520, in call
return self.call(inp, filter)
File "/home/mounir/anaconda3/envs/tflow-gpu/lib/python3.6/site-packages/tensorflow/python/ops/nn_ops.py", line 204, in call
name=self.name)
File "/home/mounir/anaconda3/envs/tflow-gpu/lib/python3.6/site-packages/tensorflow/python/ops/gen_nn_ops.py", line 956, in conv2d
data_format=data_format, dilations=dilations, name=name)
File "/home/mounir/anaconda3/envs/tflow-gpu/lib/python3.6/site-packages/tensorflow/python/framework/op_def_library.py", line 787, in _apply_op_helper
op_def=op_def)
File "/home/mounir/anaconda3/envs/tflow-gpu/lib/python3.6/site-packages/tensorflow/python/util/deprecation.py", line 454, in new_func
return func(*args, **kwargs)
File "/home/mounir/anaconda3/envs/tflow-gpu/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 3155, in create_op
op_def=op_def)
File "/home/mounir/anaconda3/envs/tflow-gpu/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 1717, in init
self._traceback = tf_stack.extract_stack()

ResourceExhaustedError (see above for traceback): OOM when allocating tensor with shape[1,384,72,128] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
[[Node: FirstStageFeatureExtractor/InceptionResnetV2/InceptionResnetV2/Mixed_6a/Branch_1/Conv2d_1a_3x3/Conv2D = Conv2D[T=DT_FLOAT, data_format="NCHW", dilations=[1, 1, 1, 1], padding="SAME", strides=[1, 1, 1, 1], use_cudnn_on_gpu=true, _device="/job:localhost/replica:0/task:0/device:GPU:0"](FirstStageFeatureExtractor/InceptionResnetV2/InceptionResnetV2/Mixed_6a/Branch_1/Conv2d_0b_3x3/Relu, FirstStageFeatureExtractor/InceptionResnetV2/Mixed_6a/Branch_1/Conv2d_1a_3x3/weights/read/_3137)]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

[[Node: gradients/SecondStageFeatureExtractor/InceptionResnetV2/Repeat/block8_9/Conv2d_1x1/Conv2D_grad/tuple/control_dependency_1/_5073 = _Recvclient_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_13509...pendency_1", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/home/mounir/PycharmProjects/Pedestrian-Detection-master/object_detection/train.py", line 163, in
tf.app.run()
File "/home/mounir/anaconda3/envs/tflow-gpu/lib/python3.6/site-packages/tensorflow/python/platform/app.py", line 125, in run
_sys.exit(main(argv))
File "/home/mounir/PycharmProjects/Pedestrian-Detection-master/object_detection/train.py", line 159, in main
worker_job_name, is_chief, FLAGS.train_dir)
File "/home/mounir/PycharmProjects/Pedestrian-Detection-master/object_detection/trainer.py", line 332, in train
saver=saver)
File "/home/mounir/anaconda3/envs/tflow-gpu/lib/python3.6/site-packages/tensorflow/contrib/slim/python/slim/learning.py", line 785, in train
ignore_live_threads=ignore_live_threads)
File "/home/mounir/anaconda3/envs/tflow-gpu/lib/python3.6/contextlib.py", line 99, in exit
self.gen.throw(type, value, traceback)
File "/home/mounir/anaconda3/envs/tflow-gpu/lib/python3.6/site-packages/tensorflow/python/training/supervisor.py", line 1005, in managed_session
self.stop(close_summary_writer=close_summary_writer)
File "/home/mounir/anaconda3/envs/tflow-gpu/lib/python3.6/site-packages/tensorflow/python/training/supervisor.py", line 833, in stop
ignore_live_threads=ignore_live_threads)
File "/home/mounir/anaconda3/envs/tflow-gpu/lib/python3.6/site-packages/tensorflow/python/training/coordinator.py", line 389, in join
six.reraise(*self._exc_info_to_raise)
File "/home/mounir/anaconda3/envs/tflow-gpu/lib/python3.6/site-packages/six.py", line 693, in reraise
raise value
File "/home/mounir/anaconda3/envs/tflow-gpu/lib/python3.6/site-packages/tensorflow/python/training/coordinator.py", line 297, in stop_on_exception
yield
File "/home/mounir/anaconda3/envs/tflow-gpu/lib/python3.6/site-packages/tensorflow/python/training/coordinator.py", line 495, in run
self.run_loop()
File "/home/mounir/anaconda3/envs/tflow-gpu/lib/python3.6/site-packages/tensorflow/python/training/supervisor.py", line 1035, in run_loop
self._sv.global_step])
File "/home/mounir/anaconda3/envs/tflow-gpu/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 877, in run
run_metadata_ptr)
File "/home/mounir/anaconda3/envs/tflow-gpu/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1100, in _run
feed_dict_tensor, options, run_metadata)
File "/home/mounir/anaconda3/envs/tflow-gpu/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1272, in _do_run
run_metadata)
File "/home/mounir/anaconda3/envs/tflow-gpu/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1291, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.ResourceExhaustedError: OOM when allocating tensor with shape[4,160,42,64] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
[[Node: FirstStageFeatureExtractor/InceptionResnetV2/InceptionResnetV2/Repeat_1/block17_2/Branch_1/Conv2d_0c_7x1/Conv2D-0-TransposeNHWCToNCHW-LayoutOptimizer = Transpose[T=DT_FLOAT, Tperm=DT_INT32, _device="/job:localhost/replica:0/task:0/device:GPU:0"](FirstStageFeatureExtractor/InceptionResnetV2/InceptionResnetV2/Repeat_1/block17_2/Branch_1/Conv2d_0c_7x1/SpaceToBatchND, PermConstNHWCToNCHW-LayoutOptimizer)]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

[[Node: BatchMultiClassNonMaxSuppression/map/while/MultiClassNonMaxSuppression/strided_slice/_1871 = _Recvclient_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_11146...ided_slice", tensor_type=DT_INT32, _device="/job:localhost/replica:0/task:0/device:CPU:0"]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

Process finished with exit code 1

thatbrguy · 2018-10-15T03:45:08Z

Changing parameters (besides batch size) won't help your case that much if you're using pretrained models. The model faster_rcnn_inception_resnet_v2_atrous_coco_2018_01_28 is pretty large. I would suggest using a smaller model such as FasterRCNN_ResNet50 or SSD_MobileNet.

MounirB · 2018-10-18T14:19:23Z

Same problem occurring again, even with SSD_MobileNet :/
I have a P400 GPU

2018-10-18 16:09:49.781181: W tensorflow/core/common_runtime/bfc_allocator.cc:279] ____**********___********************************************************************xxxxx
2018-10-18 16:09:49.781198: W tensorflow/core/framework/op_kernel.cc:1275] OP_REQUIRES failed at conv_ops.cc:693 : Resource exhausted: OOM when allocating tensor with shape[24,128,75,75] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
INFO:tensorflow:Error reported to Coordinator: <class 'tensorflow.python.framework.errors_impl.ResourceExhaustedError'>, OOM when allocating tensor with shape[24,128,75,75] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
[[Node: FeatureExtractor/MobilenetV1/MobilenetV1/Conv2d_3_pointwise/Conv2D = Conv2D[T=DT_FLOAT, data_format="NCHW", dilations=[1, 1, 1, 1], padding="SAME", strides=[1, 1, 1, 1], use_cudnn_on_gpu=true, _device="/job:localhost/replica:0/task:0/device:GPU:0"](FeatureExtractor/MobilenetV1/MobilenetV1/Conv2d_3_depthwise/Relu6, FeatureExtractor/MobilenetV1/Conv2d_3_pointwise/weights/read/_3593)]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

[[Node: Loss/Where_260/_6409 = _Recvclient_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_14144_Loss/Where_260", tensor_type=DT_INT64, _device="/job:localhost/replica:0/task:0/device:CPU:0"]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

Caused by op 'FeatureExtractor/MobilenetV1/MobilenetV1/Conv2d_3_pointwise/Conv2D', defined at:
File "/home/mounir/PycharmProjects/Pedestrian-detection-DL/train.py", line 165, in
tf.app.run()
File "/home/mounir/anaconda3/envs/tflow-gpu/lib/python3.6/site-packages/tensorflow/python/platform/app.py", line 125, in run
_sys.exit(main(argv))
File "/home/mounir/PycharmProjects/Pedestrian-detection-DL/train.py", line 161, in main
worker_job_name, is_chief, FLAGS.train_dir)
File "/home/mounir/PycharmProjects/Pedestrian-detection-DL/object_detection/trainer.py", line 228, in train
clones = model_deploy.create_clones(deploy_config, model_fn, [input_queue])
File "/home/mounir/PycharmProjects/Pedestrian-detection-DL/deployment/model_deploy.py", line 193, in create_clones
outputs = model_fn(*args, **kwargs)
File "/home/mounir/PycharmProjects/Pedestrian-detection-DL/object_detection/trainer.py", line 165, in _create_losses
prediction_dict = detection_model.predict(images)
File "/home/mounir/PycharmProjects/Pedestrian-detection-DL/object_detection/meta_architectures/ssd_meta_arch.py", line 264, in predict
preprocessed_inputs)
File "/home/mounir/PycharmProjects/Pedestrian-detection-DL/object_detection/models/ssd_mobilenet_v1_feature_extractor.py", line 106, in extract_features
scope=scope)
File "/home/mounir/PycharmProjects/Pedestrian-detection-DL/nets/mobilenet_v1.py", line 258, in mobilenet_v1_base
scope=end_point)
File "/home/mounir/anaconda3/envs/tflow-gpu/lib/python3.6/site-packages/tensorflow/contrib/framework/python/ops/arg_scope.py", line 183, in func_with_args
return func(*args, **current_args)
File "/home/mounir/anaconda3/envs/tflow-gpu/lib/python3.6/site-packages/tensorflow/contrib/layers/python/layers/layers.py", line 1154, in convolution2d
conv_dims=2)
File "/home/mounir/anaconda3/envs/tflow-gpu/lib/python3.6/site-packages/tensorflow/contrib/framework/python/ops/arg_scope.py", line 183, in func_with_args
return func(*args, **current_args)
File "/home/mounir/anaconda3/envs/tflow-gpu/lib/python3.6/site-packages/tensorflow/contrib/layers/python/layers/layers.py", line 1057, in convolution
outputs = layer.apply(inputs)
File "/home/mounir/anaconda3/envs/tflow-gpu/lib/python3.6/site-packages/tensorflow/python/keras/engine/base_layer.py", line 805, in apply
return self.call(inputs, *args, **kwargs)
File "/home/mounir/anaconda3/envs/tflow-gpu/lib/python3.6/site-packages/tensorflow/python/layers/base.py", line 362, in call
outputs = super(Layer, self).call(inputs, *args, **kwargs)
File "/home/mounir/anaconda3/envs/tflow-gpu/lib/python3.6/site-packages/tensorflow/python/keras/engine/base_layer.py", line 736, in call
outputs = self.call(inputs, *args, **kwargs)
File "/home/mounir/anaconda3/envs/tflow-gpu/lib/python3.6/site-packages/tensorflow/python/keras/layers/convolutional.py", line 186, in call
outputs = self._convolution_op(inputs, self.kernel)
File "/home/mounir/anaconda3/envs/tflow-gpu/lib/python3.6/site-packages/tensorflow/python/ops/nn_ops.py", line 868, in call
return self.conv_op(inp, filter)
File "/home/mounir/anaconda3/envs/tflow-gpu/lib/python3.6/site-packages/tensorflow/python/ops/nn_ops.py", line 520, in call
return self.call(inp, filter)
File "/home/mounir/anaconda3/envs/tflow-gpu/lib/python3.6/site-packages/tensorflow/python/ops/nn_ops.py", line 204, in call
name=self.name)
File "/home/mounir/anaconda3/envs/tflow-gpu/lib/python3.6/site-packages/tensorflow/python/ops/gen_nn_ops.py", line 956, in conv2d
data_format=data_format, dilations=dilations, name=name)
File "/home/mounir/anaconda3/envs/tflow-gpu/lib/python3.6/site-packages/tensorflow/python/framework/op_def_library.py", line 787, in _apply_op_helper
op_def=op_def)
File "/home/mounir/anaconda3/envs/tflow-gpu/lib/python3.6/site-packages/tensorflow/python/util/deprecation.py", line 454, in new_func
return func(*args, **kwargs)
File "/home/mounir/anaconda3/envs/tflow-gpu/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 3155, in create_op
op_def=op_def)
File "/home/mounir/anaconda3/envs/tflow-gpu/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 1717, in init
self._traceback = tf_stack.extract_stack()

ResourceExhaustedError (see above for traceback): OOM when allocating tensor with shape[24,128,75,75] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
[[Node: FeatureExtractor/MobilenetV1/MobilenetV1/Conv2d_3_pointwise/Conv2D = Conv2D[T=DT_FLOAT, data_format="NCHW", dilations=[1, 1, 1, 1], padding="SAME", strides=[1, 1, 1, 1], use_cudnn_on_gpu=true, _device="/job:localhost/replica:0/task:0/device:GPU:0"](FeatureExtractor/MobilenetV1/MobilenetV1/Conv2d_3_depthwise/Relu6, FeatureExtractor/MobilenetV1/Conv2d_3_pointwise/weights/read/_3593)]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

[[Node: Loss/Where_260/_6409 = _Recvclient_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_14144_Loss/Where_260", tensor_type=DT_INT64, _device="/job:localhost/replica:0/task:0/device:CPU:0"]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

Traceback (most recent call last):
File "/home/mounir/anaconda3/envs/tflow-gpu/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1278, in _do_call
return fn(*args)
File "/home/mounir/anaconda3/envs/tflow-gpu/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1263, in _run_fn
options, feed_dict, fetch_list, target_list, run_metadata)
File "/home/mounir/anaconda3/envs/tflow-gpu/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1350, in _call_tf_sessionrun
run_metadata)
tensorflow.python.framework.errors_impl.ResourceExhaustedError: OOM when allocating tensor with shape[24,128,75,75] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
[[Node: FeatureExtractor/MobilenetV1/MobilenetV1/Conv2d_3_pointwise/Conv2D = Conv2D[T=DT_FLOAT, data_format="NCHW", dilations=[1, 1, 1, 1], padding="SAME", strides=[1, 1, 1, 1], use_cudnn_on_gpu=true, _device="/job:localhost/replica:0/task:0/device:GPU:0"](FeatureExtractor/MobilenetV1/MobilenetV1/Conv2d_3_depthwise/Relu6, FeatureExtractor/MobilenetV1/Conv2d_3_pointwise/weights/read/_3593)]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

[[Node: Loss/Where_260/_6409 = _Recvclient_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_14144_Loss/Where_260", tensor_type=DT_INT64, _device="/job:localhost/replica:0/task:0/device:CPU:0"]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/home/mounir/PycharmProjects/Pedestrian-detection-DL/train.py", line 165, in
tf.app.run()
File "/home/mounir/anaconda3/envs/tflow-gpu/lib/python3.6/site-packages/tensorflow/python/platform/app.py", line 125, in run
_sys.exit(main(argv))
File "/home/mounir/PycharmProjects/Pedestrian-detection-DL/train.py", line 161, in main
worker_job_name, is_chief, FLAGS.train_dir)
File "/home/mounir/PycharmProjects/Pedestrian-detection-DL/object_detection/trainer.py", line 332, in train
saver=saver)
File "/home/mounir/anaconda3/envs/tflow-gpu/lib/python3.6/site-packages/tensorflow/contrib/slim/python/slim/learning.py", line 770, in train
sess, train_op, global_step, train_step_kwargs)
File "/home/mounir/anaconda3/envs/tflow-gpu/lib/python3.6/site-packages/tensorflow/contrib/slim/python/slim/learning.py", line 487, in train_step
run_metadata=run_metadata)
File "/home/mounir/anaconda3/envs/tflow-gpu/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 877, in run
run_metadata_ptr)
File "/home/mounir/anaconda3/envs/tflow-gpu/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1100, in _run
feed_dict_tensor, options, run_metadata)
File "/home/mounir/anaconda3/envs/tflow-gpu/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1272, in _do_run
run_metadata)
File "/home/mounir/anaconda3/envs/tflow-gpu/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1291, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.ResourceExhaustedError: OOM when allocating tensor with shape[24,128,75,75] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
[[Node: FeatureExtractor/MobilenetV1/MobilenetV1/Conv2d_3_pointwise/Conv2D = Conv2D[T=DT_FLOAT, data_format="NCHW", dilations=[1, 1, 1, 1], padding="SAME", strides=[1, 1, 1, 1], use_cudnn_on_gpu=true, _device="/job:localhost/replica:0/task:0/device:GPU:0"](FeatureExtractor/MobilenetV1/MobilenetV1/Conv2d_3_depthwise/Relu6, FeatureExtractor/MobilenetV1/Conv2d_3_pointwise/weights/read/_3593)]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

[[Node: Loss/Where_260/_6409 = _Recvclient_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_14144_Loss/Where_260", tensor_type=DT_INT64, _device="/job:localhost/replica:0/task:0/device:CPU:0"]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

Caused by op 'FeatureExtractor/MobilenetV1/MobilenetV1/Conv2d_3_pointwise/Conv2D', defined at:
File "/home/mounir/PycharmProjects/Pedestrian-detection-DL/train.py", line 165, in
tf.app.run()
File "/home/mounir/anaconda3/envs/tflow-gpu/lib/python3.6/site-packages/tensorflow/python/platform/app.py", line 125, in run
_sys.exit(main(argv))
File "/home/mounir/PycharmProjects/Pedestrian-detection-DL/train.py", line 161, in main
worker_job_name, is_chief, FLAGS.train_dir)
File "/home/mounir/PycharmProjects/Pedestrian-detection-DL/object_detection/trainer.py", line 228, in train
clones = model_deploy.create_clones(deploy_config, model_fn, [input_queue])
File "/home/mounir/PycharmProjects/Pedestrian-detection-DL/deployment/model_deploy.py", line 193, in create_clones
outputs = model_fn(*args, **kwargs)
File "/home/mounir/PycharmProjects/Pedestrian-detection-DL/object_detection/trainer.py", line 165, in _create_losses
prediction_dict = detection_model.predict(images)
File "/home/mounir/PycharmProjects/Pedestrian-detection-DL/object_detection/meta_architectures/ssd_meta_arch.py", line 264, in predict
preprocessed_inputs)
File "/home/mounir/PycharmProjects/Pedestrian-detection-DL/object_detection/models/ssd_mobilenet_v1_feature_extractor.py", line 106, in extract_features
scope=scope)
File "/home/mounir/PycharmProjects/Pedestrian-detection-DL/nets/mobilenet_v1.py", line 258, in mobilenet_v1_base
scope=end_point)
File "/home/mounir/anaconda3/envs/tflow-gpu/lib/python3.6/site-packages/tensorflow/contrib/framework/python/ops/arg_scope.py", line 183, in func_with_args
return func(*args, **current_args)
File "/home/mounir/anaconda3/envs/tflow-gpu/lib/python3.6/site-packages/tensorflow/contrib/layers/python/layers/layers.py", line 1154, in convolution2d
conv_dims=2)
File "/home/mounir/anaconda3/envs/tflow-gpu/lib/python3.6/site-packages/tensorflow/contrib/framework/python/ops/arg_scope.py", line 183, in func_with_args
return func(*args, **current_args)
File "/home/mounir/anaconda3/envs/tflow-gpu/lib/python3.6/site-packages/tensorflow/contrib/layers/python/layers/layers.py", line 1057, in convolution
outputs = layer.apply(inputs)
File "/home/mounir/anaconda3/envs/tflow-gpu/lib/python3.6/site-packages/tensorflow/python/keras/engine/base_layer.py", line 805, in apply
return self.call(inputs, *args, **kwargs)
File "/home/mounir/anaconda3/envs/tflow-gpu/lib/python3.6/site-packages/tensorflow/python/layers/base.py", line 362, in call
outputs = super(Layer, self).call(inputs, *args, **kwargs)
File "/home/mounir/anaconda3/envs/tflow-gpu/lib/python3.6/site-packages/tensorflow/python/keras/engine/base_layer.py", line 736, in call
outputs = self.call(inputs, *args, **kwargs)
File "/home/mounir/anaconda3/envs/tflow-gpu/lib/python3.6/site-packages/tensorflow/python/keras/layers/convolutional.py", line 186, in call
outputs = self._convolution_op(inputs, self.kernel)
File "/home/mounir/anaconda3/envs/tflow-gpu/lib/python3.6/site-packages/tensorflow/python/ops/nn_ops.py", line 868, in call
return self.conv_op(inp, filter)
File "/home/mounir/anaconda3/envs/tflow-gpu/lib/python3.6/site-packages/tensorflow/python/ops/nn_ops.py", line 520, in call
return self.call(inp, filter)
File "/home/mounir/anaconda3/envs/tflow-gpu/lib/python3.6/site-packages/tensorflow/python/ops/nn_ops.py", line 204, in call
name=self.name)
File "/home/mounir/anaconda3/envs/tflow-gpu/lib/python3.6/site-packages/tensorflow/python/ops/gen_nn_ops.py", line 956, in conv2d
data_format=data_format, dilations=dilations, name=name)
File "/home/mounir/anaconda3/envs/tflow-gpu/lib/python3.6/site-packages/tensorflow/python/framework/op_def_library.py", line 787, in _apply_op_helper
op_def=op_def)
File "/home/mounir/anaconda3/envs/tflow-gpu/lib/python3.6/site-packages/tensorflow/python/util/deprecation.py", line 454, in new_func
return func(*args, **kwargs)
File "/home/mounir/anaconda3/envs/tflow-gpu/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 3155, in create_op
op_def=op_def)
File "/home/mounir/anaconda3/envs/tflow-gpu/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 1717, in init
self._traceback = tf_stack.extract_stack()

ResourceExhaustedError (see above for traceback): OOM when allocating tensor with shape[24,128,75,75] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
[[Node: FeatureExtractor/MobilenetV1/MobilenetV1/Conv2d_3_pointwise/Conv2D = Conv2D[T=DT_FLOAT, data_format="NCHW", dilations=[1, 1, 1, 1], padding="SAME", strides=[1, 1, 1, 1], use_cudnn_on_gpu=true, _device="/job:localhost/replica:0/task:0/device:GPU:0"](FeatureExtractor/MobilenetV1/MobilenetV1/Conv2d_3_depthwise/Relu6, FeatureExtractor/MobilenetV1/Conv2d_3_pointwise/weights/read/_3593)]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

[[Node: Loss/Where_260/_6409 = _Recvclient_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_14144_Loss/Where_260", tensor_type=DT_INT64, _device="/job:localhost/replica:0/task:0/device:CPU:0"]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

thatbrguy · 2018-10-20T09:13:16Z

Oh alright. You can try using Google Colab to train them then. You can fit FasterRCNN+ResNet-50 (and other models with similar param count) over there.

thatbrguy closed this as completed Oct 15, 2018

MounirB mentioned this issue Nov 12, 2018

Training on Colab, how to pass arguments to the .ipynb on colab #6

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Which parameters to reduce to avoid ResourceExhaustedError #3

Which parameters to reduce to avoid ResourceExhaustedError #3

MounirB commented Oct 10, 2018 •

edited

thatbrguy commented Oct 15, 2018

MounirB commented Oct 18, 2018

thatbrguy commented Oct 20, 2018

Which parameters to reduce to avoid ResourceExhaustedError #3

Which parameters to reduce to avoid ResourceExhaustedError #3

Comments

MounirB commented Oct 10, 2018 • edited

thatbrguy commented Oct 15, 2018

MounirB commented Oct 18, 2018

thatbrguy commented Oct 20, 2018

MounirB commented Oct 10, 2018 •

edited