Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Which parameters to reduce to avoid ResourceExhaustedError #3

Closed
MounirB opened this issue Oct 10, 2018 · 3 comments
Closed

Which parameters to reduce to avoid ResourceExhaustedError #3

MounirB opened this issue Oct 10, 2018 · 3 comments

Comments

@MounirB
Copy link

MounirB commented Oct 10, 2018

Hello, I try to train the faster_rcnn_inception_resnet_v2_atrous_coco_2018_01_28 model by launching the train.py script on it, but I get the following ResourceExhaustedError. Do you have any idea on how to solve it ? I tried to change many parameters in pipeline.config, but It doesn't change anything

2018-10-10 14:54:05.313837: I tensorflow/core/common_runtime/bfc_allocator.cc:678] Sum Total of in-use chunks: 1.25GiB
2018-10-10 14:54:05.313845: I tensorflow/core/common_runtime/bfc_allocator.cc:680] Stats:
Limit: 1363345408
InUse: 1338755072
MaxInUse: 1350130944
NumAllocs: 3937
MaxAllocSize: 256131072

2018-10-10 14:54:05.313921: W tensorflow/core/common_runtime/bfc_allocator.cc:279] ****************************************************************************************************
2018-10-10 14:54:05.313944: W tensorflow/core/framework/op_kernel.cc:1275] OP_REQUIRES failed at transpose_op.cc:199 : Resource exhausted: OOM when allocating tensor with shape[4,160,42,64] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
INFO:tensorflow:Error reported to Coordinator: OOM when allocating tensor with shape[4,160,42,64] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
[[Node: FirstStageFeatureExtractor/InceptionResnetV2/InceptionResnetV2/Repeat_1/block17_2/Branch_1/Conv2d_0c_7x1/Conv2D-0-TransposeNHWCToNCHW-LayoutOptimizer = Transpose[T=DT_FLOAT, Tperm=DT_INT32, _device="/job:localhost/replica:0/task:0/device:GPU:0"](FirstStageFeatureExtractor/InceptionResnetV2/InceptionResnetV2/Repeat_1/block17_2/Branch_1/Conv2d_0c_7x1/SpaceToBatchND, PermConstNHWCToNCHW-LayoutOptimizer)]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

[[Node: BatchMultiClassNonMaxSuppression/map/while/MultiClassNonMaxSuppression/strided_slice/_1871 = _Recvclient_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_11146...ided_slice", tensor_type=DT_INT32, _device="/job:localhost/replica:0/task:0/device:CPU:0"]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.
Traceback (most recent call last):
File "/home/mounir/anaconda3/envs/tflow-gpu/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1278, in _do_call
return fn(*args)
File "/home/mounir/anaconda3/envs/tflow-gpu/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1263, in _run_fn
options, feed_dict, fetch_list, target_list, run_metadata)
File "/home/mounir/anaconda3/envs/tflow-gpu/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1350, in _call_tf_sessionrun
run_metadata)
tensorflow.python.framework.errors_impl.ResourceExhaustedError: OOM when allocating tensor with shape[4,160,42,64] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
[[Node: FirstStageFeatureExtractor/InceptionResnetV2/InceptionResnetV2/Repeat_1/block17_2/Branch_1/Conv2d_0c_7x1/Conv2D-0-TransposeNHWCToNCHW-LayoutOptimizer = Transpose[T=DT_FLOAT, Tperm=DT_INT32, _device="/job:localhost/replica:0/task:0/device:GPU:0"](FirstStageFeatureExtractor/InceptionResnetV2/InceptionResnetV2/Repeat_1/block17_2/Branch_1/Conv2d_0c_7x1/SpaceToBatchND, PermConstNHWCToNCHW-LayoutOptimizer)]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

[[Node: BatchMultiClassNonMaxSuppression/map/while/MultiClassNonMaxSuppression/strided_slice/_1871 = _Recvclient_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_11146...ided_slice", tensor_type=DT_INT32, _device="/job:localhost/replica:0/task:0/device:CPU:0"]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/home/mounir/anaconda3/envs/tflow-gpu/lib/python3.6/site-packages/tensorflow/python/training/coordinator.py", line 297, in stop_on_exception
yield
File "/home/mounir/anaconda3/envs/tflow-gpu/lib/python3.6/site-packages/tensorflow/python/training/coordinator.py", line 495, in run
self.run_loop()
File "/home/mounir/anaconda3/envs/tflow-gpu/lib/python3.6/site-packages/tensorflow/python/training/supervisor.py", line 1035, in run_loop
self._sv.global_step])
File "/home/mounir/anaconda3/envs/tflow-gpu/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 877, in run
run_metadata_ptr)
File "/home/mounir/anaconda3/envs/tflow-gpu/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1100, in _run
feed_dict_tensor, options, run_metadata)
File "/home/mounir/anaconda3/envs/tflow-gpu/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1272, in _do_run
run_metadata)
File "/home/mounir/anaconda3/envs/tflow-gpu/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1291, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.ResourceExhaustedError: OOM when allocating tensor with shape[4,160,42,64] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
[[Node: FirstStageFeatureExtractor/InceptionResnetV2/InceptionResnetV2/Repeat_1/block17_2/Branch_1/Conv2d_0c_7x1/Conv2D-0-TransposeNHWCToNCHW-LayoutOptimizer = Transpose[T=DT_FLOAT, Tperm=DT_INT32, _device="/job:localhost/replica:0/task:0/device:GPU:0"](FirstStageFeatureExtractor/InceptionResnetV2/InceptionResnetV2/Repeat_1/block17_2/Branch_1/Conv2d_0c_7x1/SpaceToBatchND, PermConstNHWCToNCHW-LayoutOptimizer)]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

[[Node: BatchMultiClassNonMaxSuppression/map/while/MultiClassNonMaxSuppression/strided_slice/_1871 = _Recvclient_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_11146...ided_slice", tensor_type=DT_INT32, _device="/job:localhost/replica:0/task:0/device:CPU:0"]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

Traceback (most recent call last):
File "/home/mounir/anaconda3/envs/tflow-gpu/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1278, in _do_call
return fn(*args)
File "/home/mounir/anaconda3/envs/tflow-gpu/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1263, in _run_fn
options, feed_dict, fetch_list, target_list, run_metadata)
File "/home/mounir/anaconda3/envs/tflow-gpu/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1350, in _call_tf_sessionrun
run_metadata)
tensorflow.python.framework.errors_impl.ResourceExhaustedError: OOM when allocating tensor with shape[1,384,72,128] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
[[Node: FirstStageFeatureExtractor/InceptionResnetV2/InceptionResnetV2/Mixed_6a/Branch_1/Conv2d_1a_3x3/Conv2D = Conv2D[T=DT_FLOAT, data_format="NCHW", dilations=[1, 1, 1, 1], padding="SAME", strides=[1, 1, 1, 1], use_cudnn_on_gpu=true, _device="/job:localhost/replica:0/task:0/device:GPU:0"](FirstStageFeatureExtractor/InceptionResnetV2/InceptionResnetV2/Mixed_6a/Branch_1/Conv2d_0b_3x3/Relu, FirstStageFeatureExtractor/InceptionResnetV2/Mixed_6a/Branch_1/Conv2d_1a_3x3/weights/read/_3137)]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

[[Node: gradients/SecondStageFeatureExtractor/InceptionResnetV2/Repeat/block8_9/Conv2d_1x1/Conv2D_grad/tuple/control_dependency_1/_5073 = _Recvclient_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_13509...pendency_1", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/home/mounir/anaconda3/envs/tflow-gpu/lib/python3.6/site-packages/tensorflow/python/training/supervisor.py", line 995, in managed_session
yield sess
File "/home/mounir/anaconda3/envs/tflow-gpu/lib/python3.6/site-packages/tensorflow/contrib/slim/python/slim/learning.py", line 770, in train
sess, train_op, global_step, train_step_kwargs)
File "/home/mounir/anaconda3/envs/tflow-gpu/lib/python3.6/site-packages/tensorflow/contrib/slim/python/slim/learning.py", line 487, in train_step
run_metadata=run_metadata)
File "/home/mounir/anaconda3/envs/tflow-gpu/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 877, in run
run_metadata_ptr)
File "/home/mounir/anaconda3/envs/tflow-gpu/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1100, in _run
feed_dict_tensor, options, run_metadata)
File "/home/mounir/anaconda3/envs/tflow-gpu/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1272, in _do_run
run_metadata)
File "/home/mounir/anaconda3/envs/tflow-gpu/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1291, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.ResourceExhaustedError: OOM when allocating tensor with shape[1,384,72,128] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
[[Node: FirstStageFeatureExtractor/InceptionResnetV2/InceptionResnetV2/Mixed_6a/Branch_1/Conv2d_1a_3x3/Conv2D = Conv2D[T=DT_FLOAT, data_format="NCHW", dilations=[1, 1, 1, 1], padding="SAME", strides=[1, 1, 1, 1], use_cudnn_on_gpu=true, _device="/job:localhost/replica:0/task:0/device:GPU:0"](FirstStageFeatureExtractor/InceptionResnetV2/InceptionResnetV2/Mixed_6a/Branch_1/Conv2d_0b_3x3/Relu, FirstStageFeatureExtractor/InceptionResnetV2/Mixed_6a/Branch_1/Conv2d_1a_3x3/weights/read/_3137)]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

[[Node: gradients/SecondStageFeatureExtractor/InceptionResnetV2/Repeat/block8_9/Conv2d_1x1/Conv2D_grad/tuple/control_dependency_1/_5073 = _Recvclient_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_13509...pendency_1", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

Caused by op 'FirstStageFeatureExtractor/InceptionResnetV2/InceptionResnetV2/Mixed_6a/Branch_1/Conv2d_1a_3x3/Conv2D', defined at:
File "/home/mounir/PycharmProjects/Pedestrian-Detection-master/object_detection/train.py", line 163, in
tf.app.run()
File "/home/mounir/anaconda3/envs/tflow-gpu/lib/python3.6/site-packages/tensorflow/python/platform/app.py", line 125, in run
_sys.exit(main(argv))
File "/home/mounir/PycharmProjects/Pedestrian-Detection-master/object_detection/train.py", line 159, in main
worker_job_name, is_chief, FLAGS.train_dir)
File "/home/mounir/PycharmProjects/Pedestrian-Detection-master/object_detection/trainer.py", line 228, in train
clones = model_deploy.create_clones(deploy_config, model_fn, [input_queue])
File "/home/mounir/PycharmProjects/Pedestrian-Detection-master/object_detection/deployment/model_deploy.py", line 193, in create_clones
outputs = model_fn(*args, **kwargs)
File "/home/mounir/PycharmProjects/Pedestrian-Detection-master/object_detection/trainer.py", line 165, in _create_losses
prediction_dict = detection_model.predict(images)
File "/home/mounir/PycharmProjects/Pedestrian-Detection-master/object_detection/meta_architectures/faster_rcnn_meta_arch.py", line 531, in predict
image_shape) = self._extract_rpn_feature_maps(preprocessed_inputs)
File "/home/mounir/PycharmProjects/Pedestrian-Detection-master/object_detection/meta_architectures/faster_rcnn_meta_arch.py", line 685, in _extract_rpn_feature_maps
preprocessed_inputs, scope=self.first_stage_feature_extractor_scope)
File "/home/mounir/PycharmProjects/Pedestrian-Detection-master/object_detection/meta_architectures/faster_rcnn_meta_arch.py", line 134, in extract_proposal_features
return self._extract_proposal_features(preprocessed_inputs, scope)
File "/home/mounir/PycharmProjects/Pedestrian-Detection-master/object_detection/models/faster_rcnn_inception_resnet_v2_feature_extractor.py", line 112, in _extract_proposal_features
align_feature_maps=True))
File "/home/mounir/PycharmProjects/Pedestrian-Detection-master/object_detection/nets/inception_resnet_v2.py", line 232, in inception_resnet_v2_base
scope='Conv2d_1a_3x3')
File "/home/mounir/anaconda3/envs/tflow-gpu/lib/python3.6/site-packages/tensorflow/contrib/framework/python/ops/arg_scope.py", line 183, in func_with_args
return func(*args, **current_args)
File "/home/mounir/anaconda3/envs/tflow-gpu/lib/python3.6/site-packages/tensorflow/contrib/layers/python/layers/layers.py", line 1154, in convolution2d
conv_dims=2)
File "/home/mounir/anaconda3/envs/tflow-gpu/lib/python3.6/site-packages/tensorflow/contrib/framework/python/ops/arg_scope.py", line 183, in func_with_args
return func(*args, **current_args)
File "/home/mounir/anaconda3/envs/tflow-gpu/lib/python3.6/site-packages/tensorflow/contrib/layers/python/layers/layers.py", line 1057, in convolution
outputs = layer.apply(inputs)
File "/home/mounir/anaconda3/envs/tflow-gpu/lib/python3.6/site-packages/tensorflow/python/keras/engine/base_layer.py", line 805, in apply
return self.call(inputs, *args, **kwargs)
File "/home/mounir/anaconda3/envs/tflow-gpu/lib/python3.6/site-packages/tensorflow/python/layers/base.py", line 362, in call
outputs = super(Layer, self).call(inputs, *args, **kwargs)
File "/home/mounir/anaconda3/envs/tflow-gpu/lib/python3.6/site-packages/tensorflow/python/keras/engine/base_layer.py", line 736, in call
outputs = self.call(inputs, *args, **kwargs)
File "/home/mounir/anaconda3/envs/tflow-gpu/lib/python3.6/site-packages/tensorflow/python/keras/layers/convolutional.py", line 186, in call
outputs = self._convolution_op(inputs, self.kernel)
File "/home/mounir/anaconda3/envs/tflow-gpu/lib/python3.6/site-packages/tensorflow/python/ops/nn_ops.py", line 868, in call
return self.conv_op(inp, filter)
File "/home/mounir/anaconda3/envs/tflow-gpu/lib/python3.6/site-packages/tensorflow/python/ops/nn_ops.py", line 520, in call
return self.call(inp, filter)
File "/home/mounir/anaconda3/envs/tflow-gpu/lib/python3.6/site-packages/tensorflow/python/ops/nn_ops.py", line 204, in call
name=self.name)
File "/home/mounir/anaconda3/envs/tflow-gpu/lib/python3.6/site-packages/tensorflow/python/ops/gen_nn_ops.py", line 956, in conv2d
data_format=data_format, dilations=dilations, name=name)
File "/home/mounir/anaconda3/envs/tflow-gpu/lib/python3.6/site-packages/tensorflow/python/framework/op_def_library.py", line 787, in _apply_op_helper
op_def=op_def)
File "/home/mounir/anaconda3/envs/tflow-gpu/lib/python3.6/site-packages/tensorflow/python/util/deprecation.py", line 454, in new_func
return func(*args, **kwargs)
File "/home/mounir/anaconda3/envs/tflow-gpu/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 3155, in create_op
op_def=op_def)
File "/home/mounir/anaconda3/envs/tflow-gpu/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 1717, in init
self._traceback = tf_stack.extract_stack()

ResourceExhaustedError (see above for traceback): OOM when allocating tensor with shape[1,384,72,128] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
[[Node: FirstStageFeatureExtractor/InceptionResnetV2/InceptionResnetV2/Mixed_6a/Branch_1/Conv2d_1a_3x3/Conv2D = Conv2D[T=DT_FLOAT, data_format="NCHW", dilations=[1, 1, 1, 1], padding="SAME", strides=[1, 1, 1, 1], use_cudnn_on_gpu=true, _device="/job:localhost/replica:0/task:0/device:GPU:0"](FirstStageFeatureExtractor/InceptionResnetV2/InceptionResnetV2/Mixed_6a/Branch_1/Conv2d_0b_3x3/Relu, FirstStageFeatureExtractor/InceptionResnetV2/Mixed_6a/Branch_1/Conv2d_1a_3x3/weights/read/_3137)]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

[[Node: gradients/SecondStageFeatureExtractor/InceptionResnetV2/Repeat/block8_9/Conv2d_1x1/Conv2D_grad/tuple/control_dependency_1/_5073 = _Recvclient_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_13509...pendency_1", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/home/mounir/PycharmProjects/Pedestrian-Detection-master/object_detection/train.py", line 163, in
tf.app.run()
File "/home/mounir/anaconda3/envs/tflow-gpu/lib/python3.6/site-packages/tensorflow/python/platform/app.py", line 125, in run
_sys.exit(main(argv))
File "/home/mounir/PycharmProjects/Pedestrian-Detection-master/object_detection/train.py", line 159, in main
worker_job_name, is_chief, FLAGS.train_dir)
File "/home/mounir/PycharmProjects/Pedestrian-Detection-master/object_detection/trainer.py", line 332, in train
saver=saver)
File "/home/mounir/anaconda3/envs/tflow-gpu/lib/python3.6/site-packages/tensorflow/contrib/slim/python/slim/learning.py", line 785, in train
ignore_live_threads=ignore_live_threads)
File "/home/mounir/anaconda3/envs/tflow-gpu/lib/python3.6/contextlib.py", line 99, in exit
self.gen.throw(type, value, traceback)
File "/home/mounir/anaconda3/envs/tflow-gpu/lib/python3.6/site-packages/tensorflow/python/training/supervisor.py", line 1005, in managed_session
self.stop(close_summary_writer=close_summary_writer)
File "/home/mounir/anaconda3/envs/tflow-gpu/lib/python3.6/site-packages/tensorflow/python/training/supervisor.py", line 833, in stop
ignore_live_threads=ignore_live_threads)
File "/home/mounir/anaconda3/envs/tflow-gpu/lib/python3.6/site-packages/tensorflow/python/training/coordinator.py", line 389, in join
six.reraise(*self._exc_info_to_raise)
File "/home/mounir/anaconda3/envs/tflow-gpu/lib/python3.6/site-packages/six.py", line 693, in reraise
raise value
File "/home/mounir/anaconda3/envs/tflow-gpu/lib/python3.6/site-packages/tensorflow/python/training/coordinator.py", line 297, in stop_on_exception
yield
File "/home/mounir/anaconda3/envs/tflow-gpu/lib/python3.6/site-packages/tensorflow/python/training/coordinator.py", line 495, in run
self.run_loop()
File "/home/mounir/anaconda3/envs/tflow-gpu/lib/python3.6/site-packages/tensorflow/python/training/supervisor.py", line 1035, in run_loop
self._sv.global_step])
File "/home/mounir/anaconda3/envs/tflow-gpu/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 877, in run
run_metadata_ptr)
File "/home/mounir/anaconda3/envs/tflow-gpu/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1100, in _run
feed_dict_tensor, options, run_metadata)
File "/home/mounir/anaconda3/envs/tflow-gpu/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1272, in _do_run
run_metadata)
File "/home/mounir/anaconda3/envs/tflow-gpu/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1291, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.ResourceExhaustedError: OOM when allocating tensor with shape[4,160,42,64] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
[[Node: FirstStageFeatureExtractor/InceptionResnetV2/InceptionResnetV2/Repeat_1/block17_2/Branch_1/Conv2d_0c_7x1/Conv2D-0-TransposeNHWCToNCHW-LayoutOptimizer = Transpose[T=DT_FLOAT, Tperm=DT_INT32, _device="/job:localhost/replica:0/task:0/device:GPU:0"](FirstStageFeatureExtractor/InceptionResnetV2/InceptionResnetV2/Repeat_1/block17_2/Branch_1/Conv2d_0c_7x1/SpaceToBatchND, PermConstNHWCToNCHW-LayoutOptimizer)]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

[[Node: BatchMultiClassNonMaxSuppression/map/while/MultiClassNonMaxSuppression/strided_slice/_1871 = _Recvclient_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_11146...ided_slice", tensor_type=DT_INT32, _device="/job:localhost/replica:0/task:0/device:CPU:0"]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

Process finished with exit code 1

@thatbrguy
Copy link
Owner

Changing parameters (besides batch size) won't help your case that much if you're using pretrained models. The model faster_rcnn_inception_resnet_v2_atrous_coco_2018_01_28 is pretty large. I would suggest using a smaller model such as FasterRCNN_ResNet50 or SSD_MobileNet.

@MounirB
Copy link
Author

MounirB commented Oct 18, 2018

Same problem occurring again, even with SSD_MobileNet :/
I have a P400 GPU

2018-10-18 16:09:49.781181: W tensorflow/core/common_runtime/bfc_allocator.cc:279] ____**********___********************************************************************xxxxx
2018-10-18 16:09:49.781198: W tensorflow/core/framework/op_kernel.cc:1275] OP_REQUIRES failed at conv_ops.cc:693 : Resource exhausted: OOM when allocating tensor with shape[24,128,75,75] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
INFO:tensorflow:Error reported to Coordinator: <class 'tensorflow.python.framework.errors_impl.ResourceExhaustedError'>, OOM when allocating tensor with shape[24,128,75,75] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
[[Node: FeatureExtractor/MobilenetV1/MobilenetV1/Conv2d_3_pointwise/Conv2D = Conv2D[T=DT_FLOAT, data_format="NCHW", dilations=[1, 1, 1, 1], padding="SAME", strides=[1, 1, 1, 1], use_cudnn_on_gpu=true, _device="/job:localhost/replica:0/task:0/device:GPU:0"](FeatureExtractor/MobilenetV1/MobilenetV1/Conv2d_3_depthwise/Relu6, FeatureExtractor/MobilenetV1/Conv2d_3_pointwise/weights/read/_3593)]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

[[Node: Loss/Where_260/_6409 = _Recvclient_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_14144_Loss/Where_260", tensor_type=DT_INT64, _device="/job:localhost/replica:0/task:0/device:CPU:0"]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

Caused by op 'FeatureExtractor/MobilenetV1/MobilenetV1/Conv2d_3_pointwise/Conv2D', defined at:
File "/home/mounir/PycharmProjects/Pedestrian-detection-DL/train.py", line 165, in
tf.app.run()
File "/home/mounir/anaconda3/envs/tflow-gpu/lib/python3.6/site-packages/tensorflow/python/platform/app.py", line 125, in run
_sys.exit(main(argv))
File "/home/mounir/PycharmProjects/Pedestrian-detection-DL/train.py", line 161, in main
worker_job_name, is_chief, FLAGS.train_dir)
File "/home/mounir/PycharmProjects/Pedestrian-detection-DL/object_detection/trainer.py", line 228, in train
clones = model_deploy.create_clones(deploy_config, model_fn, [input_queue])
File "/home/mounir/PycharmProjects/Pedestrian-detection-DL/deployment/model_deploy.py", line 193, in create_clones
outputs = model_fn(*args, **kwargs)
File "/home/mounir/PycharmProjects/Pedestrian-detection-DL/object_detection/trainer.py", line 165, in _create_losses
prediction_dict = detection_model.predict(images)
File "/home/mounir/PycharmProjects/Pedestrian-detection-DL/object_detection/meta_architectures/ssd_meta_arch.py", line 264, in predict
preprocessed_inputs)
File "/home/mounir/PycharmProjects/Pedestrian-detection-DL/object_detection/models/ssd_mobilenet_v1_feature_extractor.py", line 106, in extract_features
scope=scope)
File "/home/mounir/PycharmProjects/Pedestrian-detection-DL/nets/mobilenet_v1.py", line 258, in mobilenet_v1_base
scope=end_point)
File "/home/mounir/anaconda3/envs/tflow-gpu/lib/python3.6/site-packages/tensorflow/contrib/framework/python/ops/arg_scope.py", line 183, in func_with_args
return func(*args, **current_args)
File "/home/mounir/anaconda3/envs/tflow-gpu/lib/python3.6/site-packages/tensorflow/contrib/layers/python/layers/layers.py", line 1154, in convolution2d
conv_dims=2)
File "/home/mounir/anaconda3/envs/tflow-gpu/lib/python3.6/site-packages/tensorflow/contrib/framework/python/ops/arg_scope.py", line 183, in func_with_args
return func(*args, **current_args)
File "/home/mounir/anaconda3/envs/tflow-gpu/lib/python3.6/site-packages/tensorflow/contrib/layers/python/layers/layers.py", line 1057, in convolution
outputs = layer.apply(inputs)
File "/home/mounir/anaconda3/envs/tflow-gpu/lib/python3.6/site-packages/tensorflow/python/keras/engine/base_layer.py", line 805, in apply
return self.call(inputs, *args, **kwargs)
File "/home/mounir/anaconda3/envs/tflow-gpu/lib/python3.6/site-packages/tensorflow/python/layers/base.py", line 362, in call
outputs = super(Layer, self).call(inputs, *args, **kwargs)
File "/home/mounir/anaconda3/envs/tflow-gpu/lib/python3.6/site-packages/tensorflow/python/keras/engine/base_layer.py", line 736, in call
outputs = self.call(inputs, *args, **kwargs)
File "/home/mounir/anaconda3/envs/tflow-gpu/lib/python3.6/site-packages/tensorflow/python/keras/layers/convolutional.py", line 186, in call
outputs = self._convolution_op(inputs, self.kernel)
File "/home/mounir/anaconda3/envs/tflow-gpu/lib/python3.6/site-packages/tensorflow/python/ops/nn_ops.py", line 868, in call
return self.conv_op(inp, filter)
File "/home/mounir/anaconda3/envs/tflow-gpu/lib/python3.6/site-packages/tensorflow/python/ops/nn_ops.py", line 520, in call
return self.call(inp, filter)
File "/home/mounir/anaconda3/envs/tflow-gpu/lib/python3.6/site-packages/tensorflow/python/ops/nn_ops.py", line 204, in call
name=self.name)
File "/home/mounir/anaconda3/envs/tflow-gpu/lib/python3.6/site-packages/tensorflow/python/ops/gen_nn_ops.py", line 956, in conv2d
data_format=data_format, dilations=dilations, name=name)
File "/home/mounir/anaconda3/envs/tflow-gpu/lib/python3.6/site-packages/tensorflow/python/framework/op_def_library.py", line 787, in _apply_op_helper
op_def=op_def)
File "/home/mounir/anaconda3/envs/tflow-gpu/lib/python3.6/site-packages/tensorflow/python/util/deprecation.py", line 454, in new_func
return func(*args, **kwargs)
File "/home/mounir/anaconda3/envs/tflow-gpu/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 3155, in create_op
op_def=op_def)
File "/home/mounir/anaconda3/envs/tflow-gpu/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 1717, in init
self._traceback = tf_stack.extract_stack()

ResourceExhaustedError (see above for traceback): OOM when allocating tensor with shape[24,128,75,75] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
[[Node: FeatureExtractor/MobilenetV1/MobilenetV1/Conv2d_3_pointwise/Conv2D = Conv2D[T=DT_FLOAT, data_format="NCHW", dilations=[1, 1, 1, 1], padding="SAME", strides=[1, 1, 1, 1], use_cudnn_on_gpu=true, _device="/job:localhost/replica:0/task:0/device:GPU:0"](FeatureExtractor/MobilenetV1/MobilenetV1/Conv2d_3_depthwise/Relu6, FeatureExtractor/MobilenetV1/Conv2d_3_pointwise/weights/read/_3593)]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

[[Node: Loss/Where_260/_6409 = _Recvclient_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_14144_Loss/Where_260", tensor_type=DT_INT64, _device="/job:localhost/replica:0/task:0/device:CPU:0"]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

Traceback (most recent call last):
File "/home/mounir/anaconda3/envs/tflow-gpu/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1278, in _do_call
return fn(*args)
File "/home/mounir/anaconda3/envs/tflow-gpu/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1263, in _run_fn
options, feed_dict, fetch_list, target_list, run_metadata)
File "/home/mounir/anaconda3/envs/tflow-gpu/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1350, in _call_tf_sessionrun
run_metadata)
tensorflow.python.framework.errors_impl.ResourceExhaustedError: OOM when allocating tensor with shape[24,128,75,75] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
[[Node: FeatureExtractor/MobilenetV1/MobilenetV1/Conv2d_3_pointwise/Conv2D = Conv2D[T=DT_FLOAT, data_format="NCHW", dilations=[1, 1, 1, 1], padding="SAME", strides=[1, 1, 1, 1], use_cudnn_on_gpu=true, _device="/job:localhost/replica:0/task:0/device:GPU:0"](FeatureExtractor/MobilenetV1/MobilenetV1/Conv2d_3_depthwise/Relu6, FeatureExtractor/MobilenetV1/Conv2d_3_pointwise/weights/read/_3593)]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

[[Node: Loss/Where_260/_6409 = _Recvclient_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_14144_Loss/Where_260", tensor_type=DT_INT64, _device="/job:localhost/replica:0/task:0/device:CPU:0"]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/home/mounir/PycharmProjects/Pedestrian-detection-DL/train.py", line 165, in
tf.app.run()
File "/home/mounir/anaconda3/envs/tflow-gpu/lib/python3.6/site-packages/tensorflow/python/platform/app.py", line 125, in run
_sys.exit(main(argv))
File "/home/mounir/PycharmProjects/Pedestrian-detection-DL/train.py", line 161, in main
worker_job_name, is_chief, FLAGS.train_dir)
File "/home/mounir/PycharmProjects/Pedestrian-detection-DL/object_detection/trainer.py", line 332, in train
saver=saver)
File "/home/mounir/anaconda3/envs/tflow-gpu/lib/python3.6/site-packages/tensorflow/contrib/slim/python/slim/learning.py", line 770, in train
sess, train_op, global_step, train_step_kwargs)
File "/home/mounir/anaconda3/envs/tflow-gpu/lib/python3.6/site-packages/tensorflow/contrib/slim/python/slim/learning.py", line 487, in train_step
run_metadata=run_metadata)
File "/home/mounir/anaconda3/envs/tflow-gpu/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 877, in run
run_metadata_ptr)
File "/home/mounir/anaconda3/envs/tflow-gpu/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1100, in _run
feed_dict_tensor, options, run_metadata)
File "/home/mounir/anaconda3/envs/tflow-gpu/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1272, in _do_run
run_metadata)
File "/home/mounir/anaconda3/envs/tflow-gpu/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1291, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.ResourceExhaustedError: OOM when allocating tensor with shape[24,128,75,75] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
[[Node: FeatureExtractor/MobilenetV1/MobilenetV1/Conv2d_3_pointwise/Conv2D = Conv2D[T=DT_FLOAT, data_format="NCHW", dilations=[1, 1, 1, 1], padding="SAME", strides=[1, 1, 1, 1], use_cudnn_on_gpu=true, _device="/job:localhost/replica:0/task:0/device:GPU:0"](FeatureExtractor/MobilenetV1/MobilenetV1/Conv2d_3_depthwise/Relu6, FeatureExtractor/MobilenetV1/Conv2d_3_pointwise/weights/read/_3593)]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

[[Node: Loss/Where_260/_6409 = _Recvclient_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_14144_Loss/Where_260", tensor_type=DT_INT64, _device="/job:localhost/replica:0/task:0/device:CPU:0"]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

Caused by op 'FeatureExtractor/MobilenetV1/MobilenetV1/Conv2d_3_pointwise/Conv2D', defined at:
File "/home/mounir/PycharmProjects/Pedestrian-detection-DL/train.py", line 165, in
tf.app.run()
File "/home/mounir/anaconda3/envs/tflow-gpu/lib/python3.6/site-packages/tensorflow/python/platform/app.py", line 125, in run
_sys.exit(main(argv))
File "/home/mounir/PycharmProjects/Pedestrian-detection-DL/train.py", line 161, in main
worker_job_name, is_chief, FLAGS.train_dir)
File "/home/mounir/PycharmProjects/Pedestrian-detection-DL/object_detection/trainer.py", line 228, in train
clones = model_deploy.create_clones(deploy_config, model_fn, [input_queue])
File "/home/mounir/PycharmProjects/Pedestrian-detection-DL/deployment/model_deploy.py", line 193, in create_clones
outputs = model_fn(*args, **kwargs)
File "/home/mounir/PycharmProjects/Pedestrian-detection-DL/object_detection/trainer.py", line 165, in _create_losses
prediction_dict = detection_model.predict(images)
File "/home/mounir/PycharmProjects/Pedestrian-detection-DL/object_detection/meta_architectures/ssd_meta_arch.py", line 264, in predict
preprocessed_inputs)
File "/home/mounir/PycharmProjects/Pedestrian-detection-DL/object_detection/models/ssd_mobilenet_v1_feature_extractor.py", line 106, in extract_features
scope=scope)
File "/home/mounir/PycharmProjects/Pedestrian-detection-DL/nets/mobilenet_v1.py", line 258, in mobilenet_v1_base
scope=end_point)
File "/home/mounir/anaconda3/envs/tflow-gpu/lib/python3.6/site-packages/tensorflow/contrib/framework/python/ops/arg_scope.py", line 183, in func_with_args
return func(*args, **current_args)
File "/home/mounir/anaconda3/envs/tflow-gpu/lib/python3.6/site-packages/tensorflow/contrib/layers/python/layers/layers.py", line 1154, in convolution2d
conv_dims=2)
File "/home/mounir/anaconda3/envs/tflow-gpu/lib/python3.6/site-packages/tensorflow/contrib/framework/python/ops/arg_scope.py", line 183, in func_with_args
return func(*args, **current_args)
File "/home/mounir/anaconda3/envs/tflow-gpu/lib/python3.6/site-packages/tensorflow/contrib/layers/python/layers/layers.py", line 1057, in convolution
outputs = layer.apply(inputs)
File "/home/mounir/anaconda3/envs/tflow-gpu/lib/python3.6/site-packages/tensorflow/python/keras/engine/base_layer.py", line 805, in apply
return self.call(inputs, *args, **kwargs)
File "/home/mounir/anaconda3/envs/tflow-gpu/lib/python3.6/site-packages/tensorflow/python/layers/base.py", line 362, in call
outputs = super(Layer, self).call(inputs, *args, **kwargs)
File "/home/mounir/anaconda3/envs/tflow-gpu/lib/python3.6/site-packages/tensorflow/python/keras/engine/base_layer.py", line 736, in call
outputs = self.call(inputs, *args, **kwargs)
File "/home/mounir/anaconda3/envs/tflow-gpu/lib/python3.6/site-packages/tensorflow/python/keras/layers/convolutional.py", line 186, in call
outputs = self._convolution_op(inputs, self.kernel)
File "/home/mounir/anaconda3/envs/tflow-gpu/lib/python3.6/site-packages/tensorflow/python/ops/nn_ops.py", line 868, in call
return self.conv_op(inp, filter)
File "/home/mounir/anaconda3/envs/tflow-gpu/lib/python3.6/site-packages/tensorflow/python/ops/nn_ops.py", line 520, in call
return self.call(inp, filter)
File "/home/mounir/anaconda3/envs/tflow-gpu/lib/python3.6/site-packages/tensorflow/python/ops/nn_ops.py", line 204, in call
name=self.name)
File "/home/mounir/anaconda3/envs/tflow-gpu/lib/python3.6/site-packages/tensorflow/python/ops/gen_nn_ops.py", line 956, in conv2d
data_format=data_format, dilations=dilations, name=name)
File "/home/mounir/anaconda3/envs/tflow-gpu/lib/python3.6/site-packages/tensorflow/python/framework/op_def_library.py", line 787, in _apply_op_helper
op_def=op_def)
File "/home/mounir/anaconda3/envs/tflow-gpu/lib/python3.6/site-packages/tensorflow/python/util/deprecation.py", line 454, in new_func
return func(*args, **kwargs)
File "/home/mounir/anaconda3/envs/tflow-gpu/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 3155, in create_op
op_def=op_def)
File "/home/mounir/anaconda3/envs/tflow-gpu/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 1717, in init
self._traceback = tf_stack.extract_stack()

ResourceExhaustedError (see above for traceback): OOM when allocating tensor with shape[24,128,75,75] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
[[Node: FeatureExtractor/MobilenetV1/MobilenetV1/Conv2d_3_pointwise/Conv2D = Conv2D[T=DT_FLOAT, data_format="NCHW", dilations=[1, 1, 1, 1], padding="SAME", strides=[1, 1, 1, 1], use_cudnn_on_gpu=true, _device="/job:localhost/replica:0/task:0/device:GPU:0"](FeatureExtractor/MobilenetV1/MobilenetV1/Conv2d_3_depthwise/Relu6, FeatureExtractor/MobilenetV1/Conv2d_3_pointwise/weights/read/_3593)]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

[[Node: Loss/Where_260/_6409 = _Recvclient_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_14144_Loss/Where_260", tensor_type=DT_INT64, _device="/job:localhost/replica:0/task:0/device:CPU:0"]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

@thatbrguy
Copy link
Owner

Oh alright. You can try using Google Colab to train them then. You can fit FasterRCNN+ResNet-50 (and other models with similar param count) over there.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants