Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Do anyone run successfully on a single gpu GTX 1080? I tried it and out of memory. #23

Closed
jiafeixiaoye opened this issue May 18, 2018 · 6 comments

Comments

@jiafeixiaoye
Copy link

jiafeixiaoye commented May 18, 2018

I add
tfconfig.gpu_options.per_process_gpu_memory_fraction = 0.05
to let it run, but I got error information like follow:
...
2018-05-18 19:14:25.380430: W tensorflow/core/common_runtime/bfc_allocator.cc:275] Allocator (GPU_0_bfc) ran out of memory trying to allocate 58.69MiB. Current allocation summary follows.
2018-05-18 19:14:25.380546: I tensorflow/core/common_runtime/bfc_allocator.cc:630] Bin (256): Total Chunks: 38, Chunks in use: 37. 9.5KiB allocated for chunks. 9.2KiB in use in bin. 7.6KiB client-requested in use in bin.
...
4] 1 Chunks of size 91656192 totalling 87.41MiB
2018-05-18 19:14:25.404137: I tensorflow/core/common_runtime/bfc_allocator.cc:678] Sum Total of in-use chunks: 374.93MiB
2018-05-18 19:14:25.404163: I tensorflow/core/common_runtime/bfc_allocator.cc:680] Stats:
Limit: 425407283
InUse: 393138944
MaxInUse: 393138944
NumAllocs: 1096
MaxAllocSize: 91656192

2018-05-18 19:14:25.404278: W tensorflow/core/common_runtime/bfc_allocator.cc:279] **********************************************************_____******************xxxxxxx
2018-05-18 19:14:25.404328: W tensorflow/core/framework/op_kernel.cc:1273] OP_REQUIRES failed at conv_ops.cc:672 : Resource exhausted: OOM when allocating tensor with shape[1,64,400,601] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
Traceback (most recent call last):
File "test.py", line 244, in
eval_all(args)
File "test.py", line 137, in eval_all
result_dict = inference(func, inputs, data_dict)
File "test.py", line 69, in inference
_, scores, pred_boxes, rois = val_func(feed_dict=feed_dict)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 905, in run
run_metadata_ptr)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1140, in _run
feed_dict_tensor, options, run_metadata)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1321, in _do_run
run_metadata)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1340, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.ResourceExhaustedError: OOM when allocating tensor with shape[1,64,400,601] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
[[Node: resnet_v1_101/conv1/Conv2D = Conv2D[T=DT_FLOAT, data_format="NCHW", dilations=[1, 1, 1, 1], padding="VALID", strides=[1, 1, 2, 2], use_cudnn_on_gpu=true, _device="/job:localhost/replica:0/task:0/device:GPU:0"](resnet_v1_101/conv1/Conv2D-0-TransposeNHWCToNCHW-LayoutOptimizer, resnet_v1_101/conv1/weights/read)]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

 [[Node: resnet_v1_101_5/concat_3/_1133 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_2610_resnet_v1_101_5/concat_3", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"]()]]

Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

@nguyeho7
Copy link

Well, if you try to fit the entire network on 0.05 * 8Gb (5%), it can't work. Why not 0.5, as in half of the GPU memory? I have run it on a 1080TI successfully

@jiafeixiaoye
Copy link
Author

@nguyeho7 thanks for your suggestion, I change it to 0.5 and it runs normally.

@wm10240
Copy link

wm10240 commented Jun 21, 2018

hi @nguyeho7 @jiafeixiaoye I also get the same error, I add the code like @jiafeixiaoye , but it don't work(my GPU is four nvidia 1080ti ), the error like follow and if you have any suggestion, very grateful :

2018-06-21 08:53:49.853249: I tensorflow/core/common_runtime/bfc_allocator.cc:686] Stats:
Limit:                  5856854016
InUse:                  5832717824
MaxInUse:               5845060608
NumAllocs:                    2163
MaxAllocSize:           1121255424

2018-06-21 08:53:49.853344: W tensorflow/core/common_runtime/bfc_allocator.cc:277] ****************************************************************************************************
2018-06-21 08:53:49.853378: W tensorflow/core/framework/op_kernel.cc:1198] Resource exhausted: OOM when allocating tensor with shape[2,50,50,1024] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
Traceback (most recent call last):
  File "/root/.pyenv/versions/3.6.2/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1350, in _do_call
    return fn(*args)
  File "/root/.pyenv/versions/3.6.2/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1329, in _run_fn
    status, run_metadata)
  File "/root/.pyenv/versions/3.6.2/lib/python3.6/site-packages/tensorflow/python/framework/errors_impl.py", line 473, in __exit__
    c_api.TF_GetCode(self.status.status))
tensorflow.python.framework.errors_impl.ResourceExhaustedError: OOM when allocating tensor with shape[2,100,100,512] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
	 [[Node: tower_5/resnet_v1_101_2/block2/unit_4/bottleneck_v1/conv3/Conv2D = Conv2D[T=DT_FLOAT, data_format="NHWC", dilations=[1, 1, 1, 1], padding="SAME", strides=[1, 1, 1, 1], use_cudnn_on_gpu=true, _device="/job:localhost/replica:0/task:0/device:GPU:0"](tower_5/resnet_v1_101_2/block2/unit_4/bottleneck_v1/conv2/Relu, resnet_v1_101/block2/unit_4/bottleneck_v1/conv3/weights/read/_1533)]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

	 [[Node: tower_4/gradients/tower_4/resnet_v1_101_3/block3/unit_11/bottleneck_v1/conv2/Conv2D_grad/tuple/control_dependency_1/_6953 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_33582_tower_4/gradients/tower_4/resnet_v1_101_3/block3/unit_11/bottleneck_v1/conv2/Conv2D_grad/tuple/control_dependency_1", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"]()]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.


During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "train.py", line 265, in <module>
    train(args)
  File "train.py", line 213, in train
    sess_ret = sess.run(sess2run, feed_dict=feed_dict)
  File "/root/.pyenv/versions/3.6.2/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 895, in run
    run_metadata_ptr)
  File "/root/.pyenv/versions/3.6.2/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1128, in _run
    feed_dict_tensor, options, run_metadata)
  File "/root/.pyenv/versions/3.6.2/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1344, in _do_run
    options, run_metadata)
  File "/root/.pyenv/versions/3.6.2/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1363, in _do_call
    raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.ResourceExhaustedError: OOM when allocating tensor with shape[2,100,100,512] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
	 [[Node: tower_5/resnet_v1_101_2/block2/unit_4/bottleneck_v1/conv3/Conv2D = Conv2D[T=DT_FLOAT, data_format="NHWC", dilations=[1, 1, 1, 1], padding="SAME", strides=[1, 1, 1, 1], use_cudnn_on_gpu=true, _device="/job:localhost/replica:0/task:0/device:GPU:0"](tower_5/resnet_v1_101_2/block2/unit_4/bottleneck_v1/conv2/Relu, resnet_v1_101/block2/unit_4/bottleneck_v1/conv3/weights/read/_1533)]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

@jiafeixiaoye
Copy link
Author

Hi, @wm10240
do you change the compute capability in all of the make.sh in lib_kernel? GTX 1080 is sm_62, the default value in make.sh is sm_52.

@karansomaiah
Copy link

Hey @jiafeixiaoye I tried the changes suggested and I still see these errors. Any other suggestions for me?
Also, @wm10240 were you able to resolve the issue?

@karansomaiah
Copy link

Update:
I tried it on the GTX 1080 Ti. I didn't have to change sm_52 to sm_62.
I was running the training script as it is (not have realized the argument 0-7 is for the 8 GPUs to use.) I have changed it to the number of GPUs I have and now it works perfectly fine.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants