Sample for report_tensor_allocations_upon_oom and RunOptions #17076

Yagun · 2018-02-16T21:30:26Z

This is a feature request.

Please add some example to the docs describing how to use report_tensor_allocations_upon_oom and other options of RunOptions

All I could find is this file:
https://github.com/tensorflow/tensorflow/blob/master/tensorflow/python/profiler/model_analyzer_test.py

But it is not obvious. For example, it contains:

from tensorflow.core.protobuf import config_pb2

and then

with session.Session() as sess:
    sess.run(c, options=config_pb2.RunOptions(
        report_tensor_allocations_upon_oom=True))

And more questions arise like: "What is config_pb2?" etc.

Thanks.

The text was updated successfully, but these errors were encountered:

cy89 · 2018-02-16T23:15:31Z

@Yagun Do you care about CPU, GPU, or both?

georgh · 2018-02-17T01:26:57Z

I got it work like this:

run_options = tf.RunOptions(report_tensor_allocations_upon_oom = True)
sess.run(op, feed_dict=fdict, options=run_options)

This will produce messages like this :

tensorflow.python.framework.errors_impl.ResourceExhaustedError: OOM when allocating tensor with shape[100000,60,190] and type double on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
         [[Node: Tile = Tile[T=DT_DOUBLE, Tmultiples=DT_INT32, _device="/job:localhost/replica:0/task:0/device:GPU:0"](ExpandDims, Tile/multiples)]]

Current usage from device: /job:localhost/replica:0/task:0/device:GPU:0, allocator: GPU_0_bfc
  144.96MiB from cKR/sub

         [[Node: concat/_11 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_8207_concat", tensor_type=DT_DOUBLE, _device="/job:localhost/replica:0/task:0/device:CPU:0"]()]]

Current usage from device: /job:localhost/replica:0/task:0/device:GPU:0, allocator: GPU_0_bfc
  144.96MiB from cKR/sub

But it seems like - it does not contain all allocation, rendering it a bit pointless :/
The error message above is for a 15GB p100 gpu and it says it only allocated 145 MB, but fails on allocating a tesnor of shape [100000,60,190] -> around 9GB.

@cy89 is there a way to get even more details?

@Yagun for more Information regarding runoptions you may look here:
https://github.com/tensorflow/tensorflow/blob/r1.5/tensorflow/core/protobuf/config.proto
But it is really limited and not a real tutorial.

I think TF is in a real need of an in deep Tutorial for understanding its core and how to debug in case of errors. Handling OOM on the GPU is quiet a pain without understanding the allocations

Yagun · 2018-02-17T10:52:02Z

@cy89, I care about GPU more, because it usually has 2 to 4 GB RAM.

@georgh, thank you very much for this example.

cy89 · 2018-02-22T00:01:11Z

@zheng-xq can you please point @georgh at tools he can use to better inspect GPU memory allocation? Or is there a docs page we can point him at, or build for him?

georgh · 2018-02-23T13:13:17Z

@cy89 @zheng-xq
The main problem I encountered was the missing allocation information for placeholders. My findings regarding the problem are summarized in #17092 (would be great if a tensorflower would take a look at this)
I think doc pages for inspecting GPU allocation would be great. But I just found
https://www.tensorflow.org/programmers_guide/debugger and https://github.com/tensorflow/tensorflow/blob/master/tensorflow/core/profiler/README.md
so this might be the tool / docs I didn't find before. (Side has changed quite a bit since the last time I had a look) Maybe they even display placeholder memory usage. Have to test them now.

poxvoculi · 2018-04-10T14:25:35Z

I agree that this is a good feature request, i.e. there should be a guide to memory use debugging in TF, especially GPU memory use since there are some non-obvious tricks going on. Eventually someone from TF will probably get around to doing it, but probably not soon, so I'm going to mark it contributions welcome. @geogh I see that you made some progress in the other thread, using @yaroslavvb 's tool. It would be great if either of you want to contribute some notes on this topic.

nerai · 2020-10-04T15:56:13Z

Kindly requesting what is the status of this ticket?

m00dy · 2020-10-28T14:54:22Z

I would like to know most recent status of this ticket.

Flamefire · 2021-01-20T15:57:33Z

Also pinging in here. TF reports

Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info

But there doesn't seem to be a way to do that in TF 2.x

rmothukuru · 2021-03-09T10:22:09Z

@Yagun,
Debugging of Performance Bottlenecks can be done using Profiler Tool. Please find the Explanation in TF Site and the Example Code.

Hope this helps.

google-ml-butler · 2021-03-19T11:55:40Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you.

Flamefire · 2021-03-22T10:51:32Z

My previous comment seems to have been missed: There still doesn't seem to be a way to get the OOM info in TF 2 as indicated by the error message shown by TF 2. See also #37556 and

tensorflow/tensorflow/core/common_runtime/executor.cc

Lines 989 to 991 in 78d7f8b

    
           "\nHint: If you want to see a list of allocated tensors when " 
        
           "OOM happens, add report_tensor_allocations_upon_oom " 
        
           "to RunOptions for current allocation info.\n"));

rmothukuru · 2021-03-23T07:31:35Z

@Flamefire,
Thank you for your response. Did you try Profiler Tool, mentioned in this comment?

Flamefire · 2021-03-23T07:42:10Z

Not yet as it involves some setup and usage of a notebook which is a bit involved when running on HPC nodes. Adding some option to have some output on the stdout/stderr in case of failure would have been much more usable in that context.

And if that is not possible, then TF should not tell you so in its error message. I spend quite some time looking how to follow the advice of "add report_tensor_allocations_upon_oom to RunOptions".
In the end this is what this issue is about: Either there is a way to do that in TF2, then some documentation is needed, or there is not, then it should be added (preferably) or the error message changed.

google-ml-butler · 2021-03-30T08:22:57Z

Closing as stale. Please reopen if you'd like to work on this further.

Flamefire · 2021-03-30T08:38:31Z

What's wrong with the bot? There has been activity here since its last comment, so why close it?

rmothukuru · 2021-03-30T09:42:44Z

@Flamefire,
Sorry for the inconvenience. Reopened the issue.

blime4 · 2022-08-18T09:13:02Z

amazing. this bug is 4 years old and still open

github-actions · 2023-03-28T02:02:15Z

This issue is stale because it has been open for 180 days with no activity. It will be closed if no further activity occurs. Thank you.

nerai · 2023-06-22T16:32:50Z

.

bluelancer · 2023-09-27T19:12:04Z

5 yrs...

cy89 added the type:feature Feature requests label Feb 16, 2018

cy89 added stat:awaiting response Status - Awaiting response from author type:docs-bug Document issues and removed type:feature Feature requests labels Feb 16, 2018

georgh mentioned this issue Feb 17, 2018

[BUG] GPU memory is not freed before execution of following operation + report_tensor_allocations_upon_oom is wrong #17092

Closed

tensorflowbutler removed the stat:awaiting response Status - Awaiting response from author label Feb 17, 2018

cy89 added the stat:awaiting tensorflower Status - Awaiting response from tensorflower label Feb 22, 2018

tensorflowbutler assigned poxvoculi Mar 27, 2018

poxvoculi added stat:contribution welcome Status - Contributions welcome and removed stat:awaiting tensorflower Status - Awaiting response from tensorflower labels Apr 10, 2018

ahundt mentioned this issue Aug 22, 2018

10 minutes to recover from ResourceExhaustedError #20998

Closed

ghost mentioned this issue Mar 13, 2020

TF 2 equivalent for report_tensor_allocations_upon_oom #37556

Closed

jonathanasdf mentioned this issue Apr 23, 2020

Exits training tensorflow/lingvo#206

Open

rmothukuru added type:docs-feature Doc issues for new feature, or clarifications about functionality and removed type:docs-bug Document issues labels Mar 9, 2021

rmothukuru added the stat:awaiting response Status - Awaiting response from author label Mar 9, 2021

google-ml-butler bot added the stale This label marks the issue/pr stale - to be closed automatically if no activity label Mar 19, 2021

google-ml-butler bot closed this as completed Mar 30, 2021

rmothukuru reopened this Mar 30, 2021

rmothukuru removed stale This label marks the issue/pr stale - to be closed automatically if no activity stat:awaiting response Status - Awaiting response from author labels Mar 30, 2021

github-actions bot added the stale This label marks the issue/pr stale - to be closed automatically if no activity label Mar 28, 2023

github-actions bot removed stale This label marks the issue/pr stale - to be closed automatically if no activity stat:contribution welcome Status - Contributions welcome labels Jul 8, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sample for report_tensor_allocations_upon_oom and RunOptions #17076

Sample for report_tensor_allocations_upon_oom and RunOptions #17076

Yagun commented Feb 16, 2018

cy89 commented Feb 16, 2018

georgh commented Feb 17, 2018 •

edited

Yagun commented Feb 17, 2018

cy89 commented Feb 22, 2018

georgh commented Feb 23, 2018

poxvoculi commented Apr 10, 2018

nerai commented Oct 4, 2020

m00dy commented Oct 28, 2020

Flamefire commented Jan 20, 2021

rmothukuru commented Mar 9, 2021 •

edited

google-ml-butler bot commented Mar 19, 2021

Flamefire commented Mar 22, 2021

rmothukuru commented Mar 23, 2021 •

edited

Flamefire commented Mar 23, 2021

google-ml-butler bot commented Mar 30, 2021

Flamefire commented Mar 30, 2021

rmothukuru commented Mar 30, 2021

blime4 commented Aug 18, 2022

github-actions bot commented Mar 28, 2023

nerai commented Jun 22, 2023

bluelancer commented Sep 27, 2023

Sample for report_tensor_allocations_upon_oom and RunOptions #17076

Sample for report_tensor_allocations_upon_oom and RunOptions #17076

Comments

Yagun commented Feb 16, 2018

cy89 commented Feb 16, 2018

georgh commented Feb 17, 2018 • edited

Yagun commented Feb 17, 2018

cy89 commented Feb 22, 2018

georgh commented Feb 23, 2018

poxvoculi commented Apr 10, 2018

nerai commented Oct 4, 2020

m00dy commented Oct 28, 2020

Flamefire commented Jan 20, 2021

rmothukuru commented Mar 9, 2021 • edited

google-ml-butler bot commented Mar 19, 2021

Flamefire commented Mar 22, 2021

rmothukuru commented Mar 23, 2021 • edited

Flamefire commented Mar 23, 2021

google-ml-butler bot commented Mar 30, 2021

Flamefire commented Mar 30, 2021

rmothukuru commented Mar 30, 2021

blime4 commented Aug 18, 2022

github-actions bot commented Mar 28, 2023

nerai commented Jun 22, 2023

bluelancer commented Sep 27, 2023

georgh commented Feb 17, 2018 •

edited

rmothukuru commented Mar 9, 2021 •

edited

rmothukuru commented Mar 23, 2021 •

edited