Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sample for report_tensor_allocations_upon_oom and RunOptions #17076

Open
Yagun opened this issue Feb 16, 2018 · 21 comments
Open

Sample for report_tensor_allocations_upon_oom and RunOptions #17076

Yagun opened this issue Feb 16, 2018 · 21 comments
Assignees
Labels
type:docs-feature Doc issues for new feature, or clarifications about functionality

Comments

@Yagun
Copy link

Yagun commented Feb 16, 2018

This is a feature request.

Please add some example to the docs describing how to use report_tensor_allocations_upon_oom and other options of RunOptions

All I could find is this file:
https://github.com/tensorflow/tensorflow/blob/master/tensorflow/python/profiler/model_analyzer_test.py

But it is not obvious. For example, it contains:

from tensorflow.core.protobuf import config_pb2

and then

with session.Session() as sess:
    sess.run(c, options=config_pb2.RunOptions(
        report_tensor_allocations_upon_oom=True))

And more questions arise like: "What is config_pb2?" etc.

Thanks.

@cy89 cy89 added the type:feature Feature requests label Feb 16, 2018
@cy89
Copy link

cy89 commented Feb 16, 2018

@Yagun Do you care about CPU, GPU, or both?

@cy89 cy89 added stat:awaiting response Status - Awaiting response from author type:docs-bug Document issues and removed type:feature Feature requests labels Feb 16, 2018
@georgh
Copy link

georgh commented Feb 17, 2018

I got it work like this:

run_options = tf.RunOptions(report_tensor_allocations_upon_oom = True)
sess.run(op, feed_dict=fdict, options=run_options)

This will produce messages like this :

tensorflow.python.framework.errors_impl.ResourceExhaustedError: OOM when allocating tensor with shape[100000,60,190] and type double on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
         [[Node: Tile = Tile[T=DT_DOUBLE, Tmultiples=DT_INT32, _device="/job:localhost/replica:0/task:0/device:GPU:0"](ExpandDims, Tile/multiples)]]

Current usage from device: /job:localhost/replica:0/task:0/device:GPU:0, allocator: GPU_0_bfc
  144.96MiB from cKR/sub

         [[Node: concat/_11 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_8207_concat", tensor_type=DT_DOUBLE, _device="/job:localhost/replica:0/task:0/device:CPU:0"]()]]

Current usage from device: /job:localhost/replica:0/task:0/device:GPU:0, allocator: GPU_0_bfc
  144.96MiB from cKR/sub

But it seems like - it does not contain all allocation, rendering it a bit pointless :/
The error message above is for a 15GB p100 gpu and it says it only allocated 145 MB, but fails on allocating a tesnor of shape [100000,60,190] -> around 9GB.

@cy89 is there a way to get even more details?

@Yagun for more Information regarding runoptions you may look here:
https://github.com/tensorflow/tensorflow/blob/r1.5/tensorflow/core/protobuf/config.proto
But it is really limited and not a real tutorial.

I think TF is in a real need of an in deep Tutorial for understanding its core and how to debug in case of errors. Handling OOM on the GPU is quiet a pain without understanding the allocations

@Yagun
Copy link
Author

Yagun commented Feb 17, 2018

@cy89, I care about GPU more, because it usually has 2 to 4 GB RAM.

@georgh, thank you very much for this example.

@cy89
Copy link

cy89 commented Feb 22, 2018

@zheng-xq can you please point @georgh at tools he can use to better inspect GPU memory allocation? Or is there a docs page we can point him at, or build for him?

@cy89 cy89 added the stat:awaiting tensorflower Status - Awaiting response from tensorflower label Feb 22, 2018
@georgh
Copy link

georgh commented Feb 23, 2018

@cy89 @zheng-xq
The main problem I encountered was the missing allocation information for placeholders. My findings regarding the problem are summarized in #17092 (would be great if a tensorflower would take a look at this)
I think doc pages for inspecting GPU allocation would be great. But I just found
https://www.tensorflow.org/programmers_guide/debugger and https://github.com/tensorflow/tensorflow/blob/master/tensorflow/core/profiler/README.md
so this might be the tool / docs I didn't find before. (Side has changed quite a bit since the last time I had a look) Maybe they even display placeholder memory usage. Have to test them now.

@poxvoculi
Copy link
Contributor

I agree that this is a good feature request, i.e. there should be a guide to memory use debugging in TF, especially GPU memory use since there are some non-obvious tricks going on. Eventually someone from TF will probably get around to doing it, but probably not soon, so I'm going to mark it contributions welcome. @geogh I see that you made some progress in the other thread, using @yaroslavvb 's tool. It would be great if either of you want to contribute some notes on this topic.

@nerai
Copy link

nerai commented Oct 4, 2020

Kindly requesting what is the status of this ticket?

@m00dy
Copy link

m00dy commented Oct 28, 2020

I would like to know most recent status of this ticket.

@Flamefire
Copy link
Contributor

Also pinging in here. TF reports

Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info

But there doesn't seem to be a way to do that in TF 2.x

@rmothukuru rmothukuru added type:docs-feature Doc issues for new feature, or clarifications about functionality and removed type:docs-bug Document issues labels Mar 9, 2021
@rmothukuru
Copy link
Contributor

rmothukuru commented Mar 9, 2021

@Yagun,
Debugging of Performance Bottlenecks can be done using Profiler Tool. Please find the Explanation in TF Site and the Example Code.

Hope this helps.

@rmothukuru rmothukuru added the stat:awaiting response Status - Awaiting response from author label Mar 9, 2021
@google-ml-butler
Copy link

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you.

@google-ml-butler google-ml-butler bot added the stale This label marks the issue/pr stale - to be closed automatically if no activity label Mar 19, 2021
@Flamefire
Copy link
Contributor

My previous comment seems to have been missed: There still doesn't seem to be a way to get the OOM info in TF 2 as indicated by the error message shown by TF 2. See also #37556 and

"\nHint: If you want to see a list of allocated tensors when "
"OOM happens, add report_tensor_allocations_upon_oom "
"to RunOptions for current allocation info.\n"));

@rmothukuru
Copy link
Contributor

rmothukuru commented Mar 23, 2021

@Flamefire,
Thank you for your response. Did you try Profiler Tool, mentioned in this comment?

@Flamefire
Copy link
Contributor

Not yet as it involves some setup and usage of a notebook which is a bit involved when running on HPC nodes. Adding some option to have some output on the stdout/stderr in case of failure would have been much more usable in that context.

And if that is not possible, then TF should not tell you so in its error message. I spend quite some time looking how to follow the advice of "add report_tensor_allocations_upon_oom to RunOptions".
In the end this is what this issue is about: Either there is a way to do that in TF2, then some documentation is needed, or there is not, then it should be added (preferably) or the error message changed.

@google-ml-butler
Copy link

Closing as stale. Please reopen if you'd like to work on this further.

@Flamefire
Copy link
Contributor

What's wrong with the bot? There has been activity here since its last comment, so why close it?

@rmothukuru rmothukuru reopened this Mar 30, 2021
@rmothukuru rmothukuru removed stale This label marks the issue/pr stale - to be closed automatically if no activity stat:awaiting response Status - Awaiting response from author labels Mar 30, 2021
@rmothukuru
Copy link
Contributor

@Flamefire,
Sorry for the inconvenience. Reopened the issue.

@blime4
Copy link

blime4 commented Aug 18, 2022

amazing. this bug is 4 years old and still open

@github-actions
Copy link

This issue is stale because it has been open for 180 days with no activity. It will be closed if no further activity occurs. Thank you.

@github-actions github-actions bot added the stale This label marks the issue/pr stale - to be closed automatically if no activity label Mar 28, 2023
@nerai
Copy link

nerai commented Jun 22, 2023

.

@github-actions github-actions bot removed stale This label marks the issue/pr stale - to be closed automatically if no activity stat:contribution welcome Status - Contributions welcome labels Jul 8, 2023
@bluelancer
Copy link

5 yrs...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type:docs-feature Doc issues for new feature, or clarifications about functionality
Projects
None yet
Development

No branches or pull requests