-
Notifications
You must be signed in to change notification settings - Fork 21.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve cuda OOM message #32101
Comments
This seems quite doable to hack something up to look at this, but the question is what exactly to hack up... |
@ezyang assuming you're talking about the vis: A view over all device (despite other processes can use it) memory with all allocated tensors + storages. One way it could be some long ribbon, possibly wrapped over several lines, allocated tensors could use odd/even highlighting to discriminate between different tensors allocated next to each other (+ maybe another color for unused but allocated tensor storage). Then even a coarse view would give an idea of fragmentation. |
A measure of fragmentation could be some relation like amount of uncommitted memory versus maximum number bytes than can be allocated contiguously |
One problem of the default OOM message is that it doesn't report how much GPU memory is taken by other processes (or coversely, what's the capacity given all other processes). I recently had again |
I suppose we could shell out to nvidia-smi in this situation :> |
Is Anyway, global memory situation or info on presence of other processes using the GPU is useful. |
|
It also occurred to me that it did not report other processes even when they were inside the same docker container (just some background processes), but at least memory counter showed that they do exist - even that memory counter would be useful for the OOM msg |
Noting that you can get all tensors allocated on CUDA in the program using python's garbage collector:
And compute the size used in byte by each tensor with |
Now more and more tensors are allocated from C++ (including output volumes from ops), gradients etc. They still add memory pressure on the caching allocator, but cannot be inspected like this :( |
I don't know about that, I wrote a function to compare the size of tensors in memory w.r.t. what pytorch shows is allocated, and it matches:
|
This is based on #31497 (after the last messages with @albanD and @ezyang)
Condensed, I had a script there that printed:
in a model eval loop. And at some point it got an OOM:
Some things:
torch.cuda.memory_allocated()
of 11Gb is not reported in OOM messageTerminology discrepancy:
torch.cuda.memory_cached
seems to be equivalent to "already allocated" from OOM message. In presence of also existingtorch.cuda.memory_allocated
this is confusing. Probably the OOM message should also saycached
. Otherwise, it's pretty easy to mix up allocated, cached, reserved.It would be nice for the OOM message to include by default a small glossary/explainer explaining all these various memory counters.
Related: my previous issue about feature request of adding by default some measure of fragmentation: #29554, my old issue about allocator stats #1529.
Recently there were a few reports of default allocator causing problems with very varying batch sizes (myself included). To confirm the guess, it would be nice to have an easily interpretable allocator state visualization / dump (super cool would be to have a way to dump an HTML vis). Currently there exists
torch.cuda.memory_stats
,torch.cuda.memory_usage
andtorch.cuda.memory_snapshot
. It would be nice to have some default advice on what to save / use when debugging for suspected fragmentation.In addition, they are not searchable for whatever reason: https://pytorch.org/docs/master/search.html?q=memory_usage&check_keywords=yes&area=default#
cc @ngimel @jlin27 @mruberry
The text was updated successfully, but these errors were encountered: