-
Notifications
You must be signed in to change notification settings - Fork 21.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Added requested_bytes to CUDA Caching Allocator Stats #88575
Conversation
This pull request was exported from Phabricator. Differential Revision: D40810674 |
1 similar comment
This pull request was exported from Phabricator. Differential Revision: D40810674 |
Summary: Pull Request resolved: pytorch#88575 The caching allocator can be configured to round memory allocations in order to reduce fragmentation. Sometimes however, the overhead from rounding can be higher than the fragmentation it helps reduce. We have added a new stat to CUDA caching allocator stats to help track if rounding is adding too much overhead and help tune the roundup_power2_divisions flag: - "requested_bytes.{current,peak,allocated,freed}": memory requested by client code, compare this with allocated_bytes to check if allocation rounding adds too much overhead Test Plan: Added test case in caffe2/test/test_cuda.py Differential Revision: D40810674 fbshipit-source-id: 7c6d71e6c307d87d6b6b5db5d45d1e9b5b354f42
f6b245f
to
3bd959a
Compare
Summary: Pull Request resolved: pytorch#88575 The caching allocator can be configured to round memory allocations in order to reduce fragmentation. Sometimes however, the overhead from rounding can be higher than the fragmentation it helps reduce. We have added a new stat to CUDA caching allocator stats to help track if rounding is adding too much overhead and help tune the roundup_power2_divisions flag: - "requested_bytes.{current,peak,allocated,freed}": memory requested by client code, compare this with allocated_bytes to check if allocation rounding adds too much overhead Test Plan: Added test case in caffe2/test/test_cuda.py Differential Revision: D40810674 fbshipit-source-id: 6cd3715887a58b567b759ecd063d46cca3376777
This pull request was exported from Phabricator. Differential Revision: D40810674 |
3bd959a
to
609851b
Compare
This pull request was exported from Phabricator. Differential Revision: D40810674 |
609851b
to
7f16c56
Compare
Summary: Pull Request resolved: pytorch#88575 The caching allocator can be configured to round memory allocations in order to reduce fragmentation. Sometimes however, the overhead from rounding can be higher than the fragmentation it helps reduce. We have added a new stat to CUDA caching allocator stats to help track if rounding is adding too much overhead and help tune the roundup_power2_divisions flag: - "requested_bytes.{current,peak,allocated,freed}": memory requested by client code, compare this with allocated_bytes to check if allocation rounding adds too much overhead Test Plan: Added test case in caffe2/test/test_cuda.py Differential Revision: D40810674 fbshipit-source-id: 9aa71acd15755d6fbec17a61ca6fcfb69a7d2380
7f16c56
to
050cba1
Compare
This pull request was exported from Phabricator. Differential Revision: D40810674 |
Summary: Pull Request resolved: pytorch#88575 The caching allocator can be configured to round memory allocations in order to reduce fragmentation. Sometimes however, the overhead from rounding can be higher than the fragmentation it helps reduce. We have added a new stat to CUDA caching allocator stats to help track if rounding is adding too much overhead and help tune the roundup_power2_divisions flag: - "requested_bytes.{current,peak,allocated,freed}": memory requested by client code, compare this with allocated_bytes to check if allocation rounding adds too much overhead Test Plan: Added test case in caffe2/test/test_cuda.py Differential Revision: D40810674 fbshipit-source-id: 5f2719eef86fe6e5158f1f330cad6b69a981d183
050cba1
to
9b98296
Compare
Summary: Pull Request resolved: pytorch#88575 The caching allocator can be configured to round memory allocations in order to reduce fragmentation. Sometimes however, the overhead from rounding can be higher than the fragmentation it helps reduce. We have added a new stat to CUDA caching allocator stats to help track if rounding is adding too much overhead and help tune the roundup_power2_divisions flag: - "requested_bytes.{current,peak,allocated,freed}": memory requested by client code, compare this with allocated_bytes to check if allocation rounding adds too much overhead Test Plan: Added test case in caffe2/test/test_cuda.py Differential Revision: D40810674 fbshipit-source-id: 6b4da60ead445ee095904a0c9cfade771b66409d
This pull request was exported from Phabricator. Differential Revision: D40810674 |
1 similar comment
This pull request was exported from Phabricator. Differential Revision: D40810674 |
9b98296
to
e80ba76
Compare
Summary: Pull Request resolved: pytorch#88575 The caching allocator can be configured to round memory allocations in order to reduce fragmentation. Sometimes however, the overhead from rounding can be higher than the fragmentation it helps reduce. We have added a new stat to CUDA caching allocator stats to help track if rounding is adding too much overhead and help tune the roundup_power2_divisions flag: - "requested_bytes.{current,peak,allocated,freed}": memory requested by client code, compare this with allocated_bytes to check if allocation rounding adds too much overhead Test Plan: Added test case in caffe2/test/test_cuda.py Differential Revision: D40810674 fbshipit-source-id: 3b962b0b44f482a5d0d71c177e6bab83d9b9d887
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It makes sense to record this stat. I have a few inline comments. I also think that there is code missing to handle resetting the statistic when all the other statistics are reset. I don't see tests for reading requested_bytes out of the block info.
This pull request was exported from Phabricator. Differential Revision: D40810674 |
e80ba76
to
63f645d
Compare
Summary: Pull Request resolved: pytorch#88575 The caching allocator can be configured to round memory allocations in order to reduce fragmentation. Sometimes however, the overhead from rounding can be higher than the fragmentation it helps reduce. We have added a new stat to CUDA caching allocator stats to help track if rounding is adding too much overhead and help tune the roundup_power2_divisions flag: - "requested_bytes.{all,large_pool,small_pool}.{current,peak,allocated,freed}": memory requested by client code, compare this with allocated_bytes to check if allocation rounding adds too much overhead Test Plan: Added test case in caffe2/test/test_cuda.py Differential Revision: D40810674 fbshipit-source-id: 297b81c7886f5725a131848288652b35cdbcbd0d
This pull request was exported from Phabricator. Differential Revision: D40810674 |
Summary: Pull Request resolved: pytorch#88575 The caching allocator can be configured to round memory allocations in order to reduce fragmentation. Sometimes however, the overhead from rounding can be higher than the fragmentation it helps reduce. We have added a new stat to CUDA caching allocator stats to help track if rounding is adding too much overhead and help tune the roundup_power2_divisions flag: - "requested_bytes.{all,large_pool,small_pool}.{current,peak,allocated,freed}": memory requested by client code, compare this with allocated_bytes to check if allocation rounding adds too much overhead Test Plan: Added test case in caffe2/test/test_cuda.py Differential Revision: D40810674 fbshipit-source-id: e134240786e9a0cbf06d806f61adde57bf78a0c5
63f645d
to
348f39f
Compare
3f07abe
to
4e65787
Compare
Summary: Pull Request resolved: pytorch#88575 The caching allocator can be configured to round memory allocations in order to reduce fragmentation. Sometimes however, the overhead from rounding can be higher than the fragmentation it helps reduce. We have added a new stat to CUDA caching allocator stats to help track if rounding is adding too much overhead and help tune the roundup_power2_divisions flag: - "requested_bytes.{all,large_pool,small_pool}.{current,peak,allocated,freed}": memory requested by client code, compare this with allocated_bytes to check if allocation rounding adds too much overhead Test Plan: Added test case in caffe2/test/test_cuda.py Differential Revision: D40810674 fbshipit-source-id: 44a9fcd7bbf8971f0e8acc52f23e24ddc9ae5e42
Summary: Pull Request resolved: pytorch#88575 The caching allocator can be configured to round memory allocations in order to reduce fragmentation. Sometimes however, the overhead from rounding can be higher than the fragmentation it helps reduce. We have added a new stat to CUDA caching allocator stats to help track if rounding is adding too much overhead and help tune the roundup_power2_divisions flag: - "requested_bytes.{all,large_pool,small_pool}.{current,peak,allocated,freed}": memory requested by client code, compare this with allocated_bytes to check if allocation rounding adds too much overhead Test Plan: Added test case in caffe2/test/test_cuda.py Differential Revision: D40810674 fbshipit-source-id: fbc132e4b66f49a650d5dde2fd040752109d5c85
4e65787
to
1f0e522
Compare
This pull request was exported from Phabricator. Differential Revision: D40810674 |
Thank you for the review! I've addressed the inline comments, added tests for reading requested_bytes out of segment info/block info and added code to reset the statistics when all the other statistics are reset. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good. I have a couple nits that should be addressed (missing blockinfo export and test), but this otherwise looks good to me.
1f0e522
to
22f94d8
Compare
This pull request was exported from Phabricator. Differential Revision: D40810674 |
Summary: Pull Request resolved: pytorch#88575 The caching allocator can be configured to round memory allocations in order to reduce fragmentation. Sometimes however, the overhead from rounding can be higher than the fragmentation it helps reduce. We have added a new stat to CUDA caching allocator stats to help track if rounding is adding too much overhead and help tune the roundup_power2_divisions flag: - "requested_bytes.{all,large_pool,small_pool}.{current,peak,allocated,freed}": memory requested by client code, compare this with allocated_bytes to check if allocation rounding adds too much overhead Test Plan: Added test case in caffe2/test/test_cuda.py Differential Revision: D40810674 fbshipit-source-id: 06dbdd60a76d80a1bfdeb6cbbacba9af1e7a1f53
Summary: Pull Request resolved: pytorch#88575 The caching allocator can be configured to round memory allocations in order to reduce fragmentation. Sometimes however, the overhead from rounding can be higher than the fragmentation it helps reduce. We have added a new stat to CUDA caching allocator stats to help track if rounding is adding too much overhead and help tune the roundup_power2_divisions flag: - "requested_bytes.{all,large_pool,small_pool}.{current,peak,allocated,freed}": memory requested by client code, compare this with allocated_bytes to check if allocation rounding adds too much overhead Test Plan: Added test case in caffe2/test/test_cuda.py Differential Revision: D40810674 fbshipit-source-id: d71624c0173c1ca272a1a76acc2bc1ff022edfab
22f94d8
to
28f9656
Compare
This pull request was exported from Phabricator. Differential Revision: D40810674 |
@c-odrin has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
1 similar comment
@c-odrin has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
@pytorchbot merge (Initiating merge automatically since Phabricator Diff has merged) |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
Memory usage increase after pytorch#88575
Memory usage increases after #88575. Docker crashes with exit code 137, clearly means out of memory Pull Request resolved: #94548 Approved by: https://github.com/seemethere
Summary:
The caching allocator can be configured to round memory allocations in order to reduce fragmentation. Sometimes however, the overhead from rounding can be higher than the fragmentation it helps reduce.
We have added a new stat to CUDA caching allocator stats to help track if rounding is adding too much overhead and help tune the roundup_power2_divisions flag:
- "requested_bytes.{current,peak,allocated,freed}": memory requested by client code, compare this with allocated_bytes to check if allocation rounding adds too much overhead
Test Plan: Added test case in caffe2/test/test_cuda.py
Differential Revision: D40810674