Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added requested_bytes to CUDA Caching Allocator Stats #88575

Closed
wants to merge 2 commits into from

Conversation

c-odrin
Copy link
Contributor

@c-odrin c-odrin commented Nov 7, 2022

Summary:
The caching allocator can be configured to round memory allocations in order to reduce fragmentation. Sometimes however, the overhead from rounding can be higher than the fragmentation it helps reduce.

We have added a new stat to CUDA caching allocator stats to help track if rounding is adding too much overhead and help tune the roundup_power2_divisions flag:
- "requested_bytes.{current,peak,allocated,freed}": memory requested by client code, compare this with allocated_bytes to check if allocation rounding adds too much overhead

Test Plan: Added test case in caffe2/test/test_cuda.py

Differential Revision: D40810674

@pytorch-bot
Copy link

pytorch-bot bot commented Nov 7, 2022

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/88575

Note: Links to docs will display an error until the docs builds have been completed.

❗ 1 Active SEVs

There are 1 currently active SEVs. If your PR is affected, please view them below:

❌ 8 Failures

As of commit bbaa879:

NEW FAILURES - The following jobs have failed:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D40810674

1 similar comment
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D40810674

c-odrin added a commit to c-odrin/pytorch that referenced this pull request Nov 7, 2022
Summary:
Pull Request resolved: pytorch#88575

The caching allocator can be configured to round memory allocations in order to reduce fragmentation. Sometimes however, the overhead from rounding can be higher than the fragmentation it helps reduce.

We have added a new stat to CUDA caching allocator stats to help track if rounding is adding too much overhead and help tune the roundup_power2_divisions flag:
    - "requested_bytes.{current,peak,allocated,freed}": memory requested by client code, compare this with allocated_bytes to check if allocation rounding adds too much overhead

Test Plan: Added test case in caffe2/test/test_cuda.py

Differential Revision: D40810674

fbshipit-source-id: 7c6d71e6c307d87d6b6b5db5d45d1e9b5b354f42
c-odrin added a commit to c-odrin/pytorch that referenced this pull request Nov 7, 2022
Summary:
Pull Request resolved: pytorch#88575

The caching allocator can be configured to round memory allocations in order to reduce fragmentation. Sometimes however, the overhead from rounding can be higher than the fragmentation it helps reduce.

We have added a new stat to CUDA caching allocator stats to help track if rounding is adding too much overhead and help tune the roundup_power2_divisions flag:
    - "requested_bytes.{current,peak,allocated,freed}": memory requested by client code, compare this with allocated_bytes to check if allocation rounding adds too much overhead

Test Plan: Added test case in caffe2/test/test_cuda.py

Differential Revision: D40810674

fbshipit-source-id: 6cd3715887a58b567b759ecd063d46cca3376777
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D40810674

@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D40810674

c-odrin added a commit to c-odrin/pytorch that referenced this pull request Nov 7, 2022
Summary:
Pull Request resolved: pytorch#88575

The caching allocator can be configured to round memory allocations in order to reduce fragmentation. Sometimes however, the overhead from rounding can be higher than the fragmentation it helps reduce.

We have added a new stat to CUDA caching allocator stats to help track if rounding is adding too much overhead and help tune the roundup_power2_divisions flag:
    - "requested_bytes.{current,peak,allocated,freed}": memory requested by client code, compare this with allocated_bytes to check if allocation rounding adds too much overhead

Test Plan: Added test case in caffe2/test/test_cuda.py

Differential Revision: D40810674

fbshipit-source-id: 9aa71acd15755d6fbec17a61ca6fcfb69a7d2380
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D40810674

c-odrin added a commit to c-odrin/pytorch that referenced this pull request Nov 9, 2022
Summary:
Pull Request resolved: pytorch#88575

The caching allocator can be configured to round memory allocations in order to reduce fragmentation. Sometimes however, the overhead from rounding can be higher than the fragmentation it helps reduce.

We have added a new stat to CUDA caching allocator stats to help track if rounding is adding too much overhead and help tune the roundup_power2_divisions flag:
    - "requested_bytes.{current,peak,allocated,freed}": memory requested by client code, compare this with allocated_bytes to check if allocation rounding adds too much overhead

Test Plan: Added test case in caffe2/test/test_cuda.py

Differential Revision: D40810674

fbshipit-source-id: 5f2719eef86fe6e5158f1f330cad6b69a981d183
c-odrin added a commit to c-odrin/pytorch that referenced this pull request Nov 10, 2022
Summary:
Pull Request resolved: pytorch#88575

The caching allocator can be configured to round memory allocations in order to reduce fragmentation. Sometimes however, the overhead from rounding can be higher than the fragmentation it helps reduce.

We have added a new stat to CUDA caching allocator stats to help track if rounding is adding too much overhead and help tune the roundup_power2_divisions flag:
    - "requested_bytes.{current,peak,allocated,freed}": memory requested by client code, compare this with allocated_bytes to check if allocation rounding adds too much overhead

Test Plan: Added test case in caffe2/test/test_cuda.py

Differential Revision: D40810674

fbshipit-source-id: 6b4da60ead445ee095904a0c9cfade771b66409d
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D40810674

1 similar comment
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D40810674

c-odrin added a commit to c-odrin/pytorch that referenced this pull request Nov 11, 2022
Summary:
Pull Request resolved: pytorch#88575

The caching allocator can be configured to round memory allocations in order to reduce fragmentation. Sometimes however, the overhead from rounding can be higher than the fragmentation it helps reduce.

We have added a new stat to CUDA caching allocator stats to help track if rounding is adding too much overhead and help tune the roundup_power2_divisions flag:
    - "requested_bytes.{current,peak,allocated,freed}": memory requested by client code, compare this with allocated_bytes to check if allocation rounding adds too much overhead

Test Plan: Added test case in caffe2/test/test_cuda.py

Differential Revision: D40810674

fbshipit-source-id: 3b962b0b44f482a5d0d71c177e6bab83d9b9d887
Copy link
Contributor

@zdevito zdevito left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It makes sense to record this stat. I have a few inline comments. I also think that there is code missing to handle resetting the statistic when all the other statistics are reset. I don't see tests for reading requested_bytes out of the block info.

c10/cuda/CUDACachingAllocator.cpp Outdated Show resolved Hide resolved
c10/cuda/CUDACachingAllocator.cpp Outdated Show resolved Hide resolved
c10/cuda/CUDACachingAllocator.cpp Outdated Show resolved Hide resolved
c10/cuda/CUDACachingAllocator.h Outdated Show resolved Hide resolved
c10/cuda/CUDACachingAllocator.h Outdated Show resolved Hide resolved
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D40810674

c-odrin added a commit to c-odrin/pytorch that referenced this pull request Dec 6, 2022
Summary:
Pull Request resolved: pytorch#88575

The caching allocator can be configured to round memory allocations in order to reduce fragmentation. Sometimes however, the overhead from rounding can be higher than the fragmentation it helps reduce.

We have added a new stat to CUDA caching allocator stats to help track if rounding is adding too much overhead and help tune the roundup_power2_divisions flag:
    - "requested_bytes.{all,large_pool,small_pool}.{current,peak,allocated,freed}": memory requested by client code, compare this with allocated_bytes to check if allocation rounding adds too much overhead

Test Plan: Added test case in caffe2/test/test_cuda.py

Differential Revision: D40810674

fbshipit-source-id: 297b81c7886f5725a131848288652b35cdbcbd0d
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D40810674

c-odrin added a commit to c-odrin/pytorch that referenced this pull request Dec 6, 2022
Summary:
Pull Request resolved: pytorch#88575

The caching allocator can be configured to round memory allocations in order to reduce fragmentation. Sometimes however, the overhead from rounding can be higher than the fragmentation it helps reduce.

We have added a new stat to CUDA caching allocator stats to help track if rounding is adding too much overhead and help tune the roundup_power2_divisions flag:
    - "requested_bytes.{all,large_pool,small_pool}.{current,peak,allocated,freed}": memory requested by client code, compare this with allocated_bytes to check if allocation rounding adds too much overhead

Test Plan: Added test case in caffe2/test/test_cuda.py

Differential Revision: D40810674

fbshipit-source-id: e134240786e9a0cbf06d806f61adde57bf78a0c5
c-odrin added a commit to c-odrin/pytorch that referenced this pull request Dec 11, 2022
Summary:
Pull Request resolved: pytorch#88575

The caching allocator can be configured to round memory allocations in order to reduce fragmentation. Sometimes however, the overhead from rounding can be higher than the fragmentation it helps reduce.

We have added a new stat to CUDA caching allocator stats to help track if rounding is adding too much overhead and help tune the roundup_power2_divisions flag:
    - "requested_bytes.{all,large_pool,small_pool}.{current,peak,allocated,freed}": memory requested by client code, compare this with allocated_bytes to check if allocation rounding adds too much overhead

Test Plan: Added test case in caffe2/test/test_cuda.py

Differential Revision: D40810674

fbshipit-source-id: 44a9fcd7bbf8971f0e8acc52f23e24ddc9ae5e42
c-odrin added a commit to c-odrin/pytorch that referenced this pull request Dec 11, 2022
Summary:
Pull Request resolved: pytorch#88575

The caching allocator can be configured to round memory allocations in order to reduce fragmentation. Sometimes however, the overhead from rounding can be higher than the fragmentation it helps reduce.

We have added a new stat to CUDA caching allocator stats to help track if rounding is adding too much overhead and help tune the roundup_power2_divisions flag:
    - "requested_bytes.{all,large_pool,small_pool}.{current,peak,allocated,freed}": memory requested by client code, compare this with allocated_bytes to check if allocation rounding adds too much overhead

Test Plan: Added test case in caffe2/test/test_cuda.py

Differential Revision: D40810674

fbshipit-source-id: fbc132e4b66f49a650d5dde2fd040752109d5c85
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D40810674

@c-odrin
Copy link
Contributor Author

c-odrin commented Dec 12, 2022

It makes sense to record this stat. I have a few inline comments. I also think that there is code missing to handle resetting the statistic when all the other statistics are reset. I don't see tests for reading requested_bytes out of the block info.

Thank you for the review! I've addressed the inline comments, added tests for reading requested_bytes out of segment info/block info and added code to reset the statistics when all the other statistics are reset.

@c-odrin c-odrin requested a review from zdevito December 12, 2022 15:23
Copy link
Contributor

@zdevito zdevito left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. I have a couple nits that should be addressed (missing blockinfo export and test), but this otherwise looks good to me.

c10/cuda/CUDACachingAllocator.cpp Outdated Show resolved Hide resolved
torch/csrc/cuda/Module.cpp Show resolved Hide resolved
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D40810674

c-odrin added a commit to c-odrin/pytorch that referenced this pull request Jan 10, 2023
Summary:
Pull Request resolved: pytorch#88575

The caching allocator can be configured to round memory allocations in order to reduce fragmentation. Sometimes however, the overhead from rounding can be higher than the fragmentation it helps reduce.

We have added a new stat to CUDA caching allocator stats to help track if rounding is adding too much overhead and help tune the roundup_power2_divisions flag:
    - "requested_bytes.{all,large_pool,small_pool}.{current,peak,allocated,freed}": memory requested by client code, compare this with allocated_bytes to check if allocation rounding adds too much overhead

Test Plan: Added test case in caffe2/test/test_cuda.py

Differential Revision: D40810674

fbshipit-source-id: 06dbdd60a76d80a1bfdeb6cbbacba9af1e7a1f53
Summary:
Pull Request resolved: pytorch#88575

The caching allocator can be configured to round memory allocations in order to reduce fragmentation. Sometimes however, the overhead from rounding can be higher than the fragmentation it helps reduce.

We have added a new stat to CUDA caching allocator stats to help track if rounding is adding too much overhead and help tune the roundup_power2_divisions flag:
    - "requested_bytes.{all,large_pool,small_pool}.{current,peak,allocated,freed}": memory requested by client code, compare this with allocated_bytes to check if allocation rounding adds too much overhead

Test Plan: Added test case in caffe2/test/test_cuda.py

Differential Revision: D40810674

fbshipit-source-id: d71624c0173c1ca272a1a76acc2bc1ff022edfab
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D40810674

@facebook-github-bot
Copy link
Contributor

@c-odrin has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

1 similar comment
@facebook-github-bot
Copy link
Contributor

@c-odrin has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@facebook-github-bot
Copy link
Contributor

@pytorchbot merge

(Initiating merge automatically since Phabricator Diff has merged)

@pytorch-bot pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Feb 9, 2023
@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

huydhn added a commit to huydhn/pytorch that referenced this pull request Feb 9, 2023
Memory usage increase after pytorch#88575
@huydhn huydhn reopened this Feb 9, 2023
@huydhn huydhn closed this Feb 9, 2023
@pytorch pytorch deleted a comment from pytorchmergebot Feb 9, 2023
@pytorch pytorch deleted a comment from pytorchmergebot Feb 9, 2023
pytorchmergebot pushed a commit that referenced this pull request Feb 10, 2023
Memory usage increases after #88575.  Docker crashes with exit code 137, clearly means out of memory

Pull Request resolved: #94548
Approved by: https://github.com/seemethere
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ciflow/trunk Trigger trunk jobs on your pull request fb-exported Merged
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants