Added requested_bytes to CUDA Caching Allocator Stats #88575

c-odrin · 2022-11-07T14:30:38Z

Summary:
The caching allocator can be configured to round memory allocations in order to reduce fragmentation. Sometimes however, the overhead from rounding can be higher than the fragmentation it helps reduce.

We have added a new stat to CUDA caching allocator stats to help track if rounding is adding too much overhead and help tune the roundup_power2_divisions flag:
- "requested_bytes.{current,peak,allocated,freed}": memory requested by client code, compare this with allocated_bytes to check if allocation rounding adds too much overhead

Test Plan: Added test case in caffe2/test/test_cuda.py

Differential Revision: D40810674

pytorch-bot · 2022-11-07T14:30:40Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/88575

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

❗ 1 Active SEVs

There are 1 currently active SEVs. If your PR is affected, please view them below:

ROCm jobs fail to access AMD apt repo

❌ 8 Failures

As of commit bbaa879:

NEW FAILURES - The following jobs have failed:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

facebook-github-bot · 2022-11-07T14:31:23Z

This pull request was exported from Phabricator. Differential Revision: D40810674

facebook-github-bot · 2022-11-07T14:39:29Z

This pull request was exported from Phabricator. Differential Revision: D40810674

Summary: Pull Request resolved: pytorch#88575 The caching allocator can be configured to round memory allocations in order to reduce fragmentation. Sometimes however, the overhead from rounding can be higher than the fragmentation it helps reduce. We have added a new stat to CUDA caching allocator stats to help track if rounding is adding too much overhead and help tune the roundup_power2_divisions flag: - "requested_bytes.{current,peak,allocated,freed}": memory requested by client code, compare this with allocated_bytes to check if allocation rounding adds too much overhead Test Plan: Added test case in caffe2/test/test_cuda.py Differential Revision: D40810674 fbshipit-source-id: 7c6d71e6c307d87d6b6b5db5d45d1e9b5b354f42

Summary: Pull Request resolved: pytorch#88575 The caching allocator can be configured to round memory allocations in order to reduce fragmentation. Sometimes however, the overhead from rounding can be higher than the fragmentation it helps reduce. We have added a new stat to CUDA caching allocator stats to help track if rounding is adding too much overhead and help tune the roundup_power2_divisions flag: - "requested_bytes.{current,peak,allocated,freed}": memory requested by client code, compare this with allocated_bytes to check if allocation rounding adds too much overhead Test Plan: Added test case in caffe2/test/test_cuda.py Differential Revision: D40810674 fbshipit-source-id: 6cd3715887a58b567b759ecd063d46cca3376777

facebook-github-bot · 2022-11-07T18:34:18Z

This pull request was exported from Phabricator. Differential Revision: D40810674

facebook-github-bot · 2022-11-07T23:31:51Z

This pull request was exported from Phabricator. Differential Revision: D40810674

Summary: Pull Request resolved: pytorch#88575 The caching allocator can be configured to round memory allocations in order to reduce fragmentation. Sometimes however, the overhead from rounding can be higher than the fragmentation it helps reduce. We have added a new stat to CUDA caching allocator stats to help track if rounding is adding too much overhead and help tune the roundup_power2_divisions flag: - "requested_bytes.{current,peak,allocated,freed}": memory requested by client code, compare this with allocated_bytes to check if allocation rounding adds too much overhead Test Plan: Added test case in caffe2/test/test_cuda.py Differential Revision: D40810674 fbshipit-source-id: 9aa71acd15755d6fbec17a61ca6fcfb69a7d2380

facebook-github-bot · 2022-11-09T16:08:43Z

This pull request was exported from Phabricator. Differential Revision: D40810674

Summary: Pull Request resolved: pytorch#88575 The caching allocator can be configured to round memory allocations in order to reduce fragmentation. Sometimes however, the overhead from rounding can be higher than the fragmentation it helps reduce. We have added a new stat to CUDA caching allocator stats to help track if rounding is adding too much overhead and help tune the roundup_power2_divisions flag: - "requested_bytes.{current,peak,allocated,freed}": memory requested by client code, compare this with allocated_bytes to check if allocation rounding adds too much overhead Test Plan: Added test case in caffe2/test/test_cuda.py Differential Revision: D40810674 fbshipit-source-id: 5f2719eef86fe6e5158f1f330cad6b69a981d183

Summary: Pull Request resolved: pytorch#88575 The caching allocator can be configured to round memory allocations in order to reduce fragmentation. Sometimes however, the overhead from rounding can be higher than the fragmentation it helps reduce. We have added a new stat to CUDA caching allocator stats to help track if rounding is adding too much overhead and help tune the roundup_power2_divisions flag: - "requested_bytes.{current,peak,allocated,freed}": memory requested by client code, compare this with allocated_bytes to check if allocation rounding adds too much overhead Test Plan: Added test case in caffe2/test/test_cuda.py Differential Revision: D40810674 fbshipit-source-id: 6b4da60ead445ee095904a0c9cfade771b66409d

facebook-github-bot · 2022-11-10T17:37:49Z

This pull request was exported from Phabricator. Differential Revision: D40810674

facebook-github-bot · 2022-11-11T18:05:41Z

This pull request was exported from Phabricator. Differential Revision: D40810674

Summary: Pull Request resolved: pytorch#88575 The caching allocator can be configured to round memory allocations in order to reduce fragmentation. Sometimes however, the overhead from rounding can be higher than the fragmentation it helps reduce. We have added a new stat to CUDA caching allocator stats to help track if rounding is adding too much overhead and help tune the roundup_power2_divisions flag: - "requested_bytes.{current,peak,allocated,freed}": memory requested by client code, compare this with allocated_bytes to check if allocation rounding adds too much overhead Test Plan: Added test case in caffe2/test/test_cuda.py Differential Revision: D40810674 fbshipit-source-id: 3b962b0b44f482a5d0d71c177e6bab83d9b9d887

zdevito

It makes sense to record this stat. I have a few inline comments. I also think that there is code missing to handle resetting the statistic when all the other statistics are reset. I don't see tests for reading requested_bytes out of the block info.

c10/cuda/CUDACachingAllocator.cpp

c10/cuda/CUDACachingAllocator.h

facebook-github-bot · 2022-12-06T14:53:42Z

This pull request was exported from Phabricator. Differential Revision: D40810674

Summary: Pull Request resolved: pytorch#88575 The caching allocator can be configured to round memory allocations in order to reduce fragmentation. Sometimes however, the overhead from rounding can be higher than the fragmentation it helps reduce. We have added a new stat to CUDA caching allocator stats to help track if rounding is adding too much overhead and help tune the roundup_power2_divisions flag: - "requested_bytes.{all,large_pool,small_pool}.{current,peak,allocated,freed}": memory requested by client code, compare this with allocated_bytes to check if allocation rounding adds too much overhead Test Plan: Added test case in caffe2/test/test_cuda.py Differential Revision: D40810674 fbshipit-source-id: 297b81c7886f5725a131848288652b35cdbcbd0d

facebook-github-bot · 2022-12-06T17:32:05Z

This pull request was exported from Phabricator. Differential Revision: D40810674

Summary: Pull Request resolved: pytorch#88575 The caching allocator can be configured to round memory allocations in order to reduce fragmentation. Sometimes however, the overhead from rounding can be higher than the fragmentation it helps reduce. We have added a new stat to CUDA caching allocator stats to help track if rounding is adding too much overhead and help tune the roundup_power2_divisions flag: - "requested_bytes.{all,large_pool,small_pool}.{current,peak,allocated,freed}": memory requested by client code, compare this with allocated_bytes to check if allocation rounding adds too much overhead Test Plan: Added test case in caffe2/test/test_cuda.py Differential Revision: D40810674 fbshipit-source-id: e134240786e9a0cbf06d806f61adde57bf78a0c5

Summary: Pull Request resolved: pytorch#88575 The caching allocator can be configured to round memory allocations in order to reduce fragmentation. Sometimes however, the overhead from rounding can be higher than the fragmentation it helps reduce. We have added a new stat to CUDA caching allocator stats to help track if rounding is adding too much overhead and help tune the roundup_power2_divisions flag: - "requested_bytes.{all,large_pool,small_pool}.{current,peak,allocated,freed}": memory requested by client code, compare this with allocated_bytes to check if allocation rounding adds too much overhead Test Plan: Added test case in caffe2/test/test_cuda.py Differential Revision: D40810674 fbshipit-source-id: 44a9fcd7bbf8971f0e8acc52f23e24ddc9ae5e42

Summary: Pull Request resolved: pytorch#88575 The caching allocator can be configured to round memory allocations in order to reduce fragmentation. Sometimes however, the overhead from rounding can be higher than the fragmentation it helps reduce. We have added a new stat to CUDA caching allocator stats to help track if rounding is adding too much overhead and help tune the roundup_power2_divisions flag: - "requested_bytes.{all,large_pool,small_pool}.{current,peak,allocated,freed}": memory requested by client code, compare this with allocated_bytes to check if allocation rounding adds too much overhead Test Plan: Added test case in caffe2/test/test_cuda.py Differential Revision: D40810674 fbshipit-source-id: fbc132e4b66f49a650d5dde2fd040752109d5c85

facebook-github-bot · 2022-12-11T12:59:07Z

This pull request was exported from Phabricator. Differential Revision: D40810674

c-odrin · 2022-12-12T14:46:25Z

It makes sense to record this stat. I have a few inline comments. I also think that there is code missing to handle resetting the statistic when all the other statistics are reset. I don't see tests for reading requested_bytes out of the block info.

Thank you for the review! I've addressed the inline comments, added tests for reading requested_bytes out of segment info/block info and added code to reset the statistics when all the other statistics are reset.

zdevito

Looks good. I have a couple nits that should be addressed (missing blockinfo export and test), but this otherwise looks good to me.

c10/cuda/CUDACachingAllocator.cpp

torch/csrc/cuda/Module.cpp

facebook-github-bot · 2023-01-10T01:48:13Z

This pull request was exported from Phabricator. Differential Revision: D40810674

Summary: Pull Request resolved: pytorch#88575 The caching allocator can be configured to round memory allocations in order to reduce fragmentation. Sometimes however, the overhead from rounding can be higher than the fragmentation it helps reduce. We have added a new stat to CUDA caching allocator stats to help track if rounding is adding too much overhead and help tune the roundup_power2_divisions flag: - "requested_bytes.{all,large_pool,small_pool}.{current,peak,allocated,freed}": memory requested by client code, compare this with allocated_bytes to check if allocation rounding adds too much overhead Test Plan: Added test case in caffe2/test/test_cuda.py Differential Revision: D40810674 fbshipit-source-id: 06dbdd60a76d80a1bfdeb6cbbacba9af1e7a1f53

Summary: Pull Request resolved: pytorch#88575 The caching allocator can be configured to round memory allocations in order to reduce fragmentation. Sometimes however, the overhead from rounding can be higher than the fragmentation it helps reduce. We have added a new stat to CUDA caching allocator stats to help track if rounding is adding too much overhead and help tune the roundup_power2_divisions flag: - "requested_bytes.{all,large_pool,small_pool}.{current,peak,allocated,freed}": memory requested by client code, compare this with allocated_bytes to check if allocation rounding adds too much overhead Test Plan: Added test case in caffe2/test/test_cuda.py Differential Revision: D40810674 fbshipit-source-id: d71624c0173c1ca272a1a76acc2bc1ff022edfab

facebook-github-bot · 2023-02-01T15:22:33Z

This pull request was exported from Phabricator. Differential Revision: D40810674

facebook-github-bot · 2023-02-07T16:24:33Z

@c-odrin has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

facebook-github-bot · 2023-02-09T14:51:27Z

@c-odrin has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

facebook-github-bot · 2023-02-09T21:35:36Z

@pytorchbot merge

(Initiating merge automatically since Phabricator Diff has merged)

pytorchmergebot · 2023-02-09T21:37:21Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

Memory usage increase after pytorch#88575

Memory usage increases after #88575. Docker crashes with exit code 137, clearly means out of memory Pull Request resolved: #94548 Approved by: https://github.com/seemethere

facebook-github-bot added the fb-exported label Nov 7, 2022

c-odrin force-pushed the export-D40810674 branch from f6b245f to 3bd959a Compare November 7, 2022 14:39

c-odrin force-pushed the export-D40810674 branch from 3bd959a to 609851b Compare November 7, 2022 18:34

c-odrin force-pushed the export-D40810674 branch from 609851b to 7f16c56 Compare November 7, 2022 23:31

c-odrin force-pushed the export-D40810674 branch from 7f16c56 to 050cba1 Compare November 9, 2022 16:08

c-odrin force-pushed the export-D40810674 branch from 050cba1 to 9b98296 Compare November 10, 2022 17:37

c-odrin force-pushed the export-D40810674 branch from 9b98296 to e80ba76 Compare November 11, 2022 18:05

c-odrin requested review from zdevito and hyuen November 14, 2022 15:00

zdevito reviewed Dec 5, 2022

View reviewed changes

c-odrin force-pushed the export-D40810674 branch from e80ba76 to 63f645d Compare December 6, 2022 14:53

c-odrin force-pushed the export-D40810674 branch from 63f645d to 348f39f Compare December 6, 2022 17:32

c-odrin force-pushed the export-D40810674 branch from 3f07abe to 4e65787 Compare December 11, 2022 12:50

c-odrin force-pushed the export-D40810674 branch from 4e65787 to 1f0e522 Compare December 11, 2022 12:59

c-odrin requested a review from zdevito December 12, 2022 15:23

zdevito approved these changes Dec 13, 2022

View reviewed changes

c10/cuda/CUDACachingAllocator.cpp Outdated Show resolved Hide resolved

torch/csrc/cuda/Module.cpp Show resolved Hide resolved

c-odrin force-pushed the export-D40810674 branch from 1f0e522 to 22f94d8 Compare January 10, 2023 01:48

c-odrin force-pushed the export-D40810674 branch from 22f94d8 to 28f9656 Compare February 1, 2023 15:22

Merge branch 'master' into export-D40810674

bbaa879

pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Feb 9, 2023

pytorchmergebot added the Merged label Feb 9, 2023

pytorchmergebot closed this in 54b7c7d Feb 9, 2023

huydhn added a commit to huydhn/pytorch that referenced this pull request Feb 9, 2023

Use 12xlarge to build libtorch

486fec7

Memory usage increase after pytorch#88575

huydhn mentioned this pull request Feb 9, 2023

Lower libtorch build parallelization to avoid OOM #94548

Closed

huydhn reopened this Feb 9, 2023

huydhn closed this Feb 9, 2023

pytorch deleted a comment from pytorchmergebot Feb 9, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Added requested_bytes to CUDA Caching Allocator Stats #88575

Added requested_bytes to CUDA Caching Allocator Stats #88575

c-odrin commented Nov 7, 2022

pytorch-bot bot commented Nov 7, 2022 •

edited

facebook-github-bot commented Nov 7, 2022

facebook-github-bot commented Nov 7, 2022

facebook-github-bot commented Nov 7, 2022

facebook-github-bot commented Nov 7, 2022

facebook-github-bot commented Nov 9, 2022

facebook-github-bot commented Nov 10, 2022

facebook-github-bot commented Nov 11, 2022

zdevito left a comment

facebook-github-bot commented Dec 6, 2022

facebook-github-bot commented Dec 6, 2022

facebook-github-bot commented Dec 11, 2022

c-odrin commented Dec 12, 2022

zdevito left a comment

facebook-github-bot commented Jan 10, 2023

facebook-github-bot commented Feb 1, 2023

facebook-github-bot commented Feb 7, 2023

facebook-github-bot commented Feb 9, 2023

facebook-github-bot commented Feb 9, 2023

pytorchmergebot commented Feb 9, 2023

Added requested_bytes to CUDA Caching Allocator Stats #88575

Added requested_bytes to CUDA Caching Allocator Stats #88575

Conversation

c-odrin commented Nov 7, 2022

pytorch-bot bot commented Nov 7, 2022 • edited

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/88575

❗ 1 Active SEVs

❌ 8 Failures

facebook-github-bot commented Nov 7, 2022

facebook-github-bot commented Nov 7, 2022

facebook-github-bot commented Nov 7, 2022

facebook-github-bot commented Nov 7, 2022

facebook-github-bot commented Nov 9, 2022

facebook-github-bot commented Nov 10, 2022

facebook-github-bot commented Nov 11, 2022

zdevito left a comment

Choose a reason for hiding this comment

facebook-github-bot commented Dec 6, 2022

facebook-github-bot commented Dec 6, 2022

facebook-github-bot commented Dec 11, 2022

c-odrin commented Dec 12, 2022

zdevito left a comment

Choose a reason for hiding this comment

facebook-github-bot commented Jan 10, 2023

facebook-github-bot commented Feb 1, 2023

facebook-github-bot commented Feb 7, 2023

facebook-github-bot commented Feb 9, 2023

facebook-github-bot commented Feb 9, 2023

pytorchmergebot commented Feb 9, 2023

Merge started

pytorch-bot bot commented Nov 7, 2022 •

edited