add memory_tracker tool to help profiling memory usages #88825

zhaojuanmao · 2022-11-10T18:03:43Z

Adding a memory_tracker API to show operator level memory traces for allocated_memory, active_memory and reserved memory stats, it gave the summary about top 20 operators that generate memories as well.

The implementation mainly uses torchDispatchMode and module hooks to get traces and add markers.

Will add following up PRs:

allow tracing more than 1 iteration
dump json data for visualization
add unit test for DDP training
add unit test for FSDP training
add unit test for activation checkpointing + DDP/FSDP training
add traces for activation memories and top operators that generate activation memories
print summaries for more breakdowns like model size, optimizer states, etc
add traces for temporary memories or memories consumed by cuda streams or nccl library if possible
connect the tool with OOM memory debugging
add dynamic programming (dp) algorithm to find best activation checkpointing locations based on the operator level activation memory traces
add same traces & dp algorithm for module level memory stats, as FSDP wrapping depends on module level memories, for some model users/not model authors, if they have to apply activation checkpointing on module level, they need module level memory traces as well

======================================================

Current test result for the memory_tracker_example.py on notebook:

Top 20 ops that generates memory are:
bn1.forward.cudnn_batch_norm.default_0: 98.0009765625MB
maxpool.forward.max_pool2d_with_indices.default_0: 74.5MB
layer1.0.conv1.backward.max_pool2d_with_indices_backward.default_0: 49.0MB
layer1.0.bn1.forward.cudnn_batch_norm.default_1: 24.5009765625MB
layer1.0.bn2.forward.cudnn_batch_norm.default_2: 24.5009765625MB
layer1.1.bn1.forward.cudnn_batch_norm.default_3: 24.5009765625MB
layer1.1.bn2.forward.cudnn_batch_norm.default_4: 24.5009765625MB
layer1.2.bn1.forward.cudnn_batch_norm.default_5: 24.5009765625MB
layer1.2.bn2.forward.cudnn_batch_norm.default_6: 24.5009765625MB
layer1.0.conv1.forward.convolution.default_1: 24.5MB
layer1.0.conv2.forward.convolution.default_2: 24.5MB
layer1.1.conv1.forward.convolution.default_3: 24.5MB
layer1.1.conv2.forward.convolution.default_4: 24.5MB
layer1.2.conv1.forward.convolution.default_5: 24.5MB
layer1.2.conv2.forward.convolution.default_6: 24.5MB
maxpool.backward.threshold_backward.default_32: 23.5MB
layer2.0.downsample.backward.convolution_backward.default_26: 12.2802734375MB
layer2.0.bn1.forward.cudnn_batch_norm.default_7: 12.2509765625MB
layer2.0.bn2.forward.cudnn_batch_norm.default_8: 12.2509765625MB
layer2.0.downsample.1.forward.cudnn_batch_norm.default_9: 12.2509765625MB

pytorch-bot · 2022-11-10T18:03:46Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/88825

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 0bf8642:
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

facebook-github-bot · 2022-11-13T04:02:38Z

@zhaojuanmao has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

torch/distributed/_tools/memory_tracker.py

test/distributed/_tools/test_memory_tracker.py

mrshenli · 2022-11-13T20:50:01Z

torch/distributed/_tools/memory_tracker.py

Besides these global stats, it will also be helpful to tell how much memory is activation, how much is temporary, etc.

Those will be added in followup PRs

facebook-github-bot · 2022-11-13T22:57:04Z

@zhaojuanmao has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

test/distributed/_tools/test_memory_tracker.py

torch/distributed/_tools/memory_tracker.py

facebook-github-bot · 2022-11-16T16:53:40Z

@zhaojuanmao has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

awgu

LGTM! I left some nits. Feel free to take a look, address as desired, and land.

test/distributed/_tools/test_memory_tracker.py

torch/distributed/_tools/memory_tracker.py

facebook-github-bot · 2022-11-28T22:22:19Z

@zhaojuanmao has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

zhaojuanmao · 2022-11-29T01:49:39Z

failures are not related

zhaojuanmao · 2022-11-29T01:49:52Z

@pytorchbot merge

pytorchmergebot · 2022-11-29T01:51:27Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

pytorchmergebot · 2022-11-29T01:51:30Z

Merge failed

Reason: The following mandatory check(s) failed (Rule superuser):

pull

Dig deeper by viewing the failures on hud

Details for Dev Infra team

Raised by workflow job

kit1980 · 2022-11-29T03:46:25Z

@pytorchbot merge

pytorchmergebot · 2022-11-29T03:47:56Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

facebook-github-bot · 2022-11-29T04:00:52Z

@zhaojuanmao has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

zhaojuanmao · 2022-11-29T06:38:27Z

@pytorchbot merge

pytorchmergebot · 2022-11-29T06:38:45Z

The merge job was canceled. If you believe this is a mistake,then you can re trigger it through pytorch-bot.

zhaojuanmao · 2022-11-29T06:39:54Z

@pytorchbot merge -f "flaky CI"

pytorchmergebot · 2022-11-29T06:42:52Z

Merge started

Your change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

Adding a memory_tracker API to show operator level memory traces for allocated_memory, active_memory and reserved memory stats, it gave the summary about top 20 operators that generate memories as well. The implementation mainly uses torchDispatchMode and module hooks to get traces and add markers. Will add following up PRs: 1. allow tracing more than 1 iteration 2. dump json data for visualization 3. add unit test for DDP training 4. add unit test for FSDP training 5. add unit test for activation checkpointing + DDP/FSDP training 6. add traces for activation memories and top operators that generate activation memories 7. print summaries for more breakdowns like model size, optimizer states, etc 8. add traces for temporary memories or memories consumed by cuda streams or nccl library if possible 9. connect the tool with OOM memory debugging 10. add dynamic programming (dp) algorithm to find best activation checkpointing locations based on the operator level activation memory traces 11. add same traces & dp algorithm for module level memory stats, as FSDP wrapping depends on module level memories, for some model users/not model authors, if they have to apply activation checkpointing on module level, they need module level memory traces as well ====================================================== Current test result for the memory_tracker_example.py on notebook: Top 20 ops that generates memory are: bn1.forward.cudnn_batch_norm.default_0: 98.0009765625MB maxpool.forward.max_pool2d_with_indices.default_0: 74.5MB layer1.0.conv1.backward.max_pool2d_with_indices_backward.default_0: 49.0MB layer1.0.bn1.forward.cudnn_batch_norm.default_1: 24.5009765625MB layer1.0.bn2.forward.cudnn_batch_norm.default_2: 24.5009765625MB layer1.1.bn1.forward.cudnn_batch_norm.default_3: 24.5009765625MB layer1.1.bn2.forward.cudnn_batch_norm.default_4: 24.5009765625MB layer1.2.bn1.forward.cudnn_batch_norm.default_5: 24.5009765625MB layer1.2.bn2.forward.cudnn_batch_norm.default_6: 24.5009765625MB layer1.0.conv1.forward.convolution.default_1: 24.5MB layer1.0.conv2.forward.convolution.default_2: 24.5MB layer1.1.conv1.forward.convolution.default_3: 24.5MB layer1.1.conv2.forward.convolution.default_4: 24.5MB layer1.2.conv1.forward.convolution.default_5: 24.5MB layer1.2.conv2.forward.convolution.default_6: 24.5MB maxpool.backward.threshold_backward.default_32: 23.5MB layer2.0.downsample.backward.convolution_backward.default_26: 12.2802734375MB layer2.0.bn1.forward.cudnn_batch_norm.default_7: 12.2509765625MB layer2.0.bn2.forward.cudnn_batch_norm.default_8: 12.2509765625MB layer2.0.downsample.1.forward.cudnn_batch_norm.default_9: 12.2509765625MB <img width="1079" alt="Screen Shot 2022-11-10 at 10 03 06 AM" src="https://user-images.githubusercontent.com/48731194/201172577-ddfb769c-fb0f-4962-80df-92456b77903e.png"> Pull Request resolved: pytorch#88825 Approved by: https://github.com/awgu

zhaojuanmao requested review from H-Huang, awgu, kwen2501, mrshenli, pritamdamania87 and rohan-varma as code owners November 10, 2022 18:03

zhaojuanmao force-pushed the memoryTracker branch from 22f1a95 to 49c2f52 Compare November 13, 2022 00:17

zhaojuanmao added the module: distributed_tool tools to help distributed training label Nov 13, 2022

zhaojuanmao force-pushed the memoryTracker branch 2 times, most recently from 65fb257 to dab82f6 Compare November 13, 2022 01:39

zhaojuanmao changed the title ~~[WIP] add memory_tracker tool to help profiling memory usages~~ add memory_tracker tool to help profiling memory usages Nov 13, 2022

zhaojuanmao added the release notes: distributed (tools) label Nov 13, 2022

mrshenli reviewed Nov 13, 2022

View reviewed changes

zhaojuanmao force-pushed the memoryTracker branch from dab82f6 to 9a66a5f Compare November 13, 2022 22:54

awgu reviewed Nov 14, 2022

View reviewed changes

zhaojuanmao force-pushed the memoryTracker branch from 9a66a5f to 53aadda Compare November 16, 2022 09:27

zhaojuanmao requested review from awgu and mrshenli November 16, 2022 09:30

awgu approved these changes Nov 16, 2022

View reviewed changes

zhaojuanmao force-pushed the memoryTracker branch from 53aadda to 2325e4b Compare November 28, 2022 22:08

pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Nov 29, 2022

add memory_tracker tool to help profiling memory usages

0bf8642

zhaojuanmao force-pushed the memoryTracker branch from 2325e4b to 0bf8642 Compare November 29, 2022 03:47

pytorchmergebot added the Merged label Nov 29, 2022

pytorchmergebot closed this in 91899a9 Nov 29, 2022

github-actions bot deleted the memoryTracker branch May 30, 2024 01:55

add memory_tracker tool to help profiling memory usages #88825

add memory_tracker tool to help profiling memory usages #88825

Uh oh!

Conversation

zhaojuanmao commented Nov 10, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Nov 10, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/88825

✅ No Failures

Uh oh!

facebook-github-bot commented Nov 13, 2022

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mrshenli Nov 13, 2022

Choose a reason for hiding this comment

Uh oh!

zhaojuanmao Nov 13, 2022

Choose a reason for hiding this comment

Uh oh!

facebook-github-bot commented Nov 13, 2022

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

facebook-github-bot commented Nov 16, 2022

Uh oh!

awgu left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

facebook-github-bot commented Nov 28, 2022

Uh oh!

zhaojuanmao commented Nov 29, 2022

Uh oh!

zhaojuanmao commented Nov 29, 2022

Uh oh!

pytorchmergebot commented Nov 29, 2022

Merge started

Uh oh!

pytorchmergebot commented Nov 29, 2022

Merge failed

Uh oh!

kit1980 commented Nov 29, 2022

Uh oh!

pytorchmergebot commented Nov 29, 2022

Merge started

Uh oh!

facebook-github-bot commented Nov 29, 2022

Uh oh!

zhaojuanmao commented Nov 29, 2022

Uh oh!

pytorchmergebot commented Nov 29, 2022

Uh oh!

zhaojuanmao commented Nov 29, 2022

Uh oh!

pytorchmergebot commented Nov 29, 2022

Merge started

Uh oh!

Reviewers

Assignees

zhaojuanmao commented Nov 10, 2022 •

edited

Loading

pytorch-bot bot commented Nov 10, 2022 •

edited

Loading