Skip to content

Conversation

@zhaojuanmao
Copy link
Contributor

@zhaojuanmao zhaojuanmao commented Nov 10, 2022

Adding a memory_tracker API to show operator level memory traces for allocated_memory, active_memory and reserved memory stats, it gave the summary about top 20 operators that generate memories as well.

The implementation mainly uses torchDispatchMode and module hooks to get traces and add markers.

Will add following up PRs:

  1. allow tracing more than 1 iteration
  2. dump json data for visualization
  3. add unit test for DDP training
  4. add unit test for FSDP training
  5. add unit test for activation checkpointing + DDP/FSDP training
  6. add traces for activation memories and top operators that generate activation memories
  7. print summaries for more breakdowns like model size, optimizer states, etc
  8. add traces for temporary memories or memories consumed by cuda streams or nccl library if possible
  9. connect the tool with OOM memory debugging
  10. add dynamic programming (dp) algorithm to find best activation checkpointing locations based on the operator level activation memory traces
  11. add same traces & dp algorithm for module level memory stats, as FSDP wrapping depends on module level memories, for some model users/not model authors, if they have to apply activation checkpointing on module level, they need module level memory traces as well

======================================================

Current test result for the memory_tracker_example.py on notebook:

Top 20 ops that generates memory are:
bn1.forward.cudnn_batch_norm.default_0: 98.0009765625MB
maxpool.forward.max_pool2d_with_indices.default_0: 74.5MB
layer1.0.conv1.backward.max_pool2d_with_indices_backward.default_0: 49.0MB
layer1.0.bn1.forward.cudnn_batch_norm.default_1: 24.5009765625MB
layer1.0.bn2.forward.cudnn_batch_norm.default_2: 24.5009765625MB
layer1.1.bn1.forward.cudnn_batch_norm.default_3: 24.5009765625MB
layer1.1.bn2.forward.cudnn_batch_norm.default_4: 24.5009765625MB
layer1.2.bn1.forward.cudnn_batch_norm.default_5: 24.5009765625MB
layer1.2.bn2.forward.cudnn_batch_norm.default_6: 24.5009765625MB
layer1.0.conv1.forward.convolution.default_1: 24.5MB
layer1.0.conv2.forward.convolution.default_2: 24.5MB
layer1.1.conv1.forward.convolution.default_3: 24.5MB
layer1.1.conv2.forward.convolution.default_4: 24.5MB
layer1.2.conv1.forward.convolution.default_5: 24.5MB
layer1.2.conv2.forward.convolution.default_6: 24.5MB
maxpool.backward.threshold_backward.default_32: 23.5MB
layer2.0.downsample.backward.convolution_backward.default_26: 12.2802734375MB
layer2.0.bn1.forward.cudnn_batch_norm.default_7: 12.2509765625MB
layer2.0.bn2.forward.cudnn_batch_norm.default_8: 12.2509765625MB
layer2.0.downsample.1.forward.cudnn_batch_norm.default_9: 12.2509765625MB

Screen Shot 2022-11-10 at 10 03 06 AM

@pytorch-bot
Copy link

pytorch-bot bot commented Nov 10, 2022

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/88825

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 0bf8642:
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@zhaojuanmao zhaojuanmao added the module: distributed_tool tools to help distributed training label Nov 13, 2022
@zhaojuanmao zhaojuanmao force-pushed the memoryTracker branch 2 times, most recently from 65fb257 to dab82f6 Compare November 13, 2022 01:39
@zhaojuanmao zhaojuanmao changed the title [WIP] add memory_tracker tool to help profiling memory usages add memory_tracker tool to help profiling memory usages Nov 13, 2022
@facebook-github-bot
Copy link
Contributor

@zhaojuanmao has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

Comment on lines 219 to 238
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Besides these global stats, it will also be helpful to tell how much memory is activation, how much is temporary, etc.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Those will be added in followup PRs

@facebook-github-bot
Copy link
Contributor

@zhaojuanmao has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@facebook-github-bot
Copy link
Contributor

@zhaojuanmao has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

Copy link
Collaborator

@awgu awgu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! I left some nits. Feel free to take a look, address as desired, and land.

@facebook-github-bot
Copy link
Contributor

@zhaojuanmao has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@zhaojuanmao
Copy link
Contributor Author

failures are not related

@zhaojuanmao
Copy link
Contributor Author

@pytorchbot merge

@pytorch-bot pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Nov 29, 2022
@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@pytorchmergebot
Copy link
Collaborator

Merge failed

Reason: The following mandatory check(s) failed (Rule superuser):

Dig deeper by viewing the failures on hud

Details for Dev Infra team Raised by workflow job

@kit1980
Copy link
Contributor

kit1980 commented Nov 29, 2022

@pytorchbot merge

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@facebook-github-bot
Copy link
Contributor

@zhaojuanmao has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@zhaojuanmao
Copy link
Contributor Author

@pytorchbot merge

@pytorchmergebot
Copy link
Collaborator

The merge job was canceled. If you believe this is a mistake,then you can re trigger it through pytorch-bot.

@zhaojuanmao
Copy link
Contributor Author

@pytorchbot merge -f "flaky CI"

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

kulinseth pushed a commit to kulinseth/pytorch that referenced this pull request Dec 10, 2022
Adding a memory_tracker API to show operator level memory traces for allocated_memory, active_memory and reserved memory stats, it gave the summary about top 20 operators that generate memories as well.

The implementation mainly uses torchDispatchMode and module hooks to get traces and add markers.

Will add following up PRs:
1. allow tracing more than 1 iteration
2. dump json data for visualization
3. add unit test for DDP training
4. add unit test for FSDP training
5. add unit test for activation checkpointing + DDP/FSDP training
6. add traces for activation memories and top operators that generate activation memories
7. print summaries for more breakdowns like model size, optimizer states, etc
8. add traces for temporary memories or memories consumed by cuda streams or nccl library if possible
9. connect the tool with OOM memory debugging
10. add dynamic programming (dp) algorithm to find best activation checkpointing locations based on the operator level activation memory traces
11. add same traces & dp algorithm for module level memory stats, as FSDP wrapping depends on module level memories, for some model users/not model authors, if they have to apply activation checkpointing on module level, they need module level memory traces as well

======================================================

Current test result for the memory_tracker_example.py on notebook:

Top 20 ops that generates memory are:
bn1.forward.cudnn_batch_norm.default_0: 98.0009765625MB
maxpool.forward.max_pool2d_with_indices.default_0: 74.5MB
layer1.0.conv1.backward.max_pool2d_with_indices_backward.default_0: 49.0MB
layer1.0.bn1.forward.cudnn_batch_norm.default_1: 24.5009765625MB
layer1.0.bn2.forward.cudnn_batch_norm.default_2: 24.5009765625MB
layer1.1.bn1.forward.cudnn_batch_norm.default_3: 24.5009765625MB
layer1.1.bn2.forward.cudnn_batch_norm.default_4: 24.5009765625MB
layer1.2.bn1.forward.cudnn_batch_norm.default_5: 24.5009765625MB
layer1.2.bn2.forward.cudnn_batch_norm.default_6: 24.5009765625MB
layer1.0.conv1.forward.convolution.default_1: 24.5MB
layer1.0.conv2.forward.convolution.default_2: 24.5MB
layer1.1.conv1.forward.convolution.default_3: 24.5MB
layer1.1.conv2.forward.convolution.default_4: 24.5MB
layer1.2.conv1.forward.convolution.default_5: 24.5MB
layer1.2.conv2.forward.convolution.default_6: 24.5MB
maxpool.backward.threshold_backward.default_32: 23.5MB
layer2.0.downsample.backward.convolution_backward.default_26: 12.2802734375MB
layer2.0.bn1.forward.cudnn_batch_norm.default_7: 12.2509765625MB
layer2.0.bn2.forward.cudnn_batch_norm.default_8: 12.2509765625MB
layer2.0.downsample.1.forward.cudnn_batch_norm.default_9: 12.2509765625MB

<img width="1079" alt="Screen Shot 2022-11-10 at 10 03 06 AM" src="https://user-images.githubusercontent.com/48731194/201172577-ddfb769c-fb0f-4962-80df-92456b77903e.png">

Pull Request resolved: pytorch#88825
Approved by: https://github.com/awgu
@github-actions github-actions bot deleted the memoryTracker branch May 30, 2024 01:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/trunk Trigger trunk jobs on your pull request Merged module: distributed_tool tools to help distributed training release notes: distributed (tools)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants