tune down batch-size for res2net to avoid OOM #122977

shunting314 · 2024-03-29T20:17:41Z

Stack from ghstack (oldest at bottom):

-> tune down batch-size for res2net to avoid OOM #122977

The batch-size for this model is 64 previously. Later on we change that to 256 and cause OOM in cudagraphs setting. This PR tune the batch size down to 128.

Share more logs from my local run

cuda,res2net101_26w_4s,128,1.603578,110.273572,335.263494,1.042566,11.469964,11.001666,807,2,7,6,0,0
cuda,res2net101_26w_4s,256,1.714980,207.986155,344.013071,1.058278,22.260176,21.034332,807,2,7,6,0,0

The log shows that torch.compile uses 11GB for 128 batch size and 21GB for 256 batch size. I guess the benchmark script has extra overhead cause the model OOM for 256 batch size in the dashboard run.

cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @chenyang78 @kadeng @chauhang

[ghstack-poisoned]

pytorch-bot · 2024-03-29T20:17:45Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/122977

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ You can merge normally! (3 Unrelated Failures)

As of commit 767ca08 with merge base 57a9a64 ():

FLAKY - The following jobs failed but were likely due to flakiness present on trunk:

inductor / rocm6.0-py3.8-inductor / test (inductor, 1, 1, linux.rocm.gpu.2) (gh)
test/distributed/test_c10d_functional_native.py::TestWithNCCL::test_unwaited
pull / linux-focal-cuda12.1-py3.10-gcc9 / test (default, 4, 5, linux.4xlarge.nvidia.gpu) (gh)
test_foreach.py::TestForeachCUDA::test_binary_op_list_error_cases__foreach_add_cuda_complex128

BROKEN TRUNK - The following job failed but were present on the merge base:

👉 Rebase onto the `viable/strict` branch to avoid these failures

pull / linux-jammy-py3.8-gcc11 / test (docs_test, 1, 1, linux.2xlarge) (gh)
Process completed with exit code 2.

This comment was automatically generated by Dr. CI and updates every 15 minutes.

ghstack-source-id: 801a995 Pull Request resolved: #122977

shunting314 · 2024-03-29T23:08:15Z

@pytorchbot merge -i

pytorchmergebot · 2024-03-29T23:10:12Z

Merge started

Your change will be merged while ignoring the following 2 checks: pull / linux-jammy-py3.8-gcc11 / test (docs_test, 1, 1, linux.2xlarge), pull / linux-focal-cuda12.1-py3.10-gcc9 / test (default, 4, 5, linux.4xlarge.nvidia.gpu)

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

Summary: The batch-size for this model is 64 previously. Later on we change that to 256 and cause OOM in cudagraphs setting. This PR tune the batch size down to 128. Share more logs from my local run ``` cuda,res2net101_26w_4s,128,1.603578,110.273572,335.263494,1.042566,11.469964,11.001666,807,2,7,6,0,0 cuda,res2net101_26w_4s,256,1.714980,207.986155,344.013071,1.058278,22.260176,21.034332,807,2,7,6,0,0 ``` The log shows that torch.compile uses 11GB for 128 batch size and 21GB for 256 batch size. I guess the benchmark script has extra overhead cause the model OOM for 256 batch size in the dashboard run. X-link: pytorch/pytorch#122977 Approved by: https://github.com/Chillee Reviewed By: atalman Differential Revision: D55561255 Pulled By: shunting314 fbshipit-source-id: 9863e86776d8ed30397806bda330f53c9815f61e

The batch-size for this model is 64 previously. Later on we change that to 256 and cause OOM in cudagraphs setting. This PR tune the batch size down to 128. Share more logs from my local run ``` cuda,res2net101_26w_4s,128,1.603578,110.273572,335.263494,1.042566,11.469964,11.001666,807,2,7,6,0,0 cuda,res2net101_26w_4s,256,1.714980,207.986155,344.013071,1.058278,22.260176,21.034332,807,2,7,6,0,0 ``` The log shows that torch.compile uses 11GB for 128 batch size and 21GB for 256 batch size. I guess the benchmark script has extra overhead cause the model OOM for 256 batch size in the dashboard run. Pull Request resolved: pytorch#122977 Approved by: https://github.com/Chillee

tune down batch-size for res2net to avoid OOM

767ca08

[ghstack-poisoned]

pytorch-bot bot added ciflow/inductor module: dynamo topic: not user facing topic category labels Mar 29, 2024

shunting314 added a commit that referenced this pull request Mar 29, 2024

tune down batch-size for res2net to avoid OOM

56a2b85

ghstack-source-id: 801a995 Pull Request resolved: #122977

shunting314 requested a review from Chillee March 29, 2024 20:17

Chillee approved these changes Mar 29, 2024

View reviewed changes

pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Mar 29, 2024

pytorchmergebot added the merging label Mar 29, 2024

pytorchmergebot added the Merged label Mar 30, 2024

pytorchmergebot closed this in aaba3a8 Mar 30, 2024

pytorchmergebot removed the merging label Mar 30, 2024

github-actions bot deleted the gh/shunting314/123/head branch April 30, 2024 01:51

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

tune down batch-size for res2net to avoid OOM #122977

tune down batch-size for res2net to avoid OOM #122977

Uh oh!

shunting314 commented Mar 29, 2024 •

edited

Loading

Uh oh!

pytorch-bot bot commented Mar 29, 2024 •

edited

Loading

Uh oh!

shunting314 commented Mar 29, 2024

Uh oh!

pytorchmergebot commented Mar 29, 2024

Uh oh!

Uh oh!

tune down batch-size for res2net to avoid OOM #122977

tune down batch-size for res2net to avoid OOM #122977

Uh oh!

Conversation

shunting314 commented Mar 29, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Mar 29, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/122977

✅ You can merge normally! (3 Unrelated Failures)

Uh oh!

shunting314 commented Mar 29, 2024

Uh oh!

pytorchmergebot commented Mar 29, 2024

Merge started

Uh oh!

Uh oh!

shunting314 commented Mar 29, 2024 •

edited

Loading

pytorch-bot bot commented Mar 29, 2024 •

edited

Loading