New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Reduce overhead in CUDAGraph Trees #98529

Closed

eellison wants to merge 10 commits into gh/eellison/416/base from gh/eellison/416/head

Contributor

eellison commented Apr 6, 2023 •

edited

Stack from ghstack (oldest at bottom):

Significantly reduces overhead of constructing Tensors and Storages and checking Storage Liveness. Removes the regression for HF models that I tested and removes 75% of overhead of the extremely overhead bound resnet50 training we have in torchbench. (.91x base commit, 1.02x torchinductor default, 1.16x this PR, 1.25 previous cudagraphs impl).

This PR takes care of all of the lower hanging fruit.

Computes storage aliasing at record time instead of during at runtime. We no longer need to use a runtime storage cache, and can instead index directly into the existing alias if there is one, or construct a new Storage
Moves the heavyweight C++ calls into a batch - getting storage weakrefs and constructing tensors

cc @soumith @voznesenskym @penguinwu @anijain2305 @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @Xia-Weiwen @wenzhe-nrv @jiayisunx @peterbell10 @desertfire


          [WIP] reduce overhead

fd686cb

[ghstack-poisoned]

eellison mentioned this pull request

Flip Switch Redux #98341

Closed

pytorch-bot bot commented Apr 6, 2023 •

edited

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/98529

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 7a9b0a6:
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

github-actions bot added ciflow/inductor module: inductor labels

eellison added a commit that referenced this pull request


          [WIP] reduce overhead

ba52be3

ghstack-source-id: ced324b0fbe76056ccc3682e407e3065a79d21eb
Pull Request resolved: #98529


          Update on "[WIP] reduce overhead"

2b79424

cc soumith voznesenskym penguinwu anijain2305 EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx peterbell10 desertfire

[ghstack-poisoned]

eellison added a commit that referenced this pull request


          [WIP] reduce overhead

5468ae4

ghstack-source-id: d820ef3702340468a90b5189ec37b808b3018427
Pull Request resolved: #98529


          Update on "[WIP] reduce overhead"

adff371

cc soumith voznesenskym penguinwu anijain2305 EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx peterbell10 desertfire

[ghstack-poisoned]

eellison added a commit that referenced this pull request


          [WIP] reduce overhead

7a74198

ghstack-source-id: 466882054aa7cb281a1578cb4435fc882704ab5b
Pull Request resolved: #98529


          Update on "[WIP] reduce overhead"

1d2851a

cc soumith voznesenskym penguinwu anijain2305 EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx peterbell10 desertfire

[ghstack-poisoned]

eellison added a commit that referenced this pull request


          [WIP] reduce overhead

6fca6d2

ghstack-source-id: 4d2fb1f8988a21330a17057bb3d7e428ea1ea78b
Pull Request resolved: #98529


          Update on "[WIP] reduce overhead"

78b91c6

cc soumith voznesenskym penguinwu anijain2305 EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx peterbell10 desertfire

[ghstack-poisoned]

eellison added a commit that referenced this pull request


          [WIP] reduce overhead

364c411

ghstack-source-id: 98ad5be85f5e16b933c1571772618f60d8e0b859
Pull Request resolved: #98529


          Update on "[WIP] reduce overhead"

b96fdba

cc soumith voznesenskym penguinwu anijain2305 EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx peterbell10 desertfire

[ghstack-poisoned]

eellison added a commit that referenced this pull request


          [WIP] reduce overhead

f8061af

ghstack-source-id: 0eaaac4c8c353ca55ac57f6bb5b13f5384ebb095
Pull Request resolved: #98529


          Update on "[WIP] reduce overhead"

be4c60a

cc soumith voznesenskym penguinwu anijain2305 EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx peterbell10 desertfire

[ghstack-poisoned]

eellison added a commit that referenced this pull request


          [WIP] reduce overhead

0e5a104

ghstack-source-id: 27bbfa172ea0f5ba67e6cf1308b13b606a1aaa33
Pull Request resolved: #98529


          Update on "[WIP] reduce overhead"

595e87b

cc soumith voznesenskym penguinwu anijain2305 EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx peterbell10 desertfire

[ghstack-poisoned]

eellison added a commit that referenced this pull request


          Cuda Graph Trees reduce overhead

8dc85fc

ghstack-source-id: c2b8987832004cab8a7dbe4f8904ecf8b082012a
Pull Request resolved: #98529


          Update on "[WIP] reduce overhead"

832492b

cc soumith voznesenskym penguinwu anijain2305 EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx peterbell10 desertfire

[ghstack-poisoned]

eellison changed the title ~~[WIP] reduce overhead~~ Reduce overhead

eellison changed the title ~~Reduce overhead~~ Reduce overhead in CUDAGraph Trees

eellison requested review from ezyang, jansel and ngimel

April 6, 2023 21:35

jansel approved these changes

View reviewed changes

ngimel approved these changes

View reviewed changes

Collaborator

ngimel commented Apr 6, 2023 •

edited

It's still slower than previous cudagraphs implementation was? Do you have an idea why?
Can we switch between tapes and previous implementation depending on how many graphbreaks there are?

Contributor Author

eellison commented Apr 6, 2023 •

edited

@ngimel it's not really slower on any real model, (same speed for huggingface, etc) but resnet in torchbench the graph.replay() is so quick that just reconstructing the tensors and storages is a noticeable overhead. We also do not have a well tuned batch size for that model.

It's not really about graph breaks - even just with no graph breaks, there is still a substantial memory overhead in the existing implementation because you are not sharing memory between forward and backward.

For inference, there are so few tensors output compared to training that there should be practically no difference. A further training optimization would be to reuse the same TensorImpls for saved tensors for the backward since we know they will be live if the backward hasn't been run, and do not escape the backward.


          Update on "Reduce overhead in CUDAGraph Trees"

7a9b0a6

Significantly reduces overhead of constructing Tensors and Storages and checking Storage Liveness. Removes the regression for HF models that I tested and removes 75% of overhead of the extremely overhead bound resnet50 training we have in torchbench. (.91x base commit, 1.02x torchinductor default, 1.16x this PR, 1.25 previous cudagraphs impl). 

This PR takes care of all of the lower hanging fruit. 

- Computes storage aliasing at record time instead of during at runtime. We no longer need to use a runtime storage cache, and can instead index directly into the existing alias if there is one, or construct a new Storage

- Moves the heavyweight C++ calls into a batch - getting storage weakrefs and constructing tensors

cc soumith voznesenskym penguinwu anijain2305 EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx peterbell10 desertfire

[ghstack-poisoned]

eellison added topic: not user facing ciflow/trunk labels

Contributor Author

eellison commented Apr 7, 2023

@pytorchbot merge

Collaborator

pytorchmergebot commented Apr 7, 2023

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

pytorchmergebot added the Merged label

pytorchmergebot closed this in

5c8fea5

eellison mentioned this pull request

Rename assert options, turn off by default #98616

Closed

ZainRizvi pushed a commit that referenced this pull request


          Reduce overhead in CUDAGraph Trees (#98529)

d844a91

Significantly reduces overhead of constructing Tensors and Storages and checking Storage Liveness. Removes the regression for HF models that I tested and removes 75% of overhead of the extremely overhead bound resnet50 training we have in torchbench. (.91x base commit, 1.02x torchinductor default, 1.16x this PR, 1.25 previous cudagraphs impl).

This PR takes care of all of the lower hanging fruit.

- Computes storage aliasing at record time instead of during at runtime. We no longer need to use a runtime storage cache, and can instead index directly into the existing alias if there is one, or construct a new Storage

- Moves the heavyweight C++ calls into a batch - getting storage weakrefs and constructing tensors

Pull Request resolved: #98529
Approved by: https://github.com/jansel, https://github.com/ngimel

facebook-github-bot deleted the gh/eellison/416/head branch

June 8, 2023 16:21

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment