-
Notifications
You must be signed in to change notification settings - Fork 21.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enable shape padding, cache benchmarking #100982
Conversation
[ghstack-poisoned]
ghstack-source-id: 62c42094b7f71d211bda7eaab3fee80ca8e1f436 Pull Request resolved: #100982
cc soumith voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx peterbell10 desertfire [ghstack-poisoned]
ghstack-source-id: 61a1e204dce30414813482266b665ec6ec14b33c Pull Request resolved: #100982
cc soumith voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx peterbell10 desertfire [ghstack-poisoned]
ghstack-source-id: 68a1d6a8408918cd66163be36bf1fdb57d1285c0 Pull Request resolved: #100982
cc soumith voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx peterbell10 desertfire [ghstack-poisoned]
ghstack-source-id: e3241ead62ff88d91788191935222aa452b77186 Pull Request resolved: #100982
For `python benchmarks/dynamo/huggingface.py --performance --backend inductor --amp --training --only PegasusForCausalLM` I got 319 cache hits and 22 cache misses. cc soumith voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx peterbell10 desertfire [ghstack-poisoned]
ghstack-source-id: 141f6dda4621549b761e97c66ee82a02ee1cb142 Pull Request resolved: #100982
For `python benchmarks/dynamo/huggingface.py --performance --backend inductor --amp --training --only PegasusForCausalLM` I got 319 cache hits and 22 cache misses. Previously increased compile time was the blocker. I think it should be much reduced now. Waiting on benchmarking results.. cc soumith voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx peterbell10 desertfire [ghstack-poisoned]
ghstack-source-id: 53a044fea347ba4e1522ece16ab40e31e77ec265 Pull Request resolved: #100982
For `python benchmarks/dynamo/huggingface.py --performance --backend inductor --amp --training --only PegasusForCausalLM` I got 319 cache hits and 22 cache misses. Previously increased compile time was the blocker. I think it should be much reduced now. Waiting on benchmarking results.. cc soumith voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx peterbell10 desertfire [ghstack-poisoned]
ghstack-source-id: 6e67412ac4b16e0612b9661c509f77938072b32c Pull Request resolved: #100982
For `python benchmarks/dynamo/huggingface.py --performance --backend inductor --amp --training --only PegasusForCausalLM` I got 319 cache hits and 22 cache misses. Previously increased compile time was the blocker. I think it should be much reduced now. Waiting on benchmarking results.. cc soumith voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx peterbell10 desertfire [ghstack-poisoned]
ghstack-source-id: b02deb949bd341a12826ef05475f33e0ce969c12 Pull Request resolved: #100982
For `python benchmarks/dynamo/huggingface.py --performance --backend inductor --amp --training --only PegasusForCausalLM` I got 319 cache hits and 22 cache misses. Previously increased compile time was the blocker. I think it should be much reduced now. Waiting on benchmarking results.. cc soumith voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx peterbell10 desertfire [ghstack-poisoned]
ghstack-source-id: 2a1f6ae40f0636c1acb9bd624749749ec2fa4527 Pull Request resolved: #100982
For `python benchmarks/dynamo/huggingface.py --performance --backend inductor --amp --training --only PegasusForCausalLM` I got 319 cache hits and 22 cache misses. Previously increased compile time was the blocker. I think it should be much reduced now. Waiting on benchmarking results.. cc soumith voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx peterbell10 desertfire [ghstack-poisoned]
For `python benchmarks/dynamo/huggingface.py --performance --backend inductor --amp --training --only PegasusForCausalLM` I got 319 cache hits and 22 cache misses. Previously increased compile time was the blocker. It is much reduced now. cc soumith voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx peterbell10 desertfire [ghstack-poisoned]
For `python benchmarks/dynamo/huggingface.py --performance --backend inductor --amp --training --only PegasusForCausalLM` I got 319 cache hits and 22 cache misses. Previously increased compile time was the blocker. It is much reduced now. cc soumith voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx peterbell10 desertfire [ghstack-poisoned]
This wasn't enabled previously because of concerns around compilation time. From a recent run, this increases compile time at an average of 1-2s and reduces memory compression by ~2% but for significant performance gains (up to 30% on many HF models). There were flakey failures on my last benchmarking run, waiting for an update now that multithreading non-deterministic failures are fixed. We can wait for that before landing. cc soumith voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx peterbell10 desertfire [ghstack-poisoned]
with torch.autograd.detect_anomaly(check_nan=False): | ||
with torch.autograd.set_multithreading_enabled( | ||
False | ||
), torch.autograd.detect_anomaly(check_nan=False): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
huh, I'm a little surprised we needed this since you already added this context manager to run in the outer-most bit of aot autograd (so I'd expect to already be turned on?) https://github.com/pytorch/pytorch/pull/100982/files#diff-df954bbf954d2dcb81f687876053267ffa4ddb36ed86b7d2bd76319ff2b94416L3136
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
im trying to find the source of a strange bug on CI that can't locally repro: FakeTensor doesn't have an attribute fake_mode
. Was throwing things at the wall to see if anything fixed it. Still haven't found it yet.
if op is torch.ops.aten.addmm: | ||
input_pad = None | ||
if input is not None and input.is_cuda: | ||
input_pad = torch.randn_like(input) | ||
pad_time = do_bench( | ||
lambda: pad_addmm( | ||
input_pad, | ||
mat1_pad, | ||
mat2_pad, | ||
mat1, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could it be that you are reusing original inputs here, instead of making a copy (the source of your weird error)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I got the same error with no changes #101619. mat1
and mat2
are reassigned here, they're not the original. mat1_pad
and mat2_pad
are redundant.
My guess is they were there so the matrices wouldnt be in cache, but triton do_bench clears cache anyway
Rebase of #100982 which was already accepted cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx peterbell10 [ghstack-poisoned]
Rebase of #100982 which was already accepted cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx peterbell10 [ghstack-poisoned]
Rebase of #100982 which was already accepted cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx peterbell10 [ghstack-poisoned]
Rebase of #100982 which was already accepted cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx peterbell10 [ghstack-poisoned]
Rebase of #100982 which was already accepted Pull Request resolved: #102200 Approved by: https://github.com/ngimel, https://github.com/jansel
Stack from ghstack (oldest at bottom):
This wasn't enabled previously because of concerns around compilation time. From a recent run, this increases compile time at an average of 1-2s and reduces memory compression by ~2% but for significant performance gains (up to 30% on many HF models). There were flakey failures on my last benchmarking run, waiting for an update now that multithreading non-deterministic failures are fixed. We can wait for that before landing.
cc @soumith @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @Xia-Weiwen @wenzhe-nrv @jiayisunx @peterbell10 @desertfire