[inductor] convert layout of conv weight ahead of time for inference #103642

shunting314 · 2023-06-15T01:36:02Z

Stack from ghstack (oldest at bottom):

-> [inductor] convert layout of conv weight ahead of time for inference #103642

This PR handles inference. Will do similar thing for training later.

Some manual testing results shows this can improve inference perf by 2-3% (absolute improvement not relative one).

convmixer: 4.285x -> 4.309x
resnet50: 2.170x -> 2.203x

The PR is built upon freezing. Since without freezing, the weight input for a conv node may not be a parameter directly but be the output of precision converting ops. It's so much easier to implement this PR after freezing.

Commands

TORCHINDUCTOR_FREEZING=1 python benchmarks/dynamo/timm_models.py --backend inductor --amp --performance --only convmixer_768_32 --inference

cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @Xia-Weiwen @wenzhe-nrv @jiayisunx @peterbell10 @ipiszy @ngimel @yf225 @chenyang78

[ghstack-poisoned]

pytorch-bot · 2023-06-15T01:36:04Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/103642

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ 1 Unrelated Failure

As of commit b8c4556:

UNSTABLE - The following job failed but was likely due to flakiness present on trunk and has been marked as unstable:

linux-bionic-cuda12.1-py3.10-gcc9-bazel-test / build-and-test (default, 1, 1, linux.4xlarge.nvidia.gpu, unstable) (gh)

This comment was automatically generated by Dr. CI and updates every 15 minutes.

ghstack-source-id: 705643764afe28030fbc46bfe289acda5840cf37 Pull Request resolved: #103642

torch/_inductor/compile_fx.py

…e for inference" cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx peterbell10 ipiszy ngimel yf225 chenyang78 [ghstack-poisoned]

…ence ghstack-source-id: 3288073a1050922f36e67ccaad694f58942b9f5f Pull Request resolved: #103642

…e for inference" cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx peterbell10 ipiszy ngimel yf225 chenyang78 [ghstack-poisoned]

…ence ghstack-source-id: 749f016d781016cec09406d271a4d0f531469ab7 Pull Request resolved: #103642

…e for inference" cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx peterbell10 ipiszy ngimel yf225 chenyang78 [ghstack-poisoned]

…ence ghstack-source-id: 43e0ce6937b992ccb4e13099fd1b57251a70bc36 Pull Request resolved: #103642

…e for inference" cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx peterbell10 ipiszy ngimel yf225 chenyang78 [ghstack-poisoned]

…ence ghstack-source-id: cecab3094abb8a908d8e36dd3075d3c29ba08699 Pull Request resolved: #103642

… inference" This PR handles inference. Will do similar thing for training later. Some manual testing results shows this can improve inference perf by 2-3% (absolute improvement not relative one). - convmixer: 4.285x -> 4.309x - resnet50: 2.170x -> 2.203x The PR is built upon freezing. Since without freezing, the weight input for a conv node may not be a parameter directly but be the output of precision converting ops. It's so much easier to implement this PR after freezing. Commands ``` TORCHINDUCTOR_FREEZING=1 python benchmarks/dynamo/timm_models.py --backend inductor --amp --performance --only convmixer_768_32 --inference ``` cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx peterbell10 ipiszy ngimel yf225 chenyang78 [ghstack-poisoned]

ghstack-source-id: 8026814f2d984c5454d8fc28b2bab7aa0fc833de Pull Request resolved: #103642

torch/_inductor/freezing.py

… inference" This PR handles inference. Will do similar thing for training later. Some manual testing results shows this can improve inference perf by 2-3% (absolute improvement not relative one). - convmixer: 4.285x -> 4.309x - resnet50: 2.170x -> 2.203x The PR is built upon freezing. Since without freezing, the weight input for a conv node may not be a parameter directly but be the output of precision converting ops. It's so much easier to implement this PR after freezing. Commands ``` TORCHINDUCTOR_FREEZING=1 python benchmarks/dynamo/timm_models.py --backend inductor --amp --performance --only convmixer_768_32 --inference ``` cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx peterbell10 ipiszy ngimel yf225 chenyang78 [ghstack-poisoned]

ghstack-source-id: 310f78944dec0abf991d0ccaab408ccbcfcb16a5 Pull Request resolved: #103642

eellison

Cool ! A few comments. let me know what you think

torch/_inductor/freezing.py

jansel · 2023-06-21T16:03:13Z

At a glance this looks good to me. I'll let @eellison do the final review.

shunting314 · 2023-06-21T23:02:28Z

@eellison @jansel this is more complex then I initially thought. I just found one issue. We convert convolution weights to channels last (either change GraphModule attribute directly or add aten.clone nodes to the graph to do the layout convolution). Later on in compile_fx_inner when we do FakeTensorProp, the changed strides of the conv weight will be propagated thru the graph. After the propagation, a previously contiguous output tensor in eager may become a channels last tensor.

I think to avoid that and make sure we don't change output tensor's layout, we need

collect all output tensors layout before the pass added by this PR. Those represent the eager layout we need for output tensors
perform the pass added by this PR
explicitly add aten.clone node in the end of the graph to convert each of the output tensor's stride to those collected in step 1.

Another variant of step3 is, we collect all the output tensor's stride after performing step2 (Need do an extra FakeTensorProp) and apply step3 only to those output tensors whose stride get changed. But this may not be necessary since I assume adding redundant no-op aten.clone node is fine. We can remove those no-op aten.clone nodes from the graph in a post_grad pass. (maybe we already did).

Wdyt?

pytorchmergebot · 2023-06-28T17:42:28Z

Merge started

Your change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

Context: eellison 's review comment [here](#103642 (comment)) complains about my code calling `torch.fx.GraphModule.recompile` after I changed the graph. We didn't simply remove the call to `recompile` at that time since that increases the risk that user see or run stale python code. In this PR, I recompile GraphModule lazily without increasing the risk that user see/run stale python code. When training BertForMaskedLM, the `GraphModule.recompile` is called 707 times and takes 1.8s in total. The whole compilation takes around 60 seconds. By spot checking, I found the main reason we call recompile so frequently is due to inductor pattern matcher. E.g., if we want to replace src_fn with dst_fn, we need trace both src_fn and dst_fn. After tracing is done, we create a GraphModule. The init method of GraphModule will call recompile. By doing recompile lazily, we reduce the number of calls for `GraphModule.recompile` to 37 times and reduces its total execution time to 0.045s. [ghstack-poisoned]

Context: eellison 's review comment [here](#103642 (comment)) complains about my code calling `torch.fx.GraphModule.recompile` after I changed the graph. We didn't simply remove the call to `recompile` at that time since that increases the risk that user see or run stale python code. In this PR, I recompile GraphModule lazily without increasing the risk that user see/run stale python code. When training BertForMaskedLM, the `GraphModule.recompile` is called 707 times and takes 1.8s in total. The whole compilation takes around 60 seconds. By spot checking, I found the main reason we call recompile so frequently is due to inductor pattern matcher. E.g., if we want to replace src_fn with dst_fn, we need trace both src_fn and dst_fn. After tracing is done, we create a GraphModule. The init method of GraphModule will call recompile. By doing recompile lazily, we reduce the number of calls for `GraphModule._real_recompile` (in this PR, `recompile` just mark the class as needing recompilation and is very light weight. `_real_recompile` does the real recompilation) to 37 times and reduces its total execution time to 0.045s. [ghstack-poisoned]

Context: eellison 's review comment [here](#103642 (comment)) complains about my code calling `torch.fx.GraphModule.recompile` after I changed the graph. We didn't simply remove the call to `recompile` at that time since that increases the risk that user see or run stale python code. In this PR, I recompile GraphModule lazily without increasing the risk that user see/run stale python code. When training BertForMaskedLM, the `GraphModule.recompile` is called 707 times and takes 1.8s in total. The whole compilation takes around 60 seconds. By spot checking, I found the main reason we call recompile so frequently is due to inductor pattern matcher. E.g., if we want to replace src_fn with dst_fn, we need trace both src_fn and dst_fn. After tracing is done, we create a GraphModule. The init method of GraphModule will call recompile. By doing recompile lazily, we reduce the number of calls for `GraphModule._real_recompile` (in this PR, `recompile` just mark the class as needing recompilation and is very light weight. `_real_recompile` does the real recompilation) to 37 times and reduces its total execution time to 0.045s. cc voznesenskym penguinwu anijain2305 EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx ipiszy chenyang78 aakhundov [ghstack-poisoned]

[wip][inductor] convert layout of conv weight ahead of time

3b03e06

[ghstack-poisoned]

shunting314 added a commit that referenced this pull request Jun 15, 2023

[wip][inductor] convert layout of conv weight ahead of time

3df114e

ghstack-source-id: 705643764afe28030fbc46bfe289acda5840cf37 Pull Request resolved: #103642

github-actions bot added ciflow/inductor module: inductor labels Jun 15, 2023

XiaobingSuper reviewed Jun 15, 2023

View reviewed changes

torch/_inductor/compile_fx.py Show resolved Hide resolved

shunting314 changed the title ~~[wip][inductor] convert layout of conv weight ahead of time~~ [wip][inductor] convert layout of conv weight ahead of time for inference Jun 15, 2023

Update on "[wip][inductor] convert layout of conv weight ahead of tim…

ca24229

…e for inference" cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx peterbell10 ipiszy ngimel yf225 chenyang78 [ghstack-poisoned]

shunting314 added a commit that referenced this pull request Jun 15, 2023

[wip][inductor] convert layout of conv weight ahead of time for infer…

025d708

…ence ghstack-source-id: 3288073a1050922f36e67ccaad694f58942b9f5f Pull Request resolved: #103642

Update on "[wip][inductor] convert layout of conv weight ahead of tim…

379fabe

…e for inference" cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx peterbell10 ipiszy ngimel yf225 chenyang78 [ghstack-poisoned]

shunting314 added a commit that referenced this pull request Jun 15, 2023

[wip][inductor] convert layout of conv weight ahead of time for infer…

05a6bdd

…ence ghstack-source-id: 749f016d781016cec09406d271a4d0f531469ab7 Pull Request resolved: #103642

Update on "[wip][inductor] convert layout of conv weight ahead of tim…

53e9ed9

…e for inference" cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx peterbell10 ipiszy ngimel yf225 chenyang78 [ghstack-poisoned]

shunting314 added a commit that referenced this pull request Jun 16, 2023

[wip][inductor] convert layout of conv weight ahead of time for infer…

5315329

…ence ghstack-source-id: 43e0ce6937b992ccb4e13099fd1b57251a70bc36 Pull Request resolved: #103642

Update on "[wip][inductor] convert layout of conv weight ahead of tim…

27b08b4

…e for inference" cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx peterbell10 ipiszy ngimel yf225 chenyang78 [ghstack-poisoned]

shunting314 added a commit that referenced this pull request Jun 16, 2023

[wip][inductor] convert layout of conv weight ahead of time for infer…

f1f355c

…ence ghstack-source-id: cecab3094abb8a908d8e36dd3075d3c29ba08699 Pull Request resolved: #103642

pytorch-bot bot added the release notes: AO frontend label Jun 16, 2023

shunting314 changed the title ~~[wip][inductor] convert layout of conv weight ahead of time for inference~~ [inductor] convert layout of conv weight ahead of time for inference Jun 16, 2023

shunting314 requested review from jansel, eellison and Chillee June 16, 2023 00:55

shunting314 added a commit that referenced this pull request Jun 16, 2023

[inductor] convert layout of conv weight ahead of time for inference

1c23dca

ghstack-source-id: 8026814f2d984c5454d8fc28b2bab7aa0fc833de Pull Request resolved: #103642

XiaobingSuper reviewed Jun 16, 2023

View reviewed changes

torch/_inductor/freezing.py Outdated Show resolved Hide resolved

shunting314 added a commit that referenced this pull request Jun 16, 2023

[inductor] convert layout of conv weight ahead of time for inference

0f9abc1

ghstack-source-id: 310f78944dec0abf991d0ccaab408ccbcfcb16a5 Pull Request resolved: #103642

eellison reviewed Jun 16, 2023

View reviewed changes

torch/_inductor/freezing.py Outdated Show resolved Hide resolved

torch/_inductor/freezing.py Outdated Show resolved Hide resolved

jansel removed their request for review June 21, 2023 16:03

shunting314 requested a review from eellison June 21, 2023 23:08

pytorchmergebot added Merged and removed merging labels Jun 28, 2023

pytorchmergebot closed this in 98f00f8 Jun 28, 2023

jeanschmidt added the ciflow/slow label Jun 29, 2023

facebook-github-bot deleted the gh/shunting314/66/head branch July 2, 2023 14:16

shunting314 mentioned this pull request Jul 15, 2023

recompile fx.GraphModule lazily #105257

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[inductor] convert layout of conv weight ahead of time for inference #103642

[inductor] convert layout of conv weight ahead of time for inference #103642

shunting314 commented Jun 15, 2023 •

edited

pytorch-bot bot commented Jun 15, 2023 •

edited

eellison left a comment

jansel commented Jun 21, 2023

shunting314 commented Jun 21, 2023

pytorchmergebot commented Jun 28, 2023

[inductor] convert layout of conv weight ahead of time for inference #103642

[inductor] convert layout of conv weight ahead of time for inference #103642

Conversation

shunting314 commented Jun 15, 2023 • edited

pytorch-bot bot commented Jun 15, 2023 • edited

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/103642

✅ 1 Unrelated Failure

eellison left a comment

Choose a reason for hiding this comment

jansel commented Jun 21, 2023

shunting314 commented Jun 21, 2023

pytorchmergebot commented Jun 28, 2023

Merge started

shunting314 commented Jun 15, 2023 •

edited

pytorch-bot bot commented Jun 15, 2023 •

edited