Enable Lowering Channels last Conv1x1 when max autotune is set #107004

eellison · 2023-08-11T01:50:17Z

Stack from ghstack (oldest at bottom):

This can lead to a large speedup when max autotune is set, e.g. resnet 2.1x -> 2.5x, particularly in combination with freezing.

cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @Xia-Weiwen @wenzhe-nrv @jiayisunx @peterbell10 @ipiszy @ngimel @yf225 @chenyang78 @kadeng @muchulee8 @aakhundov

[ghstack-poisoned]

pytorch-bot · 2023-08-11T01:50:19Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/107004

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ 1 Unrelated Failure

As of commit 80262e7 with merge base ed07821 ():

BROKEN TRUNK - The following job failed but were present on the merge base:

👉 Rebase onto the `viable/strict` branch to avoid these failures

cuda12.1-py3.10-gcc9-sm86 / test (inductor, 1, 1, linux.g5.4xlarge.nvidia.gpu) (gh)

This comment was automatically generated by Dr. CI and updates every 15 minutes.

ghstack-source-id: 15e315d Pull Request resolved: #107004

cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx peterbell10 ipiszy ngimel yf225 chenyang78 kadeng muchulee8 aakhundov [ghstack-poisoned]

… set" cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx peterbell10 ipiszy ngimel yf225 chenyang78 kadeng muchulee8 aakhundov [ghstack-poisoned]

shunting314 · 2023-08-15T20:37:57Z

I have a couple of questions.

could you explain more why the gain depends on enabling max-autotune for gemm? So unconditionally calling aten.mm for 1x1 conv does not result in the same gain?
is the speedup number '2.1x -> 2.5x' mentioned in the summary for training or inference? I assume it's inference but would like to double check.

… set" This can lead to a large speedup when max autotune is set, e.g. resnet 2.1x -> 2.5x, particularly in combination with freezing. cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx peterbell10 ipiszy ngimel yf225 chenyang78 kadeng muchulee8 aakhundov [ghstack-poisoned]

eellison · 2023-08-16T17:28:12Z

enabling max-autotune for gemm? So unconditionally calling aten.mm for 1x1 conv does not result in the same gain?

The mm shapes when you convert conv to 1x1 are pretty unusual. So I think the cublas heuristics are not as well tuned for them. For example, one addmm in resnet, 200704, 64 @ 64, 256. Additionally, with freezing the batchnorm-relu after becomes just a relu, and with max_autotune we get to fuse the relu afterward which is a very cheap activation.

is the speedup number '2.1x -> 2.5x' mentioned in the summary for training or inference? I assume it's inference but would like to double check.

this is for inference. getting training data as well. but the pure mm shapes are faster, so it should help.

shunting314 · 2023-08-16T17:55:58Z

torch/_inductor/kernel/conv.py

    dilation = pad_listlike(dilation, ndim)
    output_padding = pad_listlike(output_padding, ndim)

+    def channels_last_conv():


Any intuitive understanding for why this prefers channels last?

Since we're permuting the input to permute(0, 2, 3, 1), if the input is channels-last, it produces a dense tensor. If the input is contiguous, this produces a bad format for mm.

int3 · 2023-08-16T22:10:04Z

torch/_inductor/kernel/conv.py

+        layout = conv_layout(x, weight, None, **kwargs)
+        req_stride_order = ir.get_stride_order(
+            V.graph.sizevars.size_hints(layout.stride)
+        )
+        return req_stride_order == ir.NHWC_STRIDE_ORDER


just to make sure I understand -- this is checking that the layout of the output is NHWC, and if it is, then x and w are both NHWC?

When we lower conv1x1 as mm, we are making the output NHWC. This is checking that we would have made it NHWC anyway, by looking at the output layout.

When you call into aten.convolution, if either x or w is channels last, then the other will be made to be channels last as well with the op, then the kernel is invoked.

x and w are both NHWC

So, this is true with respect to the convolution kernel itself, but not necessarily the inputs to aten::convolution. Just one input needs to be NHWC for the output to be NHWC.

Within the conv1x1 as mm function, we force the input to be channels last strides. So we want to avoid :

an extra copy to the input making it channels last where the copy wouldn't have happened otherwise

accidentally changing the output striding, which might have downstream effects

… set" This can lead to a large speedup when max autotune is set, e.g. resnet 2.1x -> 2.5x, particularly in combination with freezing. cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx peterbell10 ipiszy ngimel yf225 chenyang78 kadeng muchulee8 aakhundov [ghstack-poisoned]

ghstack-source-id: fcfc908 Pull Request resolved: #107004

eellison · 2023-08-17T16:03:17Z

@pytorchbot merge -f "unrelated failure"

pytorchmergebot · 2023-08-17T16:05:24Z

Merge started

Your change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Please use -f as last resort and instead consider -i/--ignore-current to continue the merge ignoring current failures. This will allow currently pending tests to finish and report signal before the merge.

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

[test]

7293a88

[ghstack-poisoned]

This was referenced Aug 11, 2023

Make Nd tensors hit fused addmm pass #106911

Closed

Unfuse bias add before pointwise ops #106912

Closed

tmp test #107003

Closed

eellison added a commit that referenced this pull request Aug 11, 2023

[test]

61e6739

ghstack-source-id: 15e315d Pull Request resolved: #107004

github-actions bot added module: inductor ciflow/inductor labels Aug 11, 2023

eellison marked this pull request as draft August 11, 2023 01:53

Update on "[test]"

64216bb

cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx peterbell10 ipiszy ngimel yf225 chenyang78 kadeng muchulee8 aakhundov [ghstack-poisoned]

eellison changed the title ~~[test]~~ Enable Lowering Channels last Conv1x1 when max autotune is set Aug 15, 2023

eellison marked this pull request as ready for review August 15, 2023 18:58

eellison requested review from Chillee, int3, jansel and shunting314 August 15, 2023 19:09

jansel approved these changes Aug 15, 2023

View reviewed changes

eellison added the ciflow/trunk Trigger trunk jobs on your pull request label Aug 16, 2023

shunting314 reviewed Aug 16, 2023

View reviewed changes

shunting314 approved these changes Aug 16, 2023

View reviewed changes

int3 approved these changes Aug 16, 2023

View reviewed changes

eellison added a commit that referenced this pull request Aug 16, 2023

Channels last Conv 1x1 to gemm when max autotuning is set

988de73

ghstack-source-id: fcfc908 Pull Request resolved: #107004

pytorchmergebot added the merging label Aug 17, 2023

pytorchmergebot added the Merged label Aug 17, 2023

pytorchmergebot removed the merging label Aug 17, 2023

pytorchmergebot closed this in 8298720 Aug 17, 2023

facebook-github-bot deleted the gh/eellison/515/head branch August 21, 2023 14:16

eellison mentioned this pull request Sep 7, 2023

[Inductor] Inference Fuse Gelu #103480

Closed

int3 mentioned this pull request Sep 10, 2023

Enable conv_1x1_as_mm #107215

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Enable Lowering Channels last Conv1x1 when max autotune is set #107004

Enable Lowering Channels last Conv1x1 when max autotune is set #107004

Uh oh!

eellison commented Aug 11, 2023 •

edited

Loading

Uh oh!

pytorch-bot bot commented Aug 11, 2023 •

edited

Loading

Uh oh!

shunting314 commented Aug 15, 2023

Uh oh!

eellison commented Aug 16, 2023 •

edited

Loading

Uh oh!

shunting314 Aug 16, 2023

Uh oh!

eellison Aug 16, 2023 •

edited

Loading

Uh oh!

int3 Aug 16, 2023

Uh oh!

eellison Aug 16, 2023

Uh oh!

eellison commented Aug 17, 2023

Uh oh!

pytorchmergebot commented Aug 17, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Uh oh!

Enable Lowering Channels last Conv1x1 when max autotune is set #107004

Enable Lowering Channels last Conv1x1 when max autotune is set #107004

Uh oh!

Conversation

eellison commented Aug 11, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Aug 11, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/107004

✅ 1 Unrelated Failure

Uh oh!

shunting314 commented Aug 15, 2023

Uh oh!

eellison commented Aug 16, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

shunting314 Aug 16, 2023

Choose a reason for hiding this comment

Uh oh!

eellison Aug 16, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

int3 Aug 16, 2023

Choose a reason for hiding this comment

Uh oh!

eellison Aug 16, 2023

Choose a reason for hiding this comment

Uh oh!

eellison commented Aug 17, 2023

Uh oh!

pytorchmergebot commented Aug 17, 2023

Merge started

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

eellison commented Aug 11, 2023 •

edited

Loading

pytorch-bot bot commented Aug 11, 2023 •

edited

Loading

eellison commented Aug 16, 2023 •

edited

Loading

eellison Aug 16, 2023 •

edited

Loading