-
Couldn't load subscription status.
- Fork 25.7k
Enable Lowering Channels last Conv1x1 when max autotune is set #107004
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/107004
Note: Links to docs will display an error until the docs builds have been completed. ✅ 1 Unrelated FailureAs of commit 80262e7 with merge base ed07821 ( BROKEN TRUNK - The following job failed but were present on the merge base:👉 Rebase onto the `viable/strict` branch to avoid these failures
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx peterbell10 ipiszy ngimel yf225 chenyang78 kadeng muchulee8 aakhundov [ghstack-poisoned]
… set" cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx peterbell10 ipiszy ngimel yf225 chenyang78 kadeng muchulee8 aakhundov [ghstack-poisoned]
… set" cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx peterbell10 ipiszy ngimel yf225 chenyang78 kadeng muchulee8 aakhundov [ghstack-poisoned]
|
I have a couple of questions.
|
… set" This can lead to a large speedup when max autotune is set, e.g. resnet 2.1x -> 2.5x, particularly in combination with freezing. cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx peterbell10 ipiszy ngimel yf225 chenyang78 kadeng muchulee8 aakhundov [ghstack-poisoned]
The mm shapes when you convert conv to 1x1 are pretty unusual. So I think the cublas heuristics are not as well tuned for them. For example, one addmm in resnet,
this is for inference. getting training data as well. but the pure mm shapes are faster, so it should help. |
| dilation = pad_listlike(dilation, ndim) | ||
| output_padding = pad_listlike(output_padding, ndim) | ||
|
|
||
| def channels_last_conv(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any intuitive understanding for why this prefers channels last?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since we're permuting the input to permute(0, 2, 3, 1), if the input is channels-last, it produces a dense tensor. If the input is contiguous, this produces a bad format for mm.
| layout = conv_layout(x, weight, None, **kwargs) | ||
| req_stride_order = ir.get_stride_order( | ||
| V.graph.sizevars.size_hints(layout.stride) | ||
| ) | ||
| return req_stride_order == ir.NHWC_STRIDE_ORDER |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just to make sure I understand -- this is checking that the layout of the output is NHWC, and if it is, then x and w are both NHWC?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When we lower conv1x1 as mm, we are making the output NHWC. This is checking that we would have made it NHWC anyway, by looking at the output layout.
When you call into aten.convolution, if either x or w is channels last, then the other will be made to be channels last as well with the op, then the kernel is invoked.
x and w are both NHWC
So, this is true with respect to the convolution kernel itself, but not necessarily the inputs to aten::convolution. Just one input needs to be NHWC for the output to be NHWC.
Within the conv1x1 as mm function, we force the input to be channels last strides. So we want to avoid :
- an extra copy to the input making it channels last where the copy wouldn't have happened otherwise
- accidentally changing the output striding, which might have downstream effects
… set" This can lead to a large speedup when max autotune is set, e.g. resnet 2.1x -> 2.5x, particularly in combination with freezing. cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx peterbell10 ipiszy ngimel yf225 chenyang78 kadeng muchulee8 aakhundov [ghstack-poisoned]
|
@pytorchbot merge -f "unrelated failure" |
Merge startedYour change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Please use Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
Stack from ghstack (oldest at bottom):
This can lead to a large speedup when max autotune is set, e.g. resnet 2.1x -> 2.5x, particularly in combination with freezing.
cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @Xia-Weiwen @wenzhe-nrv @jiayisunx @peterbell10 @ipiszy @ngimel @yf225 @chenyang78 @kadeng @muchulee8 @aakhundov