New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Remove fgrad_input from slow_conv2d #64280
Conversation
[ghstack-poisoned]
🔗 Helpful links
💊 CI failures summary and remediationsAs of commit d6d2760 (more details on the Dr. CI page):
🕵️ 1 new failure recognized by patternsThe following CI failures do not appear to be due to upstream breakages: linux-bionic-py3.8-gcc9-coverage / test (distributed, 1, 1, linux.2xlarge) (1/1)Step: "Unknown" (full log | diagnosis details | 🔁 rerun)
|
[ghstack-poisoned]
[ghstack-poisoned]
ghstack-source-id: 845dbc55b43c0b6474922b1fe90387dd21e8dd1e Pull Request resolved: #64280
[ghstack-poisoned]
ghstack-source-id: 168dc0a7bb648fe0b68bfa4afe3608470d0f1ba0 Pull Request resolved: #64280
ooh so delightful cc @jbschlosser |
Actually, @jbschlosser, I'm going to leave reviewing and landing this one up to you, since it is a logical step towards convolution rationalization |
Very timely PR, @peterbell10 :) love it! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very nice work! Left a couple comments below
@@ -9490,31 +9490,31 @@ | |||
CPU: slow_conv_transpose3d_backward_cpu | |||
CUDA: slow_conv_transpose3d_backward_cuda | |||
|
|||
- func: thnn_conv2d.out(Tensor self, Tensor weight, int[2] kernel_size, Tensor? bias=None, int[2] stride=1, int[2] padding=0, *, Tensor(a!) out) -> Tensor(a!) | |||
- func: _slow_conv2d.out(Tensor self, Tensor weight, int[2] kernel_size, Tensor? bias=None, int[2] stride=1, int[2] padding=0, *, Tensor(a!) out) -> Tensor(a!) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The name change is sorely needed and I welcome it, but I predict some additional internal breakage. Based on how much there is, it might make sense to limit the scope of this PR to only fgrad_input
removal and bump the name change to a future PR, TBD.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've restored the name of the top-level function but I think changing the names of _forward
and _backward
is okay since they aren't really public and the signature has to change either way.
if (grad_bias.defined()) { | ||
Tensor grad_bias_mut = grad_bias; | ||
at::sum_out(grad_bias_mut, grad_output, IntArrayRef{0, 2, 3}); | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a much nicer way to compute grad_bias
and seems correct AFAICT.
Couple questions regarding this:
- Are there any tests verifying the
grad_bias
calculation? I briefly checked but nothing stood out - Do you by chance have any comparative benchmarking results for the change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are there any tests verifying the grad_bias calculation? I briefly checked but nothing stood out
The nn tests are a bit confusing, but I think the gradcheck is done here:
pytorch/torch/testing/_internal/common_nn.py
Line 5807 in 59fcbd1
def _check_gradients(self, test_case, module, input_tuple): |
That test includes inputs and parameters, so bias
should be covered.
Do you by chance have any comparative benchmarking results for the change?
I've done some simple benchmarking which showed positive results. LMK if you want more.
For CUDA, I've run a convolution in a loop under nvprof with these parameters:
input = torch.rand(1, 50, 260, 260, device=device, requires_grad=False)
weight = torch.rand(10, 50, 3, 3, device=device, requires_grad=False)
bias = torch.rand(10, device=device, requires_grad=True)
The result is it calls one kernel instead of two (cublas gemm here becomes dot_kernel
+ reduce_1Block_kernel
). The kernel execution time is also noticeably faster:
Master (us) | This PR (us) | |
---|---|---|
Forward | 23.8 | 14.6 |
Backward | 22.4 | 18.1 |
On CPU without nvprof, the results are dwarfed by the main convolution and the same shapes showed no measurable difference. So I also tried a smaller input size with H=120, C_in = 2 and that showed some minor improvement in the forward case (presumably from vectorization).
Master (us) | This PR (us) | |
---|---|---|
Forward (Big) | 15,100 | 15,100 |
Backward (Big) | 13,100 | 13,100 |
Forward (Small) | 526 | 499 |
Backward (Small) | 1,050 | 1,050 |
TORCH_CHECK(!bias.defined() || bias.is_contiguous(), | ||
"bias tensor has to be contiguous"); | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just to verify my understading: bias
no longer needs to be contiguous because grad_bias
is now being computed using at::sum_out
instead of with an ad-hoc kernel that required contiguous input?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's the general idea. This is replacing cuBLAS library calls which don't support completely arbitrary strides, with TensorIterator
kernels that do.
ghstack-source-id: 168dc0a7bb648fe0b68bfa4afe3608470d0f1ba0 Pull Request resolved: pytorch#64280
[ghstack-poisoned]
ghstack-source-id: 0fa3a83cbdea02370e4ebd0e818efa552e568765 Pull Request resolved: #64280
[ghstack-poisoned]
ghstack-source-id: 345ed21f8d0a1e00370f1e3f99e01344468dc875 Pull Request resolved: #64280
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Thanks for the update :)
This looks awesome and I'm glad we don't need blas to add bias/compute bias grad, it's a leftover from caffe1 times. |
@jbschlosser has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator. |
Hey @peterbell10, do you mind doing a rebase for this? auto-rebase is failing internally |
Differential Revision: [D30830887](https://our.internmc.facebook.com/intern/diff/D30830887) [ghstack-poisoned]
ghstack-source-id: 5c0cdd094500c03630dc4ec813d6b1ffff081223 Pull Request resolved: #64280
CI Flow Status⚛️ CI FlowRuleset - Version:
You can add a comment to the PR and tag @pytorchbot with the following commands: # ciflow rerun, "ciflow/default" will always be added automatically
@pytorchbot ciflow rerun
# ciflow rerun with additional labels "-l <ciflow/label_name>", which is equivalent to adding these labels manually and trigger the rerun
@pytorchbot ciflow rerun -l ciflow/scheduled -l ciflow/slow For more information, please take a look at the CI Flow Wiki. |
@jbschlosser has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator. |
FYI @peterbell10 there's still quite a bit of internal breakage happening due to the renaming. I'm looking into how much of it is unavoidable due to the signature change. |
@jbschlosser any update on this? |
Hey @peterbell10, sorry for the lack of updates. I got the internal situation figured out and it's merged now. |
Stack from ghstack:
Differential Revision: D30830887