New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[StaticRuntime][ATen] Add out variant for narrow_copy #49502
Conversation
💊 CI failures summary and remediationsAs of commit e92e14b (more details on the Dr. CI page):
Extra GitHub checks: 2 failed
This comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.Please report bugs/suggestions to the (internal) Dr. CI Users group. |
This pull request was exported from Phabricator. Differential Revision: D25596290 |
6e66d8d
to
0cd6d0d
Compare
This pull request was exported from Phabricator. Differential Revision: D25596290 |
1 similar comment
This pull request was exported from Phabricator. Differential Revision: D25596290 |
This pull request was exported from Phabricator. Differential Revision: D25596290 |
This pull request was exported from Phabricator. Differential Revision: D25596290 |
This pull request was exported from Phabricator. Differential Revision: D25596290 |
Hi @hlu1 , Thanks for the PR! Also I am not sure why you rewrite a full kernel by hand instead of use the builtin one for copy? The builtin copy is most likely going to perform better when dealing with a wide random of size and shape no? |
Rephrasing Alban's point -- does this actually need to be a native op, or can it be an op in e.g. the static runtime namespace or hidden (torch._C._narrow_copy or something). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The xla failure looks a bit weird, will report back once my local build finishes.
aten/src/ATen/native/TensorShape.cpp
Outdated
Tensor narrow_copy_dense(const Tensor& self, int64_t dim, int64_t start, int64_t length){ | ||
if (self.is_cuda()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We might prefer to separate CPU/CUDA kernels instead of writing if/else inside the kernel ;)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did find an torch-xla bug revealed by this but a more proper fix is better done on this PR. Suggested some changes inline.
Also please add a test in https://github.com/pytorch/pytorch/blob/master/aten/src/ATen/test/math_kernel_test.cpp to make sure the math kernel result matches your specialized CPU kernel result.
Thanks!
Yes, functionally it's the same as In terms of namespace, there is already a |
dispatch: | ||
CPU, CUDA: narrow_copy_dense | ||
SparseCPU, SparseCUDA: narrow_copy_sparse | ||
|
||
- func: narrow_copy.out(Tensor self, int dim, int start, int length, *, Tensor(a!) out) -> Tensor(a!) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Somehow I found my inline comment disappeared so redoing the comment :P
You can simply leave narrow_copy_dense
as it is and add a new narrow_copy_cpu
with the cpu logic. And since the narrow_copy_dense
actually works for all backends, you can simply update this section to (note no need to explicitly specify CUDA anymore)
CPU: narrow_copy_cpu
SparseCPU,......
Math: narrow_copy_dense
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It doesn't seem to work. Here are the error messages:
> RuntimeError: 0 INTERNAL ASSERT FAILED at "caffe2/aten/src/ATen/core/boxing/KernelFunction.cpp":27, please report a bug to PyTorch. aten::narrow_copy has kernels registered to both Math and a backend mapped to AutogradOther. This makes the backend kernel unreachable (see Note [Ambiguity in AutogradOther kernel]). If it's intended to override Math kernel behavior, please open an issue to request a dedicated Autograd dispatch key for the backend.
Canonical state
~~~~~~~~~~~
name: aten::narrow_copy
schema: aten::narrow_copy(Tensor self, int dim, int start, int length) -> (Tensor)
debug: registered at buck-out/dev/gen/caffe2/aten/gen_aten=RegisterSchema.cpp/RegisterSchema.cpp:20
alias analysis kind: FROM_SCHEMA
CPU: registered at buck-out/dev/gen/caffe2/aten/gen_aten=RegisterCPU.cpp/RegisterCPU.cpp:5778 :: (Tensor _0, int _1, int _2, int _3) -> (Tensor _0) [ boxed unboxed ]
SparseCPU: registered at buck-out/dev/gen/caffe2/aten/gen_aten=RegisterSparseCPU.cpp/RegisterSparseCPU.cpp:543 :: (Tensor _0, int _1, int _2, int _3) -> (Tensor _0) [ boxed unboxed ]
SparseCUDA: registered at buck-out/dev/gen/caffe2/aten/gen_aten=RegisterSparseCUDA.cpp/RegisterSparseCUDA.cpp:639 :: (Tensor _0, int _1, int _2, int _3) -> (Tensor _0) [ boxed unboxed ]
Tracer: registered at buck-out/dev/gen/caffe2/generate-code/autograd/generated/TraceType_3.cpp:10432 :: (Tensor _0, int _1, int _2, int _3) -> (Tensor _0) [ boxed unboxed ]
Autograd[alias]: registered at buck-out/dev/gen/caffe2/generate-code/autograd/generated/VariableType_3.cpp:9879 :: (Tensor _0, int _1, int _2, int _3) -> (Tensor _0) [ boxed unboxed ]
Math[alias]: registered at buck-out/dev/gen/caffe2/aten/gen_aten=RegisterMath.cpp/RegisterMath.cpp:5590 :: (Tensor _0, int _1, int _2, int _3) -> (Tensor _0) [ boxed unboxed ]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah I see, it's because currently no grad formula is implemented for narrow_copy
.
In [2]: a = torch.rand(3, 3, requires_grad=True)
In [3]: b = a.narrow_copy(0, 0, 1)
In [4]: b
Out[4]: tensor([[0.6334, 0.4079, 0.6572]], grad_fn=<NotImplemented>)
Changing Math
to DefaultBackend
should fix it! :D
This pull request was exported from Phabricator. Differential Revision: D25596290 |
163272f
to
da75aca
Compare
Could you clarify which output you mean here? When you do narrow and inplace copy, no memory is actually allocated as one is a view and the other is an inplace op.
No such function exist in OSS PyTorch I think. Or I missed it? |
Codecov Report
@@ Coverage Diff @@
## master #49502 +/- ##
==========================================
- Coverage 80.72% 80.72% -0.01%
==========================================
Files 1909 1909
Lines 207051 207081 +30
==========================================
+ Hits 167144 167162 +18
- Misses 39907 39919 +12 |
Also, to be clear, we are talking about moving both the out variant and the non-out variant to a different namespace. |
ah, my mistake narrow_copy already existed in prior versions. |
Summary: Pull Request resolved: pytorch#49502 It broke the OSS CI the last time I landed it, mostly cuda tests and python bindings. Similar to permute_out, add the out variant of `aten::narrow` (slice in c2) which does an actual copy. `aten::narrow` creates a view, however, an copy is incurred when we call `input.contiguous` in the ops that follow `aten::narrow`, in `concat_add_mul_replacenan_clip`, `casted_batch_one_hot_lengths`, and `batch_box_cox`. {F351263599} Test Plan: Unit test: ``` buck test //caffe2/aten:math_kernel_test buck test //caffe2/test:sparse -- test_narrow ``` Benchmark with the adindexer model: ``` bs = 1 is neutral Before: I1214 21:32:51.919239 3285258 PyTorchPredictorBenchLib.cpp:209] PyTorch run finished. Milliseconds per iter: 0.0886948. Iters per second: 11274.6 After: I1214 21:32:52.492352 3285277 PyTorchPredictorBenchLib.cpp:209] PyTorch run finished. Milliseconds per iter: 0.0888019. Iters per second: 11261 bs = 20 shows more gains probably because the tensors are bigger and therefore the cost of copying is higher Before: I1214 21:20:19.702445 3227229 PyTorchPredictorBenchLib.cpp:209] PyTorch run finished. Milliseconds per iter: 0.527563. Iters per second: 1895.51 After: I1214 21:20:20.370173 3227307 PyTorchPredictorBenchLib.cpp:209] PyTorch run finished. Milliseconds per iter: 0.508734. Iters per second: 1965.67 ``` Reviewed By: ajyu Differential Revision: D25596290 fbshipit-source-id: bff813f29a0fd36fa56d937426a6d3a03f3af977
This pull request was exported from Phabricator. Differential Revision: D25596290 |
da75aca
to
e92e14b
Compare
When you do
permute_out is internal only. I didn't export it to OSS. |
This pull request has been merged in 4e76616. |
@hlu1 actually after looking at the code, it has nothing to do with this, my bad! It is just a narrow followed by a clone like Also I didn't had time to do a full review before you merged this (and it is generally nice to have an accept on github before merging, not just a request changes)... I think you're not testing your new function anywhere? You can add a new opinfo for it here so that the proper tests will get generated automatically. Also I still think the point above about making this a composite op is valid, you could implement this by doing Tensor& narrow_copy_dense_cpu_out(
const Tensor& self, int64_t dim, int64_t start, int64_t length, Tensor& output
) {
// resize output
auto output_sizes = self.sizes().vec();
output_sizes[dim] = length;
at::native::resize_(output, output_sizes);
// write the content into output
return output.copy_(at::narrow(self, dim, start, length));
} This will avoid the |
Summary:
It broke the OSS CI the last time I landed it, mostly cuda tests and python bindings.
Similar to permute_out, add the out variant of
aten::narrow
(slice in c2) which does an actual copy.aten::narrow
creates a view, however, an copy is incurred when we callinput.contiguous
in the ops that followaten::narrow
, inconcat_add_mul_replacenan_clip
,casted_batch_one_hot_lengths
, andbatch_box_cox
.{F351263599}
Test Plan:
Unit test:
Benchmark with the adindexer model:
Differential Revision: D25596290