[StaticRuntime][ATen] Add out variant for narrow_copy #49502

hlu1 · 2020-12-16T21:51:11Z

Summary:
It broke the OSS CI the last time I landed it, mostly cuda tests and python bindings.

Similar to permute_out, add the out variant of aten::narrow (slice in c2) which does an actual copy. aten::narrow creates a view, however, an copy is incurred when we call input.contiguous in the ops that follow aten::narrow, in concat_add_mul_replacenan_clip, casted_batch_one_hot_lengths, and batch_box_cox.

{F351263599}

Test Plan:
Unit test:

buck test //caffe2/aten:native_test
buck test //caffe2/test:sparse -- test_narrow

Benchmark with the adindexer model:

bs = 1 is neutral

Before:
I1214 21:32:51.919239 3285258 PyTorchPredictorBenchLib.cpp:209] PyTorch run finished. Milliseconds per iter: 0.0886948. Iters per second: 11274.6
After:
I1214 21:32:52.492352 3285277 PyTorchPredictorBenchLib.cpp:209] PyTorch run finished. Milliseconds per iter: 0.0888019. Iters per second: 11261

bs = 20 shows more gains probably because the tensors are bigger and therefore the cost of copying is higher

Before:
I1214 21:20:19.702445 3227229 PyTorchPredictorBenchLib.cpp:209] PyTorch run finished. Milliseconds per iter: 0.527563. Iters per second: 1895.51
After:
I1214 21:20:20.370173 3227307 PyTorchPredictorBenchLib.cpp:209] PyTorch run finished. Milliseconds per iter: 0.508734. Iters per second: 1965.67

Differential Revision: D25596290

facebook-github-bot · 2020-12-16T21:51:17Z

💊 CI failures summary and remediations

As of commit e92e14b (more details on the Dr. CI page):

3/3 failures possibly* introduced in this PR
- 3/3 non-CircleCI failure(s)

Extra GitHub checks: 2 failed

Failed: GitHub Actions - clang-format
Failed: GitHub Actions - clang-tidy

This comment was automatically generated by Dr. CI (expand for details).

Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions to the (internal) Dr. CI Users group.

facebook-github-bot · 2020-12-16T21:51:28Z

This pull request was exported from Phabricator. Differential Revision: D25596290

facebook-github-bot · 2020-12-19T03:42:12Z

This pull request was exported from Phabricator. Differential Revision: D25596290

facebook-github-bot · 2021-01-08T02:16:51Z

This pull request was exported from Phabricator. Differential Revision: D25596290

facebook-github-bot · 2021-01-08T07:17:17Z

This pull request was exported from Phabricator. Differential Revision: D25596290

facebook-github-bot · 2021-01-08T07:53:13Z

This pull request was exported from Phabricator. Differential Revision: D25596290

facebook-github-bot · 2021-01-08T18:45:23Z

This pull request was exported from Phabricator. Differential Revision: D25596290

albanD · 2021-01-11T22:20:16Z

Hi @hlu1 , Thanks for the PR!
I am not sure how this is different from doing a.narrow().copy_(b)? Is the only goal here to remove the framework overhead of doing these two calls and just have it once (the benchmark you shared seem to indicate that it actually makes no difference even though I'm not sure what op you're actually benchmarking)? Or is there something else?

Also I am not sure why you rewrite a full kernel by hand instead of use the builtin one for copy? The builtin copy is most likely going to perform better when dealing with a wide random of size and shape no?

gchanan · 2021-01-11T22:30:23Z

Rephrasing Alban's point -- does this actually need to be a native op, or can it be an op in e.g. the static runtime namespace or hidden (torch._C._narrow_copy or something).

ailzhang

The xla failure looks a bit weird, will report back once my local build finishes.

ailzhang · 2021-01-11T22:34:01Z

aten/src/ATen/native/TensorShape.cpp

 Tensor narrow_copy_dense(const Tensor& self, int64_t dim, int64_t start, int64_t length){
+  if (self.is_cuda()) {


We might prefer to separate CPU/CUDA kernels instead of writing if/else inside the kernel ;)

ailzhang

I did find an torch-xla bug revealed by this but a more proper fix is better done on this PR. Suggested some changes inline.
Also please add a test in https://github.com/pytorch/pytorch/blob/master/aten/src/ATen/test/math_kernel_test.cpp to make sure the math kernel result matches your specialized CPU kernel result.
Thanks!

hlu1 · 2021-01-12T05:42:48Z

Yes, functionally it's the same as a.narrow().copy_(b). The difference here is that narrow_copy_dense_cpu_out allows the output to be preallocated. Also, the kernel is a special implementation that is faster than the copy kernel. The second benchmark (batch_size = 20) does show quite a bit speedup.

In terms of namespace, there is already a narrow_copy in at::native. I'm just adding an out variant. I don't see any reasonable reason to move the out variant to another namespace.

ailzhang · 2021-01-12T05:57:12Z

aten/src/ATen/native/native_functions.yaml

  dispatch:
    CPU, CUDA: narrow_copy_dense
    SparseCPU, SparseCUDA: narrow_copy_sparse

+- func: narrow_copy.out(Tensor self, int dim, int start, int length, *, Tensor(a!) out) -> Tensor(a!)


Somehow I found my inline comment disappeared so redoing the comment :P
You can simply leave narrow_copy_dense as it is and add a new narrow_copy_cpu with the cpu logic. And since the narrow_copy_dense actually works for all backends, you can simply update this section to (note no need to explicitly specify CUDA anymore)

CPU: narrow_copy_cpu SparseCPU,...... Math: narrow_copy_dense

It doesn't seem to work. Here are the error messages:

> RuntimeError: 0 INTERNAL ASSERT FAILED at "caffe2/aten/src/ATen/core/boxing/KernelFunction.cpp":27, please report a bug to PyTorch. aten::narrow_copy has kernels registered to both Math and a backend mapped to AutogradOther. This makes the backend kernel unreachable (see Note [Ambiguity in AutogradOther kernel]). If it's intended to override Math kernel behavior, please open an issue to request a dedicated Autograd dispatch key for the backend. Canonical state ~~~~~~~~~~~ name: aten::narrow_copy schema: aten::narrow_copy(Tensor self, int dim, int start, int length) -> (Tensor) debug: registered at buck-out/dev/gen/caffe2/aten/gen_aten=RegisterSchema.cpp/RegisterSchema.cpp:20 alias analysis kind: FROM_SCHEMA CPU: registered at buck-out/dev/gen/caffe2/aten/gen_aten=RegisterCPU.cpp/RegisterCPU.cpp:5778 :: (Tensor _0, int _1, int _2, int _3) -> (Tensor _0) [ boxed unboxed ] SparseCPU: registered at buck-out/dev/gen/caffe2/aten/gen_aten=RegisterSparseCPU.cpp/RegisterSparseCPU.cpp:543 :: (Tensor _0, int _1, int _2, int _3) -> (Tensor _0) [ boxed unboxed ] SparseCUDA: registered at buck-out/dev/gen/caffe2/aten/gen_aten=RegisterSparseCUDA.cpp/RegisterSparseCUDA.cpp:639 :: (Tensor _0, int _1, int _2, int _3) -> (Tensor _0) [ boxed unboxed ] Tracer: registered at buck-out/dev/gen/caffe2/generate-code/autograd/generated/TraceType_3.cpp:10432 :: (Tensor _0, int _1, int _2, int _3) -> (Tensor _0) [ boxed unboxed ] Autograd[alias]: registered at buck-out/dev/gen/caffe2/generate-code/autograd/generated/VariableType_3.cpp:9879 :: (Tensor _0, int _1, int _2, int _3) -> (Tensor _0) [ boxed unboxed ] Math[alias]: registered at buck-out/dev/gen/caffe2/aten/gen_aten=RegisterMath.cpp/RegisterMath.cpp:5590 :: (Tensor _0, int _1, int _2, int _3) -> (Tensor _0) [ boxed unboxed ]

Ah I see, it's because currently no grad formula is implemented for narrow_copy.

In [2]: a = torch.rand(3, 3, requires_grad=True) In [3]: b = a.narrow_copy(0, 0, 1) In [4]: b Out[4]: tensor([[0.6334, 0.4079, 0.6572]], grad_fn=<NotImplemented>)

Changing Math to DefaultBackend should fix it! :D

facebook-github-bot · 2021-01-12T08:20:52Z

This pull request was exported from Phabricator. Differential Revision: D25596290

albanD · 2021-01-12T13:11:41Z

Yes, functionally it's the same as a.narrow().copy_(b). The difference here is that narrow_copy_dense_cpu_out allows the output to be preallocated.

Could you clarify which output you mean here? When you do narrow and inplace copy, no memory is actually allocated as one is a view and the other is an inplace op.
In general, we don't have out variants for things that don't allocate memory, hence my question.

Similar to permute_out

No such function exist in OSS PyTorch I think. Or I missed it?

codecov · 2021-01-12T13:42:16Z

Codecov Report

Merging #49502 (e92e14b) into master (cf45d65) will decrease coverage by 0.00%.
The diff coverage is 87.09%.

@@            Coverage Diff             @@
##           master   #49502      +/-   ##
==========================================
- Coverage   80.72%   80.72%   -0.01%     
==========================================
  Files        1909     1909              
  Lines      207051   207081      +30     
==========================================
+ Hits       167144   167162      +18     
- Misses      39907    39919      +12

gchanan · 2021-01-12T17:03:49Z

Also, to be clear, we are talking about moving both the out variant and the non-out variant to a different namespace.

gchanan · 2021-01-12T17:19:56Z

ah, my mistake narrow_copy already existed in prior versions.

Summary: Pull Request resolved: pytorch#49502 It broke the OSS CI the last time I landed it, mostly cuda tests and python bindings. Similar to permute_out, add the out variant of `aten::narrow` (slice in c2) which does an actual copy. `aten::narrow` creates a view, however, an copy is incurred when we call `input.contiguous` in the ops that follow `aten::narrow`, in `concat_add_mul_replacenan_clip`, `casted_batch_one_hot_lengths`, and `batch_box_cox`. {F351263599} Test Plan: Unit test: ``` buck test //caffe2/aten:math_kernel_test buck test //caffe2/test:sparse -- test_narrow ``` Benchmark with the adindexer model: ``` bs = 1 is neutral Before: I1214 21:32:51.919239 3285258 PyTorchPredictorBenchLib.cpp:209] PyTorch run finished. Milliseconds per iter: 0.0886948. Iters per second: 11274.6 After: I1214 21:32:52.492352 3285277 PyTorchPredictorBenchLib.cpp:209] PyTorch run finished. Milliseconds per iter: 0.0888019. Iters per second: 11261 bs = 20 shows more gains probably because the tensors are bigger and therefore the cost of copying is higher Before: I1214 21:20:19.702445 3227229 PyTorchPredictorBenchLib.cpp:209] PyTorch run finished. Milliseconds per iter: 0.527563. Iters per second: 1895.51 After: I1214 21:20:20.370173 3227307 PyTorchPredictorBenchLib.cpp:209] PyTorch run finished. Milliseconds per iter: 0.508734. Iters per second: 1965.67 ``` Reviewed By: ajyu Differential Revision: D25596290 fbshipit-source-id: bff813f29a0fd36fa56d937426a6d3a03f3af977

facebook-github-bot · 2021-01-12T21:37:42Z

This pull request was exported from Phabricator. Differential Revision: D25596290

hlu1 · 2021-01-12T23:00:45Z

Yes, functionally it's the same as a.narrow().copy_(b). The difference here is that narrow_copy_dense_cpu_out allows the output to be preallocated.

Could you clarify which output you mean here? When you do narrow and inplace copy, no memory is actually allocated as one is a view and the other is an inplace op.
In general, we don't have out variants for things that don't allocate memory, hence my question.

When you do a.narrow().copy_(b), b's memory has to be allocated at some time because it's a separate tensor from a. narrow_copy already exists in the codebase before this diff, which does a copy. narrow_copy_dense_cpu_out is merely an out variant which allows you pass the output in.

Similar to permute_out

permute_out is internal only. I didn't export it to OSS.

facebook-github-bot · 2021-01-13T03:40:47Z

This pull request has been merged in 4e76616.

albanD · 2021-01-13T12:37:46Z

Yes, functionally it's the same as a.narrow().copy_(b).

@hlu1 actually after looking at the code, it has nothing to do with this, my bad! It is just a narrow followed by a clone like a.narrow().clone() (which is how the non-out version is implemented!)

Also I didn't had time to do a full review before you merged this (and it is generally nice to have an accept on github before merging, not just a request changes)...

I think you're not testing your new function anywhere? You can add a new opinfo for it here so that the proper tests will get generated automatically.

Also I still think the point above about making this a composite op is valid, you could implement this by doing

Tensor& narrow_copy_dense_cpu_out( 
  const Tensor& self, int64_t dim, int64_t start, int64_t length, Tensor& output
) {
  // resize output
  auto output_sizes = self.sizes().vec();
  output_sizes[dim] = length;
  at::native::resize_(output, output_sizes);

  // write the content into output
  return output.copy_(at::narrow(self, dim, start, length));
}

This will avoid the contiguous() in the current function that might incur a full copy of the input (which would be very bad for perfs).
Also I am still expecting copy_ to be faster than a for-loop around a memcpy in general.

facebook-github-bot added the fb-exported label Dec 16, 2020

facebook-github-bot added cla signed oncall: jit Add this issue/PR to JIT oncall triage queue labels Dec 16, 2020

hlu1 force-pushed the export-D25596290 branch from 6e66d8d to 0cd6d0d Compare December 19, 2020 03:42

hlu1 force-pushed the export-D25596290 branch from 0cd6d0d to 21c8467 Compare January 8, 2021 02:16

hlu1 force-pushed the export-D25596290 branch from 21c8467 to 7b6f518 Compare January 8, 2021 07:17

hlu1 force-pushed the export-D25596290 branch from 7b6f518 to d2e6423 Compare January 8, 2021 07:53

hlu1 force-pushed the export-D25596290 branch from d2e6423 to 163272f Compare January 8, 2021 18:45

ailzhang reviewed Jan 11, 2021

View reviewed changes

ailzhang suggested changes Jan 12, 2021

View reviewed changes

ailzhang reviewed Jan 12, 2021

View reviewed changes

hlu1 force-pushed the export-D25596290 branch from 163272f to da75aca Compare January 12, 2021 08:20

hlu1 force-pushed the export-D25596290 branch from da75aca to e92e14b Compare January 12, 2021 21:37

facebook-github-bot closed this in 4e76616 Jan 13, 2021

facebook-github-bot added the Merged label Jan 13, 2021

nkaretnikov mentioned this pull request Oct 15, 2022

Rewrite narrow_copy_dense_cpu_out using copy_ and narrow #87019

Open

nkaretnikov mentioned this pull request Oct 31, 2022

[primTorch] Improve narrow and narrow_copy: refs, tests, docs #87045

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[StaticRuntime][ATen] Add out variant for narrow_copy #49502

[StaticRuntime][ATen] Add out variant for narrow_copy #49502

hlu1 commented Dec 16, 2020

facebook-github-bot commented Dec 16, 2020 •

edited

facebook-github-bot commented Dec 16, 2020

facebook-github-bot commented Dec 19, 2020

facebook-github-bot commented Jan 8, 2021

facebook-github-bot commented Jan 8, 2021

facebook-github-bot commented Jan 8, 2021

facebook-github-bot commented Jan 8, 2021

albanD commented Jan 11, 2021 •

edited

gchanan commented Jan 11, 2021

ailzhang left a comment •

edited

ailzhang Jan 11, 2021

hlu1 Jan 12, 2021

ailzhang left a comment

hlu1 commented Jan 12, 2021 •

edited

ailzhang Jan 12, 2021

hlu1 Jan 12, 2021

ailzhang Jan 12, 2021

facebook-github-bot commented Jan 12, 2021

albanD commented Jan 12, 2021 •

edited

codecov bot commented Jan 12, 2021 •

edited

gchanan commented Jan 12, 2021

gchanan commented Jan 12, 2021

facebook-github-bot commented Jan 12, 2021

hlu1 commented Jan 12, 2021

facebook-github-bot commented Jan 13, 2021

albanD commented Jan 13, 2021 •

edited

		Tensor narrow_copy_dense(const Tensor& self, int64_t dim, int64_t start, int64_t length){
		if (self.is_cuda()) {

[StaticRuntime][ATen] Add out variant for narrow_copy #49502

[StaticRuntime][ATen] Add out variant for narrow_copy #49502

Conversation

hlu1 commented Dec 16, 2020

facebook-github-bot commented Dec 16, 2020 • edited

💊 CI failures summary and remediations

Extra GitHub checks: 2 failed

facebook-github-bot commented Dec 16, 2020

facebook-github-bot commented Dec 19, 2020

facebook-github-bot commented Jan 8, 2021

facebook-github-bot commented Jan 8, 2021

facebook-github-bot commented Jan 8, 2021

facebook-github-bot commented Jan 8, 2021

albanD commented Jan 11, 2021 • edited

gchanan commented Jan 11, 2021

ailzhang left a comment • edited

Choose a reason for hiding this comment

ailzhang Jan 11, 2021

Choose a reason for hiding this comment

hlu1 Jan 12, 2021

Choose a reason for hiding this comment

ailzhang left a comment

Choose a reason for hiding this comment

hlu1 commented Jan 12, 2021 • edited

ailzhang Jan 12, 2021

Choose a reason for hiding this comment

hlu1 Jan 12, 2021

Choose a reason for hiding this comment

ailzhang Jan 12, 2021

Choose a reason for hiding this comment

facebook-github-bot commented Jan 12, 2021

albanD commented Jan 12, 2021 • edited

codecov bot commented Jan 12, 2021 • edited

Codecov Report

gchanan commented Jan 12, 2021

gchanan commented Jan 12, 2021

facebook-github-bot commented Jan 12, 2021

hlu1 commented Jan 12, 2021

facebook-github-bot commented Jan 13, 2021

albanD commented Jan 13, 2021 • edited

facebook-github-bot commented Dec 16, 2020 •

edited

albanD commented Jan 11, 2021 •

edited

ailzhang left a comment •

edited

hlu1 commented Jan 12, 2021 •

edited

albanD commented Jan 12, 2021 •

edited

codecov bot commented Jan 12, 2021 •

edited

albanD commented Jan 13, 2021 •

edited