Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[StaticRuntime][ATen] Add out variant for narrow_copy #49502

Closed
wants to merge 1 commit into from

Conversation

hlu1
Copy link
Contributor

@hlu1 hlu1 commented Dec 16, 2020

Summary:
It broke the OSS CI the last time I landed it, mostly cuda tests and python bindings.

Similar to permute_out, add the out variant of aten::narrow (slice in c2) which does an actual copy. aten::narrow creates a view, however, an copy is incurred when we call input.contiguous in the ops that follow aten::narrow, in concat_add_mul_replacenan_clip, casted_batch_one_hot_lengths, and batch_box_cox.

{F351263599}

Test Plan:
Unit test:

buck test //caffe2/aten:native_test
buck test //caffe2/test:sparse -- test_narrow

Benchmark with the adindexer model:

bs = 1 is neutral

Before:
I1214 21:32:51.919239 3285258 PyTorchPredictorBenchLib.cpp:209] PyTorch run finished. Milliseconds per iter: 0.0886948. Iters per second: 11274.6
After:
I1214 21:32:52.492352 3285277 PyTorchPredictorBenchLib.cpp:209] PyTorch run finished. Milliseconds per iter: 0.0888019. Iters per second: 11261

bs = 20 shows more gains probably because the tensors are bigger and therefore the cost of copying is higher

Before:
I1214 21:20:19.702445 3227229 PyTorchPredictorBenchLib.cpp:209] PyTorch run finished. Milliseconds per iter: 0.527563. Iters per second: 1895.51
After:
I1214 21:20:20.370173 3227307 PyTorchPredictorBenchLib.cpp:209] PyTorch run finished. Milliseconds per iter: 0.508734. Iters per second: 1965.67

Differential Revision: D25596290

@facebook-github-bot
Copy link
Contributor

facebook-github-bot commented Dec 16, 2020

💊 CI failures summary and remediations

As of commit e92e14b (more details on the Dr. CI page):


  • 3/3 failures possibly* introduced in this PR
    • 3/3 non-CircleCI failure(s)

Extra GitHub checks: 2 failed


This comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions to the (internal) Dr. CI Users group.

@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D25596290

@facebook-github-bot facebook-github-bot added cla signed oncall: jit Add this issue/PR to JIT oncall triage queue labels Dec 16, 2020
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D25596290

1 similar comment
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D25596290

@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D25596290

@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D25596290

@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D25596290

@albanD
Copy link
Collaborator

albanD commented Jan 11, 2021

Hi @hlu1 , Thanks for the PR!
I am not sure how this is different from doing a.narrow().copy_(b)? Is the only goal here to remove the framework overhead of doing these two calls and just have it once (the benchmark you shared seem to indicate that it actually makes no difference even though I'm not sure what op you're actually benchmarking)? Or is there something else?

Also I am not sure why you rewrite a full kernel by hand instead of use the builtin one for copy? The builtin copy is most likely going to perform better when dealing with a wide random of size and shape no?

@gchanan
Copy link
Contributor

gchanan commented Jan 11, 2021

Rephrasing Alban's point -- does this actually need to be a native op, or can it be an op in e.g. the static runtime namespace or hidden (torch._C._narrow_copy or something).

Copy link
Contributor

@ailzhang ailzhang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The xla failure looks a bit weird, will report back once my local build finishes.

Tensor narrow_copy_dense(const Tensor& self, int64_t dim, int64_t start, int64_t length){
if (self.is_cuda()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We might prefer to separate CPU/CUDA kernels instead of writing if/else inside the kernel ;)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure

Copy link
Contributor

@ailzhang ailzhang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did find an torch-xla bug revealed by this but a more proper fix is better done on this PR. Suggested some changes inline.
Also please add a test in https://github.com/pytorch/pytorch/blob/master/aten/src/ATen/test/math_kernel_test.cpp to make sure the math kernel result matches your specialized CPU kernel result.
Thanks!

@hlu1
Copy link
Contributor Author

hlu1 commented Jan 12, 2021

Yes, functionally it's the same as a.narrow().copy_(b). The difference here is that narrow_copy_dense_cpu_out allows the output to be preallocated. Also, the kernel is a special implementation that is faster than the copy kernel. The second benchmark (batch_size = 20) does show quite a bit speedup.

In terms of namespace, there is already a narrow_copy in at::native. I'm just adding an out variant. I don't see any reasonable reason to move the out variant to another namespace.

dispatch:
CPU, CUDA: narrow_copy_dense
SparseCPU, SparseCUDA: narrow_copy_sparse

- func: narrow_copy.out(Tensor self, int dim, int start, int length, *, Tensor(a!) out) -> Tensor(a!)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Somehow I found my inline comment disappeared so redoing the comment :P
You can simply leave narrow_copy_dense as it is and add a new narrow_copy_cpu with the cpu logic. And since the narrow_copy_dense actually works for all backends, you can simply update this section to (note no need to explicitly specify CUDA anymore)

CPU: narrow_copy_cpu
SparseCPU,......
Math: narrow_copy_dense

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't seem to work. Here are the error messages:

> RuntimeError: 0 INTERNAL ASSERT FAILED at "caffe2/aten/src/ATen/core/boxing/KernelFunction.cpp":27, please report a bug to PyTorch. aten::narrow_copy has kernels registered to both Math and a backend mapped to AutogradOther. This makes the backend kernel unreachable (see Note [Ambiguity in AutogradOther kernel]). If it's intended to override Math kernel behavior, please open an issue to request a dedicated Autograd dispatch key for the backend.
Canonical state
~~~~~~~~~~~
name: aten::narrow_copy
schema: aten::narrow_copy(Tensor self, int dim, int start, int length) -> (Tensor)
debug: registered at buck-out/dev/gen/caffe2/aten/gen_aten=RegisterSchema.cpp/RegisterSchema.cpp:20
alias analysis kind: FROM_SCHEMA
CPU: registered at buck-out/dev/gen/caffe2/aten/gen_aten=RegisterCPU.cpp/RegisterCPU.cpp:5778 :: (Tensor _0, int _1, int _2, int _3) -> (Tensor _0) [ boxed unboxed ]
SparseCPU: registered at buck-out/dev/gen/caffe2/aten/gen_aten=RegisterSparseCPU.cpp/RegisterSparseCPU.cpp:543 :: (Tensor _0, int _1, int _2, int _3) -> (Tensor _0) [ boxed unboxed ]
SparseCUDA: registered at buck-out/dev/gen/caffe2/aten/gen_aten=RegisterSparseCUDA.cpp/RegisterSparseCUDA.cpp:639 :: (Tensor _0, int _1, int _2, int _3) -> (Tensor _0) [ boxed unboxed ]
Tracer: registered at buck-out/dev/gen/caffe2/generate-code/autograd/generated/TraceType_3.cpp:10432 :: (Tensor _0, int _1, int _2, int _3) -> (Tensor _0) [ boxed unboxed ]
Autograd[alias]: registered at buck-out/dev/gen/caffe2/generate-code/autograd/generated/VariableType_3.cpp:9879 :: (Tensor _0, int _1, int _2, int _3) -> (Tensor _0) [ boxed unboxed ]
Math[alias]: registered at buck-out/dev/gen/caffe2/aten/gen_aten=RegisterMath.cpp/RegisterMath.cpp:5590 :: (Tensor _0, int _1, int _2, int _3) -> (Tensor _0) [ boxed unboxed ]

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah I see, it's because currently no grad formula is implemented for narrow_copy.

In [2]:  a = torch.rand(3, 3, requires_grad=True)

In [3]: b = a.narrow_copy(0, 0, 1)

In [4]: b
Out[4]: tensor([[0.6334, 0.4079, 0.6572]], grad_fn=<NotImplemented>)

Changing Math to DefaultBackend should fix it! :D

@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D25596290

@albanD
Copy link
Collaborator

albanD commented Jan 12, 2021

Yes, functionally it's the same as a.narrow().copy_(b). The difference here is that narrow_copy_dense_cpu_out allows the output to be preallocated.

Could you clarify which output you mean here? When you do narrow and inplace copy, no memory is actually allocated as one is a view and the other is an inplace op.
In general, we don't have out variants for things that don't allocate memory, hence my question.

Similar to permute_out

No such function exist in OSS PyTorch I think. Or I missed it?

@codecov
Copy link

codecov bot commented Jan 12, 2021

Codecov Report

Merging #49502 (e92e14b) into master (cf45d65) will decrease coverage by 0.00%.
The diff coverage is 87.09%.

@@            Coverage Diff             @@
##           master   #49502      +/-   ##
==========================================
- Coverage   80.72%   80.72%   -0.01%     
==========================================
  Files        1909     1909              
  Lines      207051   207081      +30     
==========================================
+ Hits       167144   167162      +18     
- Misses      39907    39919      +12     

@gchanan
Copy link
Contributor

gchanan commented Jan 12, 2021

Also, to be clear, we are talking about moving both the out variant and the non-out variant to a different namespace.

@gchanan
Copy link
Contributor

gchanan commented Jan 12, 2021

ah, my mistake narrow_copy already existed in prior versions.

Summary:
Pull Request resolved: pytorch#49502

It broke the OSS CI the last time I landed it, mostly cuda tests and python bindings.

Similar to permute_out, add the out variant of `aten::narrow` (slice in c2) which does an actual copy. `aten::narrow` creates a view, however, an copy is incurred when we call `input.contiguous` in the ops that follow `aten::narrow`, in `concat_add_mul_replacenan_clip`, `casted_batch_one_hot_lengths`, and `batch_box_cox`.

{F351263599}

Test Plan:
Unit test:

```
buck test //caffe2/aten:math_kernel_test
buck test //caffe2/test:sparse -- test_narrow
```
Benchmark with the adindexer model:
```
bs = 1 is neutral

Before:
I1214 21:32:51.919239 3285258 PyTorchPredictorBenchLib.cpp:209] PyTorch run finished. Milliseconds per iter: 0.0886948. Iters per second: 11274.6
After:
I1214 21:32:52.492352 3285277 PyTorchPredictorBenchLib.cpp:209] PyTorch run finished. Milliseconds per iter: 0.0888019. Iters per second: 11261

bs = 20 shows more gains probably because the tensors are bigger and therefore the cost of copying is higher

Before:
I1214 21:20:19.702445 3227229 PyTorchPredictorBenchLib.cpp:209] PyTorch run finished. Milliseconds per iter: 0.527563. Iters per second: 1895.51
After:
I1214 21:20:20.370173 3227307 PyTorchPredictorBenchLib.cpp:209] PyTorch run finished. Milliseconds per iter: 0.508734. Iters per second: 1965.67
```

Reviewed By: ajyu

Differential Revision: D25596290

fbshipit-source-id: bff813f29a0fd36fa56d937426a6d3a03f3af977
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D25596290

@hlu1
Copy link
Contributor Author

hlu1 commented Jan 12, 2021

Yes, functionally it's the same as a.narrow().copy_(b). The difference here is that narrow_copy_dense_cpu_out allows the output to be preallocated.

Could you clarify which output you mean here? When you do narrow and inplace copy, no memory is actually allocated as one is a view and the other is an inplace op.
In general, we don't have out variants for things that don't allocate memory, hence my question.

When you do a.narrow().copy_(b), b's memory has to be allocated at some time because it's a separate tensor from a. narrow_copy already exists in the codebase before this diff, which does a copy. narrow_copy_dense_cpu_out is merely an out variant which allows you pass the output in.

Similar to permute_out

permute_out is internal only. I didn't export it to OSS.

@facebook-github-bot
Copy link
Contributor

This pull request has been merged in 4e76616.

@albanD
Copy link
Collaborator

albanD commented Jan 13, 2021

Yes, functionally it's the same as a.narrow().copy_(b).

@hlu1 actually after looking at the code, it has nothing to do with this, my bad! It is just a narrow followed by a clone like a.narrow().clone() (which is how the non-out version is implemented!)

Also I didn't had time to do a full review before you merged this (and it is generally nice to have an accept on github before merging, not just a request changes)...

I think you're not testing your new function anywhere? You can add a new opinfo for it here so that the proper tests will get generated automatically.

Also I still think the point above about making this a composite op is valid, you could implement this by doing

Tensor& narrow_copy_dense_cpu_out( 
  const Tensor& self, int64_t dim, int64_t start, int64_t length, Tensor& output
) {
  // resize output
  auto output_sizes = self.sizes().vec();
  output_sizes[dim] = length;
  at::native::resize_(output, output_sizes);

  // write the content into output
  return output.copy_(at::narrow(self, dim, start, length));
}

This will avoid the contiguous() in the current function that might incur a full copy of the input (which would be very bad for perfs).
Also I am still expecting copy_ to be faster than a for-loop around a memcpy in general.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cla signed fb-exported Merged oncall: jit Add this issue/PR to JIT oncall triage queue
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants