Added flip() fn in ATen (CPU + CUDA) #7873

weiyangfb · 2018-05-26T04:40:26Z

Summary:

fixes flip a Tensor #229
implemented torch.flip() to reverse tensor (contiguous and non-contiguous) along specified dimensions
implemented forward and backward functions for both of CPU and CUDA
added tests at test_torch, test_cuda, and test_autograd

Details:
Given that a tensor element's offset = sum_i indices[i] * strides(i), we can flip on indices for each element, and then copy values to the corresponding offset.

Usage:
x = torch.arange(8).view(2, 2, 2).flip(0, 1, 2) # flip along the 1st, 2nd, and 3rd dimensions

Future work:

use thrust to speed up CUDA implementation

ivan-bilan · 2018-05-27T14:34:17Z

Great, can't wait for this to be released.

sethah · 2018-05-27T22:59:30Z

Will this need an entry in _torch_docs.py?

ngimel · 2018-05-28T04:50:49Z

Nice work!
Please unify dimensions error checking for cuda and cpu versions (right now it's 50 lines of copy-pasted code).
For cuda implementation, please run collapseDims pass on input (see in this file https://github.com/pytorch/pytorch/blob/master/aten/src/ATen/cuda/detail/TensorInfo.cuh), so that e.g. last-dimension flip of a multi-D contiguous tensor is the same as dimension flip for a 2d tensor.
Also, instead of implementing specialized kernel for this, for the flipped tensor you can create TensorInfo object with the negative strides for flipped dimensions (negative strides are generally not supported, and TensorInfo IndexType is usually unsigned, but you can instantiate it with signed) and run kernelPointwiseApply2 from CUDAApplyUtils.cuh with CopyOp. That way, you don't have to reimplement indexToOffset and back functions (TensorInfo already has them), and don't have to materialize indices tensor (that's really bad for performance).

fmassa · 2018-05-28T20:47:37Z

Just to follow-up with my message on my previous post, here is an (untested) implementation of flip that uses a combination of meshgrid and advanced indexing. Might be good to benchmark against the current implementation.

def multi_meshgrid(*args):
    """
    Creates a meshgrid from possibly many
    elements (instead of only 2).
    Returns a nd tensor with as many dimensions
    as there are arguments
    """
    args = list(args)
    template = [1 for _ in args]
    for i in range(len(args)):
        n = args[i].shape[0]
        template_copy = template.copy()
        template_copy[i] = n
        args[i] = args[i].view(*template_copy)
        # there will be some broadcast magic going on
    return tuple(args)

def flip(tensor, dims):
    if not isinstance(dims, (tuple, list)):
        dims = [dims]
    indices = [torch.arange(tensor.shape[dim] - 1, -1, -1,
        dtype=torch.int64) for dim in dims]
    multi_indices = multi_meshgrid(*indices)
    final_indices = [slice(i) for i in tensor.shape]
    for i, dim in enumerate(dims):
        final_indices[dim] = multi_indices[i]
    flipped = tensor[final_indices]
    # need to permute the final dimensions
    # if dims is not consecutive, but I'm lazy
    # now :-)
    return flipped

aten/src/ATen/native/TensorTransformations.cpp

+    std::stringstream ss;
+    ss << "expected input tensor dims not empty, "
+       << "but got tensor dims size=" << flip_dims_size;
+    throw std::runtime_error(ss.str());


aten/src/ATen/native/TensorTransformations.cpp

+  // check duplicates in dims
+  auto flip_dims_v = std::vector<int64_t>(dims);
+  flip_dims_v.erase(std::unique(flip_dims_v.begin(), flip_dims_v.end()), flip_dims_v.end());
+  if ((int64_t)flip_dims_v.size() < flip_dims_size) {


aten/src/ATen/native/TensorTransformations.cpp

+  // check len of dims
+  if (flip_dims_size > total_dims) {
+    std::stringstream ss;
+    ss << "expected flip dims size <= tensor total dims, "


aten/src/ATen/native/TensorTransformations.cpp

+  }
+
+  if (min_d < 0) {
+    std::stringstream ss;


aten/src/ATen/native/TensorTransformations.cpp

+
+Tensor flip_cpu(const Tensor& self, IntList dims) {
+
+  int64_t total_dims = self.dim(), flip_dims_size = dims.size();


aten/src/ATen/native/TensorTransformations.cpp

+  }
+
+  // check if dims axis within range
+  int64_t min_d = total_dims, max_d = 0;


aten/src/ATen/native/TensorTransformations.cpp

+
+  // check if dims axis within range
+  int64_t min_d = total_dims, max_d = 0;
+  for (auto d : dims) {


aten/src/ATen/native/TensorTransformations.cpp

+    throw std::runtime_error(ss.str());
+  }
+
+  Tensor out_t = self.clone();


aten/src/ATen/native/cuda/TensorTransformations.cu

@@ -0,0 +1,156 @@
+#include "ATen/NativeFunctions.h"
+#include "ATen/ATen.h"
+#include <algorithm>


aten/src/ATen/native/cuda/TensorTransformations.cu

+  flip_dims_v.erase(std::unique(flip_dims_v.begin(), flip_dims_v.end()), flip_dims_v.end());
+  if ((int64_t)flip_dims_v.size() < flip_dims_size) {
+    std::stringstream ss;
+    ss << "dims has duplicates, "


weiyangfb · 2018-05-29T16:39:07Z

@sethah Yes, I agree that should be an entry added to _torch_docs.py, will try that do that after code is finalized.

weiyangfb · 2018-05-29T16:54:30Z

@fmassa I just gave it a try, and your implementation is indeed much faster!! Here are some results:

Your implementation:

data = torch.arange(1000000).view(1000,1000)
%timeit flip(data, (0,1))
----------------------------

100 loops, best of 3: 7.62 ms per loop

My implementation:

data = torch.arange(1000000).view(1000,1000)
%timeit data.flip(0,1)
----------------------------

100 loops, best of 3: 19.5 ms per loop

weiyangfb · 2018-05-30T01:01:37Z

@ngimel Thanks a ton for the great great suggestions! I will modify the cuda implementation using TensorInfo. But can I try to understand how to apply negative strides for flipped dimensions? And why signed IndexType is important here? Can I also ask why collapseDims pass is very much relevant to flip on nD tensor here (maybe a simple example)?

And yes, I will reuse the error checks.

weiyangfb · 2018-05-30T06:55:24Z

@fmassa Your implementation is very nice and only requires one copy of input tensor. Can I translate your code into the CPU implementation of flip() using tensor.index() ?

fmassa · 2018-05-30T07:18:06Z

@weiyangfb definitely! My implementation also works on the GPU, it might be good to benchmark it as well to see how it compares to the dedicated kernel.
It should probably be a bit slower because I called arange on the CPU, but that could be called on the GPU as well.

The benefit of writing it as a native function is that we don't need to implement a backward pass for it (even if the backward of flip is simply a flip).

ngimel · 2018-05-30T16:14:24Z

collapseDims is important because it will reduce the amount of indexing math that you have to do. Suppose you have a contiguous 4d tensor, where you want to flip the last dimension. You can collapse the first 3 dims to view this tensor as 2d, then your indexing math will be simpler (you have to loop over just 2 dimensions). If you are flipping multiple dimensions, applying collapseDims is much trickier (may be impossible, if your flip dimensions are not contiguous, say you want to flip 0 and 2), but for a single flipped dimension collapseDims should help.
Now, to negative strides. Suppose you want to flip a 1d tensor. You can create TensorInfo object with data pointer pointing to the end of your output tensor, and set a stride of the 0-th dimension to -1, copy your original tensor to the tensor described by this TensorInfo object (using standard pointwiseApply kernel that's already in ATen), and then view the result a contiguous tensor. Similarly for flips in other dimensions/multiple flipped dimensions - you'd have to move base pointer, and set negative strides for the dimensions you want to flip, but that will be CPU code, not GPU. Obviously, since you want negative strides, you can not use unsigned IndexType for those values.
That said, it is quite possible that @fmassa's implementation already achieves good fraction of peak bandwidth, you should benchmark it first (not the absolute time, but what bandwidth you achieve compared to maximum on your card), in which case you can just use it.

weiyangfb · 2018-05-30T21:10:19Z

@fmassa You are completely right! I translated your code and had it tested out. Now the performance in CPU is similar, where on GPU my current implementation is slightly faster.

data = torch.arange(1000000).view(1000,1000)
%timeit flip_meshgrid(data, (0,1))
--------------------------------------------

100 loops, best of 3: 10.8 ms per loop

data = torch.arange(1000000).view(1000,1000)
%timeit data.flip(0,1)
--------------------------------------------

100 loops, best of 3: 11 ms per loop

data_cuda = torch.arange(1000000, device=torch.device('cuda')).view(1000,1000)
%timeit flip_meshgrid(data_cuda, (0,1))
--------------------------------------------

1000 loops, best of 3: 1.72 ms per loop

data_cuda = torch.arange(1000000, device=torch.device('cuda')).view(1000,1000)
%timeit data_cuda.flip(0,1)
--------------------------------------------

1000 loops, best of 3: 637 µs per loop

fmassa · 2018-05-30T22:04:28Z

Nice, thanks for the benchmarks @weiyangfb ! Can you try also adding a torch.cuda.synchronize() when benchmarking then CUDA kernels? Also, I'd be curious to know which fraction of the time was spent on the indexing, and which one was spent on the torch.arange. Would it be possible to check that as well?

Thanks!

weiyangfb · 2018-05-30T23:42:05Z

@ngimel Thanks a lot for the detailed instructions! Even though collapseDims might not help in the case of nD flip, I'd love to have it to speed up the case of flip in one dimension. So if I understand correctly, I probably will need to apply collapseDims on nD input tensor with dim to be flipped excluded - this gives a 2D tensor. Then I will need to use IndexToOffset along with negative stride to move elements from src to dst tensor. One quick question, set a stride of the 0-th dimension to -1 works for 1D tensor, and so what formula works for the 2D tensor?

Currently I removed the materialized indices in cuda kernel, and had it tested. I am still not quite sure how to test for GPU bandwidth, here I looked at some numbers from nvidia-smi --query-gpu=gpu_name,gpu_bus_id,utilization.gpu,utilization.memory,memory.used --format=csv -l

@fmassa implementation:

data_cuda = torch.arange(1000000, device=cuda).view(1000,1000)
%timeit flip_meshgrid(data_cuda, (0,1))
-------------------------------------------------------------------
Tesla K40m, 00000000:28:00.0, 83 %, 29 %, 478 MiB

1000 loops, best of 3: 1.72 ms per loop

My implementation with materialized indices:

data_cuda = torch.arange(1000000, device=cuda).view(1000,1000)
%timeit data_cuda.flip(0,1)
-------------------------------------------------------------------
Tesla K40m, 00000000:28:00.0, 90 %, 73 %, 478 MiB

1000 loops, best of 3: 637 µs per loop

My current implementation without materialized indices:

data_cuda = torch.arange(1000000, device=cuda).view(1000,1000)
%timeit data_cuda.flip(0,1)
-------------------------------------------------------------------
Tesla K40m, 00000000:28:00.0, 85 %, 36 %, 463 MiB

1000 loops, best of 3: 357 µs per loop

aten/src/ATen/native/TensorTransformations.cpp

+    temp[i] = indices[i].size(0);
+    indices[i] = indices[i].view(IntList(temp));
+  }
+  return self.index(TensorList(indices));


ngimel · 2018-05-31T18:04:04Z

@weiyangfb, correct about collapseDims, as for bandwidth, you can compute it as bytes/time, in you case its 8e6/357e-6 = 22.4 GB/s, not that great. K40 peak bandwidth is around 200 GB/s. (8e6 because your tensor has 1 million elements, 4 bytes per element, each element has to be read and written, hence 4*2). You can also compare your time with e.g. time for a pointwise operation for a same size tensor, e.g. a *=2.
For nD tensor, for each dimension that you are flipping you have to shift the base pointer by (dim[i]-1)*stride[i], and set the stride to -stride[i], where stride[i] are the strides of contiguous tensor of those dimensions. At least I think so, please check my math.

weiyangfb · 2018-06-04T16:38:31Z

@pytorchbot retest this please

weiyangfb · 2018-06-04T16:55:18Z

@ngimel Huge thanks for walking me through CUDA performance details and the formula for flipping! For nD tensor, I think your math is correct. Thanks a lot for sharing the formula! I will implement this for the case of flipping on a single dim. Will update the PR in a bit.

This performance analysis is super helpful! I will keep tracking this! I am also trying to use a.t().contiguous() as a benchmark. Is it going to be a tighter lower bound since flip() requires similar non-continuous memory access?

ngimel · 2018-06-04T17:08:09Z

For flipping, save for some alignment issues (which can be avoided for sure by e.g. using 1024x1024 tensor), you accesses are still contiguous (elements that are adjacent in original tensor would still be adjacent in the flipped one, even if they are in the different order), so comparing with a regular pointwise op is better. You might want to run your comparison against some real 2d tensor that cannot be collapsed to 1d (you can create it by e.g. running torch.chunk on the 1st dim), to add some index math that you necessarily have for flipping.

weiyangfb · 2018-06-11T18:53:25Z

@ngimel Thanks a lot! Now it makes all sense! I am using TensorInfo and collapseDims to speed up the case where flip dim is the 1st or last dim. Here are some performance results:

data_cuda = torch.arange(1000000, device=cuda).view(100,100,100)
%timeit data_cuda.flip(0)
----------------------------------
10000 loops, best of 3: 178 µs per loop

data_cuda = torch.arange(1000000, device=cuda).view(100,100,100)
%timeit data_cuda.flip(2)
----------------------------------
10000 loops, best of 3: 181 µs per loop

benchmark:

data_cuda = torch.arange(1000000, device=cuda).view(100,100,100)
%timeit data_cuda.mul(2)
----------------------------------
10000 loops, best of 3: 86.3 µs per loop

And if I understand it correctly, collapseDims might not be able to squeeze nD to 2D tensor if flip dim is not the 1st or last dim, I am using the previous impl for these cases.

data_cuda = torch.arange(1000000, device=cuda).view(100,100,100)
%timeit data_cuda.flip(1)
----------------------------------
1000 loops, best of 3: 364 µs per loop

@pietern

This provides a bare-minimum MPI Process Group implementation, the commit is on top of @pietern's Gloo Process Group PR. * [c10d] MPI Process Group Implementation ref: pytorch#7434 * Better exception, atexit func, and addressed comments * Clang formatting changes * Static initialization and addressed comments * Added constness back * Test will now launch mpi processes if found * CMakeList Changed

* Fix Windows doc for import error * Fix doc again * Fix wrong format

…#8334) Signed-off-by: Edward Z. Yang <ezyang@fb.com>

* Add cursors to C++ API * Small self nits * s/struct/class * Use more STL like names for cursors

* Implement arange_like operator * add ONNX symbolic * lint * change name * Comment the hack

…sts; 3. using TensorInfo and collapseDims to speed up CUDA impl for cases where flip dim is the 1st or last dim

…tensor

weiyangfb · 2018-06-13T00:05:56Z

@fmassa Using torch.cuda.synchronize() does not change much on the runtime, am I doing it correctly?

data_cuda = torch.arange(1000000, device=cuda).view(1000,1000)
def meshgrid():
    flip_meshgrid(data_cuda, (0, 1))
    torch.cuda.synchronize()
%timeit meshgrid()

1000 loops, best of 3: 1.76 ms per loop

data_cuda = torch.arange(1000000, device=cuda).view(1000,1000)
def flip():
    data_cuda.flip(0,1)
    torch.cuda.synchronize()
%timeit flip()

1000 loops, best of 3: 353 µs per loop

I don't know why though. Can I ask normally how to profile the fraction of time spent?

weiyangfb · 2018-06-14T16:57:16Z

@fmassa @ngimel is this PR ready for stamp?

ngimel

Generally looks good, I'd still like to reduce the amount of integer math.

aten/src/ATen/native/cuda/TensorTransformations.cu

+                          int64_t total_dims) {
+  for (IndexType linear_index = blockIdx.x * blockDim.x + threadIdx.x; linear_index < N; linear_index += gridDim.x * blockDim.x) {
+    int64_t cur_indices = linear_index, rem = 0, dst_offset = 0;
+    for (int64_t i = 0; i < total_dims; i++) {


aten/src/ATen/native/cuda/TensorTransformations.cu

+    for (int64_t i = 0; i < total_dims; i++) {
+      int64_t temp = cur_indices;
+      cur_indices = cur_indices / in_tensor_info.strides[i];
+      rem = temp - cur_indices * in_tensor_info.strides[i];


aten/src/ATen/native/cuda/TensorTransformations.cu

+                          int flip_dim,
+                          int64_t total_dims) {
+  for (IndexType linear_index = blockIdx.x * blockDim.x + threadIdx.x; linear_index < N; linear_index += gridDim.x * blockDim.x) {
+    int64_t cur_indices = linear_index, rem = 0, dst_offset = 0;


…64_t) IndexType for indices in pointwise CUDA kernel

weiyangfb · 2018-06-15T17:16:31Z

@pytorchbot retest this please

weiyangfb · 2018-06-15T18:34:04Z

caffe2 failing test seems not related

weiyangfb · 2018-06-15T20:34:37Z

@pytorchbot retest this please

weiyangfb · 2018-06-15T22:05:14Z

@pytorchbot retest this please

weiyangfb · 2018-06-15T22:27:34Z

@pytorchbot retest this please

weiyangfb · 2018-06-16T01:08:48Z

caffe2 and lint tests failing seems not related, can I get a stamp on this?

adam-dziedzic · 2018-07-03T20:36:58Z

Can you recreate such results:

>>> a
tensor([[1., 1., 1.],
        [1., 0., 2.]], dtype=torch.float64)
>>> torch.flip(a, [1])
tensor([[1., 1.],
        [0., 1.]], dtype=torch.float64)

?

weiyangfb · 2018-07-03T20:59:33Z

@adam-dziedzic Yes, I can reproduce your results. I think this is a bug. Let me create an issue for this.

ashwhall · 2018-07-24T04:25:44Z

@weiyangfb Does this operation copy the memory or give a view into it? I'm flipping HD video, so copying the data is a real memory-bottleneck.

soumith · 2018-07-24T04:52:25Z

@ashwhall it copies the memory over, but does it pretty efficiently.

Summary: - a walk around for #13292, a complete fix requires investigation on the root cause when using advanced indexing - this PR brings in `filp()` CUDA implementation for CPU kernel - with this change: ``` >>> t = torch.randn(1, 3, 4, 5) >> t.flip(1, 3).shape torch.Size([1, 3, 4, 5]) ``` - performance: ``` ====== with this PR ====== >>> a = torch.randn(1000, 1000) >>> %timeit -r 100 a.flip(0, 1) 1.98 ms ± 579 µs per loop (mean ± std. dev. of 100 runs, 1000 loops each) ====== Perf at previous PR #7873 ====== 100 loops, best of 3: 11 ms per loop ``` Pull Request resolved: #13344 Differential Revision: D12968003 Pulled By: weiyangfb fbshipit-source-id: 66f434049d143a0575a35b5c983b3e0577a1a28d

weiyangfb requested review from apaszke, colesbury, ezyang, gchanan, soumith and zdevito as code owners May 26, 2018 04:40

weiyangfb mentioned this pull request May 26, 2018

[WIP] Flip a tensor (CPU + CUDA implementation) #6867

Closed

goldsborough reviewed May 29, 2018

View reviewed changes

fmassa reviewed May 31, 2018

View reviewed changes

aryamccarthy and others added 3 commits June 11, 2018 12:17

Spelling fix in MultivariateNormal docstring (pytorch#7915)

b366d67

Fix Windows doc for import error (pytorch#7704)

42722c7

* Fix Windows doc for import error * Fix doc again * Fix wrong format

yf225 and others added 7 commits June 11, 2018 12:18

Skip test_multinomial_invalid_probs_cuda on Windows (pytorch#8324)

58b1c73

Support printing sparse tensors in ATen, fixes pytorch#8333. (pytorch…

8f92ab9

…#8334) Signed-off-by: Edward Z. Yang <ezyang@fb.com>

[C++ API] Cursors (pytorch#8190)

db4fa8f

* Add cursors to C++ API * Small self nits * s/struct/class * Use more STL like names for cursors

Implement dim_arange operator (pytorch#8266)

eee6226

* Implement arange_like operator * add ONNX symbolic * lint * change name * Comment the hack

1. fixed flip CPU impl for non-continuous flip dims; 2. added more te…

63094e5

…sts; 3. using TensorInfo and collapseDims to speed up CUDA impl for cases where flip dim is the 1st or last dim

nits

44b18e4

Merge branch 'flip_tensor' of github.com:weiyangfb/pytorch into flip_…

a35d2ad

…tensor

ngimel reviewed Jun 14, 2018

View reviewed changes

1. removed for loop in pointwise CUDA kernel; 2. using templated (int…

a9ae3f1

…64_t) IndexType for indices in pointwise CUDA kernel

weiyangfb added 2 commits June 15, 2018 13:22

added torch.flip.__doc__

8780087

nits

0709c30

soumith approved these changes Jun 16, 2018

View reviewed changes

soumith merged commit c9b8d85 into pytorch:master Jun 16, 2018

weiyangfb deleted the flip_tensor branch June 22, 2018 18:12

soumith mentioned this pull request Dec 12, 2018

flip a Tensor #229

Closed

fmassa mentioned this pull request Jan 28, 2019

Flip is much slower than advanced indexing #16424

Open


		Tensor flip_cpu(const Tensor& self, IntList dims) {

		int64_t total_dims = self.dim(), flip_dims_size = dims.size();

Added flip() fn in ATen (CPU + CUDA) #7873

Added flip() fn in ATen (CPU + CUDA) #7873

Conversation

weiyangfb commented May 26, 2018 • edited Loading

ivan-bilan commented May 27, 2018

sethah commented May 27, 2018

ngimel commented May 28, 2018

fmassa commented May 28, 2018

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

weiyangfb commented May 29, 2018

weiyangfb commented May 29, 2018

weiyangfb commented May 30, 2018

weiyangfb commented May 30, 2018

fmassa commented May 30, 2018

ngimel commented May 30, 2018

weiyangfb commented May 30, 2018

fmassa commented May 30, 2018

weiyangfb commented May 30, 2018 • edited Loading

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

ngimel commented May 31, 2018

weiyangfb commented Jun 4, 2018

weiyangfb commented Jun 4, 2018

ngimel commented Jun 4, 2018

weiyangfb commented Jun 11, 2018

weiyangfb commented Jun 13, 2018 • edited Loading

weiyangfb commented Jun 14, 2018

ngimel left a comment

Choose a reason for hiding this comment

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

weiyangfb commented Jun 15, 2018

weiyangfb commented Jun 15, 2018

weiyangfb commented Jun 15, 2018

weiyangfb commented Jun 15, 2018

weiyangfb commented Jun 15, 2018

weiyangfb commented Jun 16, 2018

adam-dziedzic commented Jul 3, 2018

weiyangfb commented Jul 3, 2018

ashwhall commented Jul 24, 2018

soumith commented Jul 24, 2018

weiyangfb commented May 26, 2018 •

edited

Loading

weiyangfb commented May 30, 2018 •

edited

Loading

weiyangfb commented Jun 13, 2018 •

edited

Loading