-
Notifications
You must be signed in to change notification settings - Fork 21.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Added flip() fn in ATen (CPU + CUDA) #7873
Conversation
Great, can't wait for this to be released. |
Will this need an entry in |
Nice work! |
Just to follow-up with my message on my previous post, here is an (untested) implementation of def multi_meshgrid(*args):
"""
Creates a meshgrid from possibly many
elements (instead of only 2).
Returns a nd tensor with as many dimensions
as there are arguments
"""
args = list(args)
template = [1 for _ in args]
for i in range(len(args)):
n = args[i].shape[0]
template_copy = template.copy()
template_copy[i] = n
args[i] = args[i].view(*template_copy)
# there will be some broadcast magic going on
return tuple(args)
def flip(tensor, dims):
if not isinstance(dims, (tuple, list)):
dims = [dims]
indices = [torch.arange(tensor.shape[dim] - 1, -1, -1,
dtype=torch.int64) for dim in dims]
multi_indices = multi_meshgrid(*indices)
final_indices = [slice(i) for i in tensor.shape]
for i, dim in enumerate(dims):
final_indices[dim] = multi_indices[i]
flipped = tensor[final_indices]
# need to permute the final dimensions
# if dims is not consecutive, but I'm lazy
# now :-)
return flipped |
std::stringstream ss; | ||
ss << "expected input tensor dims not empty, " | ||
<< "but got tensor dims size=" << flip_dims_size; | ||
throw std::runtime_error(ss.str()); |
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
// check duplicates in dims | ||
auto flip_dims_v = std::vector<int64_t>(dims); | ||
flip_dims_v.erase(std::unique(flip_dims_v.begin(), flip_dims_v.end()), flip_dims_v.end()); | ||
if ((int64_t)flip_dims_v.size() < flip_dims_size) { |
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
// check len of dims | ||
if (flip_dims_size > total_dims) { | ||
std::stringstream ss; | ||
ss << "expected flip dims size <= tensor total dims, " |
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
} | ||
|
||
if (min_d < 0) { | ||
std::stringstream ss; |
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
|
||
Tensor flip_cpu(const Tensor& self, IntList dims) { | ||
|
||
int64_t total_dims = self.dim(), flip_dims_size = dims.size(); |
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
} | ||
|
||
// check if dims axis within range | ||
int64_t min_d = total_dims, max_d = 0; |
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
|
||
// check if dims axis within range | ||
int64_t min_d = total_dims, max_d = 0; | ||
for (auto d : dims) { |
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
throw std::runtime_error(ss.str()); | ||
} | ||
|
||
Tensor out_t = self.clone(); |
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
@@ -0,0 +1,156 @@ | |||
#include "ATen/NativeFunctions.h" | |||
#include "ATen/ATen.h" | |||
#include <algorithm> |
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
flip_dims_v.erase(std::unique(flip_dims_v.begin(), flip_dims_v.end()), flip_dims_v.end()); | ||
if ((int64_t)flip_dims_v.size() < flip_dims_size) { | ||
std::stringstream ss; | ||
ss << "dims has duplicates, " |
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
@sethah Yes, I agree that should be an entry added to |
@fmassa I just gave it a try, and your implementation is indeed much faster!! Here are some results: Your implementation:
100 loops, best of 3: 7.62 ms per loop My implementation:
100 loops, best of 3: 19.5 ms per loop |
@ngimel Thanks a ton for the great great suggestions! I will modify the cuda implementation using And yes, I will reuse the error checks. |
@fmassa Your implementation is very nice and only requires one copy of input tensor. Can I translate your code into the CPU implementation of |
@weiyangfb definitely! My implementation also works on the GPU, it might be good to benchmark it as well to see how it compares to the dedicated kernel. The benefit of writing it as a native function is that we don't need to implement a backward pass for it (even if the backward of flip is simply a flip). |
|
@fmassa You are completely right! I translated your code and had it tested out. Now the performance in CPU is similar, where on GPU my current implementation is slightly faster.
100 loops, best of 3: 10.8 ms per loop
100 loops, best of 3: 11 ms per loop
1000 loops, best of 3: 1.72 ms per loop
1000 loops, best of 3: 637 µs per loop |
Nice, thanks for the benchmarks @weiyangfb ! Can you try also adding a Thanks! |
@ngimel Thanks a lot for the detailed instructions! Even though Currently I removed the materialized indices in cuda kernel, and had it tested. I am still not quite sure how to test for GPU bandwidth, here I looked at some numbers from @fmassa implementation:
My implementation with materialized indices:
My current implementation without materialized indices:
|
temp[i] = indices[i].size(0); | ||
indices[i] = indices[i].view(IntList(temp)); | ||
} | ||
return self.index(TensorList(indices)); |
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
@weiyangfb, correct about collapseDims, as for bandwidth, you can compute it as bytes/time, in you case its 8e6/357e-6 = 22.4 GB/s, not that great. K40 peak bandwidth is around 200 GB/s. (8e6 because your tensor has 1 million elements, 4 bytes per element, each element has to be read and written, hence 4*2). You can also compare your time with e.g. time for a pointwise operation for a same size tensor, e.g. a *=2. |
@pytorchbot retest this please |
@ngimel Huge thanks for walking me through CUDA performance details and the formula for flipping! For nD tensor, I think your math is correct. Thanks a lot for sharing the formula! I will implement this for the case of flipping on a single dim. Will update the PR in a bit. This performance analysis is super helpful! I will keep tracking this! I am also trying to use a.t().contiguous() as a benchmark. Is it going to be a tighter lower bound since flip() requires similar non-continuous memory access? |
For flipping, save for some alignment issues (which can be avoided for sure by e.g. using 1024x1024 tensor), you accesses are still contiguous (elements that are adjacent in original tensor would still be adjacent in the flipped one, even if they are in the different order), so comparing with a regular pointwise op is better. You might want to run your comparison against some real 2d tensor that cannot be collapsed to 1d (you can create it by e.g. running torch.chunk on the 1st dim), to add some index math that you necessarily have for flipping. |
@ngimel Thanks a lot! Now it makes all sense! I am using
benchmark:
And if I understand it correctly,
|
This provides a bare-minimum MPI Process Group implementation, the commit is on top of @pietern's Gloo Process Group PR. * [c10d] MPI Process Group Implementation ref: pytorch#7434 * Better exception, atexit func, and addressed comments * Clang formatting changes * Static initialization and addressed comments * Added constness back * Test will now launch mpi processes if found * CMakeList Changed
* Fix Windows doc for import error * Fix doc again * Fix wrong format
…#8334) Signed-off-by: Edward Z. Yang <ezyang@fb.com>
* Add cursors to C++ API * Small self nits * s/struct/class * Use more STL like names for cursors
* Implement arange_like operator * add ONNX symbolic * lint * change name * Comment the hack
…sts; 3. using TensorInfo and collapseDims to speed up CUDA impl for cases where flip dim is the 1st or last dim
@fmassa Using data_cuda = torch.arange(1000000, device=cuda).view(1000,1000)
def meshgrid():
flip_meshgrid(data_cuda, (0, 1))
torch.cuda.synchronize()
%timeit meshgrid() 1000 loops, best of 3: 1.76 ms per loop data_cuda = torch.arange(1000000, device=cuda).view(1000,1000)
def flip():
data_cuda.flip(0,1)
torch.cuda.synchronize()
%timeit flip() 1000 loops, best of 3: 353 µs per loop I don't know why though. Can I ask normally how to profile the fraction of time spent? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Generally looks good, I'd still like to reduce the amount of integer math.
int64_t total_dims) { | ||
for (IndexType linear_index = blockIdx.x * blockDim.x + threadIdx.x; linear_index < N; linear_index += gridDim.x * blockDim.x) { | ||
int64_t cur_indices = linear_index, rem = 0, dst_offset = 0; | ||
for (int64_t i = 0; i < total_dims; i++) { |
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
for (int64_t i = 0; i < total_dims; i++) { | ||
int64_t temp = cur_indices; | ||
cur_indices = cur_indices / in_tensor_info.strides[i]; | ||
rem = temp - cur_indices * in_tensor_info.strides[i]; |
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
int flip_dim, | ||
int64_t total_dims) { | ||
for (IndexType linear_index = blockIdx.x * blockDim.x + threadIdx.x; linear_index < N; linear_index += gridDim.x * blockDim.x) { | ||
int64_t cur_indices = linear_index, rem = 0, dst_offset = 0; |
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
…64_t) IndexType for indices in pointwise CUDA kernel
@pytorchbot retest this please |
caffe2 failing test seems not related |
@pytorchbot retest this please |
caffe2 and lint tests failing seems not related, can I get a stamp on this? |
Can you recreate such results:
? |
@adam-dziedzic Yes, I can reproduce your results. I think this is a bug. Let me create an issue for this. |
@weiyangfb Does this operation copy the memory or give a view into it? I'm flipping HD video, so copying the data is a real memory-bottleneck. |
@ashwhall it copies the memory over, but does it pretty efficiently. |
Summary: - a walk around for #13292, a complete fix requires investigation on the root cause when using advanced indexing - this PR brings in `filp()` CUDA implementation for CPU kernel - with this change: ``` >>> t = torch.randn(1, 3, 4, 5) >> t.flip(1, 3).shape torch.Size([1, 3, 4, 5]) ``` - performance: ``` ====== with this PR ====== >>> a = torch.randn(1000, 1000) >>> %timeit -r 100 a.flip(0, 1) 1.98 ms ± 579 µs per loop (mean ± std. dev. of 100 runs, 1000 loops each) ====== Perf at previous PR #7873 ====== 100 loops, best of 3: 11 ms per loop ``` Pull Request resolved: #13344 Differential Revision: D12968003 Pulled By: weiyangfb fbshipit-source-id: 66f434049d143a0575a35b5c983b3e0577a1a28d
Summary:
Details:
Given that a tensor element's offset = sum_i indices[i] * strides(i), we can flip on indices for each element, and then copy values to the corresponding offset.
Usage:
x = torch.arange(8).view(2, 2, 2).flip(0, 1, 2) # flip along the 1st, 2nd, and 3rd dimensions
Future work: