Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

grid_sample backward pass performance scales poorly with input size #64977

Closed
to-mi opened this issue Sep 14, 2021 · 12 comments
Closed

grid_sample backward pass performance scales poorly with input size #64977

to-mi opened this issue Sep 14, 2021 · 12 comments
Labels
module: autograd Related to torch.autograd, and the autograd engine in general module: cpu CPU specific problem (e.g., perf, algorithm) module: interpolation module: nn Related to torch.nn module: performance Issues related to performance, either of kernel code or framework glue triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module

Comments

@to-mi
Copy link
Contributor

to-mi commented Sep 14, 2021

馃悰 Bug

The backward pass of grid_sample (to get grad with regard to grid) depends heavily on the input size (at least in mode="bilinear"). I don't see why this should be the case, as the grid determines which pixels in the input affect the computation (but perhaps I'm mistaken?). It's also possible to do a grid sample implementation using basic PyTorch operations where the performance doesn't scale as badly with input size (though it's, of course, not as optimized otherwise as grid_sample).

To Reproduce

Code to time with different input sizes:

import timeit

import torch


def grid_sample_test(input, grid, backward):
    if backward and grid.grad is not None:
        grid.grad.zero_()
    samples = torch.nn.functional.grid_sample(
        input,
        grid,
        mode="bilinear",
        padding_mode="border",
        align_corners=True,
    )
    m = samples.mean()
    if backward:
        m.backward()

    return samples


_input = None
_grid = None
_backward = None

if __name__ == "__main__":
    torch.manual_seed(15)
    torch.set_num_threads(1)

    N = 100
    C = 2
    repeats = 100
    H_out = 13
    W_out = 13
    dtype = torch.double
    devices = ["cpu"]
    backwards = [False, True]

    input_sizes = [(30, 40), (300, 400), (1000, 1200)]

    grid_cpu = 2.0 * torch.rand((N, H_out, W_out, 2), dtype=dtype) - 1.0

    for input_size in input_sizes:
        H_in, W_in = input_size
        input_cpu = torch.rand(
            (1, C, H_in, W_in),
            requires_grad=False,
            dtype=dtype,
        ).expand((N, -1, -1, -1))

        for _backward in backwards:
            for device in devices:
                _grid = grid_cpu.clone().detach().to(device).requires_grad_(True)
                _input = input_cpu.to(device)

                t = timeit.timeit(
                    "grid_sample_test(_input, _grid, _backward)",
                    globals=globals(),
                    number=repeats,
                )
                print(
                    f"device={device:>4} backward={str(_backward):>5} input size={H_in:>4}x{W_in:<4}: {t:5.2f}"
                )

Example output, with last column being time in seconds:

device= cpu backward=False input size=  30x40  :  0.03
device= cpu backward= True input size=  30x40  :  0.54
device= cpu backward=False input size= 300x400 :  0.04
device= cpu backward= True input size= 300x400 :  5.19
device= cpu backward=False input size=1000x1200:  0.11
device= cpu backward= True input size=1000x1200: 48.56

Expected behavior

I would expect the performance to scale less drastically with input size.

Environment

collect_env.py output:

Collecting environment information...
PyTorch version: 1.9.0
Is debug build: False
CUDA used to build PyTorch: 10.2
ROCM used to build PyTorch: N/A

OS: Pop!_OS 20.04 LTS (x86_64)
GCC version: (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0
Clang version: Could not collect
CMake version: Could not collect
Libc version: glibc-2.31

Python version: 3.8.11 (default, Aug  3 2021, 15:09:35)  [GCC 7.5.0] (64-bit runtime)
Python platform: Linux-5.11.0-7633-generic-x86_64-with-glibc2.17
Is CUDA available: True
CUDA runtime version: Could not collect
GPU models and configuration: GPU 0: NVIDIA GeForce MX250
Nvidia driver version: 470.57.02
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A

Versions of relevant libraries:
[pip3] mypy-extensions==0.4.3
[pip3] numpy==1.20.3
[pip3] torch==1.9.0
[conda] blas                      1.0                         mkl  
[conda] cudatoolkit               10.2.89              hfd86e86_1  
[conda] mkl                       2021.3.0           h06a4308_520  
[conda] mkl-service               2.4.0            py38h7f8727e_0  
[conda] mkl_fft                   1.3.0            py38h42c9631_2  
[conda] mkl_random                1.2.2            py38h51133e4_0  
[conda] mypy-extensions           0.4.3                    pypi_0    pypi
[conda] numpy                     1.20.3           py38hf144106_0  
[conda] numpy-base                1.20.3           py38h74d4b33_0  

Used conda env:

name: grid_sample
channels:
  - default
  - pytorch
dependencies:
  - python=3.8
  - pytorch=1.9.0
  - numpy
  - ipython
  - black

Additional context

I would also be interested in any (possibly temporary) workarounds.

cc @VitalyFedyunin @ngimel @heitorschueroff @ezyang @albanD @zou3519 @gqchen @pearu @nikitaved @soulitzer @lezcano @Varal7 @mruberry @jbschlosser @walterddr

@mruberry mruberry added module: interpolation module: nn Related to torch.nn module: performance Issues related to performance, either of kernel code or framework glue module: cpu CPU specific problem (e.g., perf, algorithm) triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module labels Sep 15, 2021
@mruberry
Copy link
Collaborator

Thank you for reporting this issue, @to-mi. It would be nice to have a faster grid_sample operator.

@to-mi
Copy link
Contributor Author

to-mi commented Sep 15, 2021

To add some more information, I profiled the example code with py-spy record -o profile.svg --native -- python example_only_torch.py and got the following flame graph:

profile

I haven't used this before, but if I'm reading the graph correctly, it seems that most time would be spent on creating a zero-filled tensor to hold the gradient of input (here:

auto grad_input = at::zeros_like(input, LEGACY_CONTIGUOUS_MEMORY_FORMAT);
)? If this is the case, I wonder how easy it would be to have a special case for input not requiring gradient (which is the case I'm interested in).

@albanD
Copy link
Collaborator

albanD commented Sep 15, 2021

f this is the case, I wonder how easy it would be to have a special case for input not requiring gradient

There is nothing preventing us from doing that: either at the autograd level by specifying two formulas, one for grad input and one for grad grid.
Or at the formula level where the autograd can give you a boolean mask of which gradients you need to compute.

But both of them would require significant rewrite of the actual kernels.

@jbschlosser jbschlosser added the module: autograd Related to torch.autograd, and the autograd engine in general label Sep 15, 2021
@to-mi
Copy link
Contributor Author

to-mi commented Sep 17, 2021

For the record, here's the performance for the example script with a simple hack of removing most of the input gradient related stuff for the particular case in the script:

device= cpu backward=False input size=  30x40  :  0.03
device= cpu backward= True input size=  30x40  :  0.08
device= cpu backward=False input size= 300x400 :  0.04
device= cpu backward= True input size= 300x400 :  0.09
device= cpu backward=False input size=1000x1200:  0.11
device= cpu backward= True input size=1000x1200:  0.29

which can be compared with the current implementation:

device= cpu backward=False input size=  30x40  :  0.03
device= cpu backward= True input size=  30x40  :  0.54
device= cpu backward=False input size= 300x400 :  0.04
device= cpu backward= True input size= 300x400 :  5.19
device= cpu backward=False input size=1000x1200:  0.11
device= cpu backward= True input size=1000x1200: 48.56

I'm thinking of how much work it would be for me to contribute a PR for this. I haven't fully thought through this and I have only really looked at the cpu kernel with bilinear interpolation, but I think using the boolean mask would be easier and require fewer changes to the kernels (it would not require dividing the kernels into two different ones). My current (possibly naive) understanding is:

  • This would require changing the signature of the functions in tools/autograd/derivatives.yaml and aten/src/ATen/native/native_functions.yaml (so that the boolean mask for which gradients are requested is passed in).
  • If the bool for the input requiring grad could be used as template parameter in the kernel functions, one could perhaps just use if-clauses to remove related computations, assuming the compiler would then optimise the branching away.

Would contributing a PR for this require implementing the changes to all of the different kernels (cpu, cuda) and all different interpolation modes and possibly also the 3d version or would it be just enough to do some of them (so only some would not unnecessarily compute the gradient in this special case)? And I guess it could need new tests for the special case of not having input gradient?

@albanD
Copy link
Collaborator

albanD commented Sep 17, 2021

Hi,

Your understanding correct. One detail is that for inputs that do not require gradients, you should return an undefined Tensor (what you get when doing Tensor foo; or return Tensor();).

If you want to contribute the improvement, we won't block it on supporting everything under the sun. You should properly document though in the c++ side what is supported and what isn't.
Don't hesitate to ask if you have any question!

@to-mi
Copy link
Contributor Author

to-mi commented Sep 27, 2021

Hi,

I started implementing this and I'm wondering if the approach I've taken seems ok. What would be the best way to get some quick feedback on this?

The code is here: https://github.com/to-mi/pytorch/commits/optimize_grid_sample_2d_on_no_input_grad

Here's a brief description:

  • I have tried to go with rather minimal changes. It would probably be possible to make a more elegant version with a bit larger refactoring (or possibly with better understanding of PyTorch internals and C++ functionalities).
  • Changed the native_functions.yaml and derivatives.yaml so that the gradient input mask is passed to the functions.
  • Changed the CPU kernels:
    (1) added bool input_requires_grad template parameter to the backward function,
    (2) added if branches based on it to remove input gradient computations if it's not requested,
    (3) feed in TensorAccessor<scalar_t, 3>* gInp_slice_ptr instead of TensorAccessor<scalar_t, 3>& gInp_slice so that I can pass a nullptr in case gradient for input is not requested. (A bit inelegant perhaps, but allows to keep one signature for backward function and not require breaking it to smaller pieces. Perhaps there's a more elegant way to achieve this?)
  • Changed CUDA kernel:
    (1) added bool input_requires_grad template parameter to the backward function,
    (2) added if branches based on it to remove input gradient computations if it's not requested,
    (3) feed in TensorInfo<scalar_t, index_t>() instead of getTensorInfo<scalar_t, index_t>(grad_input) in case gradient for input is not requested.
  • Have not touched the CPU fallback kernel.

If this seems ok, I think I would still add some comments to the code documenting the changes and add test(s) to test/test_nn.py for the no gradient for input case.

@albanD
Copy link
Collaborator

albanD commented Sep 28, 2021

Thanks for sharing this.
The change looks quite good to me. I am not sure you need to add that many comments as "output_mask" is a common argument for such functions.

Also the split into different commit is really nice, if you want to try, it would be nice that you use ghstack (https://github.com/ezyang/ghstack) to easily send one PR for each commit. That will make it easier to review.
If you don't want to learn that tool, a single PR will work as well :)

@to-mi
Copy link
Contributor Author

to-mi commented Oct 1, 2021

Thanks @albanD for the comments and help. I made a PR (#65986) for this.

I wasn't able to use ghstack. I think it would require me to have write access to the pytorch repository (or at least I couldn't easily figure out how to run it otherwise).

@ezyang
Copy link
Contributor

ezyang commented Oct 4, 2021

now you do have rights, try again

@to-mi
Copy link
Contributor Author

to-mi commented Oct 4, 2021

Thanks. So should I close the current PR and do a new one through ghstack? (I could also rewrite the commits for the new one such that the changes done following the review comments get melded into the "main" commits or the PR.)

@lezcano
Copy link
Collaborator

lezcano commented Oct 4, 2021

Yes, you can just close this one and open a new one with ghstack :)

@to-mi
Copy link
Contributor Author

to-mi commented Oct 14, 2021

PRs #66068, #66069, #66070, #66071 should fix this issue.

Thanks everyone (and especially @albanD) for the help and guiding me through the process and making it pleasant to contribute.

@to-mi to-mi closed this as completed Oct 14, 2021
pytorchmergebot pushed a commit that referenced this issue Feb 23, 2022
Fixes #71415
I have implemented the changes that replicate what @to-mi did in this [PR](#65986 (comment)) for the 3D case :

> Fixes #64977
>
> Avoids creating a tensor for and calculating `input` gradient if it's not needed in the backward pass of `grid_sample` (2d case, native CPU & CUDA kernels). Especially the tensor creation seemed time consuming (see #64977).
>
> Brief description of the changes:
>
>     * I have tried to go with rather minimal changes. It would probably be possible to make a more elegant version with a bit larger refactoring (or possibly with better understanding of PyTorch internals and C++ functionalities).
>
>     * Changed the `native_functions.yaml` and `derivatives.yaml` so that the gradient input mask is passed to the functions.
>
>     * Changed the CPU kernels:
>       (1) added `bool input_requires_grad` template parameter to the `backward` function,
>       (2) added if branches based on it to remove `input` gradient computations if it's not requested,
>       (3) feed in `TensorAccessor<scalar_t, 3>* gInp_slice_ptr` instead of `TensorAccessor<scalar_t, 3>& gInp_slice` so that I can pass a `nullptr` in case gradient for `input` is not requested. (A bit inelegant perhaps, but allows to keep one signature for `backward` function and not require breaking it to smaller pieces. Perhaps there's a more elegant way to achieve this?)
>
>     * Changed CUDA kernel:
>       (1) added ~`bool input_requires_grad` template parameter~ `const bool input_requires_grad` argument to the `backward` function,
>       (2) added if branches based on it to remove `input` gradient computations if it's not requested,
>       (3) feed in `TensorInfo<scalar_t, index_t>()` instead of `getTensorInfo<scalar_t, index_t>(grad_input)` in case gradient for `input` is not requested.
>
>     * Modified tests in `test/test_nn.py` so that they run also cases with no `input` gradient needed.
>
>     * Have not touched the CPU fallback kernel.

Note: the changes number (3) are N/A in this case.

Pull Request resolved: #71759
facebook-github-bot pushed a commit that referenced this issue Feb 24, 2022
Summary:
Fixes #71415
I have implemented the changes that replicate what to-mi did in this [PR](#65986 (comment)) for the 3D case :

> Fixes #64977
>
> Avoids creating a tensor for and calculating `input` gradient if it's not needed in the backward pass of `grid_sample` (2d case, native CPU & CUDA kernels). Especially the tensor creation seemed time consuming (see #64977).
>
> Brief description of the changes:
>
>     * I have tried to go with rather minimal changes. It would probably be possible to make a more elegant version with a bit larger refactoring (or possibly with better understanding of PyTorch internals and C++ functionalities).
>
>     * Changed the `native_functions.yaml` and `derivatives.yaml` so that the gradient input mask is passed to the functions.
>
>     * Changed the CPU kernels:
>       (1) added `bool input_requires_grad` template parameter to the `backward` function,
>       (2) added if branches based on it to remove `input` gradient computations if it's not requested,
>       (3) feed in `TensorAccessor<scalar_t, 3>* gInp_slice_ptr` instead of `TensorAccessor<scalar_t, 3>& gInp_slice` so that I can pass a `nullptr` in case gradient for `input` is not requested. (A bit inelegant perhaps, but allows to keep one signature for `backward` function and not require breaking it to smaller pieces. Perhaps there's a more elegant way to achieve this?)
>
>     * Changed CUDA kernel:
>       (1) added ~`bool input_requires_grad` template parameter~ `const bool input_requires_grad` argument to the `backward` function,
>       (2) added if branches based on it to remove `input` gradient computations if it's not requested,
>       (3) feed in `TensorInfo<scalar_t, index_t>()` instead of `getTensorInfo<scalar_t, index_t>(grad_input)` in case gradient for `input` is not requested.
>
>     * Modified tests in `test/test_nn.py` so that they run also cases with no `input` gradient needed.
>
>     * Have not touched the CPU fallback kernel.

Note: the changes number (3) are N/A in this case.

Pull Request resolved: #71759

Test Plan: automation

Reviewed By: malfet

Differential Revision: D34426596

fbshipit-source-id: 34eb5c1f1697ff7a8acd185739844e6e25e18f8e
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
module: autograd Related to torch.autograd, and the autograd engine in general module: cpu CPU specific problem (e.g., perf, algorithm) module: interpolation module: nn Related to torch.nn module: performance Issues related to performance, either of kernel code or framework glue triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module
Projects
None yet
6 participants