`grid_sample` backward pass performance scales poorly with `input` size

## 🐛 Bug

The backward pass of `grid_sample` (to get grad with regard to `grid`) depends heavily on the `input` size (at least in `mode="bilinear"`). I don't see why this should be the case, as the `grid` determines which pixels in the `input` affect the computation (but perhaps I'm mistaken?). It's also possible to do a grid sample implementation using basic PyTorch operations where the performance doesn't scale as badly with `input` size (though it's, of course, not as optimized otherwise as `grid_sample`).

## To Reproduce

Code to time with different input sizes:

```
import timeit

import torch


def grid_sample_test(input, grid, backward):
    if backward and grid.grad is not None:
        grid.grad.zero_()
    samples = torch.nn.functional.grid_sample(
        input,
        grid,
        mode="bilinear",
        padding_mode="border",
        align_corners=True,
    )
    m = samples.mean()
    if backward:
        m.backward()

    return samples


_input = None
_grid = None
_backward = None

if __name__ == "__main__":
    torch.manual_seed(15)
    torch.set_num_threads(1)

    N = 100
    C = 2
    repeats = 100
    H_out = 13
    W_out = 13
    dtype = torch.double
    devices = ["cpu"]
    backwards = [False, True]

    input_sizes = [(30, 40), (300, 400), (1000, 1200)]

    grid_cpu = 2.0 * torch.rand((N, H_out, W_out, 2), dtype=dtype) - 1.0

    for input_size in input_sizes:
        H_in, W_in = input_size
        input_cpu = torch.rand(
            (1, C, H_in, W_in),
            requires_grad=False,
            dtype=dtype,
        ).expand((N, -1, -1, -1))

        for _backward in backwards:
            for device in devices:
                _grid = grid_cpu.clone().detach().to(device).requires_grad_(True)
                _input = input_cpu.to(device)

                t = timeit.timeit(
                    "grid_sample_test(_input, _grid, _backward)",
                    globals=globals(),
                    number=repeats,
                )
                print(
                    f"device={device:>4} backward={str(_backward):>5} input size={H_in:>4}x{W_in:<4}: {t:5.2f}"
                )

```

Example output, with last column being time in seconds:
```
device= cpu backward=False input size=  30x40  :  0.03
device= cpu backward= True input size=  30x40  :  0.54
device= cpu backward=False input size= 300x400 :  0.04
device= cpu backward= True input size= 300x400 :  5.19
device= cpu backward=False input size=1000x1200:  0.11
device= cpu backward= True input size=1000x1200: 48.56
```

## Expected behavior

I would expect the performance to scale less drastically with input size.

## Environment

`collect_env.py` output:
```
Collecting environment information...
PyTorch version: 1.9.0
Is debug build: False
CUDA used to build PyTorch: 10.2
ROCM used to build PyTorch: N/A

OS: Pop!_OS 20.04 LTS (x86_64)
GCC version: (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0
Clang version: Could not collect
CMake version: Could not collect
Libc version: glibc-2.31

Python version: 3.8.11 (default, Aug  3 2021, 15:09:35)  [GCC 7.5.0] (64-bit runtime)
Python platform: Linux-5.11.0-7633-generic-x86_64-with-glibc2.17
Is CUDA available: True
CUDA runtime version: Could not collect
GPU models and configuration: GPU 0: NVIDIA GeForce MX250
Nvidia driver version: 470.57.02
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A

Versions of relevant libraries:
[pip3] mypy-extensions==0.4.3
[pip3] numpy==1.20.3
[pip3] torch==1.9.0
[conda] blas                      1.0                         mkl  
[conda] cudatoolkit               10.2.89              hfd86e86_1  
[conda] mkl                       2021.3.0           h06a4308_520  
[conda] mkl-service               2.4.0            py38h7f8727e_0  
[conda] mkl_fft                   1.3.0            py38h42c9631_2  
[conda] mkl_random                1.2.2            py38h51133e4_0  
[conda] mypy-extensions           0.4.3                    pypi_0    pypi
[conda] numpy                     1.20.3           py38hf144106_0  
[conda] numpy-base                1.20.3           py38h74d4b33_0  
```

Used conda env:
```
name: grid_sample
channels:
  - default
  - pytorch
dependencies:
  - python=3.8
  - pytorch=1.9.0
  - numpy
  - ipython
  - black
```

## Additional context

I would also be interested in any (possibly temporary) workarounds.

cc @VitalyFedyunin @ngimel @heitorschueroff @ezyang @albanD @zou3519 @gqchen @pearu @nikitaved @soulitzer @Lezcano @Varal7 @mruberry @jbschlosser @walterddr

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

`grid_sample` backward pass performance scales poorly with `input` size #64977

🐛 Bug

To Reproduce

Expected behavior

Environment

Additional context

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

grid_sample backward pass performance scales poorly with input size #64977

Description

🐛 Bug

To Reproduce

Expected behavior

Environment

Additional context

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

`grid_sample` backward pass performance scales poorly with `input` size #64977