[2/2] Added backward pass on CUDA for interpolation with anti-alias option #4211

vfdev-5 · 2021-07-27T14:55:31Z

Description:

Added backward pass on CUDA for interpolation with anti-alias option
- bilinear 2d
- bicubic 2d
Added tests on CUDA only

This code is based on #4208

fmassa

LGTM, thanks!

I only have one minor comment, otherwise good to merge.

For next, are you planning on wrapping the autograd function within torchvision, or directly move those functions to PyTorch?

torchvision/csrc/ops/cuda/interpolate_aa_kernels.cu

github-actions · 2021-08-04T11:45:31Z

Hey @fmassa!

You merged this PR, but no labels were added.

…i-alias option (#4211) Summary: * WIP on backward op interpolation with AA * Removed cuda tests and reformat cpp code * Fixed clang wrong formatting * Added channels last test case * Added CUDA support for backward pass, interpolation with AA * Removed unused buffers Reviewed By: NicolasHug Differential Revision: D30417194 fbshipit-source-id: 4aab5bc21621859cfc4254da6a230e0c8a8cffc2 Co-authored-by: vfdev-5 <vfdev-5@gmail.com>

) Summary: Description: - Added antialias flag to interpolate (CUDA) - forward and backward for bicubic mode - added tests Previous PR for CPU bilinear, #65142 Previous PR for CPU bicubic, #68819 ### Benchmarks <details> <summary> Bilinear forward pass, PIL, PTH CPU and PTH CUDA </summary> Code: https://gist.github.com/vfdev-5/b173761a567f2283b3c649c3c0574112 ``` Torch version: 1.11.0a0+gitd032369 Torch config: PyTorch built with: - GCC 9.3 - C++ Version: 201402 - OpenMP 201511 (a.k.a. OpenMP 4.5) - CPU capability usage: AVX2 - CUDA Runtime 11.1 - NVCC architecture flags: -gencode;arch=compute_61,code=sm_61 - CuDNN 8.0.5 - Build settings: BUILD_TYPE=Release, CUDA_VERSION=11.1, CUDNN_VERSION=8.0.5, CXX_COMPILER=/usr/bin/c++, CXX_FLAGS= -Wno-deprecated -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -fopenmp -DNDEBUG -DUSE_KINETO -DUSE_PYTORCH_QNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -DEDGE_PROFILER_USE_KINETO -O2 -fPIC -Wno-narrowing -Wall -Wextra -Werror=return-type -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-sign-compare -Wno-unused-parameter -Wno-unused-function -Wno-unused-result -Wno-unused-local-typedefs -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-psabi -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Werror=cast-function-type -Wno-stringop-overflow, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_VERSION=1.11.0, USE_CUDA=1, USE_CUDNN=1, USE_EIGEN_FOR_BLAS=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=OFF, USE_MKLDNN=OFF, USE_MPI=OFF, USE_NCCL=ON, USE_NNPACK=0, USE_OPENMP=ON, USE_ROCM=OFF, Num threads: 8 [----------------------------------- Downsampling (bilinear): torch.Size([1, 3, 906, 438]) -> (320, 196) -----------------------------------] | Reference, PIL 8.4.0, mode: RGB | 1.11.0a0+gitd032369 cpu | 1.11.0a0+gitd032369 cuda 8 threads: ---------------------------------------------------------------------------------------------------------------------------------- channels_first contiguous torch.float32 | 2851.2 | 874.1 | 57.1 channels_last non-contiguous torch.float32 | 2856.1 | 1155.8 | 130.6 Times are in microseconds (us). [----------------------------------- Downsampling (bilinear): torch.Size([1, 3, 906, 438]) -> (460, 220) -----------------------------------] | Reference, PIL 8.4.0, mode: RGB | 1.11.0a0+gitd032369 cpu | 1.11.0a0+gitd032369 cuda 8 threads: ---------------------------------------------------------------------------------------------------------------------------------- channels_first contiguous torch.float32 | 3705.9 | 1005.8 | 66.3 channels_last non-contiguous torch.float32 | 3742.9 | 1332.8 | 143.5 Times are in microseconds (us). [------------------------------------ Downsampling (bilinear): torch.Size([1, 3, 906, 438]) -> (120, 96) -----------------------------------] | Reference, PIL 8.4.0, mode: RGB | 1.11.0a0+gitd032369 cpu | 1.11.0a0+gitd032369 cuda 8 threads: ---------------------------------------------------------------------------------------------------------------------------------- channels_first contiguous torch.float32 | 1768.0 | 725.2 | 77.9 channels_last non-contiguous torch.float32 | 1753.7 | 942.5 | 144.0 Times are in microseconds (us). [----------------------------------- Downsampling (bilinear): torch.Size([1, 3, 906, 438]) -> (1200, 196) ----------------------------------] | Reference, PIL 8.4.0, mode: RGB | 1.11.0a0+gitd032369 cpu | 1.11.0a0+gitd032369 cuda 8 threads: ---------------------------------------------------------------------------------------------------------------------------------- channels_first contiguous torch.float32 | 9522.6 | 2593.8 | 157.8 channels_last non-contiguous torch.float32 | 9513.5 | 3622.7 | 241.5 Times are in microseconds (us). [----------------------------------- Downsampling (bilinear): torch.Size([1, 3, 906, 438]) -> (120, 1200) ----------------------------------] | Reference, PIL 8.4.0, mode: RGB | 1.11.0a0+gitd032369 cpu | 1.11.0a0+gitd032369 cuda 8 threads: ---------------------------------------------------------------------------------------------------------------------------------- channels_first contiguous torch.float32 | 2240.1 | 565.5 | 93.3 channels_last non-contiguous torch.float32 | 2244.2 | 972.7 | 170.8 Times are in microseconds (us). [------------------------- Downsampling (bilinear): torch.Size([1, 1, 906, 438]) -> (320, 196) --------------------------] | Reference, PIL 8.4.0, mode: F | 1.11.0a0+gitd032369 cpu | 1.11.0a0+gitd032369 cuda 8 threads: --------------------------------------------------------------------------------------------------------------- contiguous torch.float32 | 1441.3 | 386.1 | 22.3 Times are in microseconds (us). [------------------------- Downsampling (bilinear): torch.Size([1, 1, 906, 438]) -> (460, 220) --------------------------] | Reference, PIL 8.4.0, mode: F | 1.11.0a0+gitd032369 cpu | 1.11.0a0+gitd032369 cuda 8 threads: --------------------------------------------------------------------------------------------------------------- contiguous torch.float32 | 1815.2 | 376.8 | 27.8 Times are in microseconds (us). [-------------------------- Downsampling (bilinear): torch.Size([1, 1, 906, 438]) -> (120, 96) --------------------------] | Reference, PIL 8.4.0, mode: F | 1.11.0a0+gitd032369 cpu | 1.11.0a0+gitd032369 cuda 8 threads: --------------------------------------------------------------------------------------------------------------- contiguous torch.float32 | 962.3 | 400.0 | 29.4 Times are in microseconds (us). [------------------------- Downsampling (bilinear): torch.Size([1, 1, 906, 438]) -> (1200, 196) -------------------------] | Reference, PIL 8.4.0, mode: F | 1.11.0a0+gitd032369 cpu | 1.11.0a0+gitd032369 cuda 8 threads: --------------------------------------------------------------------------------------------------------------- contiguous torch.float32 | 4749.7 | 910.1 | 63.7 Times are in microseconds (us). [------------------------- Downsampling (bilinear): torch.Size([1, 1, 906, 438]) -> (120, 1200) -------------------------] | Reference, PIL 8.4.0, mode: F | 1.11.0a0+gitd032369 cpu | 1.11.0a0+gitd032369 cuda 8 threads: --------------------------------------------------------------------------------------------------------------- contiguous torch.float32 | 1098.1 | 272.0 | 36.4 Times are in microseconds (us). ``` </details> <details> <summary> Bicubic forward pass, PIL, PTH CPU and PTH CUDA </summary> Code: https://gist.github.com/vfdev-5/b173761a567f2283b3c649c3c0574112 ``` Torch version: 1.11.0a0+gitd032369 Torch config: PyTorch built with: - GCC 9.3 - C++ Version: 201402 - OpenMP 201511 (a.k.a. OpenMP 4.5) - CPU capability usage: AVX2 - CUDA Runtime 11.1 - NVCC architecture flags: -gencode;arch=compute_61,code=sm_61 - CuDNN 8.0.5 - Build settings: BUILD_TYPE=Release, CUDA_VERSION=11.1, CUDNN_VERSION=8.0.5, CXX_COMPILER=/usr/bin/c++, CXX_FLAGS= -Wno-deprecated -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -fopenmp -DNDEBUG -DUSE_KINETO -DUSE_PYTORCH_QNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -DEDGE_PROFILER_USE_KINETO -O2 -fPIC -Wno-narrowing -Wall -Wextra -Werror=return-type -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-sign-compare -Wno-unused-parameter -Wno-unused-function -Wno-unused-result -Wno-unused-local-typedefs -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-psabi -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Werror=cast-function-type -Wno-stringop-overflow, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_VERSION=1.11.0, USE_CUDA=1, USE_CUDNN=1, USE_EIGEN_FOR_BLAS=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=OFF, USE_MKLDNN=OFF, USE_MPI=OFF, USE_NCCL=ON, USE_NNPACK=0, USE_OPENMP=ON, USE_ROCM=OFF, Num threads: 8 [------------------------------------ Downsampling (bicubic): torch.Size([1, 3, 906, 438]) -> (320, 196) -----------------------------------] | Reference, PIL 8.4.0, mode: RGB | 1.11.0a0+gitd032369 cpu | 1.11.0a0+gitd032369 cuda 8 threads: ---------------------------------------------------------------------------------------------------------------------------------- channels_first contiguous torch.float32 | 4522.4 | 1406.7 | 170.3 channels_last non-contiguous torch.float32 | 4530.0 | 1435.4 | 242.2 Times are in microseconds (us). [------------------------------------ Downsampling (bicubic): torch.Size([1, 3, 906, 438]) -> (460, 220) -----------------------------------] | Reference, PIL 8.4.0, mode: RGB | 1.11.0a0+gitd032369 cpu | 1.11.0a0+gitd032369 cuda 8 threads: ---------------------------------------------------------------------------------------------------------------------------------- channels_first contiguous torch.float32 | 5726.4 | 1628.6 | 164.0 channels_last non-contiguous torch.float32 | 5722.6 | 1665.6 | 234.7 Times are in microseconds (us). [------------------------------------ Downsampling (bicubic): torch.Size([1, 3, 906, 438]) -> (120, 96) ------------------------------------] | Reference, PIL 8.4.0, mode: RGB | 1.11.0a0+gitd032369 cpu | 1.11.0a0+gitd032369 cuda 8 threads: ---------------------------------------------------------------------------------------------------------------------------------- channels_first contiguous torch.float32 | 2909.1 | 1461.5 | 276.9 channels_last non-contiguous torch.float32 | 2892.9 | 1458.7 | 345.1 Times are in microseconds (us). [----------------------------------- Downsampling (bicubic): torch.Size([1, 3, 906, 438]) -> (1200, 196) -----------------------------------] | Reference, PIL 8.4.0, mode: RGB | 1.11.0a0+gitd032369 cpu | 1.11.0a0+gitd032369 cuda 8 threads: ---------------------------------------------------------------------------------------------------------------------------------- channels_first contiguous torch.float32 | 14699.2 | 4283.9 | 407.1 channels_last non-contiguous torch.float32 | 14711.3 | 4321.1 | 477.0 Times are in microseconds (us). [----------------------------------- Downsampling (bicubic): torch.Size([1, 3, 906, 438]) -> (120, 1200) -----------------------------------] | Reference, PIL 8.4.0, mode: RGB | 1.11.0a0+gitd032369 cpu | 1.11.0a0+gitd032369 cuda 8 threads: ---------------------------------------------------------------------------------------------------------------------------------- channels_first contiguous torch.float32 | 3467.0 | 980.0 | 339.2 channels_last non-contiguous torch.float32 | 3465.2 | 982.3 | 407.8 Times are in microseconds (us). [-------------------------- Downsampling (bicubic): torch.Size([1, 1, 906, 438]) -> (320, 196) --------------------------] | Reference, PIL 8.4.0, mode: F | 1.11.0a0+gitd032369 cpu | 1.11.0a0+gitd032369 cuda 8 threads: --------------------------------------------------------------------------------------------------------------- contiguous torch.float32 | 2396.7 | 877.8 | 68.1 Times are in microseconds (us). [-------------------------- Downsampling (bicubic): torch.Size([1, 1, 906, 438]) -> (460, 220) --------------------------] | Reference, PIL 8.4.0, mode: F | 1.11.0a0+gitd032369 cpu | 1.11.0a0+gitd032369 cuda 8 threads: --------------------------------------------------------------------------------------------------------------- contiguous torch.float32 | 3068.2 | 777.3 | 64.7 Times are in microseconds (us). [-------------------------- Downsampling (bicubic): torch.Size([1, 1, 906, 438]) -> (120, 96) ---------------------------] | Reference, PIL 8.4.0, mode: F | 1.11.0a0+gitd032369 cpu | 1.11.0a0+gitd032369 cuda 8 threads: --------------------------------------------------------------------------------------------------------------- contiguous torch.float32 | 1540.2 | 829.3 | 100.4 Times are in microseconds (us). [------------------------- Downsampling (bicubic): torch.Size([1, 1, 906, 438]) -> (1200, 196) --------------------------] | Reference, PIL 8.4.0, mode: F | 1.11.0a0+gitd032369 cpu | 1.11.0a0+gitd032369 cuda 8 threads: --------------------------------------------------------------------------------------------------------------- contiguous torch.float32 | 7919.5 | 1467.8 | 151.6 Times are in microseconds (us). [------------------------- Downsampling (bicubic): torch.Size([1, 1, 906, 438]) -> (120, 1200) --------------------------] | Reference, PIL 8.4.0, mode: F | 1.11.0a0+gitd032369 cpu | 1.11.0a0+gitd032369 cuda 8 threads: --------------------------------------------------------------------------------------------------------------- contiguous torch.float32 | 1695.7 | 631.2 | 117.7 Times are in microseconds (us). ``` </details> <details> <summary> Bilinear backward pass, PTH CPU and PTH CUDA </summary> Code: https://gist.github.com/vfdev-5/b173761a567f2283b3c649c3c0574112 ``` - Measure only backward op Torch version: 1.11.0a0+gitd032369 Torch config: PyTorch built with: - GCC 9.3 - C++ Version: 201402 - OpenMP 201511 (a.k.a. OpenMP 4.5) - CPU capability usage: AVX2 - CUDA Runtime 11.1 - NVCC architecture flags: -gencode;arch=compute_61,code=sm_61 - CuDNN 8.0.5 - Build settings: BUILD_TYPE=Release, CUDA_VERSION=11.1, CUDNN_VERSION=8.0.5, CXX_COMPILER=/usr/bin/c++, CXX_FLAGS= -Wno-deprecated -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -fopenmp -DNDEBUG -DUSE_KINETO -DUSE_PYTORCH_QNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -DEDGE_PROFILER_USE_KINETO -O2 -fPIC -Wno-narrowing -Wall -Wextra -Werror=return-type -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-sign-compare -Wno-unused-parameter -Wno-unused-function -Wno-unused-result -Wno-unused-local-typedefs -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-psabi -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Werror=cast-function-type -Wno-stringop-overflow, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_VERSION=1.11.0, USE_CUDA=1, USE_CUDNN=1, USE_EIGEN_FOR_BLAS=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=OFF, USE_MKLDNN=OFF, USE_MPI=OFF, USE_NCCL=ON, USE_NNPACK=0, USE_OPENMP=ON, USE_ROCM=OFF, Num threads: 8 [------------- Downsampling backward (bilinear): torch.Size([1, 3, 906, 438]) -> (320, 196) ------------] | 1.11.0a0+gitd032369 cpu | 1.11.0a0+gitd032369 cuda 8 threads: ---------------------------------------------------------------------------------------------- channels_first contiguous torch.float32 | 4686.8 | 215.7 channels_last non-contiguous torch.float32 | 5101.1 | 220.5 Times are in microseconds (us). [------------- Downsampling backward (bilinear): torch.Size([1, 3, 906, 438]) -> (460, 220) ------------] | 1.11.0a0+gitd032369 cpu | 1.11.0a0+gitd032369 cuda 8 threads: ---------------------------------------------------------------------------------------------- channels_first contiguous torch.float32 | 6011.2 | 204.4 channels_last non-contiguous torch.float32 | 6396.0 | 210.0 Times are in microseconds (us). [------------- Downsampling backward (bilinear): torch.Size([1, 3, 906, 438]) -> (120, 96) -------------] | 1.11.0a0+gitd032369 cpu | 1.11.0a0+gitd032369 cuda 8 threads: ---------------------------------------------------------------------------------------------- channels_first contiguous torch.float32 | 2035.6 | 250.2 channels_last non-contiguous torch.float32 | 1589.6 | 252.5 Times are in microseconds (us). [------------ Downsampling backward (bilinear): torch.Size([1, 3, 906, 438]) -> (1200, 196) ------------] | 1.11.0a0+gitd032369 cpu | 1.11.0a0+gitd032369 cuda 8 threads: ---------------------------------------------------------------------------------------------- channels_first contiguous torch.float32 | 11392.5 | 256.5 channels_last non-contiguous torch.float32 | 11640.2 | 263.9 Times are in microseconds (us). [------------ Downsampling backward (bilinear): torch.Size([1, 3, 906, 438]) -> (120, 1200) ------------] | 1.11.0a0+gitd032369 cpu | 1.11.0a0+gitd032369 cuda 8 threads: ---------------------------------------------------------------------------------------------- channels_first contiguous torch.float32 | 11769.6 | 465.9 channels_last non-contiguous torch.float32 | 12407.0 | 474.4 Times are in microseconds (us). [---- Downsampling backward (bilinear): torch.Size([1, 1, 906, 438]) -> (320, 196) ----] | 1.11.0a0+gitd032369 cpu | 1.11.0a0+gitd032369 cuda 8 threads: ----------------------------------------------------------------------------- contiguous torch.float32 | 3931.0 | 133.3 Times are in microseconds (us). [---- Downsampling backward (bilinear): torch.Size([1, 1, 906, 438]) -> (460, 220) ----] | 1.11.0a0+gitd032369 cpu | 1.11.0a0+gitd032369 cuda 8 threads: ----------------------------------------------------------------------------- contiguous torch.float32 | 5594.8 | 133.9 Times are in microseconds (us). [---- Downsampling backward (bilinear): torch.Size([1, 1, 906, 438]) -> (120, 96) -----] | 1.11.0a0+gitd032369 cpu | 1.11.0a0+gitd032369 cuda 8 threads: ----------------------------------------------------------------------------- contiguous torch.float32 | 1272.6 | 133.0 Times are in microseconds (us). [--- Downsampling backward (bilinear): torch.Size([1, 1, 906, 438]) -> (1200, 196) ----] | 1.11.0a0+gitd032369 cpu | 1.11.0a0+gitd032369 cuda 8 threads: ----------------------------------------------------------------------------- contiguous torch.float32 | 10618.1 | 134.0 Times are in microseconds (us). [--- Downsampling backward (bilinear): torch.Size([1, 1, 906, 438]) -> (120, 1200) ----] | 1.11.0a0+gitd032369 cpu | 1.11.0a0+gitd032369 cuda 8 threads: ----------------------------------------------------------------------------- contiguous torch.float32 | 11082.2 | 154.6 Times are in microseconds (us). ``` </details> <details> <summary> Bicubic backward pass, PTH CPU and PTH CUDA </summary> Code: https://gist.github.com/vfdev-5/b173761a567f2283b3c649c3c0574112 ``` - Measure only backward op Torch version: 1.11.0a0+gitd032369 Torch config: PyTorch built with: - GCC 9.3 - C++ Version: 201402 - OpenMP 201511 (a.k.a. OpenMP 4.5) - CPU capability usage: AVX2 - CUDA Runtime 11.1 - NVCC architecture flags: -gencode;arch=compute_61,code=sm_61 - CuDNN 8.0.5 - Build settings: BUILD_TYPE=Release, CUDA_VERSION=11.1, CUDNN_VERSION=8.0.5, CXX_COMPILER=/usr/bin/c++, CXX_FLAGS= -Wno-deprecated -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -fopenmp -DNDEBUG -DUSE_KINETO -DUSE_PYTORCH_QNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -DEDGE_PROFILER_USE_KINETO -O2 -fPIC -Wno-narrowing -Wall -Wextra -Werror=return-type -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-sign-compare -Wno-unused-parameter -Wno-unused-function -Wno-unused-result -Wno-unused-local-typedefs -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-psabi -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Werror=cast-function-type -Wno-stringop-overflow, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_VERSION=1.11.0, USE_CUDA=1, USE_CUDNN=1, USE_EIGEN_FOR_BLAS=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=OFF, USE_MKLDNN=OFF, USE_MPI=OFF, USE_NCCL=ON, USE_NNPACK=0, USE_OPENMP=ON, USE_ROCM=OFF, Num threads: 8 [------------- Downsampling backward (bicubic): torch.Size([1, 3, 906, 438]) -> (320, 196) -------------] | 1.11.0a0+gitd032369 cpu | 1.11.0a0+gitd032369 cuda 8 threads: ---------------------------------------------------------------------------------------------- channels_first contiguous torch.float32 | 6791.2 | 618.9 channels_last non-contiguous torch.float32 | 7125.2 | 622.9 Times are in microseconds (us). [------------- Downsampling backward (bicubic): torch.Size([1, 3, 906, 438]) -> (460, 220) -------------] | 1.11.0a0+gitd032369 cpu | 1.11.0a0+gitd032369 cuda 8 threads: ---------------------------------------------------------------------------------------------- channels_first contiguous torch.float32 | 8806.2 | 600.3 channels_last non-contiguous torch.float32 | 9167.6 | 607.5 Times are in microseconds (us). [-------------- Downsampling backward (bicubic): torch.Size([1, 3, 906, 438]) -> (120, 96) -------------] | 1.11.0a0+gitd032369 cpu | 1.11.0a0+gitd032369 cuda 8 threads: ---------------------------------------------------------------------------------------------- channels_first contiguous torch.float32 | 3683.6 | 693.8 channels_last non-contiguous torch.float32 | 3617.4 | 695.0 Times are in microseconds (us). [------------- Downsampling backward (bicubic): torch.Size([1, 3, 906, 438]) -> (1200, 196) ------------] | 1.11.0a0+gitd032369 cpu | 1.11.0a0+gitd032369 cuda 8 threads: ---------------------------------------------------------------------------------------------- channels_first contiguous torch.float32 | 17548.2 | 779.4 channels_last non-contiguous torch.float32 | 17966.2 | 786.5 Times are in microseconds (us). [------------- Downsampling backward (bicubic): torch.Size([1, 3, 906, 438]) -> (120, 1200) ------------] | 1.11.0a0+gitd032369 cpu | 1.11.0a0+gitd032369 cuda 8 threads: ---------------------------------------------------------------------------------------------- channels_first contiguous torch.float32 | 28.4 | 1.6 channels_last non-contiguous torch.float32 | 28.4 | 1.6 Times are in milliseconds (ms). [---- Downsampling backward (bicubic): torch.Size([1, 1, 906, 438]) -> (320, 196) -----] | 1.11.0a0+gitd032369 cpu | 1.11.0a0+gitd032369 cuda 8 threads: ----------------------------------------------------------------------------- contiguous torch.float32 | 6266.1 | 208.5 Times are in microseconds (us). [---- Downsampling backward (bicubic): torch.Size([1, 1, 906, 438]) -> (460, 220) -----] | 1.11.0a0+gitd032369 cpu | 1.11.0a0+gitd032369 cuda 8 threads: ----------------------------------------------------------------------------- contiguous torch.float32 | 8218.3 | 200.8 Times are in microseconds (us). [----- Downsampling backward (bicubic): torch.Size([1, 1, 906, 438]) -> (120, 96) -----] | 1.11.0a0+gitd032369 cpu | 1.11.0a0+gitd032369 cuda 8 threads: ----------------------------------------------------------------------------- contiguous torch.float32 | 3458.9 | 231.9 Times are in microseconds (us). [---- Downsampling backward (bicubic): torch.Size([1, 1, 906, 438]) -> (1200, 196) ----] | 1.11.0a0+gitd032369 cpu | 1.11.0a0+gitd032369 cuda 8 threads: ----------------------------------------------------------------------------- contiguous torch.float32 | 15729.3 | 261.6 Times are in microseconds (us). [---- Downsampling backward (bicubic): torch.Size([1, 1, 906, 438]) -> (120, 1200) ----] | 1.11.0a0+gitd032369 cpu | 1.11.0a0+gitd032369 cuda 8 threads: ----------------------------------------------------------------------------- contiguous torch.float32 | 26279.8 | 547.0 Times are in microseconds (us). ``` </details> Code is moved from torchvision: pytorch/vision#4211 and optimized Pull Request resolved: #70930 Reviewed By: zou3519 Differential Revision: D33817902 Pulled By: jbschlosser fbshipit-source-id: d63a620f8972ff36b63841f0bc6c820466f58f69

) Summary: Description: - Added antialias flag to interpolate (CUDA) - forward and backward for bicubic mode - added tests Previous PR for CPU bilinear, #65142 Previous PR for CPU bicubic, #68819 ### Benchmarks <details> <summary> Bilinear forward pass, PIL, PTH CPU and PTH CUDA </summary> Code: https://gist.github.com/vfdev-5/b173761a567f2283b3c649c3c0574112 ``` Torch version: 1.11.0a0+gitd032369 Torch config: PyTorch built with: - GCC 9.3 - C++ Version: 201402 - OpenMP 201511 (a.k.a. OpenMP 4.5) - CPU capability usage: AVX2 - CUDA Runtime 11.1 - NVCC architecture flags: -gencode;arch=compute_61,code=sm_61 - CuDNN 8.0.5 - Build settings: BUILD_TYPE=Release, CUDA_VERSION=11.1, CUDNN_VERSION=8.0.5, CXX_COMPILER=/usr/bin/c++, CXX_FLAGS= -Wno-deprecated -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -fopenmp -DNDEBUG -DUSE_KINETO -DUSE_PYTORCH_QNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -DEDGE_PROFILER_USE_KINETO -O2 -fPIC -Wno-narrowing -Wall -Wextra -Werror=return-type -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-sign-compare -Wno-unused-parameter -Wno-unused-function -Wno-unused-result -Wno-unused-local-typedefs -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-psabi -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Werror=cast-function-type -Wno-stringop-overflow, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_VERSION=1.11.0, USE_CUDA=1, USE_CUDNN=1, USE_EIGEN_FOR_BLAS=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=OFF, USE_MKLDNN=OFF, USE_MPI=OFF, USE_NCCL=ON, USE_NNPACK=0, USE_OPENMP=ON, USE_ROCM=OFF, Num threads: 8 [----------------------------------- Downsampling (bilinear): torch.Size([1, 3, 906, 438]) -> (320, 196) -----------------------------------] | Reference, PIL 8.4.0, mode: RGB | 1.11.0a0+gitd032369 cpu | 1.11.0a0+gitd032369 cuda 8 threads: ---------------------------------------------------------------------------------------------------------------------------------- channels_first contiguous torch.float32 | 2851.2 | 874.1 | 57.1 channels_last non-contiguous torch.float32 | 2856.1 | 1155.8 | 130.6 Times are in microseconds (us). [----------------------------------- Downsampling (bilinear): torch.Size([1, 3, 906, 438]) -> (460, 220) -----------------------------------] | Reference, PIL 8.4.0, mode: RGB | 1.11.0a0+gitd032369 cpu | 1.11.0a0+gitd032369 cuda 8 threads: ---------------------------------------------------------------------------------------------------------------------------------- channels_first contiguous torch.float32 | 3705.9 | 1005.8 | 66.3 channels_last non-contiguous torch.float32 | 3742.9 | 1332.8 | 143.5 Times are in microseconds (us). [------------------------------------ Downsampling (bilinear): torch.Size([1, 3, 906, 438]) -> (120, 96) -----------------------------------] | Reference, PIL 8.4.0, mode: RGB | 1.11.0a0+gitd032369 cpu | 1.11.0a0+gitd032369 cuda 8 threads: ---------------------------------------------------------------------------------------------------------------------------------- channels_first contiguous torch.float32 | 1768.0 | 725.2 | 77.9 channels_last non-contiguous torch.float32 | 1753.7 | 942.5 | 144.0 Times are in microseconds (us). [----------------------------------- Downsampling (bilinear): torch.Size([1, 3, 906, 438]) -> (1200, 196) ----------------------------------] | Reference, PIL 8.4.0, mode: RGB | 1.11.0a0+gitd032369 cpu | 1.11.0a0+gitd032369 cuda 8 threads: ---------------------------------------------------------------------------------------------------------------------------------- channels_first contiguous torch.float32 | 9522.6 | 2593.8 | 157.8 channels_last non-contiguous torch.float32 | 9513.5 | 3622.7 | 241.5 Times are in microseconds (us). [----------------------------------- Downsampling (bilinear): torch.Size([1, 3, 906, 438]) -> (120, 1200) ----------------------------------] | Reference, PIL 8.4.0, mode: RGB | 1.11.0a0+gitd032369 cpu | 1.11.0a0+gitd032369 cuda 8 threads: ---------------------------------------------------------------------------------------------------------------------------------- channels_first contiguous torch.float32 | 2240.1 | 565.5 | 93.3 channels_last non-contiguous torch.float32 | 2244.2 | 972.7 | 170.8 Times are in microseconds (us). [------------------------- Downsampling (bilinear): torch.Size([1, 1, 906, 438]) -> (320, 196) --------------------------] | Reference, PIL 8.4.0, mode: F | 1.11.0a0+gitd032369 cpu | 1.11.0a0+gitd032369 cuda 8 threads: --------------------------------------------------------------------------------------------------------------- contiguous torch.float32 | 1441.3 | 386.1 | 22.3 Times are in microseconds (us). [------------------------- Downsampling (bilinear): torch.Size([1, 1, 906, 438]) -> (460, 220) --------------------------] | Reference, PIL 8.4.0, mode: F | 1.11.0a0+gitd032369 cpu | 1.11.0a0+gitd032369 cuda 8 threads: --------------------------------------------------------------------------------------------------------------- contiguous torch.float32 | 1815.2 | 376.8 | 27.8 Times are in microseconds (us). [-------------------------- Downsampling (bilinear): torch.Size([1, 1, 906, 438]) -> (120, 96) --------------------------] | Reference, PIL 8.4.0, mode: F | 1.11.0a0+gitd032369 cpu | 1.11.0a0+gitd032369 cuda 8 threads: --------------------------------------------------------------------------------------------------------------- contiguous torch.float32 | 962.3 | 400.0 | 29.4 Times are in microseconds (us). [------------------------- Downsampling (bilinear): torch.Size([1, 1, 906, 438]) -> (1200, 196) -------------------------] | Reference, PIL 8.4.0, mode: F | 1.11.0a0+gitd032369 cpu | 1.11.0a0+gitd032369 cuda 8 threads: --------------------------------------------------------------------------------------------------------------- contiguous torch.float32 | 4749.7 | 910.1 | 63.7 Times are in microseconds (us). [------------------------- Downsampling (bilinear): torch.Size([1, 1, 906, 438]) -> (120, 1200) -------------------------] | Reference, PIL 8.4.0, mode: F | 1.11.0a0+gitd032369 cpu | 1.11.0a0+gitd032369 cuda 8 threads: --------------------------------------------------------------------------------------------------------------- contiguous torch.float32 | 1098.1 | 272.0 | 36.4 Times are in microseconds (us). ``` </details> <details> <summary> Bicubic forward pass, PIL, PTH CPU and PTH CUDA </summary> Code: https://gist.github.com/vfdev-5/b173761a567f2283b3c649c3c0574112 ``` Torch version: 1.11.0a0+gitd032369 Torch config: PyTorch built with: - GCC 9.3 - C++ Version: 201402 - OpenMP 201511 (a.k.a. OpenMP 4.5) - CPU capability usage: AVX2 - CUDA Runtime 11.1 - NVCC architecture flags: -gencode;arch=compute_61,code=sm_61 - CuDNN 8.0.5 - Build settings: BUILD_TYPE=Release, CUDA_VERSION=11.1, CUDNN_VERSION=8.0.5, CXX_COMPILER=/usr/bin/c++, CXX_FLAGS= -Wno-deprecated -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -fopenmp -DNDEBUG -DUSE_KINETO -DUSE_PYTORCH_QNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -DEDGE_PROFILER_USE_KINETO -O2 -fPIC -Wno-narrowing -Wall -Wextra -Werror=return-type -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-sign-compare -Wno-unused-parameter -Wno-unused-function -Wno-unused-result -Wno-unused-local-typedefs -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-psabi -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Werror=cast-function-type -Wno-stringop-overflow, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_VERSION=1.11.0, USE_CUDA=1, USE_CUDNN=1, USE_EIGEN_FOR_BLAS=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=OFF, USE_MKLDNN=OFF, USE_MPI=OFF, USE_NCCL=ON, USE_NNPACK=0, USE_OPENMP=ON, USE_ROCM=OFF, Num threads: 8 [------------------------------------ Downsampling (bicubic): torch.Size([1, 3, 906, 438]) -> (320, 196) -----------------------------------] | Reference, PIL 8.4.0, mode: RGB | 1.11.0a0+gitd032369 cpu | 1.11.0a0+gitd032369 cuda 8 threads: ---------------------------------------------------------------------------------------------------------------------------------- channels_first contiguous torch.float32 | 4522.4 | 1406.7 | 170.3 channels_last non-contiguous torch.float32 | 4530.0 | 1435.4 | 242.2 Times are in microseconds (us). [------------------------------------ Downsampling (bicubic): torch.Size([1, 3, 906, 438]) -> (460, 220) -----------------------------------] | Reference, PIL 8.4.0, mode: RGB | 1.11.0a0+gitd032369 cpu | 1.11.0a0+gitd032369 cuda 8 threads: ---------------------------------------------------------------------------------------------------------------------------------- channels_first contiguous torch.float32 | 5726.4 | 1628.6 | 164.0 channels_last non-contiguous torch.float32 | 5722.6 | 1665.6 | 234.7 Times are in microseconds (us). [------------------------------------ Downsampling (bicubic): torch.Size([1, 3, 906, 438]) -> (120, 96) ------------------------------------] | Reference, PIL 8.4.0, mode: RGB | 1.11.0a0+gitd032369 cpu | 1.11.0a0+gitd032369 cuda 8 threads: ---------------------------------------------------------------------------------------------------------------------------------- channels_first contiguous torch.float32 | 2909.1 | 1461.5 | 276.9 channels_last non-contiguous torch.float32 | 2892.9 | 1458.7 | 345.1 Times are in microseconds (us). [----------------------------------- Downsampling (bicubic): torch.Size([1, 3, 906, 438]) -> (1200, 196) -----------------------------------] | Reference, PIL 8.4.0, mode: RGB | 1.11.0a0+gitd032369 cpu | 1.11.0a0+gitd032369 cuda 8 threads: ---------------------------------------------------------------------------------------------------------------------------------- channels_first contiguous torch.float32 | 14699.2 | 4283.9 | 407.1 channels_last non-contiguous torch.float32 | 14711.3 | 4321.1 | 477.0 Times are in microseconds (us). [----------------------------------- Downsampling (bicubic): torch.Size([1, 3, 906, 438]) -> (120, 1200) -----------------------------------] | Reference, PIL 8.4.0, mode: RGB | 1.11.0a0+gitd032369 cpu | 1.11.0a0+gitd032369 cuda 8 threads: ---------------------------------------------------------------------------------------------------------------------------------- channels_first contiguous torch.float32 | 3467.0 | 980.0 | 339.2 channels_last non-contiguous torch.float32 | 3465.2 | 982.3 | 407.8 Times are in microseconds (us). [-------------------------- Downsampling (bicubic): torch.Size([1, 1, 906, 438]) -> (320, 196) --------------------------] | Reference, PIL 8.4.0, mode: F | 1.11.0a0+gitd032369 cpu | 1.11.0a0+gitd032369 cuda 8 threads: --------------------------------------------------------------------------------------------------------------- contiguous torch.float32 | 2396.7 | 877.8 | 68.1 Times are in microseconds (us). [-------------------------- Downsampling (bicubic): torch.Size([1, 1, 906, 438]) -> (460, 220) --------------------------] | Reference, PIL 8.4.0, mode: F | 1.11.0a0+gitd032369 cpu | 1.11.0a0+gitd032369 cuda 8 threads: --------------------------------------------------------------------------------------------------------------- contiguous torch.float32 | 3068.2 | 777.3 | 64.7 Times are in microseconds (us). [-------------------------- Downsampling (bicubic): torch.Size([1, 1, 906, 438]) -> (120, 96) ---------------------------] | Reference, PIL 8.4.0, mode: F | 1.11.0a0+gitd032369 cpu | 1.11.0a0+gitd032369 cuda 8 threads: --------------------------------------------------------------------------------------------------------------- contiguous torch.float32 | 1540.2 | 829.3 | 100.4 Times are in microseconds (us). [------------------------- Downsampling (bicubic): torch.Size([1, 1, 906, 438]) -> (1200, 196) --------------------------] | Reference, PIL 8.4.0, mode: F | 1.11.0a0+gitd032369 cpu | 1.11.0a0+gitd032369 cuda 8 threads: --------------------------------------------------------------------------------------------------------------- contiguous torch.float32 | 7919.5 | 1467.8 | 151.6 Times are in microseconds (us). [------------------------- Downsampling (bicubic): torch.Size([1, 1, 906, 438]) -> (120, 1200) --------------------------] | Reference, PIL 8.4.0, mode: F | 1.11.0a0+gitd032369 cpu | 1.11.0a0+gitd032369 cuda 8 threads: --------------------------------------------------------------------------------------------------------------- contiguous torch.float32 | 1695.7 | 631.2 | 117.7 Times are in microseconds (us). ``` </details> <details> <summary> Bilinear backward pass, PTH CPU and PTH CUDA </summary> Code: https://gist.github.com/vfdev-5/b173761a567f2283b3c649c3c0574112 ``` - Measure only backward op Torch version: 1.11.0a0+gitd032369 Torch config: PyTorch built with: - GCC 9.3 - C++ Version: 201402 - OpenMP 201511 (a.k.a. OpenMP 4.5) - CPU capability usage: AVX2 - CUDA Runtime 11.1 - NVCC architecture flags: -gencode;arch=compute_61,code=sm_61 - CuDNN 8.0.5 - Build settings: BUILD_TYPE=Release, CUDA_VERSION=11.1, CUDNN_VERSION=8.0.5, CXX_COMPILER=/usr/bin/c++, CXX_FLAGS= -Wno-deprecated -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -fopenmp -DNDEBUG -DUSE_KINETO -DUSE_PYTORCH_QNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -DEDGE_PROFILER_USE_KINETO -O2 -fPIC -Wno-narrowing -Wall -Wextra -Werror=return-type -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-sign-compare -Wno-unused-parameter -Wno-unused-function -Wno-unused-result -Wno-unused-local-typedefs -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-psabi -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Werror=cast-function-type -Wno-stringop-overflow, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_VERSION=1.11.0, USE_CUDA=1, USE_CUDNN=1, USE_EIGEN_FOR_BLAS=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=OFF, USE_MKLDNN=OFF, USE_MPI=OFF, USE_NCCL=ON, USE_NNPACK=0, USE_OPENMP=ON, USE_ROCM=OFF, Num threads: 8 [------------- Downsampling backward (bilinear): torch.Size([1, 3, 906, 438]) -> (320, 196) ------------] | 1.11.0a0+gitd032369 cpu | 1.11.0a0+gitd032369 cuda 8 threads: ---------------------------------------------------------------------------------------------- channels_first contiguous torch.float32 | 4686.8 | 215.7 channels_last non-contiguous torch.float32 | 5101.1 | 220.5 Times are in microseconds (us). [------------- Downsampling backward (bilinear): torch.Size([1, 3, 906, 438]) -> (460, 220) ------------] | 1.11.0a0+gitd032369 cpu | 1.11.0a0+gitd032369 cuda 8 threads: ---------------------------------------------------------------------------------------------- channels_first contiguous torch.float32 | 6011.2 | 204.4 channels_last non-contiguous torch.float32 | 6396.0 | 210.0 Times are in microseconds (us). [------------- Downsampling backward (bilinear): torch.Size([1, 3, 906, 438]) -> (120, 96) -------------] | 1.11.0a0+gitd032369 cpu | 1.11.0a0+gitd032369 cuda 8 threads: ---------------------------------------------------------------------------------------------- channels_first contiguous torch.float32 | 2035.6 | 250.2 channels_last non-contiguous torch.float32 | 1589.6 | 252.5 Times are in microseconds (us). [------------ Downsampling backward (bilinear): torch.Size([1, 3, 906, 438]) -> (1200, 196) ------------] | 1.11.0a0+gitd032369 cpu | 1.11.0a0+gitd032369 cuda 8 threads: ---------------------------------------------------------------------------------------------- channels_first contiguous torch.float32 | 11392.5 | 256.5 channels_last non-contiguous torch.float32 | 11640.2 | 263.9 Times are in microseconds (us). [------------ Downsampling backward (bilinear): torch.Size([1, 3, 906, 438]) -> (120, 1200) ------------] | 1.11.0a0+gitd032369 cpu | 1.11.0a0+gitd032369 cuda 8 threads: ---------------------------------------------------------------------------------------------- channels_first contiguous torch.float32 | 11769.6 | 465.9 channels_last non-contiguous torch.float32 | 12407.0 | 474.4 Times are in microseconds (us). [---- Downsampling backward (bilinear): torch.Size([1, 1, 906, 438]) -> (320, 196) ----] | 1.11.0a0+gitd032369 cpu | 1.11.0a0+gitd032369 cuda 8 threads: ----------------------------------------------------------------------------- contiguous torch.float32 | 3931.0 | 133.3 Times are in microseconds (us). [---- Downsampling backward (bilinear): torch.Size([1, 1, 906, 438]) -> (460, 220) ----] | 1.11.0a0+gitd032369 cpu | 1.11.0a0+gitd032369 cuda 8 threads: ----------------------------------------------------------------------------- contiguous torch.float32 | 5594.8 | 133.9 Times are in microseconds (us). [---- Downsampling backward (bilinear): torch.Size([1, 1, 906, 438]) -> (120, 96) -----] | 1.11.0a0+gitd032369 cpu | 1.11.0a0+gitd032369 cuda 8 threads: ----------------------------------------------------------------------------- contiguous torch.float32 | 1272.6 | 133.0 Times are in microseconds (us). [--- Downsampling backward (bilinear): torch.Size([1, 1, 906, 438]) -> (1200, 196) ----] | 1.11.0a0+gitd032369 cpu | 1.11.0a0+gitd032369 cuda 8 threads: ----------------------------------------------------------------------------- contiguous torch.float32 | 10618.1 | 134.0 Times are in microseconds (us). [--- Downsampling backward (bilinear): torch.Size([1, 1, 906, 438]) -> (120, 1200) ----] | 1.11.0a0+gitd032369 cpu | 1.11.0a0+gitd032369 cuda 8 threads: ----------------------------------------------------------------------------- contiguous torch.float32 | 11082.2 | 154.6 Times are in microseconds (us). ``` </details> <details> <summary> Bicubic backward pass, PTH CPU and PTH CUDA </summary> Code: https://gist.github.com/vfdev-5/b173761a567f2283b3c649c3c0574112 ``` - Measure only backward op Torch version: 1.11.0a0+gitd032369 Torch config: PyTorch built with: - GCC 9.3 - C++ Version: 201402 - OpenMP 201511 (a.k.a. OpenMP 4.5) - CPU capability usage: AVX2 - CUDA Runtime 11.1 - NVCC architecture flags: -gencode;arch=compute_61,code=sm_61 - CuDNN 8.0.5 - Build settings: BUILD_TYPE=Release, CUDA_VERSION=11.1, CUDNN_VERSION=8.0.5, CXX_COMPILER=/usr/bin/c++, CXX_FLAGS= -Wno-deprecated -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -fopenmp -DNDEBUG -DUSE_KINETO -DUSE_PYTORCH_QNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -DEDGE_PROFILER_USE_KINETO -O2 -fPIC -Wno-narrowing -Wall -Wextra -Werror=return-type -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-sign-compare -Wno-unused-parameter -Wno-unused-function -Wno-unused-result -Wno-unused-local-typedefs -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-psabi -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Werror=cast-function-type -Wno-stringop-overflow, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_VERSION=1.11.0, USE_CUDA=1, USE_CUDNN=1, USE_EIGEN_FOR_BLAS=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=OFF, USE_MKLDNN=OFF, USE_MPI=OFF, USE_NCCL=ON, USE_NNPACK=0, USE_OPENMP=ON, USE_ROCM=OFF, Num threads: 8 [------------- Downsampling backward (bicubic): torch.Size([1, 3, 906, 438]) -> (320, 196) -------------] | 1.11.0a0+gitd032369 cpu | 1.11.0a0+gitd032369 cuda 8 threads: ---------------------------------------------------------------------------------------------- channels_first contiguous torch.float32 | 6791.2 | 618.9 channels_last non-contiguous torch.float32 | 7125.2 | 622.9 Times are in microseconds (us). [------------- Downsampling backward (bicubic): torch.Size([1, 3, 906, 438]) -> (460, 220) -------------] | 1.11.0a0+gitd032369 cpu | 1.11.0a0+gitd032369 cuda 8 threads: ---------------------------------------------------------------------------------------------- channels_first contiguous torch.float32 | 8806.2 | 600.3 channels_last non-contiguous torch.float32 | 9167.6 | 607.5 Times are in microseconds (us). [-------------- Downsampling backward (bicubic): torch.Size([1, 3, 906, 438]) -> (120, 96) -------------] | 1.11.0a0+gitd032369 cpu | 1.11.0a0+gitd032369 cuda 8 threads: ---------------------------------------------------------------------------------------------- channels_first contiguous torch.float32 | 3683.6 | 693.8 channels_last non-contiguous torch.float32 | 3617.4 | 695.0 Times are in microseconds (us). [------------- Downsampling backward (bicubic): torch.Size([1, 3, 906, 438]) -> (1200, 196) ------------] | 1.11.0a0+gitd032369 cpu | 1.11.0a0+gitd032369 cuda 8 threads: ---------------------------------------------------------------------------------------------- channels_first contiguous torch.float32 | 17548.2 | 779.4 channels_last non-contiguous torch.float32 | 17966.2 | 786.5 Times are in microseconds (us). [------------- Downsampling backward (bicubic): torch.Size([1, 3, 906, 438]) -> (120, 1200) ------------] | 1.11.0a0+gitd032369 cpu | 1.11.0a0+gitd032369 cuda 8 threads: ---------------------------------------------------------------------------------------------- channels_first contiguous torch.float32 | 28.4 | 1.6 channels_last non-contiguous torch.float32 | 28.4 | 1.6 Times are in milliseconds (ms). [---- Downsampling backward (bicubic): torch.Size([1, 1, 906, 438]) -> (320, 196) -----] | 1.11.0a0+gitd032369 cpu | 1.11.0a0+gitd032369 cuda 8 threads: ----------------------------------------------------------------------------- contiguous torch.float32 | 6266.1 | 208.5 Times are in microseconds (us). [---- Downsampling backward (bicubic): torch.Size([1, 1, 906, 438]) -> (460, 220) -----] | 1.11.0a0+gitd032369 cpu | 1.11.0a0+gitd032369 cuda 8 threads: ----------------------------------------------------------------------------- contiguous torch.float32 | 8218.3 | 200.8 Times are in microseconds (us). [----- Downsampling backward (bicubic): torch.Size([1, 1, 906, 438]) -> (120, 96) -----] | 1.11.0a0+gitd032369 cpu | 1.11.0a0+gitd032369 cuda 8 threads: ----------------------------------------------------------------------------- contiguous torch.float32 | 3458.9 | 231.9 Times are in microseconds (us). [---- Downsampling backward (bicubic): torch.Size([1, 1, 906, 438]) -> (1200, 196) ----] | 1.11.0a0+gitd032369 cpu | 1.11.0a0+gitd032369 cuda 8 threads: ----------------------------------------------------------------------------- contiguous torch.float32 | 15729.3 | 261.6 Times are in microseconds (us). [---- Downsampling backward (bicubic): torch.Size([1, 1, 906, 438]) -> (120, 1200) ----] | 1.11.0a0+gitd032369 cpu | 1.11.0a0+gitd032369 cuda 8 threads: ----------------------------------------------------------------------------- contiguous torch.float32 | 26279.8 | 547.0 Times are in microseconds (us). ``` </details> Code is moved from torchvision: pytorch/vision#4211 and optimized Pull Request resolved: #70930 Reviewed By: zou3519 Differential Revision: D33817902 Pulled By: jbschlosser fbshipit-source-id: d63a620f8972ff36b63841f0bc6c820466f58f69 (cherry picked from commit d358cfd)

…930) Summary: Description: - Added antialias flag to interpolate (CUDA) - forward and backward for bicubic mode - added tests Previous PR for CPU bilinear, pytorch/pytorch#65142 Previous PR for CPU bicubic, pytorch/pytorch#68819 ### Benchmarks <details> <summary> Bilinear forward pass, PIL, PTH CPU and PTH CUDA </summary> Code: https://gist.github.com/vfdev-5/b173761a567f2283b3c649c3c0574112 ``` Torch version: 1.11.0a0+gitd032369 Torch config: PyTorch built with: - GCC 9.3 - C++ Version: 201402 - OpenMP 201511 (a.k.a. OpenMP 4.5) - CPU capability usage: AVX2 - CUDA Runtime 11.1 - NVCC architecture flags: -gencode;arch=compute_61,code=sm_61 - CuDNN 8.0.5 - Build settings: BUILD_TYPE=Release, CUDA_VERSION=11.1, CUDNN_VERSION=8.0.5, CXX_COMPILER=/usr/bin/c++, CXX_FLAGS= -Wno-deprecated -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -fopenmp -DNDEBUG -DUSE_KINETO -DUSE_PYTORCH_QNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -DEDGE_PROFILER_USE_KINETO -O2 -fPIC -Wno-narrowing -Wall -Wextra -Werror=return-type -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-sign-compare -Wno-unused-parameter -Wno-unused-function -Wno-unused-result -Wno-unused-local-typedefs -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-psabi -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Werror=cast-function-type -Wno-stringop-overflow, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_VERSION=1.11.0, USE_CUDA=1, USE_CUDNN=1, USE_EIGEN_FOR_BLAS=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=OFF, USE_MKLDNN=OFF, USE_MPI=OFF, USE_NCCL=ON, USE_NNPACK=0, USE_OPENMP=ON, USE_ROCM=OFF, Num threads: 8 [----------------------------------- Downsampling (bilinear): torch.Size([1, 3, 906, 438]) -> (320, 196) -----------------------------------] | Reference, PIL 8.4.0, mode: RGB | 1.11.0a0+gitd032369 cpu | 1.11.0a0+gitd032369 cuda 8 threads: ---------------------------------------------------------------------------------------------------------------------------------- channels_first contiguous torch.float32 | 2851.2 | 874.1 | 57.1 channels_last non-contiguous torch.float32 | 2856.1 | 1155.8 | 130.6 Times are in microseconds (us). [----------------------------------- Downsampling (bilinear): torch.Size([1, 3, 906, 438]) -> (460, 220) -----------------------------------] | Reference, PIL 8.4.0, mode: RGB | 1.11.0a0+gitd032369 cpu | 1.11.0a0+gitd032369 cuda 8 threads: ---------------------------------------------------------------------------------------------------------------------------------- channels_first contiguous torch.float32 | 3705.9 | 1005.8 | 66.3 channels_last non-contiguous torch.float32 | 3742.9 | 1332.8 | 143.5 Times are in microseconds (us). [------------------------------------ Downsampling (bilinear): torch.Size([1, 3, 906, 438]) -> (120, 96) -----------------------------------] | Reference, PIL 8.4.0, mode: RGB | 1.11.0a0+gitd032369 cpu | 1.11.0a0+gitd032369 cuda 8 threads: ---------------------------------------------------------------------------------------------------------------------------------- channels_first contiguous torch.float32 | 1768.0 | 725.2 | 77.9 channels_last non-contiguous torch.float32 | 1753.7 | 942.5 | 144.0 Times are in microseconds (us). [----------------------------------- Downsampling (bilinear): torch.Size([1, 3, 906, 438]) -> (1200, 196) ----------------------------------] | Reference, PIL 8.4.0, mode: RGB | 1.11.0a0+gitd032369 cpu | 1.11.0a0+gitd032369 cuda 8 threads: ---------------------------------------------------------------------------------------------------------------------------------- channels_first contiguous torch.float32 | 9522.6 | 2593.8 | 157.8 channels_last non-contiguous torch.float32 | 9513.5 | 3622.7 | 241.5 Times are in microseconds (us). [----------------------------------- Downsampling (bilinear): torch.Size([1, 3, 906, 438]) -> (120, 1200) ----------------------------------] | Reference, PIL 8.4.0, mode: RGB | 1.11.0a0+gitd032369 cpu | 1.11.0a0+gitd032369 cuda 8 threads: ---------------------------------------------------------------------------------------------------------------------------------- channels_first contiguous torch.float32 | 2240.1 | 565.5 | 93.3 channels_last non-contiguous torch.float32 | 2244.2 | 972.7 | 170.8 Times are in microseconds (us). [------------------------- Downsampling (bilinear): torch.Size([1, 1, 906, 438]) -> (320, 196) --------------------------] | Reference, PIL 8.4.0, mode: F | 1.11.0a0+gitd032369 cpu | 1.11.0a0+gitd032369 cuda 8 threads: --------------------------------------------------------------------------------------------------------------- contiguous torch.float32 | 1441.3 | 386.1 | 22.3 Times are in microseconds (us). [------------------------- Downsampling (bilinear): torch.Size([1, 1, 906, 438]) -> (460, 220) --------------------------] | Reference, PIL 8.4.0, mode: F | 1.11.0a0+gitd032369 cpu | 1.11.0a0+gitd032369 cuda 8 threads: --------------------------------------------------------------------------------------------------------------- contiguous torch.float32 | 1815.2 | 376.8 | 27.8 Times are in microseconds (us). [-------------------------- Downsampling (bilinear): torch.Size([1, 1, 906, 438]) -> (120, 96) --------------------------] | Reference, PIL 8.4.0, mode: F | 1.11.0a0+gitd032369 cpu | 1.11.0a0+gitd032369 cuda 8 threads: --------------------------------------------------------------------------------------------------------------- contiguous torch.float32 | 962.3 | 400.0 | 29.4 Times are in microseconds (us). [------------------------- Downsampling (bilinear): torch.Size([1, 1, 906, 438]) -> (1200, 196) -------------------------] | Reference, PIL 8.4.0, mode: F | 1.11.0a0+gitd032369 cpu | 1.11.0a0+gitd032369 cuda 8 threads: --------------------------------------------------------------------------------------------------------------- contiguous torch.float32 | 4749.7 | 910.1 | 63.7 Times are in microseconds (us). [------------------------- Downsampling (bilinear): torch.Size([1, 1, 906, 438]) -> (120, 1200) -------------------------] | Reference, PIL 8.4.0, mode: F | 1.11.0a0+gitd032369 cpu | 1.11.0a0+gitd032369 cuda 8 threads: --------------------------------------------------------------------------------------------------------------- contiguous torch.float32 | 1098.1 | 272.0 | 36.4 Times are in microseconds (us). ``` </details> <details> <summary> Bicubic forward pass, PIL, PTH CPU and PTH CUDA </summary> Code: https://gist.github.com/vfdev-5/b173761a567f2283b3c649c3c0574112 ``` Torch version: 1.11.0a0+gitd032369 Torch config: PyTorch built with: - GCC 9.3 - C++ Version: 201402 - OpenMP 201511 (a.k.a. OpenMP 4.5) - CPU capability usage: AVX2 - CUDA Runtime 11.1 - NVCC architecture flags: -gencode;arch=compute_61,code=sm_61 - CuDNN 8.0.5 - Build settings: BUILD_TYPE=Release, CUDA_VERSION=11.1, CUDNN_VERSION=8.0.5, CXX_COMPILER=/usr/bin/c++, CXX_FLAGS= -Wno-deprecated -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -fopenmp -DNDEBUG -DUSE_KINETO -DUSE_PYTORCH_QNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -DEDGE_PROFILER_USE_KINETO -O2 -fPIC -Wno-narrowing -Wall -Wextra -Werror=return-type -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-sign-compare -Wno-unused-parameter -Wno-unused-function -Wno-unused-result -Wno-unused-local-typedefs -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-psabi -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Werror=cast-function-type -Wno-stringop-overflow, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_VERSION=1.11.0, USE_CUDA=1, USE_CUDNN=1, USE_EIGEN_FOR_BLAS=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=OFF, USE_MKLDNN=OFF, USE_MPI=OFF, USE_NCCL=ON, USE_NNPACK=0, USE_OPENMP=ON, USE_ROCM=OFF, Num threads: 8 [------------------------------------ Downsampling (bicubic): torch.Size([1, 3, 906, 438]) -> (320, 196) -----------------------------------] | Reference, PIL 8.4.0, mode: RGB | 1.11.0a0+gitd032369 cpu | 1.11.0a0+gitd032369 cuda 8 threads: ---------------------------------------------------------------------------------------------------------------------------------- channels_first contiguous torch.float32 | 4522.4 | 1406.7 | 170.3 channels_last non-contiguous torch.float32 | 4530.0 | 1435.4 | 242.2 Times are in microseconds (us). [------------------------------------ Downsampling (bicubic): torch.Size([1, 3, 906, 438]) -> (460, 220) -----------------------------------] | Reference, PIL 8.4.0, mode: RGB | 1.11.0a0+gitd032369 cpu | 1.11.0a0+gitd032369 cuda 8 threads: ---------------------------------------------------------------------------------------------------------------------------------- channels_first contiguous torch.float32 | 5726.4 | 1628.6 | 164.0 channels_last non-contiguous torch.float32 | 5722.6 | 1665.6 | 234.7 Times are in microseconds (us). [------------------------------------ Downsampling (bicubic): torch.Size([1, 3, 906, 438]) -> (120, 96) ------------------------------------] | Reference, PIL 8.4.0, mode: RGB | 1.11.0a0+gitd032369 cpu | 1.11.0a0+gitd032369 cuda 8 threads: ---------------------------------------------------------------------------------------------------------------------------------- channels_first contiguous torch.float32 | 2909.1 | 1461.5 | 276.9 channels_last non-contiguous torch.float32 | 2892.9 | 1458.7 | 345.1 Times are in microseconds (us). [----------------------------------- Downsampling (bicubic): torch.Size([1, 3, 906, 438]) -> (1200, 196) -----------------------------------] | Reference, PIL 8.4.0, mode: RGB | 1.11.0a0+gitd032369 cpu | 1.11.0a0+gitd032369 cuda 8 threads: ---------------------------------------------------------------------------------------------------------------------------------- channels_first contiguous torch.float32 | 14699.2 | 4283.9 | 407.1 channels_last non-contiguous torch.float32 | 14711.3 | 4321.1 | 477.0 Times are in microseconds (us). [----------------------------------- Downsampling (bicubic): torch.Size([1, 3, 906, 438]) -> (120, 1200) -----------------------------------] | Reference, PIL 8.4.0, mode: RGB | 1.11.0a0+gitd032369 cpu | 1.11.0a0+gitd032369 cuda 8 threads: ---------------------------------------------------------------------------------------------------------------------------------- channels_first contiguous torch.float32 | 3467.0 | 980.0 | 339.2 channels_last non-contiguous torch.float32 | 3465.2 | 982.3 | 407.8 Times are in microseconds (us). [-------------------------- Downsampling (bicubic): torch.Size([1, 1, 906, 438]) -> (320, 196) --------------------------] | Reference, PIL 8.4.0, mode: F | 1.11.0a0+gitd032369 cpu | 1.11.0a0+gitd032369 cuda 8 threads: --------------------------------------------------------------------------------------------------------------- contiguous torch.float32 | 2396.7 | 877.8 | 68.1 Times are in microseconds (us). [-------------------------- Downsampling (bicubic): torch.Size([1, 1, 906, 438]) -> (460, 220) --------------------------] | Reference, PIL 8.4.0, mode: F | 1.11.0a0+gitd032369 cpu | 1.11.0a0+gitd032369 cuda 8 threads: --------------------------------------------------------------------------------------------------------------- contiguous torch.float32 | 3068.2 | 777.3 | 64.7 Times are in microseconds (us). [-------------------------- Downsampling (bicubic): torch.Size([1, 1, 906, 438]) -> (120, 96) ---------------------------] | Reference, PIL 8.4.0, mode: F | 1.11.0a0+gitd032369 cpu | 1.11.0a0+gitd032369 cuda 8 threads: --------------------------------------------------------------------------------------------------------------- contiguous torch.float32 | 1540.2 | 829.3 | 100.4 Times are in microseconds (us). [------------------------- Downsampling (bicubic): torch.Size([1, 1, 906, 438]) -> (1200, 196) --------------------------] | Reference, PIL 8.4.0, mode: F | 1.11.0a0+gitd032369 cpu | 1.11.0a0+gitd032369 cuda 8 threads: --------------------------------------------------------------------------------------------------------------- contiguous torch.float32 | 7919.5 | 1467.8 | 151.6 Times are in microseconds (us). [------------------------- Downsampling (bicubic): torch.Size([1, 1, 906, 438]) -> (120, 1200) --------------------------] | Reference, PIL 8.4.0, mode: F | 1.11.0a0+gitd032369 cpu | 1.11.0a0+gitd032369 cuda 8 threads: --------------------------------------------------------------------------------------------------------------- contiguous torch.float32 | 1695.7 | 631.2 | 117.7 Times are in microseconds (us). ``` </details> <details> <summary> Bilinear backward pass, PTH CPU and PTH CUDA </summary> Code: https://gist.github.com/vfdev-5/b173761a567f2283b3c649c3c0574112 ``` - Measure only backward op Torch version: 1.11.0a0+gitd032369 Torch config: PyTorch built with: - GCC 9.3 - C++ Version: 201402 - OpenMP 201511 (a.k.a. OpenMP 4.5) - CPU capability usage: AVX2 - CUDA Runtime 11.1 - NVCC architecture flags: -gencode;arch=compute_61,code=sm_61 - CuDNN 8.0.5 - Build settings: BUILD_TYPE=Release, CUDA_VERSION=11.1, CUDNN_VERSION=8.0.5, CXX_COMPILER=/usr/bin/c++, CXX_FLAGS= -Wno-deprecated -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -fopenmp -DNDEBUG -DUSE_KINETO -DUSE_PYTORCH_QNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -DEDGE_PROFILER_USE_KINETO -O2 -fPIC -Wno-narrowing -Wall -Wextra -Werror=return-type -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-sign-compare -Wno-unused-parameter -Wno-unused-function -Wno-unused-result -Wno-unused-local-typedefs -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-psabi -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Werror=cast-function-type -Wno-stringop-overflow, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_VERSION=1.11.0, USE_CUDA=1, USE_CUDNN=1, USE_EIGEN_FOR_BLAS=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=OFF, USE_MKLDNN=OFF, USE_MPI=OFF, USE_NCCL=ON, USE_NNPACK=0, USE_OPENMP=ON, USE_ROCM=OFF, Num threads: 8 [------------- Downsampling backward (bilinear): torch.Size([1, 3, 906, 438]) -> (320, 196) ------------] | 1.11.0a0+gitd032369 cpu | 1.11.0a0+gitd032369 cuda 8 threads: ---------------------------------------------------------------------------------------------- channels_first contiguous torch.float32 | 4686.8 | 215.7 channels_last non-contiguous torch.float32 | 5101.1 | 220.5 Times are in microseconds (us). [------------- Downsampling backward (bilinear): torch.Size([1, 3, 906, 438]) -> (460, 220) ------------] | 1.11.0a0+gitd032369 cpu | 1.11.0a0+gitd032369 cuda 8 threads: ---------------------------------------------------------------------------------------------- channels_first contiguous torch.float32 | 6011.2 | 204.4 channels_last non-contiguous torch.float32 | 6396.0 | 210.0 Times are in microseconds (us). [------------- Downsampling backward (bilinear): torch.Size([1, 3, 906, 438]) -> (120, 96) -------------] | 1.11.0a0+gitd032369 cpu | 1.11.0a0+gitd032369 cuda 8 threads: ---------------------------------------------------------------------------------------------- channels_first contiguous torch.float32 | 2035.6 | 250.2 channels_last non-contiguous torch.float32 | 1589.6 | 252.5 Times are in microseconds (us). [------------ Downsampling backward (bilinear): torch.Size([1, 3, 906, 438]) -> (1200, 196) ------------] | 1.11.0a0+gitd032369 cpu | 1.11.0a0+gitd032369 cuda 8 threads: ---------------------------------------------------------------------------------------------- channels_first contiguous torch.float32 | 11392.5 | 256.5 channels_last non-contiguous torch.float32 | 11640.2 | 263.9 Times are in microseconds (us). [------------ Downsampling backward (bilinear): torch.Size([1, 3, 906, 438]) -> (120, 1200) ------------] | 1.11.0a0+gitd032369 cpu | 1.11.0a0+gitd032369 cuda 8 threads: ---------------------------------------------------------------------------------------------- channels_first contiguous torch.float32 | 11769.6 | 465.9 channels_last non-contiguous torch.float32 | 12407.0 | 474.4 Times are in microseconds (us). [---- Downsampling backward (bilinear): torch.Size([1, 1, 906, 438]) -> (320, 196) ----] | 1.11.0a0+gitd032369 cpu | 1.11.0a0+gitd032369 cuda 8 threads: ----------------------------------------------------------------------------- contiguous torch.float32 | 3931.0 | 133.3 Times are in microseconds (us). [---- Downsampling backward (bilinear): torch.Size([1, 1, 906, 438]) -> (460, 220) ----] | 1.11.0a0+gitd032369 cpu | 1.11.0a0+gitd032369 cuda 8 threads: ----------------------------------------------------------------------------- contiguous torch.float32 | 5594.8 | 133.9 Times are in microseconds (us). [---- Downsampling backward (bilinear): torch.Size([1, 1, 906, 438]) -> (120, 96) -----] | 1.11.0a0+gitd032369 cpu | 1.11.0a0+gitd032369 cuda 8 threads: ----------------------------------------------------------------------------- contiguous torch.float32 | 1272.6 | 133.0 Times are in microseconds (us). [--- Downsampling backward (bilinear): torch.Size([1, 1, 906, 438]) -> (1200, 196) ----] | 1.11.0a0+gitd032369 cpu | 1.11.0a0+gitd032369 cuda 8 threads: ----------------------------------------------------------------------------- contiguous torch.float32 | 10618.1 | 134.0 Times are in microseconds (us). [--- Downsampling backward (bilinear): torch.Size([1, 1, 906, 438]) -> (120, 1200) ----] | 1.11.0a0+gitd032369 cpu | 1.11.0a0+gitd032369 cuda 8 threads: ----------------------------------------------------------------------------- contiguous torch.float32 | 11082.2 | 154.6 Times are in microseconds (us). ``` </details> <details> <summary> Bicubic backward pass, PTH CPU and PTH CUDA </summary> Code: https://gist.github.com/vfdev-5/b173761a567f2283b3c649c3c0574112 ``` - Measure only backward op Torch version: 1.11.0a0+gitd032369 Torch config: PyTorch built with: - GCC 9.3 - C++ Version: 201402 - OpenMP 201511 (a.k.a. OpenMP 4.5) - CPU capability usage: AVX2 - CUDA Runtime 11.1 - NVCC architecture flags: -gencode;arch=compute_61,code=sm_61 - CuDNN 8.0.5 - Build settings: BUILD_TYPE=Release, CUDA_VERSION=11.1, CUDNN_VERSION=8.0.5, CXX_COMPILER=/usr/bin/c++, CXX_FLAGS= -Wno-deprecated -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -fopenmp -DNDEBUG -DUSE_KINETO -DUSE_PYTORCH_QNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -DEDGE_PROFILER_USE_KINETO -O2 -fPIC -Wno-narrowing -Wall -Wextra -Werror=return-type -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-sign-compare -Wno-unused-parameter -Wno-unused-function -Wno-unused-result -Wno-unused-local-typedefs -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-psabi -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Werror=cast-function-type -Wno-stringop-overflow, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_VERSION=1.11.0, USE_CUDA=1, USE_CUDNN=1, USE_EIGEN_FOR_BLAS=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=OFF, USE_MKLDNN=OFF, USE_MPI=OFF, USE_NCCL=ON, USE_NNPACK=0, USE_OPENMP=ON, USE_ROCM=OFF, Num threads: 8 [------------- Downsampling backward (bicubic): torch.Size([1, 3, 906, 438]) -> (320, 196) -------------] | 1.11.0a0+gitd032369 cpu | 1.11.0a0+gitd032369 cuda 8 threads: ---------------------------------------------------------------------------------------------- channels_first contiguous torch.float32 | 6791.2 | 618.9 channels_last non-contiguous torch.float32 | 7125.2 | 622.9 Times are in microseconds (us). [------------- Downsampling backward (bicubic): torch.Size([1, 3, 906, 438]) -> (460, 220) -------------] | 1.11.0a0+gitd032369 cpu | 1.11.0a0+gitd032369 cuda 8 threads: ---------------------------------------------------------------------------------------------- channels_first contiguous torch.float32 | 8806.2 | 600.3 channels_last non-contiguous torch.float32 | 9167.6 | 607.5 Times are in microseconds (us). [-------------- Downsampling backward (bicubic): torch.Size([1, 3, 906, 438]) -> (120, 96) -------------] | 1.11.0a0+gitd032369 cpu | 1.11.0a0+gitd032369 cuda 8 threads: ---------------------------------------------------------------------------------------------- channels_first contiguous torch.float32 | 3683.6 | 693.8 channels_last non-contiguous torch.float32 | 3617.4 | 695.0 Times are in microseconds (us). [------------- Downsampling backward (bicubic): torch.Size([1, 3, 906, 438]) -> (1200, 196) ------------] | 1.11.0a0+gitd032369 cpu | 1.11.0a0+gitd032369 cuda 8 threads: ---------------------------------------------------------------------------------------------- channels_first contiguous torch.float32 | 17548.2 | 779.4 channels_last non-contiguous torch.float32 | 17966.2 | 786.5 Times are in microseconds (us). [------------- Downsampling backward (bicubic): torch.Size([1, 3, 906, 438]) -> (120, 1200) ------------] | 1.11.0a0+gitd032369 cpu | 1.11.0a0+gitd032369 cuda 8 threads: ---------------------------------------------------------------------------------------------- channels_first contiguous torch.float32 | 28.4 | 1.6 channels_last non-contiguous torch.float32 | 28.4 | 1.6 Times are in milliseconds (ms). [---- Downsampling backward (bicubic): torch.Size([1, 1, 906, 438]) -> (320, 196) -----] | 1.11.0a0+gitd032369 cpu | 1.11.0a0+gitd032369 cuda 8 threads: ----------------------------------------------------------------------------- contiguous torch.float32 | 6266.1 | 208.5 Times are in microseconds (us). [---- Downsampling backward (bicubic): torch.Size([1, 1, 906, 438]) -> (460, 220) -----] | 1.11.0a0+gitd032369 cpu | 1.11.0a0+gitd032369 cuda 8 threads: ----------------------------------------------------------------------------- contiguous torch.float32 | 8218.3 | 200.8 Times are in microseconds (us). [----- Downsampling backward (bicubic): torch.Size([1, 1, 906, 438]) -> (120, 96) -----] | 1.11.0a0+gitd032369 cpu | 1.11.0a0+gitd032369 cuda 8 threads: ----------------------------------------------------------------------------- contiguous torch.float32 | 3458.9 | 231.9 Times are in microseconds (us). [---- Downsampling backward (bicubic): torch.Size([1, 1, 906, 438]) -> (1200, 196) ----] | 1.11.0a0+gitd032369 cpu | 1.11.0a0+gitd032369 cuda 8 threads: ----------------------------------------------------------------------------- contiguous torch.float32 | 15729.3 | 261.6 Times are in microseconds (us). [---- Downsampling backward (bicubic): torch.Size([1, 1, 906, 438]) -> (120, 1200) ----] | 1.11.0a0+gitd032369 cpu | 1.11.0a0+gitd032369 cuda 8 threads: ----------------------------------------------------------------------------- contiguous torch.float32 | 26279.8 | 547.0 Times are in microseconds (us). ``` </details> Code is moved from torchvision: pytorch/vision#4211 and optimized Pull Request resolved: pytorch/pytorch#70930 Reviewed By: zou3519 Differential Revision: D33817902 Pulled By: jbschlosser fbshipit-source-id: d63a620f8972ff36b63841f0bc6c820466f58f69 (cherry picked from commit d358cfd)

vfdev-5 and others added 6 commits July 23, 2021 15:26

WIP on backward op interpolation with AA

3d9240d

Removed cuda tests and reformat cpp code

206c2be

Merge branch 'master' into add-backward-interp-aa

3529515

Fixed clang wrong formatting

5375e8c

Added channels last test case

1ed1a16

Added CUDA support for backward pass, interpolation with AA

c51f8c1

facebook-github-bot added the cla signed label Jul 27, 2021

vfdev-5 added 3 commits July 28, 2021 22:45

Merge branch 'master' into add-backward-interp-aa-cuda

699dfb1

Merge branch 'master' into add-backward-interp-aa-cuda

66ac82f

Merge branch 'master' into add-backward-interp-aa-cuda

c20bf1c

fmassa approved these changes Aug 4, 2021

View reviewed changes

torchvision/csrc/ops/cuda/interpolate_aa_kernels.cu Outdated Show resolved Hide resolved

vfdev-5 and others added 2 commits August 4, 2021 03:27

Removed unused buffers

8fa06a4

Merge branch 'master' into add-backward-interp-aa-cuda

85d21ff

fmassa merged commit a9b38db into pytorch:master Aug 4, 2021

vfdev-5 deleted the add-backward-interp-aa-cuda branch August 4, 2021 11:45

vfdev-5 added enhancement module: ops labels Aug 4, 2021

vfdev-5 mentioned this pull request Jan 6, 2022

Added antialias flag to interpolate (CUDA, bilinear and bicubic) pytorch/pytorch#70930

Closed

pmeier mentioned this pull request Jun 19, 2023

refactor transforms v2 tests #7562

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[2/2] Added backward pass on CUDA for interpolation with anti-alias option #4211

[2/2] Added backward pass on CUDA for interpolation with anti-alias option #4211

vfdev-5 commented Jul 27, 2021

fmassa left a comment

github-actions bot commented Aug 4, 2021

[2/2] Added backward pass on CUDA for interpolation with anti-alias option #4211

[2/2] Added backward pass on CUDA for interpolation with anti-alias option #4211

Conversation

vfdev-5 commented Jul 27, 2021

fmassa left a comment

Choose a reason for hiding this comment

github-actions bot commented Aug 4, 2021