Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimized bilinear interpolation using TensorIterator #51653

Closed
wants to merge 12 commits into from

Conversation

vfdev-5
Copy link
Collaborator

@vfdev-5 vfdev-5 commented Feb 3, 2021

Related to #10482

Description:

  • Optimized bilinear interpolation for 1d, 2d, 3d cases using TensorIterator

Results

Interpolation 2d - 6 thread(s)
In Out Is contiguous Channels last master this PR speed-up
[1, 3, 320, 320] [256, 256] True False 0.3938 0.0782 5.0339
[1, 3, 320, 320] [512, 512] True False 1.5585 0.4105 3.7965
[1, 3, 320, 320] [256, 256] False False 0.3481 0.0760 4.5780
[1, 3, 320, 320] [512, 512] False False 1.5848 0.4091 3.8734
[1, 3, 320, 320] [256, 256] False True 1.2058 1.2034 1.0020
[1, 3, 320, 320] [512, 512] False True 4.8691 4.8537 1.0032
[32, 128, 64, 64] [32, 32] False True 6.3915 6.4041 0.9980
[32, 128, 64, 64] [128, 128] False True 166.1769 164.5621 1.0098
[32, 128, 64, 64] [32, 32] True False 3.7194 2.4720 1.5046
[32, 128, 64, 64] [128, 128] True False 86.6704 52.3754 1.6548
[1, 3, 500, 500] [256, 256] True False 0.3270 0.0792 4.1307
[1, 3, 500, 500] [800, 800] True False 3.3116 0.5567 5.9482
[1, 3, 500, 500] [256, 256] False False 0.3763 0.0773 4.8700
[1, 3, 500, 500] [800, 800] False False 3.2577 0.5590 5.8279
Interpolation 1d - 6 thread(s)
In Out Is contiguous Channels last master this PR speed-up
[4, 512, 320] 256 True False 0.2795 0.1032 2.7089
[4, 512, 320] 512 True False 0.5533 0.1888 2.9303
Interpolation 3d - 6 thread(s)
In Out Is contiguous Channels last master this PR speed-up
[1, 3, 16, 320, 320] [8, 256, 256] True False 4.4105 2.1236 2.0769
[1, 3, 16, 320, 320] [32, 512, 512] True False 83.9426 42.6641 1.9675
[1, 3, 16, 320, 320] [8, 256, 256] False True 15.5736 15.5758 0.9999
[1, 3, 16, 320, 320] [32, 512, 512] False True 272.4795 273.2745 0.9971
Interpolation 2d - 1 thread(s)
In Out Is contiguous Channels last master this PR speed-up
[1, 3, 320, 320] [256, 256] True False 1.0240 0.4145 2.4705
[1, 3, 320, 320] [512, 512] True False 4.0771 1.3836 2.9467
[1, 3, 320, 320] [256, 256] False False 0.9771 0.3270 2.9878
[1, 3, 320, 320] [512, 512] False False 4.1732 1.2209 3.4180
[1, 3, 320, 320] [256, 256] False True 1.5466 1.5363 1.0067
[1, 3, 320, 320] [512, 512] False True 6.1555 6.1199 1.0058
[32, 128, 64, 64] [32, 32] False True 27.6362 27.5901 1.0017
[32, 128, 64, 64] [128, 128] False True 468.6442 465.5163 1.0067
[32, 128, 64, 64] [32, 32] True False 20.1495 10.0694 2.0011
[32, 128, 64, 64] [128, 128] True False 400.0401 204.0662 1.9603
[1, 3, 500, 500] [256, 256] True False 0.8956 0.3366 2.6606
[1, 3, 500, 500] [800, 800] True False 8.6554 2.9530 2.9310
[1, 3, 500, 500] [256, 256] False False 1.0921 0.3385 3.2263
[1, 3, 500, 500] [800, 800] False False 8.9594 2.9627 3.0241
Interpolation 1d - 1 thread(s)
In Out Is contiguous Channels last master this PR speed-up
[4, 512, 320] 256 True False 1.5233 0.5027 3.0301
[4, 512, 320] 512 True False 3.0302 0.9735 3.1128
Interpolation 3d - 1 thread(s)
In Out Is contiguous Channels last master this PR speed-up
[1, 3, 16, 320, 320] [8, 256, 256] True False 12.0477 11.3196 1.0643
[1, 3, 16, 320, 320] [32, 512, 512] True False 222.8618 209.9955 1.0613
[1, 3, 16, 320, 320] [8, 256, 256] False True 17.9883 17.9937 0.9997
[1, 3, 16, 320, 320] [32, 512, 512] False True 380.7244 380.1916 1.0014
Versions and build configs

PyTorch master: 1.9.0.dev20210223
PyTorch master build setting:

BLAS_INFO=mkl, BUILD_TYPE=Release, CUDA_VERSION=10.2, CUDNN_VERSION=7.6.5, CXX_COMPILER=/opt/rh/devtoolset-7/root/usr/bin/c++, CXX_FLAGS= -Wno-deprecated -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -fopenmp -DNDEBUG -DUSE_KINETO -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -O2 -fPIC -Wno-narrowing -Wall -Wextra -Werror=return-type -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-sign-compare -Wno-unused-parameter -Wno-unused-variable -Wno-unused-function -Wno-unused-result -Wno-unused-local-typedefs -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-psabi -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Wno-stringop-overflow, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_VERSION=1.9.0, USE_CUDA=ON, USE_CUDNN=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=ON, USE_NNPACK=ON, USE_OPENMP=ON,

PR : 1.9.0a0+74b172b
PR build setting:

BUILD_TYPE=Release, CUDA_VERSION=11.1, CUDNN_VERSION=8.0.5, CXX_COMPILER=/usr/bin/g++-7, CXX_FLAGS=-O3 -Wno-deprecated -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -fopenmp -DNDEBUG -DUSE_KINETO -DUSE_PYTORCH_QNNPACK -O2 -fPIC -Wno-narrowing -Wall -Wextra -Werror=return-type -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-sign-compare -Wno-unused-parameter -Wno-unused-variable -Wno-unused-function -Wno-unused-result -Wno-unused-local-typedefs -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-psabi -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Wno-stringop-overflow, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_VERSION=1.9.0, USE_CUDA=1, USE_CUDNN=1, USE_EIGEN_FOR_BLAS=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=OFF, USE_MKLDNN=OFF, USE_MPI=OFF, USE_NCCL=ON, USE_NNPACK=0, USE_OPENMP=ON,

This description is based on the benchmarks and the code from here.

TL;DR

  • Linear upsampling generic implementation using TensorIterator for Nd case (single loop function for 1d, 2d and 3d cases)
    • can be generalized to nearest, bicubic interpolation modes.
  • works for channels first and last cases.

Joint work with Francisco Massa (@fmassa).

@facebook-github-bot
Copy link
Contributor

facebook-github-bot commented Feb 3, 2021

💊 CI failures summary and remediations

As of commit 74b172b (more details on the Dr. CI page):


💚 💚 Looks good so far! There are no failures yet. 💚 💚


This comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions to the (internal) Dr. CI Users group.

@vfdev-5 vfdev-5 marked this pull request as draft February 3, 2021 17:54
@vfdev-5 vfdev-5 marked this pull request as ready for review February 10, 2021 16:40
@VitalyFedyunin
Copy link
Contributor

@fmassa let me know if you cannot review it.

@vfdev-5 please benchmark non contiguous cases

@vfdev-5
Copy link
Collaborator Author

vfdev-5 commented Feb 10, 2021

@VitalyFedyunin there is already channel last (non-contigous) case here which is routed to the original vectorized implementation.

Do you think about particular cases like torch.rand(b, c, H, W)[:, :, :h, :w] ?

@VitalyFedyunin VitalyFedyunin added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Feb 10, 2021
@fmassa
Copy link
Member

fmassa commented Feb 11, 2021

@VitalyFedyunin I'll review this PR today on Monday

@vfdev-5
Copy link
Collaborator Author

vfdev-5 commented Feb 15, 2021

@VitalyFedyunin as talked with @fmassa , i'd like to mention here a performance drawback with dispatch stub in our particular case for 3d linear interpolation.
I compare the execution times between my prototype code (used for development and as a codebase for this PR) and PyTorch built with this PR:

---- Benchmark 3D ----
Input tensor: [1, 3, 16, 320, 320]
Num threads: 6

- Bench upsample_trilinear3d_cpu (2000 rounds) - downsampling to [8, 256, 256]
Elapsed time (ms): 2.11485
- Bench ti_upsample_trilinear3d_kernel_impl (2000 rounds) - downsampling to [8, 256, 256]
Elapsed time (ms): 1.05858

- Bench upsample_trilinear3d_cpu (2000 rounds) - upsampling to [32, 512, 512]
Elapsed time (ms): 44.3299
- Bench ti_upsample_trilinear3d_kernel_impl (2000 rounds) - upsampling to [32, 512, 512]
Elapsed time (ms): 26.0051

Expected result should have ~ similar times between upsample_trilinear3d_cpu and ti_upsample_trilinear3d_kernel_impl as it is more or less the same code.

I tried also to register manually the dispatch as here : master...Quansight:upsample-tensor-iterator-another-dispatch, see aten/src/ATen/native/UpSampleTrilinear3d.cpp

REGISTER_ARCH_DISPATCH(upsample_trilinear3d_kernel, DEFAULT, &upsample_trilinear3d_kernel_impl);
REGISTER_AVX_DISPATCH(upsample_trilinear3d_kernel, &upsample_trilinear3d_kernel_impl);
REGISTER_AVX2_DISPATCH(upsample_trilinear3d_kernel, &upsample_trilinear3d_kernel_impl);

and looks like this way I could restore expected times

---- Benchmark 3D ----
Input tensor: [1, 3, 16, 320, 320]
Num threads: 6

- Bench upsample_trilinear3d_cpu (750 rounds) - downsampling to [8, 256, 256]
Elapsed time (ms): 1.05641
- Bench ti_upsample_trilinear3d_kernel_impl (750 rounds) - downsampling to [8, 256, 256]
Elapsed time (ms): 1.05557

Again, this performance slowdown happens only for 3D case, where in the code we unroll a template loop over the dimensions:
https://github.com/Quansight/pytorch/blob/64516dbc7dc9891f1702954a741ff756463799a0/aten/src/ATen/native/cpu/UpSampleMoreKernel.cpp#L457-L460

PS: I also updated full results with more tests cases: non contiguous cases.

@codecov
Copy link

codecov bot commented Feb 15, 2021

Codecov Report

Merging #51653 (74b172b) into master (64847c7) will increase coverage by 0.00%.
The diff coverage is n/a.

@@           Coverage Diff           @@
##           master   #51653   +/-   ##
=======================================
  Coverage   80.79%   80.79%           
=======================================
  Files        1972     1972           
  Lines      216093   216093           
=======================================
+ Hits       174586   174587    +1     
+ Misses      41507    41506    -1     

Copy link
Member

@fmassa fmassa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've made some minor comments about the structure of the code.

It would be great to understand the slowdown when integrating this implementation into PyTorch for the 3d case.

aten/src/ATen/native/cpu/UpSampleMoreKernel.cpp Outdated Show resolved Hide resolved
aten/src/ATen/native/cpu/UpSampleMoreKernel.cpp Outdated Show resolved Hide resolved
aten/src/ATen/native/cpu/UpSampleMoreKernel.cpp Outdated Show resolved Hide resolved
aten/src/ATen/native/cpu/UpSampleMoreKernel.cpp Outdated Show resolved Hide resolved
aten/src/ATen/native/cpu/UpSampleMoreKernel.cpp Outdated Show resolved Hide resolved
aten/src/ATen/native/cpu/UpSampleMoreKernel.cpp Outdated Show resolved Hide resolved
aten/src/ATen/native/cpu/UpSampleMoreKernel.cpp Outdated Show resolved Hide resolved
aten/src/ATen/native/cpu/UpSampleMoreKernel.cpp Outdated Show resolved Hide resolved
aten/src/ATen/native/cpu/UpSampleMoreKernel.cpp Outdated Show resolved Hide resolved
@fmassa
Copy link
Member

fmassa commented Feb 15, 2021

Additionally, @vfdev-5 can you maybe put the performance results in a table which is linked in the PR? This will make it easier to understand what's going on.

Something like

2d - 1 thread

config master this PR speed-up
[1x3x256x256] -> [1x3x500x500] xxx xxx xxx
[1x3x256x256] -> [1x3x128x128] xxx xxx xxx
...

2d - 6 threads

config master this PR speed-up
[1x3x256x256] -> [1x3x500x500] xxx xxx xxx
[1x3x256x256] -> [1x3x128x128] xxx xxx xxx
...

3d

config master this PR speed-up
[1x3x8x256x256] -> [1x3x8x500x500] xxx xxx xxx

@vfdev-5
Copy link
Collaborator Author

vfdev-5 commented Feb 22, 2021

Results for 149b976 = if tensor iterator is used instead of cpu_upsample_linear_channels_last.
TensorIterator implementation is faster for some cases and slower for others (see the speedups <1.0):

Interpolation 2d - 6 thread(s)
In Out Is contigous Channels last master this PR speed-up
[1, 3, 320, 320] [256, 256] True False 0.3261 0.1239 2.6318
[1, 3, 320, 320] [512, 512] True False 1.2854 0.4086 3.1458
[1, 3, 320, 320] [256, 256] False False 0.3488 0.0777 4.4919
[1, 3, 320, 320] [512, 512] False False 1.3063 0.4084 3.1984
[1, 3, 320, 320] [256, 256] False True 1.0897 0.3289 3.3132
[1, 3, 320, 320] [512, 512] False True 4.2505 1.3841 3.0711
[32, 128, 64, 64] [32, 32] False True 2.2961 2.9314 0.7833
[32, 128, 64, 64] [128, 128] False True 35.9384 35.6293 1.0087
[32, 128, 64, 64] [32, 32] True False 3.6902 3.5451 1.0409
[32, 128, 64, 64] [128, 128] True False 86.7835 52.4501 1.6546
[1, 3, 500, 500] [256, 256] True False 0.3266 0.0785 4.1601
[1, 3, 500, 500] [800, 800] True False 3.1868 0.5580 5.7114
[1, 3, 500, 500] [256, 256] False False 0.3771 0.0793 4.7555
[1, 3, 500, 500] [800, 800] False False 3.2693 0.5610 5.8271
Interpolation 1d - 6 thread(s)
In Out Is contigous Channels last master this PR speed-up
[4, 512, 320] 256 True False 0.2808 0.1044 2.6907
[4, 512, 320] 512 True False 0.5524 0.1887 2.9269
Interpolation 3d - 6 thread(s)
In Out Is contigous Channels last master this PR speed-up
[1, 3, 16, 320, 320] [8, 256, 256] True False 4.4017 0.9578 4.5958
[1, 3, 16, 320, 320] [32, 512, 512] True False 84.0302 24.0669 3.4915
[1, 3, 16, 320, 320] [8, 256, 256] False True 13.6098 3.0288 4.4934
[1, 3, 16, 320, 320] [32, 512, 512] False True 246.6380 64.3400 3.8334
Interpolation 2d - 1 thread(s)
In Out Is contigous Channels last master this PR speed-up
[1, 3, 320, 320] [256, 256] True False 0.8967 0.4551 1.9703
[1, 3, 320, 320] [512, 512] True False 3.5399 1.7594 2.0120
[1, 3, 320, 320] [256, 256] False False 0.9760 0.3305 2.9531
[1, 3, 320, 320] [512, 512] False False 3.6266 1.7643 2.0555
[1, 3, 320, 320] [256, 256] False True 1.0093 1.6589 0.6084
[1, 3, 320, 320] [512, 512] False True 4.0231 7.1302 0.5642
[32, 128, 64, 64] [32, 32] False True 5.8736 9.6382 0.6094
[32, 128, 64, 64] [128, 128] False True 108.2541 117.1183 0.9243
[32, 128, 64, 64] [32, 32] True False 19.9122 14.0883 1.4134
[32, 128, 64, 64] [128, 128] True False 398.8196 205.5317 1.9404
[1, 3, 500, 500] [256, 256] True False 0.8944 0.3388 2.6404
[1, 3, 500, 500] [800, 800] True False 8.6327 2.9568 2.9196
[1, 3, 500, 500] [256, 256] False False 1.0921 0.3405 3.2076
[1, 3, 500, 500] [800, 800] False False 8.9394 2.9654 3.0145
Interpolation 1d - 1 thread(s)
In Out Is contigous Channels last master this PR speed-up
[4, 512, 320] 256 True False 1.5233 0.5066 3.0071
[4, 512, 320] 512 True False 3.0312 0.9796 3.0943
Interpolation 3d - 1 thread(s)
In Out Is contigous Channels last master this PR speed-up
[1, 3, 16, 320, 320] [8, 256, 256] True False 12.0408 4.8498 2.4827
[1, 3, 16, 320, 320] [32, 512, 512] True False 222.8379 105.1315 2.1196
[1, 3, 16, 320, 320] [8, 256, 256] False True 13.3036 17.2361 0.7718
[1, 3, 16, 320, 320] [32, 512, 512] False True 245.9575 297.0317 0.8281
Versions and build configs

PyTorch master: 1.8.0.dev20210208+cu110
PyTorch master build setting:

BLAS_INFO=mkl, BUILD_TYPE=Release, CUDA_VERSION=11.0, CUDNN_VERSION=8.0.5, CXX_COMPILER=/opt/rh/devtoolset-7/root/usr/bin/c++, CXX_FLAGS= -Wno-deprecated -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -fopenmp -DNDEBUG -DUSE_KINETO -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -O2 -fPIC -Wno-narrowing -Wall -Wextra -Werror=return-type -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-sign-compare -Wno-unused-parameter -Wno-unused-variable -Wno-unused-function -Wno-unused-result -Wno-unused-local-typedefs -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-psabi -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Wno-stringop-overflow, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_VERSION=1.8.0, USE_CUDA=ON, USE_CUDNN=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=ON, USE_NNPACK=ON, USE_OPENMP=ON,

PR : 1.9.0a0+149b976
PR build setting:

BUILD_TYPE=Release, CUDA_VERSION=11.1, CUDNN_VERSION=8.0.5, CXX_COMPILER=/usr/bin/g++-7, CXX_FLAGS=-O3 -Wno-deprecated -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -fopenmp -DNDEBUG -DUSE_KINETO -DUSE_PYTORCH_QNNPACK -O2 -fPIC -Wno-narrowing -Wall -Wextra -Werror=return-type -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-sign-compare -Wno-unused-parameter -Wno-unused-variable -Wno-unused-function -Wno-unused-result -Wno-unused-local-typedefs -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-psabi -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Wno-stringop-overflow, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_VERSION=1.9.0, USE_CUDA=1, USE_CUDNN=1, USE_EIGEN_FOR_BLAS=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=OFF, USE_MKLDNN=OFF, USE_MPI=OFF, USE_NCCL=ON, USE_NNPACK=0, USE_OPENMP=ON,

Copy link
Member

@fmassa fmassa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great, thanks a lot!

Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@fmassa has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@VitalyFedyunin
Copy link
Contributor

Please do REGISTER_AVX2_DISPATCH workaround while I review the issue with dispatch (as it might take additional time).

@fmassa
Copy link
Member

fmassa commented Feb 25, 2021

I propose we do the following: given that the current PR already gives a significant speedup for the 3d case (despite the issues with REGISTER_DISPATCH), I would say to merge the PR as is, while keeping in mind that we can still obtain a 2x speed improvement on top of it when we fix the REGISTER_AVX2_DISPATCH situation.

@facebook-github-bot
Copy link
Contributor

@fmassa merged this pull request in 66f07c0.

@vfdev-5 vfdev-5 deleted the upsample-tensor-iterator branch March 1, 2021 17:16
aocsa pushed a commit to Quansight/pytorch that referenced this pull request Mar 15, 2021
Summary:
Related to pytorch#10482

Description:

- Optimized bilinear interpolation for 1d, 2d, 3d cases using TensorIterator

<details>
<summary>
Interpolation 2d - 6 thread(s)
</summary>

In | Out | Is contiguous | Channels last | master | this PR | speed-up
---|---|---|---|---|---|---
[1, 3, 320, 320] | [256, 256] | True | False | 0.3938 | 0.0782 | 5.0339
[1, 3, 320, 320] | [512, 512] | True | False | 1.5585 | 0.4105 | 3.7965
[1, 3, 320, 320] | [256, 256] | False | False | 0.3481 | 0.0760 | 4.5780
[1, 3, 320, 320] | [512, 512] | False | False | 1.5848 | 0.4091 | 3.8734
[1, 3, 320, 320] | [256, 256] | False | True | 1.2058 | 1.2034 | 1.0020
[1, 3, 320, 320] | [512, 512] | False | True | 4.8691 | 4.8537 | 1.0032
[32, 128, 64, 64] | [32, 32] | False | True | 6.3915 | 6.4041 | 0.9980
[32, 128, 64, 64] | [128, 128] | False | True | 166.1769 | 164.5621 | 1.0098
[32, 128, 64, 64] | [32, 32] | True | False | 3.7194 | 2.4720 | 1.5046
[32, 128, 64, 64] | [128, 128] | True | False | 86.6704 | 52.3754 | 1.6548
[1, 3, 500, 500] | [256, 256] | True | False | 0.3270 | 0.0792 | 4.1307
[1, 3, 500, 500] | [800, 800] | True | False | 3.3116 | 0.5567 | 5.9482
[1, 3, 500, 500] | [256, 256] | False | False | 0.3763 | 0.0773 | 4.8700
[1, 3, 500, 500] | [800, 800] | False | False | 3.2577 | 0.5590 | 5.8279

</details>

<details>
<summary>
Interpolation 1d - 6 thread(s)
</summary>

In | Out | Is contiguous | Channels last | master | this PR | speed-up
---|---|---|---|---|---|---
[4, 512, 320] | 256 | True | False | 0.2795 | 0.1032 | 2.7089
[4, 512, 320] | 512 | True | False | 0.5533 | 0.1888 | 2.9303

</details>

<details>
<summary>
Interpolation 3d - 6 thread(s)
</summary>

In | Out | Is contiguous | Channels last | master | this PR | speed-up
---|---|---|---|---|---|---
[1, 3, 16, 320, 320] | [8, 256, 256] | True | False | 4.4105 | 2.1236 | 2.0769
[1, 3, 16, 320, 320] | [32, 512, 512] | True | False | 83.9426 | 42.6641 | 1.9675
[1, 3, 16, 320, 320] | [8, 256, 256] | False | True | 15.5736 | 15.5758 | 0.9999
[1, 3, 16, 320, 320] | [32, 512, 512] | False | True | 272.4795 | 273.2745 | 0.9971

</details>

<details>
<summary>
Interpolation 2d - 1 thread(s)
</summary>

In | Out | Is contiguous | Channels last | master | this PR | speed-up
---|---|---|---|---|---|---
[1, 3, 320, 320] | [256, 256] | True | False | 1.0240 | 0.4145 | 2.4705
[1, 3, 320, 320] | [512, 512] | True | False | 4.0771 | 1.3836 | 2.9467
[1, 3, 320, 320] | [256, 256] | False | False | 0.9771 | 0.3270 | 2.9878
[1, 3, 320, 320] | [512, 512] | False | False | 4.1732 | 1.2209 | 3.4180
[1, 3, 320, 320] | [256, 256] | False | True | 1.5466 | 1.5363 | 1.0067
[1, 3, 320, 320] | [512, 512] | False | True | 6.1555 | 6.1199 | 1.0058
[32, 128, 64, 64] | [32, 32] | False | True | 27.6362 | 27.5901 | 1.0017
[32, 128, 64, 64] | [128, 128] | False | True | 468.6442 | 465.5163 | 1.0067
[32, 128, 64, 64] | [32, 32] | True | False | 20.1495 | 10.0694 | 2.0011
[32, 128, 64, 64] | [128, 128] | True | False | 400.0401 | 204.0662 | 1.9603
[1, 3, 500, 500] | [256, 256] | True | False | 0.8956 | 0.3366 | 2.6606
[1, 3, 500, 500] | [800, 800] | True | False | 8.6554 | 2.9530 | 2.9310
[1, 3, 500, 500] | [256, 256] | False | False | 1.0921 | 0.3385 | 3.2263
[1, 3, 500, 500] | [800, 800] | False | False | 8.9594 | 2.9627 | 3.0241

</details>

<details>
<summary>
Interpolation 1d - 1 thread(s)
</summary>

In | Out | Is contiguous | Channels last | master | this PR | speed-up
---|---|---|---|---|---|---
[4, 512, 320] | 256 | True | False | 1.5233 | 0.5027 | 3.0301
[4, 512, 320] | 512 | True | False | 3.0302 | 0.9735 | 3.1128

</details>

<details>
<summary>
Interpolation 3d - 1 thread(s)
</summary>

In | Out | Is contiguous | Channels last | master | this PR | speed-up
---|---|---|---|---|---|---
[1, 3, 16, 320, 320] | [8, 256, 256] | True | False | 12.0477 | 11.3196 | 1.0643
[1, 3, 16, 320, 320] | [32, 512, 512] | True | False | 222.8618 | 209.9955 | 1.0613
[1, 3, 16, 320, 320] | [8, 256, 256] | False | True | 17.9883 | 17.9937 | 0.9997
[1, 3, 16, 320, 320] | [32, 512, 512] | False | True | 380.7244 | 380.1916 | 1.0014

</details>

<details>
<summary>
Versions and build configs
</summary>

PyTorch master: 1.9.0.dev20210223
PyTorch master build setting:
```
BLAS_INFO=mkl, BUILD_TYPE=Release, CUDA_VERSION=10.2, CUDNN_VERSION=7.6.5, CXX_COMPILER=/opt/rh/devtoolset-7/root/usr/bin/c++, CXX_FLAGS= -Wno-deprecated -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -fopenmp -DNDEBUG -DUSE_KINETO -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -O2 -fPIC -Wno-narrowing -Wall -Wextra -Werror=return-type -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-sign-compare -Wno-unused-parameter -Wno-unused-variable -Wno-unused-function -Wno-unused-result -Wno-unused-local-typedefs -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-psabi -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Wno-stringop-overflow, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_VERSION=1.9.0, USE_CUDA=ON, USE_CUDNN=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=ON, USE_NNPACK=ON, USE_OPENMP=ON,
```

PR : 1.9.0a0+74b172b
PR build setting:
```
BUILD_TYPE=Release, CUDA_VERSION=11.1, CUDNN_VERSION=8.0.5, CXX_COMPILER=/usr/bin/g++-7, CXX_FLAGS=-O3 -Wno-deprecated -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -fopenmp -DNDEBUG -DUSE_KINETO -DUSE_PYTORCH_QNNPACK -O2 -fPIC -Wno-narrowing -Wall -Wextra -Werror=return-type -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-sign-compare -Wno-unused-parameter -Wno-unused-variable -Wno-unused-function -Wno-unused-result -Wno-unused-local-typedefs -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-psabi -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Wno-stringop-overflow, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_VERSION=1.9.0, USE_CUDA=1, USE_CUDNN=1, USE_EIGEN_FOR_BLAS=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=OFF, USE_MKLDNN=OFF, USE_MPI=OFF, USE_NCCL=ON, USE_NNPACK=0, USE_OPENMP=ON,
```
</details>

This description is based on the benchmarks and the code from [here](https://github.com/vfdev-5/interpolate-tensoriterator/tree/master/step_six).

TL;DR
- Linear upsampling generic implementation using TensorIterator for Nd case (single loop function for 1d, 2d and 3d cases)
  - can be generalized to nearest, bicubic interpolation modes.
- works for channels first and last cases.

Joint work with Francisco Massa (fmassa).

Pull Request resolved: pytorch#51653

Reviewed By: malfet

Differential Revision: D26619437

Pulled By: fmassa

fbshipit-source-id: 7d435e23881c5b40a18bf0dbcab4906d5462025f
xsacha pushed a commit to xsacha/pytorch that referenced this pull request Mar 31, 2021
Summary:
Related to pytorch#10482

Description:

- Optimized bilinear interpolation for 1d, 2d, 3d cases using TensorIterator

<details>
<summary>
Interpolation 2d - 6 thread(s)
</summary>

In | Out | Is contiguous | Channels last | master | this PR | speed-up
---|---|---|---|---|---|---
[1, 3, 320, 320] | [256, 256] | True | False | 0.3938 | 0.0782 | 5.0339
[1, 3, 320, 320] | [512, 512] | True | False | 1.5585 | 0.4105 | 3.7965
[1, 3, 320, 320] | [256, 256] | False | False | 0.3481 | 0.0760 | 4.5780
[1, 3, 320, 320] | [512, 512] | False | False | 1.5848 | 0.4091 | 3.8734
[1, 3, 320, 320] | [256, 256] | False | True | 1.2058 | 1.2034 | 1.0020
[1, 3, 320, 320] | [512, 512] | False | True | 4.8691 | 4.8537 | 1.0032
[32, 128, 64, 64] | [32, 32] | False | True | 6.3915 | 6.4041 | 0.9980
[32, 128, 64, 64] | [128, 128] | False | True | 166.1769 | 164.5621 | 1.0098
[32, 128, 64, 64] | [32, 32] | True | False | 3.7194 | 2.4720 | 1.5046
[32, 128, 64, 64] | [128, 128] | True | False | 86.6704 | 52.3754 | 1.6548
[1, 3, 500, 500] | [256, 256] | True | False | 0.3270 | 0.0792 | 4.1307
[1, 3, 500, 500] | [800, 800] | True | False | 3.3116 | 0.5567 | 5.9482
[1, 3, 500, 500] | [256, 256] | False | False | 0.3763 | 0.0773 | 4.8700
[1, 3, 500, 500] | [800, 800] | False | False | 3.2577 | 0.5590 | 5.8279

</details>

<details>
<summary>
Interpolation 1d - 6 thread(s)
</summary>

In | Out | Is contiguous | Channels last | master | this PR | speed-up
---|---|---|---|---|---|---
[4, 512, 320] | 256 | True | False | 0.2795 | 0.1032 | 2.7089
[4, 512, 320] | 512 | True | False | 0.5533 | 0.1888 | 2.9303

</details>

<details>
<summary>
Interpolation 3d - 6 thread(s)
</summary>

In | Out | Is contiguous | Channels last | master | this PR | speed-up
---|---|---|---|---|---|---
[1, 3, 16, 320, 320] | [8, 256, 256] | True | False | 4.4105 | 2.1236 | 2.0769
[1, 3, 16, 320, 320] | [32, 512, 512] | True | False | 83.9426 | 42.6641 | 1.9675
[1, 3, 16, 320, 320] | [8, 256, 256] | False | True | 15.5736 | 15.5758 | 0.9999
[1, 3, 16, 320, 320] | [32, 512, 512] | False | True | 272.4795 | 273.2745 | 0.9971

</details>

<details>
<summary>
Interpolation 2d - 1 thread(s)
</summary>

In | Out | Is contiguous | Channels last | master | this PR | speed-up
---|---|---|---|---|---|---
[1, 3, 320, 320] | [256, 256] | True | False | 1.0240 | 0.4145 | 2.4705
[1, 3, 320, 320] | [512, 512] | True | False | 4.0771 | 1.3836 | 2.9467
[1, 3, 320, 320] | [256, 256] | False | False | 0.9771 | 0.3270 | 2.9878
[1, 3, 320, 320] | [512, 512] | False | False | 4.1732 | 1.2209 | 3.4180
[1, 3, 320, 320] | [256, 256] | False | True | 1.5466 | 1.5363 | 1.0067
[1, 3, 320, 320] | [512, 512] | False | True | 6.1555 | 6.1199 | 1.0058
[32, 128, 64, 64] | [32, 32] | False | True | 27.6362 | 27.5901 | 1.0017
[32, 128, 64, 64] | [128, 128] | False | True | 468.6442 | 465.5163 | 1.0067
[32, 128, 64, 64] | [32, 32] | True | False | 20.1495 | 10.0694 | 2.0011
[32, 128, 64, 64] | [128, 128] | True | False | 400.0401 | 204.0662 | 1.9603
[1, 3, 500, 500] | [256, 256] | True | False | 0.8956 | 0.3366 | 2.6606
[1, 3, 500, 500] | [800, 800] | True | False | 8.6554 | 2.9530 | 2.9310
[1, 3, 500, 500] | [256, 256] | False | False | 1.0921 | 0.3385 | 3.2263
[1, 3, 500, 500] | [800, 800] | False | False | 8.9594 | 2.9627 | 3.0241

</details>

<details>
<summary>
Interpolation 1d - 1 thread(s)
</summary>

In | Out | Is contiguous | Channels last | master | this PR | speed-up
---|---|---|---|---|---|---
[4, 512, 320] | 256 | True | False | 1.5233 | 0.5027 | 3.0301
[4, 512, 320] | 512 | True | False | 3.0302 | 0.9735 | 3.1128

</details>

<details>
<summary>
Interpolation 3d - 1 thread(s)
</summary>

In | Out | Is contiguous | Channels last | master | this PR | speed-up
---|---|---|---|---|---|---
[1, 3, 16, 320, 320] | [8, 256, 256] | True | False | 12.0477 | 11.3196 | 1.0643
[1, 3, 16, 320, 320] | [32, 512, 512] | True | False | 222.8618 | 209.9955 | 1.0613
[1, 3, 16, 320, 320] | [8, 256, 256] | False | True | 17.9883 | 17.9937 | 0.9997
[1, 3, 16, 320, 320] | [32, 512, 512] | False | True | 380.7244 | 380.1916 | 1.0014

</details>

<details>
<summary>
Versions and build configs
</summary>

PyTorch master: 1.9.0.dev20210223
PyTorch master build setting:
```
BLAS_INFO=mkl, BUILD_TYPE=Release, CUDA_VERSION=10.2, CUDNN_VERSION=7.6.5, CXX_COMPILER=/opt/rh/devtoolset-7/root/usr/bin/c++, CXX_FLAGS= -Wno-deprecated -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -fopenmp -DNDEBUG -DUSE_KINETO -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -O2 -fPIC -Wno-narrowing -Wall -Wextra -Werror=return-type -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-sign-compare -Wno-unused-parameter -Wno-unused-variable -Wno-unused-function -Wno-unused-result -Wno-unused-local-typedefs -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-psabi -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Wno-stringop-overflow, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_VERSION=1.9.0, USE_CUDA=ON, USE_CUDNN=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=ON, USE_NNPACK=ON, USE_OPENMP=ON,
```

PR : 1.9.0a0+74b172b
PR build setting:
```
BUILD_TYPE=Release, CUDA_VERSION=11.1, CUDNN_VERSION=8.0.5, CXX_COMPILER=/usr/bin/g++-7, CXX_FLAGS=-O3 -Wno-deprecated -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -fopenmp -DNDEBUG -DUSE_KINETO -DUSE_PYTORCH_QNNPACK -O2 -fPIC -Wno-narrowing -Wall -Wextra -Werror=return-type -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-sign-compare -Wno-unused-parameter -Wno-unused-variable -Wno-unused-function -Wno-unused-result -Wno-unused-local-typedefs -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-psabi -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Wno-stringop-overflow, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_VERSION=1.9.0, USE_CUDA=1, USE_CUDNN=1, USE_EIGEN_FOR_BLAS=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=OFF, USE_MKLDNN=OFF, USE_MPI=OFF, USE_NCCL=ON, USE_NNPACK=0, USE_OPENMP=ON,
```
</details>

This description is based on the benchmarks and the code from [here](https://github.com/vfdev-5/interpolate-tensoriterator/tree/master/step_six).

TL;DR
- Linear upsampling generic implementation using TensorIterator for Nd case (single loop function for 1d, 2d and 3d cases)
  - can be generalized to nearest, bicubic interpolation modes.
- works for channels first and last cases.

Joint work with Francisco Massa (fmassa).

Pull Request resolved: pytorch#51653

Reviewed By: malfet

Differential Revision: D26619437

Pulled By: fmassa

fbshipit-source-id: 7d435e23881c5b40a18bf0dbcab4906d5462025f
facebook-github-bot pushed a commit that referenced this pull request Apr 6, 2021
…2d/3d channels last impl) (#54500)

Summary:
Related to #10482

A follow-up PR to #51653

Description:
- Replaces nearest/linear/cubic implementations with generic interpolation implementation
- Retains 2d/3d channels last implementation due to perf slowdown for 1 thread (see below appendix note)

Speed-ups for cases:
- upsample_nearest channels first
- upsample_bicubic channels first/last

### Results for this PR

<details>
<summary>

Benchmark results between 8518b0e (master) and 73137d8 (this PR)

</summary>

```
Description:
- 20210331-092940_pth_nightly_results_1.9.0a0+git8518b0e.6
- 20210331-092940_pth_nightly_results_1.9.0a0+git8518b0e.1
- 20210331-092940_pr_results_1.9.0a0+git73137d8.6
- 20210331-092940_pr_results_1.9.0a0+git73137d8.1

[---------- upsample_bilinear2d channels_first contiguous torch.float32 ----------]
                                       |  1.9.0a0+git8518b0e  |  1.9.0a0+git73137d8
1 threads: ------------------------------------------------------------------------
      [1, 3, 320, 320] -> (256, 256)   |          331.8       |          334.6
      [1, 3, 320, 320] -> (512, 512)   |         1261.7       |         1271.5
      [32, 128, 64, 64] -> (32, 32)    |        10164.6       |        10251.4
      [32, 128, 64, 64] -> (128, 128)  |       195966.1       |       197141.8
      [1, 3, 500, 500] -> (256, 256)   |          347.7       |          348.3
      [1, 3, 500, 500] -> (800, 800)   |         3044.9       |         3071.4
6 threads: ------------------------------------------------------------------------
      [1, 3, 320, 320] -> (256, 256)   |           76.1       |           77.0
      [1, 3, 320, 320] -> (512, 512)   |          244.8       |          247.6
      [32, 128, 64, 64] -> (32, 32)    |         2329.4       |         2315.8
      [32, 128, 64, 64] -> (128, 128)  |        47855.3       |        49047.7
      [1, 3, 500, 500] -> (256, 256)   |           78.1       |           78.7
      [1, 3, 500, 500] -> (800, 800)   |          569.3       |          575.6

Times are in microseconds (us).

[------- upsample_bilinear2d channels_first non-contiguous torch.float32 --------]
                                      |  1.9.0a0+git8518b0e  |  1.9.0a0+git73137d8
1 threads: -----------------------------------------------------------------------
      [1, 3, 320, 320] -> (256, 256)  |         339.0        |         340.3
      [1, 3, 320, 320] -> (512, 512)  |        1266.1        |        1277.3
      [1, 3, 500, 500] -> (256, 256)  |         348.8        |         351.3
      [1, 3, 500, 500] -> (800, 800)  |        3054.5        |        3077.3
6 threads: -----------------------------------------------------------------------
      [1, 3, 320, 320] -> (256, 256)  |          76.6        |          77.4
      [1, 3, 320, 320] -> (512, 512)  |         246.0        |         248.1
      [1, 3, 500, 500] -> (256, 256)  |          78.3        |          79.5
      [1, 3, 500, 500] -> (800, 800)  |         572.2        |         580.0

Times are in microseconds (us).

[--------- upsample_bilinear2d channels_last non-contiguous torch.float32 --------]
                                       |  1.9.0a0+git8518b0e  |  1.9.0a0+git73137d8
1 threads: ------------------------------------------------------------------------
      [1, 3, 320, 320] -> (256, 256)   |         965.4        |         964.9
      [1, 3, 320, 320] -> (512, 512)   |        3856.2        |        3866.8
      [32, 128, 64, 64] -> (32, 32)    |        5808.3        |        5812.8
      [32, 128, 64, 64] -> (128, 128)  |       99575.2        |       97226.2
      [2, 128, 64, 46] -> (32, 32)     |         110.5        |         109.0
      [2, 128, 64, 46] -> (128, 128)   |        1662.3        |        1612.0
      [1, 128, 64, 46] -> (32, 32)     |          55.6        |          55.5
      [1, 128, 64, 46] -> (128, 128)   |         467.0        |         463.9
      [1, 3, 500, 500] -> (256, 256)   |         967.7        |         966.7
      [1, 3, 500, 500] -> (800, 800)   |        9394.7        |        9436.6
6 threads: ------------------------------------------------------------------------
      [1, 3, 320, 320] -> (256, 256)   |         962.2        |         965.4
      [1, 3, 320, 320] -> (512, 512)   |        3844.3        |        3844.3
      [32, 128, 64, 64] -> (32, 32)    |        2270.0        |        2267.6
      [32, 128, 64, 64] -> (128, 128)  |       31909.7        |       32106.5
      [2, 128, 64, 46] -> (32, 32)     |          61.3        |          59.9
      [2, 128, 64, 46] -> (128, 128)   |         912.3        |         893.5
      [1, 128, 64, 46] -> (32, 32)     |          55.5        |          55.3
      [1, 128, 64, 46] -> (128, 128)   |         467.0        |         466.4
      [1, 3, 500, 500] -> (256, 256)   |         967.2        |         971.1
      [1, 3, 500, 500] -> (800, 800)   |        9383.2        |        9417.4

Times are in microseconds (us).

[------ upsample_linear1d channels_first contiguous torch.float32 -------]
                              |  1.9.0a0+git8518b0e  |  1.9.0a0+git73137d8
1 threads: ---------------------------------------------------------------
      [4, 512, 320] -> [256]  |        513.5         |         521.8
      [4, 512, 320] -> [512]  |        999.0         |        1011.8
6 threads: ---------------------------------------------------------------
      [4, 512, 320] -> [256]  |        103.7         |         104.9
      [4, 512, 320] -> [512]  |        192.2         |         194.9

Times are in microseconds (us).

[------------- upsample_trilinear3d channels_first contiguous torch.float32 -------------]
                                              |  1.9.0a0+git8518b0e  |  1.9.0a0+git73137d8
1 threads: -------------------------------------------------------------------------------
      [1, 3, 16, 320, 320] -> [8, 256, 256]   |          5.4         |          5.5
      [1, 3, 16, 320, 320] -> [32, 512, 512]  |        111.2         |        111.1
6 threads: -------------------------------------------------------------------------------
      [1, 3, 16, 320, 320] -> [8, 256, 256]   |          1.1         |          1.0
      [1, 3, 16, 320, 320] -> [32, 512, 512]  |         23.4         |         23.2

Times are in milliseconds (ms).

[----------- upsample_trilinear3d channels_last non-contiguous torch.float32 ------------]
                                              |  1.9.0a0+git8518b0e  |  1.9.0a0+git73137d8
1 threads: -------------------------------------------------------------------------------
      [1, 3, 16, 320, 320] -> [8, 256, 256]   |        13521.9       |        12939.9
      [1, 3, 16, 320, 320] -> [32, 512, 512]  |       244561.3       |       236595.6
      [1, 16, 32, 64, 64] -> [16, 32, 32]     |          362.2       |          365.5
      [1, 16, 32, 64, 64] -> [64, 128, 128]   |        38141.4       |        37957.7
6 threads: -------------------------------------------------------------------------------
      [1, 3, 16, 320, 320] -> [8, 256, 256]   |        12980.4       |        12962.7
      [1, 3, 16, 320, 320] -> [32, 512, 512]  |       236256.4       |       236364.5
      [1, 16, 32, 64, 64] -> [16, 32, 32]     |          367.9       |          393.2
      [1, 16, 32, 64, 64] -> [64, 128, 128]   |        38222.5       |        38198.3

Times are in microseconds (us).

[----------- upsample_nearest2d channels_first contiguous torch.float32 ----------]
                                       |  1.9.0a0+git8518b0e  |  1.9.0a0+git73137d8
1 threads: ------------------------------------------------------------------------
      [1, 3, 320, 320] -> (256, 256)   |         1205.7       |          107.2
      [1, 3, 320, 320] -> (512, 512)   |         4793.5       |          357.7
      [32, 128, 64, 64] -> (32, 32)    |        26550.0       |         6227.1
      [32, 128, 64, 64] -> (128, 128)  |       341140.3       |       116404.4
      [1, 3, 500, 500] -> (256, 256)   |         1208.6       |          122.9
      [1, 3, 500, 500] -> (800, 800)   |        11648.0       |          848.1
6 threads: ------------------------------------------------------------------------
      [1, 3, 320, 320] -> (256, 256)   |          220.5       |           32.6
      [1, 3, 320, 320] -> (512, 512)   |          865.4       |           78.1
      [32, 128, 64, 64] -> (32, 32)    |         4890.9       |         2201.2
      [32, 128, 64, 64] -> (128, 128)  |        73533.8       |        32315.4
      [1, 3, 500, 500] -> (256, 256)   |          222.3       |           35.0
      [1, 3, 500, 500] -> (800, 800)   |         2107.5       |          170.7

Times are in microseconds (us).

[----------- upsample_nearest2d channels_first contiguous torch.uint8 -----------]
                                      |  1.9.0a0+git8518b0e  |  1.9.0a0+git73137d8
1 threads: -----------------------------------------------------------------------
      [1, 3, 320, 320] -> (256, 256)  |        1457.0        |         310.7
      [1, 3, 320, 320] -> (512, 512)  |        5808.0        |        1196.6
      [1, 3, 500, 500] -> (256, 256)  |        1460.9        |         312.7
      [1, 3, 500, 500] -> (800, 800)  |       14094.3        |        2903.5
6 threads: -----------------------------------------------------------------------
      [1, 3, 320, 320] -> (256, 256)  |         264.8        |          66.8
      [1, 3, 320, 320] -> (512, 512)  |        1046.0        |         228.9
      [1, 3, 500, 500] -> (256, 256)  |         266.0        |          68.0
      [1, 3, 500, 500] -> (800, 800)  |        2546.6        |         535.8

Times are in microseconds (us).

[-------- upsample_nearest2d channels_first non-contiguous torch.float32 --------]
                                      |  1.9.0a0+git8518b0e  |  1.9.0a0+git73137d8
1 threads: -----------------------------------------------------------------------
      [1, 3, 320, 320] -> (256, 256)  |        1284.3        |        109.9
      [1, 3, 320, 320] -> (512, 512)  |        4870.0        |        361.6
      [1, 3, 500, 500] -> (256, 256)  |        1482.8        |        123.3
      [1, 3, 500, 500] -> (800, 800)  |       12050.3        |        858.8
6 threads: -----------------------------------------------------------------------
      [1, 3, 320, 320] -> (256, 256)  |         240.2        |         32.8
      [1, 3, 320, 320] -> (512, 512)  |         886.1        |         78.4
      [1, 3, 500, 500] -> (256, 256)  |         274.9        |         34.9
      [1, 3, 500, 500] -> (800, 800)  |        2188.8        |        174.0

Times are in microseconds (us).

[--------- upsample_nearest2d channels_first non-contiguous torch.uint8 ---------]
                                      |  1.9.0a0+git8518b0e  |  1.9.0a0+git73137d8
1 threads: -----------------------------------------------------------------------
      [1, 3, 320, 320] -> (256, 256)  |        1501.9        |         312.2
      [1, 3, 320, 320] -> (512, 512)  |        5853.4        |        1202.1
      [1, 3, 500, 500] -> (256, 256)  |        1574.0        |         313.9
      [1, 3, 500, 500] -> (800, 800)  |       14210.2        |        2904.5
6 threads: -----------------------------------------------------------------------
      [1, 3, 320, 320] -> (256, 256)  |         277.2        |          67.2
      [1, 3, 320, 320] -> (512, 512)  |        1059.8        |         228.9
      [1, 3, 500, 500] -> (256, 256)  |         292.2        |          68.1
      [1, 3, 500, 500] -> (800, 800)  |        2574.4        |         536.2

Times are in microseconds (us).

[--------- upsample_nearest2d channels_last non-contiguous torch.float32 ---------]
                                       |  1.9.0a0+git8518b0e  |  1.9.0a0+git73137d8
1 threads: ------------------------------------------------------------------------
      [1, 3, 320, 320] -> (256, 256)   |         746.0        |         751.1
      [1, 3, 320, 320] -> (512, 512)   |        2967.6        |        2979.2
      [32, 128, 64, 64] -> (32, 32)    |        3408.5        |        3379.0
      [32, 128, 64, 64] -> (128, 128)  |       90166.4        |       90023.0
      [2, 128, 64, 46] -> (32, 32)     |          74.8        |          74.5
      [2, 128, 64, 46] -> (128, 128)   |        1591.2        |        1594.3
      [1, 128, 64, 46] -> (32, 32)     |          39.3        |          39.2
      [1, 128, 64, 46] -> (128, 128)   |         420.3        |         419.1
      [1, 3, 500, 500] -> (256, 256)   |         751.6        |         756.3
      [1, 3, 500, 500] -> (800, 800)   |        7222.2        |        7268.6
6 threads: ------------------------------------------------------------------------
      [1, 3, 320, 320] -> (256, 256)   |         144.9        |         140.1
      [1, 3, 320, 320] -> (512, 512)   |         560.7        |         540.6
      [32, 128, 64, 64] -> (32, 32)    |        1418.1        |        1418.6
      [32, 128, 64, 64] -> (128, 128)  |       28158.4        |       26411.4
      [2, 128, 64, 46] -> (32, 32)     |          18.4        |          17.8
      [2, 128, 64, 46] -> (128, 128)   |         532.3        |         552.0
      [1, 128, 64, 46] -> (32, 32)     |          13.9        |          13.6
      [1, 128, 64, 46] -> (128, 128)   |          81.3        |          82.9
      [1, 3, 500, 500] -> (256, 256)   |         145.9        |         141.6
      [1, 3, 500, 500] -> (800, 800)   |        1363.4        |        1316.2

Times are in microseconds (us).

[---------- upsample_nearest2d channels_last non-contiguous torch.uint8 ----------]
                                       |  1.9.0a0+git8518b0e  |  1.9.0a0+git73137d8
1 threads: ------------------------------------------------------------------------
      [1, 3, 320, 320] -> (256, 256)   |         795.7        |         824.1
      [1, 3, 320, 320] -> (512, 512)   |        3163.4        |        3274.8
      [32, 128, 64, 64] -> (32, 32)    |         798.8        |         812.2
      [32, 128, 64, 64] -> (128, 128)  |       25259.6        |       25453.1
      [2, 128, 64, 46] -> (32, 32)     |          39.3        |          39.9
      [2, 128, 64, 46] -> (128, 128)   |         493.7        |         499.9
      [1, 128, 64, 46] -> (32, 32)     |          22.6        |          22.9
      [1, 128, 64, 46] -> (128, 128)   |         249.7        |         254.0
      [32, 64, 128, 64] -> (32, 32)    |         475.3        |         507.4
      [32, 64, 128, 64] -> (128, 128)  |       13709.7        |       13767.5
      [1, 3, 500, 500] -> (256, 256)   |         804.0        |         827.6
      [1, 3, 500, 500] -> (800, 800)   |        7764.9        |        7982.7
6 threads: ------------------------------------------------------------------------
      [1, 3, 320, 320] -> (256, 256)   |         150.1        |         151.4
      [1, 3, 320, 320] -> (512, 512)   |         589.5        |         592.6
      [32, 128, 64, 64] -> (32, 32)    |         141.3        |         194.5
      [32, 128, 64, 64] -> (128, 128)  |        6916.5        |        7445.0
      [2, 128, 64, 46] -> (32, 32)     |          10.0        |          12.5
      [2, 128, 64, 46] -> (128, 128)   |          95.8        |         141.1
      [1, 128, 64, 46] -> (32, 32)     |           8.1        |          10.0
      [1, 128, 64, 46] -> (128, 128)   |          52.5        |          74.3
      [32, 64, 128, 64] -> (32, 32)    |          79.8        |         123.7
      [32, 64, 128, 64] -> (128, 128)  |        3639.9        |        4087.9
      [1, 3, 500, 500] -> (256, 256)   |         150.7        |         152.2
      [1, 3, 500, 500] -> (800, 800)   |        1430.9        |        1440.7

Times are in microseconds (us).

[------ upsample_nearest1d channels_first contiguous torch.float32 ------]
                              |  1.9.0a0+git8518b0e  |  1.9.0a0+git73137d8
1 threads: ---------------------------------------------------------------
      [4, 512, 320] -> [256]  |        1601.7        |        241.7
      [4, 512, 320] -> [512]  |        3188.5        |        435.7
6 threads: ---------------------------------------------------------------
      [4, 512, 320] -> [256]  |         291.9        |         53.3
      [4, 512, 320] -> [512]  |         577.8        |         88.1

Times are in microseconds (us).

[------- upsample_nearest1d channels_first contiguous torch.uint8 -------]
                              |  1.9.0a0+git8518b0e  |  1.9.0a0+git73137d8
1 threads: ---------------------------------------------------------------
      [4, 512, 320] -> [256]  |        2010.1        |         532.3
      [4, 512, 320] -> [512]  |        3999.7        |        1011.4
6 threads: ---------------------------------------------------------------
      [4, 512, 320] -> [256]  |         364.2        |         104.6
      [4, 512, 320] -> [512]  |         722.8        |         193.5

Times are in microseconds (us).

[-------------- upsample_nearest3d channels_first contiguous torch.float32 --------------]
                                              |  1.9.0a0+git8518b0e  |  1.9.0a0+git73137d8
1 threads: -------------------------------------------------------------------------------
      [1, 3, 16, 320, 320] -> [8, 256, 256]   |        14801.0       |         977.5
      [1, 3, 16, 320, 320] -> [32, 512, 512]  |       217368.5       |       41577.3
6 threads: -------------------------------------------------------------------------------
      [1, 3, 16, 320, 320] -> [8, 256, 256]   |         2670.3       |         210.7
      [1, 3, 16, 320, 320] -> [32, 512, 512]  |        42023.6       |       10971.6

Times are in microseconds (us).

[--------------- upsample_nearest3d channels_first contiguous torch.uint8 ---------------]
                                              |  1.9.0a0+git8518b0e  |  1.9.0a0+git73137d8
1 threads: -------------------------------------------------------------------------------
      [1, 3, 16, 320, 320] -> [8, 256, 256]   |        17151.7       |        3195.8
      [1, 3, 16, 320, 320] -> [32, 512, 512]  |       221221.0       |       50524.5
6 threads: -------------------------------------------------------------------------------
      [1, 3, 16, 320, 320] -> [8, 256, 256]   |         3085.3       |         588.6
      [1, 3, 16, 320, 320] -> [32, 512, 512]  |        39842.0       |        9141.0

Times are in microseconds (us).

[------------ upsample_nearest3d channels_last non-contiguous torch.float32 -------------]
                                              |  1.9.0a0+git8518b0e  |  1.9.0a0+git73137d8
1 threads: -------------------------------------------------------------------------------
      [1, 3, 16, 320, 320] -> [8, 256, 256]   |         7694.1       |         7729.0
      [1, 3, 16, 320, 320] -> [32, 512, 512]  |       138104.6       |       138158.0
      [1, 16, 32, 64, 64] -> [16, 32, 32]     |          251.1       |          252.4
      [1, 16, 32, 64, 64] -> [64, 128, 128]   |        28991.5       |        28882.8
6 threads: -------------------------------------------------------------------------------
      [1, 3, 16, 320, 320] -> [8, 256, 256]   |         1398.3       |         1402.6
      [1, 3, 16, 320, 320] -> [32, 512, 512]  |        28056.5       |        28123.2
      [1, 16, 32, 64, 64] -> [16, 32, 32]     |           50.8       |           51.1
      [1, 16, 32, 64, 64] -> [64, 128, 128]   |         7595.7       |         7540.7

Times are in microseconds (us).

[------------- upsample_nearest3d channels_last non-contiguous torch.uint8 --------------]
                                              |  1.9.0a0+git8518b0e  |  1.9.0a0+git73137d8
1 threads: -------------------------------------------------------------------------------
      [1, 3, 16, 320, 320] -> [8, 256, 256]   |         8147.8       |         8176.2
      [1, 3, 16, 320, 320] -> [32, 512, 512]  |       114658.1       |       114992.7
      [1, 16, 32, 64, 64] -> [16, 32, 32]     |          364.3       |          356.0
      [1, 16, 32, 64, 64] -> [64, 128, 128]   |        17276.0       |        16331.0
6 threads: -------------------------------------------------------------------------------
      [1, 3, 16, 320, 320] -> [8, 256, 256]   |         1469.4       |         1476.1
      [1, 3, 16, 320, 320] -> [32, 512, 512]  |        20647.1       |        20722.6
      [1, 16, 32, 64, 64] -> [16, 32, 32]     |           69.7       |           68.4
      [1, 16, 32, 64, 64] -> [64, 128, 128]   |         3125.7       |         2948.2

Times are in microseconds (us).

[----------- upsample_bicubic2d channels_first contiguous torch.float32 ----------]
                                       |  1.9.0a0+git8518b0e  |  1.9.0a0+git73137d8
1 threads: ------------------------------------------------------------------------
      [1, 3, 320, 320] -> (256, 256)   |          5961.0      |         1680.2
      [1, 3, 320, 320] -> (512, 512)   |         23803.7      |         6591.0
      [32, 128, 64, 64] -> (32, 32)    |        620609.4      |        37981.6
      [32, 128, 64, 64] -> (128, 128)  |      10120286.1      |       646305.5
      [1, 3, 500, 500] -> (256, 256)   |          6005.4      |         1694.6
      [1, 3, 500, 500] -> (800, 800)   |         58271.9      |        16047.6
6 threads: ------------------------------------------------------------------------
      [1, 3, 320, 320] -> (256, 256)   |          6218.5      |          347.1
      [1, 3, 320, 320] -> (512, 512)   |         24144.6      |         1253.4
      [32, 128, 64, 64] -> (32, 32)    |        612762.5      |         6934.8
      [32, 128, 64, 64] -> (128, 128)  |       9906221.2      |       127411.1
      [1, 3, 500, 500] -> (256, 256)   |          6241.9      |          350.2
      [1, 3, 500, 500] -> (800, 800)   |         59052.2      |         2984.8

Times are in microseconds (us).

[-------- upsample_bicubic2d channels_first non-contiguous torch.float32 --------]
                                      |  1.9.0a0+git8518b0e  |  1.9.0a0+git73137d8
1 threads: -----------------------------------------------------------------------
      [1, 3, 320, 320] -> (256, 256)  |        6050.9        |        1694.3
      [1, 3, 320, 320] -> (512, 512)  |       23897.1        |        6607.9
      [1, 3, 500, 500] -> (256, 256)  |        6282.8        |        1693.9
      [1, 3, 500, 500] -> (800, 800)  |       58608.1        |       16061.0
6 threads: -----------------------------------------------------------------------
      [1, 3, 320, 320] -> (256, 256)  |        6243.7        |         347.6
      [1, 3, 320, 320] -> (512, 512)  |       24779.9        |        1253.8
      [1, 3, 500, 500] -> (256, 256)  |        6348.0        |         350.7
      [1, 3, 500, 500] -> (800, 800)  |       59255.6        |        2983.8

Times are in microseconds (us).

[--------- upsample_bicubic2d channels_last non-contiguous torch.float32 ---------]
                                       |  1.9.0a0+git8518b0e  |  1.9.0a0+git73137d8
1 threads: ------------------------------------------------------------------------
      [1, 3, 320, 320] -> (256, 256)   |          6117.0      |         1688.2
      [1, 3, 320, 320] -> (512, 512)   |         23967.4      |         6644.8
      [32, 128, 64, 64] -> (32, 32)    |        679574.0      |        78477.4
      [32, 128, 64, 64] -> (128, 128)  |      1033432.5      |       817649.0
      [2, 128, 64, 46] -> (32, 32)     |          9828.0      |         4449.2
      [2, 128, 64, 46] -> (128, 128)   |        134989.3      |        42817.4
      [1, 128, 64, 46] -> (32, 32)     |          4508.2      |         2228.6
      [1, 128, 64, 46] -> (128, 128)   |         59404.9      |        21400.4
      [1, 3, 500, 500] -> (256, 256)   |          6359.0      |         1712.7
      [1, 3, 500, 500] -> (800, 800)   |         58717.6      |        16086.6
6 threads: ------------------------------------------------------------------------
      [1, 3, 320, 320] -> (256, 256)   |          6922.0      |          349.5
      [1, 3, 320, 320] -> (512, 512)   |         24916.5      |         1260.2
      [32, 128, 64, 64] -> (32, 32)    |        454240.4      |        16491.4
      [32, 128, 64, 64] -> (128, 128)  |       7198101.5      |       159921.9
      [2, 128, 64, 46] -> (32, 32)     |         10082.8      |          891.1
      [2, 128, 64, 46] -> (128, 128)   |        151037.0      |         7704.2
      [1, 128, 64, 46] -> (32, 32)     |          4325.5      |          633.9
      [1, 128, 64, 46] -> (128, 128)   |         62400.4      |         3853.5
      [1, 3, 500, 500] -> (256, 256)   |          6374.9      |          354.9
      [1, 3, 500, 500] -> (800, 800)   |         58638.8      |         2992.0

Times are in microseconds (us).

Intermediate benchmark sources:

- results/20210331-092940_pth_nightly_results_1.9.0a0+git8518b0e.log.save
- results/20210331-092940_pr_results_1.9.0a0+git73137d8.log.save
```

[Source file](https://raw.githubusercontent.com/vfdev-5/interpolate-tensoriterator/master/step_seven/results/20210326-061238_pr_1.9.0a0%2Bgita17040a_vs_pth_1.9.0a0%2Bgit8518b0e_results.md)

</details>

This description is based on the benchmarks and the code from [here](https://github.com/vfdev-5/interpolate-tensoriterator/tree/master/step_seven).

Joint work with Francisco Massa (fmassa).

 ---

Appendix: Results without original 2d/3d channels last implementation

<details>
<summary>

Quick benchmark results between 8518b0e (master) and [this branch](master...Quansight:vfdev-5/generic-upsample-tensor-iterator)

</summary>

```
Description:
- 20212303-061238_pth_nightly_results_1.9.0a0+git8518b0e.opencv.6
- 20212303-061238_pth_nightly_results_1.9.0a0+git8518b0e.opencv.1
- 20212303-061238_pr_results_1.9.0a0+gite3a9544.opencv.6
- 20212303-061238_pr_results_1.9.0a0+gite3a9544.opencv.1

[----------------- upsample_bilinear2d channels_first contiguous -----------------]
                                       |  1.9.0a0+git8518b0e  |  1.9.0a0+gite3a9544
1 threads: ------------------------------------------------------------------------
      [1, 3, 320, 320] -> (256, 256)   |          348.5       |          331.7
      [1, 3, 320, 320] -> (512, 512)   |         1254.0       |         1178.1
      [32, 128, 64, 64] -> (32, 32)    |        10409.4       |        10009.1
      [32, 128, 64, 64] -> (128, 128)  |       210175.8       |       204542.5
      [1, 3, 500, 500] -> (256, 256)   |          348.5       |          329.5
      [1, 3, 500, 500] -> (800, 800)   |         3079.8       |         2890.1
6 threads: ------------------------------------------------------------------------
      [1, 3, 320, 320] -> (256, 256)   |           76.4       |           73.4
      [1, 3, 320, 320] -> (512, 512)   |          247.1       |          232.0
      [32, 128, 64, 64] -> (32, 32)    |         2371.1       |         2340.5
      [32, 128, 64, 64] -> (128, 128)  |        62182.6       |        54089.9
      [1, 3, 500, 500] -> (256, 256)   |           78.2       |           75.8
      [1, 3, 500, 500] -> (800, 800)   |          569.0       |          541.3

Times are in microseconds (us).

[-------------- upsample_bilinear2d channels_first non-contiguous ---------------]
                                      |  1.9.0a0+git8518b0e  |  1.9.0a0+gite3a9544
1 threads: -----------------------------------------------------------------------
      [1, 3, 320, 320] -> (256, 256)  |         340.5        |         321.9
      [1, 3, 320, 320] -> (512, 512)  |        1256.1        |        1179.0
      [1, 3, 500, 500] -> (256, 256)  |         351.4        |         332.0
      [1, 3, 500, 500] -> (800, 800)  |        3089.1        |        2898.6
6 threads: -----------------------------------------------------------------------
      [1, 3, 320, 320] -> (256, 256)  |          77.2        |          75.0
      [1, 3, 320, 320] -> (512, 512)  |         246.6        |         232.7
      [1, 3, 500, 500] -> (256, 256)  |          78.6        |          75.4
      [1, 3, 500, 500] -> (800, 800)  |         576.3        |         539.6

Times are in microseconds (us).

[------------------------ upsample_bilinear2d channels_last non-contiguous ------------------------]
                                       |  1.9.0a0+git8518b0e  |  1.9.0a0+gite3a9544  |  opencv 4.5.1
1 threads: -----------------------------------------------------------------------------------------
      [1, 3, 320, 320] -> (256, 256)   |          971.9       |         1324.6       |       99.6
      [1, 3, 320, 320] -> (512, 512)   |         3867.8       |         5329.9       |      271.5
      [32, 128, 64, 64] -> (32, 32)    |         6010.6       |         6304.3       |
      [32, 128, 64, 64] -> (128, 128)  |       112299.9       |       116956.8       |
      [2, 128, 64, 46] -> (32, 32)     |          110.1       |          133.2       |
      [2, 128, 64, 46] -> (128, 128)   |         1690.1       |         1838.6       |
      [1, 128, 64, 46] -> (32, 32)     |           55.8       |           73.4       |      185.8
      [1, 128, 64, 46] -> (128, 128)   |          474.5       |          684.9       |     1445.7
      [1, 3, 500, 500] -> (256, 256)   |          972.9       |         1343.0       |      149.5
      [1, 3, 500, 500] -> (800, 800)   |         9460.2       |        12925.8       |      685.1
6 threads: -----------------------------------------------------------------------------------------
      [1, 3, 320, 320] -> (256, 256)   |          956.6       |          260.1       |       27.1
      [1, 3, 320, 320] -> (512, 512)   |         3867.3       |          967.1       |       63.6
      [32, 128, 64, 64] -> (32, 32)    |         2489.4       |         2427.0       |
      [32, 128, 64, 64] -> (128, 128)  |        37462.1       |        41329.8       |
      [2, 128, 64, 46] -> (32, 32)     |           61.2       |           38.9       |
      [2, 128, 64, 46] -> (128, 128)   |          904.2       |          652.0       |
      [1, 128, 64, 46] -> (32, 32)     |           57.1       |           32.0       |      191.1
      [1, 128, 64, 46] -> (128, 128)   |          491.4       |          138.1       |     1485.8
      [1, 3, 500, 500] -> (256, 256)   |          977.0       |          257.8       |       36.6
      [1, 3, 500, 500] -> (800, 800)   |         9470.0       |         2696.0       |      142.8

Times are in microseconds (us).

[------------- upsample_linear1d channels_first contiguous --------------]
                              |  1.9.0a0+git8518b0e  |  1.9.0a0+gite3a9544
1 threads: ---------------------------------------------------------------
      [4, 512, 320] -> [256]  |        516.5         |         524.7
      [4, 512, 320] -> [512]  |        993.8         |        1008.0
6 threads: ---------------------------------------------------------------
      [4, 512, 320] -> [256]  |        104.3         |         105.4
      [4, 512, 320] -> [512]  |        193.5         |         195.6

Times are in microseconds (us).

[-------------------- upsample_trilinear3d channels_first contiguous --------------------]
                                              |  1.9.0a0+git8518b0e  |  1.9.0a0+gite3a9544
1 threads: -------------------------------------------------------------------------------
      [1, 3, 16, 320, 320] -> [8, 256, 256]   |          5.5         |         11.5
      [1, 3, 16, 320, 320] -> [32, 512, 512]  |        116.3         |        213.1
6 threads: -------------------------------------------------------------------------------
      [1, 3, 16, 320, 320] -> [8, 256, 256]   |          1.1         |          2.1
      [1, 3, 16, 320, 320] -> [32, 512, 512]  |         36.1         |         47.2

Times are in milliseconds (ms).

[------------------ upsample_trilinear3d channels_last non-contiguous -------------------]
                                              |  1.9.0a0+git8518b0e  |  1.9.0a0+gite3a9544
1 threads: -------------------------------------------------------------------------------
      [1, 3, 16, 320, 320] -> [8, 256, 256]   |         13.1         |         19.9
      [1, 3, 16, 320, 320] -> [32, 512, 512]  |        242.3         |        349.4
6 threads: -------------------------------------------------------------------------------
      [1, 3, 16, 320, 320] -> [8, 256, 256]   |         13.1         |          4.4
      [1, 3, 16, 320, 320] -> [32, 512, 512]  |        242.4         |         87.2

Times are in milliseconds (ms).

[------------------ upsample_nearest2d channels_first contiguous -----------------]
                                       |  1.9.0a0+git8518b0e  |  1.9.0a0+gite3a9544
1 threads: ------------------------------------------------------------------------
      [1, 3, 320, 320] -> (256, 256)   |         1194.5       |          107.8
      [1, 3, 320, 320] -> (512, 512)   |         4813.8       |          365.5
      [32, 128, 64, 64] -> (32, 32)    |        26745.6       |         6280.6
      [32, 128, 64, 64] -> (128, 128)  |       357686.7       |       129032.9
      [1, 3, 500, 500] -> (256, 256)   |         1205.9       |          123.8
      [1, 3, 500, 500] -> (800, 800)   |        11770.3       |          879.2
6 threads: ------------------------------------------------------------------------
      [1, 3, 320, 320] -> (256, 256)   |          220.2       |           32.7
      [1, 3, 320, 320] -> (512, 512)   |          867.2       |           78.7
      [32, 128, 64, 64] -> (32, 32)    |         5789.6       |         2241.8
      [32, 128, 64, 64] -> (128, 128)  |        89125.3       |        41881.3
      [1, 3, 500, 500] -> (256, 256)   |          224.3       |           34.8
      [1, 3, 500, 500] -> (800, 800)   |         2182.8       |          176.6

Times are in microseconds (us).

[--------------- upsample_nearest2d channels_first non-contiguous ---------------]
                                      |  1.9.0a0+git8518b0e  |  1.9.0a0+gite3a9544
1 threads: -----------------------------------------------------------------------
      [1, 3, 320, 320] -> (256, 256)  |        1279.5        |        110.2
      [1, 3, 320, 320] -> (512, 512)  |        4908.1        |        367.1
      [1, 3, 500, 500] -> (256, 256)  |        1488.1        |        123.4
      [1, 3, 500, 500] -> (800, 800)  |       12186.4        |        879.3
6 threads: -----------------------------------------------------------------------
      [1, 3, 320, 320] -> (256, 256)  |         241.8        |         32.6
      [1, 3, 320, 320] -> (512, 512)  |         889.0        |         79.2
      [1, 3, 500, 500] -> (256, 256)  |         279.2        |         35.6
      [1, 3, 500, 500] -> (800, 800)  |        2226.5        |        174.3

Times are in microseconds (us).

[------------------------ upsample_nearest2d channels_last non-contiguous -------------------------]
                                       |  1.9.0a0+git8518b0e  |  1.9.0a0+gite3a9544  |  opencv 4.5.1
1 threads: -----------------------------------------------------------------------------------------
      [1, 3, 320, 320] -> (256, 256)   |          752.1       |          487.2       |      75.5
      [1, 3, 320, 320] -> (512, 512)   |         2992.6       |         1880.0       |     251.4
      [32, 128, 64, 64] -> (32, 32)    |         3458.6       |         3466.5       |
      [32, 128, 64, 64] -> (128, 128)  |       102350.7       |       103919.4       |
      [2, 128, 64, 46] -> (32, 32)     |           75.2       |           85.2       |
      [2, 128, 64, 46] -> (128, 128)   |         1637.0       |         1690.4       |
      [1, 128, 64, 46] -> (32, 32)     |           39.6       |           47.2       |      37.6
      [1, 128, 64, 46] -> (128, 128)   |          426.3       |          449.0       |     412.4
      [1, 3, 500, 500] -> (256, 256)   |          757.5       |          495.5       |      85.0
      [1, 3, 500, 500] -> (800, 800)   |         7281.4       |         4532.6       |     622.8
6 threads: -----------------------------------------------------------------------------------------
      [1, 3, 320, 320] -> (256, 256)   |          139.3       |          104.1       |      75.7
      [1, 3, 320, 320] -> (512, 512)   |          535.5       |          361.2       |      73.0
      [32, 128, 64, 64] -> (32, 32)    |         1518.6       |         1458.2       |
      [32, 128, 64, 64] -> (128, 128)  |        37117.7       |        40142.4       |
      [2, 128, 64, 46] -> (32, 32)     |           17.6       |           26.6       |
      [2, 128, 64, 46] -> (128, 128)   |          537.6       |          629.4       |
      [1, 128, 64, 46] -> (32, 32)     |           13.7       |           22.1       |      38.8
      [1, 128, 64, 46] -> (128, 128)   |           83.6       |           94.5       |     420.2
      [1, 3, 500, 500] -> (256, 256)   |          140.8       |          104.9       |      87.8
      [1, 3, 500, 500] -> (800, 800)   |         1317.8       |          853.8       |     139.7

Times are in microseconds (us).

[------------- upsample_nearest1d channels_first contiguous -------------]
                              |  1.9.0a0+git8518b0e  |  1.9.0a0+gite3a9544
1 threads: ---------------------------------------------------------------
      [4, 512, 320] -> [256]  |        1594.3        |        247.4
      [4, 512, 320] -> [512]  |        3222.6        |        440.4
6 threads: ---------------------------------------------------------------
      [4, 512, 320] -> [256]  |         294.4        |         53.7
      [4, 512, 320] -> [512]  |         575.0        |         88.5

Times are in microseconds (us).

[--------------------- upsample_nearest3d channels_first contiguous ---------------------]
                                              |  1.9.0a0+git8518b0e  |  1.9.0a0+gite3a9544
1 threads: -------------------------------------------------------------------------------
      [1, 3, 16, 320, 320] -> [8, 256, 256]   |        14952.7       |        1005.7
      [1, 3, 16, 320, 320] -> [32, 512, 512]  |       224955.6       |       46228.0
6 threads: -------------------------------------------------------------------------------
      [1, 3, 16, 320, 320] -> [8, 256, 256]   |         2887.2       |         206.2
      [1, 3, 16, 320, 320] -> [32, 512, 512]  |        56872.0       |       13566.3

Times are in microseconds (us).

[------------------- upsample_nearest3d channels_last non-contiguous --------------------]
                                              |  1.9.0a0+git8518b0e  |  1.9.0a0+gite3a9544
1 threads: -------------------------------------------------------------------------------
      [1, 3, 16, 320, 320] -> [8, 256, 256]   |         7772.3       |         4770.9
      [1, 3, 16, 320, 320] -> [32, 512, 512]  |       144655.1       |       108605.0
6 threads: -------------------------------------------------------------------------------
      [1, 3, 16, 320, 320] -> [8, 256, 256]   |         1401.9       |          877.7
      [1, 3, 16, 320, 320] -> [32, 512, 512]  |        35939.6       |        28621.5

Times are in microseconds (us).

[------------------ upsample_bicubic2d channels_first contiguous -----------------]
                                       |  1.9.0a0+git8518b0e  |  1.9.0a0+gite3a9544
1 threads: ------------------------------------------------------------------------
      [1, 3, 320, 320] -> (256, 256)   |         6038.7       |         2340.4
      [1, 3, 320, 320] -> (512, 512)   |        24040.6       |         9205.9
      [32, 128, 64, 64] -> (32, 32)    |       471016.3       |        52059.1
      [32, 128, 64, 64] -> (128, 128)  |      7705594.5       |       884743.9
      [1, 3, 500, 500] -> (256, 256)   |         6061.5       |         2361.9
      [1, 3, 500, 500] -> (800, 800)   |        58940.7       |        22401.8
6 threads: ------------------------------------------------------------------------
      [1, 3, 320, 320] -> (256, 256)   |         6594.3       |          466.5
      [1, 3, 320, 320] -> (512, 512)   |        25361.5       |         1729.1
      [32, 128, 64, 64] -> (32, 32)    |       487783.5       |        11550.0
      [32, 128, 64, 64] -> (128, 128)  |      7963636.6       |       196017.3
      [1, 3, 500, 500] -> (256, 256)   |         6443.8       |          464.1
      [1, 3, 500, 500] -> (800, 800)   |        61891.9       |         4257.2

Times are in microseconds (us).

[--------------- upsample_bicubic2d channels_first non-contiguous ---------------]
                                      |  1.9.0a0+git8518b0e  |  1.9.0a0+gite3a9544
1 threads: -----------------------------------------------------------------------
      [1, 3, 320, 320] -> (256, 256)  |        6116.7        |        2357.0
      [1, 3, 320, 320] -> (512, 512)  |       24182.0        |        9213.9
      [1, 3, 500, 500] -> (256, 256)  |        6349.6        |        2358.5
      [1, 3, 500, 500] -> (800, 800)  |       59365.2        |       22431.2
6 threads: -----------------------------------------------------------------------
      [1, 3, 320, 320] -> (256, 256)  |        7155.1        |         464.6
      [1, 3, 320, 320] -> (512, 512)  |       24566.8        |        1712.4
      [1, 3, 500, 500] -> (256, 256)  |        7217.5        |         466.6
      [1, 3, 500, 500] -> (800, 800)  |       59880.2        |        4148.8

Times are in microseconds (us).

[------------------------ upsample_bicubic2d channels_last non-contiguous -------------------------]
                                       |  1.9.0a0+git8518b0e  |  1.9.0a0+gite3a9544  |  opencv 4.5.1
1 threads: -----------------------------------------------------------------------------------------
      [1, 3, 320, 320] -> (256, 256)   |         6184.3       |         2360.0       |      215.0
      [1, 3, 320, 320] -> (512, 512)   |        24499.7       |         9231.1       |      510.7
      [32, 128, 64, 64] -> (32, 32)    |       548304.5       |        93517.8       |
      [32, 128, 64, 64] -> (128, 128)  |      7810958.3       |      1086334.6       |
      [2, 128, 64, 46] -> (32, 32)     |        10883.4       |         5594.9       |
      [2, 128, 64, 46] -> (128, 128)   |       153253.2       |        57071.2       |
      [1, 128, 64, 46] -> (32, 32)     |         4519.4       |         2826.5       |      619.7
      [1, 128, 64, 46] -> (128, 128)   |        61339.7       |        28470.7       |     3654.5
      [1, 3, 500, 500] -> (256, 256)   |         6444.8       |         2389.9       |      292.9
      [1, 3, 500, 500] -> (800, 800)   |        59448.0       |        22479.1       |     1316.9
6 threads: -----------------------------------------------------------------------------------------
      [1, 3, 320, 320] -> (256, 256)   |         6370.1       |          464.9       |       61.3
      [1, 3, 320, 320] -> (512, 512)   |        25365.6       |         1767.5       |      145.7
      [32, 128, 64, 64] -> (32, 32)    |       502888.7       |        22016.3       |
      [32, 128, 64, 64] -> (128, 128)  |      8072918.9       |       234567.0       |
      [2, 128, 64, 46] -> (32, 32)     |        11171.4       |         1049.5       |
      [2, 128, 64, 46] -> (128, 128)   |       152612.5       |        11264.8       |
      [1, 128, 64, 46] -> (32, 32)     |         4359.3       |          791.4       |      651.1
      [1, 128, 64, 46] -> (128, 128)   |        61346.5       |         7563.9       |     3765.2
      [1, 3, 500, 500] -> (256, 256)   |         6644.4       |          469.7       |       77.4
      [1, 3, 500, 500] -> (800, 800)   |        59947.2       |         4154.3       |      313.2

Times are in microseconds (us).

Intermediate benchmark sources:

- results/20212303-061238_pth_nightly_results_1.9.0a0+git8518b0e.log.save.opencv
- results/20212303-061238_pr_results_1.9.0a0+gite3a9544.log.save.opencv

```

[Source file](https://raw.githubusercontent.com/vfdev-5/interpolate-tensoriterator/master/step_seven/results/20212303-061238_pr_1.9.0a0%2Bgite3a9544_vs_pth_1.9.0a0%2Bgit8518b0e_results.opencv.md)
</details>

Pull Request resolved: #54500

Reviewed By: glaringlee

Differential Revision: D27463566

Pulled By: fmassa

fbshipit-source-id: ceac3a8cee0eeb1a4ddd9344accffcc65449a49a
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cla signed Merged open source triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants