New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimized bilinear interpolation using TensorIterator #51653
Conversation
…rator - MemoryFormat: channel first only
💊 CI failures summary and remediationsAs of commit 74b172b (more details on the Dr. CI page): 💚 💚 Looks good so far! There are no failures yet. 💚 💚 This comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.Please report bugs/suggestions to the (internal) Dr. CI Users group. |
@VitalyFedyunin there is already channel last (non-contigous) case here which is routed to the original vectorized implementation. Do you think about particular cases like |
@VitalyFedyunin I'll review this PR |
@VitalyFedyunin as talked with @fmassa , i'd like to mention here a performance drawback with dispatch stub in our particular case for 3d linear interpolation.
Expected result should have ~ similar times between I tried also to register manually the dispatch as here : master...Quansight:upsample-tensor-iterator-another-dispatch, see aten/src/ATen/native/UpSampleTrilinear3d.cpp
and looks like this way I could restore expected times
Again, this performance slowdown happens only for 3D case, where in the code we unroll a template loop over the dimensions: PS: I also updated full results with more tests cases: non contiguous cases. |
Codecov Report
@@ Coverage Diff @@
## master #51653 +/- ##
=======================================
Coverage 80.79% 80.79%
=======================================
Files 1972 1972
Lines 216093 216093
=======================================
+ Hits 174586 174587 +1
+ Misses 41507 41506 -1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've made some minor comments about the structure of the code.
It would be great to understand the slowdown when integrating this implementation into PyTorch for the 3d case.
Additionally, @vfdev-5 can you maybe put the performance results in a table which is linked in the PR? This will make it easier to understand what's going on. Something like 2d - 1 thread
2d - 6 threads
3d
|
- Removed int32/int64 index dispatch - Added more comments and other updates according to the review
Results for 149b976 = if tensor iterator is used instead of cpu_upsample_linear_channels_last. Interpolation 2d - 6 thread(s)
Interpolation 1d - 6 thread(s)
Interpolation 3d - 6 thread(s)
Interpolation 2d - 1 thread(s)
Interpolation 1d - 1 thread(s)
Interpolation 3d - 1 thread(s)
Versions and build configsPyTorch master: 1.8.0.dev20210208+cu110
PR : 1.9.0a0+149b976
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great, thanks a lot!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@fmassa has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.
Please do |
I propose we do the following: given that the current PR already gives a significant speedup for the 3d case (despite the issues with |
Summary: Related to pytorch#10482 Description: - Optimized bilinear interpolation for 1d, 2d, 3d cases using TensorIterator <details> <summary> Interpolation 2d - 6 thread(s) </summary> In | Out | Is contiguous | Channels last | master | this PR | speed-up ---|---|---|---|---|---|--- [1, 3, 320, 320] | [256, 256] | True | False | 0.3938 | 0.0782 | 5.0339 [1, 3, 320, 320] | [512, 512] | True | False | 1.5585 | 0.4105 | 3.7965 [1, 3, 320, 320] | [256, 256] | False | False | 0.3481 | 0.0760 | 4.5780 [1, 3, 320, 320] | [512, 512] | False | False | 1.5848 | 0.4091 | 3.8734 [1, 3, 320, 320] | [256, 256] | False | True | 1.2058 | 1.2034 | 1.0020 [1, 3, 320, 320] | [512, 512] | False | True | 4.8691 | 4.8537 | 1.0032 [32, 128, 64, 64] | [32, 32] | False | True | 6.3915 | 6.4041 | 0.9980 [32, 128, 64, 64] | [128, 128] | False | True | 166.1769 | 164.5621 | 1.0098 [32, 128, 64, 64] | [32, 32] | True | False | 3.7194 | 2.4720 | 1.5046 [32, 128, 64, 64] | [128, 128] | True | False | 86.6704 | 52.3754 | 1.6548 [1, 3, 500, 500] | [256, 256] | True | False | 0.3270 | 0.0792 | 4.1307 [1, 3, 500, 500] | [800, 800] | True | False | 3.3116 | 0.5567 | 5.9482 [1, 3, 500, 500] | [256, 256] | False | False | 0.3763 | 0.0773 | 4.8700 [1, 3, 500, 500] | [800, 800] | False | False | 3.2577 | 0.5590 | 5.8279 </details> <details> <summary> Interpolation 1d - 6 thread(s) </summary> In | Out | Is contiguous | Channels last | master | this PR | speed-up ---|---|---|---|---|---|--- [4, 512, 320] | 256 | True | False | 0.2795 | 0.1032 | 2.7089 [4, 512, 320] | 512 | True | False | 0.5533 | 0.1888 | 2.9303 </details> <details> <summary> Interpolation 3d - 6 thread(s) </summary> In | Out | Is contiguous | Channels last | master | this PR | speed-up ---|---|---|---|---|---|--- [1, 3, 16, 320, 320] | [8, 256, 256] | True | False | 4.4105 | 2.1236 | 2.0769 [1, 3, 16, 320, 320] | [32, 512, 512] | True | False | 83.9426 | 42.6641 | 1.9675 [1, 3, 16, 320, 320] | [8, 256, 256] | False | True | 15.5736 | 15.5758 | 0.9999 [1, 3, 16, 320, 320] | [32, 512, 512] | False | True | 272.4795 | 273.2745 | 0.9971 </details> <details> <summary> Interpolation 2d - 1 thread(s) </summary> In | Out | Is contiguous | Channels last | master | this PR | speed-up ---|---|---|---|---|---|--- [1, 3, 320, 320] | [256, 256] | True | False | 1.0240 | 0.4145 | 2.4705 [1, 3, 320, 320] | [512, 512] | True | False | 4.0771 | 1.3836 | 2.9467 [1, 3, 320, 320] | [256, 256] | False | False | 0.9771 | 0.3270 | 2.9878 [1, 3, 320, 320] | [512, 512] | False | False | 4.1732 | 1.2209 | 3.4180 [1, 3, 320, 320] | [256, 256] | False | True | 1.5466 | 1.5363 | 1.0067 [1, 3, 320, 320] | [512, 512] | False | True | 6.1555 | 6.1199 | 1.0058 [32, 128, 64, 64] | [32, 32] | False | True | 27.6362 | 27.5901 | 1.0017 [32, 128, 64, 64] | [128, 128] | False | True | 468.6442 | 465.5163 | 1.0067 [32, 128, 64, 64] | [32, 32] | True | False | 20.1495 | 10.0694 | 2.0011 [32, 128, 64, 64] | [128, 128] | True | False | 400.0401 | 204.0662 | 1.9603 [1, 3, 500, 500] | [256, 256] | True | False | 0.8956 | 0.3366 | 2.6606 [1, 3, 500, 500] | [800, 800] | True | False | 8.6554 | 2.9530 | 2.9310 [1, 3, 500, 500] | [256, 256] | False | False | 1.0921 | 0.3385 | 3.2263 [1, 3, 500, 500] | [800, 800] | False | False | 8.9594 | 2.9627 | 3.0241 </details> <details> <summary> Interpolation 1d - 1 thread(s) </summary> In | Out | Is contiguous | Channels last | master | this PR | speed-up ---|---|---|---|---|---|--- [4, 512, 320] | 256 | True | False | 1.5233 | 0.5027 | 3.0301 [4, 512, 320] | 512 | True | False | 3.0302 | 0.9735 | 3.1128 </details> <details> <summary> Interpolation 3d - 1 thread(s) </summary> In | Out | Is contiguous | Channels last | master | this PR | speed-up ---|---|---|---|---|---|--- [1, 3, 16, 320, 320] | [8, 256, 256] | True | False | 12.0477 | 11.3196 | 1.0643 [1, 3, 16, 320, 320] | [32, 512, 512] | True | False | 222.8618 | 209.9955 | 1.0613 [1, 3, 16, 320, 320] | [8, 256, 256] | False | True | 17.9883 | 17.9937 | 0.9997 [1, 3, 16, 320, 320] | [32, 512, 512] | False | True | 380.7244 | 380.1916 | 1.0014 </details> <details> <summary> Versions and build configs </summary> PyTorch master: 1.9.0.dev20210223 PyTorch master build setting: ``` BLAS_INFO=mkl, BUILD_TYPE=Release, CUDA_VERSION=10.2, CUDNN_VERSION=7.6.5, CXX_COMPILER=/opt/rh/devtoolset-7/root/usr/bin/c++, CXX_FLAGS= -Wno-deprecated -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -fopenmp -DNDEBUG -DUSE_KINETO -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -O2 -fPIC -Wno-narrowing -Wall -Wextra -Werror=return-type -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-sign-compare -Wno-unused-parameter -Wno-unused-variable -Wno-unused-function -Wno-unused-result -Wno-unused-local-typedefs -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-psabi -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Wno-stringop-overflow, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_VERSION=1.9.0, USE_CUDA=ON, USE_CUDNN=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=ON, USE_NNPACK=ON, USE_OPENMP=ON, ``` PR : 1.9.0a0+74b172b PR build setting: ``` BUILD_TYPE=Release, CUDA_VERSION=11.1, CUDNN_VERSION=8.0.5, CXX_COMPILER=/usr/bin/g++-7, CXX_FLAGS=-O3 -Wno-deprecated -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -fopenmp -DNDEBUG -DUSE_KINETO -DUSE_PYTORCH_QNNPACK -O2 -fPIC -Wno-narrowing -Wall -Wextra -Werror=return-type -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-sign-compare -Wno-unused-parameter -Wno-unused-variable -Wno-unused-function -Wno-unused-result -Wno-unused-local-typedefs -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-psabi -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Wno-stringop-overflow, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_VERSION=1.9.0, USE_CUDA=1, USE_CUDNN=1, USE_EIGEN_FOR_BLAS=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=OFF, USE_MKLDNN=OFF, USE_MPI=OFF, USE_NCCL=ON, USE_NNPACK=0, USE_OPENMP=ON, ``` </details> This description is based on the benchmarks and the code from [here](https://github.com/vfdev-5/interpolate-tensoriterator/tree/master/step_six). TL;DR - Linear upsampling generic implementation using TensorIterator for Nd case (single loop function for 1d, 2d and 3d cases) - can be generalized to nearest, bicubic interpolation modes. - works for channels first and last cases. Joint work with Francisco Massa (fmassa). Pull Request resolved: pytorch#51653 Reviewed By: malfet Differential Revision: D26619437 Pulled By: fmassa fbshipit-source-id: 7d435e23881c5b40a18bf0dbcab4906d5462025f
Summary: Related to pytorch#10482 Description: - Optimized bilinear interpolation for 1d, 2d, 3d cases using TensorIterator <details> <summary> Interpolation 2d - 6 thread(s) </summary> In | Out | Is contiguous | Channels last | master | this PR | speed-up ---|---|---|---|---|---|--- [1, 3, 320, 320] | [256, 256] | True | False | 0.3938 | 0.0782 | 5.0339 [1, 3, 320, 320] | [512, 512] | True | False | 1.5585 | 0.4105 | 3.7965 [1, 3, 320, 320] | [256, 256] | False | False | 0.3481 | 0.0760 | 4.5780 [1, 3, 320, 320] | [512, 512] | False | False | 1.5848 | 0.4091 | 3.8734 [1, 3, 320, 320] | [256, 256] | False | True | 1.2058 | 1.2034 | 1.0020 [1, 3, 320, 320] | [512, 512] | False | True | 4.8691 | 4.8537 | 1.0032 [32, 128, 64, 64] | [32, 32] | False | True | 6.3915 | 6.4041 | 0.9980 [32, 128, 64, 64] | [128, 128] | False | True | 166.1769 | 164.5621 | 1.0098 [32, 128, 64, 64] | [32, 32] | True | False | 3.7194 | 2.4720 | 1.5046 [32, 128, 64, 64] | [128, 128] | True | False | 86.6704 | 52.3754 | 1.6548 [1, 3, 500, 500] | [256, 256] | True | False | 0.3270 | 0.0792 | 4.1307 [1, 3, 500, 500] | [800, 800] | True | False | 3.3116 | 0.5567 | 5.9482 [1, 3, 500, 500] | [256, 256] | False | False | 0.3763 | 0.0773 | 4.8700 [1, 3, 500, 500] | [800, 800] | False | False | 3.2577 | 0.5590 | 5.8279 </details> <details> <summary> Interpolation 1d - 6 thread(s) </summary> In | Out | Is contiguous | Channels last | master | this PR | speed-up ---|---|---|---|---|---|--- [4, 512, 320] | 256 | True | False | 0.2795 | 0.1032 | 2.7089 [4, 512, 320] | 512 | True | False | 0.5533 | 0.1888 | 2.9303 </details> <details> <summary> Interpolation 3d - 6 thread(s) </summary> In | Out | Is contiguous | Channels last | master | this PR | speed-up ---|---|---|---|---|---|--- [1, 3, 16, 320, 320] | [8, 256, 256] | True | False | 4.4105 | 2.1236 | 2.0769 [1, 3, 16, 320, 320] | [32, 512, 512] | True | False | 83.9426 | 42.6641 | 1.9675 [1, 3, 16, 320, 320] | [8, 256, 256] | False | True | 15.5736 | 15.5758 | 0.9999 [1, 3, 16, 320, 320] | [32, 512, 512] | False | True | 272.4795 | 273.2745 | 0.9971 </details> <details> <summary> Interpolation 2d - 1 thread(s) </summary> In | Out | Is contiguous | Channels last | master | this PR | speed-up ---|---|---|---|---|---|--- [1, 3, 320, 320] | [256, 256] | True | False | 1.0240 | 0.4145 | 2.4705 [1, 3, 320, 320] | [512, 512] | True | False | 4.0771 | 1.3836 | 2.9467 [1, 3, 320, 320] | [256, 256] | False | False | 0.9771 | 0.3270 | 2.9878 [1, 3, 320, 320] | [512, 512] | False | False | 4.1732 | 1.2209 | 3.4180 [1, 3, 320, 320] | [256, 256] | False | True | 1.5466 | 1.5363 | 1.0067 [1, 3, 320, 320] | [512, 512] | False | True | 6.1555 | 6.1199 | 1.0058 [32, 128, 64, 64] | [32, 32] | False | True | 27.6362 | 27.5901 | 1.0017 [32, 128, 64, 64] | [128, 128] | False | True | 468.6442 | 465.5163 | 1.0067 [32, 128, 64, 64] | [32, 32] | True | False | 20.1495 | 10.0694 | 2.0011 [32, 128, 64, 64] | [128, 128] | True | False | 400.0401 | 204.0662 | 1.9603 [1, 3, 500, 500] | [256, 256] | True | False | 0.8956 | 0.3366 | 2.6606 [1, 3, 500, 500] | [800, 800] | True | False | 8.6554 | 2.9530 | 2.9310 [1, 3, 500, 500] | [256, 256] | False | False | 1.0921 | 0.3385 | 3.2263 [1, 3, 500, 500] | [800, 800] | False | False | 8.9594 | 2.9627 | 3.0241 </details> <details> <summary> Interpolation 1d - 1 thread(s) </summary> In | Out | Is contiguous | Channels last | master | this PR | speed-up ---|---|---|---|---|---|--- [4, 512, 320] | 256 | True | False | 1.5233 | 0.5027 | 3.0301 [4, 512, 320] | 512 | True | False | 3.0302 | 0.9735 | 3.1128 </details> <details> <summary> Interpolation 3d - 1 thread(s) </summary> In | Out | Is contiguous | Channels last | master | this PR | speed-up ---|---|---|---|---|---|--- [1, 3, 16, 320, 320] | [8, 256, 256] | True | False | 12.0477 | 11.3196 | 1.0643 [1, 3, 16, 320, 320] | [32, 512, 512] | True | False | 222.8618 | 209.9955 | 1.0613 [1, 3, 16, 320, 320] | [8, 256, 256] | False | True | 17.9883 | 17.9937 | 0.9997 [1, 3, 16, 320, 320] | [32, 512, 512] | False | True | 380.7244 | 380.1916 | 1.0014 </details> <details> <summary> Versions and build configs </summary> PyTorch master: 1.9.0.dev20210223 PyTorch master build setting: ``` BLAS_INFO=mkl, BUILD_TYPE=Release, CUDA_VERSION=10.2, CUDNN_VERSION=7.6.5, CXX_COMPILER=/opt/rh/devtoolset-7/root/usr/bin/c++, CXX_FLAGS= -Wno-deprecated -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -fopenmp -DNDEBUG -DUSE_KINETO -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -O2 -fPIC -Wno-narrowing -Wall -Wextra -Werror=return-type -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-sign-compare -Wno-unused-parameter -Wno-unused-variable -Wno-unused-function -Wno-unused-result -Wno-unused-local-typedefs -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-psabi -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Wno-stringop-overflow, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_VERSION=1.9.0, USE_CUDA=ON, USE_CUDNN=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=ON, USE_NNPACK=ON, USE_OPENMP=ON, ``` PR : 1.9.0a0+74b172b PR build setting: ``` BUILD_TYPE=Release, CUDA_VERSION=11.1, CUDNN_VERSION=8.0.5, CXX_COMPILER=/usr/bin/g++-7, CXX_FLAGS=-O3 -Wno-deprecated -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -fopenmp -DNDEBUG -DUSE_KINETO -DUSE_PYTORCH_QNNPACK -O2 -fPIC -Wno-narrowing -Wall -Wextra -Werror=return-type -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-sign-compare -Wno-unused-parameter -Wno-unused-variable -Wno-unused-function -Wno-unused-result -Wno-unused-local-typedefs -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-psabi -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Wno-stringop-overflow, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_VERSION=1.9.0, USE_CUDA=1, USE_CUDNN=1, USE_EIGEN_FOR_BLAS=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=OFF, USE_MKLDNN=OFF, USE_MPI=OFF, USE_NCCL=ON, USE_NNPACK=0, USE_OPENMP=ON, ``` </details> This description is based on the benchmarks and the code from [here](https://github.com/vfdev-5/interpolate-tensoriterator/tree/master/step_six). TL;DR - Linear upsampling generic implementation using TensorIterator for Nd case (single loop function for 1d, 2d and 3d cases) - can be generalized to nearest, bicubic interpolation modes. - works for channels first and last cases. Joint work with Francisco Massa (fmassa). Pull Request resolved: pytorch#51653 Reviewed By: malfet Differential Revision: D26619437 Pulled By: fmassa fbshipit-source-id: 7d435e23881c5b40a18bf0dbcab4906d5462025f
…2d/3d channels last impl) (#54500) Summary: Related to #10482 A follow-up PR to #51653 Description: - Replaces nearest/linear/cubic implementations with generic interpolation implementation - Retains 2d/3d channels last implementation due to perf slowdown for 1 thread (see below appendix note) Speed-ups for cases: - upsample_nearest channels first - upsample_bicubic channels first/last ### Results for this PR <details> <summary> Benchmark results between 8518b0e (master) and 73137d8 (this PR) </summary> ``` Description: - 20210331-092940_pth_nightly_results_1.9.0a0+git8518b0e.6 - 20210331-092940_pth_nightly_results_1.9.0a0+git8518b0e.1 - 20210331-092940_pr_results_1.9.0a0+git73137d8.6 - 20210331-092940_pr_results_1.9.0a0+git73137d8.1 [---------- upsample_bilinear2d channels_first contiguous torch.float32 ----------] | 1.9.0a0+git8518b0e | 1.9.0a0+git73137d8 1 threads: ------------------------------------------------------------------------ [1, 3, 320, 320] -> (256, 256) | 331.8 | 334.6 [1, 3, 320, 320] -> (512, 512) | 1261.7 | 1271.5 [32, 128, 64, 64] -> (32, 32) | 10164.6 | 10251.4 [32, 128, 64, 64] -> (128, 128) | 195966.1 | 197141.8 [1, 3, 500, 500] -> (256, 256) | 347.7 | 348.3 [1, 3, 500, 500] -> (800, 800) | 3044.9 | 3071.4 6 threads: ------------------------------------------------------------------------ [1, 3, 320, 320] -> (256, 256) | 76.1 | 77.0 [1, 3, 320, 320] -> (512, 512) | 244.8 | 247.6 [32, 128, 64, 64] -> (32, 32) | 2329.4 | 2315.8 [32, 128, 64, 64] -> (128, 128) | 47855.3 | 49047.7 [1, 3, 500, 500] -> (256, 256) | 78.1 | 78.7 [1, 3, 500, 500] -> (800, 800) | 569.3 | 575.6 Times are in microseconds (us). [------- upsample_bilinear2d channels_first non-contiguous torch.float32 --------] | 1.9.0a0+git8518b0e | 1.9.0a0+git73137d8 1 threads: ----------------------------------------------------------------------- [1, 3, 320, 320] -> (256, 256) | 339.0 | 340.3 [1, 3, 320, 320] -> (512, 512) | 1266.1 | 1277.3 [1, 3, 500, 500] -> (256, 256) | 348.8 | 351.3 [1, 3, 500, 500] -> (800, 800) | 3054.5 | 3077.3 6 threads: ----------------------------------------------------------------------- [1, 3, 320, 320] -> (256, 256) | 76.6 | 77.4 [1, 3, 320, 320] -> (512, 512) | 246.0 | 248.1 [1, 3, 500, 500] -> (256, 256) | 78.3 | 79.5 [1, 3, 500, 500] -> (800, 800) | 572.2 | 580.0 Times are in microseconds (us). [--------- upsample_bilinear2d channels_last non-contiguous torch.float32 --------] | 1.9.0a0+git8518b0e | 1.9.0a0+git73137d8 1 threads: ------------------------------------------------------------------------ [1, 3, 320, 320] -> (256, 256) | 965.4 | 964.9 [1, 3, 320, 320] -> (512, 512) | 3856.2 | 3866.8 [32, 128, 64, 64] -> (32, 32) | 5808.3 | 5812.8 [32, 128, 64, 64] -> (128, 128) | 99575.2 | 97226.2 [2, 128, 64, 46] -> (32, 32) | 110.5 | 109.0 [2, 128, 64, 46] -> (128, 128) | 1662.3 | 1612.0 [1, 128, 64, 46] -> (32, 32) | 55.6 | 55.5 [1, 128, 64, 46] -> (128, 128) | 467.0 | 463.9 [1, 3, 500, 500] -> (256, 256) | 967.7 | 966.7 [1, 3, 500, 500] -> (800, 800) | 9394.7 | 9436.6 6 threads: ------------------------------------------------------------------------ [1, 3, 320, 320] -> (256, 256) | 962.2 | 965.4 [1, 3, 320, 320] -> (512, 512) | 3844.3 | 3844.3 [32, 128, 64, 64] -> (32, 32) | 2270.0 | 2267.6 [32, 128, 64, 64] -> (128, 128) | 31909.7 | 32106.5 [2, 128, 64, 46] -> (32, 32) | 61.3 | 59.9 [2, 128, 64, 46] -> (128, 128) | 912.3 | 893.5 [1, 128, 64, 46] -> (32, 32) | 55.5 | 55.3 [1, 128, 64, 46] -> (128, 128) | 467.0 | 466.4 [1, 3, 500, 500] -> (256, 256) | 967.2 | 971.1 [1, 3, 500, 500] -> (800, 800) | 9383.2 | 9417.4 Times are in microseconds (us). [------ upsample_linear1d channels_first contiguous torch.float32 -------] | 1.9.0a0+git8518b0e | 1.9.0a0+git73137d8 1 threads: --------------------------------------------------------------- [4, 512, 320] -> [256] | 513.5 | 521.8 [4, 512, 320] -> [512] | 999.0 | 1011.8 6 threads: --------------------------------------------------------------- [4, 512, 320] -> [256] | 103.7 | 104.9 [4, 512, 320] -> [512] | 192.2 | 194.9 Times are in microseconds (us). [------------- upsample_trilinear3d channels_first contiguous torch.float32 -------------] | 1.9.0a0+git8518b0e | 1.9.0a0+git73137d8 1 threads: ------------------------------------------------------------------------------- [1, 3, 16, 320, 320] -> [8, 256, 256] | 5.4 | 5.5 [1, 3, 16, 320, 320] -> [32, 512, 512] | 111.2 | 111.1 6 threads: ------------------------------------------------------------------------------- [1, 3, 16, 320, 320] -> [8, 256, 256] | 1.1 | 1.0 [1, 3, 16, 320, 320] -> [32, 512, 512] | 23.4 | 23.2 Times are in milliseconds (ms). [----------- upsample_trilinear3d channels_last non-contiguous torch.float32 ------------] | 1.9.0a0+git8518b0e | 1.9.0a0+git73137d8 1 threads: ------------------------------------------------------------------------------- [1, 3, 16, 320, 320] -> [8, 256, 256] | 13521.9 | 12939.9 [1, 3, 16, 320, 320] -> [32, 512, 512] | 244561.3 | 236595.6 [1, 16, 32, 64, 64] -> [16, 32, 32] | 362.2 | 365.5 [1, 16, 32, 64, 64] -> [64, 128, 128] | 38141.4 | 37957.7 6 threads: ------------------------------------------------------------------------------- [1, 3, 16, 320, 320] -> [8, 256, 256] | 12980.4 | 12962.7 [1, 3, 16, 320, 320] -> [32, 512, 512] | 236256.4 | 236364.5 [1, 16, 32, 64, 64] -> [16, 32, 32] | 367.9 | 393.2 [1, 16, 32, 64, 64] -> [64, 128, 128] | 38222.5 | 38198.3 Times are in microseconds (us). [----------- upsample_nearest2d channels_first contiguous torch.float32 ----------] | 1.9.0a0+git8518b0e | 1.9.0a0+git73137d8 1 threads: ------------------------------------------------------------------------ [1, 3, 320, 320] -> (256, 256) | 1205.7 | 107.2 [1, 3, 320, 320] -> (512, 512) | 4793.5 | 357.7 [32, 128, 64, 64] -> (32, 32) | 26550.0 | 6227.1 [32, 128, 64, 64] -> (128, 128) | 341140.3 | 116404.4 [1, 3, 500, 500] -> (256, 256) | 1208.6 | 122.9 [1, 3, 500, 500] -> (800, 800) | 11648.0 | 848.1 6 threads: ------------------------------------------------------------------------ [1, 3, 320, 320] -> (256, 256) | 220.5 | 32.6 [1, 3, 320, 320] -> (512, 512) | 865.4 | 78.1 [32, 128, 64, 64] -> (32, 32) | 4890.9 | 2201.2 [32, 128, 64, 64] -> (128, 128) | 73533.8 | 32315.4 [1, 3, 500, 500] -> (256, 256) | 222.3 | 35.0 [1, 3, 500, 500] -> (800, 800) | 2107.5 | 170.7 Times are in microseconds (us). [----------- upsample_nearest2d channels_first contiguous torch.uint8 -----------] | 1.9.0a0+git8518b0e | 1.9.0a0+git73137d8 1 threads: ----------------------------------------------------------------------- [1, 3, 320, 320] -> (256, 256) | 1457.0 | 310.7 [1, 3, 320, 320] -> (512, 512) | 5808.0 | 1196.6 [1, 3, 500, 500] -> (256, 256) | 1460.9 | 312.7 [1, 3, 500, 500] -> (800, 800) | 14094.3 | 2903.5 6 threads: ----------------------------------------------------------------------- [1, 3, 320, 320] -> (256, 256) | 264.8 | 66.8 [1, 3, 320, 320] -> (512, 512) | 1046.0 | 228.9 [1, 3, 500, 500] -> (256, 256) | 266.0 | 68.0 [1, 3, 500, 500] -> (800, 800) | 2546.6 | 535.8 Times are in microseconds (us). [-------- upsample_nearest2d channels_first non-contiguous torch.float32 --------] | 1.9.0a0+git8518b0e | 1.9.0a0+git73137d8 1 threads: ----------------------------------------------------------------------- [1, 3, 320, 320] -> (256, 256) | 1284.3 | 109.9 [1, 3, 320, 320] -> (512, 512) | 4870.0 | 361.6 [1, 3, 500, 500] -> (256, 256) | 1482.8 | 123.3 [1, 3, 500, 500] -> (800, 800) | 12050.3 | 858.8 6 threads: ----------------------------------------------------------------------- [1, 3, 320, 320] -> (256, 256) | 240.2 | 32.8 [1, 3, 320, 320] -> (512, 512) | 886.1 | 78.4 [1, 3, 500, 500] -> (256, 256) | 274.9 | 34.9 [1, 3, 500, 500] -> (800, 800) | 2188.8 | 174.0 Times are in microseconds (us). [--------- upsample_nearest2d channels_first non-contiguous torch.uint8 ---------] | 1.9.0a0+git8518b0e | 1.9.0a0+git73137d8 1 threads: ----------------------------------------------------------------------- [1, 3, 320, 320] -> (256, 256) | 1501.9 | 312.2 [1, 3, 320, 320] -> (512, 512) | 5853.4 | 1202.1 [1, 3, 500, 500] -> (256, 256) | 1574.0 | 313.9 [1, 3, 500, 500] -> (800, 800) | 14210.2 | 2904.5 6 threads: ----------------------------------------------------------------------- [1, 3, 320, 320] -> (256, 256) | 277.2 | 67.2 [1, 3, 320, 320] -> (512, 512) | 1059.8 | 228.9 [1, 3, 500, 500] -> (256, 256) | 292.2 | 68.1 [1, 3, 500, 500] -> (800, 800) | 2574.4 | 536.2 Times are in microseconds (us). [--------- upsample_nearest2d channels_last non-contiguous torch.float32 ---------] | 1.9.0a0+git8518b0e | 1.9.0a0+git73137d8 1 threads: ------------------------------------------------------------------------ [1, 3, 320, 320] -> (256, 256) | 746.0 | 751.1 [1, 3, 320, 320] -> (512, 512) | 2967.6 | 2979.2 [32, 128, 64, 64] -> (32, 32) | 3408.5 | 3379.0 [32, 128, 64, 64] -> (128, 128) | 90166.4 | 90023.0 [2, 128, 64, 46] -> (32, 32) | 74.8 | 74.5 [2, 128, 64, 46] -> (128, 128) | 1591.2 | 1594.3 [1, 128, 64, 46] -> (32, 32) | 39.3 | 39.2 [1, 128, 64, 46] -> (128, 128) | 420.3 | 419.1 [1, 3, 500, 500] -> (256, 256) | 751.6 | 756.3 [1, 3, 500, 500] -> (800, 800) | 7222.2 | 7268.6 6 threads: ------------------------------------------------------------------------ [1, 3, 320, 320] -> (256, 256) | 144.9 | 140.1 [1, 3, 320, 320] -> (512, 512) | 560.7 | 540.6 [32, 128, 64, 64] -> (32, 32) | 1418.1 | 1418.6 [32, 128, 64, 64] -> (128, 128) | 28158.4 | 26411.4 [2, 128, 64, 46] -> (32, 32) | 18.4 | 17.8 [2, 128, 64, 46] -> (128, 128) | 532.3 | 552.0 [1, 128, 64, 46] -> (32, 32) | 13.9 | 13.6 [1, 128, 64, 46] -> (128, 128) | 81.3 | 82.9 [1, 3, 500, 500] -> (256, 256) | 145.9 | 141.6 [1, 3, 500, 500] -> (800, 800) | 1363.4 | 1316.2 Times are in microseconds (us). [---------- upsample_nearest2d channels_last non-contiguous torch.uint8 ----------] | 1.9.0a0+git8518b0e | 1.9.0a0+git73137d8 1 threads: ------------------------------------------------------------------------ [1, 3, 320, 320] -> (256, 256) | 795.7 | 824.1 [1, 3, 320, 320] -> (512, 512) | 3163.4 | 3274.8 [32, 128, 64, 64] -> (32, 32) | 798.8 | 812.2 [32, 128, 64, 64] -> (128, 128) | 25259.6 | 25453.1 [2, 128, 64, 46] -> (32, 32) | 39.3 | 39.9 [2, 128, 64, 46] -> (128, 128) | 493.7 | 499.9 [1, 128, 64, 46] -> (32, 32) | 22.6 | 22.9 [1, 128, 64, 46] -> (128, 128) | 249.7 | 254.0 [32, 64, 128, 64] -> (32, 32) | 475.3 | 507.4 [32, 64, 128, 64] -> (128, 128) | 13709.7 | 13767.5 [1, 3, 500, 500] -> (256, 256) | 804.0 | 827.6 [1, 3, 500, 500] -> (800, 800) | 7764.9 | 7982.7 6 threads: ------------------------------------------------------------------------ [1, 3, 320, 320] -> (256, 256) | 150.1 | 151.4 [1, 3, 320, 320] -> (512, 512) | 589.5 | 592.6 [32, 128, 64, 64] -> (32, 32) | 141.3 | 194.5 [32, 128, 64, 64] -> (128, 128) | 6916.5 | 7445.0 [2, 128, 64, 46] -> (32, 32) | 10.0 | 12.5 [2, 128, 64, 46] -> (128, 128) | 95.8 | 141.1 [1, 128, 64, 46] -> (32, 32) | 8.1 | 10.0 [1, 128, 64, 46] -> (128, 128) | 52.5 | 74.3 [32, 64, 128, 64] -> (32, 32) | 79.8 | 123.7 [32, 64, 128, 64] -> (128, 128) | 3639.9 | 4087.9 [1, 3, 500, 500] -> (256, 256) | 150.7 | 152.2 [1, 3, 500, 500] -> (800, 800) | 1430.9 | 1440.7 Times are in microseconds (us). [------ upsample_nearest1d channels_first contiguous torch.float32 ------] | 1.9.0a0+git8518b0e | 1.9.0a0+git73137d8 1 threads: --------------------------------------------------------------- [4, 512, 320] -> [256] | 1601.7 | 241.7 [4, 512, 320] -> [512] | 3188.5 | 435.7 6 threads: --------------------------------------------------------------- [4, 512, 320] -> [256] | 291.9 | 53.3 [4, 512, 320] -> [512] | 577.8 | 88.1 Times are in microseconds (us). [------- upsample_nearest1d channels_first contiguous torch.uint8 -------] | 1.9.0a0+git8518b0e | 1.9.0a0+git73137d8 1 threads: --------------------------------------------------------------- [4, 512, 320] -> [256] | 2010.1 | 532.3 [4, 512, 320] -> [512] | 3999.7 | 1011.4 6 threads: --------------------------------------------------------------- [4, 512, 320] -> [256] | 364.2 | 104.6 [4, 512, 320] -> [512] | 722.8 | 193.5 Times are in microseconds (us). [-------------- upsample_nearest3d channels_first contiguous torch.float32 --------------] | 1.9.0a0+git8518b0e | 1.9.0a0+git73137d8 1 threads: ------------------------------------------------------------------------------- [1, 3, 16, 320, 320] -> [8, 256, 256] | 14801.0 | 977.5 [1, 3, 16, 320, 320] -> [32, 512, 512] | 217368.5 | 41577.3 6 threads: ------------------------------------------------------------------------------- [1, 3, 16, 320, 320] -> [8, 256, 256] | 2670.3 | 210.7 [1, 3, 16, 320, 320] -> [32, 512, 512] | 42023.6 | 10971.6 Times are in microseconds (us). [--------------- upsample_nearest3d channels_first contiguous torch.uint8 ---------------] | 1.9.0a0+git8518b0e | 1.9.0a0+git73137d8 1 threads: ------------------------------------------------------------------------------- [1, 3, 16, 320, 320] -> [8, 256, 256] | 17151.7 | 3195.8 [1, 3, 16, 320, 320] -> [32, 512, 512] | 221221.0 | 50524.5 6 threads: ------------------------------------------------------------------------------- [1, 3, 16, 320, 320] -> [8, 256, 256] | 3085.3 | 588.6 [1, 3, 16, 320, 320] -> [32, 512, 512] | 39842.0 | 9141.0 Times are in microseconds (us). [------------ upsample_nearest3d channels_last non-contiguous torch.float32 -------------] | 1.9.0a0+git8518b0e | 1.9.0a0+git73137d8 1 threads: ------------------------------------------------------------------------------- [1, 3, 16, 320, 320] -> [8, 256, 256] | 7694.1 | 7729.0 [1, 3, 16, 320, 320] -> [32, 512, 512] | 138104.6 | 138158.0 [1, 16, 32, 64, 64] -> [16, 32, 32] | 251.1 | 252.4 [1, 16, 32, 64, 64] -> [64, 128, 128] | 28991.5 | 28882.8 6 threads: ------------------------------------------------------------------------------- [1, 3, 16, 320, 320] -> [8, 256, 256] | 1398.3 | 1402.6 [1, 3, 16, 320, 320] -> [32, 512, 512] | 28056.5 | 28123.2 [1, 16, 32, 64, 64] -> [16, 32, 32] | 50.8 | 51.1 [1, 16, 32, 64, 64] -> [64, 128, 128] | 7595.7 | 7540.7 Times are in microseconds (us). [------------- upsample_nearest3d channels_last non-contiguous torch.uint8 --------------] | 1.9.0a0+git8518b0e | 1.9.0a0+git73137d8 1 threads: ------------------------------------------------------------------------------- [1, 3, 16, 320, 320] -> [8, 256, 256] | 8147.8 | 8176.2 [1, 3, 16, 320, 320] -> [32, 512, 512] | 114658.1 | 114992.7 [1, 16, 32, 64, 64] -> [16, 32, 32] | 364.3 | 356.0 [1, 16, 32, 64, 64] -> [64, 128, 128] | 17276.0 | 16331.0 6 threads: ------------------------------------------------------------------------------- [1, 3, 16, 320, 320] -> [8, 256, 256] | 1469.4 | 1476.1 [1, 3, 16, 320, 320] -> [32, 512, 512] | 20647.1 | 20722.6 [1, 16, 32, 64, 64] -> [16, 32, 32] | 69.7 | 68.4 [1, 16, 32, 64, 64] -> [64, 128, 128] | 3125.7 | 2948.2 Times are in microseconds (us). [----------- upsample_bicubic2d channels_first contiguous torch.float32 ----------] | 1.9.0a0+git8518b0e | 1.9.0a0+git73137d8 1 threads: ------------------------------------------------------------------------ [1, 3, 320, 320] -> (256, 256) | 5961.0 | 1680.2 [1, 3, 320, 320] -> (512, 512) | 23803.7 | 6591.0 [32, 128, 64, 64] -> (32, 32) | 620609.4 | 37981.6 [32, 128, 64, 64] -> (128, 128) | 10120286.1 | 646305.5 [1, 3, 500, 500] -> (256, 256) | 6005.4 | 1694.6 [1, 3, 500, 500] -> (800, 800) | 58271.9 | 16047.6 6 threads: ------------------------------------------------------------------------ [1, 3, 320, 320] -> (256, 256) | 6218.5 | 347.1 [1, 3, 320, 320] -> (512, 512) | 24144.6 | 1253.4 [32, 128, 64, 64] -> (32, 32) | 612762.5 | 6934.8 [32, 128, 64, 64] -> (128, 128) | 9906221.2 | 127411.1 [1, 3, 500, 500] -> (256, 256) | 6241.9 | 350.2 [1, 3, 500, 500] -> (800, 800) | 59052.2 | 2984.8 Times are in microseconds (us). [-------- upsample_bicubic2d channels_first non-contiguous torch.float32 --------] | 1.9.0a0+git8518b0e | 1.9.0a0+git73137d8 1 threads: ----------------------------------------------------------------------- [1, 3, 320, 320] -> (256, 256) | 6050.9 | 1694.3 [1, 3, 320, 320] -> (512, 512) | 23897.1 | 6607.9 [1, 3, 500, 500] -> (256, 256) | 6282.8 | 1693.9 [1, 3, 500, 500] -> (800, 800) | 58608.1 | 16061.0 6 threads: ----------------------------------------------------------------------- [1, 3, 320, 320] -> (256, 256) | 6243.7 | 347.6 [1, 3, 320, 320] -> (512, 512) | 24779.9 | 1253.8 [1, 3, 500, 500] -> (256, 256) | 6348.0 | 350.7 [1, 3, 500, 500] -> (800, 800) | 59255.6 | 2983.8 Times are in microseconds (us). [--------- upsample_bicubic2d channels_last non-contiguous torch.float32 ---------] | 1.9.0a0+git8518b0e | 1.9.0a0+git73137d8 1 threads: ------------------------------------------------------------------------ [1, 3, 320, 320] -> (256, 256) | 6117.0 | 1688.2 [1, 3, 320, 320] -> (512, 512) | 23967.4 | 6644.8 [32, 128, 64, 64] -> (32, 32) | 679574.0 | 78477.4 [32, 128, 64, 64] -> (128, 128) | 1033432.5 | 817649.0 [2, 128, 64, 46] -> (32, 32) | 9828.0 | 4449.2 [2, 128, 64, 46] -> (128, 128) | 134989.3 | 42817.4 [1, 128, 64, 46] -> (32, 32) | 4508.2 | 2228.6 [1, 128, 64, 46] -> (128, 128) | 59404.9 | 21400.4 [1, 3, 500, 500] -> (256, 256) | 6359.0 | 1712.7 [1, 3, 500, 500] -> (800, 800) | 58717.6 | 16086.6 6 threads: ------------------------------------------------------------------------ [1, 3, 320, 320] -> (256, 256) | 6922.0 | 349.5 [1, 3, 320, 320] -> (512, 512) | 24916.5 | 1260.2 [32, 128, 64, 64] -> (32, 32) | 454240.4 | 16491.4 [32, 128, 64, 64] -> (128, 128) | 7198101.5 | 159921.9 [2, 128, 64, 46] -> (32, 32) | 10082.8 | 891.1 [2, 128, 64, 46] -> (128, 128) | 151037.0 | 7704.2 [1, 128, 64, 46] -> (32, 32) | 4325.5 | 633.9 [1, 128, 64, 46] -> (128, 128) | 62400.4 | 3853.5 [1, 3, 500, 500] -> (256, 256) | 6374.9 | 354.9 [1, 3, 500, 500] -> (800, 800) | 58638.8 | 2992.0 Times are in microseconds (us). Intermediate benchmark sources: - results/20210331-092940_pth_nightly_results_1.9.0a0+git8518b0e.log.save - results/20210331-092940_pr_results_1.9.0a0+git73137d8.log.save ``` [Source file](https://raw.githubusercontent.com/vfdev-5/interpolate-tensoriterator/master/step_seven/results/20210326-061238_pr_1.9.0a0%2Bgita17040a_vs_pth_1.9.0a0%2Bgit8518b0e_results.md) </details> This description is based on the benchmarks and the code from [here](https://github.com/vfdev-5/interpolate-tensoriterator/tree/master/step_seven). Joint work with Francisco Massa (fmassa). --- Appendix: Results without original 2d/3d channels last implementation <details> <summary> Quick benchmark results between 8518b0e (master) and [this branch](master...Quansight:vfdev-5/generic-upsample-tensor-iterator) </summary> ``` Description: - 20212303-061238_pth_nightly_results_1.9.0a0+git8518b0e.opencv.6 - 20212303-061238_pth_nightly_results_1.9.0a0+git8518b0e.opencv.1 - 20212303-061238_pr_results_1.9.0a0+gite3a9544.opencv.6 - 20212303-061238_pr_results_1.9.0a0+gite3a9544.opencv.1 [----------------- upsample_bilinear2d channels_first contiguous -----------------] | 1.9.0a0+git8518b0e | 1.9.0a0+gite3a9544 1 threads: ------------------------------------------------------------------------ [1, 3, 320, 320] -> (256, 256) | 348.5 | 331.7 [1, 3, 320, 320] -> (512, 512) | 1254.0 | 1178.1 [32, 128, 64, 64] -> (32, 32) | 10409.4 | 10009.1 [32, 128, 64, 64] -> (128, 128) | 210175.8 | 204542.5 [1, 3, 500, 500] -> (256, 256) | 348.5 | 329.5 [1, 3, 500, 500] -> (800, 800) | 3079.8 | 2890.1 6 threads: ------------------------------------------------------------------------ [1, 3, 320, 320] -> (256, 256) | 76.4 | 73.4 [1, 3, 320, 320] -> (512, 512) | 247.1 | 232.0 [32, 128, 64, 64] -> (32, 32) | 2371.1 | 2340.5 [32, 128, 64, 64] -> (128, 128) | 62182.6 | 54089.9 [1, 3, 500, 500] -> (256, 256) | 78.2 | 75.8 [1, 3, 500, 500] -> (800, 800) | 569.0 | 541.3 Times are in microseconds (us). [-------------- upsample_bilinear2d channels_first non-contiguous ---------------] | 1.9.0a0+git8518b0e | 1.9.0a0+gite3a9544 1 threads: ----------------------------------------------------------------------- [1, 3, 320, 320] -> (256, 256) | 340.5 | 321.9 [1, 3, 320, 320] -> (512, 512) | 1256.1 | 1179.0 [1, 3, 500, 500] -> (256, 256) | 351.4 | 332.0 [1, 3, 500, 500] -> (800, 800) | 3089.1 | 2898.6 6 threads: ----------------------------------------------------------------------- [1, 3, 320, 320] -> (256, 256) | 77.2 | 75.0 [1, 3, 320, 320] -> (512, 512) | 246.6 | 232.7 [1, 3, 500, 500] -> (256, 256) | 78.6 | 75.4 [1, 3, 500, 500] -> (800, 800) | 576.3 | 539.6 Times are in microseconds (us). [------------------------ upsample_bilinear2d channels_last non-contiguous ------------------------] | 1.9.0a0+git8518b0e | 1.9.0a0+gite3a9544 | opencv 4.5.1 1 threads: ----------------------------------------------------------------------------------------- [1, 3, 320, 320] -> (256, 256) | 971.9 | 1324.6 | 99.6 [1, 3, 320, 320] -> (512, 512) | 3867.8 | 5329.9 | 271.5 [32, 128, 64, 64] -> (32, 32) | 6010.6 | 6304.3 | [32, 128, 64, 64] -> (128, 128) | 112299.9 | 116956.8 | [2, 128, 64, 46] -> (32, 32) | 110.1 | 133.2 | [2, 128, 64, 46] -> (128, 128) | 1690.1 | 1838.6 | [1, 128, 64, 46] -> (32, 32) | 55.8 | 73.4 | 185.8 [1, 128, 64, 46] -> (128, 128) | 474.5 | 684.9 | 1445.7 [1, 3, 500, 500] -> (256, 256) | 972.9 | 1343.0 | 149.5 [1, 3, 500, 500] -> (800, 800) | 9460.2 | 12925.8 | 685.1 6 threads: ----------------------------------------------------------------------------------------- [1, 3, 320, 320] -> (256, 256) | 956.6 | 260.1 | 27.1 [1, 3, 320, 320] -> (512, 512) | 3867.3 | 967.1 | 63.6 [32, 128, 64, 64] -> (32, 32) | 2489.4 | 2427.0 | [32, 128, 64, 64] -> (128, 128) | 37462.1 | 41329.8 | [2, 128, 64, 46] -> (32, 32) | 61.2 | 38.9 | [2, 128, 64, 46] -> (128, 128) | 904.2 | 652.0 | [1, 128, 64, 46] -> (32, 32) | 57.1 | 32.0 | 191.1 [1, 128, 64, 46] -> (128, 128) | 491.4 | 138.1 | 1485.8 [1, 3, 500, 500] -> (256, 256) | 977.0 | 257.8 | 36.6 [1, 3, 500, 500] -> (800, 800) | 9470.0 | 2696.0 | 142.8 Times are in microseconds (us). [------------- upsample_linear1d channels_first contiguous --------------] | 1.9.0a0+git8518b0e | 1.9.0a0+gite3a9544 1 threads: --------------------------------------------------------------- [4, 512, 320] -> [256] | 516.5 | 524.7 [4, 512, 320] -> [512] | 993.8 | 1008.0 6 threads: --------------------------------------------------------------- [4, 512, 320] -> [256] | 104.3 | 105.4 [4, 512, 320] -> [512] | 193.5 | 195.6 Times are in microseconds (us). [-------------------- upsample_trilinear3d channels_first contiguous --------------------] | 1.9.0a0+git8518b0e | 1.9.0a0+gite3a9544 1 threads: ------------------------------------------------------------------------------- [1, 3, 16, 320, 320] -> [8, 256, 256] | 5.5 | 11.5 [1, 3, 16, 320, 320] -> [32, 512, 512] | 116.3 | 213.1 6 threads: ------------------------------------------------------------------------------- [1, 3, 16, 320, 320] -> [8, 256, 256] | 1.1 | 2.1 [1, 3, 16, 320, 320] -> [32, 512, 512] | 36.1 | 47.2 Times are in milliseconds (ms). [------------------ upsample_trilinear3d channels_last non-contiguous -------------------] | 1.9.0a0+git8518b0e | 1.9.0a0+gite3a9544 1 threads: ------------------------------------------------------------------------------- [1, 3, 16, 320, 320] -> [8, 256, 256] | 13.1 | 19.9 [1, 3, 16, 320, 320] -> [32, 512, 512] | 242.3 | 349.4 6 threads: ------------------------------------------------------------------------------- [1, 3, 16, 320, 320] -> [8, 256, 256] | 13.1 | 4.4 [1, 3, 16, 320, 320] -> [32, 512, 512] | 242.4 | 87.2 Times are in milliseconds (ms). [------------------ upsample_nearest2d channels_first contiguous -----------------] | 1.9.0a0+git8518b0e | 1.9.0a0+gite3a9544 1 threads: ------------------------------------------------------------------------ [1, 3, 320, 320] -> (256, 256) | 1194.5 | 107.8 [1, 3, 320, 320] -> (512, 512) | 4813.8 | 365.5 [32, 128, 64, 64] -> (32, 32) | 26745.6 | 6280.6 [32, 128, 64, 64] -> (128, 128) | 357686.7 | 129032.9 [1, 3, 500, 500] -> (256, 256) | 1205.9 | 123.8 [1, 3, 500, 500] -> (800, 800) | 11770.3 | 879.2 6 threads: ------------------------------------------------------------------------ [1, 3, 320, 320] -> (256, 256) | 220.2 | 32.7 [1, 3, 320, 320] -> (512, 512) | 867.2 | 78.7 [32, 128, 64, 64] -> (32, 32) | 5789.6 | 2241.8 [32, 128, 64, 64] -> (128, 128) | 89125.3 | 41881.3 [1, 3, 500, 500] -> (256, 256) | 224.3 | 34.8 [1, 3, 500, 500] -> (800, 800) | 2182.8 | 176.6 Times are in microseconds (us). [--------------- upsample_nearest2d channels_first non-contiguous ---------------] | 1.9.0a0+git8518b0e | 1.9.0a0+gite3a9544 1 threads: ----------------------------------------------------------------------- [1, 3, 320, 320] -> (256, 256) | 1279.5 | 110.2 [1, 3, 320, 320] -> (512, 512) | 4908.1 | 367.1 [1, 3, 500, 500] -> (256, 256) | 1488.1 | 123.4 [1, 3, 500, 500] -> (800, 800) | 12186.4 | 879.3 6 threads: ----------------------------------------------------------------------- [1, 3, 320, 320] -> (256, 256) | 241.8 | 32.6 [1, 3, 320, 320] -> (512, 512) | 889.0 | 79.2 [1, 3, 500, 500] -> (256, 256) | 279.2 | 35.6 [1, 3, 500, 500] -> (800, 800) | 2226.5 | 174.3 Times are in microseconds (us). [------------------------ upsample_nearest2d channels_last non-contiguous -------------------------] | 1.9.0a0+git8518b0e | 1.9.0a0+gite3a9544 | opencv 4.5.1 1 threads: ----------------------------------------------------------------------------------------- [1, 3, 320, 320] -> (256, 256) | 752.1 | 487.2 | 75.5 [1, 3, 320, 320] -> (512, 512) | 2992.6 | 1880.0 | 251.4 [32, 128, 64, 64] -> (32, 32) | 3458.6 | 3466.5 | [32, 128, 64, 64] -> (128, 128) | 102350.7 | 103919.4 | [2, 128, 64, 46] -> (32, 32) | 75.2 | 85.2 | [2, 128, 64, 46] -> (128, 128) | 1637.0 | 1690.4 | [1, 128, 64, 46] -> (32, 32) | 39.6 | 47.2 | 37.6 [1, 128, 64, 46] -> (128, 128) | 426.3 | 449.0 | 412.4 [1, 3, 500, 500] -> (256, 256) | 757.5 | 495.5 | 85.0 [1, 3, 500, 500] -> (800, 800) | 7281.4 | 4532.6 | 622.8 6 threads: ----------------------------------------------------------------------------------------- [1, 3, 320, 320] -> (256, 256) | 139.3 | 104.1 | 75.7 [1, 3, 320, 320] -> (512, 512) | 535.5 | 361.2 | 73.0 [32, 128, 64, 64] -> (32, 32) | 1518.6 | 1458.2 | [32, 128, 64, 64] -> (128, 128) | 37117.7 | 40142.4 | [2, 128, 64, 46] -> (32, 32) | 17.6 | 26.6 | [2, 128, 64, 46] -> (128, 128) | 537.6 | 629.4 | [1, 128, 64, 46] -> (32, 32) | 13.7 | 22.1 | 38.8 [1, 128, 64, 46] -> (128, 128) | 83.6 | 94.5 | 420.2 [1, 3, 500, 500] -> (256, 256) | 140.8 | 104.9 | 87.8 [1, 3, 500, 500] -> (800, 800) | 1317.8 | 853.8 | 139.7 Times are in microseconds (us). [------------- upsample_nearest1d channels_first contiguous -------------] | 1.9.0a0+git8518b0e | 1.9.0a0+gite3a9544 1 threads: --------------------------------------------------------------- [4, 512, 320] -> [256] | 1594.3 | 247.4 [4, 512, 320] -> [512] | 3222.6 | 440.4 6 threads: --------------------------------------------------------------- [4, 512, 320] -> [256] | 294.4 | 53.7 [4, 512, 320] -> [512] | 575.0 | 88.5 Times are in microseconds (us). [--------------------- upsample_nearest3d channels_first contiguous ---------------------] | 1.9.0a0+git8518b0e | 1.9.0a0+gite3a9544 1 threads: ------------------------------------------------------------------------------- [1, 3, 16, 320, 320] -> [8, 256, 256] | 14952.7 | 1005.7 [1, 3, 16, 320, 320] -> [32, 512, 512] | 224955.6 | 46228.0 6 threads: ------------------------------------------------------------------------------- [1, 3, 16, 320, 320] -> [8, 256, 256] | 2887.2 | 206.2 [1, 3, 16, 320, 320] -> [32, 512, 512] | 56872.0 | 13566.3 Times are in microseconds (us). [------------------- upsample_nearest3d channels_last non-contiguous --------------------] | 1.9.0a0+git8518b0e | 1.9.0a0+gite3a9544 1 threads: ------------------------------------------------------------------------------- [1, 3, 16, 320, 320] -> [8, 256, 256] | 7772.3 | 4770.9 [1, 3, 16, 320, 320] -> [32, 512, 512] | 144655.1 | 108605.0 6 threads: ------------------------------------------------------------------------------- [1, 3, 16, 320, 320] -> [8, 256, 256] | 1401.9 | 877.7 [1, 3, 16, 320, 320] -> [32, 512, 512] | 35939.6 | 28621.5 Times are in microseconds (us). [------------------ upsample_bicubic2d channels_first contiguous -----------------] | 1.9.0a0+git8518b0e | 1.9.0a0+gite3a9544 1 threads: ------------------------------------------------------------------------ [1, 3, 320, 320] -> (256, 256) | 6038.7 | 2340.4 [1, 3, 320, 320] -> (512, 512) | 24040.6 | 9205.9 [32, 128, 64, 64] -> (32, 32) | 471016.3 | 52059.1 [32, 128, 64, 64] -> (128, 128) | 7705594.5 | 884743.9 [1, 3, 500, 500] -> (256, 256) | 6061.5 | 2361.9 [1, 3, 500, 500] -> (800, 800) | 58940.7 | 22401.8 6 threads: ------------------------------------------------------------------------ [1, 3, 320, 320] -> (256, 256) | 6594.3 | 466.5 [1, 3, 320, 320] -> (512, 512) | 25361.5 | 1729.1 [32, 128, 64, 64] -> (32, 32) | 487783.5 | 11550.0 [32, 128, 64, 64] -> (128, 128) | 7963636.6 | 196017.3 [1, 3, 500, 500] -> (256, 256) | 6443.8 | 464.1 [1, 3, 500, 500] -> (800, 800) | 61891.9 | 4257.2 Times are in microseconds (us). [--------------- upsample_bicubic2d channels_first non-contiguous ---------------] | 1.9.0a0+git8518b0e | 1.9.0a0+gite3a9544 1 threads: ----------------------------------------------------------------------- [1, 3, 320, 320] -> (256, 256) | 6116.7 | 2357.0 [1, 3, 320, 320] -> (512, 512) | 24182.0 | 9213.9 [1, 3, 500, 500] -> (256, 256) | 6349.6 | 2358.5 [1, 3, 500, 500] -> (800, 800) | 59365.2 | 22431.2 6 threads: ----------------------------------------------------------------------- [1, 3, 320, 320] -> (256, 256) | 7155.1 | 464.6 [1, 3, 320, 320] -> (512, 512) | 24566.8 | 1712.4 [1, 3, 500, 500] -> (256, 256) | 7217.5 | 466.6 [1, 3, 500, 500] -> (800, 800) | 59880.2 | 4148.8 Times are in microseconds (us). [------------------------ upsample_bicubic2d channels_last non-contiguous -------------------------] | 1.9.0a0+git8518b0e | 1.9.0a0+gite3a9544 | opencv 4.5.1 1 threads: ----------------------------------------------------------------------------------------- [1, 3, 320, 320] -> (256, 256) | 6184.3 | 2360.0 | 215.0 [1, 3, 320, 320] -> (512, 512) | 24499.7 | 9231.1 | 510.7 [32, 128, 64, 64] -> (32, 32) | 548304.5 | 93517.8 | [32, 128, 64, 64] -> (128, 128) | 7810958.3 | 1086334.6 | [2, 128, 64, 46] -> (32, 32) | 10883.4 | 5594.9 | [2, 128, 64, 46] -> (128, 128) | 153253.2 | 57071.2 | [1, 128, 64, 46] -> (32, 32) | 4519.4 | 2826.5 | 619.7 [1, 128, 64, 46] -> (128, 128) | 61339.7 | 28470.7 | 3654.5 [1, 3, 500, 500] -> (256, 256) | 6444.8 | 2389.9 | 292.9 [1, 3, 500, 500] -> (800, 800) | 59448.0 | 22479.1 | 1316.9 6 threads: ----------------------------------------------------------------------------------------- [1, 3, 320, 320] -> (256, 256) | 6370.1 | 464.9 | 61.3 [1, 3, 320, 320] -> (512, 512) | 25365.6 | 1767.5 | 145.7 [32, 128, 64, 64] -> (32, 32) | 502888.7 | 22016.3 | [32, 128, 64, 64] -> (128, 128) | 8072918.9 | 234567.0 | [2, 128, 64, 46] -> (32, 32) | 11171.4 | 1049.5 | [2, 128, 64, 46] -> (128, 128) | 152612.5 | 11264.8 | [1, 128, 64, 46] -> (32, 32) | 4359.3 | 791.4 | 651.1 [1, 128, 64, 46] -> (128, 128) | 61346.5 | 7563.9 | 3765.2 [1, 3, 500, 500] -> (256, 256) | 6644.4 | 469.7 | 77.4 [1, 3, 500, 500] -> (800, 800) | 59947.2 | 4154.3 | 313.2 Times are in microseconds (us). Intermediate benchmark sources: - results/20212303-061238_pth_nightly_results_1.9.0a0+git8518b0e.log.save.opencv - results/20212303-061238_pr_results_1.9.0a0+gite3a9544.log.save.opencv ``` [Source file](https://raw.githubusercontent.com/vfdev-5/interpolate-tensoriterator/master/step_seven/results/20212303-061238_pr_1.9.0a0%2Bgite3a9544_vs_pth_1.9.0a0%2Bgit8518b0e_results.opencv.md) </details> Pull Request resolved: #54500 Reviewed By: glaringlee Differential Revision: D27463566 Pulled By: fmassa fbshipit-source-id: ceac3a8cee0eeb1a4ddd9344accffcc65449a49a
Related to #10482
Description:
Results
Interpolation 2d - 6 thread(s)
Interpolation 1d - 6 thread(s)
Interpolation 3d - 6 thread(s)
Interpolation 2d - 1 thread(s)
Interpolation 1d - 1 thread(s)
Interpolation 3d - 1 thread(s)
Versions and build configs
PyTorch master: 1.9.0.dev20210223
PyTorch master build setting:
PR : 1.9.0a0+74b172b
PR build setting:
This description is based on the benchmarks and the code from here.
TL;DR
Joint work with Francisco Massa (@fmassa).