Optimized bilinear interpolation using TensorIterator #51653

vfdev-5 · 2021-02-03T17:53:09Z

Related to #10482

Description:

Optimized bilinear interpolation for 1d, 2d, 3d cases using TensorIterator

Results

Interpolation 2d - 6 thread(s)

In	Out	Is contiguous	Channels last	master	this PR	speed-up
[1, 3, 320, 320]	[256, 256]	True	False	0.3938	0.0782	5.0339
[1, 3, 320, 320]	[512, 512]	True	False	1.5585	0.4105	3.7965
[1, 3, 320, 320]	[256, 256]	False	False	0.3481	0.0760	4.5780
[1, 3, 320, 320]	[512, 512]	False	False	1.5848	0.4091	3.8734
[1, 3, 320, 320]	[256, 256]	False	True	1.2058	1.2034	1.0020
[1, 3, 320, 320]	[512, 512]	False	True	4.8691	4.8537	1.0032
[32, 128, 64, 64]	[32, 32]	False	True	6.3915	6.4041	0.9980
[32, 128, 64, 64]	[128, 128]	False	True	166.1769	164.5621	1.0098
[32, 128, 64, 64]	[32, 32]	True	False	3.7194	2.4720	1.5046
[32, 128, 64, 64]	[128, 128]	True	False	86.6704	52.3754	1.6548
[1, 3, 500, 500]	[256, 256]	True	False	0.3270	0.0792	4.1307
[1, 3, 500, 500]	[800, 800]	True	False	3.3116	0.5567	5.9482
[1, 3, 500, 500]	[256, 256]	False	False	0.3763	0.0773	4.8700
[1, 3, 500, 500]	[800, 800]	False	False	3.2577	0.5590	5.8279

Interpolation 1d - 6 thread(s)

In	Out	Is contiguous	Channels last	master	this PR	speed-up
[4, 512, 320]	256	True	False	0.2795	0.1032	2.7089
[4, 512, 320]	512	True	False	0.5533	0.1888	2.9303

Interpolation 3d - 6 thread(s)

In	Out	Is contiguous	Channels last	master	this PR	speed-up
[1, 3, 16, 320, 320]	[8, 256, 256]	True	False	4.4105	2.1236	2.0769
[1, 3, 16, 320, 320]	[32, 512, 512]	True	False	83.9426	42.6641	1.9675
[1, 3, 16, 320, 320]	[8, 256, 256]	False	True	15.5736	15.5758	0.9999
[1, 3, 16, 320, 320]	[32, 512, 512]	False	True	272.4795	273.2745	0.9971

Interpolation 2d - 1 thread(s)

In	Out	Is contiguous	Channels last	master	this PR	speed-up
[1, 3, 320, 320]	[256, 256]	True	False	1.0240	0.4145	2.4705
[1, 3, 320, 320]	[512, 512]	True	False	4.0771	1.3836	2.9467
[1, 3, 320, 320]	[256, 256]	False	False	0.9771	0.3270	2.9878
[1, 3, 320, 320]	[512, 512]	False	False	4.1732	1.2209	3.4180
[1, 3, 320, 320]	[256, 256]	False	True	1.5466	1.5363	1.0067
[1, 3, 320, 320]	[512, 512]	False	True	6.1555	6.1199	1.0058
[32, 128, 64, 64]	[32, 32]	False	True	27.6362	27.5901	1.0017
[32, 128, 64, 64]	[128, 128]	False	True	468.6442	465.5163	1.0067
[32, 128, 64, 64]	[32, 32]	True	False	20.1495	10.0694	2.0011
[32, 128, 64, 64]	[128, 128]	True	False	400.0401	204.0662	1.9603
[1, 3, 500, 500]	[256, 256]	True	False	0.8956	0.3366	2.6606
[1, 3, 500, 500]	[800, 800]	True	False	8.6554	2.9530	2.9310
[1, 3, 500, 500]	[256, 256]	False	False	1.0921	0.3385	3.2263
[1, 3, 500, 500]	[800, 800]	False	False	8.9594	2.9627	3.0241

Interpolation 1d - 1 thread(s)

In	Out	Is contiguous	Channels last	master	this PR	speed-up
[4, 512, 320]	256	True	False	1.5233	0.5027	3.0301
[4, 512, 320]	512	True	False	3.0302	0.9735	3.1128

Interpolation 3d - 1 thread(s)

In	Out	Is contiguous	Channels last	master	this PR	speed-up
[1, 3, 16, 320, 320]	[8, 256, 256]	True	False	12.0477	11.3196	1.0643
[1, 3, 16, 320, 320]	[32, 512, 512]	True	False	222.8618	209.9955	1.0613
[1, 3, 16, 320, 320]	[8, 256, 256]	False	True	17.9883	17.9937	0.9997
[1, 3, 16, 320, 320]	[32, 512, 512]	False	True	380.7244	380.1916	1.0014

Versions and build configs

PyTorch master: 1.9.0.dev20210223
PyTorch master build setting:

BLAS_INFO=mkl, BUILD_TYPE=Release, CUDA_VERSION=10.2, CUDNN_VERSION=7.6.5, CXX_COMPILER=/opt/rh/devtoolset-7/root/usr/bin/c++, CXX_FLAGS= -Wno-deprecated -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -fopenmp -DNDEBUG -DUSE_KINETO -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -O2 -fPIC -Wno-narrowing -Wall -Wextra -Werror=return-type -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-sign-compare -Wno-unused-parameter -Wno-unused-variable -Wno-unused-function -Wno-unused-result -Wno-unused-local-typedefs -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-psabi -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Wno-stringop-overflow, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_VERSION=1.9.0, USE_CUDA=ON, USE_CUDNN=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=ON, USE_NNPACK=ON, USE_OPENMP=ON,

PR : 1.9.0a0+74b172b
PR build setting:

BUILD_TYPE=Release, CUDA_VERSION=11.1, CUDNN_VERSION=8.0.5, CXX_COMPILER=/usr/bin/g++-7, CXX_FLAGS=-O3 -Wno-deprecated -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -fopenmp -DNDEBUG -DUSE_KINETO -DUSE_PYTORCH_QNNPACK -O2 -fPIC -Wno-narrowing -Wall -Wextra -Werror=return-type -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-sign-compare -Wno-unused-parameter -Wno-unused-variable -Wno-unused-function -Wno-unused-result -Wno-unused-local-typedefs -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-psabi -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Wno-stringop-overflow, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_VERSION=1.9.0, USE_CUDA=1, USE_CUDNN=1, USE_EIGEN_FOR_BLAS=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=OFF, USE_MKLDNN=OFF, USE_MPI=OFF, USE_NCCL=ON, USE_NNPACK=0, USE_OPENMP=ON,

This description is based on the benchmarks and the code from here.

TL;DR

Linear upsampling generic implementation using TensorIterator for Nd case (single loop function for 1d, 2d and 3d cases)
- can be generalized to nearest, bicubic interpolation modes.
works for channels first and last cases.

Joint work with Francisco Massa (@fmassa).

…rator - MemoryFormat: channel first only

facebook-github-bot · 2021-02-03T17:53:22Z

💊 CI failures summary and remediations

As of commit 74b172b (more details on the Dr. CI page):

💚 💚 Looks good so far! There are no failures yet. 💚 💚

This comment was automatically generated by Dr. CI (expand for details).

Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions to the (internal) Dr. CI Users group.

…sor-iterator

VitalyFedyunin · 2021-02-10T21:48:47Z

@fmassa let me know if you cannot review it.

@vfdev-5 please benchmark non contiguous cases

vfdev-5 · 2021-02-10T21:57:45Z

@VitalyFedyunin there is already channel last (non-contigous) case here which is routed to the original vectorized implementation.

Do you think about particular cases like torch.rand(b, c, H, W)[:, :, :h, :w] ?

fmassa · 2021-02-11T09:05:15Z

@VitalyFedyunin I'll review this PR ~~today~~ on Monday

vfdev-5 · 2021-02-15T10:22:54Z

@VitalyFedyunin as talked with @fmassa , i'd like to mention here a performance drawback with dispatch stub in our particular case for 3d linear interpolation.
I compare the execution times between my prototype code (used for development and as a codebase for this PR) and PyTorch built with this PR:

---- Benchmark 3D ----
Input tensor: [1, 3, 16, 320, 320]
Num threads: 6

- Bench upsample_trilinear3d_cpu (2000 rounds) - downsampling to [8, 256, 256]
Elapsed time (ms): 2.11485
- Bench ti_upsample_trilinear3d_kernel_impl (2000 rounds) - downsampling to [8, 256, 256]
Elapsed time (ms): 1.05858

- Bench upsample_trilinear3d_cpu (2000 rounds) - upsampling to [32, 512, 512]
Elapsed time (ms): 44.3299
- Bench ti_upsample_trilinear3d_kernel_impl (2000 rounds) - upsampling to [32, 512, 512]
Elapsed time (ms): 26.0051

Expected result should have ~ similar times between upsample_trilinear3d_cpu and ti_upsample_trilinear3d_kernel_impl as it is more or less the same code.

I tried also to register manually the dispatch as here : master...Quansight:upsample-tensor-iterator-another-dispatch, see aten/src/ATen/native/UpSampleTrilinear3d.cpp

REGISTER_ARCH_DISPATCH(upsample_trilinear3d_kernel, DEFAULT, &upsample_trilinear3d_kernel_impl);
REGISTER_AVX_DISPATCH(upsample_trilinear3d_kernel, &upsample_trilinear3d_kernel_impl);
REGISTER_AVX2_DISPATCH(upsample_trilinear3d_kernel, &upsample_trilinear3d_kernel_impl);

and looks like this way I could restore expected times

---- Benchmark 3D ----
Input tensor: [1, 3, 16, 320, 320]
Num threads: 6

- Bench upsample_trilinear3d_cpu (750 rounds) - downsampling to [8, 256, 256]
Elapsed time (ms): 1.05641
- Bench ti_upsample_trilinear3d_kernel_impl (750 rounds) - downsampling to [8, 256, 256]
Elapsed time (ms): 1.05557

Again, this performance slowdown happens only for 3D case, where in the code we unroll a template loop over the dimensions:
https://github.com/Quansight/pytorch/blob/64516dbc7dc9891f1702954a741ff756463799a0/aten/src/ATen/native/cpu/UpSampleMoreKernel.cpp#L457-L460

PS: I also updated full results with more tests cases: non contiguous cases.

…sor-iterator

aten/src/ATen/native/UpSample.h

codecov · 2021-02-15T15:41:40Z

Codecov Report

Merging #51653 (74b172b) into master (64847c7) will increase coverage by 0.00%.
The diff coverage is n/a.

@@           Coverage Diff           @@
##           master   #51653   +/-   ##
=======================================
  Coverage   80.79%   80.79%           
=======================================
  Files        1972     1972           
  Lines      216093   216093           
=======================================
+ Hits       174586   174587    +1     
+ Misses      41507    41506    -1

fmassa

I've made some minor comments about the structure of the code.

It would be great to understand the slowdown when integrating this implementation into PyTorch for the 3d case.

aten/src/ATen/native/cpu/UpSampleMoreKernel.cpp

fmassa · 2021-02-15T17:42:16Z

Additionally, @vfdev-5 can you maybe put the performance results in a table which is linked in the PR? This will make it easier to understand what's going on.

Something like

2d - 1 thread

config	master	this PR	speed-up
[1x3x256x256] -> [1x3x500x500]	xxx	xxx	xxx
[1x3x256x256] -> [1x3x128x128]	xxx	xxx	xxx
...

2d - 6 threads

config	master	this PR	speed-up
[1x3x256x256] -> [1x3x500x500]	xxx	xxx	xxx
[1x3x256x256] -> [1x3x128x128]	xxx	xxx	xxx
...

3d

config	master	this PR	speed-up
[1x3x8x256x256] -> [1x3x8x500x500]	xxx	xxx	xxx

…sor-iterator

- Removed int32/int64 index dispatch - Added more comments and other updates according to the review

vfdev-5 · 2021-02-22T15:49:17Z

Results for 149b976 = if tensor iterator is used instead of cpu_upsample_linear_channels_last.
TensorIterator implementation is faster for some cases and slower for others (see the speedups <1.0):

Interpolation 2d - 6 thread(s)

In	Out	Is contigous	Channels last	master	this PR	speed-up
[1, 3, 320, 320]	[256, 256]	True	False	0.3261	0.1239	2.6318
[1, 3, 320, 320]	[512, 512]	True	False	1.2854	0.4086	3.1458
[1, 3, 320, 320]	[256, 256]	False	False	0.3488	0.0777	4.4919
[1, 3, 320, 320]	[512, 512]	False	False	1.3063	0.4084	3.1984
[1, 3, 320, 320]	[256, 256]	False	True	1.0897	0.3289	3.3132
[1, 3, 320, 320]	[512, 512]	False	True	4.2505	1.3841	3.0711
[32, 128, 64, 64]	[32, 32]	False	True	2.2961	2.9314	0.7833
[32, 128, 64, 64]	[128, 128]	False	True	35.9384	35.6293	1.0087
[32, 128, 64, 64]	[32, 32]	True	False	3.6902	3.5451	1.0409
[32, 128, 64, 64]	[128, 128]	True	False	86.7835	52.4501	1.6546
[1, 3, 500, 500]	[256, 256]	True	False	0.3266	0.0785	4.1601
[1, 3, 500, 500]	[800, 800]	True	False	3.1868	0.5580	5.7114
[1, 3, 500, 500]	[256, 256]	False	False	0.3771	0.0793	4.7555
[1, 3, 500, 500]	[800, 800]	False	False	3.2693	0.5610	5.8271

Interpolation 1d - 6 thread(s)

In	Out	Is contigous	Channels last	master	this PR	speed-up
[4, 512, 320]	256	True	False	0.2808	0.1044	2.6907
[4, 512, 320]	512	True	False	0.5524	0.1887	2.9269

Interpolation 3d - 6 thread(s)

In	Out	Is contigous	Channels last	master	this PR	speed-up
[1, 3, 16, 320, 320]	[8, 256, 256]	True	False	4.4017	0.9578	4.5958
[1, 3, 16, 320, 320]	[32, 512, 512]	True	False	84.0302	24.0669	3.4915
[1, 3, 16, 320, 320]	[8, 256, 256]	False	True	13.6098	3.0288	4.4934
[1, 3, 16, 320, 320]	[32, 512, 512]	False	True	246.6380	64.3400	3.8334

Interpolation 2d - 1 thread(s)

In	Out	Is contigous	Channels last	master	this PR	speed-up
[1, 3, 320, 320]	[256, 256]	True	False	0.8967	0.4551	1.9703
[1, 3, 320, 320]	[512, 512]	True	False	3.5399	1.7594	2.0120
[1, 3, 320, 320]	[256, 256]	False	False	0.9760	0.3305	2.9531
[1, 3, 320, 320]	[512, 512]	False	False	3.6266	1.7643	2.0555
[1, 3, 320, 320]	[256, 256]	False	True	1.0093	1.6589	0.6084
[1, 3, 320, 320]	[512, 512]	False	True	4.0231	7.1302	0.5642
[32, 128, 64, 64]	[32, 32]	False	True	5.8736	9.6382	0.6094
[32, 128, 64, 64]	[128, 128]	False	True	108.2541	117.1183	0.9243
[32, 128, 64, 64]	[32, 32]	True	False	19.9122	14.0883	1.4134
[32, 128, 64, 64]	[128, 128]	True	False	398.8196	205.5317	1.9404
[1, 3, 500, 500]	[256, 256]	True	False	0.8944	0.3388	2.6404
[1, 3, 500, 500]	[800, 800]	True	False	8.6327	2.9568	2.9196
[1, 3, 500, 500]	[256, 256]	False	False	1.0921	0.3405	3.2076
[1, 3, 500, 500]	[800, 800]	False	False	8.9394	2.9654	3.0145

Interpolation 1d - 1 thread(s)

In	Out	Is contigous	Channels last	master	this PR	speed-up
[4, 512, 320]	256	True	False	1.5233	0.5066	3.0071
[4, 512, 320]	512	True	False	3.0312	0.9796	3.0943

Interpolation 3d - 1 thread(s)

In	Out	Is contigous	Channels last	master	this PR	speed-up
[1, 3, 16, 320, 320]	[8, 256, 256]	True	False	12.0408	4.8498	2.4827
[1, 3, 16, 320, 320]	[32, 512, 512]	True	False	222.8379	105.1315	2.1196
[1, 3, 16, 320, 320]	[8, 256, 256]	False	True	13.3036	17.2361	0.7718
[1, 3, 16, 320, 320]	[32, 512, 512]	False	True	245.9575	297.0317	0.8281

Versions and build configs

PyTorch master: 1.8.0.dev20210208+cu110
PyTorch master build setting:

BLAS_INFO=mkl, BUILD_TYPE=Release, CUDA_VERSION=11.0, CUDNN_VERSION=8.0.5, CXX_COMPILER=/opt/rh/devtoolset-7/root/usr/bin/c++, CXX_FLAGS= -Wno-deprecated -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -fopenmp -DNDEBUG -DUSE_KINETO -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -O2 -fPIC -Wno-narrowing -Wall -Wextra -Werror=return-type -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-sign-compare -Wno-unused-parameter -Wno-unused-variable -Wno-unused-function -Wno-unused-result -Wno-unused-local-typedefs -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-psabi -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Wno-stringop-overflow, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_VERSION=1.8.0, USE_CUDA=ON, USE_CUDNN=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=ON, USE_NNPACK=ON, USE_OPENMP=ON,

PR : 1.9.0a0+149b976
PR build setting:

BUILD_TYPE=Release, CUDA_VERSION=11.1, CUDNN_VERSION=8.0.5, CXX_COMPILER=/usr/bin/g++-7, CXX_FLAGS=-O3 -Wno-deprecated -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -fopenmp -DNDEBUG -DUSE_KINETO -DUSE_PYTORCH_QNNPACK -O2 -fPIC -Wno-narrowing -Wall -Wextra -Werror=return-type -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-sign-compare -Wno-unused-parameter -Wno-unused-variable -Wno-unused-function -Wno-unused-result -Wno-unused-local-typedefs -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-psabi -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Wno-stringop-overflow, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_VERSION=1.9.0, USE_CUDA=1, USE_CUDNN=1, USE_EIGEN_FOR_BLAS=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=OFF, USE_MKLDNN=OFF, USE_MPI=OFF, USE_NCCL=ON, USE_NNPACK=0, USE_OPENMP=ON,

…sor-iterator

aten/src/ATen/native/cpu/UpSampleMoreKernel.cpp

fmassa

Looks great, thanks a lot!

facebook-github-bot

@fmassa has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

VitalyFedyunin · 2021-02-24T22:05:38Z

Please do REGISTER_AVX2_DISPATCH workaround while I review the issue with dispatch (as it might take additional time).

fmassa · 2021-02-25T13:23:13Z

I propose we do the following: given that the current PR already gives a significant speedup for the 3d case (despite the issues with REGISTER_DISPATCH), I would say to merge the PR as is, while keeping in mind that we can still obtain a 2x speed improvement on top of it when we fix the REGISTER_AVX2_DISPATCH situation.

facebook-github-bot · 2021-03-01T17:15:55Z

@fmassa merged this pull request in 66f07c0.

Summary: Related to pytorch#10482 Description: - Optimized bilinear interpolation for 1d, 2d, 3d cases using TensorIterator <details> <summary> Interpolation 2d - 6 thread(s) </summary> In | Out | Is contiguous | Channels last | master | this PR | speed-up ---|---|---|---|---|---|--- [1, 3, 320, 320] | [256, 256] | True | False | 0.3938 | 0.0782 | 5.0339 [1, 3, 320, 320] | [512, 512] | True | False | 1.5585 | 0.4105 | 3.7965 [1, 3, 320, 320] | [256, 256] | False | False | 0.3481 | 0.0760 | 4.5780 [1, 3, 320, 320] | [512, 512] | False | False | 1.5848 | 0.4091 | 3.8734 [1, 3, 320, 320] | [256, 256] | False | True | 1.2058 | 1.2034 | 1.0020 [1, 3, 320, 320] | [512, 512] | False | True | 4.8691 | 4.8537 | 1.0032 [32, 128, 64, 64] | [32, 32] | False | True | 6.3915 | 6.4041 | 0.9980 [32, 128, 64, 64] | [128, 128] | False | True | 166.1769 | 164.5621 | 1.0098 [32, 128, 64, 64] | [32, 32] | True | False | 3.7194 | 2.4720 | 1.5046 [32, 128, 64, 64] | [128, 128] | True | False | 86.6704 | 52.3754 | 1.6548 [1, 3, 500, 500] | [256, 256] | True | False | 0.3270 | 0.0792 | 4.1307 [1, 3, 500, 500] | [800, 800] | True | False | 3.3116 | 0.5567 | 5.9482 [1, 3, 500, 500] | [256, 256] | False | False | 0.3763 | 0.0773 | 4.8700 [1, 3, 500, 500] | [800, 800] | False | False | 3.2577 | 0.5590 | 5.8279 </details> <details> <summary> Interpolation 1d - 6 thread(s) </summary> In | Out | Is contiguous | Channels last | master | this PR | speed-up ---|---|---|---|---|---|--- [4, 512, 320] | 256 | True | False | 0.2795 | 0.1032 | 2.7089 [4, 512, 320] | 512 | True | False | 0.5533 | 0.1888 | 2.9303 </details> <details> <summary> Interpolation 3d - 6 thread(s) </summary> In | Out | Is contiguous | Channels last | master | this PR | speed-up ---|---|---|---|---|---|--- [1, 3, 16, 320, 320] | [8, 256, 256] | True | False | 4.4105 | 2.1236 | 2.0769 [1, 3, 16, 320, 320] | [32, 512, 512] | True | False | 83.9426 | 42.6641 | 1.9675 [1, 3, 16, 320, 320] | [8, 256, 256] | False | True | 15.5736 | 15.5758 | 0.9999 [1, 3, 16, 320, 320] | [32, 512, 512] | False | True | 272.4795 | 273.2745 | 0.9971 </details> <details> <summary> Interpolation 2d - 1 thread(s) </summary> In | Out | Is contiguous | Channels last | master | this PR | speed-up ---|---|---|---|---|---|--- [1, 3, 320, 320] | [256, 256] | True | False | 1.0240 | 0.4145 | 2.4705 [1, 3, 320, 320] | [512, 512] | True | False | 4.0771 | 1.3836 | 2.9467 [1, 3, 320, 320] | [256, 256] | False | False | 0.9771 | 0.3270 | 2.9878 [1, 3, 320, 320] | [512, 512] | False | False | 4.1732 | 1.2209 | 3.4180 [1, 3, 320, 320] | [256, 256] | False | True | 1.5466 | 1.5363 | 1.0067 [1, 3, 320, 320] | [512, 512] | False | True | 6.1555 | 6.1199 | 1.0058 [32, 128, 64, 64] | [32, 32] | False | True | 27.6362 | 27.5901 | 1.0017 [32, 128, 64, 64] | [128, 128] | False | True | 468.6442 | 465.5163 | 1.0067 [32, 128, 64, 64] | [32, 32] | True | False | 20.1495 | 10.0694 | 2.0011 [32, 128, 64, 64] | [128, 128] | True | False | 400.0401 | 204.0662 | 1.9603 [1, 3, 500, 500] | [256, 256] | True | False | 0.8956 | 0.3366 | 2.6606 [1, 3, 500, 500] | [800, 800] | True | False | 8.6554 | 2.9530 | 2.9310 [1, 3, 500, 500] | [256, 256] | False | False | 1.0921 | 0.3385 | 3.2263 [1, 3, 500, 500] | [800, 800] | False | False | 8.9594 | 2.9627 | 3.0241 </details> <details> <summary> Interpolation 1d - 1 thread(s) </summary> In | Out | Is contiguous | Channels last | master | this PR | speed-up ---|---|---|---|---|---|--- [4, 512, 320] | 256 | True | False | 1.5233 | 0.5027 | 3.0301 [4, 512, 320] | 512 | True | False | 3.0302 | 0.9735 | 3.1128 </details> <details> <summary> Interpolation 3d - 1 thread(s) </summary> In | Out | Is contiguous | Channels last | master | this PR | speed-up ---|---|---|---|---|---|--- [1, 3, 16, 320, 320] | [8, 256, 256] | True | False | 12.0477 | 11.3196 | 1.0643 [1, 3, 16, 320, 320] | [32, 512, 512] | True | False | 222.8618 | 209.9955 | 1.0613 [1, 3, 16, 320, 320] | [8, 256, 256] | False | True | 17.9883 | 17.9937 | 0.9997 [1, 3, 16, 320, 320] | [32, 512, 512] | False | True | 380.7244 | 380.1916 | 1.0014 </details> <details> <summary> Versions and build configs </summary> PyTorch master: 1.9.0.dev20210223 PyTorch master build setting: ``` BLAS_INFO=mkl, BUILD_TYPE=Release, CUDA_VERSION=10.2, CUDNN_VERSION=7.6.5, CXX_COMPILER=/opt/rh/devtoolset-7/root/usr/bin/c++, CXX_FLAGS= -Wno-deprecated -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -fopenmp -DNDEBUG -DUSE_KINETO -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -O2 -fPIC -Wno-narrowing -Wall -Wextra -Werror=return-type -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-sign-compare -Wno-unused-parameter -Wno-unused-variable -Wno-unused-function -Wno-unused-result -Wno-unused-local-typedefs -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-psabi -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Wno-stringop-overflow, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_VERSION=1.9.0, USE_CUDA=ON, USE_CUDNN=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=ON, USE_NNPACK=ON, USE_OPENMP=ON, ``` PR : 1.9.0a0+74b172b PR build setting: ``` BUILD_TYPE=Release, CUDA_VERSION=11.1, CUDNN_VERSION=8.0.5, CXX_COMPILER=/usr/bin/g++-7, CXX_FLAGS=-O3 -Wno-deprecated -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -fopenmp -DNDEBUG -DUSE_KINETO -DUSE_PYTORCH_QNNPACK -O2 -fPIC -Wno-narrowing -Wall -Wextra -Werror=return-type -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-sign-compare -Wno-unused-parameter -Wno-unused-variable -Wno-unused-function -Wno-unused-result -Wno-unused-local-typedefs -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-psabi -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Wno-stringop-overflow, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_VERSION=1.9.0, USE_CUDA=1, USE_CUDNN=1, USE_EIGEN_FOR_BLAS=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=OFF, USE_MKLDNN=OFF, USE_MPI=OFF, USE_NCCL=ON, USE_NNPACK=0, USE_OPENMP=ON, ``` </details> This description is based on the benchmarks and the code from [here](https://github.com/vfdev-5/interpolate-tensoriterator/tree/master/step_six). TL;DR - Linear upsampling generic implementation using TensorIterator for Nd case (single loop function for 1d, 2d and 3d cases) - can be generalized to nearest, bicubic interpolation modes. - works for channels first and last cases. Joint work with Francisco Massa (fmassa). Pull Request resolved: pytorch#51653 Reviewed By: malfet Differential Revision: D26619437 Pulled By: fmassa fbshipit-source-id: 7d435e23881c5b40a18bf0dbcab4906d5462025f

…2d/3d channels last impl) (#54500) Summary: Related to #10482 A follow-up PR to #51653 Description: - Replaces nearest/linear/cubic implementations with generic interpolation implementation - Retains 2d/3d channels last implementation due to perf slowdown for 1 thread (see below appendix note) Speed-ups for cases: - upsample_nearest channels first - upsample_bicubic channels first/last ### Results for this PR <details> <summary> Benchmark results between 8518b0e (master) and 73137d8 (this PR) </summary> ``` Description: - 20210331-092940_pth_nightly_results_1.9.0a0+git8518b0e.6 - 20210331-092940_pth_nightly_results_1.9.0a0+git8518b0e.1 - 20210331-092940_pr_results_1.9.0a0+git73137d8.6 - 20210331-092940_pr_results_1.9.0a0+git73137d8.1 [---------- upsample_bilinear2d channels_first contiguous torch.float32 ----------] | 1.9.0a0+git8518b0e | 1.9.0a0+git73137d8 1 threads: ------------------------------------------------------------------------ [1, 3, 320, 320] -> (256, 256) | 331.8 | 334.6 [1, 3, 320, 320] -> (512, 512) | 1261.7 | 1271.5 [32, 128, 64, 64] -> (32, 32) | 10164.6 | 10251.4 [32, 128, 64, 64] -> (128, 128) | 195966.1 | 197141.8 [1, 3, 500, 500] -> (256, 256) | 347.7 | 348.3 [1, 3, 500, 500] -> (800, 800) | 3044.9 | 3071.4 6 threads: ------------------------------------------------------------------------ [1, 3, 320, 320] -> (256, 256) | 76.1 | 77.0 [1, 3, 320, 320] -> (512, 512) | 244.8 | 247.6 [32, 128, 64, 64] -> (32, 32) | 2329.4 | 2315.8 [32, 128, 64, 64] -> (128, 128) | 47855.3 | 49047.7 [1, 3, 500, 500] -> (256, 256) | 78.1 | 78.7 [1, 3, 500, 500] -> (800, 800) | 569.3 | 575.6 Times are in microseconds (us). [------- upsample_bilinear2d channels_first non-contiguous torch.float32 --------] | 1.9.0a0+git8518b0e | 1.9.0a0+git73137d8 1 threads: ----------------------------------------------------------------------- [1, 3, 320, 320] -> (256, 256) | 339.0 | 340.3 [1, 3, 320, 320] -> (512, 512) | 1266.1 | 1277.3 [1, 3, 500, 500] -> (256, 256) | 348.8 | 351.3 [1, 3, 500, 500] -> (800, 800) | 3054.5 | 3077.3 6 threads: ----------------------------------------------------------------------- [1, 3, 320, 320] -> (256, 256) | 76.6 | 77.4 [1, 3, 320, 320] -> (512, 512) | 246.0 | 248.1 [1, 3, 500, 500] -> (256, 256) | 78.3 | 79.5 [1, 3, 500, 500] -> (800, 800) | 572.2 | 580.0 Times are in microseconds (us). [--------- upsample_bilinear2d channels_last non-contiguous torch.float32 --------] | 1.9.0a0+git8518b0e | 1.9.0a0+git73137d8 1 threads: ------------------------------------------------------------------------ [1, 3, 320, 320] -> (256, 256) | 965.4 | 964.9 [1, 3, 320, 320] -> (512, 512) | 3856.2 | 3866.8 [32, 128, 64, 64] -> (32, 32) | 5808.3 | 5812.8 [32, 128, 64, 64] -> (128, 128) | 99575.2 | 97226.2 [2, 128, 64, 46] -> (32, 32) | 110.5 | 109.0 [2, 128, 64, 46] -> (128, 128) | 1662.3 | 1612.0 [1, 128, 64, 46] -> (32, 32) | 55.6 | 55.5 [1, 128, 64, 46] -> (128, 128) | 467.0 | 463.9 [1, 3, 500, 500] -> (256, 256) | 967.7 | 966.7 [1, 3, 500, 500] -> (800, 800) | 9394.7 | 9436.6 6 threads: ------------------------------------------------------------------------ [1, 3, 320, 320] -> (256, 256) | 962.2 | 965.4 [1, 3, 320, 320] -> (512, 512) | 3844.3 | 3844.3 [32, 128, 64, 64] -> (32, 32) | 2270.0 | 2267.6 [32, 128, 64, 64] -> (128, 128) | 31909.7 | 32106.5 [2, 128, 64, 46] -> (32, 32) | 61.3 | 59.9 [2, 128, 64, 46] -> (128, 128) | 912.3 | 893.5 [1, 128, 64, 46] -> (32, 32) | 55.5 | 55.3 [1, 128, 64, 46] -> (128, 128) | 467.0 | 466.4 [1, 3, 500, 500] -> (256, 256) | 967.2 | 971.1 [1, 3, 500, 500] -> (800, 800) | 9383.2 | 9417.4 Times are in microseconds (us). [------ upsample_linear1d channels_first contiguous torch.float32 -------] | 1.9.0a0+git8518b0e | 1.9.0a0+git73137d8 1 threads: --------------------------------------------------------------- [4, 512, 320] -> [256] | 513.5 | 521.8 [4, 512, 320] -> [512] | 999.0 | 1011.8 6 threads: --------------------------------------------------------------- [4, 512, 320] -> [256] | 103.7 | 104.9 [4, 512, 320] -> [512] | 192.2 | 194.9 Times are in microseconds (us). [------------- upsample_trilinear3d channels_first contiguous torch.float32 -------------] | 1.9.0a0+git8518b0e | 1.9.0a0+git73137d8 1 threads: ------------------------------------------------------------------------------- [1, 3, 16, 320, 320] -> [8, 256, 256] | 5.4 | 5.5 [1, 3, 16, 320, 320] -> [32, 512, 512] | 111.2 | 111.1 6 threads: ------------------------------------------------------------------------------- [1, 3, 16, 320, 320] -> [8, 256, 256] | 1.1 | 1.0 [1, 3, 16, 320, 320] -> [32, 512, 512] | 23.4 | 23.2 Times are in milliseconds (ms). [----------- upsample_trilinear3d channels_last non-contiguous torch.float32 ------------] | 1.9.0a0+git8518b0e | 1.9.0a0+git73137d8 1 threads: ------------------------------------------------------------------------------- [1, 3, 16, 320, 320] -> [8, 256, 256] | 13521.9 | 12939.9 [1, 3, 16, 320, 320] -> [32, 512, 512] | 244561.3 | 236595.6 [1, 16, 32, 64, 64] -> [16, 32, 32] | 362.2 | 365.5 [1, 16, 32, 64, 64] -> [64, 128, 128] | 38141.4 | 37957.7 6 threads: ------------------------------------------------------------------------------- [1, 3, 16, 320, 320] -> [8, 256, 256] | 12980.4 | 12962.7 [1, 3, 16, 320, 320] -> [32, 512, 512] | 236256.4 | 236364.5 [1, 16, 32, 64, 64] -> [16, 32, 32] | 367.9 | 393.2 [1, 16, 32, 64, 64] -> [64, 128, 128] | 38222.5 | 38198.3 Times are in microseconds (us). [----------- upsample_nearest2d channels_first contiguous torch.float32 ----------] | 1.9.0a0+git8518b0e | 1.9.0a0+git73137d8 1 threads: ------------------------------------------------------------------------ [1, 3, 320, 320] -> (256, 256) | 1205.7 | 107.2 [1, 3, 320, 320] -> (512, 512) | 4793.5 | 357.7 [32, 128, 64, 64] -> (32, 32) | 26550.0 | 6227.1 [32, 128, 64, 64] -> (128, 128) | 341140.3 | 116404.4 [1, 3, 500, 500] -> (256, 256) | 1208.6 | 122.9 [1, 3, 500, 500] -> (800, 800) | 11648.0 | 848.1 6 threads: ------------------------------------------------------------------------ [1, 3, 320, 320] -> (256, 256) | 220.5 | 32.6 [1, 3, 320, 320] -> (512, 512) | 865.4 | 78.1 [32, 128, 64, 64] -> (32, 32) | 4890.9 | 2201.2 [32, 128, 64, 64] -> (128, 128) | 73533.8 | 32315.4 [1, 3, 500, 500] -> (256, 256) | 222.3 | 35.0 [1, 3, 500, 500] -> (800, 800) | 2107.5 | 170.7 Times are in microseconds (us). [----------- upsample_nearest2d channels_first contiguous torch.uint8 -----------] | 1.9.0a0+git8518b0e | 1.9.0a0+git73137d8 1 threads: ----------------------------------------------------------------------- [1, 3, 320, 320] -> (256, 256) | 1457.0 | 310.7 [1, 3, 320, 320] -> (512, 512) | 5808.0 | 1196.6 [1, 3, 500, 500] -> (256, 256) | 1460.9 | 312.7 [1, 3, 500, 500] -> (800, 800) | 14094.3 | 2903.5 6 threads: ----------------------------------------------------------------------- [1, 3, 320, 320] -> (256, 256) | 264.8 | 66.8 [1, 3, 320, 320] -> (512, 512) | 1046.0 | 228.9 [1, 3, 500, 500] -> (256, 256) | 266.0 | 68.0 [1, 3, 500, 500] -> (800, 800) | 2546.6 | 535.8 Times are in microseconds (us). [-------- upsample_nearest2d channels_first non-contiguous torch.float32 --------] | 1.9.0a0+git8518b0e | 1.9.0a0+git73137d8 1 threads: ----------------------------------------------------------------------- [1, 3, 320, 320] -> (256, 256) | 1284.3 | 109.9 [1, 3, 320, 320] -> (512, 512) | 4870.0 | 361.6 [1, 3, 500, 500] -> (256, 256) | 1482.8 | 123.3 [1, 3, 500, 500] -> (800, 800) | 12050.3 | 858.8 6 threads: ----------------------------------------------------------------------- [1, 3, 320, 320] -> (256, 256) | 240.2 | 32.8 [1, 3, 320, 320] -> (512, 512) | 886.1 | 78.4 [1, 3, 500, 500] -> (256, 256) | 274.9 | 34.9 [1, 3, 500, 500] -> (800, 800) | 2188.8 | 174.0 Times are in microseconds (us). [--------- upsample_nearest2d channels_first non-contiguous torch.uint8 ---------] | 1.9.0a0+git8518b0e | 1.9.0a0+git73137d8 1 threads: ----------------------------------------------------------------------- [1, 3, 320, 320] -> (256, 256) | 1501.9 | 312.2 [1, 3, 320, 320] -> (512, 512) | 5853.4 | 1202.1 [1, 3, 500, 500] -> (256, 256) | 1574.0 | 313.9 [1, 3, 500, 500] -> (800, 800) | 14210.2 | 2904.5 6 threads: ----------------------------------------------------------------------- [1, 3, 320, 320] -> (256, 256) | 277.2 | 67.2 [1, 3, 320, 320] -> (512, 512) | 1059.8 | 228.9 [1, 3, 500, 500] -> (256, 256) | 292.2 | 68.1 [1, 3, 500, 500] -> (800, 800) | 2574.4 | 536.2 Times are in microseconds (us). [--------- upsample_nearest2d channels_last non-contiguous torch.float32 ---------] | 1.9.0a0+git8518b0e | 1.9.0a0+git73137d8 1 threads: ------------------------------------------------------------------------ [1, 3, 320, 320] -> (256, 256) | 746.0 | 751.1 [1, 3, 320, 320] -> (512, 512) | 2967.6 | 2979.2 [32, 128, 64, 64] -> (32, 32) | 3408.5 | 3379.0 [32, 128, 64, 64] -> (128, 128) | 90166.4 | 90023.0 [2, 128, 64, 46] -> (32, 32) | 74.8 | 74.5 [2, 128, 64, 46] -> (128, 128) | 1591.2 | 1594.3 [1, 128, 64, 46] -> (32, 32) | 39.3 | 39.2 [1, 128, 64, 46] -> (128, 128) | 420.3 | 419.1 [1, 3, 500, 500] -> (256, 256) | 751.6 | 756.3 [1, 3, 500, 500] -> (800, 800) | 7222.2 | 7268.6 6 threads: ------------------------------------------------------------------------ [1, 3, 320, 320] -> (256, 256) | 144.9 | 140.1 [1, 3, 320, 320] -> (512, 512) | 560.7 | 540.6 [32, 128, 64, 64] -> (32, 32) | 1418.1 | 1418.6 [32, 128, 64, 64] -> (128, 128) | 28158.4 | 26411.4 [2, 128, 64, 46] -> (32, 32) | 18.4 | 17.8 [2, 128, 64, 46] -> (128, 128) | 532.3 | 552.0 [1, 128, 64, 46] -> (32, 32) | 13.9 | 13.6 [1, 128, 64, 46] -> (128, 128) | 81.3 | 82.9 [1, 3, 500, 500] -> (256, 256) | 145.9 | 141.6 [1, 3, 500, 500] -> (800, 800) | 1363.4 | 1316.2 Times are in microseconds (us). [---------- upsample_nearest2d channels_last non-contiguous torch.uint8 ----------] | 1.9.0a0+git8518b0e | 1.9.0a0+git73137d8 1 threads: ------------------------------------------------------------------------ [1, 3, 320, 320] -> (256, 256) | 795.7 | 824.1 [1, 3, 320, 320] -> (512, 512) | 3163.4 | 3274.8 [32, 128, 64, 64] -> (32, 32) | 798.8 | 812.2 [32, 128, 64, 64] -> (128, 128) | 25259.6 | 25453.1 [2, 128, 64, 46] -> (32, 32) | 39.3 | 39.9 [2, 128, 64, 46] -> (128, 128) | 493.7 | 499.9 [1, 128, 64, 46] -> (32, 32) | 22.6 | 22.9 [1, 128, 64, 46] -> (128, 128) | 249.7 | 254.0 [32, 64, 128, 64] -> (32, 32) | 475.3 | 507.4 [32, 64, 128, 64] -> (128, 128) | 13709.7 | 13767.5 [1, 3, 500, 500] -> (256, 256) | 804.0 | 827.6 [1, 3, 500, 500] -> (800, 800) | 7764.9 | 7982.7 6 threads: ------------------------------------------------------------------------ [1, 3, 320, 320] -> (256, 256) | 150.1 | 151.4 [1, 3, 320, 320] -> (512, 512) | 589.5 | 592.6 [32, 128, 64, 64] -> (32, 32) | 141.3 | 194.5 [32, 128, 64, 64] -> (128, 128) | 6916.5 | 7445.0 [2, 128, 64, 46] -> (32, 32) | 10.0 | 12.5 [2, 128, 64, 46] -> (128, 128) | 95.8 | 141.1 [1, 128, 64, 46] -> (32, 32) | 8.1 | 10.0 [1, 128, 64, 46] -> (128, 128) | 52.5 | 74.3 [32, 64, 128, 64] -> (32, 32) | 79.8 | 123.7 [32, 64, 128, 64] -> (128, 128) | 3639.9 | 4087.9 [1, 3, 500, 500] -> (256, 256) | 150.7 | 152.2 [1, 3, 500, 500] -> (800, 800) | 1430.9 | 1440.7 Times are in microseconds (us). [------ upsample_nearest1d channels_first contiguous torch.float32 ------] | 1.9.0a0+git8518b0e | 1.9.0a0+git73137d8 1 threads: --------------------------------------------------------------- [4, 512, 320] -> [256] | 1601.7 | 241.7 [4, 512, 320] -> [512] | 3188.5 | 435.7 6 threads: --------------------------------------------------------------- [4, 512, 320] -> [256] | 291.9 | 53.3 [4, 512, 320] -> [512] | 577.8 | 88.1 Times are in microseconds (us). [------- upsample_nearest1d channels_first contiguous torch.uint8 -------] | 1.9.0a0+git8518b0e | 1.9.0a0+git73137d8 1 threads: --------------------------------------------------------------- [4, 512, 320] -> [256] | 2010.1 | 532.3 [4, 512, 320] -> [512] | 3999.7 | 1011.4 6 threads: --------------------------------------------------------------- [4, 512, 320] -> [256] | 364.2 | 104.6 [4, 512, 320] -> [512] | 722.8 | 193.5 Times are in microseconds (us). [-------------- upsample_nearest3d channels_first contiguous torch.float32 --------------] | 1.9.0a0+git8518b0e | 1.9.0a0+git73137d8 1 threads: ------------------------------------------------------------------------------- [1, 3, 16, 320, 320] -> [8, 256, 256] | 14801.0 | 977.5 [1, 3, 16, 320, 320] -> [32, 512, 512] | 217368.5 | 41577.3 6 threads: ------------------------------------------------------------------------------- [1, 3, 16, 320, 320] -> [8, 256, 256] | 2670.3 | 210.7 [1, 3, 16, 320, 320] -> [32, 512, 512] | 42023.6 | 10971.6 Times are in microseconds (us). [--------------- upsample_nearest3d channels_first contiguous torch.uint8 ---------------] | 1.9.0a0+git8518b0e | 1.9.0a0+git73137d8 1 threads: ------------------------------------------------------------------------------- [1, 3, 16, 320, 320] -> [8, 256, 256] | 17151.7 | 3195.8 [1, 3, 16, 320, 320] -> [32, 512, 512] | 221221.0 | 50524.5 6 threads: ------------------------------------------------------------------------------- [1, 3, 16, 320, 320] -> [8, 256, 256] | 3085.3 | 588.6 [1, 3, 16, 320, 320] -> [32, 512, 512] | 39842.0 | 9141.0 Times are in microseconds (us). [------------ upsample_nearest3d channels_last non-contiguous torch.float32 -------------] | 1.9.0a0+git8518b0e | 1.9.0a0+git73137d8 1 threads: ------------------------------------------------------------------------------- [1, 3, 16, 320, 320] -> [8, 256, 256] | 7694.1 | 7729.0 [1, 3, 16, 320, 320] -> [32, 512, 512] | 138104.6 | 138158.0 [1, 16, 32, 64, 64] -> [16, 32, 32] | 251.1 | 252.4 [1, 16, 32, 64, 64] -> [64, 128, 128] | 28991.5 | 28882.8 6 threads: ------------------------------------------------------------------------------- [1, 3, 16, 320, 320] -> [8, 256, 256] | 1398.3 | 1402.6 [1, 3, 16, 320, 320] -> [32, 512, 512] | 28056.5 | 28123.2 [1, 16, 32, 64, 64] -> [16, 32, 32] | 50.8 | 51.1 [1, 16, 32, 64, 64] -> [64, 128, 128] | 7595.7 | 7540.7 Times are in microseconds (us). [------------- upsample_nearest3d channels_last non-contiguous torch.uint8 --------------] | 1.9.0a0+git8518b0e | 1.9.0a0+git73137d8 1 threads: ------------------------------------------------------------------------------- [1, 3, 16, 320, 320] -> [8, 256, 256] | 8147.8 | 8176.2 [1, 3, 16, 320, 320] -> [32, 512, 512] | 114658.1 | 114992.7 [1, 16, 32, 64, 64] -> [16, 32, 32] | 364.3 | 356.0 [1, 16, 32, 64, 64] -> [64, 128, 128] | 17276.0 | 16331.0 6 threads: ------------------------------------------------------------------------------- [1, 3, 16, 320, 320] -> [8, 256, 256] | 1469.4 | 1476.1 [1, 3, 16, 320, 320] -> [32, 512, 512] | 20647.1 | 20722.6 [1, 16, 32, 64, 64] -> [16, 32, 32] | 69.7 | 68.4 [1, 16, 32, 64, 64] -> [64, 128, 128] | 3125.7 | 2948.2 Times are in microseconds (us). [----------- upsample_bicubic2d channels_first contiguous torch.float32 ----------] | 1.9.0a0+git8518b0e | 1.9.0a0+git73137d8 1 threads: ------------------------------------------------------------------------ [1, 3, 320, 320] -> (256, 256) | 5961.0 | 1680.2 [1, 3, 320, 320] -> (512, 512) | 23803.7 | 6591.0 [32, 128, 64, 64] -> (32, 32) | 620609.4 | 37981.6 [32, 128, 64, 64] -> (128, 128) | 10120286.1 | 646305.5 [1, 3, 500, 500] -> (256, 256) | 6005.4 | 1694.6 [1, 3, 500, 500] -> (800, 800) | 58271.9 | 16047.6 6 threads: ------------------------------------------------------------------------ [1, 3, 320, 320] -> (256, 256) | 6218.5 | 347.1 [1, 3, 320, 320] -> (512, 512) | 24144.6 | 1253.4 [32, 128, 64, 64] -> (32, 32) | 612762.5 | 6934.8 [32, 128, 64, 64] -> (128, 128) | 9906221.2 | 127411.1 [1, 3, 500, 500] -> (256, 256) | 6241.9 | 350.2 [1, 3, 500, 500] -> (800, 800) | 59052.2 | 2984.8 Times are in microseconds (us). [-------- upsample_bicubic2d channels_first non-contiguous torch.float32 --------] | 1.9.0a0+git8518b0e | 1.9.0a0+git73137d8 1 threads: ----------------------------------------------------------------------- [1, 3, 320, 320] -> (256, 256) | 6050.9 | 1694.3 [1, 3, 320, 320] -> (512, 512) | 23897.1 | 6607.9 [1, 3, 500, 500] -> (256, 256) | 6282.8 | 1693.9 [1, 3, 500, 500] -> (800, 800) | 58608.1 | 16061.0 6 threads: ----------------------------------------------------------------------- [1, 3, 320, 320] -> (256, 256) | 6243.7 | 347.6 [1, 3, 320, 320] -> (512, 512) | 24779.9 | 1253.8 [1, 3, 500, 500] -> (256, 256) | 6348.0 | 350.7 [1, 3, 500, 500] -> (800, 800) | 59255.6 | 2983.8 Times are in microseconds (us). [--------- upsample_bicubic2d channels_last non-contiguous torch.float32 ---------] | 1.9.0a0+git8518b0e | 1.9.0a0+git73137d8 1 threads: ------------------------------------------------------------------------ [1, 3, 320, 320] -> (256, 256) | 6117.0 | 1688.2 [1, 3, 320, 320] -> (512, 512) | 23967.4 | 6644.8 [32, 128, 64, 64] -> (32, 32) | 679574.0 | 78477.4 [32, 128, 64, 64] -> (128, 128) | 1033432.5 | 817649.0 [2, 128, 64, 46] -> (32, 32) | 9828.0 | 4449.2 [2, 128, 64, 46] -> (128, 128) | 134989.3 | 42817.4 [1, 128, 64, 46] -> (32, 32) | 4508.2 | 2228.6 [1, 128, 64, 46] -> (128, 128) | 59404.9 | 21400.4 [1, 3, 500, 500] -> (256, 256) | 6359.0 | 1712.7 [1, 3, 500, 500] -> (800, 800) | 58717.6 | 16086.6 6 threads: ------------------------------------------------------------------------ [1, 3, 320, 320] -> (256, 256) | 6922.0 | 349.5 [1, 3, 320, 320] -> (512, 512) | 24916.5 | 1260.2 [32, 128, 64, 64] -> (32, 32) | 454240.4 | 16491.4 [32, 128, 64, 64] -> (128, 128) | 7198101.5 | 159921.9 [2, 128, 64, 46] -> (32, 32) | 10082.8 | 891.1 [2, 128, 64, 46] -> (128, 128) | 151037.0 | 7704.2 [1, 128, 64, 46] -> (32, 32) | 4325.5 | 633.9 [1, 128, 64, 46] -> (128, 128) | 62400.4 | 3853.5 [1, 3, 500, 500] -> (256, 256) | 6374.9 | 354.9 [1, 3, 500, 500] -> (800, 800) | 58638.8 | 2992.0 Times are in microseconds (us). Intermediate benchmark sources: - results/20210331-092940_pth_nightly_results_1.9.0a0+git8518b0e.log.save - results/20210331-092940_pr_results_1.9.0a0+git73137d8.log.save ``` [Source file](https://raw.githubusercontent.com/vfdev-5/interpolate-tensoriterator/master/step_seven/results/20210326-061238_pr_1.9.0a0%2Bgita17040a_vs_pth_1.9.0a0%2Bgit8518b0e_results.md) </details> This description is based on the benchmarks and the code from [here](https://github.com/vfdev-5/interpolate-tensoriterator/tree/master/step_seven). Joint work with Francisco Massa (fmassa). --- Appendix: Results without original 2d/3d channels last implementation <details> <summary> Quick benchmark results between 8518b0e (master) and [this branch](master...Quansight:vfdev-5/generic-upsample-tensor-iterator) </summary> ``` Description: - 20212303-061238_pth_nightly_results_1.9.0a0+git8518b0e.opencv.6 - 20212303-061238_pth_nightly_results_1.9.0a0+git8518b0e.opencv.1 - 20212303-061238_pr_results_1.9.0a0+gite3a9544.opencv.6 - 20212303-061238_pr_results_1.9.0a0+gite3a9544.opencv.1 [----------------- upsample_bilinear2d channels_first contiguous -----------------] | 1.9.0a0+git8518b0e | 1.9.0a0+gite3a9544 1 threads: ------------------------------------------------------------------------ [1, 3, 320, 320] -> (256, 256) | 348.5 | 331.7 [1, 3, 320, 320] -> (512, 512) | 1254.0 | 1178.1 [32, 128, 64, 64] -> (32, 32) | 10409.4 | 10009.1 [32, 128, 64, 64] -> (128, 128) | 210175.8 | 204542.5 [1, 3, 500, 500] -> (256, 256) | 348.5 | 329.5 [1, 3, 500, 500] -> (800, 800) | 3079.8 | 2890.1 6 threads: ------------------------------------------------------------------------ [1, 3, 320, 320] -> (256, 256) | 76.4 | 73.4 [1, 3, 320, 320] -> (512, 512) | 247.1 | 232.0 [32, 128, 64, 64] -> (32, 32) | 2371.1 | 2340.5 [32, 128, 64, 64] -> (128, 128) | 62182.6 | 54089.9 [1, 3, 500, 500] -> (256, 256) | 78.2 | 75.8 [1, 3, 500, 500] -> (800, 800) | 569.0 | 541.3 Times are in microseconds (us). [-------------- upsample_bilinear2d channels_first non-contiguous ---------------] | 1.9.0a0+git8518b0e | 1.9.0a0+gite3a9544 1 threads: ----------------------------------------------------------------------- [1, 3, 320, 320] -> (256, 256) | 340.5 | 321.9 [1, 3, 320, 320] -> (512, 512) | 1256.1 | 1179.0 [1, 3, 500, 500] -> (256, 256) | 351.4 | 332.0 [1, 3, 500, 500] -> (800, 800) | 3089.1 | 2898.6 6 threads: ----------------------------------------------------------------------- [1, 3, 320, 320] -> (256, 256) | 77.2 | 75.0 [1, 3, 320, 320] -> (512, 512) | 246.6 | 232.7 [1, 3, 500, 500] -> (256, 256) | 78.6 | 75.4 [1, 3, 500, 500] -> (800, 800) | 576.3 | 539.6 Times are in microseconds (us). [------------------------ upsample_bilinear2d channels_last non-contiguous ------------------------] | 1.9.0a0+git8518b0e | 1.9.0a0+gite3a9544 | opencv 4.5.1 1 threads: ----------------------------------------------------------------------------------------- [1, 3, 320, 320] -> (256, 256) | 971.9 | 1324.6 | 99.6 [1, 3, 320, 320] -> (512, 512) | 3867.8 | 5329.9 | 271.5 [32, 128, 64, 64] -> (32, 32) | 6010.6 | 6304.3 | [32, 128, 64, 64] -> (128, 128) | 112299.9 | 116956.8 | [2, 128, 64, 46] -> (32, 32) | 110.1 | 133.2 | [2, 128, 64, 46] -> (128, 128) | 1690.1 | 1838.6 | [1, 128, 64, 46] -> (32, 32) | 55.8 | 73.4 | 185.8 [1, 128, 64, 46] -> (128, 128) | 474.5 | 684.9 | 1445.7 [1, 3, 500, 500] -> (256, 256) | 972.9 | 1343.0 | 149.5 [1, 3, 500, 500] -> (800, 800) | 9460.2 | 12925.8 | 685.1 6 threads: ----------------------------------------------------------------------------------------- [1, 3, 320, 320] -> (256, 256) | 956.6 | 260.1 | 27.1 [1, 3, 320, 320] -> (512, 512) | 3867.3 | 967.1 | 63.6 [32, 128, 64, 64] -> (32, 32) | 2489.4 | 2427.0 | [32, 128, 64, 64] -> (128, 128) | 37462.1 | 41329.8 | [2, 128, 64, 46] -> (32, 32) | 61.2 | 38.9 | [2, 128, 64, 46] -> (128, 128) | 904.2 | 652.0 | [1, 128, 64, 46] -> (32, 32) | 57.1 | 32.0 | 191.1 [1, 128, 64, 46] -> (128, 128) | 491.4 | 138.1 | 1485.8 [1, 3, 500, 500] -> (256, 256) | 977.0 | 257.8 | 36.6 [1, 3, 500, 500] -> (800, 800) | 9470.0 | 2696.0 | 142.8 Times are in microseconds (us). [------------- upsample_linear1d channels_first contiguous --------------] | 1.9.0a0+git8518b0e | 1.9.0a0+gite3a9544 1 threads: --------------------------------------------------------------- [4, 512, 320] -> [256] | 516.5 | 524.7 [4, 512, 320] -> [512] | 993.8 | 1008.0 6 threads: --------------------------------------------------------------- [4, 512, 320] -> [256] | 104.3 | 105.4 [4, 512, 320] -> [512] | 193.5 | 195.6 Times are in microseconds (us). [-------------------- upsample_trilinear3d channels_first contiguous --------------------] | 1.9.0a0+git8518b0e | 1.9.0a0+gite3a9544 1 threads: ------------------------------------------------------------------------------- [1, 3, 16, 320, 320] -> [8, 256, 256] | 5.5 | 11.5 [1, 3, 16, 320, 320] -> [32, 512, 512] | 116.3 | 213.1 6 threads: ------------------------------------------------------------------------------- [1, 3, 16, 320, 320] -> [8, 256, 256] | 1.1 | 2.1 [1, 3, 16, 320, 320] -> [32, 512, 512] | 36.1 | 47.2 Times are in milliseconds (ms). [------------------ upsample_trilinear3d channels_last non-contiguous -------------------] | 1.9.0a0+git8518b0e | 1.9.0a0+gite3a9544 1 threads: ------------------------------------------------------------------------------- [1, 3, 16, 320, 320] -> [8, 256, 256] | 13.1 | 19.9 [1, 3, 16, 320, 320] -> [32, 512, 512] | 242.3 | 349.4 6 threads: ------------------------------------------------------------------------------- [1, 3, 16, 320, 320] -> [8, 256, 256] | 13.1 | 4.4 [1, 3, 16, 320, 320] -> [32, 512, 512] | 242.4 | 87.2 Times are in milliseconds (ms). [------------------ upsample_nearest2d channels_first contiguous -----------------] | 1.9.0a0+git8518b0e | 1.9.0a0+gite3a9544 1 threads: ------------------------------------------------------------------------ [1, 3, 320, 320] -> (256, 256) | 1194.5 | 107.8 [1, 3, 320, 320] -> (512, 512) | 4813.8 | 365.5 [32, 128, 64, 64] -> (32, 32) | 26745.6 | 6280.6 [32, 128, 64, 64] -> (128, 128) | 357686.7 | 129032.9 [1, 3, 500, 500] -> (256, 256) | 1205.9 | 123.8 [1, 3, 500, 500] -> (800, 800) | 11770.3 | 879.2 6 threads: ------------------------------------------------------------------------ [1, 3, 320, 320] -> (256, 256) | 220.2 | 32.7 [1, 3, 320, 320] -> (512, 512) | 867.2 | 78.7 [32, 128, 64, 64] -> (32, 32) | 5789.6 | 2241.8 [32, 128, 64, 64] -> (128, 128) | 89125.3 | 41881.3 [1, 3, 500, 500] -> (256, 256) | 224.3 | 34.8 [1, 3, 500, 500] -> (800, 800) | 2182.8 | 176.6 Times are in microseconds (us). [--------------- upsample_nearest2d channels_first non-contiguous ---------------] | 1.9.0a0+git8518b0e | 1.9.0a0+gite3a9544 1 threads: ----------------------------------------------------------------------- [1, 3, 320, 320] -> (256, 256) | 1279.5 | 110.2 [1, 3, 320, 320] -> (512, 512) | 4908.1 | 367.1 [1, 3, 500, 500] -> (256, 256) | 1488.1 | 123.4 [1, 3, 500, 500] -> (800, 800) | 12186.4 | 879.3 6 threads: ----------------------------------------------------------------------- [1, 3, 320, 320] -> (256, 256) | 241.8 | 32.6 [1, 3, 320, 320] -> (512, 512) | 889.0 | 79.2 [1, 3, 500, 500] -> (256, 256) | 279.2 | 35.6 [1, 3, 500, 500] -> (800, 800) | 2226.5 | 174.3 Times are in microseconds (us). [------------------------ upsample_nearest2d channels_last non-contiguous -------------------------] | 1.9.0a0+git8518b0e | 1.9.0a0+gite3a9544 | opencv 4.5.1 1 threads: ----------------------------------------------------------------------------------------- [1, 3, 320, 320] -> (256, 256) | 752.1 | 487.2 | 75.5 [1, 3, 320, 320] -> (512, 512) | 2992.6 | 1880.0 | 251.4 [32, 128, 64, 64] -> (32, 32) | 3458.6 | 3466.5 | [32, 128, 64, 64] -> (128, 128) | 102350.7 | 103919.4 | [2, 128, 64, 46] -> (32, 32) | 75.2 | 85.2 | [2, 128, 64, 46] -> (128, 128) | 1637.0 | 1690.4 | [1, 128, 64, 46] -> (32, 32) | 39.6 | 47.2 | 37.6 [1, 128, 64, 46] -> (128, 128) | 426.3 | 449.0 | 412.4 [1, 3, 500, 500] -> (256, 256) | 757.5 | 495.5 | 85.0 [1, 3, 500, 500] -> (800, 800) | 7281.4 | 4532.6 | 622.8 6 threads: ----------------------------------------------------------------------------------------- [1, 3, 320, 320] -> (256, 256) | 139.3 | 104.1 | 75.7 [1, 3, 320, 320] -> (512, 512) | 535.5 | 361.2 | 73.0 [32, 128, 64, 64] -> (32, 32) | 1518.6 | 1458.2 | [32, 128, 64, 64] -> (128, 128) | 37117.7 | 40142.4 | [2, 128, 64, 46] -> (32, 32) | 17.6 | 26.6 | [2, 128, 64, 46] -> (128, 128) | 537.6 | 629.4 | [1, 128, 64, 46] -> (32, 32) | 13.7 | 22.1 | 38.8 [1, 128, 64, 46] -> (128, 128) | 83.6 | 94.5 | 420.2 [1, 3, 500, 500] -> (256, 256) | 140.8 | 104.9 | 87.8 [1, 3, 500, 500] -> (800, 800) | 1317.8 | 853.8 | 139.7 Times are in microseconds (us). [------------- upsample_nearest1d channels_first contiguous -------------] | 1.9.0a0+git8518b0e | 1.9.0a0+gite3a9544 1 threads: --------------------------------------------------------------- [4, 512, 320] -> [256] | 1594.3 | 247.4 [4, 512, 320] -> [512] | 3222.6 | 440.4 6 threads: --------------------------------------------------------------- [4, 512, 320] -> [256] | 294.4 | 53.7 [4, 512, 320] -> [512] | 575.0 | 88.5 Times are in microseconds (us). [--------------------- upsample_nearest3d channels_first contiguous ---------------------] | 1.9.0a0+git8518b0e | 1.9.0a0+gite3a9544 1 threads: ------------------------------------------------------------------------------- [1, 3, 16, 320, 320] -> [8, 256, 256] | 14952.7 | 1005.7 [1, 3, 16, 320, 320] -> [32, 512, 512] | 224955.6 | 46228.0 6 threads: ------------------------------------------------------------------------------- [1, 3, 16, 320, 320] -> [8, 256, 256] | 2887.2 | 206.2 [1, 3, 16, 320, 320] -> [32, 512, 512] | 56872.0 | 13566.3 Times are in microseconds (us). [------------------- upsample_nearest3d channels_last non-contiguous --------------------] | 1.9.0a0+git8518b0e | 1.9.0a0+gite3a9544 1 threads: ------------------------------------------------------------------------------- [1, 3, 16, 320, 320] -> [8, 256, 256] | 7772.3 | 4770.9 [1, 3, 16, 320, 320] -> [32, 512, 512] | 144655.1 | 108605.0 6 threads: ------------------------------------------------------------------------------- [1, 3, 16, 320, 320] -> [8, 256, 256] | 1401.9 | 877.7 [1, 3, 16, 320, 320] -> [32, 512, 512] | 35939.6 | 28621.5 Times are in microseconds (us). [------------------ upsample_bicubic2d channels_first contiguous -----------------] | 1.9.0a0+git8518b0e | 1.9.0a0+gite3a9544 1 threads: ------------------------------------------------------------------------ [1, 3, 320, 320] -> (256, 256) | 6038.7 | 2340.4 [1, 3, 320, 320] -> (512, 512) | 24040.6 | 9205.9 [32, 128, 64, 64] -> (32, 32) | 471016.3 | 52059.1 [32, 128, 64, 64] -> (128, 128) | 7705594.5 | 884743.9 [1, 3, 500, 500] -> (256, 256) | 6061.5 | 2361.9 [1, 3, 500, 500] -> (800, 800) | 58940.7 | 22401.8 6 threads: ------------------------------------------------------------------------ [1, 3, 320, 320] -> (256, 256) | 6594.3 | 466.5 [1, 3, 320, 320] -> (512, 512) | 25361.5 | 1729.1 [32, 128, 64, 64] -> (32, 32) | 487783.5 | 11550.0 [32, 128, 64, 64] -> (128, 128) | 7963636.6 | 196017.3 [1, 3, 500, 500] -> (256, 256) | 6443.8 | 464.1 [1, 3, 500, 500] -> (800, 800) | 61891.9 | 4257.2 Times are in microseconds (us). [--------------- upsample_bicubic2d channels_first non-contiguous ---------------] | 1.9.0a0+git8518b0e | 1.9.0a0+gite3a9544 1 threads: ----------------------------------------------------------------------- [1, 3, 320, 320] -> (256, 256) | 6116.7 | 2357.0 [1, 3, 320, 320] -> (512, 512) | 24182.0 | 9213.9 [1, 3, 500, 500] -> (256, 256) | 6349.6 | 2358.5 [1, 3, 500, 500] -> (800, 800) | 59365.2 | 22431.2 6 threads: ----------------------------------------------------------------------- [1, 3, 320, 320] -> (256, 256) | 7155.1 | 464.6 [1, 3, 320, 320] -> (512, 512) | 24566.8 | 1712.4 [1, 3, 500, 500] -> (256, 256) | 7217.5 | 466.6 [1, 3, 500, 500] -> (800, 800) | 59880.2 | 4148.8 Times are in microseconds (us). [------------------------ upsample_bicubic2d channels_last non-contiguous -------------------------] | 1.9.0a0+git8518b0e | 1.9.0a0+gite3a9544 | opencv 4.5.1 1 threads: ----------------------------------------------------------------------------------------- [1, 3, 320, 320] -> (256, 256) | 6184.3 | 2360.0 | 215.0 [1, 3, 320, 320] -> (512, 512) | 24499.7 | 9231.1 | 510.7 [32, 128, 64, 64] -> (32, 32) | 548304.5 | 93517.8 | [32, 128, 64, 64] -> (128, 128) | 7810958.3 | 1086334.6 | [2, 128, 64, 46] -> (32, 32) | 10883.4 | 5594.9 | [2, 128, 64, 46] -> (128, 128) | 153253.2 | 57071.2 | [1, 128, 64, 46] -> (32, 32) | 4519.4 | 2826.5 | 619.7 [1, 128, 64, 46] -> (128, 128) | 61339.7 | 28470.7 | 3654.5 [1, 3, 500, 500] -> (256, 256) | 6444.8 | 2389.9 | 292.9 [1, 3, 500, 500] -> (800, 800) | 59448.0 | 22479.1 | 1316.9 6 threads: ----------------------------------------------------------------------------------------- [1, 3, 320, 320] -> (256, 256) | 6370.1 | 464.9 | 61.3 [1, 3, 320, 320] -> (512, 512) | 25365.6 | 1767.5 | 145.7 [32, 128, 64, 64] -> (32, 32) | 502888.7 | 22016.3 | [32, 128, 64, 64] -> (128, 128) | 8072918.9 | 234567.0 | [2, 128, 64, 46] -> (32, 32) | 11171.4 | 1049.5 | [2, 128, 64, 46] -> (128, 128) | 152612.5 | 11264.8 | [1, 128, 64, 46] -> (32, 32) | 4359.3 | 791.4 | 651.1 [1, 128, 64, 46] -> (128, 128) | 61346.5 | 7563.9 | 3765.2 [1, 3, 500, 500] -> (256, 256) | 6644.4 | 469.7 | 77.4 [1, 3, 500, 500] -> (800, 800) | 59947.2 | 4154.3 | 313.2 Times are in microseconds (us). Intermediate benchmark sources: - results/20212303-061238_pth_nightly_results_1.9.0a0+git8518b0e.log.save.opencv - results/20212303-061238_pr_results_1.9.0a0+gite3a9544.log.save.opencv ``` [Source file](https://raw.githubusercontent.com/vfdev-5/interpolate-tensoriterator/master/step_seven/results/20212303-061238_pr_1.9.0a0%2Bgite3a9544_vs_pth_1.9.0a0%2Bgit8518b0e_results.opencv.md) </details> Pull Request resolved: #54500 Reviewed By: glaringlee Differential Revision: D27463566 Pulled By: fmassa fbshipit-source-id: ceac3a8cee0eeb1a4ddd9344accffcc65449a49a

Optimized bilinear interpolation for 1d, 2d, 3d cases using TensorIte…

02d5f06

…rator - MemoryFormat: channel first only

facebook-github-bot added the cla signed label Feb 3, 2021

vfdev-5 marked this pull request as draft February 3, 2021 17:54

pytorchbot added the open source label Feb 3, 2021

vfdev-5 added 3 commits February 5, 2021 22:15

Fixes windows compile bug and number of buffers (pointed by asan)

1bd5484

Merge branch 'master' of github.com:pytorch/pytorch into upsample-ten…

ff97392

…sor-iterator

Replaced code by Francisco's implementation

ce3f1b4

vfdev-5 marked this pull request as ready for review February 10, 2021 16:40

VitalyFedyunin requested a review from fmassa February 10, 2021 21:48

VitalyFedyunin added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Feb 10, 2021

Added pragma once to upsample.h

ee32a0c

Merge branch 'master' of github.com:pytorch/pytorch into upsample-ten…

ee5b18c

…sor-iterator

hameerabbasi reviewed Feb 15, 2021

View reviewed changes

aten/src/ATen/native/UpSample.h Show resolved Hide resolved

fmassa mentioned this pull request Feb 15, 2021

Reduce code duplication in interpolate and make it more generic #10482

Open

fmassa reviewed Feb 15, 2021

View reviewed changes

vfdev-5 added 3 commits February 19, 2021 16:28

Merge branch 'master' of github.com:pytorch/pytorch into upsample-ten…

7132cfb

…sor-iterator

- Removed cpu_upsample_linear_channels_last

149b976

- Removed int32/int64 index dispatch - Added more comments and other updates according to the review

Restored original cpu_upsample_linear_channels_last implementation

594f1ec

vfdev-5 added 2 commits February 23, 2021 09:51

Merge branch 'master' of github.com:pytorch/pytorch into upsample-ten…

6bdce00

…sor-iterator

Removed index_t from main methods

63ac2c3

fmassa reviewed Feb 23, 2021

View reviewed changes

aten/src/ATen/native/cpu/UpSampleMoreKernel.cpp Outdated Show resolved Hide resolved

Removed index_t from upsample_linearNd_kernel_impl and unused vars

74b172b

fmassa approved these changes Feb 23, 2021

View reviewed changes

facebook-github-bot reviewed Feb 23, 2021

View reviewed changes

facebook-github-bot closed this in 66f07c0 Mar 1, 2021

facebook-github-bot added the Merged label Mar 1, 2021

vfdev-5 deleted the upsample-tensor-iterator branch March 1, 2021 17:16

vfdev-5 mentioned this pull request Mar 3, 2021

Optimized bilinear interpolation channels last case using TensorIterator #53211

Closed

vfdev-5 mentioned this pull request Mar 23, 2021

Optimized generic interpolation using TensorIterator (keeps original 2d/3d channels last impl) #54500

Closed

renovate bot mentioned this pull request Jun 18, 2021

chore(deps): update dependency torchvision to v0.10.0 pplmx/di-ting#48

Merged

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimized bilinear interpolation using TensorIterator #51653

Optimized bilinear interpolation using TensorIterator #51653

vfdev-5 commented Feb 3, 2021 •

edited

facebook-github-bot commented Feb 3, 2021 •

edited

VitalyFedyunin commented Feb 10, 2021

vfdev-5 commented Feb 10, 2021 •

edited

fmassa commented Feb 11, 2021 •

edited

vfdev-5 commented Feb 15, 2021 •

edited

codecov bot commented Feb 15, 2021 •

edited

fmassa left a comment

fmassa commented Feb 15, 2021 •

edited

vfdev-5 commented Feb 22, 2021

fmassa left a comment

facebook-github-bot left a comment

VitalyFedyunin commented Feb 24, 2021

fmassa commented Feb 25, 2021

facebook-github-bot commented Mar 1, 2021

Optimized bilinear interpolation using TensorIterator #51653

Optimized bilinear interpolation using TensorIterator #51653

Conversation

vfdev-5 commented Feb 3, 2021 • edited

Results

facebook-github-bot commented Feb 3, 2021 • edited

💊 CI failures summary and remediations

VitalyFedyunin commented Feb 10, 2021

vfdev-5 commented Feb 10, 2021 • edited

fmassa commented Feb 11, 2021 • edited

vfdev-5 commented Feb 15, 2021 • edited

codecov bot commented Feb 15, 2021 • edited

Codecov Report

fmassa left a comment

Choose a reason for hiding this comment

fmassa commented Feb 15, 2021 • edited

2d - 1 thread

2d - 6 threads

3d

vfdev-5 commented Feb 22, 2021

fmassa left a comment

Choose a reason for hiding this comment

facebook-github-bot left a comment

Choose a reason for hiding this comment

VitalyFedyunin commented Feb 24, 2021

fmassa commented Feb 25, 2021

facebook-github-bot commented Mar 1, 2021

vfdev-5 commented Feb 3, 2021 •

edited

facebook-github-bot commented Feb 3, 2021 •

edited

vfdev-5 commented Feb 10, 2021 •

edited

fmassa commented Feb 11, 2021 •

edited

vfdev-5 commented Feb 15, 2021 •

edited

codecov bot commented Feb 15, 2021 •

edited

fmassa commented Feb 15, 2021 •

edited