Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add uint8 support for interpolate for CPU images #90771

Closed

Conversation

NicolasHug
Copy link
Member

@NicolasHug NicolasHug commented Dec 13, 2022

Joint work with @vfdev-5

This PR introduces native uint8 support for interpolate(), for bilinear and bicubic modes for CPU images (mode=nearest[_exact] was already supported ).

On a typical torchvision training job on ImageNet, the speedup are ~4X when AVX2 is supported, comparing the uint8 native (this PR) vs torchvision's current Resize():

AA = antialias
float = uint8->float->interpolate()->round()->clamp()->uint8 (what Resize() currently does)

input_size         output_size channels_last AA    mode       num_threads  speed-up float vs uint8 (this PR)
(1, 3, 270, 268) -> (224, 224)     True    True    bilinear   num_threads=1   4X    2.6ms vs 0.7ms
(1, 3, 270, 268) -> (224, 224)     True    False   bilinear   num_threads=1   2.1X  1.3ms vs 0.6ms
(1, 3, 270, 268) -> (224, 224)     False   True    bilinear   num_threads=1   3X    2.1ms vs 0.7ms
(1, 3, 270, 268) -> (224, 224)     False   False   bilinear   num_threads=1   4X    2.4ms vs 0.6ms

(Note: we removed bicubic support for now)
(1, 3, 270, 268) -> (224, 224)     True    True    bicubic    num_threads=1   4X    2.9ms vs 0.7ms
(1, 3, 270, 268) -> (224, 224)     True    False   bicubic    num_threads=1   5X    3.1ms vs 0.7ms
(1, 3, 270, 268) -> (224, 224)     False   True    bicubic    num_threads=1   3X    2.4ms vs 0.7ms
(1, 3, 270, 268) -> (224, 224)     False   False   bicubic    num_threads=1   4X    2.8ms vs 0.7ms

There is still room for further speed-ups (see TODOs in the code).

More benchmark details

with AVX2 support - speedups typically range from 1.5X to 10X. A few edge-cases are slower, worth investigating why.

AA = antialias
float = uint8->float->interpolate()->round()->clamp()->uint8 (what Resize() currently does)

input_size         output_size channels_last AA    mode       num_threads  speed-up float vs uint8 (this PR)
(1, 3, 64, 64) -> (224, 224)       True    True    bilinear   num_threads=1   5X    1.1ms vs 0.2ms
(1, 3, 64, 64) -> (224, 224)       True    False   bilinear   num_threads=1   5X    1.2ms vs 0.2ms
(1, 3, 64, 64) -> (224, 224)       False   True    bilinear   num_threads=1   2.8X  0.6ms vs 0.2ms
(1, 3, 64, 64) -> (224, 224)       False   False   bilinear   num_threads=1   7X    1.6ms vs 0.2ms
(1, 3, 64, 64) -> (224, 224)       True    True    bicubic    num_threads=1   5X    1.2ms vs 0.2ms
(1, 3, 64, 64) -> (224, 224)       True    False   bicubic    num_threads=1   12X   2.9ms vs 0.2ms
(1, 3, 64, 64) -> (224, 224)       False   True    bicubic    num_threads=1   3X    0.8ms vs 0.2ms
(1, 3, 64, 64) -> (224, 224)       False   False   bicubic    num_threads=1   7X    1.8ms vs 0.2ms

(1, 3, 64, 64) -> (224, 224)       True    True    bilinear   num_threads=2   2.6X  0.6ms vs 0.2ms
(1, 3, 64, 64) -> (224, 224)       True    False   bilinear   num_threads=2   2.8X  0.6ms vs 0.2ms
(1, 3, 64, 64) -> (224, 224)       False   True    bilinear   num_threads=2   1.7X  0.4ms vs 0.2ms
(1, 3, 64, 64) -> (224, 224)       False   False   bilinear   num_threads=2   1.4X  0.3ms vs 0.2ms
(1, 3, 64, 64) -> (224, 224)       True    True    bicubic    num_threads=2   2.7X  0.7ms vs 0.2ms
(1, 3, 64, 64) -> (224, 224)       True    False   bicubic    num_threads=2   7X    1.6ms vs 0.2ms
(1, 3, 64, 64) -> (224, 224)       False   True    bicubic    num_threads=2   1.8X  0.4ms vs 0.2ms
(1, 3, 64, 64) -> (224, 224)       False   False   bicubic    num_threads=2   4X    1.0ms vs 0.2ms

(1, 3, 224, 224) -> (270, 268)     True    True    bilinear   num_threads=1   4X    2.5ms vs 0.6ms
(1, 3, 224, 224) -> (270, 268)     True    False   bilinear   num_threads=1   3.0X  1.8ms vs 0.6ms
(1, 3, 224, 224) -> (270, 268)     False   True    bilinear   num_threads=1   3X    1.8ms vs 0.6ms
(1, 3, 224, 224) -> (270, 268)     False   False   bilinear   num_threads=1   4X    2.3ms vs 0.6ms
(1, 3, 224, 224) -> (270, 268)     True    True    bicubic    num_threads=1   4X    2.7ms vs 0.6ms
(1, 3, 224, 224) -> (270, 268)     True    False   bicubic    num_threads=1   7X    4.3ms vs 0.6ms
(1, 3, 224, 224) -> (270, 268)     False   True    bicubic    num_threads=1   3X    2.1ms vs 0.6ms
(1, 3, 224, 224) -> (270, 268)     False   False   bicubic    num_threads=1   4X    2.6ms vs 0.6ms

(1, 3, 224, 224) -> (270, 268)     True    True    bilinear   num_threads=2   2.7X  1.6ms vs 0.6ms
(1, 3, 224, 224) -> (270, 268)     True    False   bilinear   num_threads=2   2.6X  1.5ms vs 0.6ms
(1, 3, 224, 224) -> (270, 268)     False   True    bilinear   num_threads=2   2.1X  1.2ms vs 0.6ms
(1, 3, 224, 224) -> (270, 268)     False   False   bilinear   num_threads=2   1.6X  0.9ms vs 0.6ms
(1, 3, 224, 224) -> (270, 268)     True    True    bicubic    num_threads=2   2.8X  1.7ms vs 0.6ms
(1, 3, 224, 224) -> (270, 268)     True    False   bicubic    num_threads=2   5X    2.8ms vs 0.6ms
(1, 3, 224, 224) -> (270, 268)     False   True    bicubic    num_threads=2   2.3X  1.4ms vs 0.6ms
(1, 3, 224, 224) -> (270, 268)     False   False   bicubic    num_threads=2   3X    1.9ms vs 0.6ms

(1, 3, 256, 256) -> (1024, 1024)   True    True    bilinear   num_threads=1   4X    26.6ms vs 6.7ms
(1, 3, 256, 256) -> (1024, 1024)   True    False   bilinear   num_threads=1   4X    23.9ms vs 6.8ms
(1, 3, 256, 256) -> (1024, 1024)   False   True    bilinear   num_threads=1   2.5X  16.8ms vs 6.8ms
(1, 3, 256, 256) -> (1024, 1024)   False   False   bilinear   num_threads=1   5X    33.1ms vs 6.8ms
(1, 3, 256, 256) -> (1024, 1024)   True    True    bicubic    num_threads=1   4X    25.9ms vs 7.3ms
(1, 3, 256, 256) -> (1024, 1024)   True    False   bicubic    num_threads=1   8X    59.6ms vs 7.3ms
(1, 3, 256, 256) -> (1024, 1024)   False   True    bicubic    num_threads=1   1.9X  14.3ms vs 7.4ms
(1, 3, 256, 256) -> (1024, 1024)   False   False   bicubic    num_threads=1   5X    35.4ms vs 7.3ms

(1, 3, 256, 256) -> (1024, 1024)   True    True    bilinear   num_threads=2   2.0X  13.6ms vs 6.8ms
(1, 3, 256, 256) -> (1024, 1024)   True    False   bilinear   num_threads=2   2.2X  14.8ms vs 6.7ms
(1, 3, 256, 256) -> (1024, 1024)   False   True    bilinear   num_threads=2   1.3X  8.8ms vs 6.9ms
(1, 3, 256, 256) -> (1024, 1024)   False   False   bilinear   num_threads=2   1.2X  8.4ms vs 6.8ms
(1, 3, 256, 256) -> (1024, 1024)   True    True    bicubic    num_threads=2   1.8X  12.8ms vs 7.3ms
(1, 3, 256, 256) -> (1024, 1024)   True    False   bicubic    num_threads=2   4X    32.1ms vs 7.2ms
(1, 3, 256, 256) -> (1024, 1024)   False   True    bicubic    num_threads=2   1.4X  10.1ms vs 7.3ms
(1, 3, 256, 256) -> (1024, 1024)   False   False   bicubic    num_threads=2   2.9X  20.9ms vs 7.3ms

(1, 3, 224, 224) -> (64, 64)       True    True    bilinear   num_threads=1   1.4X  0.5ms vs 0.3ms
(1, 3, 224, 224) -> (64, 64)       True    False   bilinear   num_threads=1   0.7X  0.2ms vs 0.3ms
(1, 3, 224, 224) -> (64, 64)       False   True    bilinear   num_threads=1   1.3X  0.4ms vs 0.3ms
(1, 3, 224, 224) -> (64, 64)       False   False   bilinear   num_threads=1   1.4X  0.4ms vs 0.3ms
(1, 3, 224, 224) -> (64, 64)       True    True    bicubic    num_threads=1   2.1X  0.7ms vs 0.3ms
(1, 3, 224, 224) -> (64, 64)       True    False   bicubic    num_threads=1   1.3X  0.4ms vs 0.3ms
(1, 3, 224, 224) -> (64, 64)       False   True    bicubic    num_threads=1   1.9X  0.6ms vs 0.3ms
(1, 3, 224, 224) -> (64, 64)       False   False   bicubic    num_threads=1   1.0X  0.3ms vs 0.3ms

(1, 3, 224, 224) -> (64, 64)       True    True    bilinear   num_threads=2   1.0X  0.3ms vs 0.3ms
(1, 3, 224, 224) -> (64, 64)       True    False   bilinear   num_threads=2   0.6X  0.2ms vs 0.3ms
(1, 3, 224, 224) -> (64, 64)       False   True    bilinear   num_threads=2   0.8X  0.3ms vs 0.3ms
(1, 3, 224, 224) -> (64, 64)       False   False   bilinear   num_threads=2   1.4X  0.4ms vs 0.3ms
(1, 3, 224, 224) -> (64, 64)       True    True    bicubic    num_threads=2   1.4X  0.5ms vs 0.3ms
(1, 3, 224, 224) -> (64, 64)       True    False   bicubic    num_threads=2   1.2X  0.4ms vs 0.3ms
(1, 3, 224, 224) -> (64, 64)       False   True    bicubic    num_threads=2   1.2X  0.4ms vs 0.4ms
(1, 3, 224, 224) -> (64, 64)       False   False   bicubic    num_threads=2   0.9X  0.3ms vs 0.3ms

(1, 3, 270, 268) -> (224, 224)     True    True    bilinear   num_threads=1   4X    2.6ms vs 0.7ms
(1, 3, 270, 268) -> (224, 224)     True    False   bilinear   num_threads=1   2.1X  1.3ms vs 0.6ms
(1, 3, 270, 268) -> (224, 224)     False   True    bilinear   num_threads=1   3X    2.1ms vs 0.7ms
(1, 3, 270, 268) -> (224, 224)     False   False   bilinear   num_threads=1   4X    2.4ms vs 0.6ms
(1, 3, 270, 268) -> (224, 224)     True    True    bicubic    num_threads=1   4X    2.9ms vs 0.7ms
(1, 3, 270, 268) -> (224, 224)     True    False   bicubic    num_threads=1   5X    3.1ms vs 0.7ms
(1, 3, 270, 268) -> (224, 224)     False   True    bicubic    num_threads=1   3X    2.4ms vs 0.7ms
(1, 3, 270, 268) -> (224, 224)     False   False   bicubic    num_threads=1   4X    2.8ms vs 0.7ms

(1, 3, 270, 268) -> (224, 224)     True    True    bilinear   num_threads=2   1.5X  1.0ms vs 0.7ms
(1, 3, 270, 268) -> (224, 224)     True    False   bilinear   num_threads=2   1.2X  0.8ms vs 0.6ms
(1, 3, 270, 268) -> (224, 224)     False   True    bilinear   num_threads=2   2.3X  1.5ms vs 0.7ms
(1, 3, 270, 268) -> (224, 224)     False   False   bilinear   num_threads=2   1.9X  1.2ms vs 0.6ms
(1, 3, 270, 268) -> (224, 224)     True    True    bicubic    num_threads=2   1.6X  1.2ms vs 0.7ms
(1, 3, 270, 268) -> (224, 224)     True    False   bicubic    num_threads=2   4X    2.4ms vs 0.7ms
(1, 3, 270, 268) -> (224, 224)     False   True    bicubic    num_threads=2   2.4X  1.6ms vs 0.7ms
(1, 3, 270, 268) -> (224, 224)     False   False   bicubic    num_threads=2   2.8X  1.8ms vs 0.6ms

(1, 3, 1024, 1024) -> (256, 256)   True    True    bilinear   num_threads=1   2.1X  12.8ms vs 6.1ms
(1, 3, 1024, 1024) -> (256, 256)   True    False   bilinear   num_threads=1   0.6X  3.8ms vs 5.9ms
(1, 3, 1024, 1024) -> (256, 256)   False   True    bilinear   num_threads=1   1.2X  7.1ms vs 6.1ms
(1, 3, 1024, 1024) -> (256, 256)   False   False   bilinear   num_threads=1   1.9X  11.0ms vs 5.9ms
(1, 3, 1024, 1024) -> (256, 256)   True    True    bicubic    num_threads=1   2.0X  12.6ms vs 6.4ms
(1, 3, 1024, 1024) -> (256, 256)   True    False   bicubic    num_threads=1   1.0X  6.1ms vs 6.0ms
(1, 3, 1024, 1024) -> (256, 256)   False   True    bicubic    num_threads=1   1.8X  11.3ms vs 6.4ms
(1, 3, 1024, 1024) -> (256, 256)   False   False   bicubic    num_threads=1   0.8X  4.6ms vs 6.0ms

(1, 3, 1024, 1024) -> (256, 256)   True    True    bilinear   num_threads=2   1.6X  9.3ms vs 6.0ms
(1, 3, 1024, 1024) -> (256, 256)   True    False   bilinear   num_threads=2   0.3X  2.0ms vs 5.8ms
(1, 3, 1024, 1024) -> (256, 256)   False   True    bilinear   num_threads=2   1.2X  7.2ms vs 6.0ms
(1, 3, 1024, 1024) -> (256, 256)   False   False   bilinear   num_threads=2   0.3X  1.6ms vs 5.8ms
(1, 3, 1024, 1024) -> (256, 256)   True    True    bicubic    num_threads=2   1.1X  7.1ms vs 6.5ms
(1, 3, 1024, 1024) -> (256, 256)   True    False   bicubic    num_threads=2   0.6X  3.3ms vs 5.9ms
(1, 3, 1024, 1024) -> (256, 256)   False   True    bicubic    num_threads=2   0.9X  5.9ms vs 6.3ms
(1, 3, 1024, 1024) -> (256, 256)   False   False   bicubic    num_threads=2   0.4X  2.4ms vs 5.9ms

without AVX2 support - no significant speed-up, but there are various possible improvements (see TODOs)

AA = antialias
float = uint8->float->interpolate()->round()->clamp()->uint8 (what Resize() currently does)

input_size         output_size channels_last AA    mode       num_threads  speed-up float vs uint8 (this PR)
(1, 3, 64, 64) -> (224, 224)       True    True    bilinear   num_threads=1   0.9X  1.5ms vs 1.6ms
(1, 3, 64, 64) -> (224, 224)       True    False   bilinear   num_threads=1   0.9X  1.5ms vs 1.6ms
(1, 3, 64, 64) -> (224, 224)       False   True    bilinear   num_threads=1   0.8X  0.9ms vs 1.1ms
(1, 3, 64, 64) -> (224, 224)       False   False   bilinear   num_threads=1   1.5X  1.7ms vs 1.1ms
(1, 3, 64, 64) -> (224, 224)       True    True    bicubic    num_threads=1   0.9X  1.6ms vs 1.8ms
(1, 3, 64, 64) -> (224, 224)       True    False   bicubic    num_threads=1   2.1X  3.9ms vs 1.9ms
(1, 3, 64, 64) -> (224, 224)       False   True    bicubic    num_threads=1   0.8X  1.1ms vs 1.4ms
(1, 3, 64, 64) -> (224, 224)       False   False   bicubic    num_threads=1   1.7X  2.4ms vs 1.5ms

(1, 3, 64, 64) -> (224, 224)       True    True    bilinear   num_threads=2   0.9X  0.8ms vs 0.8ms
(1, 3, 64, 64) -> (224, 224)       True    False   bilinear   num_threads=2   0.9X  0.8ms vs 0.8ms
(1, 3, 64, 64) -> (224, 224)       False   True    bilinear   num_threads=2   0.9X  0.5ms vs 0.6ms
(1, 3, 64, 64) -> (224, 224)       False   False   bilinear   num_threads=2   0.7X  0.5ms vs 0.7ms
(1, 3, 64, 64) -> (224, 224)       True    True    bicubic    num_threads=2   0.9X  0.9ms vs 1.0ms
(1, 3, 64, 64) -> (224, 224)       True    False   bicubic    num_threads=2   2.1X  2.0ms vs 1.0ms
(1, 3, 64, 64) -> (224, 224)       False   True    bicubic    num_threads=2   0.8X  0.6ms vs 0.8ms
(1, 3, 64, 64) -> (224, 224)       False   False   bicubic    num_threads=2   1.7X  1.3ms vs 0.8ms

(1, 3, 224, 224) -> (270, 268)     True    True    bilinear   num_threads=1   1.0X  3.0ms vs 3.0ms
(1, 3, 224, 224) -> (270, 268)     True    False   bilinear   num_threads=1   1.0X  2.8ms vs 2.9ms
(1, 3, 224, 224) -> (270, 268)     False   True    bilinear   num_threads=1   1.0X  2.3ms vs 2.2ms
(1, 3, 224, 224) -> (270, 268)     False   False   bilinear   num_threads=1   1.4X  3.3ms vs 2.3ms
(1, 3, 224, 224) -> (270, 268)     True    True    bicubic    num_threads=1   1.0X  3.5ms vs 3.5ms
(1, 3, 224, 224) -> (270, 268)     True    False   bicubic    num_threads=1   1.7X  6.1ms vs 3.5ms
(1, 3, 224, 224) -> (270, 268)     False   True    bicubic    num_threads=1   0.9X  2.6ms vs 2.9ms
(1, 3, 224, 224) -> (270, 268)     False   False   bicubic    num_threads=1   1.4X  4.2ms vs 2.9ms

(1, 3, 224, 224) -> (270, 268)     True    True    bilinear   num_threads=2   1.0X  1.7ms vs 1.7ms
(1, 3, 224, 224) -> (270, 268)     True    False   bilinear   num_threads=2   0.9X  1.6ms vs 1.8ms
(1, 3, 224, 224) -> (270, 268)     False   True    bilinear   num_threads=2   0.9X  1.3ms vs 1.4ms
(1, 3, 224, 224) -> (270, 268)     False   False   bilinear   num_threads=2   0.7X  1.1ms vs 1.6ms
(1, 3, 224, 224) -> (270, 268)     True    True    bicubic    num_threads=2   1.0X  2.0ms vs 2.0ms
(1, 3, 224, 224) -> (270, 268)     True    False   bicubic    num_threads=2   1.7X  3.2ms vs 1.9ms
(1, 3, 224, 224) -> (270, 268)     False   True    bicubic    num_threads=2   0.8X  1.5ms vs 1.9ms
(1, 3, 224, 224) -> (270, 268)     False   False   bicubic    num_threads=2   1.2X  2.3ms vs 1.9ms

(1, 3, 256, 256) -> (1024, 1024)   True    True    bilinear   num_threads=1   1.1X  34.7ms vs 32.4ms
(1, 3, 256, 256) -> (1024, 1024)   True    False   bilinear   num_threads=1   1.0X  31.2ms vs 32.4ms
(1, 3, 256, 256) -> (1024, 1024)   False   True    bilinear   num_threads=1   1.0X  23.5ms vs 22.7ms
(1, 3, 256, 256) -> (1024, 1024)   False   False   bilinear   num_threads=1   1.9X  42.5ms vs 22.7ms
(1, 3, 256, 256) -> (1024, 1024)   True    True    bicubic    num_threads=1   0.9X  33.9ms vs 37.4ms
(1, 3, 256, 256) -> (1024, 1024)   True    False   bicubic    num_threads=1   2.2X  84.0ms vs 37.5ms
(1, 3, 256, 256) -> (1024, 1024)   False   True    bicubic    num_threads=1   1.0X  28.4ms vs 28.8ms
(1, 3, 256, 256) -> (1024, 1024)   False   False   bicubic    num_threads=1   2.0X  56.7ms vs 28.8ms

(1, 3, 256, 256) -> (1024, 1024)   True    True    bilinear   num_threads=2   1.1X  17.5ms vs 16.4ms
(1, 3, 256, 256) -> (1024, 1024)   True    False   bilinear   num_threads=2   1.1X  17.7ms vs 16.4ms
(1, 3, 256, 256) -> (1024, 1024)   False   True    bilinear   num_threads=2   0.8X  8.8ms vs 11.4ms
(1, 3, 256, 256) -> (1024, 1024)   False   False   bilinear   num_threads=2   1.0X  11.1ms vs 11.4ms
(1, 3, 256, 256) -> (1024, 1024)   True    True    bicubic    num_threads=2   1.1X  19.9ms vs 18.8ms
(1, 3, 256, 256) -> (1024, 1024)   True    False   bicubic    num_threads=2   2.3X  42.5ms vs 18.7ms
(1, 3, 256, 256) -> (1024, 1024)   False   True    bicubic    num_threads=2   1.0X  14.1ms vs 14.5ms
(1, 3, 256, 256) -> (1024, 1024)   False   False   bicubic    num_threads=2   2.0X  28.4ms vs 14.5ms

(1, 3, 224, 224) -> (64, 64)       True    True    bilinear   num_threads=1   1.0X  0.6ms vs 0.6ms
(1, 3, 224, 224) -> (64, 64)       True    False   bilinear   num_threads=1   0.7X  0.3ms vs 0.4ms
(1, 3, 224, 224) -> (64, 64)       False   True    bilinear   num_threads=1   0.9X  0.5ms vs 0.6ms
(1, 3, 224, 224) -> (64, 64)       False   False   bilinear   num_threads=1   1.7X  0.6ms vs 0.4ms
(1, 3, 224, 224) -> (64, 64)       True    True    bicubic    num_threads=1   1.0X  0.8ms vs 0.8ms
(1, 3, 224, 224) -> (64, 64)       True    False   bicubic    num_threads=1   1.1X  0.5ms vs 0.5ms
(1, 3, 224, 224) -> (64, 64)       False   True    bicubic    num_threads=1   0.9X  0.7ms vs 0.8ms
(1, 3, 224, 224) -> (64, 64)       False   False   bicubic    num_threads=1   0.9X  0.4ms vs 0.4ms

(1, 3, 224, 224) -> (64, 64)       True    True    bilinear   num_threads=2   1.0X  0.4ms vs 0.4ms
(1, 3, 224, 224) -> (64, 64)       True    False   bilinear   num_threads=2   0.8X  0.2ms vs 0.3ms
(1, 3, 224, 224) -> (64, 64)       False   True    bilinear   num_threads=2   0.9X  0.3ms vs 0.3ms
(1, 3, 224, 224) -> (64, 64)       False   False   bilinear   num_threads=2   1.3X  0.3ms vs 0.2ms
(1, 3, 224, 224) -> (64, 64)       True    True    bicubic    num_threads=2   1.0X  0.5ms vs 0.5ms
(1, 3, 224, 224) -> (64, 64)       True    False   bicubic    num_threads=2   1.3X  0.4ms vs 0.3ms
(1, 3, 224, 224) -> (64, 64)       False   True    bicubic    num_threads=2   0.9X  0.5ms vs 0.5ms
(1, 3, 224, 224) -> (64, 64)       False   False   bicubic    num_threads=2   1.2X  0.3ms vs 0.3ms

(1, 3, 270, 268) -> (224, 224)     True    True    bilinear   num_threads=1   0.8X  2.1ms vs 2.5ms
(1, 3, 270, 268) -> (224, 224)     True    False   bilinear   num_threads=1   0.7X  1.6ms vs 2.4ms
(1, 3, 270, 268) -> (224, 224)     False   True    bilinear   num_threads=1   1.2X  2.4ms vs 2.1ms
(1, 3, 270, 268) -> (224, 224)     False   False   bilinear   num_threads=1   1.3X  2.6ms vs 2.0ms
(1, 3, 270, 268) -> (224, 224)     True    True    bicubic    num_threads=1   1.1X  3.4ms vs 3.0ms
(1, 3, 270, 268) -> (224, 224)     True    False   bicubic    num_threads=1   1.7X  4.8ms vs 2.8ms
(1, 3, 270, 268) -> (224, 224)     False   True    bicubic    num_threads=1   1.1X  2.9ms vs 2.7ms
(1, 3, 270, 268) -> (224, 224)     False   False   bicubic    num_threads=1   1.4X  3.5ms vs 2.4ms

(1, 3, 270, 268) -> (224, 224)     True    True    bilinear   num_threads=2   0.9X  1.2ms vs 1.3ms
(1, 3, 270, 268) -> (224, 224)     True    False   bilinear   num_threads=2   1.3X  1.6ms vs 1.2ms
(1, 3, 270, 268) -> (224, 224)     False   True    bilinear   num_threads=2   0.8X  0.9ms vs 1.1ms
(1, 3, 270, 268) -> (224, 224)     False   False   bilinear   num_threads=2   1.3X  1.3ms vs 1.0ms
(1, 3, 270, 268) -> (224, 224)     True    True    bicubic    num_threads=2   1.4X  2.2ms vs 1.6ms
(1, 3, 270, 268) -> (224, 224)     True    False   bicubic    num_threads=2   1.9X  2.8ms vs 1.5ms
(1, 3, 270, 268) -> (224, 224)     False   True    bicubic    num_threads=2   0.8X  1.1ms vs 1.4ms
(1, 3, 270, 268) -> (224, 224)     False   False   bicubic    num_threads=2   1.7X  2.1ms vs 1.3ms

(1, 3, 1024, 1024) -> (256, 256)   True    True    bilinear   num_threads=1   1.0X  10.0ms vs 9.9ms
(1, 3, 1024, 1024) -> (256, 256)   True    False   bilinear   num_threads=1   0.7X  4.6ms vs 6.2ms
(1, 3, 1024, 1024) -> (256, 256)   False   True    bilinear   num_threads=1   0.9X  9.1ms vs 9.8ms
(1, 3, 1024, 1024) -> (256, 256)   False   False   bilinear   num_threads=1   1.7X  9.4ms vs 5.7ms
(1, 3, 1024, 1024) -> (256, 256)   True    True    bicubic    num_threads=1   1.0X  15.2ms vs 14.8ms
(1, 3, 1024, 1024) -> (256, 256)   True    False   bicubic    num_threads=1   1.0X  7.6ms vs 7.5ms
(1, 3, 1024, 1024) -> (256, 256)   False   True    bicubic    num_threads=1   0.9X  13.3ms vs 14.4ms
(1, 3, 1024, 1024) -> (256, 256)   False   False   bicubic    num_threads=1   0.8X  5.9ms vs 7.0ms

(1, 3, 1024, 1024) -> (256, 256)   True    True    bilinear   num_threads=2   1.2X  6.0ms vs 5.2ms
(1, 3, 1024, 1024) -> (256, 256)   True    False   bilinear   num_threads=2   0.7X  2.3ms vs 3.2ms
(1, 3, 1024, 1024) -> (256, 256)   False   True    bilinear   num_threads=2   1.0X  4.8ms vs 5.0ms
(1, 3, 1024, 1024) -> (256, 256)   False   False   bilinear   num_threads=2   0.7X  1.9ms vs 2.9ms
(1, 3, 1024, 1024) -> (256, 256)   True    True    bicubic    num_threads=2   1.6X  12.3ms vs 7.5ms
(1, 3, 1024, 1024) -> (256, 256)   True    False   bicubic    num_threads=2   1.0X  3.9ms vs 3.9ms
(1, 3, 1024, 1024) -> (256, 256)   False   True    bicubic    num_threads=2   1.0X  7.0ms vs 7.3ms
(1, 3, 1024, 1024) -> (256, 256)   False   False   bicubic    num_threads=2   0.9X  3.0ms vs 3.5ms

Benchmark code

import operator_benchmark as op_bench
import torch

"""Microbenchmarks for interpolate operator."""


class InterpolateBenchmark(op_bench.TorchBenchmarkBase):
    def init(self, input_size, output_size, channels_last=False, mode='linear', antialias=False, dtype=torch.float):

        input_image = torch.randint(0, 256, size=input_size, dtype=torch.uint8, device='cpu')
                                    
        if channels_last:
            input_image = input_image.contiguous(memory_format=torch.channels_last)

        self.inputs = {
            "input_image": input_image,
            "output_size": output_size,
            "mode": mode,
            "antialias": antialias,
            "dtype":dtype,
        }

        self.set_module_name("interpolate")

    def forward(self, input_image, output_size, mode, antialias, dtype):
        if dtype == torch.float:
            input_image = input_image.float()

        out = torch.nn.functional.interpolate(input_image, size=output_size, mode=mode, align_corners=False, antialias=antialias)
        if dtype == torch.float:
            out = out.round().clamp(min=0, max=256).to(torch.uint8)


def make_config():
    sizes = (
        ((224, 224), (64, 64)),
        ((270, 268), (224, 224)),
        ((256, 256), (1024, 1024)),
    )

    attrs = []
    for (HW1, HW2) in sizes:
        attrs.append([(1, 3, *HW1), HW2])  # 3 channels
        # attrs.append([(1, 1, *HW1), HW2])  # 1 channel

        attrs.append([(1, 3, *HW2), HW1])  # 3 channels
        # attrs.append([(1, 1, *HW2), HW1])  # 1 channel

    config = op_bench.config_list(
        attr_names=["input_size", "output_size"],
        attrs=attrs,
        cross_product_configs={
            'channels_last': [True, False],
            'mode': ["bilinear", "bicubic"],
            'antialias': [True, False],
            # 'dtype': [torch.float, torch.uint8]
            # 'dtype': [torch.uint8]
            'dtype': [torch.float]
        },
        tags=["short"],
    )

    return config

config = make_config()
op_bench.generate_pt_test(config, InterpolateBenchmark)


if __name__ == "__main__":
    op_bench.benchmark_runner.main()
import re
import argparse

parser = argparse.ArgumentParser()
parser.add_argument("f1", nargs="?", default="main")
parser.add_argument("f2", nargs="?", default="new")
args = parser.parse_args()

with open(args.f1) as f:
    main = f.readlines()
with open(args.f2) as f:
    new = f.readlines()

out = []

for main_line, new_line in zip(main, new):
    # num_threads=1  # TODO: remove
    if main_line.startswith("num_threads="):
        num_threads = int(main_line.split("=")[-1])
    if main_line.startswith("# Input"):
        deets = f"{main_line.strip()}, {num_threads=}"
    if main_line.startswith("Forward"):
        main_time = float(main_line.split()[-1])
        new_time = float(new_line.split()[-1])
        ratio = main_time / new_time
        fmt = ".1f" if ratio < 3 else ".0f"
        improv = f"{ratio:{fmt}}X"
        time_fmt = ",.3f" if new_time < 100 else ",.1f"
        deets = deets.strip().replace("# Input: ", "")
        deets = deets.replace(": ", "=")
        deets = deets.replace("input_size=", "")
        deets = deets.replace(", output_size=", " -> ")
        deets = deets.replace("dtype=torch.", "")
        deets = deets.replace("mode=", "")
        deets = deets.replace("antialias=", "")
        deets = deets.replace("channels_last=", "")
        # deets = deets.replace("channels_last=True, ", "")
        split = deets.split(",")

        # size = ','.join(split[:-3])
        # mode, dtype, threads = split[-3:]
        # deets = f"{size:<30} {mode:<15} {dtype:<10} {threads:<15}"

        size = ','.join(split[:-5])
        channels_last, mode, antialias, dtype, threads= split[-5:]
        deets = f"{size:<33} {channels_last:<7} {antialias:<7} {mode:<10} {threads:<15}"

        l = f"{deets}  {improv:<5} {main_time / 1000:{time_fmt}}ms vs {new_time / 1000:{time_fmt}}ms"
        out.append(l)


def key(s):
    # s = ''.join(s.split()[1:]) # remove "N.nX" part
    num_threads = (int(re.findall(r"num_threads=(\d+)", s)[0]),)

    input_shape, output_shape = re.findall("\(.*?\)", s)
    input_shape = input_shape[1:-1]  # remove parenthesis
    input_HW = tuple(int(x) for x in input_shape.split(",")[-2:])
    input_C = (-int(input_shape.split(",")[1]),)

    output_HW = tuple(int(x) for x in output_shape[1:-1].split(","))
    is_downsample = (output_HW[0] < input_HW[0],)
    if "linear" in s:
        mode = "linear"
    elif "nearest-exact" in s:
        mode = "nearest-exact"
    else:
        # assert "nearest" in s
        mode = "nearest"
    mode = (mode,)
    return is_downsample + input_HW + output_HW + num_threads + input_C + mode

for i, l in enumerate(sorted(out, key=key)):
    if i % 8 == 0:
        print()
    # if i % 10 == 0 and i % 40 != 0:
    #     print()
    # if i % 40 == 0:
    #     print("-" * 100)
    print(l)

cc @jgong5 @mingfeima @XiaobingSuper @sanchitintel @ashokei @jingxu10 @datumbox @vfdev-5 @pmeier

@pytorch-bot
Copy link

pytorch-bot bot commented Dec 13, 2022

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/90771

Note: Links to docs will display an error until the docs builds have been completed.

❗ 1 Active SEVs

There are 1 currently active SEVs. If your PR is affected, please view them below:

✅ No Failures

As of commit 35443d0:
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@vadimkantorov
Copy link
Contributor

vadimkantorov commented Dec 13, 2022

Related: #5580
Somewhat related: #54389 (lerp for uint8 - so interpolation, but in another sense :))

@vadimkantorov
Copy link
Contributor

Maybe support for bool+nearest can also be enabled without much hassle (especially given that uint8+nearest is already supported). I also wonder, for bool specifically, are all other modes equivalent to nearest?

Bool interpolation is useful when working with segmentation masks

@vadimkantorov
Copy link
Contributor

vadimkantorov commented Dec 13, 2022

@NicolasHug There're a bunch of perf benchmarks of basic transforms in Albumentations repo: https://github.com/albumentations-team/albumentations#benchmarking-results

It would be interesting to have such benchmarks run by pytorch itself (to ensure the correct version of torchvision is used etc) and having the results published. Maybe also contribute to the original albumentations benchmarking code.

https://github.com/albumentations-team/albumentations/blob/master/benchmark/README.md

It would be interesting to see the basic transforms for which core torch / torchvision is slower than albumentations

@vfdev-5
Copy link
Collaborator

vfdev-5 commented Dec 13, 2022

@vadimkantorov albu uses opencv for majority of transforms and opencv is highly optimized. On the other hand if we can achieve Pillow-SIMD runtime results in pytorch it would be good.
While benchmarking, we also need to take into account that we measure the same ops, for example resize downsampling in opencv does not use antialiasing and Pillow uses antialiasing.

@vfdev-5 vfdev-5 deleted the interpolate_uint8_images_linear_cpu_support_dev branch February 10, 2023 08:08
vfdev-5 added a commit that referenced this pull request Mar 15, 2023
## Description

- Based on #96651
  - Improved perfs for vectorized interpolate uint8 RGB-case
    - unified RGB and RGBA processing code such that RGB input is not copied into RGBA
  - Performances are more close to Pillow-SIMD
  - RGBA case perfs are the same after refactoring (see Source link below) 
- Fixed mem pointer alignment, added more comments (reviews from #96651)

## Results

```
[------------------------------------------------------------------------------------------ Resize -----------------------------------------------------------------------------------------]
                                                                 |  Pillow (9.0.0.post1)  |  torch (2.1.0a0+gitcc42a3f) PR  |  torch (2.1.0a0+git5309c44) nightly  |  Speed-up: PR vs nightly
1 threads: ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
      3 torch.uint8 channels_last bilinear 256 -> 32 aa=True     |          38.8          |                56.0             |                 133.2                |            2.4
      3 torch.uint8 channels_last bilinear 256 -> 32 aa=False    |                        |                37.5             |                 112.8                |            3.0
      3 torch.uint8 channels_last bilinear 256 -> 224 aa=True    |         128.7          |               157.0             |                 305.4                |            1.9
      3 torch.uint8 channels_last bilinear 256 -> 224 aa=False   |                        |               146.4             |                 288.7                |            2.0
      3 torch.uint8 channels_last bilinear 256 -> 320 aa=True    |         179.4          |               215.8             |                 442.5                |            2.1
      3 torch.uint8 channels_last bilinear 256 -> 320 aa=False   |                        |               212.5             |                 436.9                |            2.1
      3 torch.uint8 channels_last bilinear 520 -> 32 aa=True     |         113.3          |               127.9             |                 464.8                |            3.6
      3 torch.uint8 channels_last bilinear 520 -> 32 aa=False    |                        |                56.8             |                 365.5                |            6.4
      3 torch.uint8 channels_last bilinear 520 -> 224 aa=True    |         281.7          |               325.2             |                 722.4                |            2.2
      3 torch.uint8 channels_last bilinear 520 -> 224 aa=False   |                        |               239.1             |                 593.5                |            2.5
      3 torch.uint8 channels_last bilinear 712 -> 32 aa=True     |         186.2          |               200.7             |                 833.8                |            4.2
      3 torch.uint8 channels_last bilinear 712 -> 32 aa=False    |                        |                75.2             |                 651.4                |            8.7
      3 torch.uint8 channels_last bilinear 712 -> 224 aa=True    |         410.0          |               444.5             |                1128.4                |            2.5
      3 torch.uint8 channels_last bilinear 712 -> 224 aa=False   |                        |               309.3             |                 917.6                |            3.0
```

Note: for other cases (see Source below) speed-up is roughly around 1.0 +/- 0.1 which may be attributed to noisy measurements ...

[Source](https://gist.github.com/vfdev-5/1c0778904a07ce40401306548b9525e8#file-20230315-144416-pr_vs_nightly_speedup-md)


## Context

- #90771



cc jgong5 mingfeima XiaobingSuper sanchitintel ashokei jingxu10

[ghstack-poisoned]
vfdev-5 added a commit that referenced this pull request Mar 17, 2023
…cpu uint8 RGB-case"


## Description

- Based on #96651
  - Improved perfs for vectorized interpolate uint8 RGB-case
    - unified RGB and RGBA processing code such that RGB input is not copied into RGBA
  - Performances are more close to Pillow-SIMD
  - RGBA case perfs are the same after refactoring (see Source link below) 
- Fixed mem pointer alignment, added more comments (reviews from #96651)

## Results

```
[------------------------------------------------------------------------------------------ Resize -----------------------------------------------------------------------------------------]
                                                                 |  Pillow (9.0.0.post1)  |  torch (2.1.0a0+git0968a5d) PR  |  torch (2.1.0a0+git5309c44) nightly  |  Speed-up: PR vs nightly
1 threads: ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
      3 torch.uint8 channels_last bilinear 256 -> 32 aa=True     |          39.0          |                56.6             |                 133.2                |            2.4
      3 torch.uint8 channels_last bilinear 256 -> 32 aa=False    |                        |                36.9             |                 112.8                |            3.1
      3 torch.uint8 channels_last bilinear 256 -> 224 aa=True    |         128.1          |               152.5             |                 305.4                |            2.0
      3 torch.uint8 channels_last bilinear 256 -> 224 aa=False   |                        |               141.1             |                 288.7                |            2.0
      3 torch.uint8 channels_last bilinear 256 -> 320 aa=True    |         179.6          |               208.8             |                 442.5                |            2.1
      3 torch.uint8 channels_last bilinear 256 -> 320 aa=False   |                        |               206.4             |                 436.9                |            2.1
      3 torch.uint8 channels_last bilinear 520 -> 32 aa=True     |         113.3          |               132.1             |                 464.8                |            3.5
      3 torch.uint8 channels_last bilinear 520 -> 32 aa=False    |                        |                57.2             |                 365.5                |            6.4
      3 torch.uint8 channels_last bilinear 520 -> 224 aa=True    |         281.7          |               327.4             |                 722.4                |            2.2
      3 torch.uint8 channels_last bilinear 520 -> 224 aa=False   |                        |               230.2             |                 593.5                |            2.6
      3 torch.uint8 channels_last bilinear 712 -> 32 aa=True     |         186.9          |               210.5             |                 833.8                |            4.0
      3 torch.uint8 channels_last bilinear 712 -> 32 aa=False    |                        |                75.6             |                 651.4                |            8.6
      3 torch.uint8 channels_last bilinear 712 -> 224 aa=True    |         410.3          |               450.9             |                1128.4                |            2.5
      3 torch.uint8 channels_last bilinear 712 -> 224 aa=False   |                        |               298.7             |                 917.6                |            3.1

```

Note: for other cases (see Source below) speed-up is roughly around 1.0 +/- 0.1 which may be attributed to noisy measurements ...

[Source](https://gist.github.com/vfdev-5/1c0778904a07ce40401306548b9525e8#file-20230315-162238-pr_vs_nightly_speedup-md)


## Context

- #90771



cc jgong5 mingfeima XiaobingSuper sanchitintel ashokei jingxu10

[ghstack-poisoned]
vfdev-5 added a commit that referenced this pull request Mar 17, 2023
## Description

- Based on #96651
  - Improved perfs for vectorized interpolate uint8 RGB-case
    - unified RGB and RGBA processing code such that RGB input is not copied into RGBA
  - Performances are more close to Pillow-SIMD
  - RGBA case perfs are the same after refactoring (see Source link below) 
- Fixed mem pointer alignment, added more comments (reviews from #96651)

## Results

```
[------------------------------------------------------------------------------------------ Resize -----------------------------------------------------------------------------------------]
                                                                 |  Pillow (9.0.0.post1)  |  torch (2.1.0a0+git0968a5d) PR  |  torch (2.1.0a0+git5309c44) nightly  |  Speed-up: PR vs nightly
1 threads: ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
      3 torch.uint8 channels_last bilinear 256 -> 32 aa=True     |          39.0          |                56.6             |                 133.2                |            2.4
      3 torch.uint8 channels_last bilinear 256 -> 32 aa=False    |                        |                36.9             |                 112.8                |            3.1
      3 torch.uint8 channels_last bilinear 256 -> 224 aa=True    |         128.1          |               152.5             |                 305.4                |            2.0
      3 torch.uint8 channels_last bilinear 256 -> 224 aa=False   |                        |               141.1             |                 288.7                |            2.0
      3 torch.uint8 channels_last bilinear 256 -> 320 aa=True    |         179.6          |               208.8             |                 442.5                |            2.1
      3 torch.uint8 channels_last bilinear 256 -> 320 aa=False   |                        |               206.4             |                 436.9                |            2.1
      3 torch.uint8 channels_last bilinear 520 -> 32 aa=True     |         113.3          |               132.1             |                 464.8                |            3.5
      3 torch.uint8 channels_last bilinear 520 -> 32 aa=False    |                        |                57.2             |                 365.5                |            6.4
      3 torch.uint8 channels_last bilinear 520 -> 224 aa=True    |         281.7          |               327.4             |                 722.4                |            2.2
      3 torch.uint8 channels_last bilinear 520 -> 224 aa=False   |                        |               230.2             |                 593.5                |            2.6
      3 torch.uint8 channels_last bilinear 712 -> 32 aa=True     |         186.9          |               210.5             |                 833.8                |            4.0
      3 torch.uint8 channels_last bilinear 712 -> 32 aa=False    |                        |                75.6             |                 651.4                |            8.6
      3 torch.uint8 channels_last bilinear 712 -> 224 aa=True    |         410.3          |               450.9             |                1128.4                |            2.5
      3 torch.uint8 channels_last bilinear 712 -> 224 aa=False   |                        |               298.7             |                 917.6                |            3.1

```

Note: for other cases (see Source below) speed-up is roughly around 1.0 +/- 0.1 which may be attributed to noisy measurements ...

[Source](https://gist.github.com/vfdev-5/1c0778904a07ce40401306548b9525e8#file-20230315-162238-pr_vs_nightly_speedup-md)


## Context

- #90771



cc jgong5 mingfeima XiaobingSuper sanchitintel ashokei jingxu10

[ghstack-poisoned]
vfdev-5 added a commit that referenced this pull request Mar 20, 2023
…erpolate cpu uint8 RGB-case (channels last)"


## Description

- Based on #96651
  - Improved perfs for vectorized bilinear interpolate uint8 RGB-case, channels last
    - unified RGB and RGBA processing code such that RGB input is not copied into RGBA
  - Performances are more close to Pillow-SIMD (`Pillow (9.0.0.post1)`)
  - RGBA case perfs are the same after refactoring (see Source link below) 
- Fixed mem pointer alignment, added more comments (reviews from #96651)

## Results

```
[------------------------------------------------------------------------------------------ Resize -----------------------------------------------------------------------------------------]
                                                                 |  Pillow (9.0.0.post1)  |  torch (2.1.0a0+git0968a5d) PR  |  torch (2.1.0a0+git5309c44) nightly  |  Speed-up: PR vs nightly
1 threads: ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
      3 torch.uint8 channels_last bilinear 256 -> 32 aa=True     |          39.0          |                56.6             |                 133.2                |            2.4
      3 torch.uint8 channels_last bilinear 256 -> 32 aa=False    |                        |                36.9             |                 112.8                |            3.1
      3 torch.uint8 channels_last bilinear 256 -> 224 aa=True    |         128.1          |               152.5             |                 305.4                |            2.0
      3 torch.uint8 channels_last bilinear 256 -> 224 aa=False   |                        |               141.1             |                 288.7                |            2.0
      3 torch.uint8 channels_last bilinear 256 -> 320 aa=True    |         179.6          |               208.8             |                 442.5                |            2.1
      3 torch.uint8 channels_last bilinear 256 -> 320 aa=False   |                        |               206.4             |                 436.9                |            2.1
      3 torch.uint8 channels_last bilinear 520 -> 32 aa=True     |         113.3          |               132.1             |                 464.8                |            3.5
      3 torch.uint8 channels_last bilinear 520 -> 32 aa=False    |                        |                57.2             |                 365.5                |            6.4
      3 torch.uint8 channels_last bilinear 520 -> 224 aa=True    |         281.7          |               327.4             |                 722.4                |            2.2
      3 torch.uint8 channels_last bilinear 520 -> 224 aa=False   |                        |               230.2             |                 593.5                |            2.6
      3 torch.uint8 channels_last bilinear 712 -> 32 aa=True     |         186.9          |               210.5             |                 833.8                |            4.0
      3 torch.uint8 channels_last bilinear 712 -> 32 aa=False    |                        |                75.6             |                 651.4                |            8.6
      3 torch.uint8 channels_last bilinear 712 -> 224 aa=True    |         410.3          |               450.9             |                1128.4                |            2.5
      3 torch.uint8 channels_last bilinear 712 -> 224 aa=False   |                        |               298.7             |                 917.6                |            3.1

```

Note: for other cases (see Source below) speed-up is roughly around 1.0 +/- 0.1 which may be attributed to noisy measurements ...

[Source](https://gist.github.com/vfdev-5/1c0778904a07ce40401306548b9525e8#file-20230315-162238-pr_vs_nightly_speedup-md)


## Context

- #90771



cc jgong5 mingfeima XiaobingSuper sanchitintel ashokei jingxu10

[ghstack-poisoned]
vfdev-5 added a commit that referenced this pull request Mar 20, 2023
…t8 RGB-case (channels last)"


## Description

- Based on #96651
  - Improved perfs for vectorized bilinear interpolate uint8 RGB-case, channels last
    - unified RGB and RGBA processing code such that RGB input is not copied into RGBA
  - Performances are more close to Pillow-SIMD (`Pillow (9.0.0.post1)`)
  - RGBA case perfs are the same after refactoring (see Source link below) 
- Fixed mem pointer alignment, added more comments (reviews from #96651)

## Results

```
[------------------------------------------------------------------------------------------ Resize -----------------------------------------------------------------------------------------]
                                                                 |  Pillow (9.0.0.post1)  |  torch (2.1.0a0+git0968a5d) PR  |  torch (2.1.0a0+git5309c44) nightly  |  Speed-up: PR vs nightly
1 threads: ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
      3 torch.uint8 channels_last bilinear 256 -> 32 aa=True     |          39.0          |                56.6             |                 133.2                |            2.4
      3 torch.uint8 channels_last bilinear 256 -> 32 aa=False    |                        |                36.9             |                 112.8                |            3.1
      3 torch.uint8 channels_last bilinear 256 -> 224 aa=True    |         128.1          |               152.5             |                 305.4                |            2.0
      3 torch.uint8 channels_last bilinear 256 -> 224 aa=False   |                        |               141.1             |                 288.7                |            2.0
      3 torch.uint8 channels_last bilinear 256 -> 320 aa=True    |         179.6          |               208.8             |                 442.5                |            2.1
      3 torch.uint8 channels_last bilinear 256 -> 320 aa=False   |                        |               206.4             |                 436.9                |            2.1
      3 torch.uint8 channels_last bilinear 520 -> 32 aa=True     |         113.3          |               132.1             |                 464.8                |            3.5
      3 torch.uint8 channels_last bilinear 520 -> 32 aa=False    |                        |                57.2             |                 365.5                |            6.4
      3 torch.uint8 channels_last bilinear 520 -> 224 aa=True    |         281.7          |               327.4             |                 722.4                |            2.2
      3 torch.uint8 channels_last bilinear 520 -> 224 aa=False   |                        |               230.2             |                 593.5                |            2.6
      3 torch.uint8 channels_last bilinear 712 -> 32 aa=True     |         186.9          |               210.5             |                 833.8                |            4.0
      3 torch.uint8 channels_last bilinear 712 -> 32 aa=False    |                        |                75.6             |                 651.4                |            8.6
      3 torch.uint8 channels_last bilinear 712 -> 224 aa=True    |         410.3          |               450.9             |                1128.4                |            2.5
      3 torch.uint8 channels_last bilinear 712 -> 224 aa=False   |                        |               298.7             |                 917.6                |            3.1

```

Note: for other cases (see Source below) speed-up is roughly around 1.0 +/- 0.1 which may be attributed to noisy measurements ...

[Source](https://gist.github.com/vfdev-5/1c0778904a07ce40401306548b9525e8#file-20230315-162238-pr_vs_nightly_speedup-md)


## Context

- #90771



cc jgong5 mingfeima XiaobingSuper sanchitintel ashokei jingxu10

[ghstack-poisoned]
vfdev-5 added a commit that referenced this pull request Mar 21, 2023
…erpolate cpu uint8 RGB-case (channels last)"


## Description

- Based on #96651
  - Improved perfs for vectorized **bilinear** interpolate uint8 RGB-case, **channels last**
    - unified RGB and RGBA processing code such that RGB input is not copied into RGBA
  - Performances are more close to Pillow-SIMD (labeled as `Pillow (9.0.0.post1)` in the results)
  - RGBA case perfs are the same after refactoring (see Source link below) 
- Fixed mem pointer alignment, added more comments (reviews from #96651)

## Results

- `Pillow (9.0.0.post1)` == Pillow-SIMD

```
[-------------------------------------------------------------------------------------------------- Resize -------------------------------------------------------------------------------------------------]
                                                                                 |  Pillow (9.0.0.post1)  |  torch (2.1.0a0+gitc005105) PR  |  torch (2.1.0a0+git5309c44) nightly  |  Speed-up: PR vs nightly
1 threads: --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
      3 torch.uint8 channels_last bilinear (256, 256) -> (32, 32) aa=True        |    38.670 (+-0.445)    |         57.366 (+-0.799)        |          132.147 (+-1.236)           |      2.304 (+-0.000)    
      3 torch.uint8 channels_last bilinear (256, 256) -> (32, 32) aa=False       |                        |         37.825 (+-0.417)        |          111.789 (+-1.175)           |      2.955 (+-0.000)    
      3 torch.uint8 channels_last bilinear (256, 256) -> (224, 224) aa=True      |   127.898 (+-1.335)    |        153.081 (+-2.346)        |          302.518 (+-2.632)           |      1.976 (+-0.000)    
      3 torch.uint8 channels_last bilinear (256, 256) -> (224, 224) aa=False     |                        |        141.695 (+-1.415)        |          286.663 (+-2.494)           |      2.023 (+-0.000)    
      3 torch.uint8 channels_last bilinear (256, 256) -> (320, 320) aa=True      |   179.735 (+-2.054)    |        210.613 (+-3.116)        |          439.375 (+-4.014)           |      2.086 (+-0.000)    
      3 torch.uint8 channels_last bilinear (256, 256) -> (320, 320) aa=False     |                        |        207.601 (+-1.639)        |          438.537 (+-4.143)           |      2.112 (+-0.000)    
      3 torch.uint8 channels_last bilinear (520, 520) -> (32, 32) aa=True        |   112.679 (+-1.321)    |        130.863 (+-1.987)        |          446.804 (+-3.283)           |      3.414 (+-0.000)    
      3 torch.uint8 channels_last bilinear (520, 520) -> (32, 32) aa=False       |                        |         57.968 (+-0.270)        |          374.244 (+-13.598)          |      6.456 (+-0.000)    
      3 torch.uint8 channels_last bilinear (520, 520) -> (224, 224) aa=True      |   282.398 (+-3.485)    |        322.986 (+-1.947)        |          720.197 (+-3.467)           |      2.230 (+-0.000)    
      3 torch.uint8 channels_last bilinear (520, 520) -> (224, 224) aa=False     |                        |        231.625 (+-2.006)        |          592.834 (+-3.903)           |      2.559 (+-0.000)    
      3 torch.uint8 channels_last bilinear (712, 712) -> (32, 32) aa=True        |   185.711 (+-1.666)    |        201.069 (+-2.182)        |          787.868 (+-3.648)           |      3.918 (+-0.000)    
      3 torch.uint8 channels_last bilinear (712, 712) -> (32, 32) aa=False       |                        |         75.975 (+-0.696)        |          651.016 (+-3.926)           |      8.569 (+-0.000)    
      3 torch.uint8 channels_last bilinear (712, 712) -> (224, 224) aa=True      |   410.236 (+-6.021)    |        451.486 (+-3.939)        |         1123.923 (+-14.988)          |      2.489 (+-0.000)    
      3 torch.uint8 channels_last bilinear (712, 712) -> (224, 224) aa=False     |                        |        299.597 (+-1.887)        |          915.347 (+-4.486)           |      3.055 (+-0.000)    

      # More test-cases from #90771
      3 torch.uint8 channels_last bilinear (64, 64) -> (224, 224) aa=True        |    60.751 (+-0.285)    |         78.538 (+-1.282)        |          170.465 (+-1.830)           |      2.170 (+-0.000)    
      3 torch.uint8 channels_last bilinear (224, 224) -> (270, 268) aa=True      |   133.619 (+-2.035)    |        159.614 (+-1.587)        |          330.971 (+-3.249)           |      2.074 (+-0.000)    
      3 torch.uint8 channels_last bilinear (256, 256) -> (1024, 1024) aa=True    |   950.243 (+-10.641)   |        891.369 (+-17.946)       |         2805.510 (+-25.503)          |      3.147 (+-0.000)    
      3 torch.uint8 channels_last bilinear (224, 224) -> (64, 64) aa=True        |    52.771 (+-0.961)    |         72.253 (+-1.020)        |          135.933 (+-1.625)           |      1.881 (+-0.000)    
      3 torch.uint8 channels_last bilinear (270, 268) -> (224, 224) aa=True      |   139.107 (+-2.143)    |        165.844 (+-2.177)        |          321.112 (+-2.904)           |      1.936 (+-0.000)    
      3 torch.uint8 channels_last bilinear (1024, 1024) -> (256, 256) aa=True    |   691.470 (+-9.566)    |        764.942 (+-11.192)       |         2050.880 (+-22.188)          |      2.681 (+-0.000)    
      3 torch.uint8 channels_last bilinear (64, 64) -> (224, 224) aa=False       |                        |         77.375 (+-1.345)        |          169.646 (+-1.640)           |      2.193 (+-0.000)    
      3 torch.uint8 channels_last bilinear (224, 224) -> (270, 268) aa=False     |                        |        159.115 (+-3.935)        |          329.754 (+-2.590)           |      2.072 (+-0.000)    
      3 torch.uint8 channels_last bilinear (256, 256) -> (1024, 1024) aa=False   |                        |        877.248 (+-5.736)        |         2815.870 (+-22.589)          |      3.210 (+-0.000)    
      3 torch.uint8 channels_last bilinear (224, 224) -> (64, 64) aa=False       |                        |         53.120 (+-0.316)        |          112.024 (+-1.225)           |      2.109 (+-0.000)    
      3 torch.uint8 channels_last bilinear (270, 268) -> (224, 224) aa=False     |                        |        147.330 (+-1.871)        |          299.152 (+-3.353)           |      2.030 (+-0.000)    
      3 torch.uint8 channels_last bilinear (1024, 1024) -> (256, 256) aa=False   |                        |        472.182 (+-10.785)       |         1698.601 (+-16.785)          |      3.597 (+-0.000)    
```

Note: for other cases (see Source below) speed-up is roughly around 1.0 +/- 0.1 which may be attributed to noisy measurements ...

[Source](https://gist.github.com/vfdev-5/1c0778904a07ce40401306548b9525e8#file-20230320-160044-pr_vs_nightly-speedup-md)


## Context

- #90771



cc jgong5 mingfeima XiaobingSuper sanchitintel ashokei jingxu10

[ghstack-poisoned]
vfdev-5 added a commit that referenced this pull request Mar 21, 2023
…t8 RGB-case (channels last)"


## Description

- Based on #96651
  - Improved perfs for vectorized **bilinear** interpolate uint8 RGB-case, **channels last**
    - unified RGB and RGBA processing code such that RGB input is not copied into RGBA
  - Performances are more close to Pillow-SIMD (labeled as `Pillow (9.0.0.post1)` in the results)
  - RGBA case perfs are the same after refactoring (see Source link below) 
- Fixed mem pointer alignment, added more comments (reviews from #96651)

## Results

- `Pillow (9.0.0.post1)` == Pillow-SIMD

```
[-------------------------------------------------------------------------------------------------- Resize -------------------------------------------------------------------------------------------------]
                                                                                 |  Pillow (9.0.0.post1)  |  torch (2.1.0a0+gitc005105) PR  |  torch (2.1.0a0+git5309c44) nightly  |  Speed-up: PR vs nightly
1 threads: --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
      3 torch.uint8 channels_last bilinear (256, 256) -> (32, 32) aa=True        |    38.670 (+-0.445)    |         57.366 (+-0.799)        |          132.147 (+-1.236)           |      2.304 (+-0.000)    
      3 torch.uint8 channels_last bilinear (256, 256) -> (32, 32) aa=False       |                        |         37.825 (+-0.417)        |          111.789 (+-1.175)           |      2.955 (+-0.000)    
      3 torch.uint8 channels_last bilinear (256, 256) -> (224, 224) aa=True      |   127.898 (+-1.335)    |        153.081 (+-2.346)        |          302.518 (+-2.632)           |      1.976 (+-0.000)    
      3 torch.uint8 channels_last bilinear (256, 256) -> (224, 224) aa=False     |                        |        141.695 (+-1.415)        |          286.663 (+-2.494)           |      2.023 (+-0.000)    
      3 torch.uint8 channels_last bilinear (256, 256) -> (320, 320) aa=True      |   179.735 (+-2.054)    |        210.613 (+-3.116)        |          439.375 (+-4.014)           |      2.086 (+-0.000)    
      3 torch.uint8 channels_last bilinear (256, 256) -> (320, 320) aa=False     |                        |        207.601 (+-1.639)        |          438.537 (+-4.143)           |      2.112 (+-0.000)    
      3 torch.uint8 channels_last bilinear (520, 520) -> (32, 32) aa=True        |   112.679 (+-1.321)    |        130.863 (+-1.987)        |          446.804 (+-3.283)           |      3.414 (+-0.000)    
      3 torch.uint8 channels_last bilinear (520, 520) -> (32, 32) aa=False       |                        |         57.968 (+-0.270)        |          374.244 (+-13.598)          |      6.456 (+-0.000)    
      3 torch.uint8 channels_last bilinear (520, 520) -> (224, 224) aa=True      |   282.398 (+-3.485)    |        322.986 (+-1.947)        |          720.197 (+-3.467)           |      2.230 (+-0.000)    
      3 torch.uint8 channels_last bilinear (520, 520) -> (224, 224) aa=False     |                        |        231.625 (+-2.006)        |          592.834 (+-3.903)           |      2.559 (+-0.000)    
      3 torch.uint8 channels_last bilinear (712, 712) -> (32, 32) aa=True        |   185.711 (+-1.666)    |        201.069 (+-2.182)        |          787.868 (+-3.648)           |      3.918 (+-0.000)    
      3 torch.uint8 channels_last bilinear (712, 712) -> (32, 32) aa=False       |                        |         75.975 (+-0.696)        |          651.016 (+-3.926)           |      8.569 (+-0.000)    
      3 torch.uint8 channels_last bilinear (712, 712) -> (224, 224) aa=True      |   410.236 (+-6.021)    |        451.486 (+-3.939)        |         1123.923 (+-14.988)          |      2.489 (+-0.000)    
      3 torch.uint8 channels_last bilinear (712, 712) -> (224, 224) aa=False     |                        |        299.597 (+-1.887)        |          915.347 (+-4.486)           |      3.055 (+-0.000)    

      # More test-cases from #90771
      3 torch.uint8 channels_last bilinear (64, 64) -> (224, 224) aa=True        |    60.751 (+-0.285)    |         78.538 (+-1.282)        |          170.465 (+-1.830)           |      2.170 (+-0.000)    
      3 torch.uint8 channels_last bilinear (224, 224) -> (270, 268) aa=True      |   133.619 (+-2.035)    |        159.614 (+-1.587)        |          330.971 (+-3.249)           |      2.074 (+-0.000)    
      3 torch.uint8 channels_last bilinear (256, 256) -> (1024, 1024) aa=True    |   950.243 (+-10.641)   |        891.369 (+-17.946)       |         2805.510 (+-25.503)          |      3.147 (+-0.000)    
      3 torch.uint8 channels_last bilinear (224, 224) -> (64, 64) aa=True        |    52.771 (+-0.961)    |         72.253 (+-1.020)        |          135.933 (+-1.625)           |      1.881 (+-0.000)    
      3 torch.uint8 channels_last bilinear (270, 268) -> (224, 224) aa=True      |   139.107 (+-2.143)    |        165.844 (+-2.177)        |          321.112 (+-2.904)           |      1.936 (+-0.000)    
      3 torch.uint8 channels_last bilinear (1024, 1024) -> (256, 256) aa=True    |   691.470 (+-9.566)    |        764.942 (+-11.192)       |         2050.880 (+-22.188)          |      2.681 (+-0.000)    
      3 torch.uint8 channels_last bilinear (64, 64) -> (224, 224) aa=False       |                        |         77.375 (+-1.345)        |          169.646 (+-1.640)           |      2.193 (+-0.000)    
      3 torch.uint8 channels_last bilinear (224, 224) -> (270, 268) aa=False     |                        |        159.115 (+-3.935)        |          329.754 (+-2.590)           |      2.072 (+-0.000)    
      3 torch.uint8 channels_last bilinear (256, 256) -> (1024, 1024) aa=False   |                        |        877.248 (+-5.736)        |         2815.870 (+-22.589)          |      3.210 (+-0.000)    
      3 torch.uint8 channels_last bilinear (224, 224) -> (64, 64) aa=False       |                        |         53.120 (+-0.316)        |          112.024 (+-1.225)           |      2.109 (+-0.000)    
      3 torch.uint8 channels_last bilinear (270, 268) -> (224, 224) aa=False     |                        |        147.330 (+-1.871)        |          299.152 (+-3.353)           |      2.030 (+-0.000)    
      3 torch.uint8 channels_last bilinear (1024, 1024) -> (256, 256) aa=False   |                        |        472.182 (+-10.785)       |         1698.601 (+-16.785)          |      3.597 (+-0.000)    
```

Note: for other cases (see Source below) speed-up is roughly around 1.0 +/- 0.1 which may be attributed to noisy measurements ...

[Source](https://gist.github.com/vfdev-5/1c0778904a07ce40401306548b9525e8#file-20230320-160044-pr_vs_nightly-speedup-md)


## Context

- #90771



cc jgong5 mingfeima XiaobingSuper sanchitintel ashokei jingxu10

[ghstack-poisoned]
vfdev-5 added a commit that referenced this pull request Mar 22, 2023
…erpolate cpu uint8 RGB-case (channels last)"


## Description

- Based on #96651
  - Improved perfs for vectorized **bilinear** interpolate uint8 RGB-case, **channels last**
    - unified RGB and RGBA processing code such that RGB input is not copied into RGBA
  - Performances are more close to Pillow-SIMD (labeled as `Pillow (9.0.0.post1)` in the results)
  - RGBA case perfs are the same after refactoring (see Source link below) 
- Fixed mem pointer alignment, added more comments (reviews from #96651)

## Results

- `Pillow (9.0.0.post1)` == Pillow-SIMD

```
[-------------------------------------------------------------------------------------------------- Resize -------------------------------------------------------------------------------------------------]
                                                                                 |  Pillow (9.0.0.post1)  |  torch (2.1.0a0+git8d955df) PR  |  torch (2.1.0a0+git5309c44) nightly  |  Speed-up: PR vs nightly
1 threads: --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
      3 torch.uint8 channels_last bilinear (256, 256) -> (32, 32) aa=True        |    38.649 (+-0.306)    |         55.828 (+-0.370)        |          132.147 (+-1.236)           |      2.367 (+-0.000)    
      3 torch.uint8 channels_last bilinear (256, 256) -> (32, 32) aa=False       |                        |         36.826 (+-0.229)        |          111.789 (+-1.175)           |      3.036 (+-0.000)    
      3 torch.uint8 channels_last bilinear (256, 256) -> (224, 224) aa=True      |   128.233 (+-1.313)    |        153.827 (+-1.229)        |          302.518 (+-2.632)           |      1.967 (+-0.000)    
      3 torch.uint8 channels_last bilinear (256, 256) -> (224, 224) aa=False     |                        |        143.886 (+-1.409)        |          286.663 (+-2.494)           |      1.992 (+-0.000)    
      3 torch.uint8 channels_last bilinear (256, 256) -> (320, 320) aa=True      |   179.504 (+-1.825)    |        211.569 (+-1.336)        |          439.375 (+-4.014)           |      2.077 (+-0.000)    
      3 torch.uint8 channels_last bilinear (256, 256) -> (320, 320) aa=False     |                        |        209.888 (+-1.443)        |          438.537 (+-4.143)           |      2.089 (+-0.000)    
      3 torch.uint8 channels_last bilinear (520, 520) -> (32, 32) aa=True        |   112.891 (+-1.118)    |        129.373 (+-1.396)        |          446.804 (+-3.283)           |      3.454 (+-0.000)    
      3 torch.uint8 channels_last bilinear (520, 520) -> (32, 32) aa=False       |                        |         56.858 (+-0.227)        |          374.244 (+-13.598)          |      6.582 (+-0.000)    
      3 torch.uint8 channels_last bilinear (520, 520) -> (224, 224) aa=True      |   282.917 (+-2.992)    |        324.378 (+-1.694)        |          720.197 (+-3.467)           |      2.220 (+-0.000)    
      3 torch.uint8 channels_last bilinear (520, 520) -> (224, 224) aa=False     |                        |        236.078 (+-1.679)        |          592.834 (+-3.903)           |      2.511 (+-0.000)    
      3 torch.uint8 channels_last bilinear (712, 712) -> (32, 32) aa=True        |   185.595 (+-1.633)    |        202.000 (+-1.920)        |          787.868 (+-3.648)           |      3.900 (+-0.000)    
      3 torch.uint8 channels_last bilinear (712, 712) -> (32, 32) aa=False       |                        |         75.421 (+-0.512)        |          651.016 (+-3.926)           |      8.632 (+-0.000)    
      3 torch.uint8 channels_last bilinear (712, 712) -> (224, 224) aa=True      |   409.691 (+-2.735)    |        449.927 (+-2.500)        |         1123.923 (+-14.988)          |      2.498 (+-0.000)    
      3 torch.uint8 channels_last bilinear (712, 712) -> (224, 224) aa=False     |                        |        306.691 (+-2.095)        |          915.347 (+-4.486)           |      2.985 (+-0.000)    

      # More test-cases from #90771
      3 torch.uint8 channels_last bilinear (64, 64) -> (224, 224) aa=True        |    60.740 (+-0.278)    |         78.745 (+-0.286)        |          170.465 (+-1.830)           |      2.165 (+-0.000)    
      3 torch.uint8 channels_last bilinear (224, 224) -> (270, 268) aa=True      |   133.029 (+-1.619)    |        162.393 (+-1.289)        |          330.971 (+-3.249)           |      2.038 (+-0.000)    
      3 torch.uint8 channels_last bilinear (256, 256) -> (1024, 1024) aa=True    |   948.849 (+-2.749)    |        896.127 (+-3.696)        |         2805.510 (+-25.503)          |      3.131 (+-0.000)    
      3 torch.uint8 channels_last bilinear (224, 224) -> (64, 64) aa=True        |    52.505 (+-0.319)    |         70.617 (+-0.344)        |          135.933 (+-1.625)           |      1.925 (+-0.000)    
      3 torch.uint8 channels_last bilinear (270, 268) -> (224, 224) aa=True      |   138.671 (+-1.953)    |        165.638 (+-1.473)        |          321.112 (+-2.904)           |      1.939 (+-0.000)    
      3 torch.uint8 channels_last bilinear (1024, 1024) -> (256, 256) aa=True    |   689.492 (+-2.917)    |        758.162 (+-3.719)        |         2050.880 (+-22.188)          |      2.705 (+-0.000)    
      3 torch.uint8 channels_last bilinear (64, 64) -> (224, 224) aa=False       |                        |         77.300 (+-0.307)        |          169.646 (+-1.640)           |      2.195 (+-0.000)    
      3 torch.uint8 channels_last bilinear (224, 224) -> (270, 268) aa=False     |                        |        159.525 (+-1.225)        |          329.754 (+-2.590)           |      2.067 (+-0.000)    
      3 torch.uint8 channels_last bilinear (256, 256) -> (1024, 1024) aa=False   |                        |        890.106 (+-3.358)        |         2815.870 (+-22.589)          |      3.164 (+-0.000)    
      3 torch.uint8 channels_last bilinear (224, 224) -> (64, 64) aa=False       |                        |         52.399 (+-0.314)        |          112.024 (+-1.225)           |      2.138 (+-0.000)    
      3 torch.uint8 channels_last bilinear (270, 268) -> (224, 224) aa=False     |                        |        148.780 (+-1.282)        |          299.152 (+-3.353)           |      2.011 (+-0.000)    
      3 torch.uint8 channels_last bilinear (1024, 1024) -> (256, 256) aa=False   |                        |        479.273 (+-3.432)        |         1698.601 (+-16.785)          |      3.544 (+-0.000)    
      4
```

Note: There is no perf regression for other case. There some cases (see Source below) with small speed-ups, for the rest it is roughly around 1.0 +/- 0.1 which may be attributed to noisy measurements ...

[Source](https://gist.github.com/vfdev-5/1c0778904a07ce40401306548b9525e8#file-20230321-145513-pr_vs_nightly-speedup-md)


## Context

- #90771



cc jgong5 mingfeima XiaobingSuper sanchitintel ashokei jingxu10

[ghstack-poisoned]
vfdev-5 added a commit that referenced this pull request Mar 22, 2023
…t8 RGB-case (channels last)"


## Description

- Based on #96651
  - Improved perfs for vectorized **bilinear** interpolate uint8 RGB-case, **channels last**
    - unified RGB and RGBA processing code such that RGB input is not copied into RGBA
  - Performances are more close to Pillow-SIMD (labeled as `Pillow (9.0.0.post1)` in the results)
  - RGBA case perfs are the same after refactoring (see Source link below) 
- Fixed mem pointer alignment, added more comments (reviews from #96651)

## Results

- `Pillow (9.0.0.post1)` == Pillow-SIMD

```
[-------------------------------------------------------------------------------------------------- Resize -------------------------------------------------------------------------------------------------]
                                                                                 |  Pillow (9.0.0.post1)  |  torch (2.1.0a0+git8d955df) PR  |  torch (2.1.0a0+git5309c44) nightly  |  Speed-up: PR vs nightly
1 threads: --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
      3 torch.uint8 channels_last bilinear (256, 256) -> (32, 32) aa=True        |    38.649 (+-0.306)    |         55.828 (+-0.370)        |          132.147 (+-1.236)           |      2.367 (+-0.000)    
      3 torch.uint8 channels_last bilinear (256, 256) -> (32, 32) aa=False       |                        |         36.826 (+-0.229)        |          111.789 (+-1.175)           |      3.036 (+-0.000)    
      3 torch.uint8 channels_last bilinear (256, 256) -> (224, 224) aa=True      |   128.233 (+-1.313)    |        153.827 (+-1.229)        |          302.518 (+-2.632)           |      1.967 (+-0.000)    
      3 torch.uint8 channels_last bilinear (256, 256) -> (224, 224) aa=False     |                        |        143.886 (+-1.409)        |          286.663 (+-2.494)           |      1.992 (+-0.000)    
      3 torch.uint8 channels_last bilinear (256, 256) -> (320, 320) aa=True      |   179.504 (+-1.825)    |        211.569 (+-1.336)        |          439.375 (+-4.014)           |      2.077 (+-0.000)    
      3 torch.uint8 channels_last bilinear (256, 256) -> (320, 320) aa=False     |                        |        209.888 (+-1.443)        |          438.537 (+-4.143)           |      2.089 (+-0.000)    
      3 torch.uint8 channels_last bilinear (520, 520) -> (32, 32) aa=True        |   112.891 (+-1.118)    |        129.373 (+-1.396)        |          446.804 (+-3.283)           |      3.454 (+-0.000)    
      3 torch.uint8 channels_last bilinear (520, 520) -> (32, 32) aa=False       |                        |         56.858 (+-0.227)        |          374.244 (+-13.598)          |      6.582 (+-0.000)    
      3 torch.uint8 channels_last bilinear (520, 520) -> (224, 224) aa=True      |   282.917 (+-2.992)    |        324.378 (+-1.694)        |          720.197 (+-3.467)           |      2.220 (+-0.000)    
      3 torch.uint8 channels_last bilinear (520, 520) -> (224, 224) aa=False     |                        |        236.078 (+-1.679)        |          592.834 (+-3.903)           |      2.511 (+-0.000)    
      3 torch.uint8 channels_last bilinear (712, 712) -> (32, 32) aa=True        |   185.595 (+-1.633)    |        202.000 (+-1.920)        |          787.868 (+-3.648)           |      3.900 (+-0.000)    
      3 torch.uint8 channels_last bilinear (712, 712) -> (32, 32) aa=False       |                        |         75.421 (+-0.512)        |          651.016 (+-3.926)           |      8.632 (+-0.000)    
      3 torch.uint8 channels_last bilinear (712, 712) -> (224, 224) aa=True      |   409.691 (+-2.735)    |        449.927 (+-2.500)        |         1123.923 (+-14.988)          |      2.498 (+-0.000)    
      3 torch.uint8 channels_last bilinear (712, 712) -> (224, 224) aa=False     |                        |        306.691 (+-2.095)        |          915.347 (+-4.486)           |      2.985 (+-0.000)    

      # More test-cases from #90771
      3 torch.uint8 channels_last bilinear (64, 64) -> (224, 224) aa=True        |    60.740 (+-0.278)    |         78.745 (+-0.286)        |          170.465 (+-1.830)           |      2.165 (+-0.000)    
      3 torch.uint8 channels_last bilinear (224, 224) -> (270, 268) aa=True      |   133.029 (+-1.619)    |        162.393 (+-1.289)        |          330.971 (+-3.249)           |      2.038 (+-0.000)    
      3 torch.uint8 channels_last bilinear (256, 256) -> (1024, 1024) aa=True    |   948.849 (+-2.749)    |        896.127 (+-3.696)        |         2805.510 (+-25.503)          |      3.131 (+-0.000)    
      3 torch.uint8 channels_last bilinear (224, 224) -> (64, 64) aa=True        |    52.505 (+-0.319)    |         70.617 (+-0.344)        |          135.933 (+-1.625)           |      1.925 (+-0.000)    
      3 torch.uint8 channels_last bilinear (270, 268) -> (224, 224) aa=True      |   138.671 (+-1.953)    |        165.638 (+-1.473)        |          321.112 (+-2.904)           |      1.939 (+-0.000)    
      3 torch.uint8 channels_last bilinear (1024, 1024) -> (256, 256) aa=True    |   689.492 (+-2.917)    |        758.162 (+-3.719)        |         2050.880 (+-22.188)          |      2.705 (+-0.000)    
      3 torch.uint8 channels_last bilinear (64, 64) -> (224, 224) aa=False       |                        |         77.300 (+-0.307)        |          169.646 (+-1.640)           |      2.195 (+-0.000)    
      3 torch.uint8 channels_last bilinear (224, 224) -> (270, 268) aa=False     |                        |        159.525 (+-1.225)        |          329.754 (+-2.590)           |      2.067 (+-0.000)    
      3 torch.uint8 channels_last bilinear (256, 256) -> (1024, 1024) aa=False   |                        |        890.106 (+-3.358)        |         2815.870 (+-22.589)          |      3.164 (+-0.000)    
      3 torch.uint8 channels_last bilinear (224, 224) -> (64, 64) aa=False       |                        |         52.399 (+-0.314)        |          112.024 (+-1.225)           |      2.138 (+-0.000)    
      3 torch.uint8 channels_last bilinear (270, 268) -> (224, 224) aa=False     |                        |        148.780 (+-1.282)        |          299.152 (+-3.353)           |      2.011 (+-0.000)    
      3 torch.uint8 channels_last bilinear (1024, 1024) -> (256, 256) aa=False   |                        |        479.273 (+-3.432)        |         1698.601 (+-16.785)          |      3.544 (+-0.000)    
      4
```

Note: There is no perf regression for other case. There some cases (see Source below) with small speed-ups, for the rest it is roughly around 1.0 +/- 0.1 which may be attributed to noisy measurements ...

[Source](https://gist.github.com/vfdev-5/1c0778904a07ce40401306548b9525e8#file-20230321-145513-pr_vs_nightly-speedup-md)


## Context

- #90771



cc jgong5 mingfeima XiaobingSuper sanchitintel ashokei jingxu10

[ghstack-poisoned]
vfdev-5 added a commit that referenced this pull request Mar 23, 2023
…erpolate cpu uint8 RGB-case (channels last)"


## Description

- Based on #96651
  - Improved perfs for vectorized **bilinear** interpolate uint8 RGB-case, **channels last**
    - unified RGB and RGBA processing code such that RGB input is not copied into RGBA
  - Performances are more close to Pillow-SIMD (labeled as `Pillow (9.0.0.post1)` in the results)
  - RGBA case perfs are the same after refactoring (see Source link below) 
- Fixed mem pointer alignment, added more comments (reviews from #96651)

## Results

- `Pillow (9.0.0.post1)` == Pillow-SIMD

```
[-------------------------------------------------------------------------------------------------- Resize -------------------------------------------------------------------------------------------------]
                                                                                 |  Pillow (9.0.0.post1)  |  torch (2.1.0a0+gitce4be01) PR  |  torch (2.1.0a0+git5309c44) nightly  |  Speed-up: PR vs nightly
1 threads: --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
      3 torch.uint8 channels_last bilinear (256, 256) -> (32, 32) aa=True        |    38.548 (+-0.280)    |         57.536 (+-0.210)        |          132.147 (+-1.236)           |      2.297 (+-0.000)    
      3 torch.uint8 channels_last bilinear (256, 256) -> (32, 32) aa=False       |                        |         38.532 (+-0.219)        |          111.789 (+-1.175)           |      2.901 (+-0.000)    
      3 torch.uint8 channels_last bilinear (256, 256) -> (224, 224) aa=True      |   127.689 (+-1.348)    |        156.262 (+-1.213)        |          302.518 (+-2.632)           |      1.936 (+-0.000)    
      3 torch.uint8 channels_last bilinear (256, 256) -> (224, 224) aa=False     |                        |        145.483 (+-1.077)        |          286.663 (+-2.494)           |      1.970 (+-0.000)    
      3 torch.uint8 channels_last bilinear (256, 256) -> (320, 320) aa=True      |   178.117 (+-1.956)    |        215.053 (+-1.470)        |          439.375 (+-4.014)           |      2.043 (+-0.000)    
      3 torch.uint8 channels_last bilinear (256, 256) -> (320, 320) aa=False     |                        |        211.340 (+-2.239)        |          438.537 (+-4.143)           |      2.075 (+-0.000)    
      3 torch.uint8 channels_last bilinear (520, 520) -> (32, 32) aa=True        |   112.593 (+-1.266)    |        130.414 (+-1.633)        |          446.804 (+-3.283)           |      3.426 (+-0.000)    
      3 torch.uint8 channels_last bilinear (520, 520) -> (32, 32) aa=False       |                        |         58.767 (+-0.203)        |          374.244 (+-13.598)          |      6.368 (+-0.000)    
      3 torch.uint8 channels_last bilinear (520, 520) -> (224, 224) aa=True      |   283.210 (+-2.937)    |        324.157 (+-1.895)        |          720.197 (+-3.467)           |      2.222 (+-0.000)    
      3 torch.uint8 channels_last bilinear (520, 520) -> (224, 224) aa=False     |                        |        239.800 (+-2.492)        |          592.834 (+-3.903)           |      2.472 (+-0.000)    
      3 torch.uint8 channels_last bilinear (712, 712) -> (32, 32) aa=True        |   186.255 (+-1.629)    |        204.834 (+-1.496)        |          787.868 (+-3.648)           |      3.846 (+-0.000)    
      3 torch.uint8 channels_last bilinear (712, 712) -> (32, 32) aa=False       |                        |         77.335 (+-0.341)        |          651.016 (+-3.926)           |      8.418 (+-0.000)    
      3 torch.uint8 channels_last bilinear (712, 712) -> (224, 224) aa=True      |   410.286 (+-2.439)    |        443.934 (+-2.899)        |         1123.923 (+-14.988)          |      2.532 (+-0.000)    
      3 torch.uint8 channels_last bilinear (712, 712) -> (224, 224) aa=False     |                        |        312.220 (+-2.307)        |          915.347 (+-4.486)           |      2.932 (+-0.000)    

      # More test-cases from #90771
      3 torch.uint8 channels_last bilinear (64, 64) -> (224, 224) aa=True        |    60.611 (+-0.337)    |         80.849 (+-1.780)        |          170.465 (+-1.830)           |      2.108 (+-0.000)    
      3 torch.uint8 channels_last bilinear (224, 224) -> (270, 268) aa=True      |   132.971 (+-1.624)    |        164.892 (+-1.426)        |          330.971 (+-3.249)           |      2.007 (+-0.000)    
      3 torch.uint8 channels_last bilinear (256, 256) -> (1024, 1024) aa=True    |   948.467 (+-3.179)    |        891.414 (+-5.282)        |         2805.510 (+-25.503)          |      3.147 (+-0.000)    
      3 torch.uint8 channels_last bilinear (224, 224) -> (64, 64) aa=True        |    52.539 (+-0.327)    |         72.471 (+-0.367)        |          135.933 (+-1.625)           |      1.876 (+-0.000)    
      3 torch.uint8 channels_last bilinear (270, 268) -> (224, 224) aa=True      |   138.669 (+-1.867)    |        168.628 (+-1.213)        |          321.112 (+-2.904)           |      1.904 (+-0.000)    
      3 torch.uint8 channels_last bilinear (1024, 1024) -> (256, 256) aa=True    |   689.933 (+-3.175)    |        746.911 (+-2.985)        |         2050.880 (+-22.188)          |      2.746 (+-0.000)    
      3 torch.uint8 channels_last bilinear (64, 64) -> (224, 224) aa=False       |                        |         78.347 (+-0.338)        |          169.646 (+-1.640)           |      2.165 (+-0.000)    
      3 torch.uint8 channels_last bilinear (224, 224) -> (270, 268) aa=False     |                        |        162.194 (+-1.089)        |          329.754 (+-2.590)           |      2.033 (+-0.000)    
      3 torch.uint8 channels_last bilinear (256, 256) -> (1024, 1024) aa=False   |                        |        894.476 (+-2.738)        |         2815.870 (+-22.589)          |      3.148 (+-0.000)    
      3 torch.uint8 channels_last bilinear (224, 224) -> (64, 64) aa=False       |                        |         52.728 (+-0.406)        |          112.024 (+-1.225)           |      2.125 (+-0.000)    
      3 torch.uint8 channels_last bilinear (270, 268) -> (224, 224) aa=False     |                        |        151.560 (+-1.128)        |          299.152 (+-3.353)           |      1.974 (+-0.000)    
      3 torch.uint8 channels_last bilinear (1024, 1024) -> (256, 256) aa=False   |                        |        500.053 (+-4.288)        |         1698.601 (+-16.785)          |      3.397 (+-0.000)    
```

Note: There is no perf regression for other case. There some cases (see Source below) with small speed-ups, for the rest it is roughly around 1.0 +/- 0.1 which may be attributed to noisy measurements ...

[Source](https://gist.github.com/vfdev-5/1c0778904a07ce40401306548b9525e8#file-20230322-132441-pr_vs_nightly-speedup-md)


## Context

- #90771



cc jgong5 mingfeima XiaobingSuper sanchitintel ashokei jingxu10

[ghstack-poisoned]
vfdev-5 added a commit that referenced this pull request Mar 23, 2023
…t8 RGB-case (channels last)"


## Description

- Based on #96651
  - Improved perfs for vectorized **bilinear** interpolate uint8 RGB-case, **channels last**
    - unified RGB and RGBA processing code such that RGB input is not copied into RGBA
  - Performances are more close to Pillow-SIMD (labeled as `Pillow (9.0.0.post1)` in the results)
  - RGBA case perfs are the same after refactoring (see Source link below) 
- Fixed mem pointer alignment, added more comments (reviews from #96651)

## Results

- `Pillow (9.0.0.post1)` == Pillow-SIMD

```
[-------------------------------------------------------------------------------------------------- Resize -------------------------------------------------------------------------------------------------]
                                                                                 |  Pillow (9.0.0.post1)  |  torch (2.1.0a0+gitce4be01) PR  |  torch (2.1.0a0+git5309c44) nightly  |  Speed-up: PR vs nightly
1 threads: --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
      3 torch.uint8 channels_last bilinear (256, 256) -> (32, 32) aa=True        |    38.548 (+-0.280)    |         57.536 (+-0.210)        |          132.147 (+-1.236)           |      2.297 (+-0.000)    
      3 torch.uint8 channels_last bilinear (256, 256) -> (32, 32) aa=False       |                        |         38.532 (+-0.219)        |          111.789 (+-1.175)           |      2.901 (+-0.000)    
      3 torch.uint8 channels_last bilinear (256, 256) -> (224, 224) aa=True      |   127.689 (+-1.348)    |        156.262 (+-1.213)        |          302.518 (+-2.632)           |      1.936 (+-0.000)    
      3 torch.uint8 channels_last bilinear (256, 256) -> (224, 224) aa=False     |                        |        145.483 (+-1.077)        |          286.663 (+-2.494)           |      1.970 (+-0.000)    
      3 torch.uint8 channels_last bilinear (256, 256) -> (320, 320) aa=True      |   178.117 (+-1.956)    |        215.053 (+-1.470)        |          439.375 (+-4.014)           |      2.043 (+-0.000)    
      3 torch.uint8 channels_last bilinear (256, 256) -> (320, 320) aa=False     |                        |        211.340 (+-2.239)        |          438.537 (+-4.143)           |      2.075 (+-0.000)    
      3 torch.uint8 channels_last bilinear (520, 520) -> (32, 32) aa=True        |   112.593 (+-1.266)    |        130.414 (+-1.633)        |          446.804 (+-3.283)           |      3.426 (+-0.000)    
      3 torch.uint8 channels_last bilinear (520, 520) -> (32, 32) aa=False       |                        |         58.767 (+-0.203)        |          374.244 (+-13.598)          |      6.368 (+-0.000)    
      3 torch.uint8 channels_last bilinear (520, 520) -> (224, 224) aa=True      |   283.210 (+-2.937)    |        324.157 (+-1.895)        |          720.197 (+-3.467)           |      2.222 (+-0.000)    
      3 torch.uint8 channels_last bilinear (520, 520) -> (224, 224) aa=False     |                        |        239.800 (+-2.492)        |          592.834 (+-3.903)           |      2.472 (+-0.000)    
      3 torch.uint8 channels_last bilinear (712, 712) -> (32, 32) aa=True        |   186.255 (+-1.629)    |        204.834 (+-1.496)        |          787.868 (+-3.648)           |      3.846 (+-0.000)    
      3 torch.uint8 channels_last bilinear (712, 712) -> (32, 32) aa=False       |                        |         77.335 (+-0.341)        |          651.016 (+-3.926)           |      8.418 (+-0.000)    
      3 torch.uint8 channels_last bilinear (712, 712) -> (224, 224) aa=True      |   410.286 (+-2.439)    |        443.934 (+-2.899)        |         1123.923 (+-14.988)          |      2.532 (+-0.000)    
      3 torch.uint8 channels_last bilinear (712, 712) -> (224, 224) aa=False     |                        |        312.220 (+-2.307)        |          915.347 (+-4.486)           |      2.932 (+-0.000)    

      # More test-cases from #90771
      3 torch.uint8 channels_last bilinear (64, 64) -> (224, 224) aa=True        |    60.611 (+-0.337)    |         80.849 (+-1.780)        |          170.465 (+-1.830)           |      2.108 (+-0.000)    
      3 torch.uint8 channels_last bilinear (224, 224) -> (270, 268) aa=True      |   132.971 (+-1.624)    |        164.892 (+-1.426)        |          330.971 (+-3.249)           |      2.007 (+-0.000)    
      3 torch.uint8 channels_last bilinear (256, 256) -> (1024, 1024) aa=True    |   948.467 (+-3.179)    |        891.414 (+-5.282)        |         2805.510 (+-25.503)          |      3.147 (+-0.000)    
      3 torch.uint8 channels_last bilinear (224, 224) -> (64, 64) aa=True        |    52.539 (+-0.327)    |         72.471 (+-0.367)        |          135.933 (+-1.625)           |      1.876 (+-0.000)    
      3 torch.uint8 channels_last bilinear (270, 268) -> (224, 224) aa=True      |   138.669 (+-1.867)    |        168.628 (+-1.213)        |          321.112 (+-2.904)           |      1.904 (+-0.000)    
      3 torch.uint8 channels_last bilinear (1024, 1024) -> (256, 256) aa=True    |   689.933 (+-3.175)    |        746.911 (+-2.985)        |         2050.880 (+-22.188)          |      2.746 (+-0.000)    
      3 torch.uint8 channels_last bilinear (64, 64) -> (224, 224) aa=False       |                        |         78.347 (+-0.338)        |          169.646 (+-1.640)           |      2.165 (+-0.000)    
      3 torch.uint8 channels_last bilinear (224, 224) -> (270, 268) aa=False     |                        |        162.194 (+-1.089)        |          329.754 (+-2.590)           |      2.033 (+-0.000)    
      3 torch.uint8 channels_last bilinear (256, 256) -> (1024, 1024) aa=False   |                        |        894.476 (+-2.738)        |         2815.870 (+-22.589)          |      3.148 (+-0.000)    
      3 torch.uint8 channels_last bilinear (224, 224) -> (64, 64) aa=False       |                        |         52.728 (+-0.406)        |          112.024 (+-1.225)           |      2.125 (+-0.000)    
      3 torch.uint8 channels_last bilinear (270, 268) -> (224, 224) aa=False     |                        |        151.560 (+-1.128)        |          299.152 (+-3.353)           |      1.974 (+-0.000)    
      3 torch.uint8 channels_last bilinear (1024, 1024) -> (256, 256) aa=False   |                        |        500.053 (+-4.288)        |         1698.601 (+-16.785)          |      3.397 (+-0.000)    
```

Note: There is no perf regression for other case. There some cases (see Source below) with small speed-ups, for the rest it is roughly around 1.0 +/- 0.1 which may be attributed to noisy measurements ...

[Source](https://gist.github.com/vfdev-5/1c0778904a07ce40401306548b9525e8#file-20230322-132441-pr_vs_nightly-speedup-md)


## Context

- #90771



cc jgong5 mingfeima XiaobingSuper sanchitintel ashokei jingxu10

[ghstack-poisoned]
vfdev-5 added a commit that referenced this pull request Mar 23, 2023
…erpolate cpu uint8 RGB-case (channels last)"


## Description

- Based on #96651
  - Improved perfs for vectorized **bilinear** interpolate uint8 RGB-case, **channels last**
    - unified RGB and RGBA processing code such that RGB input is not copied into RGBA
  - Performances are more close to Pillow-SIMD (labeled as `Pillow (9.0.0.post1)` in the results)
  - RGBA case perfs are the same after refactoring (see Source link below) 
- Fixed mem pointer alignment, added more comments (reviews from #96651)

## Results

- `Pillow (9.0.0.post1)` == Pillow-SIMD

```
[-------------------------------------------------------------------------------------------------- Resize -------------------------------------------------------------------------------------------------]
                                                                                 |  Pillow (9.0.0.post1)  |  torch (2.1.0a0+gitce4be01) PR  |  torch (2.1.0a0+git5309c44) nightly  |  Speed-up: PR vs nightly
1 threads: --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
      3 torch.uint8 channels_last bilinear (256, 256) -> (32, 32) aa=True        |    38.548 (+-0.280)    |         57.536 (+-0.210)        |          132.147 (+-1.236)           |      2.297 (+-0.000)    
      3 torch.uint8 channels_last bilinear (256, 256) -> (32, 32) aa=False       |                        |         38.532 (+-0.219)        |          111.789 (+-1.175)           |      2.901 (+-0.000)    
      3 torch.uint8 channels_last bilinear (256, 256) -> (224, 224) aa=True      |   127.689 (+-1.348)    |        156.262 (+-1.213)        |          302.518 (+-2.632)           |      1.936 (+-0.000)    
      3 torch.uint8 channels_last bilinear (256, 256) -> (224, 224) aa=False     |                        |        145.483 (+-1.077)        |          286.663 (+-2.494)           |      1.970 (+-0.000)    
      3 torch.uint8 channels_last bilinear (256, 256) -> (320, 320) aa=True      |   178.117 (+-1.956)    |        215.053 (+-1.470)        |          439.375 (+-4.014)           |      2.043 (+-0.000)    
      3 torch.uint8 channels_last bilinear (256, 256) -> (320, 320) aa=False     |                        |        211.340 (+-2.239)        |          438.537 (+-4.143)           |      2.075 (+-0.000)    
      3 torch.uint8 channels_last bilinear (520, 520) -> (32, 32) aa=True        |   112.593 (+-1.266)    |        130.414 (+-1.633)        |          446.804 (+-3.283)           |      3.426 (+-0.000)    
      3 torch.uint8 channels_last bilinear (520, 520) -> (32, 32) aa=False       |                        |         58.767 (+-0.203)        |          374.244 (+-13.598)          |      6.368 (+-0.000)    
      3 torch.uint8 channels_last bilinear (520, 520) -> (224, 224) aa=True      |   283.210 (+-2.937)    |        324.157 (+-1.895)        |          720.197 (+-3.467)           |      2.222 (+-0.000)    
      3 torch.uint8 channels_last bilinear (520, 520) -> (224, 224) aa=False     |                        |        239.800 (+-2.492)        |          592.834 (+-3.903)           |      2.472 (+-0.000)    
      3 torch.uint8 channels_last bilinear (712, 712) -> (32, 32) aa=True        |   186.255 (+-1.629)    |        204.834 (+-1.496)        |          787.868 (+-3.648)           |      3.846 (+-0.000)    
      3 torch.uint8 channels_last bilinear (712, 712) -> (32, 32) aa=False       |                        |         77.335 (+-0.341)        |          651.016 (+-3.926)           |      8.418 (+-0.000)    
      3 torch.uint8 channels_last bilinear (712, 712) -> (224, 224) aa=True      |   410.286 (+-2.439)    |        443.934 (+-2.899)        |         1123.923 (+-14.988)          |      2.532 (+-0.000)    
      3 torch.uint8 channels_last bilinear (712, 712) -> (224, 224) aa=False     |                        |        312.220 (+-2.307)        |          915.347 (+-4.486)           |      2.932 (+-0.000)    

      # More test-cases from #90771
      3 torch.uint8 channels_last bilinear (64, 64) -> (224, 224) aa=True        |    60.611 (+-0.337)    |         80.849 (+-1.780)        |          170.465 (+-1.830)           |      2.108 (+-0.000)    
      3 torch.uint8 channels_last bilinear (224, 224) -> (270, 268) aa=True      |   132.971 (+-1.624)    |        164.892 (+-1.426)        |          330.971 (+-3.249)           |      2.007 (+-0.000)    
      3 torch.uint8 channels_last bilinear (256, 256) -> (1024, 1024) aa=True    |   948.467 (+-3.179)    |        891.414 (+-5.282)        |         2805.510 (+-25.503)          |      3.147 (+-0.000)    
      3 torch.uint8 channels_last bilinear (224, 224) -> (64, 64) aa=True        |    52.539 (+-0.327)    |         72.471 (+-0.367)        |          135.933 (+-1.625)           |      1.876 (+-0.000)    
      3 torch.uint8 channels_last bilinear (270, 268) -> (224, 224) aa=True      |   138.669 (+-1.867)    |        168.628 (+-1.213)        |          321.112 (+-2.904)           |      1.904 (+-0.000)    
      3 torch.uint8 channels_last bilinear (1024, 1024) -> (256, 256) aa=True    |   689.933 (+-3.175)    |        746.911 (+-2.985)        |         2050.880 (+-22.188)          |      2.746 (+-0.000)    
      3 torch.uint8 channels_last bilinear (64, 64) -> (224, 224) aa=False       |                        |         78.347 (+-0.338)        |          169.646 (+-1.640)           |      2.165 (+-0.000)    
      3 torch.uint8 channels_last bilinear (224, 224) -> (270, 268) aa=False     |                        |        162.194 (+-1.089)        |          329.754 (+-2.590)           |      2.033 (+-0.000)    
      3 torch.uint8 channels_last bilinear (256, 256) -> (1024, 1024) aa=False   |                        |        894.476 (+-2.738)        |         2815.870 (+-22.589)          |      3.148 (+-0.000)    
      3 torch.uint8 channels_last bilinear (224, 224) -> (64, 64) aa=False       |                        |         52.728 (+-0.406)        |          112.024 (+-1.225)           |      2.125 (+-0.000)    
      3 torch.uint8 channels_last bilinear (270, 268) -> (224, 224) aa=False     |                        |        151.560 (+-1.128)        |          299.152 (+-3.353)           |      1.974 (+-0.000)    
      3 torch.uint8 channels_last bilinear (1024, 1024) -> (256, 256) aa=False   |                        |        500.053 (+-4.288)        |         1698.601 (+-16.785)          |      3.397 (+-0.000)    
```

Note: There is no perf regression for other case. There some cases (see Source below) with small speed-ups, for the rest it is roughly around 1.0 +/- 0.1 which may be attributed to noisy measurements ...

[Source](https://gist.github.com/vfdev-5/1c0778904a07ce40401306548b9525e8#file-20230322-132441-pr_vs_nightly-speedup-md)


## Context

- #90771



cc jgong5 mingfeima XiaobingSuper sanchitintel ashokei jingxu10

[ghstack-poisoned]
vfdev-5 added a commit that referenced this pull request Mar 23, 2023
…t8 RGB-case (channels last)"


## Description

- Based on #96651
  - Improved perfs for vectorized **bilinear** interpolate uint8 RGB-case, **channels last**
    - unified RGB and RGBA processing code such that RGB input is not copied into RGBA
  - Performances are more close to Pillow-SIMD (labeled as `Pillow (9.0.0.post1)` in the results)
  - RGBA case perfs are the same after refactoring (see Source link below) 
- Fixed mem pointer alignment, added more comments (reviews from #96651)

## Results

- `Pillow (9.0.0.post1)` == Pillow-SIMD

```
[-------------------------------------------------------------------------------------------------- Resize -------------------------------------------------------------------------------------------------]
                                                                                 |  Pillow (9.0.0.post1)  |  torch (2.1.0a0+gitce4be01) PR  |  torch (2.1.0a0+git5309c44) nightly  |  Speed-up: PR vs nightly
1 threads: --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
      3 torch.uint8 channels_last bilinear (256, 256) -> (32, 32) aa=True        |    38.548 (+-0.280)    |         57.536 (+-0.210)        |          132.147 (+-1.236)           |      2.297 (+-0.000)    
      3 torch.uint8 channels_last bilinear (256, 256) -> (32, 32) aa=False       |                        |         38.532 (+-0.219)        |          111.789 (+-1.175)           |      2.901 (+-0.000)    
      3 torch.uint8 channels_last bilinear (256, 256) -> (224, 224) aa=True      |   127.689 (+-1.348)    |        156.262 (+-1.213)        |          302.518 (+-2.632)           |      1.936 (+-0.000)    
      3 torch.uint8 channels_last bilinear (256, 256) -> (224, 224) aa=False     |                        |        145.483 (+-1.077)        |          286.663 (+-2.494)           |      1.970 (+-0.000)    
      3 torch.uint8 channels_last bilinear (256, 256) -> (320, 320) aa=True      |   178.117 (+-1.956)    |        215.053 (+-1.470)        |          439.375 (+-4.014)           |      2.043 (+-0.000)    
      3 torch.uint8 channels_last bilinear (256, 256) -> (320, 320) aa=False     |                        |        211.340 (+-2.239)        |          438.537 (+-4.143)           |      2.075 (+-0.000)    
      3 torch.uint8 channels_last bilinear (520, 520) -> (32, 32) aa=True        |   112.593 (+-1.266)    |        130.414 (+-1.633)        |          446.804 (+-3.283)           |      3.426 (+-0.000)    
      3 torch.uint8 channels_last bilinear (520, 520) -> (32, 32) aa=False       |                        |         58.767 (+-0.203)        |          374.244 (+-13.598)          |      6.368 (+-0.000)    
      3 torch.uint8 channels_last bilinear (520, 520) -> (224, 224) aa=True      |   283.210 (+-2.937)    |        324.157 (+-1.895)        |          720.197 (+-3.467)           |      2.222 (+-0.000)    
      3 torch.uint8 channels_last bilinear (520, 520) -> (224, 224) aa=False     |                        |        239.800 (+-2.492)        |          592.834 (+-3.903)           |      2.472 (+-0.000)    
      3 torch.uint8 channels_last bilinear (712, 712) -> (32, 32) aa=True        |   186.255 (+-1.629)    |        204.834 (+-1.496)        |          787.868 (+-3.648)           |      3.846 (+-0.000)    
      3 torch.uint8 channels_last bilinear (712, 712) -> (32, 32) aa=False       |                        |         77.335 (+-0.341)        |          651.016 (+-3.926)           |      8.418 (+-0.000)    
      3 torch.uint8 channels_last bilinear (712, 712) -> (224, 224) aa=True      |   410.286 (+-2.439)    |        443.934 (+-2.899)        |         1123.923 (+-14.988)          |      2.532 (+-0.000)    
      3 torch.uint8 channels_last bilinear (712, 712) -> (224, 224) aa=False     |                        |        312.220 (+-2.307)        |          915.347 (+-4.486)           |      2.932 (+-0.000)    

      # More test-cases from #90771
      3 torch.uint8 channels_last bilinear (64, 64) -> (224, 224) aa=True        |    60.611 (+-0.337)    |         80.849 (+-1.780)        |          170.465 (+-1.830)           |      2.108 (+-0.000)    
      3 torch.uint8 channels_last bilinear (224, 224) -> (270, 268) aa=True      |   132.971 (+-1.624)    |        164.892 (+-1.426)        |          330.971 (+-3.249)           |      2.007 (+-0.000)    
      3 torch.uint8 channels_last bilinear (256, 256) -> (1024, 1024) aa=True    |   948.467 (+-3.179)    |        891.414 (+-5.282)        |         2805.510 (+-25.503)          |      3.147 (+-0.000)    
      3 torch.uint8 channels_last bilinear (224, 224) -> (64, 64) aa=True        |    52.539 (+-0.327)    |         72.471 (+-0.367)        |          135.933 (+-1.625)           |      1.876 (+-0.000)    
      3 torch.uint8 channels_last bilinear (270, 268) -> (224, 224) aa=True      |   138.669 (+-1.867)    |        168.628 (+-1.213)        |          321.112 (+-2.904)           |      1.904 (+-0.000)    
      3 torch.uint8 channels_last bilinear (1024, 1024) -> (256, 256) aa=True    |   689.933 (+-3.175)    |        746.911 (+-2.985)        |         2050.880 (+-22.188)          |      2.746 (+-0.000)    
      3 torch.uint8 channels_last bilinear (64, 64) -> (224, 224) aa=False       |                        |         78.347 (+-0.338)        |          169.646 (+-1.640)           |      2.165 (+-0.000)    
      3 torch.uint8 channels_last bilinear (224, 224) -> (270, 268) aa=False     |                        |        162.194 (+-1.089)        |          329.754 (+-2.590)           |      2.033 (+-0.000)    
      3 torch.uint8 channels_last bilinear (256, 256) -> (1024, 1024) aa=False   |                        |        894.476 (+-2.738)        |         2815.870 (+-22.589)          |      3.148 (+-0.000)    
      3 torch.uint8 channels_last bilinear (224, 224) -> (64, 64) aa=False       |                        |         52.728 (+-0.406)        |          112.024 (+-1.225)           |      2.125 (+-0.000)    
      3 torch.uint8 channels_last bilinear (270, 268) -> (224, 224) aa=False     |                        |        151.560 (+-1.128)        |          299.152 (+-3.353)           |      1.974 (+-0.000)    
      3 torch.uint8 channels_last bilinear (1024, 1024) -> (256, 256) aa=False   |                        |        500.053 (+-4.288)        |         1698.601 (+-16.785)          |      3.397 (+-0.000)    
```

Note: There is no perf regression for other case. There some cases (see Source below) with small speed-ups, for the rest it is roughly around 1.0 +/- 0.1 which may be attributed to noisy measurements ...

[Source](https://gist.github.com/vfdev-5/1c0778904a07ce40401306548b9525e8#file-20230322-132441-pr_vs_nightly-speedup-md)


## Context

- #90771



cc jgong5 mingfeima XiaobingSuper sanchitintel ashokei jingxu10

[ghstack-poisoned]
vfdev-5 added a commit that referenced this pull request Mar 29, 2023
…erpolate cpu uint8 RGB-case (channels last)"


## Description

- Based on #96651
  - Improved perfs for vectorized **bilinear** interpolate uint8 RGB-case, **channels last**
    - unified RGB and RGBA processing code such that RGB input is not copied into RGBA
  - Performances are more close to Pillow-SIMD (labeled as `Pillow (9.0.0.post1)` in the results)
  - RGBA case perfs are the same after refactoring (see Source link below) 
- Fixed mem pointer alignment, added more comments (reviews from #96651)

## Results

- `Pillow (9.0.0.post1)` == Pillow-SIMD

```
[-------------------------------------------------------------------------------------------------- Resize -------------------------------------------------------------------------------------------------]
                                                                                 |  Pillow (9.0.0.post1)  |  torch (2.1.0a0+gitce4be01) PR  |  torch (2.1.0a0+git5309c44) nightly  |  Speed-up: PR vs nightly
1 threads: --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
      3 torch.uint8 channels_last bilinear (256, 256) -> (32, 32) aa=True        |    38.548 (+-0.280)    |         57.536 (+-0.210)        |          132.147 (+-1.236)           |      2.297 (+-0.000)    
      3 torch.uint8 channels_last bilinear (256, 256) -> (32, 32) aa=False       |                        |         38.532 (+-0.219)        |          111.789 (+-1.175)           |      2.901 (+-0.000)    
      3 torch.uint8 channels_last bilinear (256, 256) -> (224, 224) aa=True      |   127.689 (+-1.348)    |        156.262 (+-1.213)        |          302.518 (+-2.632)           |      1.936 (+-0.000)    
      3 torch.uint8 channels_last bilinear (256, 256) -> (224, 224) aa=False     |                        |        145.483 (+-1.077)        |          286.663 (+-2.494)           |      1.970 (+-0.000)    
      3 torch.uint8 channels_last bilinear (256, 256) -> (320, 320) aa=True      |   178.117 (+-1.956)    |        215.053 (+-1.470)        |          439.375 (+-4.014)           |      2.043 (+-0.000)    
      3 torch.uint8 channels_last bilinear (256, 256) -> (320, 320) aa=False     |                        |        211.340 (+-2.239)        |          438.537 (+-4.143)           |      2.075 (+-0.000)    
      3 torch.uint8 channels_last bilinear (520, 520) -> (32, 32) aa=True        |   112.593 (+-1.266)    |        130.414 (+-1.633)        |          446.804 (+-3.283)           |      3.426 (+-0.000)    
      3 torch.uint8 channels_last bilinear (520, 520) -> (32, 32) aa=False       |                        |         58.767 (+-0.203)        |          374.244 (+-13.598)          |      6.368 (+-0.000)    
      3 torch.uint8 channels_last bilinear (520, 520) -> (224, 224) aa=True      |   283.210 (+-2.937)    |        324.157 (+-1.895)        |          720.197 (+-3.467)           |      2.222 (+-0.000)    
      3 torch.uint8 channels_last bilinear (520, 520) -> (224, 224) aa=False     |                        |        239.800 (+-2.492)        |          592.834 (+-3.903)           |      2.472 (+-0.000)    
      3 torch.uint8 channels_last bilinear (712, 712) -> (32, 32) aa=True        |   186.255 (+-1.629)    |        204.834 (+-1.496)        |          787.868 (+-3.648)           |      3.846 (+-0.000)    
      3 torch.uint8 channels_last bilinear (712, 712) -> (32, 32) aa=False       |                        |         77.335 (+-0.341)        |          651.016 (+-3.926)           |      8.418 (+-0.000)    
      3 torch.uint8 channels_last bilinear (712, 712) -> (224, 224) aa=True      |   410.286 (+-2.439)    |        443.934 (+-2.899)        |         1123.923 (+-14.988)          |      2.532 (+-0.000)    
      3 torch.uint8 channels_last bilinear (712, 712) -> (224, 224) aa=False     |                        |        312.220 (+-2.307)        |          915.347 (+-4.486)           |      2.932 (+-0.000)    

      # More test-cases from #90771
      3 torch.uint8 channels_last bilinear (64, 64) -> (224, 224) aa=True        |    60.611 (+-0.337)    |         80.849 (+-1.780)        |          170.465 (+-1.830)           |      2.108 (+-0.000)    
      3 torch.uint8 channels_last bilinear (224, 224) -> (270, 268) aa=True      |   132.971 (+-1.624)    |        164.892 (+-1.426)        |          330.971 (+-3.249)           |      2.007 (+-0.000)    
      3 torch.uint8 channels_last bilinear (256, 256) -> (1024, 1024) aa=True    |   948.467 (+-3.179)    |        891.414 (+-5.282)        |         2805.510 (+-25.503)          |      3.147 (+-0.000)    
      3 torch.uint8 channels_last bilinear (224, 224) -> (64, 64) aa=True        |    52.539 (+-0.327)    |         72.471 (+-0.367)        |          135.933 (+-1.625)           |      1.876 (+-0.000)    
      3 torch.uint8 channels_last bilinear (270, 268) -> (224, 224) aa=True      |   138.669 (+-1.867)    |        168.628 (+-1.213)        |          321.112 (+-2.904)           |      1.904 (+-0.000)    
      3 torch.uint8 channels_last bilinear (1024, 1024) -> (256, 256) aa=True    |   689.933 (+-3.175)    |        746.911 (+-2.985)        |         2050.880 (+-22.188)          |      2.746 (+-0.000)    
      3 torch.uint8 channels_last bilinear (64, 64) -> (224, 224) aa=False       |                        |         78.347 (+-0.338)        |          169.646 (+-1.640)           |      2.165 (+-0.000)    
      3 torch.uint8 channels_last bilinear (224, 224) -> (270, 268) aa=False     |                        |        162.194 (+-1.089)        |          329.754 (+-2.590)           |      2.033 (+-0.000)    
      3 torch.uint8 channels_last bilinear (256, 256) -> (1024, 1024) aa=False   |                        |        894.476 (+-2.738)        |         2815.870 (+-22.589)          |      3.148 (+-0.000)    
      3 torch.uint8 channels_last bilinear (224, 224) -> (64, 64) aa=False       |                        |         52.728 (+-0.406)        |          112.024 (+-1.225)           |      2.125 (+-0.000)    
      3 torch.uint8 channels_last bilinear (270, 268) -> (224, 224) aa=False     |                        |        151.560 (+-1.128)        |          299.152 (+-3.353)           |      1.974 (+-0.000)    
      3 torch.uint8 channels_last bilinear (1024, 1024) -> (256, 256) aa=False   |                        |        500.053 (+-4.288)        |         1698.601 (+-16.785)          |      3.397 (+-0.000)    
```

Note: There is no perf regression for other case. There some cases (see Source below) with small speed-ups, for the rest it is roughly around 1.0 +/- 0.1 which may be attributed to noisy measurements ...

[Source](https://gist.github.com/vfdev-5/1c0778904a07ce40401306548b9525e8#file-20230322-132441-pr_vs_nightly-speedup-md)


## Context

- #90771



cc jgong5 mingfeima XiaobingSuper sanchitintel ashokei jingxu10

[ghstack-poisoned]
vfdev-5 added a commit that referenced this pull request Mar 29, 2023
…t8 RGB-case (channels last)"


## Description

- Based on #96651
  - Improved perfs for vectorized **bilinear** interpolate uint8 RGB-case, **channels last**
    - unified RGB and RGBA processing code such that RGB input is not copied into RGBA
  - Performances are more close to Pillow-SIMD (labeled as `Pillow (9.0.0.post1)` in the results)
  - RGBA case perfs are the same after refactoring (see Source link below) 
- Fixed mem pointer alignment, added more comments (reviews from #96651)

## Results

- `Pillow (9.0.0.post1)` == Pillow-SIMD

```
[-------------------------------------------------------------------------------------------------- Resize -------------------------------------------------------------------------------------------------]
                                                                                 |  Pillow (9.0.0.post1)  |  torch (2.1.0a0+gitce4be01) PR  |  torch (2.1.0a0+git5309c44) nightly  |  Speed-up: PR vs nightly
1 threads: --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
      3 torch.uint8 channels_last bilinear (256, 256) -> (32, 32) aa=True        |    38.548 (+-0.280)    |         57.536 (+-0.210)        |          132.147 (+-1.236)           |      2.297 (+-0.000)    
      3 torch.uint8 channels_last bilinear (256, 256) -> (32, 32) aa=False       |                        |         38.532 (+-0.219)        |          111.789 (+-1.175)           |      2.901 (+-0.000)    
      3 torch.uint8 channels_last bilinear (256, 256) -> (224, 224) aa=True      |   127.689 (+-1.348)    |        156.262 (+-1.213)        |          302.518 (+-2.632)           |      1.936 (+-0.000)    
      3 torch.uint8 channels_last bilinear (256, 256) -> (224, 224) aa=False     |                        |        145.483 (+-1.077)        |          286.663 (+-2.494)           |      1.970 (+-0.000)    
      3 torch.uint8 channels_last bilinear (256, 256) -> (320, 320) aa=True      |   178.117 (+-1.956)    |        215.053 (+-1.470)        |          439.375 (+-4.014)           |      2.043 (+-0.000)    
      3 torch.uint8 channels_last bilinear (256, 256) -> (320, 320) aa=False     |                        |        211.340 (+-2.239)        |          438.537 (+-4.143)           |      2.075 (+-0.000)    
      3 torch.uint8 channels_last bilinear (520, 520) -> (32, 32) aa=True        |   112.593 (+-1.266)    |        130.414 (+-1.633)        |          446.804 (+-3.283)           |      3.426 (+-0.000)    
      3 torch.uint8 channels_last bilinear (520, 520) -> (32, 32) aa=False       |                        |         58.767 (+-0.203)        |          374.244 (+-13.598)          |      6.368 (+-0.000)    
      3 torch.uint8 channels_last bilinear (520, 520) -> (224, 224) aa=True      |   283.210 (+-2.937)    |        324.157 (+-1.895)        |          720.197 (+-3.467)           |      2.222 (+-0.000)    
      3 torch.uint8 channels_last bilinear (520, 520) -> (224, 224) aa=False     |                        |        239.800 (+-2.492)        |          592.834 (+-3.903)           |      2.472 (+-0.000)    
      3 torch.uint8 channels_last bilinear (712, 712) -> (32, 32) aa=True        |   186.255 (+-1.629)    |        204.834 (+-1.496)        |          787.868 (+-3.648)           |      3.846 (+-0.000)    
      3 torch.uint8 channels_last bilinear (712, 712) -> (32, 32) aa=False       |                        |         77.335 (+-0.341)        |          651.016 (+-3.926)           |      8.418 (+-0.000)    
      3 torch.uint8 channels_last bilinear (712, 712) -> (224, 224) aa=True      |   410.286 (+-2.439)    |        443.934 (+-2.899)        |         1123.923 (+-14.988)          |      2.532 (+-0.000)    
      3 torch.uint8 channels_last bilinear (712, 712) -> (224, 224) aa=False     |                        |        312.220 (+-2.307)        |          915.347 (+-4.486)           |      2.932 (+-0.000)    

      # More test-cases from #90771
      3 torch.uint8 channels_last bilinear (64, 64) -> (224, 224) aa=True        |    60.611 (+-0.337)    |         80.849 (+-1.780)        |          170.465 (+-1.830)           |      2.108 (+-0.000)    
      3 torch.uint8 channels_last bilinear (224, 224) -> (270, 268) aa=True      |   132.971 (+-1.624)    |        164.892 (+-1.426)        |          330.971 (+-3.249)           |      2.007 (+-0.000)    
      3 torch.uint8 channels_last bilinear (256, 256) -> (1024, 1024) aa=True    |   948.467 (+-3.179)    |        891.414 (+-5.282)        |         2805.510 (+-25.503)          |      3.147 (+-0.000)    
      3 torch.uint8 channels_last bilinear (224, 224) -> (64, 64) aa=True        |    52.539 (+-0.327)    |         72.471 (+-0.367)        |          135.933 (+-1.625)           |      1.876 (+-0.000)    
      3 torch.uint8 channels_last bilinear (270, 268) -> (224, 224) aa=True      |   138.669 (+-1.867)    |        168.628 (+-1.213)        |          321.112 (+-2.904)           |      1.904 (+-0.000)    
      3 torch.uint8 channels_last bilinear (1024, 1024) -> (256, 256) aa=True    |   689.933 (+-3.175)    |        746.911 (+-2.985)        |         2050.880 (+-22.188)          |      2.746 (+-0.000)    
      3 torch.uint8 channels_last bilinear (64, 64) -> (224, 224) aa=False       |                        |         78.347 (+-0.338)        |          169.646 (+-1.640)           |      2.165 (+-0.000)    
      3 torch.uint8 channels_last bilinear (224, 224) -> (270, 268) aa=False     |                        |        162.194 (+-1.089)        |          329.754 (+-2.590)           |      2.033 (+-0.000)    
      3 torch.uint8 channels_last bilinear (256, 256) -> (1024, 1024) aa=False   |                        |        894.476 (+-2.738)        |         2815.870 (+-22.589)          |      3.148 (+-0.000)    
      3 torch.uint8 channels_last bilinear (224, 224) -> (64, 64) aa=False       |                        |         52.728 (+-0.406)        |          112.024 (+-1.225)           |      2.125 (+-0.000)    
      3 torch.uint8 channels_last bilinear (270, 268) -> (224, 224) aa=False     |                        |        151.560 (+-1.128)        |          299.152 (+-3.353)           |      1.974 (+-0.000)    
      3 torch.uint8 channels_last bilinear (1024, 1024) -> (256, 256) aa=False   |                        |        500.053 (+-4.288)        |         1698.601 (+-16.785)          |      3.397 (+-0.000)    
```

Note: There is no perf regression for other case. There some cases (see Source below) with small speed-ups, for the rest it is roughly around 1.0 +/- 0.1 which may be attributed to noisy measurements ...

[Source](https://gist.github.com/vfdev-5/1c0778904a07ce40401306548b9525e8#file-20230322-132441-pr_vs_nightly-speedup-md)


## Context

- #90771



cc jgong5 mingfeima XiaobingSuper sanchitintel ashokei jingxu10

[ghstack-poisoned]
pytorchmergebot pushed a commit that referenced this pull request Mar 30, 2023
…erpolate cpu uint8 RGB-case (channels last)"


## Description

- Based on #96651
  - Improved perfs for vectorized **bilinear** interpolate uint8 RGB-case, **channels last**
    - unified RGB and RGBA processing code such that RGB input is not copied into RGBA
  - Performances are more close to Pillow-SIMD (labeled as `Pillow (9.0.0.post1)` in the results)
  - RGBA case perfs are the same after refactoring (see Source link below) 
- Fixed mem pointer alignment, added more comments (reviews from #96651)

## Results

- `Pillow (9.0.0.post1)` == Pillow-SIMD

```
[-------------------------------------------------------------------------------------------------- Resize -------------------------------------------------------------------------------------------------]
                                                                                 |  Pillow (9.0.0.post1)  |  torch (2.1.0a0+gitd6e220c) PR  |  torch (2.1.0a0+git2b75955) nightly  |  Speed-up: PR vs nightly
1 threads: --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
      3 torch.uint8 channels_last bilinear (256, 256) -> (32, 32) aa=True        |    38.674 (+-0.323)    |         57.591 (+-0.244)        |          131.033 (+-1.448)           |      2.275 (+-0.000)    
      3 torch.uint8 channels_last bilinear (256, 256) -> (32, 32) aa=False       |                        |         39.471 (+-0.166)        |          113.911 (+-1.736)           |      2.886 (+-0.000)    
      3 torch.uint8 channels_last bilinear (256, 256) -> (224, 224) aa=True      |   128.512 (+-1.916)    |        161.592 (+-1.242)        |          299.679 (+-2.099)           |      1.855 (+-0.000)    
      3 torch.uint8 channels_last bilinear (256, 256) -> (224, 224) aa=False     |                        |        150.994 (+-1.180)        |          285.331 (+-1.919)           |      1.890 (+-0.000)    
      3 torch.uint8 channels_last bilinear (256, 256) -> (320, 320) aa=True      |   180.045 (+-2.223)    |        220.581 (+-1.363)        |          431.057 (+-3.536)           |      1.954 (+-0.000)    
      3 torch.uint8 channels_last bilinear (256, 256) -> (320, 320) aa=False     |                        |        219.391 (+-1.409)        |          429.410 (+-3.620)           |      1.957 (+-0.000)    
      3 torch.uint8 channels_last bilinear (520, 520) -> (32, 32) aa=True        |   113.911 (+-1.024)    |        129.457 (+-1.295)        |          459.610 (+-13.322)          |      3.550 (+-0.000)    
      3 torch.uint8 channels_last bilinear (520, 520) -> (32, 32) aa=False       |                        |         59.800 (+-0.199)        |          400.015 (+-11.815)          |      6.689 (+-0.000)    
      3 torch.uint8 channels_last bilinear (520, 520) -> (224, 224) aa=True      |   283.050 (+-2.664)    |        339.143 (+-1.209)        |          683.555 (+-4.466)           |      2.016 (+-0.000)    
      3 torch.uint8 channels_last bilinear (520, 520) -> (224, 224) aa=False     |                        |        250.601 (+-1.236)        |          603.545 (+-2.644)           |      2.408 (+-0.000)    
      3 torch.uint8 channels_last bilinear (712, 712) -> (32, 32) aa=True        |   186.723 (+-2.213)    |        199.960 (+-1.343)        |          860.867 (+-21.763)          |      4.305 (+-0.000)    
      3 torch.uint8 channels_last bilinear (712, 712) -> (32, 32) aa=False       |                        |         79.188 (+-0.261)        |          703.019 (+-25.805)          |      8.878 (+-0.000)    
      3 torch.uint8 channels_last bilinear (712, 712) -> (224, 224) aa=True      |   412.353 (+-4.476)    |        462.230 (+-1.983)        |         1101.673 (+-49.299)          |      2.383 (+-0.000)    
      3 torch.uint8 channels_last bilinear (712, 712) -> (224, 224) aa=False     |                        |        327.973 (+-1.852)        |          941.062 (+-5.549)           |      2.869 (+-0.000)    

      3 torch.uint8 channels_last bilinear (64, 64) -> (224, 224) aa=True        |    61.191 (+-0.926)    |         80.795 (+-0.518)        |          160.853 (+-1.506)           |      1.991 (+-0.000)    
      3 torch.uint8 channels_last bilinear (224, 224) -> (270, 268) aa=True      |   134.488 (+-2.129)    |        169.147 (+-1.324)        |          327.343 (+-2.846)           |      1.935 (+-0.000)    
      3 torch.uint8 channels_last bilinear (256, 256) -> (1024, 1024) aa=True    |  1037.045 (+-24.982)   |        938.623 (+-9.010)        |         2603.360 (+-20.530)          |      2.774 (+-0.000)    
      3 torch.uint8 channels_last bilinear (224, 224) -> (64, 64) aa=True        |    52.792 (+-0.613)    |         73.692 (+-0.264)        |          131.829 (+-1.333)           |      1.789 (+-0.000)    
      3 torch.uint8 channels_last bilinear (270, 268) -> (224, 224) aa=True      |   139.596 (+-1.944)    |        173.778 (+-1.039)        |          320.063 (+-2.562)           |      1.842 (+-0.000)    
      3 torch.uint8 channels_last bilinear (1024, 1024) -> (256, 256) aa=True    |   690.132 (+-10.946)   |        772.758 (+-2.864)        |         2036.860 (+-36.109)          |      2.636 (+-0.000)    
      3 torch.uint8 channels_last bilinear (64, 64) -> (224, 224) aa=False       |                        |         78.747 (+-0.799)        |          158.479 (+-1.702)           |      2.013 (+-0.000)    
      3 torch.uint8 channels_last bilinear (224, 224) -> (270, 268) aa=False     |                        |        167.046 (+-1.077)        |          322.104 (+-2.764)           |      1.928 (+-0.000)    
      3 torch.uint8 channels_last bilinear (256, 256) -> (1024, 1024) aa=False   |                        |        918.967 (+-5.251)        |         2611.388 (+-29.917)          |      2.842 (+-0.000)    
      3 torch.uint8 channels_last bilinear (224, 224) -> (64, 64) aa=False       |                        |         55.336 (+-0.251)        |          113.869 (+-1.243)           |      2.058 (+-0.000)    
      3 torch.uint8 channels_last bilinear (270, 268) -> (224, 224) aa=False     |                        |        156.505 (+-1.095)        |          299.861 (+-2.710)           |      1.916 (+-0.000)    
      3 torch.uint8 channels_last bilinear (1024, 1024) -> (256, 256) aa=False   |                        |        514.344 (+-1.905)        |         1776.796 (+-19.660)          |      3.454 (+-0.000)    

```

Note: There is no perf regression for other case. There some cases (see Source below) with small speed-ups, for the rest it is roughly around 1.0 +/- 0.1 which may be attributed to noisy measurements ...

[Source](https://gist.github.com/vfdev-5/1c0778904a07ce40401306548b9525e8#file-20230329-181023-pr_vs_nightly-speedup-md)


## Context

- #90771



cc jgong5 mingfeima XiaobingSuper sanchitintel ashokei jingxu10 datumbox pmeier

[ghstack-poisoned]
pytorchmergebot pushed a commit that referenced this pull request Mar 30, 2023
…t8 RGB-case (channels last)"


## Description

- Based on #96651
  - Improved perfs for vectorized **bilinear** interpolate uint8 RGB-case, **channels last**
    - unified RGB and RGBA processing code such that RGB input is not copied into RGBA
  - Performances are more close to Pillow-SIMD (labeled as `Pillow (9.0.0.post1)` in the results)
  - RGBA case perfs are the same after refactoring (see Source link below) 
- Fixed mem pointer alignment, added more comments (reviews from #96651)

## Results

- `Pillow (9.0.0.post1)` == Pillow-SIMD

```
[-------------------------------------------------------------------------------------------------- Resize -------------------------------------------------------------------------------------------------]
                                                                                 |  Pillow (9.0.0.post1)  |  torch (2.1.0a0+gitd6e220c) PR  |  torch (2.1.0a0+git2b75955) nightly  |  Speed-up: PR vs nightly
1 threads: --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
      3 torch.uint8 channels_last bilinear (256, 256) -> (32, 32) aa=True        |    38.674 (+-0.323)    |         57.591 (+-0.244)        |          131.033 (+-1.448)           |      2.275 (+-0.000)    
      3 torch.uint8 channels_last bilinear (256, 256) -> (32, 32) aa=False       |                        |         39.471 (+-0.166)        |          113.911 (+-1.736)           |      2.886 (+-0.000)    
      3 torch.uint8 channels_last bilinear (256, 256) -> (224, 224) aa=True      |   128.512 (+-1.916)    |        161.592 (+-1.242)        |          299.679 (+-2.099)           |      1.855 (+-0.000)    
      3 torch.uint8 channels_last bilinear (256, 256) -> (224, 224) aa=False     |                        |        150.994 (+-1.180)        |          285.331 (+-1.919)           |      1.890 (+-0.000)    
      3 torch.uint8 channels_last bilinear (256, 256) -> (320, 320) aa=True      |   180.045 (+-2.223)    |        220.581 (+-1.363)        |          431.057 (+-3.536)           |      1.954 (+-0.000)    
      3 torch.uint8 channels_last bilinear (256, 256) -> (320, 320) aa=False     |                        |        219.391 (+-1.409)        |          429.410 (+-3.620)           |      1.957 (+-0.000)    
      3 torch.uint8 channels_last bilinear (520, 520) -> (32, 32) aa=True        |   113.911 (+-1.024)    |        129.457 (+-1.295)        |          459.610 (+-13.322)          |      3.550 (+-0.000)    
      3 torch.uint8 channels_last bilinear (520, 520) -> (32, 32) aa=False       |                        |         59.800 (+-0.199)        |          400.015 (+-11.815)          |      6.689 (+-0.000)    
      3 torch.uint8 channels_last bilinear (520, 520) -> (224, 224) aa=True      |   283.050 (+-2.664)    |        339.143 (+-1.209)        |          683.555 (+-4.466)           |      2.016 (+-0.000)    
      3 torch.uint8 channels_last bilinear (520, 520) -> (224, 224) aa=False     |                        |        250.601 (+-1.236)        |          603.545 (+-2.644)           |      2.408 (+-0.000)    
      3 torch.uint8 channels_last bilinear (712, 712) -> (32, 32) aa=True        |   186.723 (+-2.213)    |        199.960 (+-1.343)        |          860.867 (+-21.763)          |      4.305 (+-0.000)    
      3 torch.uint8 channels_last bilinear (712, 712) -> (32, 32) aa=False       |                        |         79.188 (+-0.261)        |          703.019 (+-25.805)          |      8.878 (+-0.000)    
      3 torch.uint8 channels_last bilinear (712, 712) -> (224, 224) aa=True      |   412.353 (+-4.476)    |        462.230 (+-1.983)        |         1101.673 (+-49.299)          |      2.383 (+-0.000)    
      3 torch.uint8 channels_last bilinear (712, 712) -> (224, 224) aa=False     |                        |        327.973 (+-1.852)        |          941.062 (+-5.549)           |      2.869 (+-0.000)    

      3 torch.uint8 channels_last bilinear (64, 64) -> (224, 224) aa=True        |    61.191 (+-0.926)    |         80.795 (+-0.518)        |          160.853 (+-1.506)           |      1.991 (+-0.000)    
      3 torch.uint8 channels_last bilinear (224, 224) -> (270, 268) aa=True      |   134.488 (+-2.129)    |        169.147 (+-1.324)        |          327.343 (+-2.846)           |      1.935 (+-0.000)    
      3 torch.uint8 channels_last bilinear (256, 256) -> (1024, 1024) aa=True    |  1037.045 (+-24.982)   |        938.623 (+-9.010)        |         2603.360 (+-20.530)          |      2.774 (+-0.000)    
      3 torch.uint8 channels_last bilinear (224, 224) -> (64, 64) aa=True        |    52.792 (+-0.613)    |         73.692 (+-0.264)        |          131.829 (+-1.333)           |      1.789 (+-0.000)    
      3 torch.uint8 channels_last bilinear (270, 268) -> (224, 224) aa=True      |   139.596 (+-1.944)    |        173.778 (+-1.039)        |          320.063 (+-2.562)           |      1.842 (+-0.000)    
      3 torch.uint8 channels_last bilinear (1024, 1024) -> (256, 256) aa=True    |   690.132 (+-10.946)   |        772.758 (+-2.864)        |         2036.860 (+-36.109)          |      2.636 (+-0.000)    
      3 torch.uint8 channels_last bilinear (64, 64) -> (224, 224) aa=False       |                        |         78.747 (+-0.799)        |          158.479 (+-1.702)           |      2.013 (+-0.000)    
      3 torch.uint8 channels_last bilinear (224, 224) -> (270, 268) aa=False     |                        |        167.046 (+-1.077)        |          322.104 (+-2.764)           |      1.928 (+-0.000)    
      3 torch.uint8 channels_last bilinear (256, 256) -> (1024, 1024) aa=False   |                        |        918.967 (+-5.251)        |         2611.388 (+-29.917)          |      2.842 (+-0.000)    
      3 torch.uint8 channels_last bilinear (224, 224) -> (64, 64) aa=False       |                        |         55.336 (+-0.251)        |          113.869 (+-1.243)           |      2.058 (+-0.000)    
      3 torch.uint8 channels_last bilinear (270, 268) -> (224, 224) aa=False     |                        |        156.505 (+-1.095)        |          299.861 (+-2.710)           |      1.916 (+-0.000)    
      3 torch.uint8 channels_last bilinear (1024, 1024) -> (256, 256) aa=False   |                        |        514.344 (+-1.905)        |         1776.796 (+-19.660)          |      3.454 (+-0.000)    

```

Note: There is no perf regression for other case. There some cases (see Source below) with small speed-ups, for the rest it is roughly around 1.0 +/- 0.1 which may be attributed to noisy measurements ...

[Source](https://gist.github.com/vfdev-5/1c0778904a07ce40401306548b9525e8#file-20230329-181023-pr_vs_nightly-speedup-md)


## Context

- #90771



cc jgong5 mingfeima XiaobingSuper sanchitintel ashokei jingxu10 datumbox pmeier

[ghstack-poisoned]
pytorchmergebot pushed a commit that referenced this pull request Mar 30, 2023
… (channels last) (#96848)

## Description

- Based on #96651
  - Improved perfs for vectorized **bilinear** interpolate uint8 RGB-case, **channels last**
    - unified RGB and RGBA processing code such that RGB input is not copied into RGBA
  - Performances are more close to Pillow-SIMD (labeled as `Pillow (9.0.0.post1)` in the results)
  - RGBA case perfs are the same after refactoring (see Source link below)
- Fixed mem pointer alignment, added more comments (reviews from #96651)

## Results

- `Pillow (9.0.0.post1)` == Pillow-SIMD

```
[-------------------------------------------------------------------------------------------------- Resize -------------------------------------------------------------------------------------------------]
                                                                                 |  Pillow (9.0.0.post1)  |  torch (2.1.0a0+gitd6e220c) PR  |  torch (2.1.0a0+git2b75955) nightly  |  Speed-up: PR vs nightly
1 threads: --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
      3 torch.uint8 channels_last bilinear (256, 256) -> (32, 32) aa=True        |    38.674 (+-0.323)    |         57.591 (+-0.244)        |          131.033 (+-1.448)           |      2.275 (+-0.000)
      3 torch.uint8 channels_last bilinear (256, 256) -> (32, 32) aa=False       |                        |         39.471 (+-0.166)        |          113.911 (+-1.736)           |      2.886 (+-0.000)
      3 torch.uint8 channels_last bilinear (256, 256) -> (224, 224) aa=True      |   128.512 (+-1.916)    |        161.592 (+-1.242)        |          299.679 (+-2.099)           |      1.855 (+-0.000)
      3 torch.uint8 channels_last bilinear (256, 256) -> (224, 224) aa=False     |                        |        150.994 (+-1.180)        |          285.331 (+-1.919)           |      1.890 (+-0.000)
      3 torch.uint8 channels_last bilinear (256, 256) -> (320, 320) aa=True      |   180.045 (+-2.223)    |        220.581 (+-1.363)        |          431.057 (+-3.536)           |      1.954 (+-0.000)
      3 torch.uint8 channels_last bilinear (256, 256) -> (320, 320) aa=False     |                        |        219.391 (+-1.409)        |          429.410 (+-3.620)           |      1.957 (+-0.000)
      3 torch.uint8 channels_last bilinear (520, 520) -> (32, 32) aa=True        |   113.911 (+-1.024)    |        129.457 (+-1.295)        |          459.610 (+-13.322)          |      3.550 (+-0.000)
      3 torch.uint8 channels_last bilinear (520, 520) -> (32, 32) aa=False       |                        |         59.800 (+-0.199)        |          400.015 (+-11.815)          |      6.689 (+-0.000)
      3 torch.uint8 channels_last bilinear (520, 520) -> (224, 224) aa=True      |   283.050 (+-2.664)    |        339.143 (+-1.209)        |          683.555 (+-4.466)           |      2.016 (+-0.000)
      3 torch.uint8 channels_last bilinear (520, 520) -> (224, 224) aa=False     |                        |        250.601 (+-1.236)        |          603.545 (+-2.644)           |      2.408 (+-0.000)
      3 torch.uint8 channels_last bilinear (712, 712) -> (32, 32) aa=True        |   186.723 (+-2.213)    |        199.960 (+-1.343)        |          860.867 (+-21.763)          |      4.305 (+-0.000)
      3 torch.uint8 channels_last bilinear (712, 712) -> (32, 32) aa=False       |                        |         79.188 (+-0.261)        |          703.019 (+-25.805)          |      8.878 (+-0.000)
      3 torch.uint8 channels_last bilinear (712, 712) -> (224, 224) aa=True      |   412.353 (+-4.476)    |        462.230 (+-1.983)        |         1101.673 (+-49.299)          |      2.383 (+-0.000)
      3 torch.uint8 channels_last bilinear (712, 712) -> (224, 224) aa=False     |                        |        327.973 (+-1.852)        |          941.062 (+-5.549)           |      2.869 (+-0.000)

      3 torch.uint8 channels_last bilinear (64, 64) -> (224, 224) aa=True        |    61.191 (+-0.926)    |         80.795 (+-0.518)        |          160.853 (+-1.506)           |      1.991 (+-0.000)
      3 torch.uint8 channels_last bilinear (224, 224) -> (270, 268) aa=True      |   134.488 (+-2.129)    |        169.147 (+-1.324)        |          327.343 (+-2.846)           |      1.935 (+-0.000)
      3 torch.uint8 channels_last bilinear (256, 256) -> (1024, 1024) aa=True    |  1037.045 (+-24.982)   |        938.623 (+-9.010)        |         2603.360 (+-20.530)          |      2.774 (+-0.000)
      3 torch.uint8 channels_last bilinear (224, 224) -> (64, 64) aa=True        |    52.792 (+-0.613)    |         73.692 (+-0.264)        |          131.829 (+-1.333)           |      1.789 (+-0.000)
      3 torch.uint8 channels_last bilinear (270, 268) -> (224, 224) aa=True      |   139.596 (+-1.944)    |        173.778 (+-1.039)        |          320.063 (+-2.562)           |      1.842 (+-0.000)
      3 torch.uint8 channels_last bilinear (1024, 1024) -> (256, 256) aa=True    |   690.132 (+-10.946)   |        772.758 (+-2.864)        |         2036.860 (+-36.109)          |      2.636 (+-0.000)
      3 torch.uint8 channels_last bilinear (64, 64) -> (224, 224) aa=False       |                        |         78.747 (+-0.799)        |          158.479 (+-1.702)           |      2.013 (+-0.000)
      3 torch.uint8 channels_last bilinear (224, 224) -> (270, 268) aa=False     |                        |        167.046 (+-1.077)        |          322.104 (+-2.764)           |      1.928 (+-0.000)
      3 torch.uint8 channels_last bilinear (256, 256) -> (1024, 1024) aa=False   |                        |        918.967 (+-5.251)        |         2611.388 (+-29.917)          |      2.842 (+-0.000)
      3 torch.uint8 channels_last bilinear (224, 224) -> (64, 64) aa=False       |                        |         55.336 (+-0.251)        |          113.869 (+-1.243)           |      2.058 (+-0.000)
      3 torch.uint8 channels_last bilinear (270, 268) -> (224, 224) aa=False     |                        |        156.505 (+-1.095)        |          299.861 (+-2.710)           |      1.916 (+-0.000)
      3 torch.uint8 channels_last bilinear (1024, 1024) -> (256, 256) aa=False   |                        |        514.344 (+-1.905)        |         1776.796 (+-19.660)          |      3.454 (+-0.000)

```

Note: There is no perf regression for other case. There some cases (see Source below) with small speed-ups, for the rest it is roughly around 1.0 +/- 0.1 which may be attributed to noisy measurements ...

[Source](https://gist.github.com/vfdev-5/1c0778904a07ce40401306548b9525e8#file-20230329-181023-pr_vs_nightly-speedup-md)

## Context

- #90771

Pull Request resolved: #96848
Approved by: https://github.com/NicolasHug, https://github.com/peterbell10
vfdev-5 added a commit to vfdev-5/vision that referenced this pull request May 4, 2023
Description:
- Now that pytorch/pytorch#90771 is merged, let Resize() rely on interpolate()'s native uint8 handling instead of converting to and from float.

  - uint8 input is not casted to f32 for nearest mode and bilinear mode if the latter has AVX2.

Context: pytorch#7217

Benchmarks:
```
[----------- Resize cpu torch.uint8 InterpolationMode.NEAREST -----------]
                         |  resize v2  |  resize stable  |  resize nightly
1 threads: ---------------------------------------------------------------
      (3, 400, 400)      |      457    |        461      |        480
      (16, 3, 400, 400)  |     6870    |       6850      |      10100

Times are in microseconds (us).

[---------- Resize cpu torch.uint8 InterpolationMode.BILINEAR -----------]
                         |  resize v2  |  resize stable  |  resize nightly
1 threads: ---------------------------------------------------------------
      (3, 400, 400)      |      326    |        329      |        844
      (16, 3, 400, 400)  |     4380    |       4390      |      14800

Times are in microseconds (us).
```

[Source](https://gist.github.com/vfdev-5/a2e30ed50b5996807c9b09d5d33d8bc2)
vfdev-5 added a commit to vfdev-5/vision that referenced this pull request May 9, 2023
Description:
- Now that pytorch/pytorch#90771 is merged, let Resize() rely on interpolate()'s native uint8 handling instead of converting to and from float.

  - uint8 input is not casted to f32 for nearest mode and bilinear mode if the latter has AVX2.

Context: pytorch#7217

Benchmarks:
```
[----------- Resize cpu torch.uint8 InterpolationMode.NEAREST -----------]
                         |  resize v2  |  resize stable  |  resize nightly
1 threads: ---------------------------------------------------------------
      (3, 400, 400)      |      457    |        461      |        480
      (16, 3, 400, 400)  |     6870    |       6850      |      10100

Times are in microseconds (us).

[---------- Resize cpu torch.uint8 InterpolationMode.BILINEAR -----------]
                         |  resize v2  |  resize stable  |  resize nightly
1 threads: ---------------------------------------------------------------
      (3, 400, 400)      |      326    |        329      |        844
      (16, 3, 400, 400)  |     4380    |       4390      |      14800

Times are in microseconds (us).
```

[Source](https://gist.github.com/vfdev-5/a2e30ed50b5996807c9b09d5d33d8bc2)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ciflow/trunk Trigger trunk jobs on your pull request Merged module: cpu CPU specific problem (e.g., perf, algorithm) module: interpolation module: vision release notes: nn release notes category
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

9 participants