Enforce contiguous outputs on the transforms v2 kernels? #6839

pmeier · 2022-10-26T10:54:10Z

All the performance benchmarks that did so far for transforms v1 vs. v2 were on contiguous inputs. However, we have a few kernels that leave the output in a noncontiguous state:

affine_image_tensor in case image.numel() > 0 and image.ndim == 4 and fill is not None
convert_color_space in case we only strip the alpha channel, i.e. RGB_ALPHA -> RGB and GRAY_ALPHA -> ALPHA
rotate_image_tensor in case image.numel() > 0 and image.ndim == 4 and fill is not None
crop_image_tensor
center_crop_image_tensor
five_crop_image_tensor
ten_crop_image_tensor

If applicable, the same is also valid for the *_mask and *_video kernels since they are thin wrappers around the *_image_tensor ones.

We should benchmark at least for a few kernels whether noncontiguous inputs cause a performance degredation that is larger than enforcing contiguous outputs on the kernels above. If so we should probably enforce contiguous outputs of our kernels.

cc @vfdev-5 @datumbox @bjuncek

The text was updated successfully, but these errors were encountered:

NicolasHug · 2022-10-26T10:59:01Z

IIRC Normalize()'s perf is fairly sensitive to contiguity.

Same for Resize(), and depending on what was used for decoding (PIL vs decode_jpeg()), you'll end up with different contiguity

pmeier · 2022-10-26T13:13:35Z

I'll benchmark them in a bit, but before maybe another question: if we know that these kernels are sensitive to contiguity, should they know how to handle the situation? Meaning, opposed to what is proposed above, we would shift the burden to the kernel that "needs" contiguous inputs instead of enforcing contiguous outputs everywhere.

NicolasHug · 2022-10-26T13:31:30Z

should they know how to handle the situation?

Do you mean it should be up to the kernel to convert its input if it makes it run faster?

I don't know if this is something worth doing; same for the global enforcement of contiguity BTW. Because, while this may speed up that one particular kernel, it can still make all the other kernels slower in the rest of the pipeline.

pmeier · 2022-10-26T13:37:07Z

Do you mean it should be up to the kernel to convert its input if it makes it run faster?

Yes.

I don't know if this is something worth doing; same for the global enforcement of contiguity BTW. Because, while this may speed up that one particular kernel, it can still make all the other kernels slower in the rest of the pipeline.

Agreed. I need to benchmark if this has an impact or not. Otherwise there is no need to enforce one or the other.

vadimkantorov · 2022-10-28T15:48:36Z

Preserving memory_format (contig or channel_last) can be quite important. If there're perf implications of preserving memory format by default (that could avoid some copies later on), some ops may accept additional memory_format arg?

pmeier · 2022-12-12T12:40:28Z

As stated by @NicolasHug in #6839 (comment), torchvision.io.read_image produces noncontigous outputs. The strides are not random though: by image.unsqueeze(0), we get torch.channels_last as memory format. I did some quick benchmarks to asses the impact of this on our transformation pipelines:

Script

import functools
import itertools

from torch.utils import benchmark
from torchvision.io import read_image
from torchvision.prototype.transforms import functional as F


image_other = read_image("test/assets/encode_jpeg/grace_hopper_517x606.jpg").float()
image_channels_last = image_other.unsqueeze(0)
image_contiguous_format = image_other.contiguous()
images = [
    (image_contiguous_format, "contiguous_format"),
    (image_channels_last, "channels_last"),
    (image_other, f"other, stride={image_other.stride()}"),
]

fns = [
    (
        functools.partial(
            F.normalize_image_tensor, mean=(0.485, 0.456, 0.406), std=(0.229, 0.224, 0.225)
        ),
        "normalize",
    ),
    (
        functools.partial(
            F.resize_image_tensor, size=256, interpolation=F.InterpolationMode.BILINEAR
        ),
        "resize BILINEAR",
    ),
    (
        functools.partial(
            F.resize_image_tensor, size=256, interpolation=F.InterpolationMode.NEAREST
        ),
        "resize NEAREST",
    ),
    (
        functools.partial(
            F.affine_image_tensor,
            angle=0.0,
            translate=[0.0, 0.0],
            scale=1.0,
            shear=[0.0, 0.0],
            interpolation=F.InterpolationMode.BILINEAR,
        ),
        "affine BILINEAR",
    ),
    (
        functools.partial(
            F.affine_image_tensor,
            angle=0.0,
            translate=[0.0, 0.0],
            scale=1.0,
            shear=[0.0, 0.0],
            interpolation=F.InterpolationMode.NEAREST,
        ),
        "affine NEAREST",
    ),
]

measurements = [
    benchmark.Timer(
        stmt="fn(image)",
        globals=dict(fn=fn, image=image),
        label="impact of input contiguity",
        sub_label=name,
        description=memory_format,
    ).blocked_autorange(min_run_time=5)
    for (fn, name), (image, memory_format) in itertools.product(fns, images)
]

comparison = benchmark.Compare(measurements)
comparison.trim_significant_figures()
comparison.print()

[----------------------------------------------- impact of input contiguity ----------------------------------------------]
                                                      |  contiguous_format  |  channels_last  |  other, stride=(1, 1551, 3)
1 threads: ----------------------------------------------------------------------------------------------------------------
      normalize / float32                             |          161        |       2110      |             2120           
      normalize with prior `.contiguous()` / float32  |          160        |       1100      |             1100           
      resize BILINEAR / uint8                         |         1370        |       1050      |             1070           
      resize BILINEAR / float32                       |         1200        |        870      |              884           
      resize NEAREST / uint8                          |          307        |        320      |              321           
      resize NEAREST / float32                        |          121        |        141      |              139           
      affine BILINEAR / uint8                         |         6140        |       6180      |             5990           
      affine BILINEAR / float32                       |         5630        |       5640      |             5460           
      affine NEAREST / uint8                          |         3660        |       3650      |             3680           
      affine NEAREST / float32                        |         3120        |       3150      |             3140           

Times are in microseconds (us).

normalize is 10x faster on contiguous inputs
naively calling image.contiguous() in normalize significantly improves performance for noncontiguous inputs, while having very little effect on the contiguous performance
resize with bilinear interpolation is roughly 30% faster on channels_last inputs
resize with nearest interpolation is marginally faster for contiguous inputs
affine seems to not be impacted by the contiguity
dtype is irrelevant

My conclusion here is that we should not enforce contiguity, but rather identify the kernels that benefit from it and enforce it there. One option to do that is to run our benchmark again (#6818), but this time for inputs in the channels_last format.

As for normalize, I think we can safely include a .contiguous() call in case inplace=True.

datumbox · 2022-12-12T13:03:03Z

My conclusion here is that we should not enforce contiguity, but rather identify the kernels that benefit from it and enforce it there

I agree. This can be added as an implementation detail/performance-hack on the specific identified kernels. We can then easily move the workarounds if Core oprtimizes the kernels.

pmeier · 2022-12-12T21:23:12Z

I ran the functional API benchmark from #6818 again with a few tweaks:

Instead of using a random image, I used Grace Hopper
I've read it with torchvision.io.read_image
For noncontiguous inputs, I left it as is, which corresponds to torch.channels_last if we would add a singleton batch dimension
For contiguous inputs, I simply called .contiguous() on it
All measurements were done on CPU on a single thread
I tested float32 and uint8 to make sure that the assessment in Enforce contiguous outputs on the transforms v2 kernels? #6839 (comment) was not a fluke (spoiler: it wasn't)
I tested resize for NEAREST and BILINEAR interpolation separately, since they will be averaged otherwise
I tested both v1 and v2 to make sure we didn't introduce a regression

full log

  [---- adjust_brightness @ torchvision==0.15.0a0+93723b4 ----]
                                 |       v1      |       v2    
  1 threads: --------------------------------------------------
        uint8 / contiguous       |  824 (+- 23)  |  580 (+- 17)
        uint8 / noncontiguous    |  824 (+- 23)  |  580 (+- 20)
        float32 / contiguous     |  351 (+-  6)  |  123 (+-  2)
        float32 / noncontiguous  |  347 (+-  5)  |  114 (+-  2)
  
  Times are in microseconds (us).
  
  Aggregated performance change of v2 vs. v1: +47.9% (improvement)
  [------ adjust_contrast @ torchvision==0.15.0a0+93723b4 ------]
                                 |       v1       |       v2     
  1 threads: ----------------------------------------------------
        uint8 / contiguous       |  1608 (+-262)  |  874 (+- 21) 
        uint8 / noncontiguous    |  1768 (+-239)  |  1094 (+- 70)
        float32 / contiguous     |  422 (+-  4)   |  291 (+-  3) 
        float32 / noncontiguous  |  662 (+- 16)   |  624 (+- 24) 
  
  Times are in microseconds (us).
  
  Aggregated performance change of v2 vs. v1: +30.1% (improvement)
  [------- adjust_gamma @ torchvision==0.15.0a0+93723b4 ------]
                                 |       v1      |       v2    
  1 threads: --------------------------------------------------
        uint8 / contiguous       |    4 (+-  0)  |    4 (+-  0)
        uint8 / noncontiguous    |    4 (+-  0)  |    4 (+-  0)
        float32 / contiguous     |    3 (+-  0)  |    4 (+-  0)
        float32 / noncontiguous  |    3 (+-  0)  |    4 (+-  0)
  
  Times are in milliseconds (ms).
  
  Aggregated performance change of v2 vs. v1: -1.4% (slowdown)
  [-------- adjust_hue @ torchvision==0.15.0a0+93723b4 -------]
                                 |       v1      |       v2    
  1 threads: --------------------------------------------------
        uint8 / contiguous       |   22 (+-  1)  |   15 (+-  1)
        uint8 / noncontiguous    |   21 (+-  1)  |   19 (+-  1)
        float32 / contiguous     |   20 (+-  1)  |   14 (+-  1)
        float32 / noncontiguous  |   19 (+-  1)  |   18 (+-  1)
  
  Times are in milliseconds (ms).
  
  Aggregated performance change of v2 vs. v1: +19.6% (improvement)
  [----- adjust_saturation @ torchvision==0.15.0a0+93723b4 -----]
                                 |       v1       |       v2     
  1 threads: ----------------------------------------------------
        uint8 / contiguous       |  1137 (+- 50)  |  815 (+- 24) 
        uint8 / noncontiguous    |  2187 (+- 65)  |  1915 (+- 61)
        float32 / contiguous     |  438 (+-  9)   |  279 (+-  5) 
        float32 / noncontiguous  |  1493 (+- 39)  |  1452 (+- 42)
  
  Times are in microseconds (us).
  
  Aggregated performance change of v2 vs. v1: +19.9% (improvement)
  [----- adjust_sharpness @ torchvision==0.15.0a0+93723b4 ----]
                                 |       v1      |       v2    
  1 threads: --------------------------------------------------
        uint8 / contiguous       |    4 (+-  0)  |    3 (+-  0)
        uint8 / noncontiguous    |    5 (+-  0)  |    4 (+-  0)
        float32 / contiguous     |    4 (+-  0)  |    2 (+-  0)
        float32 / noncontiguous  |    5 (+-  0)  |    4 (+-  0)
  
  Times are in milliseconds (ms).
  
  Aggregated performance change of v2 vs. v1: +32.3% (improvement)
  [---------- affine @ torchvision==0.15.0a0+93723b4 ---------]
                                 |       v1      |       v2    
  1 threads: --------------------------------------------------
        uint8 / contiguous       |    4 (+-  1)  |    4 (+-  1)
        uint8 / noncontiguous    |    4 (+-  1)  |    4 (+-  1)
        float32 / contiguous     |    3 (+-  1)  |    3 (+-  1)
        float32 / noncontiguous  |    3 (+-  1)  |    3 (+-  1)
  
  Times are in milliseconds (ms).
  
  Aggregated performance change of v2 vs. v1: -0.4% (slowdown)
  [-------- autocontrast @ torchvision==0.15.0a0+93723b4 -------]
                                 |       v1       |       v2     
  1 threads: ----------------------------------------------------
        uint8 / contiguous       |  741 (+- 29)   |  770 (+- 25) 
        uint8 / noncontiguous    |  3601 (+- 94)  |  6593 (+- 96)
        float32 / contiguous     |  342 (+-  5)   |  323 (+-  4) 
        float32 / noncontiguous  |  6210 (+-101)  |  6135 (+- 82)
  
  Times are in microseconds (us).
  
  Aggregated performance change of v2 vs. v1: -20.1% (slowdown)
  [------- center_crop @ torchvision==0.15.0a0+93723b4 -------]
                                 |       v1      |       v2    
  1 threads: --------------------------------------------------
        uint8 / contiguous       |    9 (+-  0)  |    5 (+-  0)
        uint8 / noncontiguous    |    9 (+-  0)  |    5 (+-  0)
        float32 / contiguous     |    9 (+-  0)  |    5 (+-  0)
        float32 / noncontiguous  |    9 (+-  0)  |    5 (+-  0)
  
  Times are in microseconds (us).
  
  Aggregated performance change of v2 vs. v1: +41.9% (improvement)
  [--- convert_color_space @ torchvision==0.15.0a0+93723b4 ---]
                                 |       v1      |       v2    
  1 threads: --------------------------------------------------
        uint8 / contiguous       |  368 (+- 11)  |  284 (+- 10)
        uint8 / noncontiguous    |  583 (+- 15)  |  513 (+- 18)
        float32 / contiguous     |  162 (+-  3)  |   90 (+-  1)
        float32 / noncontiguous  |  391 (+-  9)  |  409 (+- 17)
  
  Times are in microseconds (us).
  
  Aggregated performance change of v2 vs. v1: +18.6% (improvement)
  [----- convert_dtype @ torchvision==0.15.0a0+93723b4 -----]
                               |       v1      |       v2    
  1 threads: ------------------------------------------------
        uint8 / contiguous     |  158 (+-  2)  |  113 (+-  3)
        uint8 / noncontiguous  |  159 (+-  2)  |  114 (+-  2)
  
  Times are in microseconds (us).
  
  Aggregated performance change of v2 vs. v1: +28.5% (improvement)
  [----------- crop @ torchvision==0.15.0a0+93723b4 ----------]
                                 |       v1      |       v2    
  1 threads: --------------------------------------------------
        uint8 / contiguous       |    4 (+-  0)  |    4 (+-  0)
        uint8 / noncontiguous    |    4 (+-  0)  |    4 (+-  0)
        float32 / contiguous     |    5 (+-  0)  |    4 (+-  0)
        float32 / noncontiguous  |    4 (+-  0)  |    4 (+-  0)
  
  Times are in microseconds (us).
  
  Aggregated performance change of v2 vs. v1: +16.9% (improvement)
  [--------- elastic @ torchvision==0.15.0a0+93723b4 ---------]
                                 |       v1      |       v2    
  1 threads: --------------------------------------------------
        uint8 / contiguous       |    5 (+-  0)  |    5 (+-  0)
        uint8 / noncontiguous    |    5 (+-  0)  |    5 (+-  0)
        float32 / contiguous     |    4 (+-  0)  |    4 (+-  0)
        float32 / noncontiguous  |    4 (+-  0)  |    4 (+-  0)
  
  Times are in milliseconds (ms).
  
  Aggregated performance change of v2 vs. v1: +0.8% (improvement)
  [--------- equalize @ torchvision==0.15.0a0+93723b4 --------]
                                 |       v1      |       v2    
  1 threads: --------------------------------------------------
        uint8 / contiguous       |    3 (+-  0)  |    3 (+-  0)
        uint8 / noncontiguous    |    3 (+-  0)  |    3 (+-  0)
        float32 / contiguous     |               |    3 (+-  0)
        float32 / noncontiguous  |               |    4 (+-  0)
  
  Times are in milliseconds (ms).
  
  Aggregated performance change of v2 vs. v1: -7.6% (slowdown)
  [---------- erase @ torchvision==0.15.0a0+93723b4 ----------]
                                 |       v1      |       v2    
  1 threads: --------------------------------------------------
        uint8 / contiguous       |   28 (+-  1)  |   27 (+-  1)
        uint8 / noncontiguous    |   27 (+-  1)  |   26 (+-  1)
        float32 / contiguous     |   75 (+-  1)  |   76 (+-  2)
        float32 / noncontiguous  |   77 (+-  1)  |   80 (+-  1)
  
  Times are in microseconds (us).
  
  Aggregated performance change of v2 vs. v1: -0.7% (slowdown)
  [-------- five_crop @ torchvision==0.15.0a0+93723b4 --------]
                                 |       v1      |       v2    
  1 threads: --------------------------------------------------
        uint8 / contiguous       |   30 (+-  2)  |   16 (+-  1)
        uint8 / noncontiguous    |   30 (+-  1)  |   16 (+-  1)
        float32 / contiguous     |   30 (+-  1)  |   16 (+-  1)
        float32 / noncontiguous  |   30 (+-  1)  |   16 (+-  1)
  
  Times are in microseconds (us).
  
  Aggregated performance change of v2 vs. v1: +45.1% (improvement)
  [------ gaussian_blur @ torchvision==0.15.0a0+93723b4 ------]
                                 |       v1      |       v2    
  1 threads: --------------------------------------------------
        uint8 / contiguous       |    4 (+-  0)  |    4 (+-  0)
        uint8 / noncontiguous    |    4 (+-  0)  |    4 (+-  0)
        float32 / contiguous     |    3 (+-  0)  |    3 (+-  0)
        float32 / noncontiguous  |    4 (+-  0)  |    4 (+-  0)
  
  Times are in milliseconds (ms).
  
  Aggregated performance change of v2 vs. v1: +1.8% (improvement)
  [----- horizontal_flip @ torchvision==0.15.0a0+93723b4 -----]
                                 |       v1      |       v2    
  1 threads: --------------------------------------------------
        uint8 / contiguous       |   32 (+-  1)  |   32 (+-  1)
        uint8 / noncontiguous    |  711 (+- 15)  |  710 (+- 16)
        float32 / contiguous     |   88 (+-  2)  |   89 (+-  2)
        float32 / noncontiguous  |  695 (+- 33)  |  746 (+- 21)
  
  Times are in microseconds (us).
  
  Aggregated performance change of v2 vs. v1: -1.8% (slowdown)
  [---------- invert @ torchvision==0.15.0a0+93723b4 ---------]
                                 |       v1      |       v2    
  1 threads: --------------------------------------------------
        uint8 / contiguous       |   30 (+-  1)  |   20 (+-  0)
        uint8 / noncontiguous    |   30 (+-  1)  |   20 (+-  0)
        float32 / contiguous     |   77 (+-  0)  |   75 (+-  0)
        float32 / noncontiguous  |   74 (+-  0)  |   71 (+-  1)
  
  Times are in microseconds (us).
  
  Aggregated performance change of v2 vs. v1: +18.4% (improvement)
  [--------- normalize @ torchvision==0.15.0a0+93723b4 ---------]
                                 |       v1       |       v2     
  1 threads: ----------------------------------------------------
        float32 / contiguous     |  234 (+-  4)   |  167 (+-  2) 
        float32 / noncontiguous  |  2254 (+- 73)  |  2186 (+- 89)
  
  Times are in microseconds (us).
  
  Aggregated performance change of v2 vs. v1: +15.8% (improvement)
  [----------- pad @ torchvision==0.15.0a0+93723b4 -----------]
                                 |       v1      |       v2    
  1 threads: --------------------------------------------------
        uint8 / contiguous       |   53 (+-  1)  |   52 (+-  1)
        uint8 / noncontiguous    |  248 (+- 10)  |  250 (+- 11)
        float32 / contiguous     |  126 (+-  4)  |  127 (+-  4)
        float32 / noncontiguous  |  289 (+- 11)  |  287 (+- 12)
  
  Times are in microseconds (us).
  
  Aggregated performance change of v2 vs. v1: +0.2% (improvement)
  [------- perspective @ torchvision==0.15.0a0+93723b4 -------]
                                 |       v1      |       v2    
  1 threads: --------------------------------------------------
        uint8 / contiguous       |    8 (+-  0)  |    8 (+-  0)
        uint8 / noncontiguous    |    8 (+-  0)  |    8 (+-  0)
        float32 / contiguous     |    7 (+-  0)  |    7 (+-  0)
        float32 / noncontiguous  |    7 (+-  0)  |    7 (+-  0)
  
  Times are in milliseconds (ms).
  
  Aggregated performance change of v2 vs. v1: +0.5% (improvement)
  [-------- posterize @ torchvision==0.15.0a0+93723b4 --------]
                                 |       v1      |       v2    
  1 threads: --------------------------------------------------
        uint8 / contiguous       |   28 (+-  0)  |   26 (+-  0)
        uint8 / noncontiguous    |   28 (+-  0)  |   24 (+-  0)
        float32 / contiguous     |               |  217 (+-  3)
        float32 / noncontiguous  |               |  202 (+-  3)
  
  Times are in microseconds (us).
  
  Aggregated performance change of v2 vs. v1: +9.6% (improvement)
  [------- resized_crop @ torchvision==0.15.0a0+93723b4 ------]
                                 |       v1      |       v2    
  1 threads: --------------------------------------------------
        uint8 / contiguous       |    3 (+-  1)  |    3 (+-  1)
        uint8 / noncontiguous    |    3 (+-  1)  |    3 (+-  1)
        float32 / contiguous     |    3 (+-  1)  |    3 (+-  1)
        float32 / noncontiguous  |    3 (+-  1)  |    3 (+-  1)
  
  Times are in milliseconds (ms).
  
  Aggregated performance change of v2 vs. v1: +0.7% (improvement)
  [------ resize BILINEAR @ torchvision==0.15.0a0+93723b4 ------]
                                 |       v1       |       v2     
  1 threads: ----------------------------------------------------
        uint8 / contiguous       |  1292 (+- 69)  |  1240 (+- 56)
        uint8 / noncontiguous    |  910 (+- 67)   |  898 (+- 53) 
        float32 / contiguous     |  1086 (+- 41)  |  1183 (+- 62)
        float32 / noncontiguous  |  743 (+- 33)   |  843 (+- 61) 
  
  Times are in microseconds (us).
  
  Aggregated performance change of v2 vs. v1: -4.3% (slowdown)
  [------ resize NEAREST @ torchvision==0.15.0a0+93723b4 -----]
                                 |       v1      |       v2    
  1 threads: --------------------------------------------------
        uint8 / contiguous       |  643 (+-179)  |  280 (+-  9)
        uint8 / noncontiguous    |  292 (+-149)  |  291 (+-  8)
        float32 / contiguous     |  418 (+-  5)  |  110 (+-  3)
        float32 / noncontiguous  |  122 (+-  3)  |  119 (+-  5)
  
  Times are in microseconds (us).
  
  Aggregated performance change of v2 vs. v1: +33.2% (improvement)
  [---------- rotate @ torchvision==0.15.0a0+93723b4 ---------]
                                 |       v1      |       v2    
  1 threads: --------------------------------------------------
        uint8 / contiguous       |    4 (+-  1)  |    4 (+-  1)
        uint8 / noncontiguous    |    4 (+-  1)  |    4 (+-  1)
        float32 / contiguous     |    3 (+-  1)  |    3 (+-  1)
        float32 / noncontiguous  |    3 (+-  1)  |    3 (+-  1)
  
  Times are in milliseconds (ms).
  
  Aggregated performance change of v2 vs. v1: -0.6% (slowdown)
  [--------- solarize @ torchvision==0.15.0a0+93723b4 --------]
                                 |       v1      |       v2    
  1 threads: --------------------------------------------------
        uint8 / contiguous       |    1 (+-  0)  |    1 (+-  0)
        uint8 / noncontiguous    |    1 (+-  0)  |    1 (+-  0)
        float32 / contiguous     |    1 (+-  0)  |    1 (+-  0)
        float32 / noncontiguous  |    1 (+-  0)  |    1 (+-  0)
  
  Times are in milliseconds (ms).
  
  Aggregated performance change of v2 vs. v1: +3.1% (improvement)
  [--------- ten_crop @ torchvision==0.15.0a0+93723b4 --------]
                                 |       v1      |       v2    
  1 threads: --------------------------------------------------
        uint8 / contiguous       |   95 (+-  3)  |   67 (+-  2)
        uint8 / noncontiguous    |  775 (+- 27)  |  739 (+- 21)
        float32 / contiguous     |  146 (+-  7)  |  123 (+-  5)
        float32 / noncontiguous  |  803 (+- 21)  |  771 (+- 18)
  
  Times are in microseconds (us).
  
  Aggregated performance change of v2 vs. v1: +13.6% (improvement)
  [------ vertical_flip @ torchvision==0.15.0a0+93723b4 ------]
                                 |       v1      |       v2    
  1 threads: --------------------------------------------------
        uint8 / contiguous       |   34 (+-  1)  |   32 (+-  1)
        uint8 / noncontiguous    |   34 (+-  1)  |   32 (+-  1)
        float32 / contiguous     |   85 (+-  2)  |   82 (+-  2)
        float32 / noncontiguous  |   77 (+-  1)  |   77 (+-  1)
  
  Times are in microseconds (us).
  
  Aggregated performance change of v2 vs. v1: +3.2% (improvement)

Here are the kernels, that are significantly slowed down by noncontiguous inputs:

[------ adjust_contrast @ torchvision==0.15.0a0+93723b4 ------]
                               |       v1       |       v2     
1 threads: ----------------------------------------------------
      uint8 / contiguous       |  1608 (+-262)  |  874 (+- 21) 
      uint8 / noncontiguous    |  1768 (+-239)  |  1094 (+- 70)
      float32 / contiguous     |  422 (+-  4)   |  291 (+-  3) 
      float32 / noncontiguous  |  662 (+- 16)   |  624 (+- 24) 
      
[----- adjust_saturation @ torchvision==0.15.0a0+93723b4 -----]
                               |       v1       |       v2     
1 threads: ----------------------------------------------------
      uint8 / contiguous       |  1137 (+- 50)  |  815 (+- 24) 
      uint8 / noncontiguous    |  2187 (+- 65)  |  1915 (+- 61)
      float32 / contiguous     |  438 (+-  9)   |  279 (+-  5) 
      float32 / noncontiguous  |  1493 (+- 39)  |  1452 (+- 42)
      
[-------- autocontrast @ torchvision==0.15.0a0+93723b4 -------]
                               |       v1       |       v2     
1 threads: ----------------------------------------------------
      uint8 / contiguous       |  741 (+- 29)   |  770 (+- 25) 
      uint8 / noncontiguous    |  3601 (+- 94)  |  6593 (+- 96)
      float32 / contiguous     |  342 (+-  5)   |  323 (+-  4) 
      float32 / noncontiguous  |  6210 (+-101)  |  6135 (+- 82)
      
[--- convert_color_space @ torchvision==0.15.0a0+93723b4 ---]
                               |       v1      |       v2    
1 threads: --------------------------------------------------
      uint8 / contiguous       |  368 (+- 11)  |  284 (+- 10)
      uint8 / noncontiguous    |  583 (+- 15)  |  513 (+- 18)
      float32 / contiguous     |  162 (+-  3)  |   90 (+-  1)
      float32 / noncontiguous  |  391 (+-  9)  |  409 (+- 17)
      
[----- horizontal_flip @ torchvision==0.15.0a0+93723b4 -----]
                               |       v1      |       v2    
1 threads: --------------------------------------------------
      uint8 / contiguous       |   32 (+-  1)  |   32 (+-  1)
      uint8 / noncontiguous    |  711 (+- 15)  |  710 (+- 16)
      float32 / contiguous     |   88 (+-  2)  |   89 (+-  2)
      float32 / noncontiguous  |  695 (+- 33)  |  746 (+- 21)
      
[--------- normalize @ torchvision==0.15.0a0+93723b4 ---------]
                               |       v1       |       v2     
1 threads: ----------------------------------------------------
      float32 / contiguous     |  234 (+-  4)   |  167 (+-  2) 
      float32 / noncontiguous  |  2254 (+- 73)  |  2186 (+- 89)
      
[----------- pad @ torchvision==0.15.0a0+93723b4 -----------]
                               |       v1      |       v2    
1 threads: --------------------------------------------------
      uint8 / contiguous       |   53 (+-  1)  |   52 (+-  1)
      uint8 / noncontiguous    |  248 (+- 10)  |  250 (+- 11)
      float32 / contiguous     |  126 (+-  4)  |  127 (+-  4)
      float32 / noncontiguous  |  289 (+- 11)  |  287 (+- 12)
      
Aggregated performance change of v2 vs. v1: +3.1% (improvement)
[--------- ten_crop @ torchvision==0.15.0a0+93723b4 --------]
                               |       v1      |       v2    
1 threads: --------------------------------------------------
      uint8 / contiguous       |   95 (+-  3)  |   67 (+-  2)
      uint8 / noncontiguous    |  775 (+- 27)  |  739 (+- 21)
      float32 / contiguous     |  146 (+-  7)  |  123 (+-  5)
      float32 / noncontiguous  |  803 (+- 21)  |  771 (+- 18)

The only counterexample that we have is resize with BILINEAR interpolation:

[------ resize BILINEAR @ torchvision==0.15.0a0+93723b4 ------]
                                   |       v1       |       v2
1 threads: ----------------------------------------------------
      uint8 / contiguous       |  1292 (+- 69)  |  1240 (+- 56)
      uint8 / noncontiguous    |  910 (+- 67)   |  898 (+- 53) 
      float32 / contiguous     |  1086 (+- 41)  |  1183 (+- 62)
      float32 / noncontiguous  |  743 (+- 33)   |  843 (+- 61)

I'll check our pipelines next if and where noncontiguity comes into play.

pmeier added needs discussion module: transforms Perf For performance improvements prototype labels Oct 26, 2022

datumbox added this to To do in Transforms V2 via automation Oct 26, 2022

datumbox moved this from To do to Backlog in Transforms V2 Oct 26, 2022

datumbox moved this from Backlog to To do in Transforms V2 Oct 31, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enforce contiguous outputs on the transforms v2 kernels? #6839

Enforce contiguous outputs on the transforms v2 kernels? #6839

pmeier commented Oct 26, 2022 •

edited

NicolasHug commented Oct 26, 2022

pmeier commented Oct 26, 2022

NicolasHug commented Oct 26, 2022

pmeier commented Oct 26, 2022

vadimkantorov commented Oct 28, 2022

pmeier commented Dec 12, 2022

datumbox commented Dec 12, 2022

pmeier commented Dec 12, 2022

Enforce contiguous outputs on the transforms v2 kernels? #6839

Enforce contiguous outputs on the transforms v2 kernels? #6839

Comments

pmeier commented Oct 26, 2022 • edited

NicolasHug commented Oct 26, 2022

pmeier commented Oct 26, 2022

NicolasHug commented Oct 26, 2022

pmeier commented Oct 26, 2022

vadimkantorov commented Oct 28, 2022

pmeier commented Dec 12, 2022

datumbox commented Dec 12, 2022

pmeier commented Dec 12, 2022

pmeier commented Oct 26, 2022 •

edited