Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enforce contiguous outputs on the transforms v2 kernels? #6839

Open
pmeier opened this issue Oct 26, 2022 · 8 comments
Open

Enforce contiguous outputs on the transforms v2 kernels? #6839

pmeier opened this issue Oct 26, 2022 · 8 comments

Comments

@pmeier
Copy link
Collaborator

pmeier commented Oct 26, 2022

All the performance benchmarks that did so far for transforms v1 vs. v2 were on contiguous inputs. However, we have a few kernels that leave the output in a noncontiguous state:

  • affine_image_tensor in case image.numel() > 0 and image.ndim == 4 and fill is not None
  • convert_color_space in case we only strip the alpha channel, i.e. RGB_ALPHA -> RGB and GRAY_ALPHA -> ALPHA
  • rotate_image_tensor in case image.numel() > 0 and image.ndim == 4 and fill is not None
  • crop_image_tensor
  • center_crop_image_tensor
  • five_crop_image_tensor
  • ten_crop_image_tensor

If applicable, the same is also valid for the *_mask and *_video kernels since they are thin wrappers around the *_image_tensor ones.

We should benchmark at least for a few kernels whether noncontiguous inputs cause a performance degredation that is larger than enforcing contiguous outputs on the kernels above. If so we should probably enforce contiguous outputs of our kernels.

cc @vfdev-5 @datumbox @bjuncek

@NicolasHug
Copy link
Member

IIRC Normalize()'s perf is fairly sensitive to contiguity.

Same for Resize(), and depending on what was used for decoding (PIL vs decode_jpeg()), you'll end up with different contiguity

@datumbox datumbox added this to To do in Transforms V2 via automation Oct 26, 2022
@pmeier
Copy link
Collaborator Author

pmeier commented Oct 26, 2022

I'll benchmark them in a bit, but before maybe another question: if we know that these kernels are sensitive to contiguity, should they know how to handle the situation? Meaning, opposed to what is proposed above, we would shift the burden to the kernel that "needs" contiguous inputs instead of enforcing contiguous outputs everywhere.

@NicolasHug
Copy link
Member

should they know how to handle the situation?

Do you mean it should be up to the kernel to convert its input if it makes it run faster?

I don't know if this is something worth doing; same for the global enforcement of contiguity BTW. Because, while this may speed up that one particular kernel, it can still make all the other kernels slower in the rest of the pipeline.

@pmeier
Copy link
Collaborator Author

pmeier commented Oct 26, 2022

Do you mean it should be up to the kernel to convert its input if it makes it run faster?

Yes.

I don't know if this is something worth doing; same for the global enforcement of contiguity BTW. Because, while this may speed up that one particular kernel, it can still make all the other kernels slower in the rest of the pipeline.

Agreed. I need to benchmark if this has an impact or not. Otherwise there is no need to enforce one or the other.

@datumbox datumbox moved this from To do to Backlog in Transforms V2 Oct 26, 2022
@vadimkantorov
Copy link

Preserving memory_format (contig or channel_last) can be quite important. If there're perf implications of preserving memory format by default (that could avoid some copies later on), some ops may accept additional memory_format arg?

@datumbox datumbox moved this from Backlog to To do in Transforms V2 Oct 31, 2022
@pmeier
Copy link
Collaborator Author

pmeier commented Dec 12, 2022

As stated by @NicolasHug in #6839 (comment), torchvision.io.read_image produces noncontigous outputs. The strides are not random though: by image.unsqueeze(0), we get torch.channels_last as memory format. I did some quick benchmarks to asses the impact of this on our transformation pipelines:

Script
import functools
import itertools

from torch.utils import benchmark
from torchvision.io import read_image
from torchvision.prototype.transforms import functional as F


image_other = read_image("test/assets/encode_jpeg/grace_hopper_517x606.jpg").float()
image_channels_last = image_other.unsqueeze(0)
image_contiguous_format = image_other.contiguous()
images = [
    (image_contiguous_format, "contiguous_format"),
    (image_channels_last, "channels_last"),
    (image_other, f"other, stride={image_other.stride()}"),
]

fns = [
    (
        functools.partial(
            F.normalize_image_tensor, mean=(0.485, 0.456, 0.406), std=(0.229, 0.224, 0.225)
        ),
        "normalize",
    ),
    (
        functools.partial(
            F.resize_image_tensor, size=256, interpolation=F.InterpolationMode.BILINEAR
        ),
        "resize BILINEAR",
    ),
    (
        functools.partial(
            F.resize_image_tensor, size=256, interpolation=F.InterpolationMode.NEAREST
        ),
        "resize NEAREST",
    ),
    (
        functools.partial(
            F.affine_image_tensor,
            angle=0.0,
            translate=[0.0, 0.0],
            scale=1.0,
            shear=[0.0, 0.0],
            interpolation=F.InterpolationMode.BILINEAR,
        ),
        "affine BILINEAR",
    ),
    (
        functools.partial(
            F.affine_image_tensor,
            angle=0.0,
            translate=[0.0, 0.0],
            scale=1.0,
            shear=[0.0, 0.0],
            interpolation=F.InterpolationMode.NEAREST,
        ),
        "affine NEAREST",
    ),
]

measurements = [
    benchmark.Timer(
        stmt="fn(image)",
        globals=dict(fn=fn, image=image),
        label="impact of input contiguity",
        sub_label=name,
        description=memory_format,
    ).blocked_autorange(min_run_time=5)
    for (fn, name), (image, memory_format) in itertools.product(fns, images)
]

comparison = benchmark.Compare(measurements)
comparison.trim_significant_figures()
comparison.print()
[----------------------------------------------- impact of input contiguity ----------------------------------------------]
                                                      |  contiguous_format  |  channels_last  |  other, stride=(1, 1551, 3)
1 threads: ----------------------------------------------------------------------------------------------------------------
      normalize / float32                             |          161        |       2110      |             2120           
      normalize with prior `.contiguous()` / float32  |          160        |       1100      |             1100           
      resize BILINEAR / uint8                         |         1370        |       1050      |             1070           
      resize BILINEAR / float32                       |         1200        |        870      |              884           
      resize NEAREST / uint8                          |          307        |        320      |              321           
      resize NEAREST / float32                        |          121        |        141      |              139           
      affine BILINEAR / uint8                         |         6140        |       6180      |             5990           
      affine BILINEAR / float32                       |         5630        |       5640      |             5460           
      affine NEAREST / uint8                          |         3660        |       3650      |             3680           
      affine NEAREST / float32                        |         3120        |       3150      |             3140           

Times are in microseconds (us).
  • normalize is 10x faster on contiguous inputs
  • naively calling image.contiguous() in normalize significantly improves performance for noncontiguous inputs, while having very little effect on the contiguous performance
  • resize with bilinear interpolation is roughly 30% faster on channels_last inputs
  • resize with nearest interpolation is marginally faster for contiguous inputs
  • affine seems to not be impacted by the contiguity
  • dtype is irrelevant

My conclusion here is that we should not enforce contiguity, but rather identify the kernels that benefit from it and enforce it there. One option to do that is to run our benchmark again (#6818), but this time for inputs in the channels_last format.

As for normalize, I think we can safely include a .contiguous() call in case inplace=True.

@datumbox
Copy link
Contributor

My conclusion here is that we should not enforce contiguity, but rather identify the kernels that benefit from it and enforce it there

I agree. This can be added as an implementation detail/performance-hack on the specific identified kernels. We can then easily move the workarounds if Core oprtimizes the kernels.

@pmeier
Copy link
Collaborator Author

pmeier commented Dec 12, 2022

I ran the functional API benchmark from #6818 again with a few tweaks:

  • Instead of using a random image, I used Grace Hopper
  • I've read it with torchvision.io.read_image
  • For noncontiguous inputs, I left it as is, which corresponds to torch.channels_last if we would add a singleton batch dimension
  • For contiguous inputs, I simply called .contiguous() on it
  • All measurements were done on CPU on a single thread
  • I tested float32 and uint8 to make sure that the assessment in Enforce contiguous outputs on the transforms v2 kernels? #6839 (comment) was not a fluke (spoiler: it wasn't)
  • I tested resize for NEAREST and BILINEAR interpolation separately, since they will be averaged otherwise
  • I tested both v1 and v2 to make sure we didn't introduce a regression
full log
  [---- adjust_brightness @ torchvision==0.15.0a0+93723b4 ----]
                                 |       v1      |       v2    
  1 threads: --------------------------------------------------
        uint8 / contiguous       |  824 (+- 23)  |  580 (+- 17)
        uint8 / noncontiguous    |  824 (+- 23)  |  580 (+- 20)
        float32 / contiguous     |  351 (+-  6)  |  123 (+-  2)
        float32 / noncontiguous  |  347 (+-  5)  |  114 (+-  2)
  
  Times are in microseconds (us).
  
  Aggregated performance change of v2 vs. v1: +47.9% (improvement)
  [------ adjust_contrast @ torchvision==0.15.0a0+93723b4 ------]
                                 |       v1       |       v2     
  1 threads: ----------------------------------------------------
        uint8 / contiguous       |  1608 (+-262)  |  874 (+- 21) 
        uint8 / noncontiguous    |  1768 (+-239)  |  1094 (+- 70)
        float32 / contiguous     |  422 (+-  4)   |  291 (+-  3) 
        float32 / noncontiguous  |  662 (+- 16)   |  624 (+- 24) 
  
  Times are in microseconds (us).
  
  Aggregated performance change of v2 vs. v1: +30.1% (improvement)
  [------- adjust_gamma @ torchvision==0.15.0a0+93723b4 ------]
                                 |       v1      |       v2    
  1 threads: --------------------------------------------------
        uint8 / contiguous       |    4 (+-  0)  |    4 (+-  0)
        uint8 / noncontiguous    |    4 (+-  0)  |    4 (+-  0)
        float32 / contiguous     |    3 (+-  0)  |    4 (+-  0)
        float32 / noncontiguous  |    3 (+-  0)  |    4 (+-  0)
  
  Times are in milliseconds (ms).
  
  Aggregated performance change of v2 vs. v1: -1.4% (slowdown)
  [-------- adjust_hue @ torchvision==0.15.0a0+93723b4 -------]
                                 |       v1      |       v2    
  1 threads: --------------------------------------------------
        uint8 / contiguous       |   22 (+-  1)  |   15 (+-  1)
        uint8 / noncontiguous    |   21 (+-  1)  |   19 (+-  1)
        float32 / contiguous     |   20 (+-  1)  |   14 (+-  1)
        float32 / noncontiguous  |   19 (+-  1)  |   18 (+-  1)
  
  Times are in milliseconds (ms).
  
  Aggregated performance change of v2 vs. v1: +19.6% (improvement)
  [----- adjust_saturation @ torchvision==0.15.0a0+93723b4 -----]
                                 |       v1       |       v2     
  1 threads: ----------------------------------------------------
        uint8 / contiguous       |  1137 (+- 50)  |  815 (+- 24) 
        uint8 / noncontiguous    |  2187 (+- 65)  |  1915 (+- 61)
        float32 / contiguous     |  438 (+-  9)   |  279 (+-  5) 
        float32 / noncontiguous  |  1493 (+- 39)  |  1452 (+- 42)
  
  Times are in microseconds (us).
  
  Aggregated performance change of v2 vs. v1: +19.9% (improvement)
  [----- adjust_sharpness @ torchvision==0.15.0a0+93723b4 ----]
                                 |       v1      |       v2    
  1 threads: --------------------------------------------------
        uint8 / contiguous       |    4 (+-  0)  |    3 (+-  0)
        uint8 / noncontiguous    |    5 (+-  0)  |    4 (+-  0)
        float32 / contiguous     |    4 (+-  0)  |    2 (+-  0)
        float32 / noncontiguous  |    5 (+-  0)  |    4 (+-  0)
  
  Times are in milliseconds (ms).
  
  Aggregated performance change of v2 vs. v1: +32.3% (improvement)
  [---------- affine @ torchvision==0.15.0a0+93723b4 ---------]
                                 |       v1      |       v2    
  1 threads: --------------------------------------------------
        uint8 / contiguous       |    4 (+-  1)  |    4 (+-  1)
        uint8 / noncontiguous    |    4 (+-  1)  |    4 (+-  1)
        float32 / contiguous     |    3 (+-  1)  |    3 (+-  1)
        float32 / noncontiguous  |    3 (+-  1)  |    3 (+-  1)
  
  Times are in milliseconds (ms).
  
  Aggregated performance change of v2 vs. v1: -0.4% (slowdown)
  [-------- autocontrast @ torchvision==0.15.0a0+93723b4 -------]
                                 |       v1       |       v2     
  1 threads: ----------------------------------------------------
        uint8 / contiguous       |  741 (+- 29)   |  770 (+- 25) 
        uint8 / noncontiguous    |  3601 (+- 94)  |  6593 (+- 96)
        float32 / contiguous     |  342 (+-  5)   |  323 (+-  4) 
        float32 / noncontiguous  |  6210 (+-101)  |  6135 (+- 82)
  
  Times are in microseconds (us).
  
  Aggregated performance change of v2 vs. v1: -20.1% (slowdown)
  [------- center_crop @ torchvision==0.15.0a0+93723b4 -------]
                                 |       v1      |       v2    
  1 threads: --------------------------------------------------
        uint8 / contiguous       |    9 (+-  0)  |    5 (+-  0)
        uint8 / noncontiguous    |    9 (+-  0)  |    5 (+-  0)
        float32 / contiguous     |    9 (+-  0)  |    5 (+-  0)
        float32 / noncontiguous  |    9 (+-  0)  |    5 (+-  0)
  
  Times are in microseconds (us).
  
  Aggregated performance change of v2 vs. v1: +41.9% (improvement)
  [--- convert_color_space @ torchvision==0.15.0a0+93723b4 ---]
                                 |       v1      |       v2    
  1 threads: --------------------------------------------------
        uint8 / contiguous       |  368 (+- 11)  |  284 (+- 10)
        uint8 / noncontiguous    |  583 (+- 15)  |  513 (+- 18)
        float32 / contiguous     |  162 (+-  3)  |   90 (+-  1)
        float32 / noncontiguous  |  391 (+-  9)  |  409 (+- 17)
  
  Times are in microseconds (us).
  
  Aggregated performance change of v2 vs. v1: +18.6% (improvement)
  [----- convert_dtype @ torchvision==0.15.0a0+93723b4 -----]
                               |       v1      |       v2    
  1 threads: ------------------------------------------------
        uint8 / contiguous     |  158 (+-  2)  |  113 (+-  3)
        uint8 / noncontiguous  |  159 (+-  2)  |  114 (+-  2)
  
  Times are in microseconds (us).
  
  Aggregated performance change of v2 vs. v1: +28.5% (improvement)
  [----------- crop @ torchvision==0.15.0a0+93723b4 ----------]
                                 |       v1      |       v2    
  1 threads: --------------------------------------------------
        uint8 / contiguous       |    4 (+-  0)  |    4 (+-  0)
        uint8 / noncontiguous    |    4 (+-  0)  |    4 (+-  0)
        float32 / contiguous     |    5 (+-  0)  |    4 (+-  0)
        float32 / noncontiguous  |    4 (+-  0)  |    4 (+-  0)
  
  Times are in microseconds (us).
  
  Aggregated performance change of v2 vs. v1: +16.9% (improvement)
  [--------- elastic @ torchvision==0.15.0a0+93723b4 ---------]
                                 |       v1      |       v2    
  1 threads: --------------------------------------------------
        uint8 / contiguous       |    5 (+-  0)  |    5 (+-  0)
        uint8 / noncontiguous    |    5 (+-  0)  |    5 (+-  0)
        float32 / contiguous     |    4 (+-  0)  |    4 (+-  0)
        float32 / noncontiguous  |    4 (+-  0)  |    4 (+-  0)
  
  Times are in milliseconds (ms).
  
  Aggregated performance change of v2 vs. v1: +0.8% (improvement)
  [--------- equalize @ torchvision==0.15.0a0+93723b4 --------]
                                 |       v1      |       v2    
  1 threads: --------------------------------------------------
        uint8 / contiguous       |    3 (+-  0)  |    3 (+-  0)
        uint8 / noncontiguous    |    3 (+-  0)  |    3 (+-  0)
        float32 / contiguous     |               |    3 (+-  0)
        float32 / noncontiguous  |               |    4 (+-  0)
  
  Times are in milliseconds (ms).
  
  Aggregated performance change of v2 vs. v1: -7.6% (slowdown)
  [---------- erase @ torchvision==0.15.0a0+93723b4 ----------]
                                 |       v1      |       v2    
  1 threads: --------------------------------------------------
        uint8 / contiguous       |   28 (+-  1)  |   27 (+-  1)
        uint8 / noncontiguous    |   27 (+-  1)  |   26 (+-  1)
        float32 / contiguous     |   75 (+-  1)  |   76 (+-  2)
        float32 / noncontiguous  |   77 (+-  1)  |   80 (+-  1)
  
  Times are in microseconds (us).
  
  Aggregated performance change of v2 vs. v1: -0.7% (slowdown)
  [-------- five_crop @ torchvision==0.15.0a0+93723b4 --------]
                                 |       v1      |       v2    
  1 threads: --------------------------------------------------
        uint8 / contiguous       |   30 (+-  2)  |   16 (+-  1)
        uint8 / noncontiguous    |   30 (+-  1)  |   16 (+-  1)
        float32 / contiguous     |   30 (+-  1)  |   16 (+-  1)
        float32 / noncontiguous  |   30 (+-  1)  |   16 (+-  1)
  
  Times are in microseconds (us).
  
  Aggregated performance change of v2 vs. v1: +45.1% (improvement)
  [------ gaussian_blur @ torchvision==0.15.0a0+93723b4 ------]
                                 |       v1      |       v2    
  1 threads: --------------------------------------------------
        uint8 / contiguous       |    4 (+-  0)  |    4 (+-  0)
        uint8 / noncontiguous    |    4 (+-  0)  |    4 (+-  0)
        float32 / contiguous     |    3 (+-  0)  |    3 (+-  0)
        float32 / noncontiguous  |    4 (+-  0)  |    4 (+-  0)
  
  Times are in milliseconds (ms).
  
  Aggregated performance change of v2 vs. v1: +1.8% (improvement)
  [----- horizontal_flip @ torchvision==0.15.0a0+93723b4 -----]
                                 |       v1      |       v2    
  1 threads: --------------------------------------------------
        uint8 / contiguous       |   32 (+-  1)  |   32 (+-  1)
        uint8 / noncontiguous    |  711 (+- 15)  |  710 (+- 16)
        float32 / contiguous     |   88 (+-  2)  |   89 (+-  2)
        float32 / noncontiguous  |  695 (+- 33)  |  746 (+- 21)
  
  Times are in microseconds (us).
  
  Aggregated performance change of v2 vs. v1: -1.8% (slowdown)
  [---------- invert @ torchvision==0.15.0a0+93723b4 ---------]
                                 |       v1      |       v2    
  1 threads: --------------------------------------------------
        uint8 / contiguous       |   30 (+-  1)  |   20 (+-  0)
        uint8 / noncontiguous    |   30 (+-  1)  |   20 (+-  0)
        float32 / contiguous     |   77 (+-  0)  |   75 (+-  0)
        float32 / noncontiguous  |   74 (+-  0)  |   71 (+-  1)
  
  Times are in microseconds (us).
  
  Aggregated performance change of v2 vs. v1: +18.4% (improvement)
  [--------- normalize @ torchvision==0.15.0a0+93723b4 ---------]
                                 |       v1       |       v2     
  1 threads: ----------------------------------------------------
        float32 / contiguous     |  234 (+-  4)   |  167 (+-  2) 
        float32 / noncontiguous  |  2254 (+- 73)  |  2186 (+- 89)
  
  Times are in microseconds (us).
  
  Aggregated performance change of v2 vs. v1: +15.8% (improvement)
  [----------- pad @ torchvision==0.15.0a0+93723b4 -----------]
                                 |       v1      |       v2    
  1 threads: --------------------------------------------------
        uint8 / contiguous       |   53 (+-  1)  |   52 (+-  1)
        uint8 / noncontiguous    |  248 (+- 10)  |  250 (+- 11)
        float32 / contiguous     |  126 (+-  4)  |  127 (+-  4)
        float32 / noncontiguous  |  289 (+- 11)  |  287 (+- 12)
  
  Times are in microseconds (us).
  
  Aggregated performance change of v2 vs. v1: +0.2% (improvement)
  [------- perspective @ torchvision==0.15.0a0+93723b4 -------]
                                 |       v1      |       v2    
  1 threads: --------------------------------------------------
        uint8 / contiguous       |    8 (+-  0)  |    8 (+-  0)
        uint8 / noncontiguous    |    8 (+-  0)  |    8 (+-  0)
        float32 / contiguous     |    7 (+-  0)  |    7 (+-  0)
        float32 / noncontiguous  |    7 (+-  0)  |    7 (+-  0)
  
  Times are in milliseconds (ms).
  
  Aggregated performance change of v2 vs. v1: +0.5% (improvement)
  [-------- posterize @ torchvision==0.15.0a0+93723b4 --------]
                                 |       v1      |       v2    
  1 threads: --------------------------------------------------
        uint8 / contiguous       |   28 (+-  0)  |   26 (+-  0)
        uint8 / noncontiguous    |   28 (+-  0)  |   24 (+-  0)
        float32 / contiguous     |               |  217 (+-  3)
        float32 / noncontiguous  |               |  202 (+-  3)
  
  Times are in microseconds (us).
  
  Aggregated performance change of v2 vs. v1: +9.6% (improvement)
  [------- resized_crop @ torchvision==0.15.0a0+93723b4 ------]
                                 |       v1      |       v2    
  1 threads: --------------------------------------------------
        uint8 / contiguous       |    3 (+-  1)  |    3 (+-  1)
        uint8 / noncontiguous    |    3 (+-  1)  |    3 (+-  1)
        float32 / contiguous     |    3 (+-  1)  |    3 (+-  1)
        float32 / noncontiguous  |    3 (+-  1)  |    3 (+-  1)
  
  Times are in milliseconds (ms).
  
  Aggregated performance change of v2 vs. v1: +0.7% (improvement)
  [------ resize BILINEAR @ torchvision==0.15.0a0+93723b4 ------]
                                 |       v1       |       v2     
  1 threads: ----------------------------------------------------
        uint8 / contiguous       |  1292 (+- 69)  |  1240 (+- 56)
        uint8 / noncontiguous    |  910 (+- 67)   |  898 (+- 53) 
        float32 / contiguous     |  1086 (+- 41)  |  1183 (+- 62)
        float32 / noncontiguous  |  743 (+- 33)   |  843 (+- 61) 
  
  Times are in microseconds (us).
  
  Aggregated performance change of v2 vs. v1: -4.3% (slowdown)
  [------ resize NEAREST @ torchvision==0.15.0a0+93723b4 -----]
                                 |       v1      |       v2    
  1 threads: --------------------------------------------------
        uint8 / contiguous       |  643 (+-179)  |  280 (+-  9)
        uint8 / noncontiguous    |  292 (+-149)  |  291 (+-  8)
        float32 / contiguous     |  418 (+-  5)  |  110 (+-  3)
        float32 / noncontiguous  |  122 (+-  3)  |  119 (+-  5)
  
  Times are in microseconds (us).
  
  Aggregated performance change of v2 vs. v1: +33.2% (improvement)
  [---------- rotate @ torchvision==0.15.0a0+93723b4 ---------]
                                 |       v1      |       v2    
  1 threads: --------------------------------------------------
        uint8 / contiguous       |    4 (+-  1)  |    4 (+-  1)
        uint8 / noncontiguous    |    4 (+-  1)  |    4 (+-  1)
        float32 / contiguous     |    3 (+-  1)  |    3 (+-  1)
        float32 / noncontiguous  |    3 (+-  1)  |    3 (+-  1)
  
  Times are in milliseconds (ms).
  
  Aggregated performance change of v2 vs. v1: -0.6% (slowdown)
  [--------- solarize @ torchvision==0.15.0a0+93723b4 --------]
                                 |       v1      |       v2    
  1 threads: --------------------------------------------------
        uint8 / contiguous       |    1 (+-  0)  |    1 (+-  0)
        uint8 / noncontiguous    |    1 (+-  0)  |    1 (+-  0)
        float32 / contiguous     |    1 (+-  0)  |    1 (+-  0)
        float32 / noncontiguous  |    1 (+-  0)  |    1 (+-  0)
  
  Times are in milliseconds (ms).
  
  Aggregated performance change of v2 vs. v1: +3.1% (improvement)
  [--------- ten_crop @ torchvision==0.15.0a0+93723b4 --------]
                                 |       v1      |       v2    
  1 threads: --------------------------------------------------
        uint8 / contiguous       |   95 (+-  3)  |   67 (+-  2)
        uint8 / noncontiguous    |  775 (+- 27)  |  739 (+- 21)
        float32 / contiguous     |  146 (+-  7)  |  123 (+-  5)
        float32 / noncontiguous  |  803 (+- 21)  |  771 (+- 18)
  
  Times are in microseconds (us).
  
  Aggregated performance change of v2 vs. v1: +13.6% (improvement)
  [------ vertical_flip @ torchvision==0.15.0a0+93723b4 ------]
                                 |       v1      |       v2    
  1 threads: --------------------------------------------------
        uint8 / contiguous       |   34 (+-  1)  |   32 (+-  1)
        uint8 / noncontiguous    |   34 (+-  1)  |   32 (+-  1)
        float32 / contiguous     |   85 (+-  2)  |   82 (+-  2)
        float32 / noncontiguous  |   77 (+-  1)  |   77 (+-  1)
  
  Times are in microseconds (us).
  
  Aggregated performance change of v2 vs. v1: +3.2% (improvement)

Here are the kernels, that are significantly slowed down by noncontiguous inputs:

[------ adjust_contrast @ torchvision==0.15.0a0+93723b4 ------]
                               |       v1       |       v2     
1 threads: ----------------------------------------------------
      uint8 / contiguous       |  1608 (+-262)  |  874 (+- 21) 
      uint8 / noncontiguous    |  1768 (+-239)  |  1094 (+- 70)
      float32 / contiguous     |  422 (+-  4)   |  291 (+-  3) 
      float32 / noncontiguous  |  662 (+- 16)   |  624 (+- 24) 
      
[----- adjust_saturation @ torchvision==0.15.0a0+93723b4 -----]
                               |       v1       |       v2     
1 threads: ----------------------------------------------------
      uint8 / contiguous       |  1137 (+- 50)  |  815 (+- 24) 
      uint8 / noncontiguous    |  2187 (+- 65)  |  1915 (+- 61)
      float32 / contiguous     |  438 (+-  9)   |  279 (+-  5) 
      float32 / noncontiguous  |  1493 (+- 39)  |  1452 (+- 42)
      
[-------- autocontrast @ torchvision==0.15.0a0+93723b4 -------]
                               |       v1       |       v2     
1 threads: ----------------------------------------------------
      uint8 / contiguous       |  741 (+- 29)   |  770 (+- 25) 
      uint8 / noncontiguous    |  3601 (+- 94)  |  6593 (+- 96)
      float32 / contiguous     |  342 (+-  5)   |  323 (+-  4) 
      float32 / noncontiguous  |  6210 (+-101)  |  6135 (+- 82)
      
[--- convert_color_space @ torchvision==0.15.0a0+93723b4 ---]
                               |       v1      |       v2    
1 threads: --------------------------------------------------
      uint8 / contiguous       |  368 (+- 11)  |  284 (+- 10)
      uint8 / noncontiguous    |  583 (+- 15)  |  513 (+- 18)
      float32 / contiguous     |  162 (+-  3)  |   90 (+-  1)
      float32 / noncontiguous  |  391 (+-  9)  |  409 (+- 17)
      
[----- horizontal_flip @ torchvision==0.15.0a0+93723b4 -----]
                               |       v1      |       v2    
1 threads: --------------------------------------------------
      uint8 / contiguous       |   32 (+-  1)  |   32 (+-  1)
      uint8 / noncontiguous    |  711 (+- 15)  |  710 (+- 16)
      float32 / contiguous     |   88 (+-  2)  |   89 (+-  2)
      float32 / noncontiguous  |  695 (+- 33)  |  746 (+- 21)
      
[--------- normalize @ torchvision==0.15.0a0+93723b4 ---------]
                               |       v1       |       v2     
1 threads: ----------------------------------------------------
      float32 / contiguous     |  234 (+-  4)   |  167 (+-  2) 
      float32 / noncontiguous  |  2254 (+- 73)  |  2186 (+- 89)
      
[----------- pad @ torchvision==0.15.0a0+93723b4 -----------]
                               |       v1      |       v2    
1 threads: --------------------------------------------------
      uint8 / contiguous       |   53 (+-  1)  |   52 (+-  1)
      uint8 / noncontiguous    |  248 (+- 10)  |  250 (+- 11)
      float32 / contiguous     |  126 (+-  4)  |  127 (+-  4)
      float32 / noncontiguous  |  289 (+- 11)  |  287 (+- 12)
      
Aggregated performance change of v2 vs. v1: +3.1% (improvement)
[--------- ten_crop @ torchvision==0.15.0a0+93723b4 --------]
                               |       v1      |       v2    
1 threads: --------------------------------------------------
      uint8 / contiguous       |   95 (+-  3)  |   67 (+-  2)
      uint8 / noncontiguous    |  775 (+- 27)  |  739 (+- 21)
      float32 / contiguous     |  146 (+-  7)  |  123 (+-  5)
      float32 / noncontiguous  |  803 (+- 21)  |  771 (+- 18)

The only counterexample that we have is resize with BILINEAR interpolation:

[------ resize BILINEAR @ torchvision==0.15.0a0+93723b4 ------]
                                   |       v1       |       v2
1 threads: ----------------------------------------------------
      uint8 / contiguous       |  1292 (+- 69)  |  1240 (+- 56)
      uint8 / noncontiguous    |  910 (+- 67)   |  898 (+- 53) 
      float32 / contiguous     |  1086 (+- 41)  |  1183 (+- 62)
      float32 / noncontiguous  |  743 (+- 33)   |  843 (+- 61) 

I'll check our pipelines next if and where noncontiguity comes into play.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Development

No branches or pull requests

4 participants