[prototype] Speed up Augment Transform Classes #6835

datumbox · 2022-10-25T12:41:05Z

This PR does a few optimizations on the _augment.py Transform classes.

…aste.

datumbox · 2022-10-25T12:41:52Z

torchvision/prototype/transforms/_augment.py

+        if isinstance(value, (int, float)):
+            self.value = [value]
+        elif isinstance(value, str):
+            self.value = None
+        elif isinstance(value, tuple):
+            self.value = list(value)
+        else:
+            self.value = value


Moving value handling from run-time to constructor. This is more code-quality than perf.

datumbox · 2022-10-25T12:47:00Z

torchvision/prototype/transforms/_augment.py

@@ -121,8 +119,7 @@ def _check_inputs(self, flat_inputs: List[Any]) -> None:
    def _mixup_onehotlabel(self, inpt: features.OneHotLabel, lam: float) -> features.OneHotLabel:
        if inpt.ndim < 2:
            raise ValueError("Need a batch of one hot labels")
-        output = inpt.clone()
-        output = output.roll(1, 0).mul_(1.0 - lam).add_(output.mul_(lam))
+        output = inpt.roll(1, 0).mul_(1.0 - lam).add_(inpt.mul(lam))


As we saw in previous optimizations (for example #6821), removing clone in favour of moving the copy on the first multiplication has speed benefits. Below I benchmark the following two alternatives for various inputs (images and labels) to showcase this:

def withclone(inpt, lam=0.5): output = inpt.clone() return output.roll(1, 0).mul_(1.0 - lam).add_(output.mul_(lam)) def withoutclone(inpt, lam=0.5): return inpt.roll(1, 0).mul_(1.0 - lam).add_(inpt.mul(lam))

Below we benchmark the 2 above methods alone:

[-------------- Mixing cpu torch.float32 --------------] | withclone | withoutclone 1 threads: --------------------------------------------- (16, 3, 400, 400) | 17200 | 15100 (16, 1000) | 22 | 21 6 threads: --------------------------------------------- (16, 3, 400, 400) | 17700 | 15400 (16, 1000) | 23 | 20 Times are in microseconds (us). [------------- Mixing cuda torch.float32 --------------] | withclone | withoutclone 1 threads: --------------------------------------------- (16, 3, 400, 400) | 275.3 | 226 (16, 1000) | 32.8 | 26 6 threads: --------------------------------------------- (16, 3, 400, 400) | 275.9 | 226 (16, 1000) | 32.6 | 26 Times are in microseconds (us).

Because it took me a while to figure out why the clone is needed in the first place: without it,

inpt.roll(1, 0).mul_(1.0 - lam).add_(inpt.mul_(lam))

would modify inpt inplace due to inpt.mul_(lam). This is the inplace operation @datumbox talking above. inpt.roll always return a copy and thus we can use inplace operations afterwards.

datumbox · 2022-10-25T12:48:18Z

torchvision/prototype/transforms/_augment.py

+        inverse_paste_alpha_mask = ~paste_alpha_mask
        # Copy-paste images:
-        image = (image * (~paste_alpha_mask)) + (paste_image * paste_alpha_mask)
+        image = (image * inverse_paste_alpha_mask).add_(paste_image * paste_alpha_mask)


Simply cache ~paste_alpha_mask to avoid repeated estimations (see code below this point) and use of in-place addition on the intermediate results. I don't have benchmarks for this because I thought it's quite uncontroversial.

Nit: I would prefer not to mix the regular operators like * and + with the flow interface like .add_ and .mul_. Since the regular operators don't support inplace operations on the right side of a statement, this means I would prefer using the flow style whenever inplace operations are present

image = image.mul(inverse_paste_alpha_mask).add_(paste_image.mul(paste_alpha_mask))

Non-blocking: We could perform inplace operations on paste_image in case we needed to resize it above. Otherwise it is one of the inputs and thus we need to do out of place operations like it is done here.

pmeier

LGTM if CI is green. Thanks Vasilis!

pmeier · 2022-10-26T06:59:12Z

torchvision/prototype/transforms/_augment.py

@@ -121,8 +119,7 @@ def _check_inputs(self, flat_inputs: List[Any]) -> None:
    def _mixup_onehotlabel(self, inpt: features.OneHotLabel, lam: float) -> features.OneHotLabel:
        if inpt.ndim < 2:
            raise ValueError("Need a batch of one hot labels")
-        output = inpt.clone()
-        output = output.roll(1, 0).mul_(1.0 - lam).add_(output.mul_(lam))
+        output = inpt.roll(1, 0).mul_(1.0 - lam).add_(inpt.mul(lam))


Because it took me a while to figure out why the clone is needed in the first place: without it,

inpt.roll(1, 0).mul_(1.0 - lam).add_(inpt.mul_(lam))

would modify inpt inplace due to inpt.mul_(lam). This is the inplace operation @datumbox talking above. inpt.roll always return a copy and thus we can use inplace operations afterwards.

pmeier · 2022-10-26T07:06:36Z

torchvision/prototype/transforms/_augment.py

+        inverse_paste_alpha_mask = ~paste_alpha_mask
        # Copy-paste images:
-        image = (image * (~paste_alpha_mask)) + (paste_image * paste_alpha_mask)
+        image = (image * inverse_paste_alpha_mask).add_(paste_image * paste_alpha_mask)


Nit: I would prefer not to mix the regular operators like * and + with the flow interface like .add_ and .mul_. Since the regular operators don't support inplace operations on the right side of a statement, this means I would prefer using the flow style whenever inplace operations are present

image = image.mul(inverse_paste_alpha_mask).add_(paste_image.mul(paste_alpha_mask))

pmeier · 2022-10-26T07:08:33Z

torchvision/prototype/transforms/_augment.py

+        inverse_paste_alpha_mask = ~paste_alpha_mask
        # Copy-paste images:
-        image = (image * (~paste_alpha_mask)) + (paste_image * paste_alpha_mask)
+        image = (image * inverse_paste_alpha_mask).add_(paste_image * paste_alpha_mask)


Non-blocking: We could perform inplace operations on paste_image in case we needed to resize it above. Otherwise it is one of the inputs and thus we need to do out of place operations like it is done here.

pmeier · 2022-10-26T07:15:10Z

torchvision/prototype/transforms/_augment.py

@@ -243,11 +239,12 @@ def _copy_paste(
        if blending:
            paste_alpha_mask = F.gaussian_blur(paste_alpha_mask.unsqueeze(0), kernel_size=[5, 5], sigma=[2.0])

+        inverse_paste_alpha_mask = ~paste_alpha_mask


I just thought of another "evil hack" that avoids the out of place inversion here. There are two places where paste_alpha_mask or its inverse is used:

image = (image * inverse_paste_alpha_mask).add_(paste_image * paste_alpha_mask)

masks = masks * inverse_paste_alpha_mask

By reordering the first computation to

image = (paste_image * paste_alpha_mask).add_(image * inverse_paste_alpha_mask)

we no longer need the paste_alpha_mask after inverse_paste_alpha_mask is used the first time. Thus, we can perform inplace operations on it:

image = (paste_image * paste_alpha_mask).add_(image * paste_alpha_mask.logical_not_())

That turns the 2. statement into

`masks = masks * paste_alpha_mask`

Although that looks wrong, it is actually correct since paste_alpha_mask at this point contains it inverse.

Going for this needs at least an extensive comment since it is not obvious. Not sure if it is worth it.

@pmeier I'm not 100% sure but I sense that this will end up having the same number of operations as before. You indeed do an inplace operation but at the end you still need to reverse the bool an write the results back to memory. I think what you propose will lead to less memory footprint. Still I think your proposal is worth trying. Do you want to bring it as an optimization on a PR after this one? You could refactor it a bit to split this into multiple lines and avoid doing one non-inplace op with the inplace op on paste_alpha_mask to make things clearer to the reader (that's debatable, maybe it's clear already?). You could literally measure only those 2 lines on the new PR and show-case if the proposal is faster. WDYT?

You are right:

def current(*, image, masks, paste_image, paste_alpha_mask, size1_ne_size2): inverse_paste_alpha_mask = paste_alpha_mask.logical_not() # Copy-paste images: image = image.mul(inverse_paste_alpha_mask).add_(paste_image.mul(paste_alpha_mask)) # Copy-paste masks: masks = masks * inverse_paste_alpha_mask def proposal(*, image, masks, paste_image, paste_alpha_mask, size1_ne_size2): if size1_ne_size2: tmp = paste_image.mul_(paste_alpha_mask) else: tmp = paste_image.mul(paste_alpha_mask) inverse_paste_alpha_mask = paste_alpha_mask.logical_not_() # Copy-paste images: image = tmp.add_(image.mul(inverse_paste_alpha_mask)) # Copy-paste masks: masks = masks.mul(inverse_paste_alpha_mask)

Benchmark script

import itertools import torch from torch.utils import benchmark shapes = [ (3, 512, 512), # single image (5, 3, 512, 512), # single video ] dtypes = [torch.uint8, torch.float32] devices = ["cpu", "cuda"] def current(*, image, masks, paste_image, paste_alpha_mask, size1_ne_size2): inverse_paste_alpha_mask = paste_alpha_mask.logical_not() # Copy-paste images: image = image.mul(inverse_paste_alpha_mask).add_(paste_image.mul(paste_alpha_mask)) # Copy-paste masks: masks = masks * inverse_paste_alpha_mask def proposal(*, image, masks, paste_image, paste_alpha_mask, size1_ne_size2): if size1_ne_size2: tmp = paste_image.mul_(paste_alpha_mask) else: tmp = paste_image.mul(paste_alpha_mask) inverse_paste_alpha_mask = paste_alpha_mask.logical_not_() # Copy-paste images: image = tmp.add_(image.mul(inverse_paste_alpha_mask)) # Copy-paste masks: masks = masks.mul(inverse_paste_alpha_mask) timers = [ benchmark.Timer( stmt="fn(image=image, masks=masks, paste_image=paste_image, paste_alpha_mask=paste_alpha_mask, size1_ne_size2=size1_ne_size2)", globals=dict( fn=fn, image=torch.testing.make_tensor( shape, dtype=dtype, device=device, low=0, high=1 if dtype.is_floating_point else torch.iinfo(dtype).max, ), masks=torch.testing.make_tensor( (4, *shape), dtype=torch.uint8, device=device, low=0, high=2, ), paste_image=torch.testing.make_tensor( shape, dtype=dtype, device=device, low=0, high=1 if dtype.is_floating_point else torch.iinfo(dtype).max, ), paste_alpha_mask=torch.testing.make_tensor( shape, dtype=torch.bool, device=device, low=0, high=2, ), size1_ne_size2=size1_ne_size2, ), sub_label=f"{shape!s:16} / {str(dtype).replace('torch.', ''):7} / {device:4} / size1_ne_size2={size1_ne_size2}", description=fn.__name__, ) for fn, shape, dtype, device, size1_ne_size2 in itertools.product( [ current, proposal, ], shapes, dtypes, devices, [False, True], ) ] measurements = [timer.blocked_autorange(min_run_time=10) for timer in timers] comparison = benchmark.Compare(measurements) comparison.print()

| current | proposal 1 threads: ---------------------------------------------------------------------------- (3, 512, 512) / uint8 / cpu / size1_ne_size2=False | 1964.5 | 2155.9 (3, 512, 512) / uint8 / cpu / size1_ne_size2=True | 1956.6 | 2169.5 (3, 512, 512) / uint8 / cuda / size1_ne_size2=False | 142.8 | 140.8 (3, 512, 512) / uint8 / cuda / size1_ne_size2=True | 143.4 | 139.6 (3, 512, 512) / float32 / cpu / size1_ne_size2=False | 2053.3 | 2270.9 (3, 512, 512) / float32 / cpu / size1_ne_size2=True | 2052.9 | 2217.7 (3, 512, 512) / float32 / cuda / size1_ne_size2=False | 196.9 | 197.0 (3, 512, 512) / float32 / cuda / size1_ne_size2=True | 196.9 | 196.7 (5, 3, 512, 512) / uint8 / cpu / size1_ne_size2=False | 10575.0 | 10932.7 (5, 3, 512, 512) / uint8 / cpu / size1_ne_size2=True | 10541.8 | 10887.1 (5, 3, 512, 512) / uint8 / cuda / size1_ne_size2=False | 695.4 | 696.4 (5, 3, 512, 512) / uint8 / cuda / size1_ne_size2=True | 695.2 | 697.6 (5, 3, 512, 512) / float32 / cpu / size1_ne_size2=False | 14170.3 | 14315.7 (5, 3, 512, 512) / float32 / cpu / size1_ne_size2=True | 14110.5 | 13735.6 (5, 3, 512, 512) / float32 / cuda / size1_ne_size2=False | 946.1 | 949.2 (5, 3, 512, 512) / float32 / cuda / size1_ne_size2=True | 948.9 | 951.2 Times are in microseconds (us).

It makes no difference on CUDA

There is a slowdown on CPU

Only for the combination (5, 3, 512, 512) / float32 / cpu / size1_ne_size2=True there is a speed-up, of ~400µs. It is likely that the speed-up comes from the size1_ne_size2=True condition that I suggested in [prototype] Speed up Augment Transform Classes #6835 (comment).

I would leave it as is.

Summary: * Moving value estimation of `RandomErasing` from runtime to constructor. * Speed up mixing on MixUp/Cutmix and small optimization on SimpleCopyPaste. * Apply nits. Reviewed By: YosuaMichael Differential Revision: D40722897 fbshipit-source-id: feb2a56ec9d0a80991e5a2acaff4846d82168639

datumbox added 2 commits October 25, 2022 12:18

Moving value estimation of RandomErasing from runtime to constructor.

9440d3c

Speed up mixing on MixUp/Cutmix and small optimization on SimpleCopyP…

9b9b75e

…aste.

datumbox added module: transforms Perf For performance improvements prototype labels Oct 25, 2022

facebook-github-bot added the cla signed label Oct 25, 2022

datumbox commented Oct 25, 2022

View reviewed changes

datumbox requested review from pmeier and vfdev-5 October 25, 2022 12:49

Merge branch 'main' into prototype/speedup_classes

7cc8ec9

datumbox mentioned this pull request Oct 25, 2022

Performance improvements for transforms v2 vs. v1 #6818

Closed

31 tasks

pmeier approved these changes Oct 26, 2022

View reviewed changes

datumbox and others added 2 commits October 26, 2022 10:24

Apply nits.

b9dc49b

Merge branch 'main' into prototype/speedup_classes

fa82d63

datumbox merged commit c84dbfa into pytorch:main Oct 26, 2022

datumbox deleted the prototype/speedup_classes branch October 26, 2022 09:25

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[prototype] Speed up Augment Transform Classes #6835

[prototype] Speed up Augment Transform Classes #6835

datumbox commented Oct 25, 2022 •

edited

Loading

datumbox Oct 25, 2022

datumbox Oct 25, 2022 •

edited

Loading

pmeier Oct 26, 2022

datumbox Oct 25, 2022 •

edited

Loading

pmeier Oct 26, 2022

pmeier Oct 26, 2022

pmeier left a comment

pmeier Oct 26, 2022

pmeier Oct 26, 2022

pmeier Oct 26, 2022

pmeier Oct 26, 2022

datumbox Oct 26, 2022

pmeier Oct 26, 2022

[prototype] Speed up Augment Transform Classes #6835

[prototype] Speed up Augment Transform Classes #6835

Conversation

datumbox commented Oct 25, 2022 • edited Loading

Choose a reason for hiding this comment

datumbox Oct 25, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

datumbox Oct 25, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pmeier left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

datumbox commented Oct 25, 2022 •

edited

Loading

datumbox Oct 25, 2022 •

edited

Loading

datumbox Oct 25, 2022 •

edited

Loading