Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[prototype] Speed up Augment Transform Classes #6835

Merged
merged 5 commits into from
Oct 26, 2022

Conversation

datumbox
Copy link
Contributor

@datumbox datumbox commented Oct 25, 2022

This PR does a few optimizations on the _augment.py Transform classes.

cc @vfdev-5 @bjuncek @pmeier

Comment on lines +43 to +50
if isinstance(value, (int, float)):
self.value = [value]
elif isinstance(value, str):
self.value = None
elif isinstance(value, tuple):
self.value = list(value)
else:
self.value = value
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moving value handling from run-time to constructor. This is more code-quality than perf.

@@ -121,8 +119,7 @@ def _check_inputs(self, flat_inputs: List[Any]) -> None:
def _mixup_onehotlabel(self, inpt: features.OneHotLabel, lam: float) -> features.OneHotLabel:
if inpt.ndim < 2:
raise ValueError("Need a batch of one hot labels")
output = inpt.clone()
output = output.roll(1, 0).mul_(1.0 - lam).add_(output.mul_(lam))
output = inpt.roll(1, 0).mul_(1.0 - lam).add_(inpt.mul(lam))
Copy link
Contributor Author

@datumbox datumbox Oct 25, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As we saw in previous optimizations (for example #6821), removing clone in favour of moving the copy on the first multiplication has speed benefits. Below I benchmark the following two alternatives for various inputs (images and labels) to showcase this:

def withclone(inpt, lam=0.5):
    output = inpt.clone()
    return output.roll(1, 0).mul_(1.0 - lam).add_(output.mul_(lam))


def withoutclone(inpt, lam=0.5):
    return inpt.roll(1, 0).mul_(1.0 - lam).add_(inpt.mul(lam))

Below we benchmark the 2 above methods alone:

[-------------- Mixing cpu torch.float32 --------------]
                         |  withclone   |  withoutclone 
1 threads: ---------------------------------------------
      (16, 3, 400, 400)  |    17200     |      15100    
      (16, 1000)         |       22     |         21    
6 threads: ---------------------------------------------
      (16, 3, 400, 400)  |    17700     |      15400    
      (16, 1000)         |       23     |         20    

Times are in microseconds (us).

[------------- Mixing cuda torch.float32 --------------]
                         |  withclone   |  withoutclone 
1 threads: ---------------------------------------------
      (16, 3, 400, 400)  |    275.3     |       226     
      (16, 1000)         |     32.8     |        26     
6 threads: ---------------------------------------------
      (16, 3, 400, 400)  |    275.9     |       226     
      (16, 1000)         |     32.6     |        26     

Times are in microseconds (us).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because it took me a while to figure out why the clone is needed in the first place: without it,

inpt.roll(1, 0).mul_(1.0 - lam).add_(inpt.mul_(lam))

would modify inpt inplace due to inpt.mul_(lam). This is the inplace operation @datumbox talking above. inpt.roll always return a copy and thus we can use inplace operations afterwards.

Comment on lines 242 to 244
inverse_paste_alpha_mask = ~paste_alpha_mask
# Copy-paste images:
image = (image * (~paste_alpha_mask)) + (paste_image * paste_alpha_mask)
image = (image * inverse_paste_alpha_mask).add_(paste_image * paste_alpha_mask)
Copy link
Contributor Author

@datumbox datumbox Oct 25, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Simply cache ~paste_alpha_mask to avoid repeated estimations (see code below this point) and use of in-place addition on the intermediate results. I don't have benchmarks for this because I thought it's quite uncontroversial.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: I would prefer not to mix the regular operators like * and + with the flow interface like .add_ and .mul_. Since the regular operators don't support inplace operations on the right side of a statement, this means I would prefer using the flow style whenever inplace operations are present

image = image.mul(inverse_paste_alpha_mask).add_(paste_image.mul(paste_alpha_mask))

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Non-blocking: We could perform inplace operations on paste_image in case we needed to resize it above. Otherwise it is one of the inputs and thus we need to do out of place operations like it is done here.

Copy link
Contributor

@pmeier pmeier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM if CI is green. Thanks Vasilis!

@@ -121,8 +119,7 @@ def _check_inputs(self, flat_inputs: List[Any]) -> None:
def _mixup_onehotlabel(self, inpt: features.OneHotLabel, lam: float) -> features.OneHotLabel:
if inpt.ndim < 2:
raise ValueError("Need a batch of one hot labels")
output = inpt.clone()
output = output.roll(1, 0).mul_(1.0 - lam).add_(output.mul_(lam))
output = inpt.roll(1, 0).mul_(1.0 - lam).add_(inpt.mul(lam))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because it took me a while to figure out why the clone is needed in the first place: without it,

inpt.roll(1, 0).mul_(1.0 - lam).add_(inpt.mul_(lam))

would modify inpt inplace due to inpt.mul_(lam). This is the inplace operation @datumbox talking above. inpt.roll always return a copy and thus we can use inplace operations afterwards.

Comment on lines 242 to 244
inverse_paste_alpha_mask = ~paste_alpha_mask
# Copy-paste images:
image = (image * (~paste_alpha_mask)) + (paste_image * paste_alpha_mask)
image = (image * inverse_paste_alpha_mask).add_(paste_image * paste_alpha_mask)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: I would prefer not to mix the regular operators like * and + with the flow interface like .add_ and .mul_. Since the regular operators don't support inplace operations on the right side of a statement, this means I would prefer using the flow style whenever inplace operations are present

image = image.mul(inverse_paste_alpha_mask).add_(paste_image.mul(paste_alpha_mask))

Comment on lines 242 to 244
inverse_paste_alpha_mask = ~paste_alpha_mask
# Copy-paste images:
image = (image * (~paste_alpha_mask)) + (paste_image * paste_alpha_mask)
image = (image * inverse_paste_alpha_mask).add_(paste_image * paste_alpha_mask)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Non-blocking: We could perform inplace operations on paste_image in case we needed to resize it above. Otherwise it is one of the inputs and thus we need to do out of place operations like it is done here.

@@ -243,11 +239,12 @@ def _copy_paste(
if blending:
paste_alpha_mask = F.gaussian_blur(paste_alpha_mask.unsqueeze(0), kernel_size=[5, 5], sigma=[2.0])

inverse_paste_alpha_mask = ~paste_alpha_mask
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just thought of another "evil hack" that avoids the out of place inversion here. There are two places where paste_alpha_mask or its inverse is used:

  1. image = (image * inverse_paste_alpha_mask).add_(paste_image * paste_alpha_mask)
  2. masks = masks * inverse_paste_alpha_mask

By reordering the first computation to

image = (paste_image * paste_alpha_mask).add_(image * inverse_paste_alpha_mask)

we no longer need the paste_alpha_mask after inverse_paste_alpha_mask is used the first time. Thus, we can perform inplace operations on it:

image = (paste_image * paste_alpha_mask).add_(image * paste_alpha_mask.logical_not_())

That turns the 2. statement into

`masks = masks * paste_alpha_mask`

Although that looks wrong, it is actually correct since paste_alpha_mask at this point contains it inverse.

Going for this needs at least an extensive comment since it is not obvious. Not sure if it is worth it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pmeier I'm not 100% sure but I sense that this will end up having the same number of operations as before. You indeed do an inplace operation but at the end you still need to reverse the bool an write the results back to memory. I think what you propose will lead to less memory footprint. Still I think your proposal is worth trying. Do you want to bring it as an optimization on a PR after this one? You could refactor it a bit to split this into multiple lines and avoid doing one non-inplace op with the inplace op on paste_alpha_mask to make things clearer to the reader (that's debatable, maybe it's clear already?). You could literally measure only those 2 lines on the new PR and show-case if the proposal is faster. WDYT?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right:

def current(*, image, masks, paste_image, paste_alpha_mask, size1_ne_size2):
    inverse_paste_alpha_mask = paste_alpha_mask.logical_not()

    # Copy-paste images:
    image = image.mul(inverse_paste_alpha_mask).add_(paste_image.mul(paste_alpha_mask))
    # Copy-paste masks:
    masks = masks * inverse_paste_alpha_mask


def proposal(*, image, masks, paste_image, paste_alpha_mask, size1_ne_size2):
    if size1_ne_size2:
        tmp = paste_image.mul_(paste_alpha_mask)
    else:
        tmp = paste_image.mul(paste_alpha_mask)
    inverse_paste_alpha_mask = paste_alpha_mask.logical_not_()

    # Copy-paste images:
    image = tmp.add_(image.mul(inverse_paste_alpha_mask))
    # Copy-paste masks:
    masks = masks.mul(inverse_paste_alpha_mask)
Benchmark script

import itertools

import torch
from torch.utils import benchmark

shapes = [
    (3, 512, 512),  # single image
    (5, 3, 512, 512),  # single video
]
dtypes = [torch.uint8, torch.float32]
devices = ["cpu", "cuda"]


def current(*, image, masks, paste_image, paste_alpha_mask, size1_ne_size2):
    inverse_paste_alpha_mask = paste_alpha_mask.logical_not()

    # Copy-paste images:
    image = image.mul(inverse_paste_alpha_mask).add_(paste_image.mul(paste_alpha_mask))
    # Copy-paste masks:
    masks = masks * inverse_paste_alpha_mask


def proposal(*, image, masks, paste_image, paste_alpha_mask, size1_ne_size2):
    if size1_ne_size2:
        tmp = paste_image.mul_(paste_alpha_mask)
    else:
        tmp = paste_image.mul(paste_alpha_mask)
    inverse_paste_alpha_mask = paste_alpha_mask.logical_not_()

    # Copy-paste images:
    image = tmp.add_(image.mul(inverse_paste_alpha_mask))
    # Copy-paste masks:
    masks = masks.mul(inverse_paste_alpha_mask)


timers = [
    benchmark.Timer(
        stmt="fn(image=image, masks=masks, paste_image=paste_image, paste_alpha_mask=paste_alpha_mask, size1_ne_size2=size1_ne_size2)",
        globals=dict(
            fn=fn,
            image=torch.testing.make_tensor(
                shape,
                dtype=dtype,
                device=device,
                low=0,
                high=1 if dtype.is_floating_point else torch.iinfo(dtype).max,
            ),
            masks=torch.testing.make_tensor(
                (4, *shape),
                dtype=torch.uint8,
                device=device,
                low=0,
                high=2,
            ),
            paste_image=torch.testing.make_tensor(
                shape,
                dtype=dtype,
                device=device,
                low=0,
                high=1 if dtype.is_floating_point else torch.iinfo(dtype).max,
            ),
            paste_alpha_mask=torch.testing.make_tensor(
                shape,
                dtype=torch.bool,
                device=device,
                low=0,
                high=2,
            ),
            size1_ne_size2=size1_ne_size2,
        ),
        sub_label=f"{shape!s:16} / {str(dtype).replace('torch.', ''):7} / {device:4} / size1_ne_size2={size1_ne_size2}",
        description=fn.__name__,
    )
    for fn, shape, dtype, device, size1_ne_size2 in itertools.product(
        [
            current,
            proposal,
        ],
        shapes,
        dtypes,
        devices,
        [False, True],
    )
]

measurements = [timer.blocked_autorange(min_run_time=10) for timer in timers]

comparison = benchmark.Compare(measurements)
comparison.print()

                                                                |  current  |  proposal
1 threads: ----------------------------------------------------------------------------
      (3, 512, 512)    / uint8   / cpu  / size1_ne_size2=False  |   1964.5  |   2155.9 
      (3, 512, 512)    / uint8   / cpu  / size1_ne_size2=True   |   1956.6  |   2169.5 
      (3, 512, 512)    / uint8   / cuda / size1_ne_size2=False  |    142.8  |    140.8 
      (3, 512, 512)    / uint8   / cuda / size1_ne_size2=True   |    143.4  |    139.6 
      (3, 512, 512)    / float32 / cpu  / size1_ne_size2=False  |   2053.3  |   2270.9 
      (3, 512, 512)    / float32 / cpu  / size1_ne_size2=True   |   2052.9  |   2217.7 
      (3, 512, 512)    / float32 / cuda / size1_ne_size2=False  |    196.9  |    197.0 
      (3, 512, 512)    / float32 / cuda / size1_ne_size2=True   |    196.9  |    196.7 
      (5, 3, 512, 512) / uint8   / cpu  / size1_ne_size2=False  |  10575.0  |  10932.7 
      (5, 3, 512, 512) / uint8   / cpu  / size1_ne_size2=True   |  10541.8  |  10887.1 
      (5, 3, 512, 512) / uint8   / cuda / size1_ne_size2=False  |    695.4  |    696.4 
      (5, 3, 512, 512) / uint8   / cuda / size1_ne_size2=True   |    695.2  |    697.6 
      (5, 3, 512, 512) / float32 / cpu  / size1_ne_size2=False  |  14170.3  |  14315.7 
      (5, 3, 512, 512) / float32 / cpu  / size1_ne_size2=True   |  14110.5  |  13735.6 
      (5, 3, 512, 512) / float32 / cuda / size1_ne_size2=False  |    946.1  |    949.2 
      (5, 3, 512, 512) / float32 / cuda / size1_ne_size2=True   |    948.9  |    951.2 

Times are in microseconds (us).
  • It makes no difference on CUDA
  • There is a slowdown on CPU
  • Only for the combination (5, 3, 512, 512) / float32 / cpu / size1_ne_size2=True there is a speed-up, of ~400µs. It is likely that the speed-up comes from the size1_ne_size2=True condition that I suggested in [prototype] Speed up Augment Transform Classes #6835 (comment).

I would leave it as is.

@datumbox datumbox merged commit c84dbfa into pytorch:main Oct 26, 2022
@datumbox datumbox deleted the prototype/speedup_classes branch October 26, 2022 09:25
facebook-github-bot pushed a commit that referenced this pull request Oct 27, 2022
Summary:
* Moving value estimation of `RandomErasing` from runtime to constructor.

* Speed up mixing on MixUp/Cutmix and small optimization on SimpleCopyPaste.

* Apply nits.

Reviewed By: YosuaMichael

Differential Revision: D40722897

fbshipit-source-id: feb2a56ec9d0a80991e5a2acaff4846d82168639
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants