improve performance of {invert, solarize}_image_tensor #6819

pmeier · 2022-10-24T10:37:56Z

Per title. The benchmark script from #6818 yields the following

[--------- invert @ torchvision==0.15.0a0+62da7d4 ---------]
                                            |   v1   |   v2 
1 threads: -------------------------------------------------
      (1, 512, 512)       / uint8   / cpu   |    84  |     8
      (1, 512, 512)       / uint8   / cuda  |    20  |     4
      (1, 512, 512)       / float32 / cpu   |    28  |    27
      (1, 512, 512)       / float32 / cuda  |    20  |     6
      (3, 512, 512)       / uint8   / cpu   |   230  |    18
      (3, 512, 512)       / uint8   / cuda  |    30  |     4
      (3, 512, 512)       / float32 / cpu   |    61  |    62
      (3, 512, 512)       / float32 / cuda  |    40  |    27
      (5, 3, 512, 512)    / uint8   / cpu   |  1100  |    75
      (5, 3, 512, 512)    / uint8   / cuda  |    91  |    34
      (5, 3, 512, 512)    / float32 / cpu   |   500  |   500
      (5, 3, 512, 512)    / float32 / cuda  |   144  |   134
      (4, 5, 3, 512, 512) / uint8   / cpu   |  4400  |   400
      (4, 5, 3, 512, 512) / uint8   / cuda  |   400  |   133
      (4, 5, 3, 512, 512) / float32 / cpu   |  8000  |  7100
      (4, 5, 3, 512, 512) / float32 / cuda  |   540  |   535
2 threads: -------------------------------------------------
      (1, 512, 512)       / uint8   / cpu   |    49  |     7
      (1, 512, 512)       / float32 / cpu   |    21  |    20
      (3, 512, 512)       / uint8   / cpu   |   100  |    13
      (3, 512, 512)       / float32 / cpu   |    40  |    38
      (5, 3, 512, 512)    / uint8   / cpu   |   587  |    42
      (5, 3, 512, 512)    / float32 / cpu   |   150  |   150
      (4, 5, 3, 512, 512) / uint8   / cpu   |  2300  |   140
      (4, 5, 3, 512, 512) / float32 / cpu   |  7000  |  6600
4 threads: -------------------------------------------------
      (1, 512, 512)       / uint8   / cpu   |    30  |     6
      (1, 512, 512)       / float32 / cpu   |    16  |    14
      (3, 512, 512)       / uint8   / cpu   |    70  |    10
      (3, 512, 512)       / float32 / cpu   |    26  |    25
      (5, 3, 512, 512)    / uint8   / cpu   |   300  |    25
      (5, 3, 512, 512)    / float32 / cpu   |    83  |    90
      (4, 5, 3, 512, 512) / uint8   / cpu   |  1170  |    80
      (4, 5, 3, 512, 512) / float32 / cpu   |  6000  |  6000

Times are in microseconds (us).

Performance increased for images by roughly 12.5x.
Performance increased for videos by roughly 14.9x.

uint8 is much faster now, while float32 is on par with what we had before.

cc @vfdev-5 @datumbox @bjuncek

torchvision/prototype/transforms/functional/_color.py

datumbox

It's looking good. Perhaps address @vfdev-5 remark below, to ensure the previous and new behaviours are identical?

Worth noting that you also improve solarize which depends on invert. To get this speed gain, you would have to implement solarize again on v2, to call the new kernel. Do you plan to do this now or separately?

torchvision/prototype/transforms/functional/_color.py

datumbox · 2022-10-24T12:07:12Z

There is a related linter issue.

pmeier · 2022-10-24T12:48:30Z

torchvision/prototype/transforms/functional/_color.py

+    if image.dtype == torch.uint8:
+        return image.bitwise_not()
+    else:
+        return _FT._max_value(image.dtype) - image


I couldn't believe it, but this seems to be fastest way:

def scalar_sub(image): return _max_value(image.dtype) - image def new_full(image): return image.new_full(image.shape, _max_value(image.dtype)).sub_(image) def full_like(image): return torch.full_like(image, _max_value(image.dtype)).sub_(image)

[-------------------- invert float or signed integer tensors -------------------] | scalar_sub | new_full | full_like 1 threads: ---------------------------------------------------------------------- (3, 256, 256) / int32 / cpu | 20 | 28 | 26 (3, 256, 256) / int32 / cuda | 5 | 8 | 7 (3, 256, 256) / int64 / cpu | 88 | 109 | 107 (3, 256, 256) / int64 / cuda | 13 | 22 | 22 (3, 256, 256) / float64 / cpu | 35 | 49 | 49 (3, 256, 256) / float64 / cuda | 12 | 21 | 21 (5, 3, 256, 256) / int32 / cpu | 76 | 112 | 110 (5, 3, 256, 256) / int32 / cuda | 34 | 67 | 67 (5, 3, 256, 256) / int64 / cpu | 439 | 510 | 524 (5, 3, 256, 256) / int64 / cuda | 68 | 133 | 133 (5, 3, 256, 256) / float64 / cpu | 140 | 215 | 220 (5, 3, 256, 256) / float64 / cuda | 68 | 133 | 133 Times are in microseconds (us).

pmeier · 2022-10-24T12:50:55Z

Worth noting that you also improve solarize which depends on invert. To get this speed gain, you would have to implement solarize again on v2, to call the new kernel. Do you plan to do this now or separately?

Let's do it in a follow-up PR. Maybe there are more improvements possible. Will deal with it next.

Edit: There is nothing to optimize in solarize and so I implemented it here.

pmeier · 2022-10-24T13:25:38Z

Re-run a reduced version of the benchmark with the recent changes:

[-------- invert @ torchvision==0.15.0a0+62da7d4 --------]
                                            |   v1   |  v2
1 threads: -----------------------------------------------
      (3, 512, 512)       / uint8   / cpu   |   236  |  17
      (3, 512, 512)       / uint8   / cuda  |    28  |   4
      (5, 3, 512, 512)    / uint8   / cpu   |  1140  |  70
      (5, 3, 512, 512)    / uint8   / cuda  |    93  |  34
2 threads: -----------------------------------------------
      (3, 512, 512)       / uint8   / cpu   |   126  |  12
      (5, 3, 512, 512)    / uint8   / cpu   |   582  |  41
4 threads: -----------------------------------------------
      (3, 512, 512)       / uint8   / cpu   |    69  |   7
      (5, 3, 512, 512)    / uint8   / cpu   |   300  |  23

Times are in microseconds (us).

Performance increased for images by roughly 13.8x.
Performance increased for videos by roughly 16.2x.

[-------- solarize @ torchvision==0.15.0a0+62da7d4 --------]
                                            |   v1   |   v2 
1 threads: -------------------------------------------------
      (3, 512, 512)       / uint8   / cpu   |  1000  |  1000
      (3, 512, 512)       / uint8   / cuda  |    54  |    30
      (5, 3, 512, 512)    / uint8   / cpu   |  6000  |  5300
      (5, 3, 512, 512)    / uint8   / cuda  |   230  |   180
2 threads: -------------------------------------------------
      (3, 512, 512)       / uint8   / cpu   |  1000  |   600
      (5, 3, 512, 512)    / uint8   / cpu   |  4000  |  3000
4 threads: -------------------------------------------------
      (3, 512, 512)       / uint8   / cpu   |   668  |   582
      (5, 3, 512, 512)    / uint8   / cpu   |  3140  |  2900

Times are in microseconds (us).

Performance increased for images by roughly 21%.
Performance increased for videos by roughly 16%.

torchvision/prototype/transforms/functional/_color.py

Summary: * improve performance of invert_image_tensor * cleanup * lint * more cleanup * use new invert in solarize Reviewed By: YosuaMichael Differential Revision: D40722898 fbshipit-source-id: b2157759d439a184be49d69b91e5ea82e2d96cc6

improve performance of invert_image_tensor

f1a248f

pmeier added module: transforms Perf For performance improvements prototype labels Oct 24, 2022

pmeier requested review from vfdev-5 and datumbox October 24, 2022 10:37

facebook-github-bot added the cla signed label Oct 24, 2022

pmeier commented Oct 24, 2022

View reviewed changes

torchvision/prototype/transforms/functional/_color.py Outdated Show resolved Hide resolved

torchvision/prototype/transforms/functional/_color.py Outdated Show resolved Hide resolved

pmeier mentioned this pull request Oct 24, 2022

Performance improvements for transforms v2 vs. v1 #6818

Closed

31 tasks

datumbox approved these changes Oct 24, 2022

View reviewed changes

torchvision/prototype/transforms/functional/_color.py Outdated Show resolved Hide resolved

cleanup

0cc6cdd

pmeier commented Oct 24, 2022

View reviewed changes

lint

34c075c

pmeier added 3 commits October 24, 2022 14:51

more cleanup

7adfeb5

Merge branch 'main' into perf/invert

d8af69b

use new invert in solarize

f8f68d7

pmeier changed the title ~~improve performance of invert_image_tensor~~ improve performance of {invert, solarize}_image_tensor Oct 24, 2022

pmeier merged commit 7f5513d into pytorch:main Oct 24, 2022

pmeier deleted the perf/invert branch October 24, 2022 13:27

pmeier mentioned this pull request Oct 24, 2022

remove unneccesary checks from posterize_image_tensor #6823

Closed

datumbox reviewed Oct 24, 2022

View reviewed changes

torchvision/prototype/transforms/functional/_color.py Show resolved Hide resolved

torchvision/prototype/transforms/functional/_color.py Show resolved Hide resolved

pmeier mentioned this pull request Oct 24, 2022

revert 255 -> max_value fix #6826

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

improve performance of {invert, solarize}_image_tensor #6819

improve performance of {invert, solarize}_image_tensor #6819

pmeier commented Oct 24, 2022 •

edited by pytorch-bot bot

Loading

datumbox left a comment •

edited

Loading

datumbox commented Oct 24, 2022

pmeier Oct 24, 2022

pmeier commented Oct 24, 2022 •

edited

Loading

pmeier commented Oct 24, 2022

improve performance of {invert, solarize}_image_tensor #6819

improve performance of {invert, solarize}_image_tensor #6819

Conversation

pmeier commented Oct 24, 2022 • edited by pytorch-bot bot Loading

datumbox left a comment • edited Loading

Choose a reason for hiding this comment

datumbox commented Oct 24, 2022

pmeier Oct 24, 2022

Choose a reason for hiding this comment

pmeier commented Oct 24, 2022 • edited Loading

pmeier commented Oct 24, 2022

pmeier commented Oct 24, 2022 •

edited by pytorch-bot bot

Loading

datumbox left a comment •

edited

Loading

pmeier commented Oct 24, 2022 •

edited

Loading