Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

improve performance of {invert, solarize}_image_tensor #6819

Merged
merged 6 commits into from
Oct 24, 2022

Conversation

pmeier
Copy link
Collaborator

@pmeier pmeier commented Oct 24, 2022

Per title. The benchmark script from #6818 yields the following

[--------- invert @ torchvision==0.15.0a0+62da7d4 ---------]
                                            |   v1   |   v2 
1 threads: -------------------------------------------------
      (1, 512, 512)       / uint8   / cpu   |    84  |     8
      (1, 512, 512)       / uint8   / cuda  |    20  |     4
      (1, 512, 512)       / float32 / cpu   |    28  |    27
      (1, 512, 512)       / float32 / cuda  |    20  |     6
      (3, 512, 512)       / uint8   / cpu   |   230  |    18
      (3, 512, 512)       / uint8   / cuda  |    30  |     4
      (3, 512, 512)       / float32 / cpu   |    61  |    62
      (3, 512, 512)       / float32 / cuda  |    40  |    27
      (5, 3, 512, 512)    / uint8   / cpu   |  1100  |    75
      (5, 3, 512, 512)    / uint8   / cuda  |    91  |    34
      (5, 3, 512, 512)    / float32 / cpu   |   500  |   500
      (5, 3, 512, 512)    / float32 / cuda  |   144  |   134
      (4, 5, 3, 512, 512) / uint8   / cpu   |  4400  |   400
      (4, 5, 3, 512, 512) / uint8   / cuda  |   400  |   133
      (4, 5, 3, 512, 512) / float32 / cpu   |  8000  |  7100
      (4, 5, 3, 512, 512) / float32 / cuda  |   540  |   535
2 threads: -------------------------------------------------
      (1, 512, 512)       / uint8   / cpu   |    49  |     7
      (1, 512, 512)       / float32 / cpu   |    21  |    20
      (3, 512, 512)       / uint8   / cpu   |   100  |    13
      (3, 512, 512)       / float32 / cpu   |    40  |    38
      (5, 3, 512, 512)    / uint8   / cpu   |   587  |    42
      (5, 3, 512, 512)    / float32 / cpu   |   150  |   150
      (4, 5, 3, 512, 512) / uint8   / cpu   |  2300  |   140
      (4, 5, 3, 512, 512) / float32 / cpu   |  7000  |  6600
4 threads: -------------------------------------------------
      (1, 512, 512)       / uint8   / cpu   |    30  |     6
      (1, 512, 512)       / float32 / cpu   |    16  |    14
      (3, 512, 512)       / uint8   / cpu   |    70  |    10
      (3, 512, 512)       / float32 / cpu   |    26  |    25
      (5, 3, 512, 512)    / uint8   / cpu   |   300  |    25
      (5, 3, 512, 512)    / float32 / cpu   |    83  |    90
      (4, 5, 3, 512, 512) / uint8   / cpu   |  1170  |    80
      (4, 5, 3, 512, 512) / float32 / cpu   |  6000  |  6000

Times are in microseconds (us).

Performance increased for images by roughly 12.5x.
Performance increased for videos by roughly 14.9x.

uint8 is much faster now, while float32 is on par with what we had before.

cc @vfdev-5 @datumbox @bjuncek

Copy link
Contributor

@datumbox datumbox left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's looking good. Perhaps address @vfdev-5 remark below, to ensure the previous and new behaviours are identical?

Worth noting that you also improve solarize which depends on invert. To get this speed gain, you would have to implement solarize again on v2, to call the new kernel. Do you plan to do this now or separately?

torchvision/prototype/transforms/functional/_color.py Outdated Show resolved Hide resolved
@datumbox
Copy link
Contributor

There is a related linter issue.

if image.dtype == torch.uint8:
return image.bitwise_not()
else:
return _FT._max_value(image.dtype) - image
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I couldn't believe it, but this seems to be fastest way:

def scalar_sub(image):
    return _max_value(image.dtype) - image


def new_full(image):
    return image.new_full(image.shape, _max_value(image.dtype)).sub_(image)


def full_like(image):
    return torch.full_like(image, _max_value(image.dtype)).sub_(image)
[-------------------- invert float or signed integer tensors -------------------]
                                         |  scalar_sub  |  new_full  |  full_like
1 threads: ----------------------------------------------------------------------
      (3, 256, 256)    / int32   / cpu   |      20      |     28     |      26   
      (3, 256, 256)    / int32   / cuda  |       5      |      8     |       7   
      (3, 256, 256)    / int64   / cpu   |      88      |    109     |     107   
      (3, 256, 256)    / int64   / cuda  |      13      |     22     |      22   
      (3, 256, 256)    / float64 / cpu   |      35      |     49     |      49   
      (3, 256, 256)    / float64 / cuda  |      12      |     21     |      21   
      (5, 3, 256, 256) / int32   / cpu   |      76      |    112     |     110   
      (5, 3, 256, 256) / int32   / cuda  |      34      |     67     |      67   
      (5, 3, 256, 256) / int64   / cpu   |     439      |    510     |     524   
      (5, 3, 256, 256) / int64   / cuda  |      68      |    133     |     133   
      (5, 3, 256, 256) / float64 / cpu   |     140      |    215     |     220   
      (5, 3, 256, 256) / float64 / cuda  |      68      |    133     |     133   

Times are in microseconds (us).

@pmeier
Copy link
Collaborator Author

pmeier commented Oct 24, 2022

Worth noting that you also improve solarize which depends on invert. To get this speed gain, you would have to implement solarize again on v2, to call the new kernel. Do you plan to do this now or separately?

Let's do it in a follow-up PR. Maybe there are more improvements possible. Will deal with it next.

Edit: There is nothing to optimize in solarize and so I implemented it here.

@pmeier pmeier changed the title improve performance of invert_image_tensor improve performance of {invert, solarize}_image_tensor Oct 24, 2022
@pmeier
Copy link
Collaborator Author

pmeier commented Oct 24, 2022

Re-run a reduced version of the benchmark with the recent changes:

[-------- invert @ torchvision==0.15.0a0+62da7d4 --------]
                                            |   v1   |  v2
1 threads: -----------------------------------------------
      (3, 512, 512)       / uint8   / cpu   |   236  |  17
      (3, 512, 512)       / uint8   / cuda  |    28  |   4
      (5, 3, 512, 512)    / uint8   / cpu   |  1140  |  70
      (5, 3, 512, 512)    / uint8   / cuda  |    93  |  34
2 threads: -----------------------------------------------
      (3, 512, 512)       / uint8   / cpu   |   126  |  12
      (5, 3, 512, 512)    / uint8   / cpu   |   582  |  41
4 threads: -----------------------------------------------
      (3, 512, 512)       / uint8   / cpu   |    69  |   7
      (5, 3, 512, 512)    / uint8   / cpu   |   300  |  23

Times are in microseconds (us).

Performance increased for images by roughly 13.8x.
Performance increased for videos by roughly 16.2x.
[-------- solarize @ torchvision==0.15.0a0+62da7d4 --------]
                                            |   v1   |   v2 
1 threads: -------------------------------------------------
      (3, 512, 512)       / uint8   / cpu   |  1000  |  1000
      (3, 512, 512)       / uint8   / cuda  |    54  |    30
      (5, 3, 512, 512)    / uint8   / cpu   |  6000  |  5300
      (5, 3, 512, 512)    / uint8   / cuda  |   230  |   180
2 threads: -------------------------------------------------
      (3, 512, 512)       / uint8   / cpu   |  1000  |   600
      (5, 3, 512, 512)    / uint8   / cpu   |  4000  |  3000
4 threads: -------------------------------------------------
      (3, 512, 512)       / uint8   / cpu   |   668  |   582
      (5, 3, 512, 512)    / uint8   / cpu   |  3140  |  2900

Times are in microseconds (us).

Performance increased for images by roughly 21%.
Performance increased for videos by roughly 16%.

@pmeier pmeier merged commit 7f5513d into pytorch:main Oct 24, 2022
@pmeier pmeier deleted the perf/invert branch October 24, 2022 13:27
facebook-github-bot pushed a commit that referenced this pull request Oct 27, 2022
Summary:
* improve performance of invert_image_tensor

* cleanup

* lint

* more cleanup

* use new invert in solarize

Reviewed By: YosuaMichael

Differential Revision: D40722898

fbshipit-source-id: b2157759d439a184be49d69b91e5ea82e2d96cc6
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants