-
Notifications
You must be signed in to change notification settings - Fork 7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
improve performance of {invert, solarize}_image_tensor #6819
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's looking good. Perhaps address @vfdev-5 remark below, to ensure the previous and new behaviours are identical?
Worth noting that you also improve solarize
which depends on invert
. To get this speed gain, you would have to implement solarize again on v2, to call the new kernel. Do you plan to do this now or separately?
There is a related linter issue. |
if image.dtype == torch.uint8: | ||
return image.bitwise_not() | ||
else: | ||
return _FT._max_value(image.dtype) - image |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I couldn't believe it, but this seems to be fastest way:
def scalar_sub(image):
return _max_value(image.dtype) - image
def new_full(image):
return image.new_full(image.shape, _max_value(image.dtype)).sub_(image)
def full_like(image):
return torch.full_like(image, _max_value(image.dtype)).sub_(image)
[-------------------- invert float or signed integer tensors -------------------]
| scalar_sub | new_full | full_like
1 threads: ----------------------------------------------------------------------
(3, 256, 256) / int32 / cpu | 20 | 28 | 26
(3, 256, 256) / int32 / cuda | 5 | 8 | 7
(3, 256, 256) / int64 / cpu | 88 | 109 | 107
(3, 256, 256) / int64 / cuda | 13 | 22 | 22
(3, 256, 256) / float64 / cpu | 35 | 49 | 49
(3, 256, 256) / float64 / cuda | 12 | 21 | 21
(5, 3, 256, 256) / int32 / cpu | 76 | 112 | 110
(5, 3, 256, 256) / int32 / cuda | 34 | 67 | 67
(5, 3, 256, 256) / int64 / cpu | 439 | 510 | 524
(5, 3, 256, 256) / int64 / cuda | 68 | 133 | 133
(5, 3, 256, 256) / float64 / cpu | 140 | 215 | 220
(5, 3, 256, 256) / float64 / cuda | 68 | 133 | 133
Times are in microseconds (us).
Let's do it in a follow-up PR. Maybe there are more improvements possible. Will deal with it next. Edit: There is nothing to optimize in |
Re-run a reduced version of the benchmark with the recent changes:
|
Summary: * improve performance of invert_image_tensor * cleanup * lint * more cleanup * use new invert in solarize Reviewed By: YosuaMichael Differential Revision: D40722898 fbshipit-source-id: b2157759d439a184be49d69b91e5ea82e2d96cc6
Per title. The benchmark script from #6818 yields the following
uint8
is much faster now, whilefloat32
is on par with what we had before.cc @vfdev-5 @datumbox @bjuncek