Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remaining BBox kernel perf optimizations #6896

Merged
merged 8 commits into from
Nov 3, 2022

Conversation

datumbox
Copy link
Contributor

@datumbox datumbox commented Nov 3, 2022

Some of the opts highlighted at #6872

cc @vfdev-5 @bjuncek @pmeier

Comment on lines +184 to +188
w_ratio = new_width / old_width
h_ratio = new_height / old_height
ratios = torch.tensor([w_ratio, h_ratio, w_ratio, h_ratio], device=bounding_box.device)
return (
bounding_box.reshape(-1, 2, 2).mul(ratios).to(bounding_box.dtype).reshape(bounding_box.shape),
bounding_box.mul(ratios).to(bounding_box.dtype),
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Improvement:

[------------ resize cpu torch.float32 ------------]
                |        old       |        new     
1 threads: -----------------------------------------
      (128, 4)  |   13 (+-  0) us  |    8 (+-  0) us
6 threads: -----------------------------------------
      (128, 4)  |   13 (+-  0) us  |    8 (+-  0) us

Times are in microseconds (us).

[----------- resize cuda torch.float32 ------------]
                |        old       |        new     
1 threads: -----------------------------------------
      (128, 4)  |   37 (+-  0) us  |   31 (+-  0) us
6 threads: -----------------------------------------
      (128, 4)  |   37 (+-  0) us  |   31 (+-  0) us

Times are in microseconds (us).

[------------- resize cpu torch.uint8 -------------]
                |        old       |        new     
1 threads: -----------------------------------------
      (128, 4)  |   19 (+-  0) us  |   13 (+-  0) us
6 threads: -----------------------------------------
      (128, 4)  |   19 (+-  0) us  |   13 (+-  0) us

Times are in microseconds (us).

[------------ resize cuda torch.uint8 -------------]
                |        old       |        new     
1 threads: -----------------------------------------
      (128, 4)  |   45 (+-  0) us  |   39 (+-  0) us
6 threads: -----------------------------------------
      (128, 4)  |   45 (+-  0) us  |   39 (+-  1) us

Times are in microseconds (us).

@vfdev-5
Copy link
Collaborator

vfdev-5 commented Nov 3, 2022

Maybe, we can merge this after #6879

Copy link
Collaborator

@vfdev-5 vfdev-5 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice optim for resize, thanks @datumbox

@datumbox
Copy link
Contributor Author

datumbox commented Nov 3, 2022

@vfdev-5 I just pushed a couple of untested opts. Could you check again which you think are safe? I'll do benchmarks after we confirm which ones we want in.

@datumbox datumbox requested a review from vfdev-5 November 3, 2022 11:43
@vfdev-5
Copy link
Collaborator

vfdev-5 commented Nov 3, 2022

I'll cherry pick those for elastic those which makes sense. Thanks for pointers!

@@ -388,8 +389,7 @@ def _affine_bounding_box_xyxy(
new_points = torch.matmul(points, transposed_affine_matrix)
tr, _ = torch.min(new_points, dim=0, keepdim=True)
# Translate bounding boxes
out_bboxes[:, 0::2] = out_bboxes[:, 0::2] - tr[:, 0]
out_bboxes[:, 1::2] = out_bboxes[:, 1::2] - tr[:, 1]
out_bboxes.sub_(tr.repeat((1, 2)))
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Improvement for both changes:

[-------------------- bbox_rotate cpu -------------------]
                     |      False       |       True     
1 threads: ----------------------------------------------
      torch.float32  |  265 (+- 40) us  |  225 (+-  2) us
      torch.float64  |  261 (+-  1) us  |  241 (+-  1) us
      torch.int32    |  258 (+-  1) us  |  239 (+-  2) us
      torch.int64    |  260 (+-  1) us  |  239 (+-  1) us
6 threads: ----------------------------------------------
      torch.float32  |  466 (+- 10) us  |  405 (+- 20) us
      torch.float64  |  483 (+- 10) us  |  422 (+- 55) us
      torch.int32    |  479 (+- 10) us  |  420 (+- 10) us
      torch.int64    |  482 (+- 18) us  |  422 (+- 10) us

Times are in microseconds (us).

[-------------------- bbox_rotate cpu -------------------]
                     |      False       |       True     
1 threads: ----------------------------------------------
      torch.float32  |  498 (+- 46) us  |  432 (+-  0) us
      torch.float64  |  489 (+-  1) us  |  446 (+-  0) us
      torch.int32    |  503 (+-  0) us  |  459 (+-  3) us
      torch.int64    |  504 (+-  3) us  |  458 (+-  0) us
6 threads: ----------------------------------------------
      torch.float32  |  573 (+-  2) us  |  530 (+-  0) us
      torch.float64  |  600 (+- 20) us  |  554 (+- 20) us
      torch.int32    |  609 (+- 20) us  |  560 (+- 10) us
      torch.int64    |  598 (+- 58) us  |  563 (+- 10) us

Times are in microseconds (us).

@datumbox datumbox changed the title [WIP] Remaining BBox kernel perf optimizations Remaining BBox kernel perf optimizations Nov 3, 2022
@datumbox datumbox added module: transforms Perf For performance improvements prototype labels Nov 3, 2022
Copy link
Collaborator

@vfdev-5 vfdev-5 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks @datumbox

@datumbox datumbox merged commit f1b840d into pytorch:main Nov 3, 2022
@datumbox datumbox deleted the prototype/bbox_speedups branch November 3, 2022 13:07
facebook-github-bot pushed a commit that referenced this pull request Nov 4, 2022
Summary:
* Bbox resize optimization

* Other (untested) optimizations on `_affine_bounding_box_xyxy` and `elastic_bounding_box`.

* fix conflict

* Reverting changes on elastic

* revert one more change

* Further improvement

Reviewed By: datumbox

Differential Revision: D41020550

fbshipit-source-id: dfd1f2d91490b45176f1976bcec1fc99248f8587
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants