-
Notifications
You must be signed in to change notification settings - Fork 24.6k
Vectorize cpu tensor conversions #80905
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
This adds vectorization to the copy kernel acting between different dtypes through the use of `at::vec::convert`. Currently `vec::convert` falls back to a scalar copy loop for most dtypes, however the compiler is still better able to auto-vectorize the loop since it doesn't involve stride calculations. In a simple timeit benchmark I see around a 2x speedup copying from int32 to various dtypes: | To dtype | Master (us) | This PR (us) | |----------|-------------|--------------| | int64 | 23.8 | 10.3 | | float32 | 16.8 | 8.18 | | float64 | 18.0 | 9.47 | [ghstack-poisoned]
🔗 Helpful links
✅ No Failures (0 Pending)As of commit 9d811be (more details on the Dr. CI page): Expand to see more💚 💚 Looks good so far! There are no failures yet. 💚 💚 This comment was automatically generated by Dr. CI (expand for details).Please report bugs/suggestions to the (internal) Dr. CI Users group. |
This adds vectorization to the copy kernel acting between different dtypes through the use of `at::vec::convert`. Currently `vec::convert` falls back to a scalar copy loop for most dtypes, however the compiler is still better able to auto-vectorize the loop since it doesn't involve stride calculations. In a simple timeit benchmark I see around a 2x speedup copying from int32 to various dtypes: | To dtype | Master (us) | This PR (us) | |----------|-------------|--------------| | int64 | 23.8 | 10.3 | | float32 | 16.8 | 8.18 | | float64 | 18.0 | 9.47 | [ghstack-poisoned]
This adds vectorization to the copy kernel acting between different dtypes through the use of `at::vec::convert`. Currently `vec::convert` falls back to a scalar copy loop for most dtypes, however the compiler is still better able to auto-vectorize the loop since it doesn't involve stride calculations. In a simple timeit benchmark I see around a 2x speedup copying from int32 to various dtypes: | To dtype | Master (us) | This PR (us) | |----------|-------------|--------------| | int64 | 23.8 | 10.3 | | float32 | 16.8 | 8.18 | | float64 | 18.0 | 9.47 | ghstack-source-id: 8d3e6a0 Pull Request resolved: pytorch#80905
This adds vectorization to the copy kernel acting between different dtypes through the use of `at::vec::convert`. Currently `vec::convert` falls back to a scalar copy loop for most dtypes, however the compiler is still better able to auto-vectorize the loop since it doesn't involve stride calculations. In a simple timeit benchmark I see around a 2x speedup copying from int32 to various dtypes: | To dtype | Master (us) | This PR (us) | |----------|-------------|--------------| | int64 | 23.8 | 10.3 | | float32 | 16.8 | 8.18 | | float64 | 18.0 | 9.47 | [ghstack-poisoned]
@ngimel ping |
This adds vectorization to the copy kernel acting between different dtypes through the use of `at::vec::convert`. Currently `vec::convert` falls back to a scalar copy loop for most dtypes, however the compiler is still better able to auto-vectorize the loop since it doesn't involve stride calculations. In a simple timeit benchmark I see around a 2x speedup copying from int32 to various dtypes: | To dtype | Master (us) | This PR (us) | |----------|-------------|--------------| | int64 | 23.8 | 10.3 | | float32 | 16.8 | 8.18 | | float64 | 18.0 | 9.47 | ghstack-source-id: e9f681a Pull Request resolved: pytorch#80905
This adds vectorization to the copy kernel acting between different dtypes through the use of `at::vec::convert`. Currently `vec::convert` falls back to a scalar copy loop for most dtypes, however the compiler is still better able to auto-vectorize the loop since it doesn't involve stride calculations. In a simple timeit benchmark I see around a 2x speedup copying from int32 to various dtypes: | To dtype | Master (us) | This PR (us) | |----------|-------------|--------------| | int64 | 23.8 | 10.3 | | float32 | 16.8 | 8.18 | | float64 | 18.0 | 9.47 | [ghstack-poisoned]
@peterbell10 you'll have to rebase before landing I think |
This adds vectorization to the copy kernel acting between different dtypes through the use of `at::vec::convert`. Currently `vec::convert` falls back to a scalar copy loop for most dtypes, however the compiler is still better able to auto-vectorize the loop since it doesn't involve stride calculations. In a simple timeit benchmark I see around a 2x speedup copying from int32 to various dtypes: | To dtype | Master (us) | This PR (us) | |----------|-------------|--------------| | int64 | 23.8 | 10.3 | | float32 | 16.8 | 8.18 | | float64 | 18.0 | 9.47 | [ghstack-poisoned]
This adds vectorization to the copy kernel acting between different dtypes through the use of `at::vec::convert`. Currently `vec::convert` falls back to a scalar copy loop for most dtypes, however the compiler is still better able to auto-vectorize the loop since it doesn't involve stride calculations. In a simple timeit benchmark I see around a 2x speedup copying from int32 to various dtypes: | To dtype | Master (us) | This PR (us) | |----------|-------------|--------------| | int64 | 23.8 | 10.3 | | float32 | 16.8 | 8.18 | | float64 | 18.0 | 9.47 | [ghstack-poisoned]
This adds vectorization to the copy kernel acting between different dtypes through the use of `at::vec::convert`. Currently `vec::convert` falls back to a scalar copy loop for most dtypes, however the compiler is still better able to auto-vectorize the loop since it doesn't involve stride calculations. In a simple timeit benchmark I see around a 2x speedup copying from int32 to various dtypes: | To dtype | Master (us) | This PR (us) | |----------|-------------|--------------| | int64 | 23.8 | 10.3 | | float32 | 16.8 | 8.18 | | float64 | 18.0 | 9.47 | Pull Request resolved: #80905 Approved by: https://github.com/ngimel
@pytorchbot revert -m="This stack broke macos tests on trunk" -c=ignoredsignal |
@pytorchbot successfully started a revert job. Check the current status here |
@peterbell10 your PR has been successfully reverted. |
This reverts commit 948cc54. Reverted #80905 on behalf of https://github.com/peterbell10 due to This stack broke macos tests on trunk
This adds vectorization to the copy kernel acting between different dtypes through the use of `at::vec::convert`. Currently `vec::convert` falls back to a scalar copy loop for most dtypes, however the compiler is still better able to auto-vectorize the loop since it doesn't involve stride calculations. In a simple timeit benchmark I see around a 2x speedup copying from int32 to various dtypes: | To dtype | Master (us) | This PR (us) | |----------|-------------|--------------| | int64 | 23.8 | 10.3 | | float32 | 16.8 | 8.18 | | float64 | 18.0 | 9.47 | [ghstack-poisoned]
This adds vectorization to the copy kernel acting between different dtypes through the use of `at::vec::convert`. Currently `vec::convert` falls back to a scalar copy loop for most dtypes, however the compiler is still better able to auto-vectorize the loop since it doesn't involve stride calculations. In a simple timeit benchmark I see around a 2x speedup copying from int32 to various dtypes: | To dtype | Master (us) | This PR (us) | |----------|-------------|--------------| | int64 | 23.8 | 10.3 | | float32 | 16.8 | 8.18 | | float64 | 18.0 | 9.47 | [ghstack-poisoned]
This adds vectorization to the copy kernel acting between different dtypes through the use of `at::vec::convert`. Currently `vec::convert` falls back to a scalar copy loop for most dtypes, however the compiler is still better able to auto-vectorize the loop since it doesn't involve stride calculations. In a simple timeit benchmark I see around a 2x speedup copying from int32 to various dtypes: | To dtype | Master (us) | This PR (us) | |----------|-------------|--------------| | int64 | 23.8 | 10.3 | | float32 | 16.8 | 8.18 | | float64 | 18.0 | 9.47 | [ghstack-poisoned]
This adds vectorization to the copy kernel acting between different dtypes through the use of `at::vec::convert`. Currently `vec::convert` falls back to a scalar copy loop for most dtypes, however the compiler is still better able to auto-vectorize the loop since it doesn't involve stride calculations. In a simple timeit benchmark I see around a 2x speedup copying from int32 to various dtypes: | To dtype | Master (us) | This PR (us) | |----------|-------------|--------------| | int64 | 23.8 | 10.3 | | float32 | 16.8 | 8.18 | | float64 | 18.0 | 9.47 | [ghstack-poisoned]
This adds vectorization to the copy kernel acting between different dtypes through the use of `at::vec::convert`. Currently `vec::convert` falls back to a scalar copy loop for most dtypes, however the compiler is still better able to auto-vectorize the loop since it doesn't involve stride calculations. In a simple timeit benchmark I see around a 2x speedup copying from int32 to various dtypes: | To dtype | Master (us) | This PR (us) | |----------|-------------|--------------| | int64 | 23.8 | 10.3 | | float32 | 16.8 | 8.18 | | float64 | 18.0 | 9.47 | ghstack-source-id: 0f8ad0b Pull Request resolved: pytorch#80905
This adds vectorization to the copy kernel acting between different dtypes through the use of `at::vec::convert`. Currently `vec::convert` falls back to a scalar copy loop for most dtypes, however the compiler is still better able to auto-vectorize the loop since it doesn't involve stride calculations. In a simple timeit benchmark I see around a 2x speedup copying from int32 to various dtypes: | To dtype | Master (us) | This PR (us) | |----------|-------------|--------------| | int64 | 23.8 | 10.3 | | float32 | 16.8 | 8.18 | | float64 | 18.0 | 9.47 | [ghstack-poisoned]
This adds vectorization to the copy kernel acting between different dtypes through the use of `at::vec::convert`. Currently `vec::convert` falls back to a scalar copy loop for most dtypes, however the compiler is still better able to auto-vectorize the loop since it doesn't involve stride calculations. In a simple timeit benchmark I see around a 2x speedup copying from int32 to various dtypes: | To dtype | Master (us) | This PR (us) | |----------|-------------|--------------| | int64 | 23.8 | 10.3 | | float32 | 16.8 | 8.18 | | float64 | 18.0 | 9.47 | [ghstack-poisoned]
@pytorchbot merge -l |
@pytorchbot successfully started a merge job. Check the current status here. |
Hey @peterbell10. |
Summary: This adds vectorization to the copy kernel acting between different dtypes through the use of `at::vec::convert`. Currently `vec::convert` falls back to a scalar copy loop for most dtypes, however the compiler is still better able to auto-vectorize the loop since it doesn't involve stride calculations. In a simple timeit benchmark I see around a 2x speedup copying from int32 to various dtypes: | To dtype | Master (us) | This PR (us) | |----------|-------------|--------------| | int64 | 23.8 | 10.3 | | float32 | 16.8 | 8.18 | | float64 | 18.0 | 9.47 | Pull Request resolved: #80905 Approved by: https://github.com/ngimel Test Plan: contbuild & OSS CI, see https://hud.pytorch.org/commit/pytorch/pytorch/84146f3d0db1a39e6a4b363e15e30c6f6f159f75 Reviewed By: seemethere Differential Revision: D38769975 fbshipit-source-id: 32d13aa275f58a2ba4be90c59933b3e3ea68017e
Stack from ghstack (oldest at bottom):
This adds vectorization to the copy kernel acting between different
dtypes through the use of
at::vec::convert
. Currentlyvec::convert
falls back to a scalar copy loop for most dtypes, however the compiler
is still better able to auto-vectorize the loop since it doesn't
involve stride calculations.
In a simple timeit benchmark I see around a 2x speedup copying from
int32 to various dtypes: