Vectorize cpu tensor conversions #80905

peterbell10 · 2022-07-05T19:20:27Z

Stack from ghstack (oldest at bottom):

This adds vectorization to the copy kernel acting between different
dtypes through the use of at::vec::convert. Currently vec::convert
falls back to a scalar copy loop for most dtypes, however the compiler
is still better able to auto-vectorize the loop since it doesn't
involve stride calculations.

In a simple timeit benchmark I see around a 2x speedup copying from
int32 to various dtypes:

To dtype	Master (us)	This PR (us)
int64	23.8	10.3
float32	16.8	8.18
float64	18.0	9.47

This adds vectorization to the copy kernel acting between different dtypes through the use of `at::vec::convert`. Currently `vec::convert` falls back to a scalar copy loop for most dtypes, however the compiler is still better able to auto-vectorize the loop since it doesn't involve stride calculations. In a simple timeit benchmark I see around a 2x speedup copying from int32 to various dtypes: | To dtype | Master (us) | This PR (us) | |----------|-------------|--------------| | int64 | 23.8 | 10.3 | | float32 | 16.8 | 8.18 | | float64 | 18.0 | 9.47 | [ghstack-poisoned]

facebook-github-bot · 2022-07-05T19:20:33Z

🔗 Helpful links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/80905
📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓Need help or want to give feedback on the CI? Visit our office hours

✅ No Failures (0 Pending)

As of commit 9d811be (more details on the Dr. CI page):

Expand to see more

💚 💚 Looks good so far! There are no failures yet. 💚 💚

This comment was automatically generated by Dr. CI (expand for details).

Please report bugs/suggestions to the (internal) Dr. CI Users group.

Click here to manually regenerate this comment.

This adds vectorization to the copy kernel acting between different dtypes through the use of `at::vec::convert`. Currently `vec::convert` falls back to a scalar copy loop for most dtypes, however the compiler is still better able to auto-vectorize the loop since it doesn't involve stride calculations. In a simple timeit benchmark I see around a 2x speedup copying from int32 to various dtypes: | To dtype | Master (us) | This PR (us) | |----------|-------------|--------------| | int64 | 23.8 | 10.3 | | float32 | 16.8 | 8.18 | | float64 | 18.0 | 9.47 | [ghstack-poisoned]

aten/src/ATen/native/cpu/CopyKernel.cpp

This adds vectorization to the copy kernel acting between different dtypes through the use of `at::vec::convert`. Currently `vec::convert` falls back to a scalar copy loop for most dtypes, however the compiler is still better able to auto-vectorize the loop since it doesn't involve stride calculations. In a simple timeit benchmark I see around a 2x speedup copying from int32 to various dtypes: | To dtype | Master (us) | This PR (us) | |----------|-------------|--------------| | int64 | 23.8 | 10.3 | | float32 | 16.8 | 8.18 | | float64 | 18.0 | 9.47 | ghstack-source-id: 8d3e6a0 Pull Request resolved: pytorch#80905

This adds vectorization to the copy kernel acting between different dtypes through the use of `at::vec::convert`. Currently `vec::convert` falls back to a scalar copy loop for most dtypes, however the compiler is still better able to auto-vectorize the loop since it doesn't involve stride calculations. In a simple timeit benchmark I see around a 2x speedup copying from int32 to various dtypes: | To dtype | Master (us) | This PR (us) | |----------|-------------|--------------| | int64 | 23.8 | 10.3 | | float32 | 16.8 | 8.18 | | float64 | 18.0 | 9.47 | [ghstack-poisoned]

peterbell10 · 2022-07-27T12:03:14Z

@ngimel ping

This adds vectorization to the copy kernel acting between different dtypes through the use of `at::vec::convert`. Currently `vec::convert` falls back to a scalar copy loop for most dtypes, however the compiler is still better able to auto-vectorize the loop since it doesn't involve stride calculations. In a simple timeit benchmark I see around a 2x speedup copying from int32 to various dtypes: | To dtype | Master (us) | This PR (us) | |----------|-------------|--------------| | int64 | 23.8 | 10.3 | | float32 | 16.8 | 8.18 | | float64 | 18.0 | 9.47 | ghstack-source-id: e9f681a Pull Request resolved: pytorch#80905

This adds vectorization to the copy kernel acting between different dtypes through the use of `at::vec::convert`. Currently `vec::convert` falls back to a scalar copy loop for most dtypes, however the compiler is still better able to auto-vectorize the loop since it doesn't involve stride calculations. In a simple timeit benchmark I see around a 2x speedup copying from int32 to various dtypes: | To dtype | Master (us) | This PR (us) | |----------|-------------|--------------| | int64 | 23.8 | 10.3 | | float32 | 16.8 | 8.18 | | float64 | 18.0 | 9.47 | [ghstack-poisoned]

ngimel · 2022-08-05T16:53:44Z

@peterbell10 you'll have to rebase before landing I think

This adds vectorization to the copy kernel acting between different dtypes through the use of `at::vec::convert`. Currently `vec::convert` falls back to a scalar copy loop for most dtypes, however the compiler is still better able to auto-vectorize the loop since it doesn't involve stride calculations. In a simple timeit benchmark I see around a 2x speedup copying from int32 to various dtypes: | To dtype | Master (us) | This PR (us) | |----------|-------------|--------------| | int64 | 23.8 | 10.3 | | float32 | 16.8 | 8.18 | | float64 | 18.0 | 9.47 | [ghstack-poisoned]

This adds vectorization to the copy kernel acting between different dtypes through the use of `at::vec::convert`. Currently `vec::convert` falls back to a scalar copy loop for most dtypes, however the compiler is still better able to auto-vectorize the loop since it doesn't involve stride calculations. In a simple timeit benchmark I see around a 2x speedup copying from int32 to various dtypes: | To dtype | Master (us) | This PR (us) | |----------|-------------|--------------| | int64 | 23.8 | 10.3 | | float32 | 16.8 | 8.18 | | float64 | 18.0 | 9.47 | Pull Request resolved: #80905 Approved by: https://github.com/ngimel

peterbell10 · 2022-08-06T21:10:40Z

@pytorchbot revert -m="This stack broke macos tests on trunk" -c=ignoredsignal

pytorchmergebot · 2022-08-06T21:11:49Z

@pytorchbot successfully started a revert job. Check the current status here

pytorchmergebot · 2022-08-06T21:11:58Z

@peterbell10 your PR has been successfully reverted.

This reverts commit 948cc54. Reverted #80905 on behalf of https://github.com/peterbell10 due to This stack broke macos tests on trunk

This adds vectorization to the copy kernel acting between different dtypes through the use of `at::vec::convert`. Currently `vec::convert` falls back to a scalar copy loop for most dtypes, however the compiler is still better able to auto-vectorize the loop since it doesn't involve stride calculations. In a simple timeit benchmark I see around a 2x speedup copying from int32 to various dtypes: | To dtype | Master (us) | This PR (us) | |----------|-------------|--------------| | int64 | 23.8 | 10.3 | | float32 | 16.8 | 8.18 | | float64 | 18.0 | 9.47 | [ghstack-poisoned]

This adds vectorization to the copy kernel acting between different dtypes through the use of `at::vec::convert`. Currently `vec::convert` falls back to a scalar copy loop for most dtypes, however the compiler is still better able to auto-vectorize the loop since it doesn't involve stride calculations. In a simple timeit benchmark I see around a 2x speedup copying from int32 to various dtypes: | To dtype | Master (us) | This PR (us) | |----------|-------------|--------------| | int64 | 23.8 | 10.3 | | float32 | 16.8 | 8.18 | | float64 | 18.0 | 9.47 | ghstack-source-id: 0f8ad0b Pull Request resolved: pytorch#80905

This adds vectorization to the copy kernel acting between different dtypes through the use of `at::vec::convert`. Currently `vec::convert` falls back to a scalar copy loop for most dtypes, however the compiler is still better able to auto-vectorize the loop since it doesn't involve stride calculations. In a simple timeit benchmark I see around a 2x speedup copying from int32 to various dtypes: | To dtype | Master (us) | This PR (us) | |----------|-------------|--------------| | int64 | 23.8 | 10.3 | | float32 | 16.8 | 8.18 | | float64 | 18.0 | 9.47 | [ghstack-poisoned]

peterbell10 · 2022-08-16T20:02:18Z

@pytorchbot merge -l

pytorchmergebot · 2022-08-16T20:03:52Z

@pytorchbot successfully started a merge job. Check the current status here.
The merge job was triggered with the land checks (-l) flag. If you did not specify this flag yourself, you are likely enrolled in the land checks rollout. This means that your change will be merged once all checks on your PR have passed since you have added the ciflow/trunk label to your PR (ETA 0-4 Hours). If this is not the intended behavior, feel free to use some of the other merge options in the wiki.
Please reach out to the PyTorch DevX Team with feedback or questions!

github-actions · 2022-08-16T20:22:22Z

Hey @peterbell10.
You've committed this PR, but it does not have both a 'release notes: ...' and 'topics: ...' label. Please add one of each to the PR. The 'release notes: ...' label should represent the part of PyTorch that this PR changes (fx, autograd, distributed, etc) and the 'topics: ...' label should represent the kind of PR it is (not user facing, new feature, bug fix, perf improvement, etc). The list of valid labels can be found here for the 'release notes: ...' and here for the 'topics: ...'.
For changes that are 'topic: not user facing' there is no need for a release notes label.

Summary: This adds vectorization to the copy kernel acting between different dtypes through the use of `at::vec::convert`. Currently `vec::convert` falls back to a scalar copy loop for most dtypes, however the compiler is still better able to auto-vectorize the loop since it doesn't involve stride calculations. In a simple timeit benchmark I see around a 2x speedup copying from int32 to various dtypes: | To dtype | Master (us) | This PR (us) | |----------|-------------|--------------| | int64 | 23.8 | 10.3 | | float32 | 16.8 | 8.18 | | float64 | 18.0 | 9.47 | Pull Request resolved: #80905 Approved by: https://github.com/ngimel Test Plan: contbuild & OSS CI, see https://hud.pytorch.org/commit/pytorch/pytorch/84146f3d0db1a39e6a4b363e15e30c6f6f159f75 Reviewed By: seemethere Differential Revision: D38769975 fbshipit-source-id: 32d13aa275f58a2ba4be90c59933b3e3ea68017e

facebook-github-bot added the cla signed label Jul 5, 2022

peterbell10 mentioned this pull request Jul 5, 2022

Vectorize conversions to BFloat16 on CPU #80906

Closed

pytorchbot added the open source label Jul 5, 2022

peterbell10 marked this pull request as ready for review July 6, 2022 11:47

peterbell10 requested a review from ngimel July 6, 2022 11:47

lezcano reviewed Jul 8, 2022

View reviewed changes

aten/src/ATen/native/cpu/CopyKernel.cpp Show resolved Hide resolved

ngimel approved these changes Jul 27, 2022

View reviewed changes

This was referenced Jul 27, 2022

Allow ufunc OpInfos to have no reference #82348

Closed

Use UnaryUfuncInfo for type conversion functions #82349

Closed

peterbell10 added 2 commits August 5, 2022 18:50

pytorchmergebot closed this in 948cc54 Aug 6, 2022

pytorchmergebot added the Reverted label Aug 6, 2022

pytorchmergebot added a commit that referenced this pull request Aug 6, 2022

Revert "Vectorize cpu tensor conversions (#80905)"

dd3f602

This reverts commit 948cc54. Reverted #80905 on behalf of https://github.com/peterbell10 due to This stack broke macos tests on trunk

peterbell10 reopened this Aug 6, 2022

peterbell10 added the ciflow/trunk Trigger trunk jobs on your pull request label Aug 6, 2022

This was referenced Aug 8, 2022

Fix prims.div to return the correct dtype #82949

Closed

Fix _refs.lcm using floating point maths #82950

Closed

peterbell10 added 3 commits August 8, 2022 02:47

peterbell10 added 2 commits August 15, 2022 15:10

pytorchmergebot added the Merged label Aug 16, 2022

pytorchmergebot closed this in 84146f3 Aug 16, 2022

facebook-github-bot deleted the gh/peterbell10/349/head branch August 20, 2022 14:19

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Vectorize cpu tensor conversions #80905

Vectorize cpu tensor conversions #80905

Uh oh!

peterbell10 commented Jul 5, 2022 •

edited

Loading

Uh oh!

facebook-github-bot commented Jul 5, 2022 •

edited

Loading

Uh oh!

Uh oh!

peterbell10 commented Jul 27, 2022

Uh oh!

ngimel commented Aug 5, 2022

Uh oh!

peterbell10 commented Aug 6, 2022

Uh oh!

pytorchmergebot commented Aug 6, 2022

Uh oh!

pytorchmergebot commented Aug 6, 2022

Uh oh!

peterbell10 commented Aug 16, 2022

Uh oh!

pytorchmergebot commented Aug 16, 2022

Uh oh!

github-actions bot commented Aug 16, 2022

Uh oh!

Uh oh!

Vectorize cpu tensor conversions #80905

Vectorize cpu tensor conversions #80905

Uh oh!

Conversation

peterbell10 commented Jul 5, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

facebook-github-bot commented Jul 5, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful links

✅ No Failures (0 Pending)

Uh oh!

Uh oh!

peterbell10 commented Jul 27, 2022

Uh oh!

ngimel commented Aug 5, 2022

Uh oh!

peterbell10 commented Aug 6, 2022

Uh oh!

pytorchmergebot commented Aug 6, 2022

Uh oh!

pytorchmergebot commented Aug 6, 2022

Uh oh!

peterbell10 commented Aug 16, 2022

Uh oh!

pytorchmergebot commented Aug 16, 2022

Uh oh!

github-actions bot commented Aug 16, 2022

Uh oh!

Uh oh!

peterbell10 commented Jul 5, 2022 •

edited

Loading

facebook-github-bot commented Jul 5, 2022 •

edited

Loading