Skip to content

Vectorize cpu tensor conversions #80905

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 12 commits into from

Conversation

peterbell10
Copy link
Collaborator

@peterbell10 peterbell10 commented Jul 5, 2022

Stack from ghstack (oldest at bottom):

This adds vectorization to the copy kernel acting between different
dtypes through the use of at::vec::convert. Currently vec::convert
falls back to a scalar copy loop for most dtypes, however the compiler
is still better able to auto-vectorize the loop since it doesn't
involve stride calculations.

In a simple timeit benchmark I see around a 2x speedup copying from
int32 to various dtypes:

To dtype Master (us) This PR (us)
int64 23.8 10.3
float32 16.8 8.18
float64 18.0 9.47

This adds vectorization to the copy kernel acting between different
dtypes through the use of `at::vec::convert`. Currently `vec::convert`
falls back to a scalar copy loop for most dtypes, however the compiler
is still better able to auto-vectorize the loop since it doesn't
involve stride calculations.

In a simple timeit benchmark I see around a 2x speedup copying from
int32 to various dtypes:

| To dtype | Master (us) | This PR (us) |
|----------|-------------|--------------|
| int64    | 23.8        | 10.3         |
| float32  | 16.8        | 8.18         |
| float64  | 18.0        | 9.47         |

[ghstack-poisoned]
@facebook-github-bot
Copy link
Contributor

facebook-github-bot commented Jul 5, 2022

🔗 Helpful links

✅ No Failures (0 Pending)

As of commit 9d811be (more details on the Dr. CI page):

Expand to see more

💚 💚 Looks good so far! There are no failures yet. 💚 💚


This comment was automatically generated by Dr. CI (expand for details).

Please report bugs/suggestions to the (internal) Dr. CI Users group.

Click here to manually regenerate this comment.

This adds vectorization to the copy kernel acting between different
dtypes through the use of `at::vec::convert`. Currently `vec::convert`
falls back to a scalar copy loop for most dtypes, however the compiler
is still better able to auto-vectorize the loop since it doesn't
involve stride calculations.

In a simple timeit benchmark I see around a 2x speedup copying from
int32 to various dtypes:

| To dtype | Master (us) | This PR (us) |
|----------|-------------|--------------|
| int64    | 23.8        | 10.3         |
| float32  | 16.8        | 8.18         |
| float64  | 18.0        | 9.47         |

[ghstack-poisoned]
@peterbell10 peterbell10 marked this pull request as ready for review July 6, 2022 11:47
@peterbell10 peterbell10 requested a review from ngimel July 6, 2022 11:47
peterbell10 added a commit to peterbell10/pytorch that referenced this pull request Jul 8, 2022
This adds vectorization to the copy kernel acting between different
dtypes through the use of `at::vec::convert`. Currently `vec::convert`
falls back to a scalar copy loop for most dtypes, however the compiler
is still better able to auto-vectorize the loop since it doesn't
involve stride calculations.

In a simple timeit benchmark I see around a 2x speedup copying from
int32 to various dtypes:

| To dtype | Master (us) | This PR (us) |
|----------|-------------|--------------|
| int64    | 23.8        | 10.3         |
| float32  | 16.8        | 8.18         |
| float64  | 18.0        | 9.47         |

ghstack-source-id: 8d3e6a0
Pull Request resolved: pytorch#80905
This adds vectorization to the copy kernel acting between different
dtypes through the use of `at::vec::convert`. Currently `vec::convert`
falls back to a scalar copy loop for most dtypes, however the compiler
is still better able to auto-vectorize the loop since it doesn't
involve stride calculations.

In a simple timeit benchmark I see around a 2x speedup copying from
int32 to various dtypes:

| To dtype | Master (us) | This PR (us) |
|----------|-------------|--------------|
| int64    | 23.8        | 10.3         |
| float32  | 16.8        | 8.18         |
| float64  | 18.0        | 9.47         |

[ghstack-poisoned]
@peterbell10
Copy link
Collaborator Author

@ngimel ping

peterbell10 added a commit to peterbell10/pytorch that referenced this pull request Jul 27, 2022
This adds vectorization to the copy kernel acting between different
dtypes through the use of `at::vec::convert`. Currently `vec::convert`
falls back to a scalar copy loop for most dtypes, however the compiler
is still better able to auto-vectorize the loop since it doesn't
involve stride calculations.

In a simple timeit benchmark I see around a 2x speedup copying from
int32 to various dtypes:

| To dtype | Master (us) | This PR (us) |
|----------|-------------|--------------|
| int64    | 23.8        | 10.3         |
| float32  | 16.8        | 8.18         |
| float64  | 18.0        | 9.47         |

ghstack-source-id: e9f681a
Pull Request resolved: pytorch#80905
This adds vectorization to the copy kernel acting between different
dtypes through the use of `at::vec::convert`. Currently `vec::convert`
falls back to a scalar copy loop for most dtypes, however the compiler
is still better able to auto-vectorize the loop since it doesn't
involve stride calculations.

In a simple timeit benchmark I see around a 2x speedup copying from
int32 to various dtypes:

| To dtype | Master (us) | This PR (us) |
|----------|-------------|--------------|
| int64    | 23.8        | 10.3         |
| float32  | 16.8        | 8.18         |
| float64  | 18.0        | 9.47         |

[ghstack-poisoned]
@ngimel
Copy link
Collaborator

ngimel commented Aug 5, 2022

@peterbell10 you'll have to rebase before landing I think

This adds vectorization to the copy kernel acting between different
dtypes through the use of `at::vec::convert`. Currently `vec::convert`
falls back to a scalar copy loop for most dtypes, however the compiler
is still better able to auto-vectorize the loop since it doesn't
involve stride calculations.

In a simple timeit benchmark I see around a 2x speedup copying from
int32 to various dtypes:

| To dtype | Master (us) | This PR (us) |
|----------|-------------|--------------|
| int64    | 23.8        | 10.3         |
| float32  | 16.8        | 8.18         |
| float64  | 18.0        | 9.47         |

[ghstack-poisoned]
This adds vectorization to the copy kernel acting between different
dtypes through the use of `at::vec::convert`. Currently `vec::convert`
falls back to a scalar copy loop for most dtypes, however the compiler
is still better able to auto-vectorize the loop since it doesn't
involve stride calculations.

In a simple timeit benchmark I see around a 2x speedup copying from
int32 to various dtypes:

| To dtype | Master (us) | This PR (us) |
|----------|-------------|--------------|
| int64    | 23.8        | 10.3         |
| float32  | 16.8        | 8.18         |
| float64  | 18.0        | 9.47         |

[ghstack-poisoned]
pytorchmergebot pushed a commit that referenced this pull request Aug 6, 2022
This adds vectorization to the copy kernel acting between different
dtypes through the use of `at::vec::convert`. Currently `vec::convert`
falls back to a scalar copy loop for most dtypes, however the compiler
is still better able to auto-vectorize the loop since it doesn't
involve stride calculations.

In a simple timeit benchmark I see around a 2x speedup copying from
int32 to various dtypes:

| To dtype | Master (us) | This PR (us) |
|----------|-------------|--------------|
| int64    | 23.8        | 10.3         |
| float32  | 16.8        | 8.18         |
| float64  | 18.0        | 9.47         |
Pull Request resolved: #80905
Approved by: https://github.com/ngimel
@peterbell10
Copy link
Collaborator Author

@pytorchbot revert -m="This stack broke macos tests on trunk" -c=ignoredsignal

@pytorchmergebot
Copy link
Collaborator

@pytorchbot successfully started a revert job. Check the current status here

@pytorchmergebot
Copy link
Collaborator

@peterbell10 your PR has been successfully reverted.

pytorchmergebot added a commit that referenced this pull request Aug 6, 2022
This reverts commit 948cc54.

Reverted #80905 on behalf of https://github.com/peterbell10 due to This stack broke macos tests on trunk
@peterbell10 peterbell10 reopened this Aug 6, 2022
@peterbell10 peterbell10 added the ciflow/trunk Trigger trunk jobs on your pull request label Aug 6, 2022
This adds vectorization to the copy kernel acting between different
dtypes through the use of `at::vec::convert`. Currently `vec::convert`
falls back to a scalar copy loop for most dtypes, however the compiler
is still better able to auto-vectorize the loop since it doesn't
involve stride calculations.

In a simple timeit benchmark I see around a 2x speedup copying from
int32 to various dtypes:

| To dtype | Master (us) | This PR (us) |
|----------|-------------|--------------|
| int64    | 23.8        | 10.3         |
| float32  | 16.8        | 8.18         |
| float64  | 18.0        | 9.47         |

[ghstack-poisoned]
This adds vectorization to the copy kernel acting between different
dtypes through the use of `at::vec::convert`. Currently `vec::convert`
falls back to a scalar copy loop for most dtypes, however the compiler
is still better able to auto-vectorize the loop since it doesn't
involve stride calculations.

In a simple timeit benchmark I see around a 2x speedup copying from
int32 to various dtypes:

| To dtype | Master (us) | This PR (us) |
|----------|-------------|--------------|
| int64    | 23.8        | 10.3         |
| float32  | 16.8        | 8.18         |
| float64  | 18.0        | 9.47         |

[ghstack-poisoned]
This adds vectorization to the copy kernel acting between different
dtypes through the use of `at::vec::convert`. Currently `vec::convert`
falls back to a scalar copy loop for most dtypes, however the compiler
is still better able to auto-vectorize the loop since it doesn't
involve stride calculations.

In a simple timeit benchmark I see around a 2x speedup copying from
int32 to various dtypes:

| To dtype | Master (us) | This PR (us) |
|----------|-------------|--------------|
| int64    | 23.8        | 10.3         |
| float32  | 16.8        | 8.18         |
| float64  | 18.0        | 9.47         |

[ghstack-poisoned]
This adds vectorization to the copy kernel acting between different
dtypes through the use of `at::vec::convert`. Currently `vec::convert`
falls back to a scalar copy loop for most dtypes, however the compiler
is still better able to auto-vectorize the loop since it doesn't
involve stride calculations.

In a simple timeit benchmark I see around a 2x speedup copying from
int32 to various dtypes:

| To dtype | Master (us) | This PR (us) |
|----------|-------------|--------------|
| int64    | 23.8        | 10.3         |
| float32  | 16.8        | 8.18         |
| float64  | 18.0        | 9.47         |

[ghstack-poisoned]
peterbell10 added a commit to peterbell10/pytorch that referenced this pull request Aug 15, 2022
This adds vectorization to the copy kernel acting between different
dtypes through the use of `at::vec::convert`. Currently `vec::convert`
falls back to a scalar copy loop for most dtypes, however the compiler
is still better able to auto-vectorize the loop since it doesn't
involve stride calculations.

In a simple timeit benchmark I see around a 2x speedup copying from
int32 to various dtypes:

| To dtype | Master (us) | This PR (us) |
|----------|-------------|--------------|
| int64    | 23.8        | 10.3         |
| float32  | 16.8        | 8.18         |
| float64  | 18.0        | 9.47         |

ghstack-source-id: 0f8ad0b
Pull Request resolved: pytorch#80905
This adds vectorization to the copy kernel acting between different
dtypes through the use of `at::vec::convert`. Currently `vec::convert`
falls back to a scalar copy loop for most dtypes, however the compiler
is still better able to auto-vectorize the loop since it doesn't
involve stride calculations.

In a simple timeit benchmark I see around a 2x speedup copying from
int32 to various dtypes:

| To dtype | Master (us) | This PR (us) |
|----------|-------------|--------------|
| int64    | 23.8        | 10.3         |
| float32  | 16.8        | 8.18         |
| float64  | 18.0        | 9.47         |

[ghstack-poisoned]
This adds vectorization to the copy kernel acting between different
dtypes through the use of `at::vec::convert`. Currently `vec::convert`
falls back to a scalar copy loop for most dtypes, however the compiler
is still better able to auto-vectorize the loop since it doesn't
involve stride calculations.

In a simple timeit benchmark I see around a 2x speedup copying from
int32 to various dtypes:

| To dtype | Master (us) | This PR (us) |
|----------|-------------|--------------|
| int64    | 23.8        | 10.3         |
| float32  | 16.8        | 8.18         |
| float64  | 18.0        | 9.47         |

[ghstack-poisoned]
@peterbell10
Copy link
Collaborator Author

@pytorchbot merge -l

@pytorchmergebot
Copy link
Collaborator

@pytorchbot successfully started a merge job. Check the current status here.
The merge job was triggered with the land checks (-l) flag. If you did not specify this flag yourself, you are likely enrolled in the land checks rollout. This means that your change will be merged once all checks on your PR have passed since you have added the ciflow/trunk label to your PR (ETA 0-4 Hours). If this is not the intended behavior, feel free to use some of the other merge options in the wiki.
Please reach out to the PyTorch DevX Team with feedback or questions!

@github-actions
Copy link
Contributor

Hey @peterbell10.
You've committed this PR, but it does not have both a 'release notes: ...' and 'topics: ...' label. Please add one of each to the PR. The 'release notes: ...' label should represent the part of PyTorch that this PR changes (fx, autograd, distributed, etc) and the 'topics: ...' label should represent the kind of PR it is (not user facing, new feature, bug fix, perf improvement, etc). The list of valid labels can be found here for the 'release notes: ...' and here for the 'topics: ...'.
For changes that are 'topic: not user facing' there is no need for a release notes label.

facebook-github-bot pushed a commit that referenced this pull request Aug 17, 2022
Summary:
This adds vectorization to the copy kernel acting between different
dtypes through the use of `at::vec::convert`. Currently `vec::convert`
falls back to a scalar copy loop for most dtypes, however the compiler
is still better able to auto-vectorize the loop since it doesn't
involve stride calculations.

In a simple timeit benchmark I see around a 2x speedup copying from
int32 to various dtypes:

| To dtype | Master (us) | This PR (us) |
|----------|-------------|--------------|
| int64    | 23.8        | 10.3         |
| float32  | 16.8        | 8.18         |
| float64  | 18.0        | 9.47         |

Pull Request resolved: #80905
Approved by: https://github.com/ngimel

Test Plan: contbuild & OSS CI, see https://hud.pytorch.org/commit/pytorch/pytorch/84146f3d0db1a39e6a4b363e15e30c6f6f159f75

Reviewed By: seemethere

Differential Revision: D38769975

fbshipit-source-id: 32d13aa275f58a2ba4be90c59933b3e3ea68017e
@facebook-github-bot facebook-github-bot deleted the gh/peterbell10/349/head branch August 20, 2022 14:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants