Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve complex lerp performance #84844

Closed
wants to merge 7 commits into from

Conversation

peterbell10
Copy link
Collaborator

@peterbell10 peterbell10 commented Sep 11, 2022

Stack from ghstack (oldest at bottom):

The complex lerp kernel uses std::abs(z) < 0.5 which involves
computing a sqrt. Instead compare the square against 0.25 has much
lower latency and so performs much better overall.

In a simple timeit benchmark I see more than 10x speedup on CPU for a 4096
element complex lerp, from 84 us to 6.7 us.

The complex lerp kernel uses `std::abs(z) < 0.5` which involves
computing a sqrt. Instead compare the square against 0.25 has much
lower latency and so performs much better overall.

In a simple timeit benchmark I see more than 10x speedup on CPU for a 4096
element complex lerp, from 84 us to 6.7 us.

[ghstack-poisoned]
@pytorch-bot
Copy link

pytorch-bot bot commented Sep 11, 2022

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/84844

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 92b3719:
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

peterbell10 added a commit to peterbell10/pytorch that referenced this pull request Sep 11, 2022
The complex lerp kernel uses `std::abs(z) < 0.5` which involves
computing a sqrt. Instead compare the square against 0.25 has much
lower latency and so performs much better overall.

In a simple timeit benchmark I see more than 10x speedup on CPU for a 4096
element complex lerp, from 84 us to 6.7 us.

ghstack-source-id: d402414a6a71a8e522a25f2819f83248e19b0b05
Pull Request resolved: pytorch#84844
The complex lerp kernel uses `std::abs(z) < 0.5` which involves
computing a sqrt. Instead compare the square against 0.25 has much
lower latency and so performs much better overall.

In a simple timeit benchmark I see more than 10x speedup on CPU for a 4096
element complex lerp, from 84 us to 6.7 us.

[ghstack-poisoned]
The complex lerp kernel uses `std::abs(z) < 0.5` which involves
computing a sqrt. Instead compare the square against 0.25 has much
lower latency and so performs much better overall.

In a simple timeit benchmark I see more than 10x speedup on CPU for a 4096
element complex lerp, from 84 us to 6.7 us.

[ghstack-poisoned]
The complex lerp kernel uses `std::abs(z) < 0.5` which involves
computing a sqrt. Instead compare the square against 0.25 has much
lower latency and so performs much better overall.

In a simple timeit benchmark I see more than 10x speedup on CPU for a 4096
element complex lerp, from 84 us to 6.7 us.

[ghstack-poisoned]
peterbell10 added a commit to peterbell10/pytorch that referenced this pull request Sep 22, 2022
The complex lerp kernel uses `std::abs(z) < 0.5` which involves
computing a sqrt. Instead compare the square against 0.25 has much
lower latency and so performs much better overall.

In a simple timeit benchmark I see more than 10x speedup on CPU for a 4096
element complex lerp, from 84 us to 6.7 us.

ghstack-source-id: 3fd059b6e41f541a6a48b26d2c87e67c01fe236d
Pull Request resolved: pytorch#84844
peterbell10 added a commit to peterbell10/pytorch that referenced this pull request Sep 22, 2022
The complex lerp kernel uses `std::abs(z) < 0.5` which involves
computing a sqrt. Instead compare the square against 0.25 has much
lower latency and so performs much better overall.

In a simple timeit benchmark I see more than 10x speedup on CPU for a 4096
element complex lerp, from 84 us to 6.7 us.

ghstack-source-id: 3fd059b6e41f541a6a48b26d2c87e67c01fe236d
Pull Request resolved: pytorch#84844
peterbell10 added a commit to peterbell10/pytorch that referenced this pull request Sep 22, 2022
The complex lerp kernel uses `std::abs(z) < 0.5` which involves
computing a sqrt. Instead compare the square against 0.25 has much
lower latency and so performs much better overall.

In a simple timeit benchmark I see more than 10x speedup on CPU for a 4096
element complex lerp, from 84 us to 6.7 us.

ghstack-source-id: 3fd059b6e41f541a6a48b26d2c87e67c01fe236d
Pull Request resolved: pytorch#84844
peterbell10 added a commit to peterbell10/pytorch that referenced this pull request Sep 22, 2022
The complex lerp kernel uses `std::abs(z) < 0.5` which involves
computing a sqrt. Instead compare the square against 0.25 has much
lower latency and so performs much better overall.

In a simple timeit benchmark I see more than 10x speedup on CPU for a 4096
element complex lerp, from 84 us to 6.7 us.

ghstack-source-id: 3fd059b6e41f541a6a48b26d2c87e67c01fe236d
Pull Request resolved: pytorch#84844
peterbell10 added a commit to peterbell10/pytorch that referenced this pull request Sep 22, 2022
The complex lerp kernel uses `std::abs(z) < 0.5` which involves
computing a sqrt. Instead compare the square against 0.25 has much
lower latency and so performs much better overall.

In a simple timeit benchmark I see more than 10x speedup on CPU for a 4096
element complex lerp, from 84 us to 6.7 us.

ghstack-source-id: 3fd059b6e41f541a6a48b26d2c87e67c01fe236d
Pull Request resolved: pytorch#84844
peterbell10 added a commit to peterbell10/pytorch that referenced this pull request Sep 22, 2022
The complex lerp kernel uses `std::abs(z) < 0.5` which involves
computing a sqrt. Instead compare the square against 0.25 has much
lower latency and so performs much better overall.

In a simple timeit benchmark I see more than 10x speedup on CPU for a 4096
element complex lerp, from 84 us to 6.7 us.

ghstack-source-id: 3fd059b6e41f541a6a48b26d2c87e67c01fe236d
Pull Request resolved: pytorch#84844
peterbell10 added a commit to peterbell10/pytorch that referenced this pull request Sep 23, 2022
The complex lerp kernel uses `std::abs(z) < 0.5` which involves
computing a sqrt. Instead compare the square against 0.25 has much
lower latency and so performs much better overall.

In a simple timeit benchmark I see more than 10x speedup on CPU for a 4096
element complex lerp, from 84 us to 6.7 us.

ghstack-source-id: 3fd059b6e41f541a6a48b26d2c87e67c01fe236d
Pull Request resolved: pytorch#84844
peterbell10 added a commit to peterbell10/pytorch that referenced this pull request Sep 23, 2022
The complex lerp kernel uses `std::abs(z) < 0.5` which involves
computing a sqrt. Instead compare the square against 0.25 has much
lower latency and so performs much better overall.

In a simple timeit benchmark I see more than 10x speedup on CPU for a 4096
element complex lerp, from 84 us to 6.7 us.

ghstack-source-id: 3fd059b6e41f541a6a48b26d2c87e67c01fe236d
Pull Request resolved: pytorch#84844
peterbell10 added a commit to peterbell10/pytorch that referenced this pull request Sep 23, 2022
The complex lerp kernel uses `std::abs(z) < 0.5` which involves
computing a sqrt. Instead compare the square against 0.25 has much
lower latency and so performs much better overall.

In a simple timeit benchmark I see more than 10x speedup on CPU for a 4096
element complex lerp, from 84 us to 6.7 us.

ghstack-source-id: 3fd059b6e41f541a6a48b26d2c87e67c01fe236d
Pull Request resolved: pytorch#84844
peterbell10 added a commit to peterbell10/pytorch that referenced this pull request Sep 23, 2022
The complex lerp kernel uses `std::abs(z) < 0.5` which involves
computing a sqrt. Instead compare the square against 0.25 has much
lower latency and so performs much better overall.

In a simple timeit benchmark I see more than 10x speedup on CPU for a 4096
element complex lerp, from 84 us to 6.7 us.

ghstack-source-id: 3fd059b6e41f541a6a48b26d2c87e67c01fe236d
Pull Request resolved: pytorch#84844
@peterbell10 peterbell10 marked this pull request as ready for review September 23, 2022 16:01
@facebook-github-bot
Copy link
Contributor

/easycla

As part of the transition to the PyTorch Foundation, this project now requires contributions be covered under the new CLA. See #85559 for additional details.

This comment will trigger a new check of this PR. If you are already covered, you will simply see a new "EasyCLA" check that passes. If you are not covered, a bot will leave a new comment with a link to sign.

@linux-foundation-easycla
Copy link

linux-foundation-easycla bot commented Oct 4, 2022

CLA Signed

The committers listed above are authorized under a signed CLA.

The complex lerp kernel uses `std::abs(z) < 0.5` which involves
computing a sqrt. Instead compare the square against 0.25 has much
lower latency and so performs much better overall.

In a simple timeit benchmark I see more than 10x speedup on CPU for a 4096
element complex lerp, from 84 us to 6.7 us.

[ghstack-poisoned]
The complex lerp kernel uses `std::abs(z) < 0.5` which involves
computing a sqrt. Instead compare the square against 0.25 has much
lower latency and so performs much better overall.

In a simple timeit benchmark I see more than 10x speedup on CPU for a 4096
element complex lerp, from 84 us to 6.7 us.

[ghstack-poisoned]
@peterbell10
Copy link
Collaborator Author

@ngimel ping

@ngimel
Copy link
Collaborator

ngimel commented Oct 13, 2022

@pytorchbot rebase

@pytorch-bot pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Oct 13, 2022
@pytorchmergebot
Copy link
Collaborator

@pytorchbot successfully started a rebase job. Check the current status here

The complex lerp kernel uses `std::abs(z) < 0.5` which involves
computing a sqrt. Instead compare the square against 0.25 has much
lower latency and so performs much better overall.

In a simple timeit benchmark I see more than 10x speedup on CPU for a 4096
element complex lerp, from 84 us to 6.7 us.

[ghstack-poisoned]
@pytorchmergebot
Copy link
Collaborator

Successfully rebased gh/peterbell10/420/orig onto refs/remotes/origin/viable/strict, please pull locally before adding more changes (for example, via ghstack checkout https://github.com/pytorch/pytorch/pull/84844)

pytorchmergebot pushed a commit that referenced this pull request Oct 13, 2022
The complex lerp kernel uses `std::abs(z) < 0.5` which involves
computing a sqrt. Instead compare the square against 0.25 has much
lower latency and so performs much better overall.

In a simple timeit benchmark I see more than 10x speedup on CPU for a 4096
element complex lerp, from 84 us to 6.7 us.

ghstack-source-id: 3baffe91f2c44d0e29df0d39459a2e4ac457c7cc
Pull Request resolved: #84844
@ngimel
Copy link
Collaborator

ngimel commented Oct 13, 2022

@pytorchbot merge -g

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks on your PR pass since you used the green (-g) flag (ETA: 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@github-actions
Copy link

Hey @peterbell10.
You've committed this PR, but it does not have both a 'release notes: ...' and 'topics: ...' label. Please add one of each to the PR. The 'release notes: ...' label should represent the part of PyTorch that this PR changes (fx, autograd, distributed, etc) and the 'topics: ...' label should represent the kind of PR it is (not user facing, new feature, bug fix, perf improvement, etc). The list of valid labels can be found here for the 'release notes: ...' and here for the 'topics: ...'.
For changes that are 'topic: not user facing' there is no need for a release notes label.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ciflow/trunk Trigger trunk jobs on your pull request cla signed Merged open source release notes: complex release notes category topic: performance topic category
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants