-
Notifications
You must be signed in to change notification settings - Fork 21.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve complex lerp performance #84844
Conversation
The complex lerp kernel uses `std::abs(z) < 0.5` which involves computing a sqrt. Instead compare the square against 0.25 has much lower latency and so performs much better overall. In a simple timeit benchmark I see more than 10x speedup on CPU for a 4096 element complex lerp, from 84 us to 6.7 us. [ghstack-poisoned]
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/84844
Note: Links to docs will display an error until the docs builds have been completed. ✅ No FailuresAs of commit 92b3719: This comment was automatically generated by Dr. CI and updates every 15 minutes. |
The complex lerp kernel uses `std::abs(z) < 0.5` which involves computing a sqrt. Instead compare the square against 0.25 has much lower latency and so performs much better overall. In a simple timeit benchmark I see more than 10x speedup on CPU for a 4096 element complex lerp, from 84 us to 6.7 us. ghstack-source-id: d402414a6a71a8e522a25f2819f83248e19b0b05 Pull Request resolved: pytorch#84844
The complex lerp kernel uses `std::abs(z) < 0.5` which involves computing a sqrt. Instead compare the square against 0.25 has much lower latency and so performs much better overall. In a simple timeit benchmark I see more than 10x speedup on CPU for a 4096 element complex lerp, from 84 us to 6.7 us. [ghstack-poisoned]
The complex lerp kernel uses `std::abs(z) < 0.5` which involves computing a sqrt. Instead compare the square against 0.25 has much lower latency and so performs much better overall. In a simple timeit benchmark I see more than 10x speedup on CPU for a 4096 element complex lerp, from 84 us to 6.7 us. [ghstack-poisoned]
The complex lerp kernel uses `std::abs(z) < 0.5` which involves computing a sqrt. Instead compare the square against 0.25 has much lower latency and so performs much better overall. In a simple timeit benchmark I see more than 10x speedup on CPU for a 4096 element complex lerp, from 84 us to 6.7 us. [ghstack-poisoned]
The complex lerp kernel uses `std::abs(z) < 0.5` which involves computing a sqrt. Instead compare the square against 0.25 has much lower latency and so performs much better overall. In a simple timeit benchmark I see more than 10x speedup on CPU for a 4096 element complex lerp, from 84 us to 6.7 us. ghstack-source-id: 3fd059b6e41f541a6a48b26d2c87e67c01fe236d Pull Request resolved: pytorch#84844
The complex lerp kernel uses `std::abs(z) < 0.5` which involves computing a sqrt. Instead compare the square against 0.25 has much lower latency and so performs much better overall. In a simple timeit benchmark I see more than 10x speedup on CPU for a 4096 element complex lerp, from 84 us to 6.7 us. ghstack-source-id: 3fd059b6e41f541a6a48b26d2c87e67c01fe236d Pull Request resolved: pytorch#84844
The complex lerp kernel uses `std::abs(z) < 0.5` which involves computing a sqrt. Instead compare the square against 0.25 has much lower latency and so performs much better overall. In a simple timeit benchmark I see more than 10x speedup on CPU for a 4096 element complex lerp, from 84 us to 6.7 us. ghstack-source-id: 3fd059b6e41f541a6a48b26d2c87e67c01fe236d Pull Request resolved: pytorch#84844
The complex lerp kernel uses `std::abs(z) < 0.5` which involves computing a sqrt. Instead compare the square against 0.25 has much lower latency and so performs much better overall. In a simple timeit benchmark I see more than 10x speedup on CPU for a 4096 element complex lerp, from 84 us to 6.7 us. ghstack-source-id: 3fd059b6e41f541a6a48b26d2c87e67c01fe236d Pull Request resolved: pytorch#84844
The complex lerp kernel uses `std::abs(z) < 0.5` which involves computing a sqrt. Instead compare the square against 0.25 has much lower latency and so performs much better overall. In a simple timeit benchmark I see more than 10x speedup on CPU for a 4096 element complex lerp, from 84 us to 6.7 us. ghstack-source-id: 3fd059b6e41f541a6a48b26d2c87e67c01fe236d Pull Request resolved: pytorch#84844
The complex lerp kernel uses `std::abs(z) < 0.5` which involves computing a sqrt. Instead compare the square against 0.25 has much lower latency and so performs much better overall. In a simple timeit benchmark I see more than 10x speedup on CPU for a 4096 element complex lerp, from 84 us to 6.7 us. ghstack-source-id: 3fd059b6e41f541a6a48b26d2c87e67c01fe236d Pull Request resolved: pytorch#84844
The complex lerp kernel uses `std::abs(z) < 0.5` which involves computing a sqrt. Instead compare the square against 0.25 has much lower latency and so performs much better overall. In a simple timeit benchmark I see more than 10x speedup on CPU for a 4096 element complex lerp, from 84 us to 6.7 us. ghstack-source-id: 3fd059b6e41f541a6a48b26d2c87e67c01fe236d Pull Request resolved: pytorch#84844
The complex lerp kernel uses `std::abs(z) < 0.5` which involves computing a sqrt. Instead compare the square against 0.25 has much lower latency and so performs much better overall. In a simple timeit benchmark I see more than 10x speedup on CPU for a 4096 element complex lerp, from 84 us to 6.7 us. ghstack-source-id: 3fd059b6e41f541a6a48b26d2c87e67c01fe236d Pull Request resolved: pytorch#84844
The complex lerp kernel uses `std::abs(z) < 0.5` which involves computing a sqrt. Instead compare the square against 0.25 has much lower latency and so performs much better overall. In a simple timeit benchmark I see more than 10x speedup on CPU for a 4096 element complex lerp, from 84 us to 6.7 us. ghstack-source-id: 3fd059b6e41f541a6a48b26d2c87e67c01fe236d Pull Request resolved: pytorch#84844
The complex lerp kernel uses `std::abs(z) < 0.5` which involves computing a sqrt. Instead compare the square against 0.25 has much lower latency and so performs much better overall. In a simple timeit benchmark I see more than 10x speedup on CPU for a 4096 element complex lerp, from 84 us to 6.7 us. ghstack-source-id: 3fd059b6e41f541a6a48b26d2c87e67c01fe236d Pull Request resolved: pytorch#84844
/easycla As part of the transition to the PyTorch Foundation, this project now requires contributions be covered under the new CLA. See #85559 for additional details. This comment will trigger a new check of this PR. If you are already covered, you will simply see a new "EasyCLA" check that passes. If you are not covered, a bot will leave a new comment with a link to sign. |
The complex lerp kernel uses `std::abs(z) < 0.5` which involves computing a sqrt. Instead compare the square against 0.25 has much lower latency and so performs much better overall. In a simple timeit benchmark I see more than 10x speedup on CPU for a 4096 element complex lerp, from 84 us to 6.7 us. [ghstack-poisoned]
The complex lerp kernel uses `std::abs(z) < 0.5` which involves computing a sqrt. Instead compare the square against 0.25 has much lower latency and so performs much better overall. In a simple timeit benchmark I see more than 10x speedup on CPU for a 4096 element complex lerp, from 84 us to 6.7 us. [ghstack-poisoned]
@ngimel ping |
@pytorchbot rebase |
@pytorchbot successfully started a rebase job. Check the current status here |
The complex lerp kernel uses `std::abs(z) < 0.5` which involves computing a sqrt. Instead compare the square against 0.25 has much lower latency and so performs much better overall. In a simple timeit benchmark I see more than 10x speedup on CPU for a 4096 element complex lerp, from 84 us to 6.7 us. [ghstack-poisoned]
Successfully rebased |
The complex lerp kernel uses `std::abs(z) < 0.5` which involves computing a sqrt. Instead compare the square against 0.25 has much lower latency and so performs much better overall. In a simple timeit benchmark I see more than 10x speedup on CPU for a 4096 element complex lerp, from 84 us to 6.7 us. ghstack-source-id: 3baffe91f2c44d0e29df0d39459a2e4ac457c7cc Pull Request resolved: #84844
@pytorchbot merge -g |
Merge startedYour change will be merged once all checks on your PR pass since you used the green (-g) flag (ETA: 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
Hey @peterbell10. |
Stack from ghstack (oldest at bottom):
The complex lerp kernel uses
std::abs(z) < 0.5
which involvescomputing a sqrt. Instead compare the square against 0.25 has much
lower latency and so performs much better overall.
In a simple timeit benchmark I see more than 10x speedup on CPU for a 4096
element complex lerp, from 84 us to 6.7 us.