-
Notifications
You must be signed in to change notification settings - Fork 21.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimize GELU BFloat16 Impl in CPU path #79378
Optimize GELU BFloat16 Impl in CPU path #79378
Conversation
🔗 Helpful links
✅ No Failures (0 Pending)As of commit a917806 (more details on the Dr. CI page): Expand to see more💚 💚 Looks good so far! There are no failures yet. 💚 💚 This comment was automatically generated by Dr. CI (expand for details).Please report bugs/suggestions to the (internal) Dr. CI Users group. |
e78dd68
to
d1dd927
Compare
86a2b7b
to
edb2202
Compare
Looks like this PR hasn't been updated in a while so we're going to go ahead and mark this as |
/easycla As part of the transition to the PyTorch Foundation, this project now requires contributions be covered under the new CLA. See #85559 for additional details. This comment will trigger a new check of this PR. If you are already covered, you will simply see a new "EasyCLA" check that passes. If you are not covered, a bot will leave a new comment with a link to sign. |
@pytorchbot label ciflow/trunk |
Can't add following labels to PR: ciflow/trunk Please ping one of the reviewers for help. |
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/79378
Note: Links to docs will display an error until the docs builds have been completed. ✅ No FailuresAs of commit ddda96a: This comment was automatically generated by Dr. CI and updates every 15 minutes. |
@pytorchbot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
Description
For slow path (with non-contiguous inputs) with
none
ortanh
approximate, current bfloat16 impl is not performance friendly in ATen. This PR uses float32 as an immediate type, in order to reduce the heavy cost of converting bf16 to fp32.Test
IceLake 2S 32C (Intel(R) Xeon(R) Platinum 8358 CPU @ 2.60GHz)
single socket (32 cores):
approximate is
none
:approximate is
tanh
:single core:
approximate is
none
:approximate is
tanh
:cc @jgong5 @mingfeima @XiaobingSuper @sanchitintel @ashokei @jingxu10