Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize GELU BFloat16 Impl in CPU path #79378

Closed

Conversation

yanbing-j
Copy link
Collaborator

@yanbing-j yanbing-j commented Jun 12, 2022

Description

For slow path (with non-contiguous inputs) with none or tanh approximate, current bfloat16 impl is not performance friendly in ATen. This PR uses float32 as an immediate type, in order to reduce the heavy cost of converting bf16 to fp32.

Test

IceLake 2S 32C (Intel(R) Xeon(R) Platinum 8358 CPU @ 2.60GHz)

single socket (32 cores):
approximate is none:

input shapes  forward ( base) (ms) backward (base) (ms) forward (optimized) (ms) backward (optimized) (ms)
[16, 32, 32] 0.361 1.055 0.348 0.672
[32, 32, 64] 0.084 2.003 0.076 1.426
[32, 64, 128] 0.237 2.007 0.22 1.454
[64, 128, 128] 2.23 6.348 1.943 4.103

approximate is tanh:

input shapes  forward ( base) (ms) backward (base) (ms) forward (optimized) (ms) backward (optimized) (ms)
[16, 32, 32] 0.203 1.209 0.138 0.474
[32, 32, 64] 0.063 2.497 0.043 0.985
[32, 64, 128] 0.201 2.707 0.141 1.205
[64, 128, 128] 1.549 8.749 1.065 3.635

single core:
approximate is none:

input shapes  forward ( base) (ms) backward (base) (ms) forward (optimized) (ms) backward (optimized) (ms)
[16, 32, 32] 0.359 1.055 0.267 0.592
[32, 32, 64] 1.11 3.483 1.063 2.373
[32, 64, 128] 4.478 13.866 4.27 9.426
[64, 128, 128] 17.675 55.231 16.805 37.509

approximate is tanh:

input shapes  forward ( base) (ms) backward (base) (ms) forward (optimized) (ms) backward (optimized) (ms)
[16, 32, 32] 0.202 1.212 0.138 0.473
[32, 32, 64] 0.776 4.843 0.531 1.872
[32, 64, 128] 3.203 19.267 2.16 7.243
[64, 128, 128] 12.33 76.834 8.286 29.553

cc @jgong5 @mingfeima @XiaobingSuper @sanchitintel @ashokei @jingxu10

@facebook-github-bot
Copy link
Contributor

facebook-github-bot commented Jun 12, 2022

🔗 Helpful links

✅ No Failures (0 Pending)

As of commit a917806 (more details on the Dr. CI page):

Expand to see more

💚 💚 Looks good so far! There are no failures yet. 💚 💚


This comment was automatically generated by Dr. CI (expand for details).

Please report bugs/suggestions to the (internal) Dr. CI Users group.

Click here to manually regenerate this comment.

@yanbing-j yanbing-j force-pushed the yanbing/gelu_bf16_vec_optimize branch from e78dd68 to d1dd927 Compare June 15, 2022 09:23
@yanbing-j yanbing-j added the intel This tag is for PR from Intel label Jun 16, 2022
@yanbing-j yanbing-j force-pushed the yanbing/gelu_bf16_vec_optimize branch from 86a2b7b to edb2202 Compare July 6, 2022 06:11
@github-actions
Copy link

Looks like this PR hasn't been updated in a while so we're going to go ahead and mark this as Stale.
Feel free to remove the Stale label if you feel this was a mistake.
If you are unable to remove the Stale label please contact a maintainer in order to do so.
If you want the bot to never mark this PR stale again, add the no-stale label.
Stale pull requests will automatically be closed after 30 days of inactivity.

@github-actions github-actions bot added the Stale label Sep 19, 2022
@facebook-github-bot
Copy link
Contributor

/easycla

As part of the transition to the PyTorch Foundation, this project now requires contributions be covered under the new CLA. See #85559 for additional details.

This comment will trigger a new check of this PR. If you are already covered, you will simply see a new "EasyCLA" check that passes. If you are not covered, a bot will leave a new comment with a link to sign.

@linux-foundation-easycla
Copy link

linux-foundation-easycla bot commented Oct 4, 2022

CLA Signed

The committers listed above are authorized under a signed CLA.

@yanbing-j
Copy link
Collaborator Author

@pytorchbot label ciflow/trunk

@pytorch-bot
Copy link

pytorch-bot bot commented Oct 19, 2022

Can't add following labels to PR: ciflow/trunk Please ping one of the reviewers for help.

@github-actions github-actions bot closed this Nov 18, 2022
@chunyuan-w chunyuan-w reopened this Dec 12, 2022
@pytorch-bot
Copy link

pytorch-bot bot commented Dec 12, 2022

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/79378

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit ddda96a:
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@github-actions github-actions bot added the module: cpu CPU specific problem (e.g., perf, algorithm) label Dec 12, 2022
@chunyuan-w chunyuan-w added the ciflow/trunk Trigger trunk jobs on your pull request label Dec 13, 2022
@yanbing-j yanbing-j marked this pull request as ready for review December 19, 2022 13:17
@drisspg drisspg added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Dec 22, 2022
@mingfeima
Copy link
Collaborator

@pytorchbot merge

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ciflow/trunk Trigger trunk jobs on your pull request cla signed intel This tag is for PR from Intel Merged module: cpu CPU specific problem (e.g., perf, algorithm) open source Stale triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

None yet

7 participants