Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add mixed data type support for LayerNorm backward on CPU #88064

Closed
wants to merge 2 commits into from

Conversation

jiayisunx
Copy link
Collaborator

@jiayisunx jiayisunx commented Oct 31, 2022

Motivation

Amp provides convenience methods for mixed precision. If users use amp to run bfloat16 models, torch.autocast will keep module parameters in acc dtype which will leave gamma and beta in float while input/output will be in bfloat16. The same goes for backward: parameters are in float, and X & dX & dY are in bfloat16.
Mixed data type support for LayerNorm backward is also needed for model training with LayerNorm.

Testing

Single socket (icx, 32cores):

shape fp32 forward (ms) bf16 forward (ms) mix forward (ms) fp32 backward (ms) bf16 backward (ms) mix backward (ms)
(1, 8, 16) 0.012 0.012 0.012 0.071 0.065 0.062
(8, 8, 16) 0.015 0.014 0.015 0.074 0.070 0.063
(32, 8, 16) 0.062 0.016 0.016 0.073 0.073 0.072
(64, 128, 56, 56) 2.467 0.907 0.0897 12.993 7.603 7.777
(64, 128, 256, 256) 48.904 25.589 25.472 343.992 183.133 188.222

Single core(icx):

shape fp32 forward (ms) bf16 forward (ms) mix forward (ms) fp32 backward (ms) bf16 backward (ms) mix backward (ms)
(1, 8, 16) 0.012 0.012 0.012 0.050 0.050 0.050
(8, 8, 16) 0.014 0.014 0.014 0.052 0.054 0.053
(32, 8, 16) 0.034 0.019 0.018 0.059 0.067 0.066
(64, 128, 56, 56) 66.791 17.725 19.799 119.431 106.123 107.446
(64, 128, 256, 256) 1542.477 402.132 527.044 3019.437 2336.318 2448.320

cc @VitalyFedyunin @jgong5 @mingfeima @XiaobingSuper @sanchitintel @ashokei @jingxu10

@pytorch-bot pytorch-bot bot added the release notes: nn release notes category label Oct 31, 2022
@pytorch-bot
Copy link

pytorch-bot bot commented Oct 31, 2022

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/88064

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 27f3948:
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@jiayisunx jiayisunx marked this pull request as draft October 31, 2022 03:03
@linux-foundation-easycla
Copy link

linux-foundation-easycla bot commented Oct 31, 2022

CLA Signed

The committers listed above are authorized under a signed CLA.

@jgong5
Copy link
Collaborator

jgong5 commented Oct 31, 2022

@jiayisunx Do you mind elaborate the motivation of this PR in the description?

@jiayisunx
Copy link
Collaborator Author

@pytorchbot label intel

@pytorch-bot pytorch-bot bot added the intel This tag is for PR from Intel label Nov 17, 2022
@github-actions github-actions bot added the module: cpu CPU specific problem (e.g., perf, algorithm) label Dec 5, 2022
@jiayisunx jiayisunx marked this pull request as ready for review December 6, 2022 08:42
@jiayisunx
Copy link
Collaborator Author

@jgong5, @mingfeima, could you please help review this PR? thanks.

@bdhirsh bdhirsh added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Dec 7, 2022
@jiayisunx
Copy link
Collaborator Author

@pytorchbot label intel priority

@pytorch-bot
Copy link

pytorch-bot bot commented Dec 14, 2022

Didn't find following labels among repository labels: priority

@zhuhaozhe zhuhaozhe added the intel priority matters to intel architecture from performance wise label Dec 14, 2022
Copy link
Contributor

@malfet malfet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like layer_norm_backward_kernel_mixed_type is just a copy of LayerNormBackwardKernelImplInternal with slightly different argument types.

Perhaps a better approach would be to just move ACC_T as template argument of LayerNormBackwardKernelImplInternal and call it slightly differently in mixed precision type scenario

fVec x_fvec0, x_fvec1, dy_fvec0, dy_fvec1, gamma_fvec0, gamma_fvec1;
std::tie(x_fvec0, x_fvec1) = convert_bfloat16_float(x_bvec);
std::tie(dy_fvec0, dy_fvec1) = convert_bfloat16_float(dy_bvec);
std::tie(gamma_fvec0, gamma_fvec1) = load2f(gamma_data, N);
Copy link
Collaborator Author

@jiayisunx jiayisunx Dec 15, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like layer_norm_backward_kernel_mixed_type is just a copy of LayerNormBackwardKernelImplInternal with slightly different argument types.

Perhaps a better approach would be to just move ACC_T as template argument of LayerNormBackwardKernelImplInternal and call it slightly differently in mixed precision type scenario

@malfet thanks for your comments, actually this backward implementation is consistent with forward, and there are many places that need to be modified to support mixed data type, not just slightly different argument type. For example, here, we cannot simply call vec::map3_reduce_all to do this part.

@jiayisunx jiayisunx closed this Dec 15, 2022
@jiayisunx jiayisunx reopened this Dec 15, 2022
@jiayisunx
Copy link
Collaborator Author

@pytorchbot rebase

@pytorchmergebot
Copy link
Collaborator

@pytorchbot successfully started a rebase job. Check the current status here

@pytorchmergebot
Copy link
Collaborator

Successfully rebased layer_norm onto refs/remotes/origin/viable/strict, please pull locally before adding more changes (for example, via git checkout layer_norm && git pull --rebase)

@jiayisunx
Copy link
Collaborator Author

@malfet , I have refactored the layernorm backward kernel, could you please help to review this PR again?

@atalman atalman added this to the 2.0.0 milestone Jan 11, 2023
@jiayisunx jiayisunx force-pushed the layer_norm branch 2 times, most recently from 3770921 to 1a4595e Compare January 18, 2023 07:46
@jiayisunx
Copy link
Collaborator Author

@malfet

Copy link
Contributor

@malfet malfet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, though template specialization is essentially copy-n-paste from the generic template, would it be possible to somehow reuse code more by introducing more templates. I.e. why bf16 can not utilize the same vec::map primitives

Comment on lines +75 to +86
inline std::tuple<Vectorized<float>, Vectorized<float>> load2f(const BFloat16* ptr, int64_t count) {
return convert_bfloat16_float(Vectorized<BFloat16>::loadu(ptr, count));
}

inline std::tuple<Vectorized<float>, Vectorized<float>> load2f(const float* ptr, int64_t count) {
using Vec = Vectorized<float>;
if (count > Vec::size()) {
return std::make_tuple(Vec::loadu(ptr), Vec::loadu(ptr + Vec::size(), count - Vec::size()));
} else {
return std::make_tuple(Vec::loadu(ptr, count), Vec(0));
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, what's the difference between this implementation and the previous one? Should it be just the same template with default argument, something like `load2f(const BFloat16*ptr, int64_t count = 1);

Also, why int64 rather than uint64?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The previous one is the helper for mixed data type parameter Vec::load(ptr), this implementation is the helper for Vec::load(ptr, count), and use int64 because loadu(const void* ptr, int64_t count) uses int64.

@jiayisunx
Copy link
Collaborator Author

Looks good, though template specialization is essentially copy-n-paste from the generic template, would it be possible to somehow reuse code more by introducing more templates. I.e. why bf16 can not utilize the same vec::map primitives

Actually bf16 can utilize the same vec::map primitives, but mixed datatype cannot. And the template specialization is for mixed datatype cases.

@jiayisunx
Copy link
Collaborator Author

@pytorchbot merge

@pytorch-bot pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Feb 9, 2023
@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@pytorchmergebot
Copy link
Collaborator

Merge failed

Reason: This PR is too stale; the last push date was more than 3 days ago. Please rebase and try again. You can rebase by leaving the following comment on this PR:
@pytorchbot rebase

Details for Dev Infra team Raised by workflow job

@jiayisunx
Copy link
Collaborator Author

@pytorchbot rebase

@pytorchmergebot
Copy link
Collaborator

@pytorchbot successfully started a rebase job. Check the current status here

@pytorchmergebot
Copy link
Collaborator

Successfully rebased layer_norm onto refs/remotes/origin/viable/strict, please pull locally before adding more changes (for example, via git checkout layer_norm && git pull --rebase)

@jiayisunx
Copy link
Collaborator Author

@pytorchbot merge

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ciflow/trunk Trigger trunk jobs on your pull request intel priority matters to intel architecture from performance wise intel This tag is for PR from Intel Merged module: cpu CPU specific problem (e.g., perf, algorithm) open source release notes: nn release notes category triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

None yet

8 participants