-
Notifications
You must be signed in to change notification settings - Fork 21.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add mixed data type support for LayerNorm backward on CPU #88064
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/88064
Note: Links to docs will display an error until the docs builds have been completed. ✅ No FailuresAs of commit 27f3948: This comment was automatically generated by Dr. CI and updates every 15 minutes. |
|
@jiayisunx Do you mind elaborate the motivation of this PR in the description? |
@pytorchbot label intel |
@jgong5, @mingfeima, could you please help review this PR? thanks. |
@pytorchbot label intel priority |
Didn't find following labels among repository labels: priority |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like layer_norm_backward_kernel_mixed_type
is just a copy of LayerNormBackwardKernelImplInternal
with slightly different argument types.
Perhaps a better approach would be to just move ACC_T
as template argument of LayerNormBackwardKernelImplInternal
and call it slightly differently in mixed precision type scenario
fVec x_fvec0, x_fvec1, dy_fvec0, dy_fvec1, gamma_fvec0, gamma_fvec1; | ||
std::tie(x_fvec0, x_fvec1) = convert_bfloat16_float(x_bvec); | ||
std::tie(dy_fvec0, dy_fvec1) = convert_bfloat16_float(dy_bvec); | ||
std::tie(gamma_fvec0, gamma_fvec1) = load2f(gamma_data, N); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like
layer_norm_backward_kernel_mixed_type
is just a copy ofLayerNormBackwardKernelImplInternal
with slightly different argument types.Perhaps a better approach would be to just move
ACC_T
as template argument ofLayerNormBackwardKernelImplInternal
and call it slightly differently in mixed precision type scenario
@malfet thanks for your comments, actually this backward implementation is consistent with forward, and there are many places that need to be modified to support mixed data type, not just slightly different argument type. For example, here, we cannot simply call vec::map3_reduce_all to do this part.
@pytorchbot rebase |
@pytorchbot successfully started a rebase job. Check the current status here |
Successfully rebased |
118b40d
to
2fc868b
Compare
@malfet , I have refactored the layernorm backward kernel, could you please help to review this PR again? |
3770921
to
1a4595e
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, though template specialization is essentially copy-n-paste from the generic template, would it be possible to somehow reuse code more by introducing more templates. I.e. why bf16 can not utilize the same vec::map
primitives
inline std::tuple<Vectorized<float>, Vectorized<float>> load2f(const BFloat16* ptr, int64_t count) { | ||
return convert_bfloat16_float(Vectorized<BFloat16>::loadu(ptr, count)); | ||
} | ||
|
||
inline std::tuple<Vectorized<float>, Vectorized<float>> load2f(const float* ptr, int64_t count) { | ||
using Vec = Vectorized<float>; | ||
if (count > Vec::size()) { | ||
return std::make_tuple(Vec::loadu(ptr), Vec::loadu(ptr + Vec::size(), count - Vec::size())); | ||
} else { | ||
return std::make_tuple(Vec::loadu(ptr, count), Vec(0)); | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, what's the difference between this implementation and the previous one? Should it be just the same template with default argument, something like `load2f(const BFloat16*ptr, int64_t count = 1);
Also, why int64
rather than uint64
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The previous one is the helper for mixed data type parameter Vec::load(ptr), this implementation is the helper for Vec::load(ptr, count), and use int64 because loadu(const void* ptr, int64_t count) uses int64.
Actually bf16 can utilize the same |
@pytorchbot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
Merge failedReason: This PR is too stale; the last push date was more than 3 days ago. Please rebase and try again. You can rebase by leaving the following comment on this PR: Details for Dev Infra teamRaised by workflow job |
@pytorchbot rebase |
@pytorchbot successfully started a rebase job. Check the current status here |
Successfully rebased |
1a4595e
to
27f3948
Compare
@pytorchbot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
Motivation
Amp provides convenience methods for mixed precision. If users use amp to run bfloat16 models, torch.autocast will keep module parameters in acc dtype which will leave gamma and beta in float while input/output will be in bfloat16. The same goes for backward: parameters are in float, and X & dX & dY are in bfloat16.
Mixed data type support for LayerNorm backward is also needed for model training with LayerNorm.
Testing
Single socket (icx, 32cores):
Single core(icx):
cc @VitalyFedyunin @jgong5 @mingfeima @XiaobingSuper @sanchitintel @ashokei @jingxu10