-
Notifications
You must be signed in to change notification settings - Fork 689
Float16 rowwise sparse adagrad with stochastic rounding #370
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
c4f712c to
28d46bc
Compare
|
Thanks so much @uyongw for this valuable contribution. I added comments except to RowWiseSparseAdagradFused.cc . I'll go over RowWiseSparseAdagradFused.cc soon. |
Reuse the rowwise_sparse_adagrad_fused_ref(...) API with float16 data type as template parameter. Implemented float16 weights update with tochastic rounding. Random integer number generating is based on xoshiro128++. A 64-bytes per-thread global random buffer is maintained and updated each time a new random number is generated. Signed-off-by: Yong Wu <yong.wu@intel.com>
Support GenerateRowWiseSparseAdaGradFused(...) JIT API with float16 data type as template parameter and keep backward compatiblity. Implement float16 weights update with stochastic rounding (SR). Random integer number generating is based on xoshiro128++. A 64-bytes per-thread global random buffer is maintained and updated each time a new random number is generated. Support both AVX2 and AVX512 random number generating and float16 weights udpate with SR. Signed-off-by: Yong Wu <yong.wu@intel.com>
…g stochastic rounding Signed-off-by: Yong Wu <yong.wu@intel.com>
… stochastic rounding Signed-off-by: Yong Wu <yong.wu@intel.com>
facebook-github-bot
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jspark1105 has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.
|
An observation is the FP16/SR with AVX512 JIT code will be faster than AVX2 on SKX. We could probably use AVX512 as default for FP16/SR kernel. |
facebook-github-bot
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jspark1105 has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.
REF path: - fix local variable shadow - fix memory leak - fix FP32->FP16 rounding - unify shift logic JIT path: - fix comments - fix FP32->FP16 rounding - GPR allocation: RAX for rand_buffer and R12 for length_R Bench: - fix bytes_padded caculation Signed-off-by: Yong Wu <yong.wu@intel.com>
Use portable immintrin.h instead. Signed-off-by: Yong Wu <yong.wu@intel.com>
bce5308 to
02cfbee
Compare
facebook-github-bot
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jspark1105 has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.
|
@jspark1105 Thank you for taking the time for all the valuable comments. |
…exceptions in JIT. Signed-off-by: Yong Wu <yong.wu@intel.com>
|
@jspark1105 merged this pull request in b9f1a80. |
| target_compile_options(fbgemm_avx2 PRIVATE "/arch:AVX2") | ||
| target_compile_options(fbgemm_avx512 PRIVATE "/arch:AVX512") | ||
| else(MSVC) | ||
| target_compile_options(fbgemm_generic PRIVATE "-mf16c") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wouldn't that makes fbgemm code incompatible with SandyBridge CPUs?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@malfet That would be true. We need perhaps a pure C implementation of _cvtss_sh/_cvtsh_ss in the REF implementation (there is one already but missing the misc. rounding handling). I can create a PR next week for this purpose if needed. @jspark1105
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This PR has been reverted due to an illegal instruction error in PyTorch tests. Let me check if it's related and if so commit with a C implementation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for the introducing this issue. And thanks for the fast turnaround! @jspark1105
… D21519194) (pytorch#370) Summary: Add FP16 weights support to fused rowwise sparse adagrad. - FP16 weights update use stochastic rounding by default * use_stochastic_rounding=true | false - Extend existing APIs on both REF and JIT path and keep API backward compatibility; * REF: rowwise_sparse_adagrad_fused_ref * JIT: GenerateRowWiseSparseAdaGradFused - Both API are thread-safe; a 4x64-bytes per-thread random buffer is maintained on both JIT and REF implementation; - Random number generating based on small-state generators xoshiro128++; - Both AVX2 and AVX512 are implemented in JIT kernel Pull Request resolved: pytorch#370 Differential Revision: D21610037 Pulled By: jspark1105 fbshipit-source-id: b1356300116ae42cd2e060cca2ba58ded300d95a
… D21519194) (pytorch#375) Summary: Pull Request resolved: pytorch#375 Add FP16 weights support to fused rowwise sparse adagrad. - FP16 weights update use stochastic rounding by default * use_stochastic_rounding=true | false - Extend existing APIs on both REF and JIT path and keep API backward compatibility; * REF: rowwise_sparse_adagrad_fused_ref * JIT: GenerateRowWiseSparseAdaGradFused - Both API are thread-safe; a 4x64-bytes per-thread random buffer is maintained on both JIT and REF implementation; - Random number generating based on small-state generators xoshiro128++; - Both AVX2 and AVX512 are implemented in JIT kernel Pull Request resolved: pytorch#370 Test Plan: Imported from GitHub, without a `Test Plan:` line. pytorch/pytorch#38602 Differential Revision: D21610037 Pulled By: jspark1105 fbshipit-source-id: 9866b65dfefc7653753a703ee2f9e209d831f6fc
… D21519194) (pytorch#375) Summary: Pull Request resolved: pytorch#375 Add FP16 weights support to fused rowwise sparse adagrad. - FP16 weights update use stochastic rounding by default * use_stochastic_rounding=true | false - Extend existing APIs on both REF and JIT path and keep API backward compatibility; * REF: rowwise_sparse_adagrad_fused_ref * JIT: GenerateRowWiseSparseAdaGradFused - Both API are thread-safe; a 4x64-bytes per-thread random buffer is maintained on both JIT and REF implementation; - Random number generating based on small-state generators xoshiro128++; - Both AVX2 and AVX512 are implemented in JIT kernel Pull Request resolved: pytorch#370 Test Plan: Imported from GitHub, without a `Test Plan:` line. pytorch/pytorch#38602 Differential Revision: D21610037 Pulled By: jspark1105 fbshipit-source-id: 21a3eb91ac2f12727a2e383ec6a45b7d3273d173
… D21519194) (#375) Summary: Pull Request resolved: #375 Add FP16 weights support to fused rowwise sparse adagrad. - FP16 weights update use stochastic rounding by default * use_stochastic_rounding=true | false - Extend existing APIs on both REF and JIT path and keep API backward compatibility; * REF: rowwise_sparse_adagrad_fused_ref * JIT: GenerateRowWiseSparseAdaGradFused - Both API are thread-safe; a 4x64-bytes per-thread random buffer is maintained on both JIT and REF implementation; - Random number generating based on small-state generators xoshiro128++; - Both AVX2 and AVX512 are implemented in JIT kernel Pull Request resolved: #370 Test Plan: Imported from GitHub, without a `Test Plan:` line. pytorch/pytorch#38602 Reviewed By: jianyuh Differential Revision: D21610037 Pulled By: jspark1105 fbshipit-source-id: 13df1013d651d2be590e82ddc0387b68e65f9023
Summary: X-link: pytorch#3270 Pull Request resolved: facebookresearch/FBGEMM#370 Make fp8 bmm output contiguous as [silu_mul](https://fburl.com/code/sa1faq0w) requests output tensor of fp8 bmm stride(-1) to be 1. This Diff fixes the issue Reviewed By: jspark1105 Differential Revision: D64811808 fbshipit-source-id: e0f213f24fbf8bf989576371af1e2ada4cafbfb1
Summary:
Add FP16 weights support to fused rowwise sparse adagrad.