Float16 rowwise sparse adagrad with stochastic rounding #370

uyongw · 2020-05-11T13:44:42Z

Summary:

Add FP16 weights support to fused rowwise sparse adagrad.

FP16 weights update use stochastic rounding by default
- use_stochastic_rounding=true | false
Extend existing APIs on both REF and JIT path and keep API backward compatibility;
- REF: rowwise_sparse_adagrad_fused_ref
- JIT: GenerateRowWiseSparseAdaGradFused
Both API are thread-safe; a 4x64-bytes per-thread random buffer is maintained on both JIT and REF implementation;
Random number generating based on small-state generators xoshiro128++;
Both AVX2 and AVX512 are implemented in JIT kernel

bench/RowwiseAdagradFusedBenchmark.cc

src/RefImplementations.cc

jspark1105 · 2020-05-11T19:49:03Z

Thanks so much @uyongw for this valuable contribution. I added comments except to RowWiseSparseAdagradFused.cc . I'll go over RowWiseSparseAdagradFused.cc soon.

Reuse the rowwise_sparse_adagrad_fused_ref(...) API with float16 data type as template parameter. Implemented float16 weights update with tochastic rounding. Random integer number generating is based on xoshiro128++. A 64-bytes per-thread global random buffer is maintained and updated each time a new random number is generated. Signed-off-by: Yong Wu <yong.wu@intel.com>

Support GenerateRowWiseSparseAdaGradFused(...) JIT API with float16 data type as template parameter and keep backward compatiblity. Implement float16 weights update with stochastic rounding (SR). Random integer number generating is based on xoshiro128++. A 64-bytes per-thread global random buffer is maintained and updated each time a new random number is generated. Support both AVX2 and AVX512 random number generating and float16 weights udpate with SR. Signed-off-by: Yong Wu <yong.wu@intel.com>

…g stochastic rounding Signed-off-by: Yong Wu <yong.wu@intel.com>

… stochastic rounding Signed-off-by: Yong Wu <yong.wu@intel.com>

src/RowWiseSparseAdagradFused.cc

facebook-github-bot

@jspark1105 has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

src/RefImplementations.cc

uyongw · 2020-05-12T16:29:09Z

An observation is the FP16/SR with AVX512 JIT code will be faster than AVX2 on SKX. We could probably use AVX512 as default for FP16/SR kernel.

src/RefImplementations.cc

facebook-github-bot

@jspark1105 has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

REF path: - fix local variable shadow - fix memory leak - fix FP32->FP16 rounding - unify shift logic JIT path: - fix comments - fix FP32->FP16 rounding - GPR allocation: RAX for rand_buffer and R12 for length_R Bench: - fix bytes_padded caculation Signed-off-by: Yong Wu <yong.wu@intel.com>

Use portable immintrin.h instead. Signed-off-by: Yong Wu <yong.wu@intel.com>

facebook-github-bot

@jspark1105 has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

src/RefImplementations.cc

uyongw · 2020-05-13T05:04:30Z

@jspark1105 Thank you for taking the time for all the valuable comments.

…exceptions in JIT. Signed-off-by: Yong Wu <yong.wu@intel.com>

facebook-github-bot · 2020-05-16T00:17:25Z

@jspark1105 merged this pull request in b9f1a80.

malfet · 2020-05-16T04:03:45Z

CMakeLists.txt

  target_compile_options(fbgemm_avx2 PRIVATE "/arch:AVX2")
  target_compile_options(fbgemm_avx512 PRIVATE "/arch:AVX512")
 else(MSVC)
+  target_compile_options(fbgemm_generic PRIVATE "-mf16c")


Wouldn't that makes fbgemm code incompatible with SandyBridge CPUs?

@malfet That would be true. We need perhaps a pure C implementation of _cvtss_sh/_cvtsh_ss in the REF implementation (there is one already but missing the misc. rounding handling). I can create a PR next week for this purpose if needed. @jspark1105

This PR has been reverted due to an illegal instruction error in PyTorch tests. Let me check if it's related and if so commit with a C implementation.

Sorry for the introducing this issue. And thanks for the fast turnaround! @jspark1105

… D21519194) (pytorch#370) Summary: Add FP16 weights support to fused rowwise sparse adagrad. - FP16 weights update use stochastic rounding by default * use_stochastic_rounding=true | false - Extend existing APIs on both REF and JIT path and keep API backward compatibility; * REF: rowwise_sparse_adagrad_fused_ref * JIT: GenerateRowWiseSparseAdaGradFused - Both API are thread-safe; a 4x64-bytes per-thread random buffer is maintained on both JIT and REF implementation; - Random number generating based on small-state generators xoshiro128++; - Both AVX2 and AVX512 are implemented in JIT kernel Pull Request resolved: pytorch#370 Differential Revision: D21610037 Pulled By: jspark1105 fbshipit-source-id: b1356300116ae42cd2e060cca2ba58ded300d95a

… D21519194) (pytorch#375) Summary: Pull Request resolved: pytorch#375 Add FP16 weights support to fused rowwise sparse adagrad. - FP16 weights update use stochastic rounding by default * use_stochastic_rounding=true | false - Extend existing APIs on both REF and JIT path and keep API backward compatibility; * REF: rowwise_sparse_adagrad_fused_ref * JIT: GenerateRowWiseSparseAdaGradFused - Both API are thread-safe; a 4x64-bytes per-thread random buffer is maintained on both JIT and REF implementation; - Random number generating based on small-state generators xoshiro128++; - Both AVX2 and AVX512 are implemented in JIT kernel Pull Request resolved: pytorch#370 Test Plan: Imported from GitHub, without a `Test Plan:` line. pytorch/pytorch#38602 Differential Revision: D21610037 Pulled By: jspark1105 fbshipit-source-id: 9866b65dfefc7653753a703ee2f9e209d831f6fc

… D21519194) (pytorch#375) Summary: Pull Request resolved: pytorch#375 Add FP16 weights support to fused rowwise sparse adagrad. - FP16 weights update use stochastic rounding by default * use_stochastic_rounding=true | false - Extend existing APIs on both REF and JIT path and keep API backward compatibility; * REF: rowwise_sparse_adagrad_fused_ref * JIT: GenerateRowWiseSparseAdaGradFused - Both API are thread-safe; a 4x64-bytes per-thread random buffer is maintained on both JIT and REF implementation; - Random number generating based on small-state generators xoshiro128++; - Both AVX2 and AVX512 are implemented in JIT kernel Pull Request resolved: pytorch#370 Test Plan: Imported from GitHub, without a `Test Plan:` line. pytorch/pytorch#38602 Differential Revision: D21610037 Pulled By: jspark1105 fbshipit-source-id: 21a3eb91ac2f12727a2e383ec6a45b7d3273d173

… D21519194) (#375) Summary: Pull Request resolved: #375 Add FP16 weights support to fused rowwise sparse adagrad. - FP16 weights update use stochastic rounding by default * use_stochastic_rounding=true | false - Extend existing APIs on both REF and JIT path and keep API backward compatibility; * REF: rowwise_sparse_adagrad_fused_ref * JIT: GenerateRowWiseSparseAdaGradFused - Both API are thread-safe; a 4x64-bytes per-thread random buffer is maintained on both JIT and REF implementation; - Random number generating based on small-state generators xoshiro128++; - Both AVX2 and AVX512 are implemented in JIT kernel Pull Request resolved: #370 Test Plan: Imported from GitHub, without a `Test Plan:` line. pytorch/pytorch#38602 Reviewed By: jianyuh Differential Revision: D21610037 Pulled By: jspark1105 fbshipit-source-id: 13df1013d651d2be590e82ddc0387b68e65f9023

Summary: X-link: pytorch#3270 Pull Request resolved: facebookresearch/FBGEMM#370 Make fp8 bmm output contiguous as [silu_mul](https://fburl.com/code/sa1faq0w) requests output tensor of fp8 bmm stride(-1) to be 1. This Diff fixes the issue Reviewed By: jspark1105 Differential Revision: D64811808 fbshipit-source-id: e0f213f24fbf8bf989576371af1e2ada4cafbfb1

uyongw force-pushed the uyongw_fp16_rowwise_sparse_adagrad branch from c4f712c to 28d46bc Compare May 11, 2020 14:54

jspark1105 reviewed May 11, 2020

View reviewed changes

bench/RowwiseAdagradFusedBenchmark.cc Outdated Show resolved Hide resolved

src/RefImplementations.cc Outdated Show resolved Hide resolved

src/RefImplementations.cc Outdated Show resolved Hide resolved

uyongw added 4 commits May 12, 2020 07:42

Extend RowWiseSparseAdagradFusedTest with float16 weights update usin…

8f0240d

…g stochastic rounding Signed-off-by: Yong Wu <yong.wu@intel.com>

Extend RowwiseAdagradFusedBenchmark with float16 weights update using…

28d46bc

… stochastic rounding Signed-off-by: Yong Wu <yong.wu@intel.com>

jspark1105 reviewed May 12, 2020

View reviewed changes

facebook-github-bot reviewed May 12, 2020

View reviewed changes

jspark1105 reviewed May 12, 2020

View reviewed changes

src/RefImplementations.cc Show resolved Hide resolved

src/RefImplementations.cc Outdated Show resolved Hide resolved

src/RefImplementations.cc Outdated Show resolved Hide resolved

jspark1105 reviewed May 12, 2020

View reviewed changes

src/RefImplementations.cc Outdated Show resolved Hide resolved

facebook-github-bot reviewed May 13, 2020

View reviewed changes

uyongw added 2 commits May 13, 2020 08:56

Fix missing header x86intrin.h in MSVC build

ec268b3

Use portable immintrin.h instead. Signed-off-by: Yong Wu <yong.wu@intel.com>

uyongw force-pushed the uyongw_fp16_rowwise_sparse_adagrad branch from bce5308 to 02cfbee Compare May 13, 2020 02:20

facebook-github-bot reviewed May 13, 2020

View reviewed changes

uyongw commented May 13, 2020

View reviewed changes

src/RefImplementations.cc Show resolved Hide resolved

src/RefImplementations.cc Outdated Show resolved Hide resolved

jspark1105 approved these changes May 13, 2020

View reviewed changes

src/RefImplementations.cc Show resolved Hide resolved

Fix MSVC build on missing symbol of _cvtss_sh/_cvtsh_ss and suppress …

02cfbee

…exceptions in JIT. Signed-off-by: Yong Wu <yong.wu@intel.com>

facebook-github-bot closed this in b9f1a80 May 15, 2020

facebook-github-bot added the Merged label May 16, 2020

malfet reviewed May 16, 2020

View reviewed changes

jspark1105 mentioned this pull request May 16, 2020

Float16 rowwise sparse adagrad with stochastic rounding (reattempt of D21519194) (#370) #375

Closed

Float16 rowwise sparse adagrad with stochastic rounding #370

Float16 rowwise sparse adagrad with stochastic rounding #370

Uh oh!

Conversation

uyongw commented May 11, 2020

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jspark1105 commented May 11, 2020

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

facebook-github-bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

uyongw commented May 12, 2020

Uh oh!

Uh oh!

facebook-github-bot left a comment

Choose a reason for hiding this comment

Uh oh!

facebook-github-bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

uyongw commented May 13, 2020

Uh oh!

facebook-github-bot commented May 16, 2020

Uh oh!

malfet May 16, 2020

Choose a reason for hiding this comment

Uh oh!

uyongw May 16, 2020

Choose a reason for hiding this comment

Uh oh!

jspark1105 May 16, 2020

Choose a reason for hiding this comment

Uh oh!

uyongw May 17, 2020

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants