Skip to content

Conversation

@uyongw
Copy link
Contributor

@uyongw uyongw commented May 11, 2020

Summary:

Add FP16 weights support to fused rowwise sparse adagrad.

  • FP16 weights update use stochastic rounding by default
    • use_stochastic_rounding=true | false
  • Extend existing APIs on both REF and JIT path and keep API backward compatibility;
    • REF: rowwise_sparse_adagrad_fused_ref
    • JIT: GenerateRowWiseSparseAdaGradFused
  • Both API are thread-safe; a 4x64-bytes per-thread random buffer is maintained on both JIT and REF implementation;
  • Random number generating based on small-state generators xoshiro128++;
  • Both AVX2 and AVX512 are implemented in JIT kernel

@uyongw uyongw force-pushed the uyongw_fp16_rowwise_sparse_adagrad branch from c4f712c to 28d46bc Compare May 11, 2020 14:54
@jspark1105
Copy link
Contributor

Thanks so much @uyongw for this valuable contribution. I added comments except to RowWiseSparseAdagradFused.cc . I'll go over RowWiseSparseAdagradFused.cc soon.

uyongw added 4 commits May 12, 2020 07:42
Reuse the rowwise_sparse_adagrad_fused_ref(...) API with
float16 data type as template parameter. Implemented float16
weights update with tochastic rounding.

Random integer number generating is based on xoshiro128++. A
64-bytes per-thread global random buffer is maintained and
updated each time a new random number is generated.

Signed-off-by: Yong Wu <yong.wu@intel.com>
Support GenerateRowWiseSparseAdaGradFused(...) JIT API with
float16 data type as template parameter and keep backward
compatiblity. Implement float16 weights update with stochastic
rounding (SR).

Random integer number generating is based on xoshiro128++. A
64-bytes per-thread global random buffer is maintained and
updated each time a new random number is generated.

Support both AVX2 and AVX512 random number generating and float16
weights udpate with SR.

Signed-off-by: Yong Wu <yong.wu@intel.com>
…g stochastic rounding

Signed-off-by: Yong Wu <yong.wu@intel.com>
… stochastic rounding

Signed-off-by: Yong Wu <yong.wu@intel.com>
Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jspark1105 has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@uyongw
Copy link
Contributor Author

uyongw commented May 12, 2020

An observation is the FP16/SR with AVX512 JIT code will be faster than AVX2 on SKX. We could probably use AVX512 as default for FP16/SR kernel.

Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jspark1105 has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

uyongw added 2 commits May 13, 2020 08:56
REF path:
- fix local variable shadow
- fix memory leak
- fix FP32->FP16 rounding
- unify shift logic

JIT path:
- fix comments
- fix FP32->FP16 rounding
- GPR allocation: RAX for rand_buffer and R12 for length_R

Bench:
- fix bytes_padded caculation

Signed-off-by: Yong Wu <yong.wu@intel.com>
Use portable immintrin.h instead.

Signed-off-by: Yong Wu <yong.wu@intel.com>
@uyongw uyongw force-pushed the uyongw_fp16_rowwise_sparse_adagrad branch from bce5308 to 02cfbee Compare May 13, 2020 02:20
Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jspark1105 has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@uyongw
Copy link
Contributor Author

uyongw commented May 13, 2020

@jspark1105 Thank you for taking the time for all the valuable comments.

…exceptions in JIT.

Signed-off-by: Yong Wu <yong.wu@intel.com>
@facebook-github-bot
Copy link
Contributor

@jspark1105 merged this pull request in b9f1a80.

target_compile_options(fbgemm_avx2 PRIVATE "/arch:AVX2")
target_compile_options(fbgemm_avx512 PRIVATE "/arch:AVX512")
else(MSVC)
target_compile_options(fbgemm_generic PRIVATE "-mf16c")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't that makes fbgemm code incompatible with SandyBridge CPUs?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@malfet That would be true. We need perhaps a pure C implementation of _cvtss_sh/_cvtsh_ss in the REF implementation (there is one already but missing the misc. rounding handling). I can create a PR next week for this purpose if needed. @jspark1105

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR has been reverted due to an illegal instruction error in PyTorch tests. Let me check if it's related and if so commit with a C implementation.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the introducing this issue. And thanks for the fast turnaround! @jspark1105

jspark1105 pushed a commit to jspark1105/FBGEMM that referenced this pull request May 16, 2020
… D21519194) (pytorch#370)

Summary:
Add FP16 weights support to fused rowwise sparse adagrad.
- FP16 weights update use stochastic rounding by default
   * use_stochastic_rounding=true | false
- Extend existing APIs on both REF and JIT path and keep API backward compatibility;
   * REF: rowwise_sparse_adagrad_fused_ref
    * JIT: GenerateRowWiseSparseAdaGradFused
- Both API are thread-safe; a 4x64-bytes per-thread random buffer is maintained on both JIT and REF implementation;
- Random number generating based on small-state generators xoshiro128++;
- Both AVX2 and AVX512 are implemented in JIT kernel
Pull Request resolved: pytorch#370

Differential Revision: D21610037

Pulled By: jspark1105

fbshipit-source-id: b1356300116ae42cd2e060cca2ba58ded300d95a
jspark1105 pushed a commit to jspark1105/FBGEMM that referenced this pull request May 16, 2020
… D21519194) (pytorch#375)

Summary:
Pull Request resolved: pytorch#375

Add FP16 weights support to fused rowwise sparse adagrad.
- FP16 weights update use stochastic rounding by default
   * use_stochastic_rounding=true | false
- Extend existing APIs on both REF and JIT path and keep API backward compatibility;
   * REF: rowwise_sparse_adagrad_fused_ref
    * JIT: GenerateRowWiseSparseAdaGradFused
- Both API are thread-safe; a 4x64-bytes per-thread random buffer is maintained on both JIT and REF implementation;
- Random number generating based on small-state generators xoshiro128++;
- Both AVX2 and AVX512 are implemented in JIT kernel
Pull Request resolved: pytorch#370

Test Plan:
Imported from GitHub, without a `Test Plan:` line.
pytorch/pytorch#38602

Differential Revision: D21610037

Pulled By: jspark1105

fbshipit-source-id: 9866b65dfefc7653753a703ee2f9e209d831f6fc
jspark1105 pushed a commit to jspark1105/FBGEMM that referenced this pull request May 16, 2020
… D21519194) (pytorch#375)

Summary:
Pull Request resolved: pytorch#375

Add FP16 weights support to fused rowwise sparse adagrad.
- FP16 weights update use stochastic rounding by default
   * use_stochastic_rounding=true | false
- Extend existing APIs on both REF and JIT path and keep API backward compatibility;
   * REF: rowwise_sparse_adagrad_fused_ref
    * JIT: GenerateRowWiseSparseAdaGradFused
- Both API are thread-safe; a 4x64-bytes per-thread random buffer is maintained on both JIT and REF implementation;
- Random number generating based on small-state generators xoshiro128++;
- Both AVX2 and AVX512 are implemented in JIT kernel
Pull Request resolved: pytorch#370

Test Plan:
Imported from GitHub, without a `Test Plan:` line.
pytorch/pytorch#38602

Differential Revision: D21610037

Pulled By: jspark1105

fbshipit-source-id: 21a3eb91ac2f12727a2e383ec6a45b7d3273d173
facebook-github-bot pushed a commit that referenced this pull request May 16, 2020
… D21519194) (#375)

Summary:
Pull Request resolved: #375

Add FP16 weights support to fused rowwise sparse adagrad.
- FP16 weights update use stochastic rounding by default
   * use_stochastic_rounding=true | false
- Extend existing APIs on both REF and JIT path and keep API backward compatibility;
   * REF: rowwise_sparse_adagrad_fused_ref
    * JIT: GenerateRowWiseSparseAdaGradFused
- Both API are thread-safe; a 4x64-bytes per-thread random buffer is maintained on both JIT and REF implementation;
- Random number generating based on small-state generators xoshiro128++;
- Both AVX2 and AVX512 are implemented in JIT kernel
Pull Request resolved: #370

Test Plan:
Imported from GitHub, without a `Test Plan:` line.
pytorch/pytorch#38602

Reviewed By: jianyuh

Differential Revision: D21610037

Pulled By: jspark1105

fbshipit-source-id: 13df1013d651d2be590e82ddc0387b68e65f9023
q10 pushed a commit to q10/FBGEMM that referenced this pull request Apr 10, 2025
Summary:
X-link: pytorch#3270

Pull Request resolved: facebookresearch/FBGEMM#370

Make fp8 bmm output contiguous as [silu_mul](https://fburl.com/code/sa1faq0w) requests output tensor of fp8 bmm stride(-1) to be 1. This Diff fixes the issue

Reviewed By: jspark1105

Differential Revision: D64811808

fbshipit-source-id: e0f213f24fbf8bf989576371af1e2ada4cafbfb1
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants