Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable PyTorch Bfloat16 for CPU and add MKL-DNN bfloat16 optimization for Cooper Lake #23509

Open
Jianhui-Li opened this issue Jul 29, 2019 · 6 comments
Assignees
Labels
module: cpu CPU specific problem (e.g., perf, algorithm) module: performance Issues related to performance, either of kernel code or framework glue triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module

Comments

@Jianhui-Li
Copy link

Jianhui-Li commented Jul 29, 2019

Enable PyTorch Bfloat16 for CPU and add MKL-DNN bfloat16 optimization for Cooper Lake

Motivation

Bfloat16 is a 16-bit floating point representation with same exponent bit-width as 32-bit floating point representation (FP32). It improves deep learning training performance by reducing both the computation and memory bandwidth while keeps the accuracy at the same level as FP32. It has been adopted by various deep learning HW, and Pytorch added initial Bfloat16 support recently.

Intel’s upcoming Cooper Lake 14nm Intel Xeon® processor family will add Bfloat16 support, which provides 2x speedup for SIMD FMA instructions and 2x performance benefits on memory access. MKL-DNN v1.0 introduced bfloat16 support and expect more to come in the future releases. Comparing to FP32 baseline, we projected 1.6x+ performance benefit end to end for a wide range of vision models, and expect benefit for speech-recognition, recommendation engines and machine translation also.

Pitch

Support Pytorch bfloat16 feature on CPU path and optimize it for Intel Cooper Lake

Additional context

On the CPU path, we plan to extend Bfloat16 tensor operation support to have a full coverage as FP16. The input and output tensors are all in public format. We need to override the basic data type operations (like “+”, “-, “*”, “/”) in Bfloat16 class and modify the Bfloat16 tensor operations for special process if needed (like accumulate in FP32 precision in BN). On CPU prior to Cooper Lake, the basic Bfloat16 data operation is emulated, with conversion of input to FP32 and rounding of result to Bfloat16 using Rounding to Nearest Mode.

On the MKLDNN path, the input tensor is expected to be converted to MKL-DNN blocking format so represented internally as MKL-DNN tensor. The dispatch to MKL-DNN Bfloat16 operation will happen within MKL-DNN operations, not according to first level tensor type id.

On CPU prior to Cooper Lake, User can run the BFloat16 model with this enabling effort. User may experience lower performance for both CPU and MKL-DNN path on Bfloat16 than FP32 baseline. For best performance, User needs to run Bfloat16 model on MKL-DNN path on Cooper Lake.

@colesbury colesbury added module: cpu CPU specific problem (e.g., perf, algorithm) module: performance Issues related to performance, either of kernel code or framework glue triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module labels Jul 29, 2019
@colesbury
Copy link
Member

cc @izdeby

@EikanWang
Copy link
Collaborator

EikanWang commented Oct 14, 2019

Intel are trying to enable some BF16 OPs for CPU and found an issue here - vec256 does not support accumulate type for “+”, “-, “*”, “/”.

The framework uses vec256 to vectorize some OPs. But vec256 does not support accumulate type. Take sum of vec256 and BF16 as an example.

  • The accumulate type of BF16 is FP32
  • The function prototype of sum of vec256 is template <class T> Vec256<T> inline operator+(const Vec256<T> &a, const Vec256<T> &b)

So for BF16 sum, we need to change the interface to template <class acc_t, class data_t> Vec256<acc_t> inline operator+(const Vec256<acc_t> &a, const Vec256<data_t> &b). Howerver vec256 can store 16 BF16 numbers but only 8 FP32 numbers, then we cannot execute sum element-wisely due to different element numbers. The framework may need to introduce vec512 to store FP32 and then developer can take vec512 as the accumulate type of vec256. Then for BF16, the interface should like template <class acc_t, class data_t> Vec512<acc_t> inline operator+(const Vec512<acc_t> &a, const Vec256<data_t> &b).

The int8_t, uint8_t, char, int16_t and int32_t also have the same issue because its' accumulate type is int64_t.

The framework cannot use vec256 to store scalar data and its accumulate type data for accumulation operations like “+”, “-, “*”, “/”, because the bit width of most accumulate types is larger than its corresponding scalar type. It could not execute accumulate operations element-wisely due to different element numbers.

@Jianhui-Li
Copy link
Author

Jianhui-Li commented Jan 15, 2020

Let me summarize what we plan to enhance the accumulation support for BF16 Vec256. We found that Resnet-50 didn’t converge to SOTA accuracy with BF16. We rooted cause the problem to SoftMax OP. Softmax does a sum of a large number of BF16 values, and the sum can deviate from the truth value by a lot. Due to BF16 only has 7bit mantissa, when the big value is 2^7 bigger than the small value, the small value is discarded. When we perform reduction operation like SUM, the intermediate accumulated value can be 2^7 bigger than the input. We need to introduce FP32 as accumulation data type for all reduce operation performed on BF16 data type. For the scalar data type, the mechanism is already there but need to be enchanced to support BF16.

We also need to extend vector reduction API to support high precision accumulation data type and longer vector length. Currently pytorch uses vec256 class to provide vectorization programming support. However, vec256 does not support accumulate data type. Below is an example of current vec256’s sum operator using BF16.

  `template <class T> Vec256<T> inline operator+(const Vec256<T> &a, const Vec256<T> &b)`

We propose to add an interface to support accumulation.

  `template <class acc_t, class data_t> Vec512<acc_t> inline operator+(const Vec512<acc_t> &a, const Vec256<data_t> &b)`

The interface adds acc_t, which is float32 for bfp16, and changes the vector length from Vec256 and Vec512. We also propose to introduce a Vec512 class to Pytorch.

The softmax op is fixed with #PR24457. #PR29280 enhances the BF16 reduction support for scalar data type.

@rcownie
Copy link

rcownie commented Nov 17, 2020

This rounding will depend on which binary tree you use to accumulate the sum. Accumulating up a balanced binary tree does not have the same problem of adding probably-very-small values to a probably-much-larger sum, e.g.

 sum = ((((((a + b) + c) + d) + e) + f) + g) + h)

vs sum = ((a + b) + (c + d)) + ((e + f) + (g + h))

Typically hardware would use the balanced-tree since it gives the minimal latency.

@garymm
Copy link
Collaborator

garymm commented Sep 15, 2021

Was this ever done? This page suggests it was.

@jgong5
Copy link
Collaborator

jgong5 commented Sep 25, 2021

@garymm Many ATen ops have BF16 supported and optimized already (including matmul). We are in progress to fill the remaining gaps. PRs submitted for some of them (e.g. conv and linear etc.). Please refer to the comments in #55374 (comment) for details.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
module: cpu CPU specific problem (e.g., perf, algorithm) module: performance Issues related to performance, either of kernel code or framework glue triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module
Projects
None yet
Development

No branches or pull requests