-
Notifications
You must be signed in to change notification settings - Fork 21.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enable PyTorch Bfloat16 for CPU and add MKL-DNN bfloat16 optimization for Cooper Lake #23509
Comments
cc @izdeby |
Intel are trying to enable some BF16 OPs for CPU and found an issue here - vec256 does not support accumulate type for “+”, “-, “*”, “/”. The framework uses vec256 to vectorize some OPs. But vec256 does not support accumulate type. Take sum of vec256 and BF16 as an example.
So for BF16 sum, we need to change the interface to The The framework cannot use vec256 to store scalar data and its accumulate type data for accumulation operations like “+”, “-, “*”, “/”, because the bit width of most accumulate types is larger than its corresponding scalar type. It could not execute accumulate operations element-wisely due to different element numbers. |
Let me summarize what we plan to enhance the accumulation support for BF16 Vec256. We found that Resnet-50 didn’t converge to SOTA accuracy with BF16. We rooted cause the problem to SoftMax OP. Softmax does a sum of a large number of BF16 values, and the sum can deviate from the truth value by a lot. Due to BF16 only has 7bit mantissa, when the big value is 2^7 bigger than the small value, the small value is discarded. When we perform reduction operation like SUM, the intermediate accumulated value can be 2^7 bigger than the input. We need to introduce FP32 as accumulation data type for all reduce operation performed on BF16 data type. For the scalar data type, the mechanism is already there but need to be enchanced to support BF16. We also need to extend vector reduction API to support high precision accumulation data type and longer vector length. Currently pytorch uses vec256 class to provide vectorization programming support. However, vec256 does not support accumulate data type. Below is an example of current vec256’s sum operator using BF16.
We propose to add an interface to support accumulation.
The interface adds acc_t, which is float32 for bfp16, and changes the vector length from Vec256 and Vec512. We also propose to introduce a Vec512 class to Pytorch. The softmax op is fixed with #PR24457. #PR29280 enhances the BF16 reduction support for scalar data type. |
This rounding will depend on which binary tree you use to accumulate the sum. Accumulating up a balanced binary tree does not have the same problem of adding probably-very-small values to a probably-much-larger sum, e.g.
vs sum = ((a + b) + (c + d)) + ((e + f) + (g + h)) Typically hardware would use the balanced-tree since it gives the minimal latency. |
Was this ever done? This page suggests it was. |
@garymm Many ATen ops have BF16 supported and optimized already (including matmul). We are in progress to fill the remaining gaps. PRs submitted for some of them (e.g. conv and linear etc.). Please refer to the comments in #55374 (comment) for details. |
Enable PyTorch Bfloat16 for CPU and add MKL-DNN bfloat16 optimization for Cooper Lake
Motivation
Bfloat16 is a 16-bit floating point representation with same exponent bit-width as 32-bit floating point representation (FP32). It improves deep learning training performance by reducing both the computation and memory bandwidth while keeps the accuracy at the same level as FP32. It has been adopted by various deep learning HW, and Pytorch added initial Bfloat16 support recently.
Intel’s upcoming Cooper Lake 14nm Intel Xeon® processor family will add Bfloat16 support, which provides 2x speedup for SIMD FMA instructions and 2x performance benefits on memory access. MKL-DNN v1.0 introduced bfloat16 support and expect more to come in the future releases. Comparing to FP32 baseline, we projected 1.6x+ performance benefit end to end for a wide range of vision models, and expect benefit for speech-recognition, recommendation engines and machine translation also.
Pitch
Support Pytorch bfloat16 feature on CPU path and optimize it for Intel Cooper Lake
Additional context
On the CPU path, we plan to extend Bfloat16 tensor operation support to have a full coverage as FP16. The input and output tensors are all in public format. We need to override the basic data type operations (like “+”, “-, “*”, “/”) in Bfloat16 class and modify the Bfloat16 tensor operations for special process if needed (like accumulate in FP32 precision in BN). On CPU prior to Cooper Lake, the basic Bfloat16 data operation is emulated, with conversion of input to FP32 and rounding of result to Bfloat16 using Rounding to Nearest Mode.
On the MKLDNN path, the input tensor is expected to be converted to MKL-DNN blocking format so represented internally as MKL-DNN tensor. The dispatch to MKL-DNN Bfloat16 operation will happen within MKL-DNN operations, not according to first level tensor type id.
On CPU prior to Cooper Lake, User can run the BFloat16 model with this enabling effort. User may experience lower performance for both CPU and MKL-DNN path on Bfloat16 than FP32 baseline. For best performance, User needs to run Bfloat16 model on MKL-DNN path on Cooper Lake.
The text was updated successfully, but these errors were encountered: