Enable PyTorch Bfloat16 for CPU and add MKL-DNN bfloat16 optimization for Cooper Lake #23509

Jianhui-Li · 2019-07-29T17:03:59Z

Enable PyTorch Bfloat16 for CPU and add MKL-DNN bfloat16 optimization for Cooper Lake

Motivation

Bfloat16 is a 16-bit floating point representation with same exponent bit-width as 32-bit floating point representation (FP32). It improves deep learning training performance by reducing both the computation and memory bandwidth while keeps the accuracy at the same level as FP32. It has been adopted by various deep learning HW, and Pytorch added initial Bfloat16 support recently.

Intel’s upcoming Cooper Lake 14nm Intel Xeon® processor family will add Bfloat16 support, which provides 2x speedup for SIMD FMA instructions and 2x performance benefits on memory access. MKL-DNN v1.0 introduced bfloat16 support and expect more to come in the future releases. Comparing to FP32 baseline, we projected 1.6x+ performance benefit end to end for a wide range of vision models, and expect benefit for speech-recognition, recommendation engines and machine translation also.

Pitch

Support Pytorch bfloat16 feature on CPU path and optimize it for Intel Cooper Lake

Additional context

On the CPU path, we plan to extend Bfloat16 tensor operation support to have a full coverage as FP16. The input and output tensors are all in public format. We need to override the basic data type operations (like “+”, “-, “*”, “/”) in Bfloat16 class and modify the Bfloat16 tensor operations for special process if needed (like accumulate in FP32 precision in BN). On CPU prior to Cooper Lake, the basic Bfloat16 data operation is emulated, with conversion of input to FP32 and rounding of result to Bfloat16 using Rounding to Nearest Mode.

On the MKLDNN path, the input tensor is expected to be converted to MKL-DNN blocking format so represented internally as MKL-DNN tensor. The dispatch to MKL-DNN Bfloat16 operation will happen within MKL-DNN operations, not according to first level tensor type id.

On CPU prior to Cooper Lake, User can run the BFloat16 model with this enabling effort. User may experience lower performance for both CPU and MKL-DNN path on Bfloat16 than FP32 baseline. For best performance, User needs to run Bfloat16 model on MKL-DNN path on Cooper Lake.

colesbury · 2019-07-29T18:23:20Z

cc @izdeby

EikanWang · 2019-10-14T06:53:55Z

Intel are trying to enable some BF16 OPs for CPU and found an issue here - vec256 does not support accumulate type for “+”, “-, “*”, “/”.

The framework uses vec256 to vectorize some OPs. But vec256 does not support accumulate type. Take sum of vec256 and BF16 as an example.

The accumulate type of BF16 is FP32
The function prototype of sum of vec256 is template <class T> Vec256<T> inline operator+(const Vec256<T> &a, const Vec256<T> &b)

So for BF16 sum, we need to change the interface to template <class acc_t, class data_t> Vec256<acc_t> inline operator+(const Vec256<acc_t> &a, const Vec256<data_t> &b). Howerver vec256 can store 16 BF16 numbers but only 8 FP32 numbers, then we cannot execute sum element-wisely due to different element numbers. The framework may need to introduce vec512 to store FP32 and then developer can take vec512 as the accumulate type of vec256. Then for BF16, the interface should like template <class acc_t, class data_t> Vec512<acc_t> inline operator+(const Vec512<acc_t> &a, const Vec256<data_t> &b).

The int8_t, uint8_t, char, int16_t and int32_t also have the same issue because its' accumulate type is int64_t.

The framework cannot use vec256 to store scalar data and its accumulate type data for accumulation operations like “+”, “-, “*”, “/”, because the bit width of most accumulate types is larger than its corresponding scalar type. It could not execute accumulate operations element-wisely due to different element numbers.

Jianhui-Li · 2020-01-15T07:33:34Z

Let me summarize what we plan to enhance the accumulation support for BF16 Vec256. We found that Resnet-50 didn’t converge to SOTA accuracy with BF16. We rooted cause the problem to SoftMax OP. Softmax does a sum of a large number of BF16 values, and the sum can deviate from the truth value by a lot. Due to BF16 only has 7bit mantissa, when the big value is 2^7 bigger than the small value, the small value is discarded. When we perform reduction operation like SUM, the intermediate accumulated value can be 2^7 bigger than the input. We need to introduce FP32 as accumulation data type for all reduce operation performed on BF16 data type. For the scalar data type, the mechanism is already there but need to be enchanced to support BF16.

We also need to extend vector reduction API to support high precision accumulation data type and longer vector length. Currently pytorch uses vec256 class to provide vectorization programming support. However, vec256 does not support accumulate data type. Below is an example of current vec256’s sum operator using BF16.

  `template <class T> Vec256<T> inline operator+(const Vec256<T> &a, const Vec256<T> &b)`

We propose to add an interface to support accumulation.

  `template <class acc_t, class data_t> Vec512<acc_t> inline operator+(const Vec512<acc_t> &a, const Vec256<data_t> &b)`

The interface adds acc_t, which is float32 for bfp16, and changes the vector length from Vec256 and Vec512. We also propose to introduce a Vec512 class to Pytorch.

The softmax op is fixed with #PR24457. #PR29280 enhances the BF16 reduction support for scalar data type.

rcownie · 2020-11-17T20:24:57Z

This rounding will depend on which binary tree you use to accumulate the sum. Accumulating up a balanced binary tree does not have the same problem of adding probably-very-small values to a probably-much-larger sum, e.g.

 sum = ((((((a + b) + c) + d) + e) + f) + g) + h)

vs sum = ((a + b) + (c + d)) + ((e + f) + (g + h))

Typically hardware would use the balanced-tree since it gives the minimal latency.

garymm · 2021-09-15T21:22:10Z

Was this ever done? This page suggests it was.

jgong5 · 2021-09-25T01:10:57Z

@garymm Many ATen ops have BF16 supported and optimized already (including matmul). We are in progress to fill the remaining gaps. PRs submitted for some of them (e.g. conv and linear etc.). Please refer to the comments in #55374 (comment) for details.

colesbury added module: cpu CPU specific problem (e.g., perf, algorithm) module: performance Issues related to performance, either of kernel code or framework glue triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module labels Jul 29, 2019

Jianhui-Li mentioned this issue Aug 19, 2019

Enable log_softmax and CrossEntropyLoss for bfloat16 #24457

Closed

gottbrath assigned jianyuh, bddppq and izdeby Aug 29, 2019

Jianhui-Li mentioned this issue Nov 7, 2019

Clearly define accumulate type, data type and output type for shared reduce kernels #29280

Closed

gottbrath assigned ngimel and VitalyFedyunin Feb 19, 2020

glenn-jocher mentioned this issue Mar 19, 2021

Custom yolo5xs model ultralytics/yolov5#2367

Closed

stas00 mentioned this issue Jun 23, 2021

[models] respect dtype of the model when instantiating it huggingface/transformers#12316

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enable PyTorch Bfloat16 for CPU and add MKL-DNN bfloat16 optimization for Cooper Lake #23509

Enable PyTorch Bfloat16 for CPU and add MKL-DNN bfloat16 optimization for Cooper Lake #23509

Jianhui-Li commented Jul 29, 2019 •

edited by soumith

colesbury commented Jul 29, 2019

EikanWang commented Oct 14, 2019 •

edited

Jianhui-Li commented Jan 15, 2020 •

edited

rcownie commented Nov 17, 2020

garymm commented Sep 15, 2021

jgong5 commented Sep 25, 2021

Enable PyTorch Bfloat16 for CPU and add MKL-DNN bfloat16 optimization for Cooper Lake #23509

Enable PyTorch Bfloat16 for CPU and add MKL-DNN bfloat16 optimization for Cooper Lake #23509

Comments

Jianhui-Li commented Jul 29, 2019 • edited by soumith

Motivation

Pitch

Additional context

colesbury commented Jul 29, 2019

EikanWang commented Oct 14, 2019 • edited

Jianhui-Li commented Jan 15, 2020 • edited

rcownie commented Nov 17, 2020

garymm commented Sep 15, 2021

jgong5 commented Sep 25, 2021

Jianhui-Li commented Jul 29, 2019 •

edited by soumith

EikanWang commented Oct 14, 2019 •

edited

Jianhui-Li commented Jan 15, 2020 •

edited