[RFC] CPU float16 performance optimization on eager mode. #97068

mingfeima · 2023-03-18T01:26:22Z

🚀 The feature, motivation and pitch

The RFC is to improve float16 performance as well as Op coverage on PyTorch CPU backend on eager mode.

Float16 and BFloat16 are both commonly used reduced floating point types for performance improvement in neural network inference/training. On the CPU side, previous optimization efforts have been placed more on BFloat16, which leaves float16 at a relatively primitive status.

On the 4th generation Intel® Xeon® Scalable processor (Sapphire Rapids), a new fp16 instruction set architecture for Intel® AVX-512 has been added, e.g. avx512-fp16. The instruction set supports a wide range of general-purpose numeric operations for fp16. One the next generation of Xeon, Intel® Advanced Matrix Extensions (AMX) will have fp16 support, e.g. amx-fp16.

This proposal would help the scenario when the model is pre trained on GPU with mixed precision of float16/float32 and users intend to do deployment on the CPU side without modifying the model weights, for instance, many HuggingFace models belong to this scenario.

This project will be targeting at:

Improve float16 Op coverage on PT from 52% to ~80% (for reference, BFloat16 Op coverage ratio is 83%). Torchbench models and HuggingFace models will be prioritized.
Improve float16 performance： Since the current fp16 performance on CPU is very low, we can use fp16 v.s. fp32 speedup as the metrics here. Performance speedup will match bf16, on average a 4x-5x performance speedup will be achieved on hardware with amx-fp16 support.
Add Automatic Mixed Precision (AMP) support for float16 on CPU.
Improve float16 numerical stability: use fp32 as accumulate type in reduction Ops, e.g. mean.

Technically, the optimization will be carried out as below:

Compute intensive Ops (e.g. Convolution, Gemm, and RNN):

Rely on oneDNN for optimal performance when the hardware has float16 acceleration.
Functional coverage will be added for hardware without float16 acceleration, no performance gain.

Generic ATen kernels:

Extend avx256 and avx512 and vectorization utils for dtype Half. Add native convert intrinsics: _mm256_cvtph_ps/_mm256_cvtps_ph (Rounding mode: RNE).
Unary and binary Op kernels: map to fp32 for computation.
Non arithmetic Ops: do direct memory copy (no dtype conversion), e.g. cat, index_select.
Reduction Ops: use fp32 as accumulate type.
NN Ops: reuse kernels of BFloat16.

Test Plan:

Extend vec_test_all_types_AVX2 and vec_test_all_types_AVX512 for float16.
Add OpInfo at torch/testing/_internal/common_methods_invocations.py.
Provide specific test cases for reduced floating point.

Alternatives

No response

Additional context

Previous RFC on extending AMP fp16 on CPU:

[RFC] Extend CPU AMP to add FP16 support on eager mode #96093

Float16 support in torch inductor working in parallel (implemented in similar method as BFloat16 support), has dependency on explicit vectorization utils of at::vec::Vectorized<Half>.

Pull Requests related to this feature requests:

cc @jgong5 @XiaobingSuper @sanchitintel @ashokei @jingxu10

The text was updated successfully, but these errors were encountered:

jgong5 · 2023-03-20T13:34:07Z

More clarification on avx512-fp16: we are not going to leverage avx512-fp16 to optimize non-conv/gemm ATen ops (primarily pointwise and reductions) but plan to follow the similar "fused type cast" approaches as how we optimize those ops for the bf16 data type, i.e. fp16 data are converted to/from fp32, fused with computation which happens with fp32. The type cast will rely on the f16c instruction set, as noted by @mingfeima in the RFC description. This is due to the following considerations:

avx512-fp16 support needs GCC12 which introduces extra compiler dependency to PyTorch while f16c is well supported by older compilers.
Reductions have to use fp32 accumulate data type anyway.
Pointwise ops might benefit a bit from avx512-fp16 vs. fused type cast approach but we don't expect much, since they are memory bound in most cases.
Similar fused type cast approach will be adopted in the Inductor CPP codegen as well. We expect this is good enough thanks to the fusion of more ops.

amx-fp16 will be leveraged to optimize conv and gemm ops via the oneDNN library.

…v on CPU" The PR is part of #97068, which is to add fp16 support for mkldnn conv and mkldnn deconv to leverage avx512-fp16 or amx-fp16 via the oneDNN library. cc jgong5 mingfeima XiaobingSuper sanchitintel ashokei jingxu10 [ghstack-poisoned]

The PR is part of #97068, which is to add fp16 support for mkldnn conv and mkldnn deconv to leverage avx512-fp16 or amx-fp16 via the oneDNN library. cc jgong5 mingfeima XiaobingSuper sanchitintel ashokei jingxu10 [ghstack-poisoned]

…v on CPU" The PR is part of #97068, which is to add fp16 support for mkldnn conv and mkldnn deconv to leverage avx_ne_convert, avx512-fp16, and amx-fp16 via the oneDNN library. cc jgong5 mingfeima XiaobingSuper sanchitintel ashokei jingxu10 [ghstack-poisoned]

The PR is part of #97068, which is to add fp16 support for mkldnn conv and mkldnn deconv to leverage avx_ne_convert, avx512-fp16, and amx-fp16 via the oneDNN library. cc jgong5 mingfeima XiaobingSuper sanchitintel ashokei jingxu10 [ghstack-poisoned]

…v on CPU" The PR is part of #97068, which is to add fp16 support for mkldnn conv and mkldnn deconv to leverage avx_ne_convert, avx512-fp16, and amx-fp16 via the oneDNN library. cc jgong5 mingfeima XiaobingSuper sanchitintel ashokei jingxu10 [ghstack-poisoned]

The PR is part of #97068, which is to add fp16 support for mkldnn conv and mkldnn deconv to leverage avx_ne_convert, avx512-fp16, and amx-fp16 via the oneDNN library. cc jgong5 mingfeima XiaobingSuper sanchitintel ashokei jingxu10 [ghstack-poisoned]

The PR is part of #97068, which is to add fp16 support for mkldnn conv and mkldnn deconv to leverage avx_ne_convert, avx512-fp16, and amx-fp16 via the oneDNN library. Pull Request resolved: #99496 Approved by: https://github.com/jgong5, https://github.com/cpuhrsch

mingfeima added module: cpu CPU specific problem (e.g., perf, algorithm) module: half Related to float16 half-precision floats feature A request for a proper, new feature. labels Mar 18, 2023

jgong5 changed the title ~~[Feature Proposal] CPU float16 performance optimization on eager mode.~~ [RFC] CPU float16 performance optimization on eager mode. Mar 20, 2023

CaoE mentioned this issue Mar 20, 2023

[RFC] Extend CPU AMP to add FP16 support on eager mode #96093

Closed

cpuhrsch added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Mar 21, 2023

This was referenced Mar 23, 2023

torch.fmod produces inconsistent results in eager and compile mode #97333

Closed

test half for vml cpu (log) #97180

Closed

mingfeima mentioned this issue May 15, 2023

RuntimeError in Scaled Dot Product Attention Tutorial Code #101359

Open

CaoE mentioned this issue Sep 7, 2023

add fp16 support for mkldnn conv and deconv on CPU #99496

Closed

CaoE mentioned this issue Oct 19, 2023

[RFC] Add GradScaler on CPU #111559

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RFC] CPU float16 performance optimization on eager mode. #97068

[RFC] CPU float16 performance optimization on eager mode. #97068

mingfeima commented Mar 18, 2023 •

edited

jgong5 commented Mar 20, 2023

[RFC] CPU float16 performance optimization on eager mode. #97068

[RFC] CPU float16 performance optimization on eager mode. #97068

Comments

mingfeima commented Mar 18, 2023 • edited

🚀 The feature, motivation and pitch

Alternatives

Additional context

jgong5 commented Mar 20, 2023

mingfeima commented Mar 18, 2023 •

edited