Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RFC] CPU float16 performance optimization on eager mode. #97068

Open
mingfeima opened this issue Mar 18, 2023 · 1 comment
Open

[RFC] CPU float16 performance optimization on eager mode. #97068

mingfeima opened this issue Mar 18, 2023 · 1 comment
Labels
feature A request for a proper, new feature. module: cpu CPU specific problem (e.g., perf, algorithm) module: half Related to float16 half-precision floats triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module

Comments

@mingfeima
Copy link
Collaborator

mingfeima commented Mar 18, 2023

🚀 The feature, motivation and pitch

The RFC is to improve float16 performance as well as Op coverage on PyTorch CPU backend on eager mode.

Float16 and BFloat16 are both commonly used reduced floating point types for performance improvement in neural network inference/training. On the CPU side, previous optimization efforts have been placed more on BFloat16, which leaves float16 at a relatively primitive status.

On the 4th generation Intel® Xeon® Scalable processor (Sapphire Rapids), a new fp16 instruction set architecture for Intel® AVX-512 has been added, e.g. avx512-fp16. The instruction set supports a wide range of general-purpose numeric operations for fp16. One the next generation of Xeon, Intel® Advanced Matrix Extensions (AMX) will have fp16 support, e.g. amx-fp16.

This proposal would help the scenario when the model is pre trained on GPU with mixed precision of float16/float32 and users intend to do deployment on the CPU side without modifying the model weights, for instance, many HuggingFace models belong to this scenario.

This project will be targeting at:

  • Improve float16 Op coverage on PT from 52% to ~80% (for reference, BFloat16 Op coverage ratio is 83%). Torchbench models and HuggingFace models will be prioritized.
  • Improve float16 performance: Since the current fp16 performance on CPU is very low, we can use fp16 v.s. fp32 speedup as the metrics here. Performance speedup will match bf16, on average a 4x-5x performance speedup will be achieved on hardware with amx-fp16 support.
  • Add Automatic Mixed Precision (AMP) support for float16 on CPU.
  • Improve float16 numerical stability: use fp32 as accumulate type in reduction Ops, e.g. mean.

Technically, the optimization will be carried out as below:

Compute intensive Ops (e.g. Convolution, Gemm, and RNN):

  • Rely on oneDNN for optimal performance when the hardware has float16 acceleration.
  • Functional coverage will be added for hardware without float16 acceleration, no performance gain.

Generic ATen kernels:

  • Extend avx256 and avx512 and vectorization utils for dtype Half. Add native convert intrinsics: _mm256_cvtph_ps/_mm256_cvtps_ph (Rounding mode: RNE).
  • Unary and binary Op kernels: map to fp32 for computation.
  • Non arithmetic Ops: do direct memory copy (no dtype conversion), e.g. cat, index_select.
  • Reduction Ops: use fp32 as accumulate type.
  • NN Ops: reuse kernels of BFloat16.

Test Plan:

  • Extend vec_test_all_types_AVX2 and vec_test_all_types_AVX512 for float16.
  • Add OpInfo at torch/testing/_internal/common_methods_invocations.py.
  • Provide specific test cases for reduced floating point.

Alternatives

No response

Additional context

Previous RFC on extending AMP fp16 on CPU:

Float16 support in torch inductor working in parallel (implemented in similar method as BFloat16 support), has dependency on explicit vectorization utils of at::vec::Vectorized<Half>.

Pull Requests related to this feature requests:

cc @jgong5 @XiaobingSuper @sanchitintel @ashokei @jingxu10

@mingfeima mingfeima added module: cpu CPU specific problem (e.g., perf, algorithm) module: half Related to float16 half-precision floats feature A request for a proper, new feature. labels Mar 18, 2023
@jgong5 jgong5 changed the title [Feature Proposal] CPU float16 performance optimization on eager mode. [RFC] CPU float16 performance optimization on eager mode. Mar 20, 2023
@jgong5
Copy link
Collaborator

jgong5 commented Mar 20, 2023

More clarification on avx512-fp16: we are not going to leverage avx512-fp16 to optimize non-conv/gemm ATen ops (primarily pointwise and reductions) but plan to follow the similar "fused type cast" approaches as how we optimize those ops for the bf16 data type, i.e. fp16 data are converted to/from fp32, fused with computation which happens with fp32. The type cast will rely on the f16c instruction set, as noted by @mingfeima in the RFC description. This is due to the following considerations:

  1. avx512-fp16 support needs GCC12 which introduces extra compiler dependency to PyTorch while f16c is well supported by older compilers.
  2. Reductions have to use fp32 accumulate data type anyway.
  3. Pointwise ops might benefit a bit from avx512-fp16 vs. fused type cast approach but we don't expect much, since they are memory bound in most cases.
  4. Similar fused type cast approach will be adopted in the Inductor CPP codegen as well. We expect this is good enough thanks to the fusion of more ops.

amx-fp16 will be leveraged to optimize conv and gemm ops via the oneDNN library.

@cpuhrsch cpuhrsch added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Mar 21, 2023
CaoE added a commit that referenced this issue Sep 7, 2023
…v on CPU"


The PR is part of #97068, which is to add fp16 support for mkldnn conv and mkldnn deconv to leverage avx512-fp16 or amx-fp16 via the oneDNN library.

cc jgong5 mingfeima XiaobingSuper sanchitintel ashokei jingxu10

[ghstack-poisoned]
CaoE added a commit that referenced this issue Sep 7, 2023
The PR is part of #97068, which is to add fp16 support for mkldnn conv and mkldnn deconv to leverage avx512-fp16 or amx-fp16 via the oneDNN library.

cc jgong5 mingfeima XiaobingSuper sanchitintel ashokei jingxu10

[ghstack-poisoned]
CaoE added a commit that referenced this issue Sep 18, 2023
…v on CPU"


The PR is part of #97068, which is to add fp16 support for mkldnn conv and mkldnn deconv to leverage  avx_ne_convert, avx512-fp16, and amx-fp16 via the oneDNN library.

cc jgong5 mingfeima XiaobingSuper sanchitintel ashokei jingxu10

[ghstack-poisoned]
CaoE added a commit that referenced this issue Sep 18, 2023
The PR is part of #97068, which is to add fp16 support for mkldnn conv and mkldnn deconv to leverage  avx_ne_convert, avx512-fp16, and amx-fp16 via the oneDNN library.

cc jgong5 mingfeima XiaobingSuper sanchitintel ashokei jingxu10

[ghstack-poisoned]
CaoE added a commit that referenced this issue Sep 19, 2023
…v on CPU"


The PR is part of #97068, which is to add fp16 support for mkldnn conv and mkldnn deconv to leverage  avx_ne_convert, avx512-fp16, and amx-fp16 via the oneDNN library.

cc jgong5 mingfeima XiaobingSuper sanchitintel ashokei jingxu10

[ghstack-poisoned]
CaoE added a commit that referenced this issue Sep 19, 2023
The PR is part of #97068, which is to add fp16 support for mkldnn conv and mkldnn deconv to leverage  avx_ne_convert, avx512-fp16, and amx-fp16 via the oneDNN library.

cc jgong5 mingfeima XiaobingSuper sanchitintel ashokei jingxu10

[ghstack-poisoned]
pytorchmergebot pushed a commit that referenced this issue Sep 19, 2023
The PR is part of #97068, which is to add fp16 support for mkldnn conv and mkldnn deconv to leverage  avx_ne_convert, avx512-fp16, and amx-fp16 via the oneDNN library.

Pull Request resolved: #99496
Approved by: https://github.com/jgong5, https://github.com/cpuhrsch
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature A request for a proper, new feature. module: cpu CPU specific problem (e.g., perf, algorithm) module: half Related to float16 half-precision floats triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module
Projects
Status: In Progress
Development

No branches or pull requests

3 participants