RFC: SDPA Optimization CPU #56

Valentine233 · 2023-08-03T05:45:09Z

No description provided.

Valentine233 · 2023-08-03T06:00:51Z

@mingfeima @jgong5, please have a look and help review, thanks.

jgong5 · 2023-08-03T06:44:33Z

RFC-0025-sdpa-optm-cpu.md

+The following will be nice to have for PT 2.1:
+
+*   Support data type of float16.
+*   Enable the SDPA graph rewriting for Inductor.


I suppose the fusion would automatically apply to the inductor after we add the kernel support to SDPA, right?

I suppose the fusion would automatically apply to the inductor after we add the kernel support to SDPA, right?

Yes, the fusion is automatically applied now. However, there would be assertion checks for intermediate outputs' shapes and strides, which are not the same for CPU and CUDA implementations. For example, the output logsumexp which is used for backward calculation, has the shape of {batch_size, query_length, num_head} in CPU and the shape of {batch_size, num_head, query_length} in CUDA. Besides, other outputs like cum_seq_q, cum_seq_k, are not used and are assigned with zero-size tensors in CPU. For now, I have temporarily disabled CPU fusion path and would enable it if having time to fix the issue mentioned before.

Inductor currently falls back to the aten function for sdpa_flash and sdpa_efficeint_attention so should be similar in this case, unless you are talking about the pattern matcher

Yeah, after doing the fused attention pattern matching in Inductor, some graph nodes could be replaced with one SDPA node. If the SDPA node uses sdpa_flash, Inductor would check the all outputs' shapes and strides in CPP output code.

jgong5 · 2023-08-03T06:46:19Z

RFC-0025-sdpa-optm-cpu.md

+Here are the detailed implementation items:
+
+*   The flash attention CPU kernel is added, in which both forward and backward paths are implemented for data types float32 and bfloat16.
+*   Write an SDPA selecting function for CPU to automatically choose one SDPA implementation among several ones.


Can we elaborate more on the algorithm used for selecting functions?

Thanks, modified.

jgong5 · 2023-08-03T06:46:50Z

RFC-0025-sdpa-optm-cpu.md

+
+Here are the detailed implementation items:
+
+*   The flash attention CPU kernel is added, in which both forward and backward paths are implemented for data types float32 and bfloat16.


Please add more details on how the fusions are implemented on CPU.

Thanks, modified.

jgong5 · 2023-08-03T06:49:07Z

RFC-0025-sdpa-optm-cpu.md

+## **Performance**
+All validations are run on SPR machine.
+
+### NanoGPT's SDPA kernel


Can you also provide shape details from the benchmarks?

Thanks, added.

jgong5 · 2023-08-03T06:49:52Z

RFC-0025-sdpa-optm-cpu.md

+
+For the SDPA optimization, there are two things that needed to be discussed and I hope to have your precious opinions.
+
+One is about the util functions for SDPA selection. The current util functions are under the CUDA folder, i.e. `transformers/cuda/sdp_utils`. For CPU, we have similar functions in `transformers/sdp_utils_cpp` (see #105131). It is good to know whether we need to make them a unified API.


What's the problem of unifying these utils?

There exist compilation issues when the files in CUDA folder include the files in non-CUDA folder, and vice versa. I'll spend more time to figure out the solution.

I think that this can be resolved, there is functionality that is needed for cuda dispatching that requires cuda utils to be built. I think though that the common utils can be abstracted out and shared between the two

Thanks, would do it.

The common utils abstraction has been done in #105131.

albanD · 2023-08-04T13:33:30Z

cc @drisspg

jgong5

LGTM. I know we are also working on EfficientAttention. Do we plan to submit a separate RFC or combine them two? @mingfeima

mingfeima · 2023-08-07T06:04:01Z

LGTM. I know we are also working on EfficientAttention. Do we plan to submit a separate RFC or combine them two? @mingfeima

No, just this one.

drisspg · 2023-08-07T19:56:24Z

jgong5

Can you elaborate on the difference between EfficientAttention and what is discussed here?

drisspg · 2023-08-07T19:59:22Z

RFC-0025-sdpa-optm-cpu.md

+
+Here are the detailed implementation items:
+
+*   The flash attention CPU kernel is added, in which both forward and backward paths are implemented for data types float32 and bfloat16. Blocking is applied on dimensions of query length and kv length and the fusion of gemm + softmax update + gemm is done at once for each block. Specifically, FP32In-FP32Out and FP32In-BF16Out adopt the mkl gemm and BF16In-BF16Out adopts the OneDNN one. Parallelization is on the dimensions of batch size, head number and query length for forward path, and on the dimensions of batch size and head number for backward path. In addition, the causal attention mask is supported. As the attention is masked for the unseen tokens, early termination is applied and we only calculate the blocks in the lower triangular part.


How is the mixed_dtype supposed to be handled ? Will it be built off of this: pytorch/pytorch#103333?

Currently, we only support the case where the three inputs, e.g. q, k, v, and the output own the same data type. Do you have any suggestions?

This makes sense, I guess I was confused on FP32In-BF16Out but is this used for the second bmm after the softmax?

Sorry, I accidently made a typo. Fixed from FP32In-BF16Out to BF16In-FP32Out, which is used for the two bmms in forward path when input dtype is BF16.

drisspg

Yeah I think this will be great to have, also still curious if there is a difference between EfficientAttention mentioned above but otherwise this looks great

Valentine233 · 2023-08-10T01:36:05Z

Yeah I think this will be great to have, also still curious if there is a difference between EfficientAttention mentioned above but otherwise this looks great

Efficient Attention mentioned above is another fused SDPA algorithm, which is enabled in CUDA but not in CPU now. After the Flash attention being merged, we would also upstream Efficient Attention for CPU soon.

Valentine233 · 2023-08-10T01:38:38Z

@drisspg BTW, is this RFC needed to be merged? If so, could you please kindly help merge it as we don't have the permissions.

drisspg · 2023-08-10T02:45:16Z

Yeah I think this will be great to have, also still curious if there is a difference between EfficientAttention mentioned above but otherwise this looks great

Efficient Attention mentioned above is another fused SDPA algorithm, which is enabled in CUDA but not in CPU now. After the Flash attention being merged, we would also upstream Efficient Attention for CPU soon.

If you are referring to mem_eff_attention this is just another implementation ( by xformers) of the original flash attention algorithm. This was also included because it had different performance characteristics and different hardware support. I would be surprised if for CPU both are needed.

mingfeima · 2023-08-10T04:09:41Z

@drisspg Indeed, the mem_eff_attention is another form of flash attention; we have done both on the CPU side but mostly from the aspect that we should fill the feature gap between CPU and CUDA device on PyTorch. As @Valentine233 commented, we still prioritize flash attention and will continue to work on it for new features such as GQA, etc.

drisspg · 2023-08-10T16:07:24Z

Okay I would say you don't need to create a CPU impl for mem_eff_attention if it is going to behave identical to how sdpa_flash_attention's cpu impl. We just update the sdp_dispatcher to never dispatch to mem_eff for cpu. This is a small point but just want to make sure there isn't any wasted work.

All that being said I will merge this RFC in! Thanks for all the discussion

mingfeima · 2023-08-11T01:03:36Z

Okay I would say you don't need to create a CPU impl for mem_eff_attention if it is going to behave identical to how sdpa_flash_attention's cpu impl. We just update the sdp_dispatcher to never dispatch to mem_eff for cpu. This is a small point but just want to make sure there isn't any wasted work.

All that being said I will merge this RFC in! Thanks for all the discussion

Wow, cool. Did not know this before, then it totally makes sense to drop cpu mem_eff_attention :)

jgong5 · 2023-08-14T01:44:35Z

Can you elaborate on the difference between EfficientAttention and what is discussed here?

Just back from PTO. Thanks for your comments. That's good we only need to keep one implementation that is best for CPU.

Feature RFC: pytorch/rfcs#56 TODO: - [ ] Support for dropout>0 - [ ] Support Inductor path cc jgong5 mingfeima XiaobingSuper sanchitintel ashokei jingxu10 voznesenskym penguinwu EikanWang Guobing-Chen zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx peterbell10 ipiszy ngimel yf225 chenyang78 kadeng muchulee8 aakhundov [ghstack-poisoned]

Feature RFC: pytorch/rfcs#56. Enable the SDPA graph rewriting for Inductor CPU. cc jgong5 mingfeima XiaobingSuper sanchitintel ashokei jingxu10 voznesenskym penguinwu EikanWang Guobing-Chen zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx peterbell10 ipiszy ngimel yf225 chenyang78 kadeng muchulee8 aakhundov [ghstack-poisoned]

Feature RFC: pytorch/rfcs#56. Write an SDPA selecting function for CPU to automatically choose one SDPA implementation among several ones. There are two CPU implementations which could be chosen: the unfused SDPA and flash attention. In general, flash attention has a higher priority than the unfused SDPA. For cases where flash attention is not applicable, such as manually disabling flash attention or the inputs not 4 dimensional, the unfused SDPA is chosen. ## Performance of the stack ### NanoGPT's SDPA kernel Using benchmark [repo](https://github.com/mingfeima/bench_sdpa/blob/main/README.md), with one socket. Shape: Batch size 1, Sequence length 1024, Head number 25, Head size 64. Machine: SPR. | Dtype | Causal | Mode | SDPA | Time (ms per iter) | Speedup | | -------- | -------- | ------- | ------- | ------- | ------- | | float32 | FALSE | Inference | Unfused | 3.081 | | | | | | Flash attention | 1.665 | **1.85045** | | float32 | TRUE | Inference | Unfused | 3.463 | | | | | | Flash attention | 1.662 | **2.083634**| | bfloat16 | FALSE | Inference | Unfused | 1.203 | | | | | | Flash attention | 1.154 | **1.042461**| | bfloat16 | TRUE | Inference | Unfused | 1.543 | | | | | | Flash attention | 1.154 | **1.337088**| | float32 | FALSE | Training | Unfused | 54.938 | | | | | | Flash attention | 23.029 | **2.385601**| | float32 | TRUE | Training | Unfused | 58.266 | | | | | | Flash attention | 17.835 | **3.266947**| | bfloat16 | FALSE | Training | Unfused | 18.924 | | | | | | Flash attention | 18.886 | **1.002012**| | bfloat16 | TRUE | Training | Unfused | 21.08 | | | | | | Flash attention | 14.172 | **1.48744** | ### Stable Diffusion Following model's [BKM](https://github.com/intel-innersource/frameworks.ai.models.intel-models/blob/develop/quickstart/diffusion/pytorch/stable_diffusion/inference/cpu/README.md). Mode: Inference; Machine: SPR. | Dtype | SDPA | Throughput (fps) | Speedup SDPA | Total Time (ms) | Speedup | | -------- | -------- | ------- | ------- | ------- | ------- | | float32 | Unfused | 1.63 | | 1139 | | | | Flash attention | 1.983 | 1.216564 | 547.488 | **2.080411**| | bfloat16 | Flash attention in IPEX | 4.784 | | 429.051 | | | | Flash attention | 4.857 | 1.015259 | 408.823 | **1.049479**| ### LLM models of Torchbench Dtype: float32; Mode: Inference, single socket; Machine: CPX. Model name | SDPA | Inductor_new | Inductor_old | Inductor Ratio(old/new) -- | -- | -- | -- | -- hf_Albert | Unfused -> Flash attention | 0.048629309 | 0.05591545 | **1.14983024** hf_Bert | Unfused -> Flash attention | 0.053156243 | 0.060732115 | **1.142520841** hf_Bert_large | Unfused -> Flash attention | 0.141089502 | 0.155190077 | **1.099940636** llama | Unfused -> Flash attention | 0.033250106 | 0.033720745 | **1.01415451** Dtype: bfloat16; Mode: Inference, single socket; Machine: SPR. Model name | SDPA | Inductor_new | Inductor_old | Inductor Ratio(old/new) -- | -- | -- | -- | -- hf_Albert | Unfused -> Flash attention | 0.020681298 | 0.020718282 | **1.001788324** hf_Bert | Unfused -> Flash attention | 0.019932816 | 0.019935424 | **1.000130842** hf_Bert_large | Unfused -> Flash attention | 0.047949174 | 0.048312502 | **1.007577355** llama | Unfused -> Flash attention | 0.018528057 | 0.01861126 | **1.0044907** cc jgong5 mingfeima XiaobingSuper sanchitintel ashokei jingxu10 voznesenskym penguinwu EikanWang Guobing-Chen zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx peterbell10 ipiszy ngimel yf225 chenyang78 kadeng muchulee8 aakhundov [ghstack-poisoned]

Feature RFC: pytorch/rfcs#56. Enable the SDPA graph rewriting for Inductor CPU. cc jgong5 mingfeima XiaobingSuper sanchitintel ashokei jingxu10 voznesenskym penguinwu EikanWang Guobing-Chen zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx peterbell10 ipiszy ngimel yf225 chenyang78 kadeng muchulee8 aakhundov [ghstack-poisoned]

…sh attention" Feature RFC: pytorch/rfcs#56. The support for BF16 is added in flash attention CPU kernel, for both forward and backward paths. cc jgong5 mingfeima XiaobingSuper sanchitintel ashokei jingxu10 [ghstack-poisoned]

Feature RFC: pytorch/rfcs#56. The flash attention CPU kernel is added, for backward path FP32. Parallelization is on the dimensions of batch size and head number. cc jgong5 mingfeima XiaobingSuper sanchitintel ashokei jingxu10 [ghstack-poisoned]

…ttention" Feature RFC: pytorch/rfcs#56. The flash attention CPU kernel is added, for forward path FP32. Blocking is applied on dimensions of query length and kv length and the fusion of gemm + softmax update + gemm is done at once for each block. Parallelization is on the dimensions of batch size, head number and query length. In addition, the causal attention mask is supported. As the attention is masked for the unseen tokens, early termination is applied and we only calculate the blocks in the lower triangular part. cc jgong5 mingfeima EikanWang XiaobingSuper sanchitintel ashokei jingxu10 [ghstack-poisoned]

Feature RFC: pytorch/rfcs#56. Write an SDPA selecting function for CPU to automatically choose one SDPA implementation among several ones. There are two CPU implementations which could be chosen: the unfused SDPA and flash attention. In general, flash attention has a higher priority than the unfused SDPA. For cases where flash attention is not applicable, such as manually disabling flash attention or the inputs not 4 dimensional, the unfused SDPA is chosen. ## Performance of the stack ### NanoGPT's SDPA kernel Using benchmark [repo](https://github.com/mingfeima/bench_sdpa/blob/main/README.md), with one socket. Shape: Batch size 1, Sequence length 1024, Head number 25, Head size 64. Machine: SPR. | Dtype | Causal | Mode | SDPA | Time (ms per iter) | Speedup | | -------- | -------- | ------- | ------- | ------- | ------- | | float32 | FALSE | Inference | Unfused | 3.081 | | | | | | Flash attention | 1.665 | **1.85045** | | float32 | TRUE | Inference | Unfused | 3.463 | | | | | | Flash attention | 1.662 | **2.083634**| | bfloat16 | FALSE | Inference | Unfused | 1.203 | | | | | | Flash attention | 1.154 | **1.042461**| | bfloat16 | TRUE | Inference | Unfused | 1.543 | | | | | | Flash attention | 1.154 | **1.337088**| | float32 | FALSE | Training | Unfused | 54.938 | | | | | | Flash attention | 23.029 | **2.385601**| | float32 | TRUE | Training | Unfused | 58.266 | | | | | | Flash attention | 17.835 | **3.266947**| | bfloat16 | FALSE | Training | Unfused | 18.924 | | | | | | Flash attention | 18.886 | **1.002012**| | bfloat16 | TRUE | Training | Unfused | 21.08 | | | | | | Flash attention | 14.172 | **1.48744** | ### Stable Diffusion Following model's [BKM](https://github.com/intel-innersource/frameworks.ai.models.intel-models/blob/develop/quickstart/diffusion/pytorch/stable_diffusion/inference/cpu/README.md). Mode: Inference; Machine: SPR. | Dtype | SDPA | Throughput (fps) | Speedup SDPA | Total Time (ms) | Speedup | | -------- | -------- | ------- | ------- | ------- | ------- | | float32 | Unfused | 1.63 | | 1139 | | | | Flash attention | 1.983 | 1.216564 | 547.488 | **2.080411**| | bfloat16 | Flash attention in IPEX | 4.784 | | 429.051 | | | | Flash attention | 4.857 | 1.015259 | 408.823 | **1.049479**| ### LLM models of Torchbench Dtype: float32; Mode: Inference, single socket; Machine: CPX. Model name | SDPA | Inductor_new | Inductor_old | Inductor Ratio(old/new) -- | -- | -- | -- | -- hf_Albert | Unfused -> Flash attention | 0.048629309 | 0.05591545 | **1.14983024** hf_Bert | Unfused -> Flash attention | 0.053156243 | 0.060732115 | **1.142520841** hf_Bert_large | Unfused -> Flash attention | 0.141089502 | 0.155190077 | **1.099940636** llama | Unfused -> Flash attention | 0.033250106 | 0.033720745 | **1.01415451** Dtype: bfloat16; Mode: Inference, single socket; Machine: SPR. Model name | SDPA | Inductor_new | Inductor_old | Inductor Ratio(old/new) -- | -- | -- | -- | -- hf_Albert | Unfused -> Flash attention | 0.020681298 | 0.020718282 | **1.001788324** hf_Bert | Unfused -> Flash attention | 0.019932816 | 0.019935424 | **1.000130842** hf_Bert_large | Unfused -> Flash attention | 0.047949174 | 0.048312502 | **1.007577355** llama | Unfused -> Flash attention | 0.018528057 | 0.01861126 | **1.0044907** cc jgong5 mingfeima XiaobingSuper sanchitintel ashokei jingxu10 voznesenskym penguinwu EikanWang Guobing-Chen zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx peterbell10 ipiszy ngimel yf225 chenyang78 kadeng muchulee8 aakhundov [ghstack-poisoned]

Feature RFC: pytorch/rfcs#56. The flash attention CPU kernel is added, for forward path FP32. Blocking is applied on dimensions of query length and kv length and the fusion of gemm + softmax update + gemm is done at once for each block. Parallelization is on the dimensions of batch size, head number and query length. In addition, the causal attention mask is supported. As the attention is masked for the unseen tokens, early termination is applied and we only calculate the blocks in the lower triangular part. cc jgong5 mingfeima EikanWang XiaobingSuper sanchitintel ashokei jingxu10 [ghstack-poisoned]

Feature RFC: pytorch/rfcs#56. Enable the SDPA graph rewriting for Inductor CPU. cc jgong5 mingfeima XiaobingSuper sanchitintel ashokei jingxu10 voznesenskym penguinwu EikanWang Guobing-Chen zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx peterbell10 ipiszy ngimel yf225 chenyang78 kadeng muchulee8 aakhundov [ghstack-poisoned]

Feature RFC: pytorch/rfcs#56. Write an SDPA selecting function for CPU to automatically choose one SDPA implementation among several ones. There are two CPU implementations which could be chosen: the unfused SDPA and flash attention. In general, flash attention has a higher priority than the unfused SDPA. For cases where flash attention is not applicable, such as manually disabling flash attention or the inputs not 4 dimensional, the unfused SDPA is chosen. ## Performance of the stack ### NanoGPT's SDPA kernel Using benchmark [repo](https://github.com/mingfeima/bench_sdpa/blob/main/README.md), with one socket. Shape: Batch size 1, Sequence length 1024, Head number 25, Head size 64. Machine: SPR. | Dtype | Causal | Mode | SDPA | Time (ms per iter) | Speedup | | -------- | -------- | ------- | ------- | ------- | ------- | | float32 | FALSE | Inference | Unfused | 3.081 | | | | | | Flash attention | 1.665 | **1.85045** | | float32 | TRUE | Inference | Unfused | 3.463 | | | | | | Flash attention | 1.662 | **2.083634**| | bfloat16 | FALSE | Inference | Unfused | 1.203 | | | | | | Flash attention | 1.154 | **1.042461**| | bfloat16 | TRUE | Inference | Unfused | 1.543 | | | | | | Flash attention | 1.154 | **1.337088**| | float32 | FALSE | Training | Unfused | 54.938 | | | | | | Flash attention | 23.029 | **2.385601**| | float32 | TRUE | Training | Unfused | 58.266 | | | | | | Flash attention | 17.835 | **3.266947**| | bfloat16 | FALSE | Training | Unfused | 18.924 | | | | | | Flash attention | 18.886 | **1.002012**| | bfloat16 | TRUE | Training | Unfused | 21.08 | | | | | | Flash attention | 14.172 | **1.48744** | ### Stable Diffusion Following model's [BKM](https://github.com/intel-innersource/frameworks.ai.models.intel-models/blob/develop/quickstart/diffusion/pytorch/stable_diffusion/inference/cpu/README.md). Mode: Inference; Machine: SPR. | Dtype | SDPA | Throughput (fps) | Speedup SDPA | Total Time (ms) | Speedup | | -------- | -------- | ------- | ------- | ------- | ------- | | float32 | Unfused | 1.63 | | 1139 | | | | Flash attention | 1.983 | 1.216564 | 547.488 | **2.080411**| | bfloat16 | Flash attention in IPEX | 4.784 | | 429.051 | | | | Flash attention | 4.857 | 1.015259 | 408.823 | **1.049479**| ### LLM models of Torchbench Dtype: float32; Mode: Inference, single socket; Machine: CPX. Model name | SDPA | Inductor_new | Inductor_old | Inductor Ratio(old/new) -- | -- | -- | -- | -- hf_Albert | Unfused -> Flash attention | 0.048629309 | 0.05591545 | **1.14983024** hf_Bert | Unfused -> Flash attention | 0.053156243 | 0.060732115 | **1.142520841** hf_Bert_large | Unfused -> Flash attention | 0.141089502 | 0.155190077 | **1.099940636** llama | Unfused -> Flash attention | 0.033250106 | 0.033720745 | **1.01415451** Dtype: bfloat16; Mode: Inference, single socket; Machine: SPR. Model name | SDPA | Inductor_new | Inductor_old | Inductor Ratio(old/new) -- | -- | -- | -- | -- hf_Albert | Unfused -> Flash attention | 0.020681298 | 0.020718282 | **1.001788324** hf_Bert | Unfused -> Flash attention | 0.019932816 | 0.019935424 | **1.000130842** hf_Bert_large | Unfused -> Flash attention | 0.047949174 | 0.048312502 | **1.007577355** llama | Unfused -> Flash attention | 0.018528057 | 0.01861126 | **1.0044907** cc jgong5 mingfeima XiaobingSuper sanchitintel ashokei jingxu10 voznesenskym penguinwu EikanWang Guobing-Chen zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx peterbell10 ipiszy ngimel yf225 chenyang78 kadeng muchulee8 aakhundov [ghstack-poisoned]

Feature RFC: pytorch/rfcs#56. The flash attention CPU kernel is added, for backward path FP32. Parallelization is on the dimensions of batch size and head number. cc jgong5 mingfeima XiaobingSuper sanchitintel ashokei jingxu10 [ghstack-poisoned]

Feature RFC: pytorch/rfcs#56. The support for BF16 is added in flash attention CPU kernel, for both forward and backward paths. cc jgong5 mingfeima XiaobingSuper sanchitintel ashokei jingxu10 [ghstack-poisoned]