-
Notifications
You must be signed in to change notification settings - Fork 87
RFC: SDPA Optimization CPU #56
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
344212b
to
b243990
Compare
@mingfeima @jgong5, please have a look and help review, thanks. |
The following will be nice to have for PT 2.1: | ||
|
||
* Support data type of float16. | ||
* Enable the SDPA graph rewriting for Inductor. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suppose the fusion would automatically apply to the inductor after we add the kernel support to SDPA, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suppose the fusion would automatically apply to the inductor after we add the kernel support to SDPA, right?
Yes, the fusion is automatically applied now. However, there would be assertion checks for intermediate outputs' shapes and strides, which are not the same for CPU and CUDA implementations. For example, the output logsumexp
which is used for backward calculation, has the shape of {batch_size, query_length, num_head}
in CPU and the shape of {batch_size, num_head, query_length}
in CUDA. Besides, other outputs like cum_seq_q
, cum_seq_k
, are not used and are assigned with zero-size tensors in CPU. For now, I have temporarily disabled CPU fusion path and would enable it if having time to fix the issue mentioned before.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Inductor currently falls back to the aten function for sdpa_flash and sdpa_efficeint_attention so should be similar in this case, unless you are talking about the pattern matcher
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, after doing the fused attention
pattern matching in Inductor, some graph nodes could be replaced with one SDPA node. If the SDPA node uses sdpa_flash, Inductor would check the all outputs' shapes and strides in CPP output code.
RFC-0025-sdpa-optm-cpu.md
Outdated
Here are the detailed implementation items: | ||
|
||
* The flash attention CPU kernel is added, in which both forward and backward paths are implemented for data types float32 and bfloat16. | ||
* Write an SDPA selecting function for CPU to automatically choose one SDPA implementation among several ones. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we elaborate more on the algorithm used for selecting functions?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, modified.
RFC-0025-sdpa-optm-cpu.md
Outdated
|
||
Here are the detailed implementation items: | ||
|
||
* The flash attention CPU kernel is added, in which both forward and backward paths are implemented for data types float32 and bfloat16. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add more details on how the fusions are implemented on CPU.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, modified.
## **Performance** | ||
All validations are run on SPR machine. | ||
|
||
### NanoGPT's SDPA kernel |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you also provide shape details from the benchmarks?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, added.
|
||
For the SDPA optimization, there are two things that needed to be discussed and I hope to have your precious opinions. | ||
|
||
One is about the util functions for SDPA selection. The current util functions are under the CUDA folder, i.e. `transformers/cuda/sdp_utils`. For CPU, we have similar functions in `transformers/sdp_utils_cpp` (see #105131). It is good to know whether we need to make them a unified API. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the problem of unifying these utils?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There exist compilation issues when the files in CUDA folder include the files in non-CUDA folder, and vice versa. I'll spend more time to figure out the solution.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that this can be resolved, there is functionality that is needed for cuda dispatching that requires cuda utils to be built. I think though that the common utils can be abstracted out and shared between the two
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, would do it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The common utils abstraction has been done in #105131.
cc @drisspg |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. I know we are also working on EfficientAttention. Do we plan to submit a separate RFC or combine them two? @mingfeima
No, just this one. |
Can you elaborate on the difference between EfficientAttention and what is discussed here? |
RFC-0025-sdpa-optm-cpu.md
Outdated
|
||
Here are the detailed implementation items: | ||
|
||
* The flash attention CPU kernel is added, in which both forward and backward paths are implemented for data types float32 and bfloat16. Blocking is applied on dimensions of query length and kv length and the fusion of gemm + softmax update + gemm is done at once for each block. Specifically, FP32In-FP32Out and FP32In-BF16Out adopt the mkl gemm and BF16In-BF16Out adopts the OneDNN one. Parallelization is on the dimensions of batch size, head number and query length for forward path, and on the dimensions of batch size and head number for backward path. In addition, the causal attention mask is supported. As the attention is masked for the unseen tokens, early termination is applied and we only calculate the blocks in the lower triangular part. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How is the mixed_dtype supposed to be handled ? Will it be built off of this: pytorch/pytorch#103333?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently, we only support the case where the three inputs, e.g. q, k, v, and the output own the same data type. Do you have any suggestions?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This makes sense, I guess I was confused on FP32In-BF16Out
but is this used for the second bmm after the softmax?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I accidently made a typo. Fixed from FP32In-BF16Out
to BF16In-FP32Out
, which is used for the two bmms in forward path when input dtype is BF16.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I think this will be great to have, also still curious if there is a difference between EfficientAttention mentioned above but otherwise this looks great
Efficient Attention mentioned above is another fused SDPA algorithm, which is enabled in CUDA but not in CPU now. After the Flash attention being merged, we would also upstream Efficient Attention for CPU soon. |
@drisspg BTW, is this RFC needed to be merged? If so, could you please kindly help merge it as we don't have the permissions. |
If you are referring to mem_eff_attention this is just another implementation ( by xformers) of the original flash attention algorithm. This was also included because it had different performance characteristics and different hardware support. I would be surprised if for CPU both are needed. |
@drisspg Indeed, the mem_eff_attention is another form of flash attention; we have done both on the CPU side but mostly from the aspect that we should fill the feature gap between CPU and CUDA device on PyTorch. As @Valentine233 commented, we still prioritize flash attention and will continue to work on it for new features such as GQA, etc. |
Okay I would say you don't need to create a CPU impl for mem_eff_attention if it is going to behave identical to how sdpa_flash_attention's cpu impl. We just update the sdp_dispatcher to never dispatch to mem_eff for cpu. This is a small point but just want to make sure there isn't any wasted work. All that being said I will merge this RFC in! Thanks for all the discussion |
Wow, cool. Did not know this before, then it totally makes sense to drop cpu mem_eff_attention :) |
Just back from PTO. Thanks for your comments. That's good we only need to keep one implementation that is best for CPU. |
Feature RFC: pytorch/rfcs#56 TODO: - [ ] Support for dropout>0 - [ ] Support Inductor path cc jgong5 mingfeima XiaobingSuper sanchitintel ashokei jingxu10 voznesenskym penguinwu EikanWang Guobing-Chen zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx peterbell10 ipiszy ngimel yf225 chenyang78 kadeng muchulee8 aakhundov [ghstack-poisoned]
Feature RFC: pytorch/rfcs#56 TODO: - [ ] Support for dropout>0 - [ ] Support Inductor path cc jgong5 mingfeima XiaobingSuper sanchitintel ashokei jingxu10 voznesenskym penguinwu EikanWang Guobing-Chen zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx peterbell10 ipiszy ngimel yf225 chenyang78 kadeng muchulee8 aakhundov [ghstack-poisoned]
Feature RFC: pytorch/rfcs#56 TODO: - [ ] Support for dropout>0 - [ ] Support Inductor path cc jgong5 mingfeima XiaobingSuper sanchitintel ashokei jingxu10 voznesenskym penguinwu EikanWang Guobing-Chen zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx peterbell10 ipiszy ngimel yf225 chenyang78 kadeng muchulee8 aakhundov [ghstack-poisoned]
Feature RFC: pytorch/rfcs#56 TODO: - [ ] Support for dropout>0 - [ ] Support Inductor path cc jgong5 mingfeima XiaobingSuper sanchitintel ashokei jingxu10 voznesenskym penguinwu EikanWang Guobing-Chen zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx peterbell10 ipiszy ngimel yf225 chenyang78 kadeng muchulee8 aakhundov [ghstack-poisoned]
Feature RFC: pytorch/rfcs#56. Enable the SDPA graph rewriting for Inductor CPU. cc jgong5 mingfeima XiaobingSuper sanchitintel ashokei jingxu10 voznesenskym penguinwu EikanWang Guobing-Chen zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx peterbell10 ipiszy ngimel yf225 chenyang78 kadeng muchulee8 aakhundov [ghstack-poisoned]
Feature RFC: pytorch/rfcs#56. Enable the SDPA graph rewriting for Inductor CPU. cc jgong5 mingfeima XiaobingSuper sanchitintel ashokei jingxu10 voznesenskym penguinwu EikanWang Guobing-Chen zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx peterbell10 ipiszy ngimel yf225 chenyang78 kadeng muchulee8 aakhundov [ghstack-poisoned]
Feature RFC: pytorch/rfcs#56. Write an SDPA selecting function for CPU to automatically choose one SDPA implementation among several ones. There are two CPU implementations which could be chosen: the unfused SDPA and flash attention. In general, flash attention has a higher priority than the unfused SDPA. For cases where flash attention is not applicable, such as manually disabling flash attention or the inputs not 4 dimensional, the unfused SDPA is chosen. ## Performance of the stack ### NanoGPT's SDPA kernel Using benchmark [repo](https://github.com/mingfeima/bench_sdpa/blob/main/README.md), with one socket. Shape: Batch size 1, Sequence length 1024, Head number 25, Head size 64. Machine: SPR. | Dtype | Causal | Mode | SDPA | Time (ms per iter) | Speedup | | -------- | -------- | ------- | ------- | ------- | ------- | | float32 | FALSE | Inference | Unfused | 3.081 | | | | | | Flash attention | 1.665 | **1.85045** | | float32 | TRUE | Inference | Unfused | 3.463 | | | | | | Flash attention | 1.662 | **2.083634**| | bfloat16 | FALSE | Inference | Unfused | 1.203 | | | | | | Flash attention | 1.154 | **1.042461**| | bfloat16 | TRUE | Inference | Unfused | 1.543 | | | | | | Flash attention | 1.154 | **1.337088**| | float32 | FALSE | Training | Unfused | 54.938 | | | | | | Flash attention | 23.029 | **2.385601**| | float32 | TRUE | Training | Unfused | 58.266 | | | | | | Flash attention | 17.835 | **3.266947**| | bfloat16 | FALSE | Training | Unfused | 18.924 | | | | | | Flash attention | 18.886 | **1.002012**| | bfloat16 | TRUE | Training | Unfused | 21.08 | | | | | | Flash attention | 14.172 | **1.48744** | ### Stable Diffusion Following model's [BKM](https://github.com/intel-innersource/frameworks.ai.models.intel-models/blob/develop/quickstart/diffusion/pytorch/stable_diffusion/inference/cpu/README.md). Mode: Inference; Machine: SPR. | Dtype | SDPA | Throughput (fps) | Speedup SDPA | Total Time (ms) | Speedup | | -------- | -------- | ------- | ------- | ------- | ------- | | float32 | Unfused | 1.63 | | 1139 | | | | Flash attention | 1.983 | 1.216564 | 547.488 | **2.080411**| | bfloat16 | Flash attention in IPEX | 4.784 | | 429.051 | | | | Flash attention | 4.857 | 1.015259 | 408.823 | **1.049479**| ### LLM models of Torchbench Dtype: float32; Mode: Inference, single socket; Machine: CPX. Model name | SDPA | Inductor_new | Inductor_old | Inductor Ratio(old/new) -- | -- | -- | -- | -- hf_Albert | Unfused -> Flash attention | 0.048629309 | 0.05591545 | **1.14983024** hf_Bert | Unfused -> Flash attention | 0.053156243 | 0.060732115 | **1.142520841** hf_Bert_large | Unfused -> Flash attention | 0.141089502 | 0.155190077 | **1.099940636** llama | Unfused -> Flash attention | 0.033250106 | 0.033720745 | **1.01415451** Dtype: bfloat16; Mode: Inference, single socket; Machine: SPR. Model name | SDPA | Inductor_new | Inductor_old | Inductor Ratio(old/new) -- | -- | -- | -- | -- hf_Albert | Unfused -> Flash attention | 0.020681298 | 0.020718282 | **1.001788324** hf_Bert | Unfused -> Flash attention | 0.019932816 | 0.019935424 | **1.000130842** hf_Bert_large | Unfused -> Flash attention | 0.047949174 | 0.048312502 | **1.007577355** llama | Unfused -> Flash attention | 0.018528057 | 0.01861126 | **1.0044907** cc jgong5 mingfeima XiaobingSuper sanchitintel ashokei jingxu10 voznesenskym penguinwu EikanWang Guobing-Chen zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx peterbell10 ipiszy ngimel yf225 chenyang78 kadeng muchulee8 aakhundov [ghstack-poisoned]
Feature RFC: pytorch/rfcs#56. Enable the SDPA graph rewriting for Inductor CPU. cc jgong5 mingfeima XiaobingSuper sanchitintel ashokei jingxu10 voznesenskym penguinwu EikanWang Guobing-Chen zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx peterbell10 ipiszy ngimel yf225 chenyang78 kadeng muchulee8 aakhundov [ghstack-poisoned]
…sh attention" Feature RFC: pytorch/rfcs#56. The support for BF16 is added in flash attention CPU kernel, for both forward and backward paths. cc jgong5 mingfeima XiaobingSuper sanchitintel ashokei jingxu10 [ghstack-poisoned]
Feature RFC: pytorch/rfcs#56. The flash attention CPU kernel is added, for backward path FP32. Parallelization is on the dimensions of batch size and head number. cc jgong5 mingfeima XiaobingSuper sanchitintel ashokei jingxu10 [ghstack-poisoned]
…ttention" Feature RFC: pytorch/rfcs#56. The flash attention CPU kernel is added, for forward path FP32. Blocking is applied on dimensions of query length and kv length and the fusion of gemm + softmax update + gemm is done at once for each block. Parallelization is on the dimensions of batch size, head number and query length. In addition, the causal attention mask is supported. As the attention is masked for the unseen tokens, early termination is applied and we only calculate the blocks in the lower triangular part. cc jgong5 mingfeima EikanWang XiaobingSuper sanchitintel ashokei jingxu10 [ghstack-poisoned]
Feature RFC: pytorch/rfcs#56. Write an SDPA selecting function for CPU to automatically choose one SDPA implementation among several ones. There are two CPU implementations which could be chosen: the unfused SDPA and flash attention. In general, flash attention has a higher priority than the unfused SDPA. For cases where flash attention is not applicable, such as manually disabling flash attention or the inputs not 4 dimensional, the unfused SDPA is chosen. ## Performance of the stack ### NanoGPT's SDPA kernel Using benchmark [repo](https://github.com/mingfeima/bench_sdpa/blob/main/README.md), with one socket. Shape: Batch size 1, Sequence length 1024, Head number 25, Head size 64. Machine: SPR. | Dtype | Causal | Mode | SDPA | Time (ms per iter) | Speedup | | -------- | -------- | ------- | ------- | ------- | ------- | | float32 | FALSE | Inference | Unfused | 3.081 | | | | | | Flash attention | 1.665 | **1.85045** | | float32 | TRUE | Inference | Unfused | 3.463 | | | | | | Flash attention | 1.662 | **2.083634**| | bfloat16 | FALSE | Inference | Unfused | 1.203 | | | | | | Flash attention | 1.154 | **1.042461**| | bfloat16 | TRUE | Inference | Unfused | 1.543 | | | | | | Flash attention | 1.154 | **1.337088**| | float32 | FALSE | Training | Unfused | 54.938 | | | | | | Flash attention | 23.029 | **2.385601**| | float32 | TRUE | Training | Unfused | 58.266 | | | | | | Flash attention | 17.835 | **3.266947**| | bfloat16 | FALSE | Training | Unfused | 18.924 | | | | | | Flash attention | 18.886 | **1.002012**| | bfloat16 | TRUE | Training | Unfused | 21.08 | | | | | | Flash attention | 14.172 | **1.48744** | ### Stable Diffusion Following model's [BKM](https://github.com/intel-innersource/frameworks.ai.models.intel-models/blob/develop/quickstart/diffusion/pytorch/stable_diffusion/inference/cpu/README.md). Mode: Inference; Machine: SPR. | Dtype | SDPA | Throughput (fps) | Speedup SDPA | Total Time (ms) | Speedup | | -------- | -------- | ------- | ------- | ------- | ------- | | float32 | Unfused | 1.63 | | 1139 | | | | Flash attention | 1.983 | 1.216564 | 547.488 | **2.080411**| | bfloat16 | Flash attention in IPEX | 4.784 | | 429.051 | | | | Flash attention | 4.857 | 1.015259 | 408.823 | **1.049479**| ### LLM models of Torchbench Dtype: float32; Mode: Inference, single socket; Machine: CPX. Model name | SDPA | Inductor_new | Inductor_old | Inductor Ratio(old/new) -- | -- | -- | -- | -- hf_Albert | Unfused -> Flash attention | 0.048629309 | 0.05591545 | **1.14983024** hf_Bert | Unfused -> Flash attention | 0.053156243 | 0.060732115 | **1.142520841** hf_Bert_large | Unfused -> Flash attention | 0.141089502 | 0.155190077 | **1.099940636** llama | Unfused -> Flash attention | 0.033250106 | 0.033720745 | **1.01415451** Dtype: bfloat16; Mode: Inference, single socket; Machine: SPR. Model name | SDPA | Inductor_new | Inductor_old | Inductor Ratio(old/new) -- | -- | -- | -- | -- hf_Albert | Unfused -> Flash attention | 0.020681298 | 0.020718282 | **1.001788324** hf_Bert | Unfused -> Flash attention | 0.019932816 | 0.019935424 | **1.000130842** hf_Bert_large | Unfused -> Flash attention | 0.047949174 | 0.048312502 | **1.007577355** llama | Unfused -> Flash attention | 0.018528057 | 0.01861126 | **1.0044907** cc jgong5 mingfeima XiaobingSuper sanchitintel ashokei jingxu10 voznesenskym penguinwu EikanWang Guobing-Chen zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx peterbell10 ipiszy ngimel yf225 chenyang78 kadeng muchulee8 aakhundov [ghstack-poisoned]
Feature RFC: pytorch/rfcs#56. The flash attention CPU kernel is added, for forward path FP32. Blocking is applied on dimensions of query length and kv length and the fusion of gemm + softmax update + gemm is done at once for each block. Parallelization is on the dimensions of batch size, head number and query length. In addition, the causal attention mask is supported. As the attention is masked for the unseen tokens, early termination is applied and we only calculate the blocks in the lower triangular part. cc jgong5 mingfeima EikanWang XiaobingSuper sanchitintel ashokei jingxu10 [ghstack-poisoned]
Feature RFC: pytorch/rfcs#56. Enable the SDPA graph rewriting for Inductor CPU. cc jgong5 mingfeima XiaobingSuper sanchitintel ashokei jingxu10 voznesenskym penguinwu EikanWang Guobing-Chen zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx peterbell10 ipiszy ngimel yf225 chenyang78 kadeng muchulee8 aakhundov [ghstack-poisoned]
Feature RFC: pytorch/rfcs#56. Write an SDPA selecting function for CPU to automatically choose one SDPA implementation among several ones. There are two CPU implementations which could be chosen: the unfused SDPA and flash attention. In general, flash attention has a higher priority than the unfused SDPA. For cases where flash attention is not applicable, such as manually disabling flash attention or the inputs not 4 dimensional, the unfused SDPA is chosen. ## Performance of the stack ### NanoGPT's SDPA kernel Using benchmark [repo](https://github.com/mingfeima/bench_sdpa/blob/main/README.md), with one socket. Shape: Batch size 1, Sequence length 1024, Head number 25, Head size 64. Machine: SPR. | Dtype | Causal | Mode | SDPA | Time (ms per iter) | Speedup | | -------- | -------- | ------- | ------- | ------- | ------- | | float32 | FALSE | Inference | Unfused | 3.081 | | | | | | Flash attention | 1.665 | **1.85045** | | float32 | TRUE | Inference | Unfused | 3.463 | | | | | | Flash attention | 1.662 | **2.083634**| | bfloat16 | FALSE | Inference | Unfused | 1.203 | | | | | | Flash attention | 1.154 | **1.042461**| | bfloat16 | TRUE | Inference | Unfused | 1.543 | | | | | | Flash attention | 1.154 | **1.337088**| | float32 | FALSE | Training | Unfused | 54.938 | | | | | | Flash attention | 23.029 | **2.385601**| | float32 | TRUE | Training | Unfused | 58.266 | | | | | | Flash attention | 17.835 | **3.266947**| | bfloat16 | FALSE | Training | Unfused | 18.924 | | | | | | Flash attention | 18.886 | **1.002012**| | bfloat16 | TRUE | Training | Unfused | 21.08 | | | | | | Flash attention | 14.172 | **1.48744** | ### Stable Diffusion Following model's [BKM](https://github.com/intel-innersource/frameworks.ai.models.intel-models/blob/develop/quickstart/diffusion/pytorch/stable_diffusion/inference/cpu/README.md). Mode: Inference; Machine: SPR. | Dtype | SDPA | Throughput (fps) | Speedup SDPA | Total Time (ms) | Speedup | | -------- | -------- | ------- | ------- | ------- | ------- | | float32 | Unfused | 1.63 | | 1139 | | | | Flash attention | 1.983 | 1.216564 | 547.488 | **2.080411**| | bfloat16 | Flash attention in IPEX | 4.784 | | 429.051 | | | | Flash attention | 4.857 | 1.015259 | 408.823 | **1.049479**| ### LLM models of Torchbench Dtype: float32; Mode: Inference, single socket; Machine: CPX. Model name | SDPA | Inductor_new | Inductor_old | Inductor Ratio(old/new) -- | -- | -- | -- | -- hf_Albert | Unfused -> Flash attention | 0.048629309 | 0.05591545 | **1.14983024** hf_Bert | Unfused -> Flash attention | 0.053156243 | 0.060732115 | **1.142520841** hf_Bert_large | Unfused -> Flash attention | 0.141089502 | 0.155190077 | **1.099940636** llama | Unfused -> Flash attention | 0.033250106 | 0.033720745 | **1.01415451** Dtype: bfloat16; Mode: Inference, single socket; Machine: SPR. Model name | SDPA | Inductor_new | Inductor_old | Inductor Ratio(old/new) -- | -- | -- | -- | -- hf_Albert | Unfused -> Flash attention | 0.020681298 | 0.020718282 | **1.001788324** hf_Bert | Unfused -> Flash attention | 0.019932816 | 0.019935424 | **1.000130842** hf_Bert_large | Unfused -> Flash attention | 0.047949174 | 0.048312502 | **1.007577355** llama | Unfused -> Flash attention | 0.018528057 | 0.01861126 | **1.0044907** cc jgong5 mingfeima XiaobingSuper sanchitintel ashokei jingxu10 voznesenskym penguinwu EikanWang Guobing-Chen zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx peterbell10 ipiszy ngimel yf225 chenyang78 kadeng muchulee8 aakhundov [ghstack-poisoned]
Feature RFC: pytorch/rfcs#56. The flash attention CPU kernel is added, for backward path FP32. Parallelization is on the dimensions of batch size and head number. cc jgong5 mingfeima XiaobingSuper sanchitintel ashokei jingxu10 [ghstack-poisoned]
Feature RFC: pytorch/rfcs#56. The support for BF16 is added in flash attention CPU kernel, for both forward and backward paths. cc jgong5 mingfeima XiaobingSuper sanchitintel ashokei jingxu10 [ghstack-poisoned]
Feature RFC: pytorch/rfcs#56. The flash attention CPU kernel is added, for backward path FP32. Parallelization is on the dimensions of batch size and head number. cc jgong5 mingfeima XiaobingSuper sanchitintel ashokei jingxu10 [ghstack-poisoned]
Feature RFC: pytorch/rfcs#56. Enable the SDPA graph rewriting for Inductor CPU. cc jgong5 mingfeima XiaobingSuper sanchitintel ashokei jingxu10 voznesenskym penguinwu EikanWang Guobing-Chen zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx peterbell10 ipiszy ngimel yf225 chenyang78 kadeng muchulee8 aakhundov [ghstack-poisoned]
Feature RFC: pytorch/rfcs#56. Enable the SDPA graph rewriting for Inductor CPU. cc jgong5 mingfeima XiaobingSuper sanchitintel ashokei jingxu10 voznesenskym penguinwu EikanWang Guobing-Chen zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx peterbell10 ipiszy ngimel yf225 chenyang78 kadeng muchulee8 aakhundov [ghstack-poisoned]
Feature RFC: pytorch/rfcs#56. Write an SDPA selecting function for CPU to automatically choose one SDPA implementation among several ones. There are two CPU implementations which could be chosen: the unfused SDPA and flash attention. In general, flash attention has a higher priority than the unfused SDPA. For cases where flash attention is not applicable, such as manually disabling flash attention or the inputs not 4 dimensional, the unfused SDPA is chosen. ## Performance of the stack ### NanoGPT's SDPA kernel Using benchmark [repo](https://github.com/mingfeima/bench_sdpa/blob/main/README.md), with one socket. Shape: Batch size 1, Sequence length 1024, Head number 25, Head size 64. Machine: SPR. | Dtype | Causal | Mode | SDPA | Time (ms per iter) | Speedup | | -------- | -------- | ------- | ------- | ------- | ------- | | float32 | FALSE | Inference | Unfused | 3.081 | | | | | | Flash attention | 1.665 | **1.85045** | | float32 | TRUE | Inference | Unfused | 3.463 | | | | | | Flash attention | 1.662 | **2.083634**| | bfloat16 | FALSE | Inference | Unfused | 1.203 | | | | | | Flash attention | 1.154 | **1.042461**| | bfloat16 | TRUE | Inference | Unfused | 1.543 | | | | | | Flash attention | 1.154 | **1.337088**| | float32 | FALSE | Training | Unfused | 54.938 | | | | | | Flash attention | 23.029 | **2.385601**| | float32 | TRUE | Training | Unfused | 58.266 | | | | | | Flash attention | 17.835 | **3.266947**| | bfloat16 | FALSE | Training | Unfused | 18.924 | | | | | | Flash attention | 18.886 | **1.002012**| | bfloat16 | TRUE | Training | Unfused | 21.08 | | | | | | Flash attention | 14.172 | **1.48744** | ### Stable Diffusion Following model's [BKM](https://github.com/intel-innersource/frameworks.ai.models.intel-models/blob/develop/quickstart/diffusion/pytorch/stable_diffusion/inference/cpu/README.md). Mode: Inference; Machine: SPR. | Dtype | SDPA | Throughput (fps) | Speedup SDPA | Total Time (ms) | Speedup | | -------- | -------- | ------- | ------- | ------- | ------- | | float32 | Unfused | 1.63 | | 1139 | | | | Flash attention | 1.983 | 1.216564 | 547.488 | **2.080411**| | bfloat16 | Flash attention in IPEX | 4.784 | | 429.051 | | | | Flash attention | 4.857 | 1.015259 | 408.823 | **1.049479**| ### LLM models of Torchbench Dtype: float32; Mode: Inference, single socket; Machine: CPX. Model name | SDPA | Inductor_new | Inductor_old | Inductor Ratio(old/new) -- | -- | -- | -- | -- hf_Albert | Unfused -> Flash attention | 0.048629309 | 0.05591545 | **1.14983024** hf_Bert | Unfused -> Flash attention | 0.053156243 | 0.060732115 | **1.142520841** hf_Bert_large | Unfused -> Flash attention | 0.141089502 | 0.155190077 | **1.099940636** llama | Unfused -> Flash attention | 0.033250106 | 0.033720745 | **1.01415451** Dtype: bfloat16; Mode: Inference, single socket; Machine: SPR. Model name | SDPA | Inductor_new | Inductor_old | Inductor Ratio(old/new) -- | -- | -- | -- | -- hf_Albert | Unfused -> Flash attention | 0.020681298 | 0.020718282 | **1.001788324** hf_Bert | Unfused -> Flash attention | 0.019932816 | 0.019935424 | **1.000130842** hf_Bert_large | Unfused -> Flash attention | 0.047949174 | 0.048312502 | **1.007577355** llama | Unfused -> Flash attention | 0.018528057 | 0.01861126 | **1.0044907** cc jgong5 mingfeima XiaobingSuper sanchitintel ashokei jingxu10 voznesenskym penguinwu EikanWang Guobing-Chen zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx peterbell10 ipiszy ngimel yf225 chenyang78 kadeng muchulee8 aakhundov [ghstack-poisoned]
Feature RFC: pytorch/rfcs#56. Write an SDPA selecting function for CPU to automatically choose one SDPA implementation among several ones. There are two CPU implementations which could be chosen: the unfused SDPA and flash attention. In general, flash attention has a higher priority than the unfused SDPA. For cases where flash attention is not applicable, such as manually disabling flash attention or the inputs not 4 dimensional, the unfused SDPA is chosen. ## Performance of the stack ### NanoGPT's SDPA kernel Using benchmark [repo](https://github.com/mingfeima/bench_sdpa/blob/main/README.md), with one socket. Shape: Batch size 1, Sequence length 1024, Head number 25, Head size 64. Machine: SPR. | Dtype | Causal | Mode | SDPA | Time (ms per iter) | Speedup | | -------- | -------- | ------- | ------- | ------- | ------- | | float32 | FALSE | Inference | Unfused | 3.081 | | | | | | Flash attention | 1.665 | **1.85045** | | float32 | TRUE | Inference | Unfused | 3.463 | | | | | | Flash attention | 1.662 | **2.083634**| | bfloat16 | FALSE | Inference | Unfused | 1.203 | | | | | | Flash attention | 1.154 | **1.042461**| | bfloat16 | TRUE | Inference | Unfused | 1.543 | | | | | | Flash attention | 1.154 | **1.337088**| | float32 | FALSE | Training | Unfused | 54.938 | | | | | | Flash attention | 23.029 | **2.385601**| | float32 | TRUE | Training | Unfused | 58.266 | | | | | | Flash attention | 17.835 | **3.266947**| | bfloat16 | FALSE | Training | Unfused | 18.924 | | | | | | Flash attention | 18.886 | **1.002012**| | bfloat16 | TRUE | Training | Unfused | 21.08 | | | | | | Flash attention | 14.172 | **1.48744** | ### Stable Diffusion Following model's [BKM](https://github.com/intel-innersource/frameworks.ai.models.intel-models/blob/develop/quickstart/diffusion/pytorch/stable_diffusion/inference/cpu/README.md). Mode: Inference; Machine: SPR. | Dtype | SDPA | Throughput (fps) | Speedup SDPA | Total Time (ms) | Speedup | | -------- | -------- | ------- | ------- | ------- | ------- | | float32 | Unfused | 1.63 | | 1139 | | | | Flash attention | 1.983 | 1.216564 | 547.488 | **2.080411**| | bfloat16 | Flash attention in IPEX | 4.784 | | 429.051 | | | | Flash attention | 4.857 | 1.015259 | 408.823 | **1.049479**| ### LLM models of Torchbench Dtype: float32; Mode: Inference, single socket; Machine: CPX. Model name | SDPA | Inductor_new | Inductor_old | Inductor Ratio(old/new) -- | -- | -- | -- | -- hf_Albert | Unfused -> Flash attention | 0.048629309 | 0.05591545 | **1.14983024** hf_Bert | Unfused -> Flash attention | 0.053156243 | 0.060732115 | **1.142520841** hf_Bert_large | Unfused -> Flash attention | 0.141089502 | 0.155190077 | **1.099940636** llama | Unfused -> Flash attention | 0.033250106 | 0.033720745 | **1.01415451** Dtype: bfloat16; Mode: Inference, single socket; Machine: SPR. Model name | SDPA | Inductor_new | Inductor_old | Inductor Ratio(old/new) -- | -- | -- | -- | -- hf_Albert | Unfused -> Flash attention | 0.020681298 | 0.020718282 | **1.001788324** hf_Bert | Unfused -> Flash attention | 0.019932816 | 0.019935424 | **1.000130842** hf_Bert_large | Unfused -> Flash attention | 0.047949174 | 0.048312502 | **1.007577355** llama | Unfused -> Flash attention | 0.018528057 | 0.01861126 | **1.0044907** cc jgong5 mingfeima XiaobingSuper sanchitintel ashokei jingxu10 voznesenskym penguinwu EikanWang Guobing-Chen zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx peterbell10 ipiszy ngimel yf225 chenyang78 kadeng muchulee8 aakhundov [ghstack-poisoned]
…ttention" Feature RFC: pytorch/rfcs#56. The flash attention CPU kernel is added, for forward path FP32. Blocking is applied on dimensions of query length and kv length and the fusion of gemm + softmax update + gemm is done at once for each block. Parallelization is on the dimensions of batch size, head number and query length. In addition, the causal attention mask is supported. As the attention is masked for the unseen tokens, early termination is applied and we only calculate the blocks in the lower triangular part. cc jgong5 mingfeima EikanWang XiaobingSuper sanchitintel ashokei jingxu10 [ghstack-poisoned]
…sh attention" Feature RFC: pytorch/rfcs#56. The support for BF16 is added in flash attention CPU kernel, for both forward and backward paths. cc jgong5 mingfeima XiaobingSuper sanchitintel ashokei jingxu10 [ghstack-poisoned]
Feature RFC: pytorch/rfcs#56. The support for BF16 is added in flash attention CPU kernel, for both forward and backward paths. cc jgong5 mingfeima XiaobingSuper sanchitintel ashokei jingxu10 [ghstack-poisoned]
Feature RFC: pytorch/rfcs#56. The flash attention CPU kernel is added, for forward path FP32. Blocking is applied on dimensions of query length and kv length and the fusion of gemm + softmax update + gemm is done at once for each block. Parallelization is on the dimensions of batch size, head number and query length. In addition, the causal attention mask is supported. As the attention is masked for the unseen tokens, early termination is applied and we only calculate the blocks in the lower triangular part. cc jgong5 mingfeima EikanWang XiaobingSuper sanchitintel ashokei jingxu10 [ghstack-poisoned]
Feature RFC: pytorch/rfcs#56. The flash attention CPU kernel is added, for backward path FP32. Parallelization is on the dimensions of batch size and head number. cc jgong5 mingfeima XiaobingSuper sanchitintel ashokei jingxu10 [ghstack-poisoned]
Feature RFC: pytorch/rfcs#56. Write an SDPA selecting function for CPU to automatically choose one SDPA implementation among several ones. There are two CPU implementations which could be chosen: the unfused SDPA and flash attention. In general, flash attention has a higher priority than the unfused SDPA. For cases where flash attention is not applicable, such as manually disabling flash attention or the inputs not 4 dimensional, the unfused SDPA is chosen. ## Performance of the stack ### NanoGPT's SDPA kernel Using benchmark [repo](https://github.com/mingfeima/bench_sdpa/blob/main/README.md), with one socket. Shape: Batch size 1, Sequence length 1024, Head number 25, Head size 64. Machine: SPR. | Dtype | Causal | Mode | SDPA | Time (ms per iter) | Speedup | | -------- | -------- | ------- | ------- | ------- | ------- | | float32 | FALSE | Inference | Unfused | 3.081 | | | | | | Flash attention | 1.665 | **1.85045** | | float32 | TRUE | Inference | Unfused | 3.463 | | | | | | Flash attention | 1.662 | **2.083634**| | bfloat16 | FALSE | Inference | Unfused | 1.203 | | | | | | Flash attention | 1.154 | **1.042461**| | bfloat16 | TRUE | Inference | Unfused | 1.543 | | | | | | Flash attention | 1.154 | **1.337088**| | float32 | FALSE | Training | Unfused | 54.938 | | | | | | Flash attention | 23.029 | **2.385601**| | float32 | TRUE | Training | Unfused | 58.266 | | | | | | Flash attention | 17.835 | **3.266947**| | bfloat16 | FALSE | Training | Unfused | 18.924 | | | | | | Flash attention | 18.886 | **1.002012**| | bfloat16 | TRUE | Training | Unfused | 21.08 | | | | | | Flash attention | 14.172 | **1.48744** | ### Stable Diffusion Following model's [BKM](https://github.com/intel-innersource/frameworks.ai.models.intel-models/blob/develop/quickstart/diffusion/pytorch/stable_diffusion/inference/cpu/README.md). Mode: Inference; Machine: SPR. | Dtype | SDPA | Throughput (fps) | Speedup SDPA | Total Time (ms) | Speedup | | -------- | -------- | ------- | ------- | ------- | ------- | | float32 | Unfused | 1.63 | | 1139 | | | | Flash attention | 1.983 | 1.216564 | 547.488 | **2.080411**| | bfloat16 | Flash attention in IPEX | 4.784 | | 429.051 | | | | Flash attention | 4.857 | 1.015259 | 408.823 | **1.049479**| ### LLM models of Torchbench Dtype: float32; Mode: Inference, single socket; Machine: CPX. Model name | SDPA | Inductor_new | Inductor_old | Inductor Ratio(old/new) -- | -- | -- | -- | -- hf_Albert | Unfused -> Flash attention | 0.048629309 | 0.05591545 | **1.14983024** hf_Bert | Unfused -> Flash attention | 0.053156243 | 0.060732115 | **1.142520841** hf_Bert_large | Unfused -> Flash attention | 0.141089502 | 0.155190077 | **1.099940636** llama | Unfused -> Flash attention | 0.033250106 | 0.033720745 | **1.01415451** Dtype: bfloat16; Mode: Inference, single socket; Machine: SPR. Model name | SDPA | Inductor_new | Inductor_old | Inductor Ratio(old/new) -- | -- | -- | -- | -- hf_Albert | Unfused -> Flash attention | 0.020681298 | 0.020718282 | **1.001788324** hf_Bert | Unfused -> Flash attention | 0.019932816 | 0.019935424 | **1.000130842** hf_Bert_large | Unfused -> Flash attention | 0.047949174 | 0.048312502 | **1.007577355** llama | Unfused -> Flash attention | 0.018528057 | 0.01861126 | **1.0044907** cc jgong5 mingfeima XiaobingSuper sanchitintel ashokei jingxu10 voznesenskym penguinwu EikanWang Guobing-Chen zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx peterbell10 ipiszy ngimel yf225 chenyang78 kadeng muchulee8 aakhundov [ghstack-poisoned]
Feature RFC: pytorch/rfcs#56. Write an SDPA selecting function for CPU to automatically choose one SDPA implementation among several ones. There are two CPU implementations which could be chosen: the unfused SDPA and flash attention. In general, flash attention has a higher priority than the unfused SDPA. For cases where flash attention is not applicable, such as manually disabling flash attention or the inputs not 4 dimensional, the unfused SDPA is chosen. ## Performance of the stack ### NanoGPT's SDPA kernel Using benchmark [repo](https://github.com/mingfeima/bench_sdpa/blob/main/README.md), with one socket. Shape: Batch size 1, Sequence length 1024, Head number 25, Head size 64. Machine: SPR. | Dtype | Causal | Mode | SDPA | Time (ms per iter) | Speedup | | -------- | -------- | ------- | ------- | ------- | ------- | | float32 | FALSE | Inference | Unfused | 3.081 | | | | | | Flash attention | 1.665 | **1.85045** | | float32 | TRUE | Inference | Unfused | 3.463 | | | | | | Flash attention | 1.662 | **2.083634**| | bfloat16 | FALSE | Inference | Unfused | 1.203 | | | | | | Flash attention | 1.154 | **1.042461**| | bfloat16 | TRUE | Inference | Unfused | 1.543 | | | | | | Flash attention | 1.154 | **1.337088**| | float32 | FALSE | Training | Unfused | 54.938 | | | | | | Flash attention | 23.029 | **2.385601**| | float32 | TRUE | Training | Unfused | 58.266 | | | | | | Flash attention | 17.835 | **3.266947**| | bfloat16 | FALSE | Training | Unfused | 18.924 | | | | | | Flash attention | 18.886 | **1.002012**| | bfloat16 | TRUE | Training | Unfused | 21.08 | | | | | | Flash attention | 14.172 | **1.48744** | ### Stable Diffusion Following model's [BKM](https://github.com/intel-innersource/frameworks.ai.models.intel-models/blob/develop/quickstart/diffusion/pytorch/stable_diffusion/inference/cpu/README.md). Mode: Inference; Machine: SPR. | Dtype | SDPA | Throughput (fps) | Speedup SDPA | Total Time (ms) | Speedup | | -------- | -------- | ------- | ------- | ------- | ------- | | float32 | Unfused | 1.63 | | 1139 | | | | Flash attention | 1.983 | 1.216564 | 547.488 | **2.080411**| | bfloat16 | Flash attention in IPEX | 4.784 | | 429.051 | | | | Flash attention | 4.857 | 1.015259 | 408.823 | **1.049479**| ### LLM models of Torchbench Dtype: float32; Mode: Inference, single socket; Machine: CPX. Model name | SDPA | Inductor_new | Inductor_old | Inductor Ratio(old/new) -- | -- | -- | -- | -- hf_Albert | Unfused -> Flash attention | 0.048629309 | 0.05591545 | **1.14983024** hf_Bert | Unfused -> Flash attention | 0.053156243 | 0.060732115 | **1.142520841** hf_Bert_large | Unfused -> Flash attention | 0.141089502 | 0.155190077 | **1.099940636** llama | Unfused -> Flash attention | 0.033250106 | 0.033720745 | **1.01415451** Dtype: bfloat16; Mode: Inference, single socket; Machine: SPR. Model name | SDPA | Inductor_new | Inductor_old | Inductor Ratio(old/new) -- | -- | -- | -- | -- hf_Albert | Unfused -> Flash attention | 0.020681298 | 0.020718282 | **1.001788324** hf_Bert | Unfused -> Flash attention | 0.019932816 | 0.019935424 | **1.000130842** hf_Bert_large | Unfused -> Flash attention | 0.047949174 | 0.048312502 | **1.007577355** llama | Unfused -> Flash attention | 0.018528057 | 0.01861126 | **1.0044907** cc jgong5 mingfeima XiaobingSuper sanchitintel ashokei jingxu10 voznesenskym penguinwu EikanWang Guobing-Chen zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx peterbell10 ipiszy ngimel yf225 chenyang78 kadeng muchulee8 aakhundov [ghstack-poisoned]
Feature RFC: pytorch/rfcs#56. The flash attention CPU kernel is added, for forward path FP32. Blocking is applied on dimensions of query length and kv length and the fusion of gemm + softmax update + gemm is done at once for each block. Parallelization is on the dimensions of batch size, head number and query length. In addition, the causal attention mask is supported. As the attention is masked for the unseen tokens, early termination is applied and we only calculate the blocks in the lower triangular part. Pull Request resolved: #103826 Approved by: https://github.com/drisspg, https://github.com/jgong5 ghstack dependencies: #104583, #104584
Feature RFC: pytorch/rfcs#56. The flash attention CPU kernel is added, for backward path FP32. Parallelization is on the dimensions of batch size and head number. Pull Request resolved: #104693 Approved by: https://github.com/jgong5, https://github.com/drisspg ghstack dependencies: #104583, #104584, #103826
Feature RFC: pytorch/rfcs#56. The support for BF16 is added in flash attention CPU kernel, for both forward and backward paths. Pull Request resolved: #104863 Approved by: https://github.com/jgong5, https://github.com/drisspg ghstack dependencies: #104583, #104584, #103826, #104693
Feature RFC: pytorch/rfcs#56. Enable the SDPA graph rewriting for Inductor CPU. Pull Request resolved: #107128 Approved by: https://github.com/jgong5, https://github.com/jansel, https://github.com/eellison ghstack dependencies: #104583, #104584, #103826, #104693, #104863
Feature RFC: pytorch/rfcs#56. Write an SDPA selecting function for CPU to automatically choose one SDPA implementation among several ones. There are two CPU implementations which could be chosen: the unfused SDPA and flash attention. In general, flash attention has a higher priority than the unfused SDPA. For cases where flash attention is not applicable, such as manually disabling flash attention or the inputs not 4 dimensional, the unfused SDPA is chosen. ## Performance of the stack ### NanoGPT's SDPA kernel Using benchmark [repo](https://github.com/mingfeima/bench_sdpa/blob/main/README.md), with one socket. Shape: Batch size 1, Sequence length 1024, Head number 25, Head size 64. Machine: SPR. | Dtype | Causal | Mode | SDPA | Time (ms per iter) | Speedup | | -------- | -------- | ------- | ------- | ------- | ------- | | float32 | FALSE | Inference | Unfused | 3.081 | | | | | | Flash attention | 1.665 | **1.85045** | | float32 | TRUE | Inference | Unfused | 3.463 | | | | | | Flash attention | 1.662 | **2.083634**| | bfloat16 | FALSE | Inference | Unfused | 1.203 | | | | | | Flash attention | 1.154 | **1.042461**| | bfloat16 | TRUE | Inference | Unfused | 1.543 | | | | | | Flash attention | 1.154 | **1.337088**| | float32 | FALSE | Training | Unfused | 54.938 | | | | | | Flash attention | 23.029 | **2.385601**| | float32 | TRUE | Training | Unfused | 58.266 | | | | | | Flash attention | 17.835 | **3.266947**| | bfloat16 | FALSE | Training | Unfused | 18.924 | | | | | | Flash attention | 18.886 | **1.002012**| | bfloat16 | TRUE | Training | Unfused | 21.08 | | | | | | Flash attention | 14.172 | **1.48744** | ### Stable Diffusion Following model's [BKM](https://github.com/intel-innersource/frameworks.ai.models.intel-models/blob/develop/quickstart/diffusion/pytorch/stable_diffusion/inference/cpu/README.md). Mode: Inference; Machine: SPR. | Dtype | SDPA | Throughput (fps) | Speedup SDPA | Total Time (ms) | Speedup | | -------- | -------- | ------- | ------- | ------- | ------- | | float32 | Unfused | 1.63 | | 1139 | | | | Flash attention | 1.983 | 1.216564 | 547.488 | **2.080411**| | bfloat16 | Flash attention in IPEX | 4.784 | | 429.051 | | | | Flash attention | 4.857 | 1.015259 | 408.823 | **1.049479**| ### LLM models of Torchbench Dtype: float32; Mode: Inference, single socket; Machine: CPX. Model name | SDPA | Inductor_new | Inductor_old | Inductor Ratio(old/new) -- | -- | -- | -- | -- hf_Albert | Unfused -> Flash attention | 0.048629309 | 0.05591545 | **1.14983024** hf_Bert | Unfused -> Flash attention | 0.053156243 | 0.060732115 | **1.142520841** hf_Bert_large | Unfused -> Flash attention | 0.141089502 | 0.155190077 | **1.099940636** llama | Unfused -> Flash attention | 0.033250106 | 0.033720745 | **1.01415451** Dtype: bfloat16; Mode: Inference, single socket; Machine: SPR. Model name | SDPA | Inductor_new | Inductor_old | Inductor Ratio(old/new) -- | -- | -- | -- | -- hf_Albert | Unfused -> Flash attention | 0.020681298 | 0.020718282 | **1.001788324** hf_Bert | Unfused -> Flash attention | 0.019932816 | 0.019935424 | **1.000130842** hf_Bert_large | Unfused -> Flash attention | 0.047949174 | 0.048312502 | **1.007577355** llama | Unfused -> Flash attention | 0.018528057 | 0.01861126 | **1.0044907** Pull Request resolved: #105131 Approved by: https://github.com/drisspg ghstack dependencies: #104583, #104584, #103826, #104693, #104863, #107128
No description provided.