Skip to content

[ISel] Commute FMUL and inserting zero into vector lane #146096

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

juliannagele
Copy link
Member

When inserting zero into a vector that is the result of a multiplication we can do the insertion earlier, i.e. into an operand of the fmul instead.

@llvmbot
Copy link
Member

llvmbot commented Jun 27, 2025

@llvm/pr-subscribers-backend-aarch64

Author: Julian Nagele (juliannagele)

Changes

When inserting zero into a vector that is the result of a multiplication we can do the insertion earlier, i.e. into an operand of the fmul instead.


Full diff: https://github.com/llvm/llvm-project/pull/146096.diff

2 Files Affected:

  • (modified) llvm/lib/Target/AArch64/AArch64ISelLowering.cpp (+32)
  • (modified) llvm/test/CodeGen/AArch64/arm64-vmul.ll (+10)
diff --git a/llvm/lib/Target/AArch64/AArch64ISelLowering.cpp b/llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
index c8b1eafd35495..867833f67b822 100644
--- a/llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
+++ b/llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
@@ -26268,11 +26268,43 @@ static SDValue removeRedundantInsertVectorElt(SDNode *N) {
   return ExtractVec;
 }
 
+static SDValue commuteInsertVectorEltFMul(SDNode *N, SelectionDAG &DAG) {
+  assert(N->getOpcode() == ISD::INSERT_VECTOR_ELT && "Unexpected node!");
+  SDValue InsertVec = N->getOperand(0);
+  SDValue InsertVal = N->getOperand(1);
+  SDValue InsertIdx = N->getOperand(2);
+
+  // Only handle constant 0 insertion...
+  if (!(isNullConstant(InsertVal) || isNullFPConstant(InsertVal)))
+    return SDValue();
+  // ... into the result of an FMUL.
+  if (InsertVec.getOpcode() != ISD::FMUL)
+    return SDValue();
+
+  // Insert into the operand of FMUL instead.
+  SDValue FMulOp = InsertVec.getOperand(0);
+
+  if (!InsertVec.hasOneUse() || !FMulOp.hasOneUse())
+    return SDValue();
+
+  SDValue InsertOp =
+      DAG.getNode(ISD::INSERT_VECTOR_ELT, SDLoc(N), FMulOp.getValueType(),
+                  FMulOp, InsertVal, InsertIdx);
+  SDValue FMul =
+      DAG.getNode(ISD::FMUL, SDLoc(InsertVec), InsertVec.getValueType(),
+                  InsertOp, InsertVec.getOperand(1));
+  DAG.ReplaceAllUsesWith(N, &FMul);
+  return FMul;
+}
+
 static SDValue
 performInsertVectorEltCombine(SDNode *N, TargetLowering::DAGCombinerInfo &DCI) {
   if (SDValue Res = removeRedundantInsertVectorElt(N))
     return Res;
 
+  if (SDValue Res = commuteInsertVectorEltFMul(N, DCI.DAG))
+    return Res;
+
   return performPostLD1Combine(N, DCI, true);
 }
 
diff --git a/llvm/test/CodeGen/AArch64/arm64-vmul.ll b/llvm/test/CodeGen/AArch64/arm64-vmul.ll
index 937a17ca6c1e0..1e33f81b9b835 100644
--- a/llvm/test/CodeGen/AArch64/arm64-vmul.ll
+++ b/llvm/test/CodeGen/AArch64/arm64-vmul.ll
@@ -1186,6 +1186,16 @@ define double @fmul_lane_d(double %A, <2 x double> %vec) nounwind {
   ret double %res
 }
 
+define <4 x float> @fmul_insert_zero(<4 x float> %A, <4 x float> %B) {
+; CHECK-LABEL: fmul_insert_zero:
+; CHECK:       // %bb.0:
+; CHECK-NEXT:    mov.s v0[3], wzr
+; CHECK-NEXT:    fmul.4s v0, v0, v1
+; CHECK-NEXT:    ret
+  %mul = fmul <4 x float> %A, %B
+  %mul_set_lane = insertelement <4 x float> %mul, float 0.000000e+00, i64 3
+  ret <4 x float> %mul_set_lane
+}
 
 
 define <2 x float> @fmulx_lane_2s(<2 x float> %A, <2 x float> %B) nounwind {

Copy link
Collaborator

@davemgreen davemgreen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you give context why this is better? Thanks

@juliannagele
Copy link
Member Author

We've observed that doing the insert on an operand of the mul we potentially reduce the critical path, in particular in combination with #146538

Copy link
Contributor

@fhahn fhahn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To add maybe a bit more context, commuting the ops may allow the insert to happen earlier, while if we do it after the mul we need to wait for the mul (and its other operand) to be completed.

It should not increase the critial path I think, so in the worst case should not make things worse

@davemgreen
Copy link
Collaborator

OK sounds good. Like you said I don't think this would be worse (and I guess there is a chance to generate fma?). Could we add a test for a case that shows improvement?

How do we know which operand to zero? If it needs trace info then that is only available later in MachineCombiner. (Sometimes I can see that it replies on other DAG optimizations, in which case it sounds OK, it just might not always be reliable).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants