-
Notifications
You must be signed in to change notification settings - Fork 14.3k
[LoadStoreVectorizer] Propagate alignment through contiguous chain #145733
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
[LoadStoreVectorizer] Propagate alignment through contiguous chain #145733
Conversation
… improve vectorization
@llvm/pr-subscribers-llvm-transforms @llvm/pr-subscribers-vectorizers Author: Drew Kersnar (dakersnar) ChangesAt this point in the vectorization pass, we are guaranteed to have a contiguous chain with defined offsets for each element. Using this information, we can derive and upgrade alignment for elements in the chain based on their offset from previous well-aligned elements. This enables vectorization of chains that are longer than the maximum vector length of the target. This algorithm is also robust to the head of the chain not being well-aligned; if we find a better alignment while iterating from the beginning to the end of the chain, we will use that alignment moving forward. Full diff: https://github.com/llvm/llvm-project/pull/145733.diff 2 Files Affected:
diff --git a/llvm/lib/Transforms/Vectorize/LoadStoreVectorizer.cpp b/llvm/lib/Transforms/Vectorize/LoadStoreVectorizer.cpp
index 89f63c3b66aad..e14a936b764e5 100644
--- a/llvm/lib/Transforms/Vectorize/LoadStoreVectorizer.cpp
+++ b/llvm/lib/Transforms/Vectorize/LoadStoreVectorizer.cpp
@@ -343,6 +343,9 @@ class Vectorizer {
/// Postcondition: For all i, ret[i][0].second == 0, because the first instr
/// in the chain is the leader, and an instr touches distance 0 from itself.
std::vector<Chain> gatherChains(ArrayRef<Instruction *> Instrs);
+
+ /// Propagates the best alignment in a chain of contiguous accesses
+ void propagateBestAlignmentsInChain(ArrayRef<ChainElem> C) const;
};
class LoadStoreVectorizerLegacyPass : public FunctionPass {
@@ -716,6 +719,14 @@ std::vector<Chain> Vectorizer::splitChainByAlignment(Chain &C) {
unsigned AS = getLoadStoreAddressSpace(C[0].Inst);
unsigned VecRegBytes = TTI.getLoadStoreVecRegBitWidth(AS) / 8;
+ // We know that the accesses are contiguous. Propagate alignment
+ // information so that slices of the chain can still be vectorized.
+ propagateBestAlignmentsInChain(C);
+ LLVM_DEBUG({
+ dbgs() << "LSV: Chain after alignment propagation:\n";
+ dumpChain(C);
+ });
+
std::vector<Chain> Ret;
for (unsigned CBegin = 0; CBegin < C.size(); ++CBegin) {
// Find candidate chains of size not greater than the largest vector reg.
@@ -1634,3 +1645,27 @@ std::optional<APInt> Vectorizer::getConstantOffset(Value *PtrA, Value *PtrB,
.sextOrTrunc(OrigBitWidth);
return std::nullopt;
}
+
+void Vectorizer::propagateBestAlignmentsInChain(ArrayRef<ChainElem> C) const {
+ ChainElem BestAlignedElem = C[0];
+ Align BestAlignSoFar = getLoadStoreAlignment(C[0].Inst);
+
+ for (const ChainElem &E : C) {
+ Align OrigAlign = getLoadStoreAlignment(E.Inst);
+ if (OrigAlign > BestAlignSoFar) {
+ BestAlignedElem = E;
+ BestAlignSoFar = OrigAlign;
+ }
+
+ APInt OffsetFromBestAlignedElem =
+ E.OffsetFromLeader - BestAlignedElem.OffsetFromLeader;
+ assert(OffsetFromBestAlignedElem.isNonNegative());
+ // commonAlignment is equivalent to a greatest common power-of-two divisor;
+ // it returns the largest power of 2 that divides both A and B.
+ Align NewAlign = commonAlignment(
+ BestAlignSoFar, OffsetFromBestAlignedElem.getLimitedValue());
+ if (NewAlign > OrigAlign)
+ setLoadStoreAlignment(E.Inst, NewAlign);
+ }
+ return;
+}
diff --git a/llvm/test/Transforms/LoadStoreVectorizer/prop-align.ll b/llvm/test/Transforms/LoadStoreVectorizer/prop-align.ll
new file mode 100644
index 0000000000000..a1878dc051d99
--- /dev/null
+++ b/llvm/test/Transforms/LoadStoreVectorizer/prop-align.ll
@@ -0,0 +1,296 @@
+; NOTE: Assertions have been autogenerated by utils/update_test_checks.py UTC_ARGS: --version 5
+; RUN: opt -passes=load-store-vectorizer -S < %s | FileCheck %s
+
+; The IR has the first float3 labeled with align 16, and that 16 should
+; be propagated such that the second set of 4 values
+; can also be vectorized together.
+%struct.float3 = type { float, float, float }
+%struct.S1 = type { %struct.float3, %struct.float3, i32, i32 }
+
+define void @testStore(ptr nocapture writeonly %1) {
+; CHECK-LABEL: define void @testStore(
+; CHECK-SAME: ptr writeonly captures(none) [[TMP0:%.*]]) {
+; CHECK-NEXT: store <4 x float> zeroinitializer, ptr [[TMP0]], align 16
+; CHECK-NEXT: [[GETELEM10:%.*]] = getelementptr inbounds [[STRUCT_S1:%.*]], ptr [[TMP0]], i64 0, i32 1, i32 1
+; CHECK-NEXT: store <4 x i32> zeroinitializer, ptr [[GETELEM10]], align 16
+; CHECK-NEXT: ret void
+;
+ store float 0.000000e+00, ptr %1, align 16
+ %getElem = getelementptr inbounds %struct.float3, ptr %1, i64 0, i32 1
+ store float 0.000000e+00, ptr %getElem, align 4
+ %getElem8 = getelementptr inbounds %struct.float3, ptr %1, i64 0, i32 2
+ store float 0.000000e+00, ptr %getElem8, align 8
+ %getElem9 = getelementptr inbounds %struct.S1, ptr %1, i64 0, i32 1
+ store float 0.000000e+00, ptr %getElem9, align 4
+ %getElem10 = getelementptr inbounds %struct.S1, ptr %1, i64 0, i32 1, i32 1
+ store float 0.000000e+00, ptr %getElem10, align 4
+ %getElem11 = getelementptr inbounds %struct.S1, ptr %1, i64 0, i32 1, i32 2
+ store float 0.000000e+00, ptr %getElem11, align 4
+ %getElem12 = getelementptr inbounds %struct.S1, ptr %1, i64 0, i32 2
+ store i32 0, ptr %getElem12, align 8
+ %getElem13 = getelementptr inbounds %struct.S1, ptr %1, i64 0, i32 3
+ store i32 0, ptr %getElem13, align 4
+ ret void
+}
+
+define void @testLoad(ptr nocapture writeonly %1) {
+; CHECK-LABEL: define void @testLoad(
+; CHECK-SAME: ptr writeonly captures(none) [[TMP0:%.*]]) {
+; CHECK-NEXT: [[TMP2:%.*]] = load <4 x float>, ptr [[TMP0]], align 16
+; CHECK-NEXT: [[L11:%.*]] = extractelement <4 x float> [[TMP2]], i32 0
+; CHECK-NEXT: [[L22:%.*]] = extractelement <4 x float> [[TMP2]], i32 1
+; CHECK-NEXT: [[L33:%.*]] = extractelement <4 x float> [[TMP2]], i32 2
+; CHECK-NEXT: [[L44:%.*]] = extractelement <4 x float> [[TMP2]], i32 3
+; CHECK-NEXT: [[GETELEM10:%.*]] = getelementptr inbounds [[STRUCT_S1:%.*]], ptr [[TMP0]], i64 0, i32 1, i32 1
+; CHECK-NEXT: [[TMP3:%.*]] = load <4 x i32>, ptr [[GETELEM10]], align 16
+; CHECK-NEXT: [[L55:%.*]] = extractelement <4 x i32> [[TMP3]], i32 0
+; CHECK-NEXT: [[TMP4:%.*]] = bitcast i32 [[L55]] to float
+; CHECK-NEXT: [[L66:%.*]] = extractelement <4 x i32> [[TMP3]], i32 1
+; CHECK-NEXT: [[TMP5:%.*]] = bitcast i32 [[L66]] to float
+; CHECK-NEXT: [[L77:%.*]] = extractelement <4 x i32> [[TMP3]], i32 2
+; CHECK-NEXT: [[L88:%.*]] = extractelement <4 x i32> [[TMP3]], i32 3
+; CHECK-NEXT: ret void
+;
+ %l1 = load float, ptr %1, align 16
+ %getElem = getelementptr inbounds %struct.float3, ptr %1, i64 0, i32 1
+ %l2 = load float, ptr %getElem, align 4
+ %getElem8 = getelementptr inbounds %struct.float3, ptr %1, i64 0, i32 2
+ %l3 = load float, ptr %getElem8, align 8
+ %getElem9 = getelementptr inbounds %struct.S1, ptr %1, i64 0, i32 1
+ %l4 = load float, ptr %getElem9, align 4
+ %getElem10 = getelementptr inbounds %struct.S1, ptr %1, i64 0, i32 1, i32 1
+ %l5 = load float, ptr %getElem10, align 4
+ %getElem11 = getelementptr inbounds %struct.S1, ptr %1, i64 0, i32 1, i32 2
+ %l6 = load float, ptr %getElem11, align 4
+ %getElem12 = getelementptr inbounds %struct.S1, ptr %1, i64 0, i32 2
+ %l7 = load i32, ptr %getElem12, align 8
+ %getElem13 = getelementptr inbounds %struct.S1, ptr %1, i64 0, i32 3
+ %l8 = load i32, ptr %getElem13, align 4
+ ret void
+}
+
+; Also, test without the struct geps, to see if it still works with i8 geps/ptradd
+
+define void @testStorei8(ptr nocapture writeonly %1) {
+; CHECK-LABEL: define void @testStorei8(
+; CHECK-SAME: ptr writeonly captures(none) [[TMP0:%.*]]) {
+; CHECK-NEXT: store <4 x float> zeroinitializer, ptr [[TMP0]], align 16
+; CHECK-NEXT: [[GETELEM10:%.*]] = getelementptr inbounds i8, ptr [[TMP0]], i64 16
+; CHECK-NEXT: store <4 x i32> zeroinitializer, ptr [[GETELEM10]], align 16
+; CHECK-NEXT: ret void
+;
+ store float 0.000000e+00, ptr %1, align 16
+ %getElem = getelementptr inbounds i8, ptr %1, i64 4
+ store float 0.000000e+00, ptr %getElem, align 4
+ %getElem8 = getelementptr inbounds i8, ptr %1, i64 8
+ store float 0.000000e+00, ptr %getElem8, align 8
+ %getElem9 = getelementptr inbounds i8, ptr %1, i64 12
+ store float 0.000000e+00, ptr %getElem9, align 4
+ %getElem10 = getelementptr inbounds i8, ptr %1, i64 16
+ store float 0.000000e+00, ptr %getElem10, align 4
+ %getElem11 = getelementptr inbounds i8, ptr %1, i64 20
+ store float 0.000000e+00, ptr %getElem11, align 4
+ %getElem12 = getelementptr inbounds i8, ptr %1, i64 24
+ store i32 0, ptr %getElem12, align 8
+ %getElem13 = getelementptr inbounds i8, ptr %1, i64 28
+ store i32 0, ptr %getElem13, align 4
+ ret void
+}
+
+define void @testLoadi8(ptr nocapture writeonly %1) {
+; CHECK-LABEL: define void @testLoadi8(
+; CHECK-SAME: ptr writeonly captures(none) [[TMP0:%.*]]) {
+; CHECK-NEXT: [[TMP2:%.*]] = load <4 x float>, ptr [[TMP0]], align 16
+; CHECK-NEXT: [[L11:%.*]] = extractelement <4 x float> [[TMP2]], i32 0
+; CHECK-NEXT: [[L22:%.*]] = extractelement <4 x float> [[TMP2]], i32 1
+; CHECK-NEXT: [[L33:%.*]] = extractelement <4 x float> [[TMP2]], i32 2
+; CHECK-NEXT: [[L44:%.*]] = extractelement <4 x float> [[TMP2]], i32 3
+; CHECK-NEXT: [[GETELEM10:%.*]] = getelementptr inbounds i8, ptr [[TMP0]], i64 16
+; CHECK-NEXT: [[TMP3:%.*]] = load <4 x i32>, ptr [[GETELEM10]], align 16
+; CHECK-NEXT: [[L55:%.*]] = extractelement <4 x i32> [[TMP3]], i32 0
+; CHECK-NEXT: [[TMP4:%.*]] = bitcast i32 [[L55]] to float
+; CHECK-NEXT: [[L66:%.*]] = extractelement <4 x i32> [[TMP3]], i32 1
+; CHECK-NEXT: [[TMP5:%.*]] = bitcast i32 [[L66]] to float
+; CHECK-NEXT: [[L77:%.*]] = extractelement <4 x i32> [[TMP3]], i32 2
+; CHECK-NEXT: [[L88:%.*]] = extractelement <4 x i32> [[TMP3]], i32 3
+; CHECK-NEXT: ret void
+;
+ %l1 = load float, ptr %1, align 16
+ %getElem = getelementptr inbounds i8, ptr %1, i64 4
+ %l2 = load float, ptr %getElem, align 4
+ %getElem8 = getelementptr inbounds i8, ptr %1, i64 8
+ %l3 = load float, ptr %getElem8, align 8
+ %getElem9 = getelementptr inbounds i8, ptr %1, i64 12
+ %l4 = load float, ptr %getElem9, align 4
+ %getElem10 = getelementptr inbounds i8, ptr %1, i64 16
+ %l5 = load float, ptr %getElem10, align 4
+ %getElem11 = getelementptr inbounds i8, ptr %1, i64 20
+ %l6 = load float, ptr %getElem11, align 4
+ %getElem12 = getelementptr inbounds i8, ptr %1, i64 24
+ %l7 = load i32, ptr %getElem12, align 8
+ %getElem13 = getelementptr inbounds i8, ptr %1, i64 28
+ %l8 = load i32, ptr %getElem13, align 4
+ ret void
+}
+
+
+; This version of the test adjusts the struct to hold two i32s at the beginning,
+; but still assumes that the first float3 is 16 aligned. If the alignment
+; propagation works correctly, it should be able to load this struct in three
+; loads: a 2x32, a 4x32, and a 4x32. Without the alignment propagation, the last
+; 4x32 will instead be a 2x32 and a 2x32
+%struct.S2 = type { i32, i32, %struct.float3, %struct.float3, i32, i32 }
+
+define void @testStore_2(ptr nocapture writeonly %1) {
+; CHECK-LABEL: define void @testStore_2(
+; CHECK-SAME: ptr writeonly captures(none) [[TMP0:%.*]]) {
+; CHECK-NEXT: store <2 x i32> zeroinitializer, ptr [[TMP0]], align 8
+; CHECK-NEXT: [[GETELEM1:%.*]] = getelementptr inbounds [[STRUCT_S2:%.*]], ptr [[TMP0]], i64 0, i32 2
+; CHECK-NEXT: store <4 x float> zeroinitializer, ptr [[GETELEM1]], align 16
+; CHECK-NEXT: [[GETELEM10:%.*]] = getelementptr inbounds [[STRUCT_S2]], ptr [[TMP0]], i64 0, i32 3, i32 1
+; CHECK-NEXT: store <4 x i32> zeroinitializer, ptr [[GETELEM10]], align 16
+; CHECK-NEXT: ret void
+;
+ store i32 0, ptr %1, align 8
+ %getElem = getelementptr inbounds %struct.S2, ptr %1, i64 0, i32 1
+ store i32 0, ptr %getElem, align 4
+ %getElem1 = getelementptr inbounds %struct.S2, ptr %1, i64 0, i32 2
+ store float 0.000000e+00, ptr %getElem1, align 16
+ %getElem2 = getelementptr inbounds %struct.S2, ptr %1, i64 0, i32 2, i32 1
+ store float 0.000000e+00, ptr %getElem2, align 4
+ %getElem8 = getelementptr inbounds %struct.S2, ptr %1, i64 0, i32 2, i32 2
+ store float 0.000000e+00, ptr %getElem8, align 8
+ %getElem9 = getelementptr inbounds %struct.S2, ptr %1, i64 0, i32 3
+ store float 0.000000e+00, ptr %getElem9, align 4
+ %getElem10 = getelementptr inbounds %struct.S2, ptr %1, i64 0, i32 3, i32 1
+ store float 0.000000e+00, ptr %getElem10, align 4
+ %getElem11 = getelementptr inbounds %struct.S2, ptr %1, i64 0, i32 3, i32 2
+ store float 0.000000e+00, ptr %getElem11, align 4
+ %getElem12 = getelementptr inbounds %struct.S2, ptr %1, i64 0, i32 4
+ store i32 0, ptr %getElem12, align 8
+ %getElem13 = getelementptr inbounds %struct.S2, ptr %1, i64 0, i32 5
+ store i32 0, ptr %getElem13, align 4
+ ret void
+}
+
+define void @testLoad_2(ptr nocapture writeonly %1) {
+; CHECK-LABEL: define void @testLoad_2(
+; CHECK-SAME: ptr writeonly captures(none) [[TMP0:%.*]]) {
+; CHECK-NEXT: [[TMP2:%.*]] = load <2 x i32>, ptr [[TMP0]], align 8
+; CHECK-NEXT: [[L1:%.*]] = extractelement <2 x i32> [[TMP2]], i32 0
+; CHECK-NEXT: [[L22:%.*]] = extractelement <2 x i32> [[TMP2]], i32 1
+; CHECK-NEXT: [[GETELEM1:%.*]] = getelementptr inbounds [[STRUCT_S2:%.*]], ptr [[TMP0]], i64 0, i32 2
+; CHECK-NEXT: [[TMP3:%.*]] = load <4 x float>, ptr [[GETELEM1]], align 16
+; CHECK-NEXT: [[L33:%.*]] = extractelement <4 x float> [[TMP3]], i32 0
+; CHECK-NEXT: [[L44:%.*]] = extractelement <4 x float> [[TMP3]], i32 1
+; CHECK-NEXT: [[L55:%.*]] = extractelement <4 x float> [[TMP3]], i32 2
+; CHECK-NEXT: [[L66:%.*]] = extractelement <4 x float> [[TMP3]], i32 3
+; CHECK-NEXT: [[GETELEM10:%.*]] = getelementptr inbounds [[STRUCT_S2]], ptr [[TMP0]], i64 0, i32 3, i32 1
+; CHECK-NEXT: [[TMP4:%.*]] = load <4 x i32>, ptr [[GETELEM10]], align 16
+; CHECK-NEXT: [[L77:%.*]] = extractelement <4 x i32> [[TMP4]], i32 0
+; CHECK-NEXT: [[TMP5:%.*]] = bitcast i32 [[L77]] to float
+; CHECK-NEXT: [[L88:%.*]] = extractelement <4 x i32> [[TMP4]], i32 1
+; CHECK-NEXT: [[TMP6:%.*]] = bitcast i32 [[L88]] to float
+; CHECK-NEXT: [[L99:%.*]] = extractelement <4 x i32> [[TMP4]], i32 2
+; CHECK-NEXT: [[L010:%.*]] = extractelement <4 x i32> [[TMP4]], i32 3
+; CHECK-NEXT: ret void
+;
+ %l = load i32, ptr %1, align 8
+ %getElem = getelementptr inbounds %struct.S2, ptr %1, i64 0, i32 1
+ %l2 = load i32, ptr %getElem, align 4
+ %getElem1 = getelementptr inbounds %struct.S2, ptr %1, i64 0, i32 2
+ %l3 = load float, ptr %getElem1, align 16
+ %getElem2 = getelementptr inbounds %struct.S2, ptr %1, i64 0, i32 2, i32 1
+ %l4 = load float, ptr %getElem2, align 4
+ %getElem8 = getelementptr inbounds %struct.S2, ptr %1, i64 0, i32 2, i32 2
+ %l5 = load float, ptr %getElem8, align 8
+ %getElem9 = getelementptr inbounds %struct.S2, ptr %1, i64 0, i32 3
+ %l6 = load float, ptr %getElem9, align 4
+ %getElem10 = getelementptr inbounds %struct.S2, ptr %1, i64 0, i32 3, i32 1
+ %l7 = load float, ptr %getElem10, align 4
+ %getElem11 = getelementptr inbounds %struct.S2, ptr %1, i64 0, i32 3, i32 2
+ %l8 = load float, ptr %getElem11, align 4
+ %getElem12 = getelementptr inbounds %struct.S2, ptr %1, i64 0, i32 4
+ %l9 = load i32, ptr %getElem12, align 8
+ %getElem13 = getelementptr inbounds %struct.S2, ptr %1, i64 0, i32 5
+ %l0 = load i32, ptr %getElem13, align 4
+ ret void
+}
+
+; Also, test without the struct geps, to see if it still works with i8 geps/ptradd
+
+define void @testStorei8_2(ptr nocapture writeonly %1) {
+; CHECK-LABEL: define void @testStorei8_2(
+; CHECK-SAME: ptr writeonly captures(none) [[TMP0:%.*]]) {
+; CHECK-NEXT: store <2 x i32> zeroinitializer, ptr [[TMP0]], align 8
+; CHECK-NEXT: [[GETELEM1:%.*]] = getelementptr inbounds i8, ptr [[TMP0]], i64 8
+; CHECK-NEXT: store <4 x float> zeroinitializer, ptr [[GETELEM1]], align 16
+; CHECK-NEXT: [[GETELEM10:%.*]] = getelementptr inbounds i8, ptr [[TMP0]], i64 24
+; CHECK-NEXT: store <4 x i32> zeroinitializer, ptr [[GETELEM10]], align 16
+; CHECK-NEXT: ret void
+;
+ store i32 0, ptr %1, align 8
+ %getElem = getelementptr inbounds i8, ptr %1, i64 4
+ store i32 0, ptr %getElem, align 4
+ %getElem1 = getelementptr inbounds i8, ptr %1, i64 8
+ store float 0.000000e+00, ptr %getElem1, align 16
+ %getElem2 = getelementptr inbounds i8, ptr %1, i64 12
+ store float 0.000000e+00, ptr %getElem2, align 4
+ %getElem8 = getelementptr inbounds i8, ptr %1, i64 16
+ store float 0.000000e+00, ptr %getElem8, align 8
+ %getElem9 = getelementptr inbounds i8, ptr %1, i64 20
+ store float 0.000000e+00, ptr %getElem9, align 4
+ %getElem10 = getelementptr inbounds i8, ptr %1, i64 24
+ store float 0.000000e+00, ptr %getElem10, align 4
+ %getElem11 = getelementptr inbounds i8, ptr %1, i64 28
+ store float 0.000000e+00, ptr %getElem11, align 4
+ %getElem12 = getelementptr inbounds i8, ptr %1, i64 32
+ store i32 0, ptr %getElem12, align 8
+ %getElem13 = getelementptr inbounds i8, ptr %1, i64 36
+ store i32 0, ptr %getElem13, align 4
+ ret void
+}
+
+define void @testLoadi8_2(ptr nocapture writeonly %1) {
+; CHECK-LABEL: define void @testLoadi8_2(
+; CHECK-SAME: ptr writeonly captures(none) [[TMP0:%.*]]) {
+; CHECK-NEXT: [[TMP2:%.*]] = load <2 x i32>, ptr [[TMP0]], align 8
+; CHECK-NEXT: [[L1:%.*]] = extractelement <2 x i32> [[TMP2]], i32 0
+; CHECK-NEXT: [[L22:%.*]] = extractelement <2 x i32> [[TMP2]], i32 1
+; CHECK-NEXT: [[GETELEM1:%.*]] = getelementptr inbounds i8, ptr [[TMP0]], i64 8
+; CHECK-NEXT: [[TMP3:%.*]] = load <4 x float>, ptr [[GETELEM1]], align 16
+; CHECK-NEXT: [[L33:%.*]] = extractelement <4 x float> [[TMP3]], i32 0
+; CHECK-NEXT: [[L44:%.*]] = extractelement <4 x float> [[TMP3]], i32 1
+; CHECK-NEXT: [[L55:%.*]] = extractelement <4 x float> [[TMP3]], i32 2
+; CHECK-NEXT: [[L66:%.*]] = extractelement <4 x float> [[TMP3]], i32 3
+; CHECK-NEXT: [[GETELEM10:%.*]] = getelementptr inbounds i8, ptr [[TMP0]], i64 24
+; CHECK-NEXT: [[TMP4:%.*]] = load <4 x i32>, ptr [[GETELEM10]], align 16
+; CHECK-NEXT: [[L77:%.*]] = extractelement <4 x i32> [[TMP4]], i32 0
+; CHECK-NEXT: [[TMP5:%.*]] = bitcast i32 [[L77]] to float
+; CHECK-NEXT: [[L88:%.*]] = extractelement <4 x i32> [[TMP4]], i32 1
+; CHECK-NEXT: [[TMP6:%.*]] = bitcast i32 [[L88]] to float
+; CHECK-NEXT: [[L99:%.*]] = extractelement <4 x i32> [[TMP4]], i32 2
+; CHECK-NEXT: [[L010:%.*]] = extractelement <4 x i32> [[TMP4]], i32 3
+; CHECK-NEXT: ret void
+;
+ %l = load i32, ptr %1, align 8
+ %getElem = getelementptr inbounds i8, ptr %1, i64 4
+ %l2 = load i32, ptr %getElem, align 4
+ %getElem1 = getelementptr inbounds i8, ptr %1, i64 8
+ %l3 = load float, ptr %getElem1, align 16
+ %getElem2 = getelementptr inbounds i8, ptr %1, i64 12
+ %l4 = load float, ptr %getElem2, align 4
+ %getElem8 = getelementptr inbounds i8, ptr %1, i64 16
+ %l5 = load float, ptr %getElem8, align 8
+ %getElem9 = getelementptr inbounds i8, ptr %1, i64 20
+ %l6 = load float, ptr %getElem9, align 4
+ %getElem10 = getelementptr inbounds i8, ptr %1, i64 24
+ %l7 = load float, ptr %getElem10, align 4
+ %getElem11 = getelementptr inbounds i8, ptr %1, i64 28
+ %l8 = load float, ptr %getElem11, align 4
+ %getElem12 = getelementptr inbounds i8, ptr %1, i64 32
+ %l9 = load i32, ptr %getElem12, align 8
+ %getElem13 = getelementptr inbounds i8, ptr %1, i64 36
+ %l0 = load i32, ptr %getElem13, align 4
+ ret void
+}
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice
// If this is a load/store of an alloca, we might have upgraded the alloca's | ||
// alignment earlier. Get the new alignment. | ||
if (AS == DL.getAllocaAddrSpace()) { | ||
Alignment = std::max( | ||
Alignment, | ||
getOrEnforceKnownAlignment(getLoadStorePointerOperand(C[0].Inst), | ||
MaybeAlign(), DL, C[0].Inst, nullptr, &DT)); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Regarding this change: I don't totally understand why we delay the upgrade of load/store alignment, but I'd imagine it is preferred to minimize IR changes before vectorization is certain.
If this change gets accepted, we are ok with eagerly upgrading load/store alignment. So, for consistency, we should also be ok with eagerly doing so for the existing alloca alignment upgrade optimization. Hence, I am bundling this simplification with this change, which lets us remove this handling in vectorizeChain
by instead calling setLoadStoreAlignment
on line 837.
There is one test case change that results from this, in massive_indirection.ll
. This is not a regression, as the target supports unaligned loads. The upgrade from alignment of 8 to an alignment of 4294967296 that happened with this call to getOrEnforceKnownAlignment
is not the intended purpose of this block of code, merely an unintended side effect. This is evidenced by the fact that there are no allocas in the test case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, but please wait for review by someone with more LSV experience before landing.
Adding @michalpaszkowski to reviewers since I saw you reviewed a recently merged LSV change. If you can think of anyone else who would be a better fit to review this let me know :) |
At this point in the vectorization pass, we are guaranteed to have a contiguous chain with defined offsets for each element. Using this information, we can derive and upgrade alignment for elements in the chain based on their offset from previous well-aligned elements. This enables vectorization of chains that are longer than the maximum vector length of the target. This algorithm is also robust to the head of the chain not being well-aligned; if we find a better alignment while iterating from the beginning to the end of the chain, we will use that alignment moving forward.