-
Notifications
You must be signed in to change notification settings - Fork 258
Add support for some multi-store cases in affine loop fusion #162
Conversation
This PR is a stepping stone towards supporting generic multi-store source loop nests in affine loop fusion. It extends the algorithm to support fusion of multi-store loop nests that: 1. have only one store that writes to a function-local live out, and 2. the remaining stores are involved in loop nest self dependences or no dependences within the function.
Hi Diego,
Thanks very much! The single-store limitation has indeed been a key
limitation that we may want to immediately get rid of. I'll be happy to
provide feedback on this PR - @andydavis1 will have more useful feedback
than I here though.
…On 02/10/2019 05:01, Diego Caballero wrote:
Hello!
We've been giving affine loop fusion a try and we are pretty happy with
the initial results! Great work! We are seeing quite a few loop nests
being fused in our models!
We noticed that currently only single-store producer loops are supported
and would like to contribute a fix for that since it seems to be an
important limitation. Even though our loop nests usually have a single
store, this limitation is exposed when several single-store loop nests
can be fused. The following example is a snippet of a model we are
working on, where 4 loop nests could be fused into a single one:
|func @main(%in : memref<1048576x256xf32>, %out :
memref<1048576x256xf32>) { %cst = constant 0.000000e+00 : f32 %6 =
alloc() : memref<1048576x256xf32> affine.for %arg7 = 0 to 1048576 {
affine.for %arg8 = 0 to 256 { %13 = affine.load %in[%arg7, %arg8] :
memref<1048576x256xf32> %14 = affine.load %in[%arg7, %arg8] :
memref<1048576x256xf32> %15 = mulf %14, %13 : f32 affine.store %15,
%6[%arg7, %arg8] : memref<1048576x256xf32> } } %7 = alloc() :
memref<1048576x256xf32> affine.for %arg7 = 0 to 1048576 { affine.for
%arg8 = 0 to 256 { %13 = affine.load %6[%arg7, %arg8] :
memref<1048576x256xf32> %14 = cmpf "ogt", %13, %cst : f32 %15 = select
%14, %13, %cst : f32 affine.store %15, %7[%arg7, %arg8] :
memref<1048576x256xf32> } } %8 = alloc() : memref<1048576x256xf32>
affine.for %arg7 = 0 to 1048576 { affine.for %arg8 = 0 to 256 { %13 =
affine.load %7[%arg7, %arg8] : memref<1048576x256xf32> %14 = affine.load
%7[%arg7, %arg8] : memref<1048576x256xf32> %15 = mulf %14, %13 : f32
affine.store %15, %8[%arg7, %arg8] : memref<1048576x256xf32> } }
affine.for %arg7 = 0 to 1048576 { affine.for %arg8 = 0 to 256 { %13 =
affine.load %8[%arg7, %arg8] : memref<1048576x256xf32> %14 = cmpf "ogt",
%13, %cst : f32 %15 = select %14, %13, %cst : f32 affine.store %15,
%out[%arg7, %arg8] : memref<1048576x256xf32> } } dealloc %6 :
memref<1048576x256xf32> dealloc %7 : memref<1048576x256xf32> dealloc %8
: memref<1048576x256xf32> return } |
However, affine loop fusion only fuses the first 3 loops and not the 4th
one:
|module { func @main(%arg0: memref<1048576x256xf32>, %arg1:
memref<1048576x256xf32>) { %cst = constant 0.000000e+00 : f32 %0 =
alloc() : memref<1048576x256xf32> %1 = alloc() : memref<1048576x256xf32>
%2 = alloc() : memref<1048576x256xf32> affine.for %arg2 = 0 to 1048576 {
affine.for %arg3 = 0 to 256 { %3 = affine.load %arg0[%arg2, %arg3] :
memref<1048576x256xf32> %4 = affine.load %arg0[%arg2, %arg3] :
memref<1048576x256xf32> %5 = mulf %4, %3 : f32 affine.store %5,
%0[%arg2, %arg3] : memref<1048576x256xf32> %6 = affine.load %0[%arg2,
%arg3] : memref<1048576x256xf32> %7 = cmpf "ogt", %6, %cst : f32 %8 =
select %7, %6, %cst : f32 affine.store %8, %1[%arg2, %arg3] :
memref<1048576x256xf32> %9 = affine.load %1[%arg2, %arg3] :
memref<1048576x256xf32> %10 = affine.load %1[%arg2, %arg3] :
memref<1048576x256xf32> %11 = mulf %10, %9 : f32 affine.store %11,
%2[%arg2, %arg3] : memref<1048576x256xf32> } } affine.for %arg2 = 0 to
1048576 { affine.for %arg3 = 0 to 256 { %3 = affine.load %2[%arg2,
%arg3] : memref<1048576x256xf32> %4 = cmpf "ogt", %3, %cst : f32 %5 =
select %4, %3, %cst : f32 affine.store %5, %arg1[%arg2, %arg3] :
memref<1048576x256xf32> } } dealloc %0 : memref<1048576x256xf32> dealloc
%1 : memref<1048576x256xf32> dealloc %2 : memref<1048576x256xf32> return
} } |
This happens because the algorithm starts with the 3rd loop nest as
consumer and it fuses 2nd and 1st one into it. Then, it picks 4th as
consumer and tries to fuse the previously fused one into it, which it's
not possible because it has multiple stores.
This PR is a stepping stone towards supporting generic multi-store
producer loop nests in affine loop fusion. It extends the algorithm to
support fusion of multi-store producer loop nests that:
1. have only one store that writes to a function-local live out, and
2. the remaining stores are only involved in loop nest self dependences
or no dependences within the function.
It would be great to get feedback on this fix, if this approach is
aligned with what you envision for the generic multi-output problem and
if there are more cases that could be problematic and are currently not
properly addressed in this PR.
Thanks!
Diego
------------------------------------------------------------------------
You can view, comment on, or merge this pull request online at:
#162
Commit Summary
* Add support for some multi-store cases in affine fusion
File Changes
* *M* lib/Transforms/LoopFusion.cpp
<https://github.com/tensorflow/mlir/pull/162/files#diff-0> (64)
* *M* test/Transforms/loop-fusion.mlir
<https://github.com/tensorflow/mlir/pull/162/files#diff-1> (75)
Patch Links:
* https://github.com/tensorflow/mlir/pull/162.patch
* https://github.com/tensorflow/mlir/pull/162.diff
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#162?email_source=notifications&email_token=ABVPBEJHXHCHD2UQH336YS3QMPMT3A5CNFSM4I4P6QK2YY3PNVWWK3TUL52HS4DFUVEXG43VMWVGG33NNVSW45C7NFSM4HO7U2XA>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ABVPBEPITRITRSTSWLSGGYTQMPMT3ANCNFSM4I4P6QKQ>.
|
Hello Diego, Glad you are using Affine Loop Fusion, and thanks for the feedback and PR!!! I'll take a look at the PR in a bit. |
Great! Thank you both! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A first round of superficial comments.
lib/Transforms/LoopFusion.cpp
Outdated
AffineStoreOp getUniqueStoreToLiveOut(Node *node) { | ||
AffineStoreOp uniqueStore; | ||
for (auto *op : node->stores) { | ||
auto storeOpInst = cast<AffineStoreOp>(op); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: The "Inst" suffix used to be used earlier when MLIR operations used to be called Instructions. You can just drop it now, i.e.,
storeOpInst -> storeOp
lib/Transforms/LoopFusion.cpp
Outdated
@@ -322,6 +322,38 @@ struct MemRefDependenceGraph { | |||
return false; | |||
} | |||
|
|||
// Returns the unique AffineStoreOp in `node` that meets all the following: | |||
// *) store is the only one that writes to a function-local live out memref, | |||
// *) store is not the source of a `node` self-dependence. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a self dependence on node
?
lib/Transforms/LoopFusion.cpp
Outdated
// *) store is the only one that writes to a function-local live out memref, | ||
// *) store is not the source of a `node` self-dependence. | ||
// Otherwise, returns a null AffineStoreOp. | ||
AffineStoreOp getUniqueStoreToLiveOut(Node *node) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be a local / static function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It uses class member outEdges
. I could make it a static non-member function but I think it makes sense as it is. It's similar to other member functions, such as writesToLiveInOrEscapingMemrefs
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, had missed it; fine as is.
lib/Transforms/LoopFusion.cpp
Outdated
// live-out. | ||
// TODO(andydavis) Support more generic multi-output src loop nests | ||
// fusion. | ||
auto srcStoreOpInst = mdg->getUniqueStoreToLiveOut(srcNode); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
srcStoreOpInst -> srcStoreOp
@@ -1259,6 +1260,7 @@ func @should_not_fuse_multi_output_producer() { | |||
// CHECK-NEXT: } | |||
// CHECK-NEXT: affine.for %{{.*}} = 0 to 10 { | |||
// CHECK-NEXT: %{{.*}} = affine.load %{{.*}}[%{{.*}}] : memref<10xf32> | |||
// CHECK-NEXT: %{{.*}} = affine.load %{{.*}}[%{{.*}}] : memref<10xf32> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No need to match the result here and in the one above.
lib/Transforms/LoopFusion.cpp
Outdated
@@ -972,25 +1004,17 @@ static Value *createPrivateMemRef(AffineForOp forOp, Operation *srcStoreOpInst, | |||
// TODO(andydavis) Generalize this to handle more live in/out cases. | |||
static bool canFuseSrcWhichWritesToLiveOut(unsigned srcId, unsigned dstId, | |||
Value *memref, | |||
AffineStoreOp srcStoreOpInst, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
srcStoreOpInst -> srcStoreOp
test/Transforms/loop-fusion.mlir
Outdated
%v1 = affine.load %m[%i1] : memref<10xf32> | ||
} | ||
// CHECK: affine.for [[i0:%.*]] = 0 to 10 { | ||
// CHECK-NEXT: affine.store {{%.*}}, [[LOCAL_M:%.*]]{{\[}}[[i0]]{{\]}} : memref<10xf32> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
%{{.*}} to be consistent.
test/Transforms/loop-fusion.mlir
Outdated
affine.for %i1 = 0 to 10 { | ||
%v1 = affine.load %m[%i1] : memref<10xf32> | ||
} | ||
// CHECK: affine.for [[i0:%.*]] = 0 to 10 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can capture this as:
%[[i0:.*]] and then use %[[i0]]. This way you won't need the escapes below.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice! Thanks! I forgot to ask about alternatives for this since I found it a bit uncomfortable.
test/Transforms/loop-fusion.mlir
Outdated
} | ||
// CHECK: affine.for [[i0:%.*]] = 0 to 10 { | ||
// CHECK-NEXT: affine.store {{%.*}}, [[LOCAL_M:%.*]]{{\[}}[[i0]]{{\]}} : memref<10xf32> | ||
// CHECK-NEXT: [[v0:%.*]] = affine.load [[LOCAL_M]]{{\[}}[[i0]]{{\]}} : memref<10xf32> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This could be rewritten as:
[%[[i0]]]
test/Transforms/loop-fusion.mlir
Outdated
%v0 = affine.load %m[%i1] : memref<10xf32> | ||
} | ||
// CHECK: affine.for [[i0:%.*]] = 0 to 10 { | ||
// CHECK-NEXT: affine.store {{%.*}}, {{%.*\[}}[[i0]]{{\]}} : memref<10xf32> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Likewise. Capture %[[i0:.*]] and use [%[[i0]]] -- more readable this way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, Uday! Addressed.
lib/Transforms/LoopFusion.cpp
Outdated
// *) store is the only one that writes to a function-local live out memref, | ||
// *) store is not the source of a `node` self-dependence. | ||
// Otherwise, returns a null AffineStoreOp. | ||
AffineStoreOp getUniqueStoreToLiveOut(Node *node) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It uses class member outEdges
. I could make it a static non-member function but I think it makes sense as it is. It's similar to other member functions, such as writesToLiveInOrEscapingMemrefs
.
test/Transforms/loop-fusion.mlir
Outdated
affine.for %i1 = 0 to 10 { | ||
%v1 = affine.load %m[%i1] : memref<10xf32> | ||
} | ||
// CHECK: affine.for [[i0:%.*]] = 0 to 10 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice! Thanks! I forgot to ask about alternatives for this since I found it a bit uncomfortable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This PR looks great to me! Just some efficiency and documentation related suggestions.
lib/Transforms/LoopFusion.cpp
Outdated
@@ -322,6 +322,38 @@ struct MemRefDependenceGraph { | |||
return false; | |||
} | |||
|
|||
// Returns the unique AffineStoreOp in `node` that meets all the following: | |||
// *) store is the only one that writes to a function-local live out memref, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I sort of prefer:
"function-local memref live out of node
"
The comments inside should be fine.
lib/Transforms/LoopFusion.cpp
Outdated
// *) store is the only one that writes to a function-local live out memref, | ||
// *) store is not the source of a `node` self-dependence. | ||
// Otherwise, returns a null AffineStoreOp. | ||
AffineStoreOp getUniqueStoreToLiveOut(Node *node) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, had missed it; fine as is.
lib/Transforms/LoopFusion.cpp
Outdated
@@ -972,33 +1004,25 @@ static Value *createPrivateMemRef(AffineForOp forOp, Operation *srcStoreOpInst, | |||
// TODO(andydavis) Generalize this to handle more live in/out cases. | |||
static bool canFuseSrcWhichWritesToLiveOut(unsigned srcId, unsigned dstId, | |||
Value *memref, | |||
AffineStoreOp srcStoreOp, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The comment for this function needs an update 'srcNode' could write to multiple memrefs, but only one of them may have an outgoing edge.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Function comment doesn't document 'srcStoreOp'.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I revisited this a bit since memref is no longer needed and some of the checks should be actually asserts after getUniqueStoreToLiveOut
. Please, let me know if the documentation is not clear. We can iterate on it.
lib/Transforms/LoopFusion.cpp
Outdated
// Check that all stores are to the same memref. | ||
if (storeMemrefs.size() != 1 || | ||
mdg->getOutEdgeCount(srcNode->id, memref) != 1) | ||
// Return false if 'srcNode' has more than one output edge on 'memref'. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
unless 'srcNode' has exactly one outgoing edge on 'memref'.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed
lib/Transforms/LoopFusion.cpp
Outdated
// *) store is the only one that writes to a function-local live out memref, | ||
// *) store is not the source of a self-dependence on `node`. | ||
// Otherwise, returns a null AffineStoreOp. | ||
AffineStoreOp getUniqueStoreToLiveOut(Node *node) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
getUniqueOutgoingStore or getUniqueLiveOutStore sounds better to me.
test/Transforms/loop-fusion.mlir
Outdated
// ----- | ||
|
||
// CHECK-LABEL: func @should_fuse_function_live_out_multi_store_producer | ||
func @should_fuse_function_live_out_multi_store_producer(%live_out_m : memref<10xf32>) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: This is really %arg_m or %live_in_out_m.
lib/Transforms/LoopFusion.cpp
Outdated
// live-out. | ||
// TODO(andydavis) Support more generic multi-output src loop nests | ||
// fusion. | ||
auto srcStoreOp = mdg->getUniqueStoreToLiveOut(srcNode); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't this store be on 'memref'? Since you are actually checking getOutEdgeCount != 1, you've made sure to eliminate the "no store case" on memref. So, if this finds a store, it will have to be on 'memref'.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, it looks like it. I had some problems with this assumption in an early implementation but probably something was wrong. I think it's better to keep getUniqueStoreToLiveOut
a bit more generic so I'll add an assert for that constraint here. Thanks!
lib/Transforms/LoopFusion.cpp
Outdated
for (auto *op : node->stores) { | ||
auto storeOp = cast<AffineStoreOp>(op); | ||
auto *memref = storeOp.getMemRef(); | ||
auto outEdgeIt = outEdges.find(node->id); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is invariant and can be hoisted out.
lib/Transforms/LoopFusion.cpp
Outdated
// (self-dependence edges are not represented in graph at the moment), | ||
// *) writes to a function live out memref (function parameter), or | ||
// *) is dead. | ||
if (outEdgeIt == outEdges.end() || |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is invariant and can be hoisted out and you can exit early. If the node doesn't have outgoing edges, just return nullptr?!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, good catch!
lib/Transforms/LoopFusion.cpp
Outdated
// *) writes to a function live out memref (function parameter), or | ||
// *) is dead. | ||
if (outEdgeIt == outEdges.end() || | ||
llvm::all_of(outEdgeIt->second, [=](const Edge &edge) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hoist out outEdgeIt->second access as well.
const auto &nodeOutEdges = outEdgeIt->second;
The only thing needed inside is:
if (llvm::all_of(outEdges, [=](const Edge &edge) {
...
)
continue;
Thanks, Uday! |
// ----- | ||
|
||
// CHECK-LABEL: func @should_fuse_self_dependence_multi_store_producer() { | ||
func @should_fuse_self_dependence_multi_store_producer() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you have any multi-store fusion unit tests where one of the fused loops stores to a live out memref?
func @test(%live_out : : memref<64x9xi32>) {
// Add loop nest here which stores to %live_out
// Add other loop nests to fuse with above loop nest.
return %live_out : memref<64x9xi32>
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the one in 2325 (should_fuse_function_live_out_multi_store_producer
) what you are looking for?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. Thanks.
Anything else is needed? |
This looks great to me. |
Yes. Looks good. Thanks Diego! |
This PR is a stepping stone towards supporting generic multi-store source loop nests in affine loop fusion. It extends the algorithm to support fusion of multi-store loop nests that: 1. have only one store that writes to a function-local live out, and 2. the remaining stores are involved in loop nest self dependences or no dependences within the function. Closes #162 COPYBARA_INTEGRATE_REVIEW=tensorflow/mlir#162 from dcaballe:dcaballe/multi-output-fusion 7fb7dec6fe8b45f5ce176f018bfe37b256420c45 PiperOrigin-RevId: 273773907
tensorflow#162 introduced a bug that incorrectly allowed fusion of producer loops with multiple outgoing edges. This commit fixes that problem. It also introduces a new flag to disable sibling loop fusion so that we can test producer-consumer fusion in isolation.
#162 introduced a bug that incorrectly allowed fusion of producer loops with multiple outgoing edges. This commit fixes that problem. It also introduces a new flag to disable sibling loop fusion so that we can test producer-consumer fusion in isolation. Closes #259 COPYBARA_INTEGRATE_REVIEW=#259 from dcaballe:dcaballe/fix_multi_out_edge_producer_fusion 578d566 PiperOrigin-RevId: 283531105
tensorflow/mlir#162 introduced a bug that incorrectly allowed fusion of producer loops with multiple outgoing edges. This commit fixes that problem. It also introduces a new flag to disable sibling loop fusion so that we can test producer-consumer fusion in isolation. Closes #259 COPYBARA_INTEGRATE_REVIEW=tensorflow/mlir#259 from dcaballe:dcaballe/fix_multi_out_edge_producer_fusion 578d5661705fd5c56c555832d5e0528df88c5282 PiperOrigin-RevId: 283531105 Change-Id: I3a6173463ea20bd35555c24fa451bfbf2dfac098
This PR is a stepping stone towards supporting generic multi-store source loop nests in affine loop fusion. It extends the algorithm to support fusion of multi-store loop nests that: 1. have only one store that writes to a function-local live out, and 2. the remaining stores are involved in loop nest self dependences or no dependences within the function. Closes tensorflow/mlir#162 COPYBARA_INTEGRATE_REVIEW=tensorflow/mlir#162 from dcaballe:dcaballe/multi-output-fusion 7fb7dec6fe8b45f5ce176f018bfe37b256420c45 PiperOrigin-RevId: 273773907
tensorflow/mlir#162 introduced a bug that incorrectly allowed fusion of producer loops with multiple outgoing edges. This commit fixes that problem. It also introduces a new flag to disable sibling loop fusion so that we can test producer-consumer fusion in isolation. Closes tensorflow/mlir#259 COPYBARA_INTEGRATE_REVIEW=tensorflow/mlir#259 from dcaballe:dcaballe/fix_multi_out_edge_producer_fusion 578d5661705fd5c56c555832d5e0528df88c5282 PiperOrigin-RevId: 283531105
Hello!
We've been giving affine loop fusion a try and we are pretty happy with the initial results! Great work! We are seeing quite a few loop nests being fused in our models!
We noticed that currently only single-store producer loops are supported and would like to contribute a fix for that since it seems to be an important limitation. Even though our loop nests usually have a single store, this limitation is exposed when several single-store loop nests can be fused. The following example is a snippet of a model we are working on, where 4 loop nests could be fused into a single one:
However, affine loop fusion only fuses the first 3 loops and not the 4th one:
This happens because the algorithm starts with the 3rd loop nest as consumer and it fuses 2nd and 1st one into it. Then, it picks 4th as consumer and tries to fuse the previously fused one into it, which it's not possible because it has multiple stores.
This PR is a stepping stone towards supporting generic multi-store producer loop nests in affine loop fusion. It extends the algorithm to support fusion of multi-store producer loop nests that:
It would be great to get feedback on this fix, if this approach is aligned with what you envision for the generic multi-output problem and if there are more cases that could be problematic and are currently not properly addressed in this PR.
Thanks!
Diego