Skip to content
This repository has been archived by the owner on Apr 23, 2021. It is now read-only.

Add support for some multi-store cases in affine loop fusion #162

Closed

Conversation

dcaballe
Copy link
Contributor

@dcaballe dcaballe commented Oct 1, 2019

Hello!

We've been giving affine loop fusion a try and we are pretty happy with the initial results! Great work! We are seeing quite a few loop nests being fused in our models!

We noticed that currently only single-store producer loops are supported and would like to contribute a fix for that since it seems to be an important limitation. Even though our loop nests usually have a single store, this limitation is exposed when several single-store loop nests can be fused. The following example is a snippet of a model we are working on, where 4 loop nests could be fused into a single one:

func @main(%in : memref<1048576x256xf32>, %out : memref<1048576x256xf32>) {
  %cst = constant 0.000000e+00 : f32

  %6 = alloc() : memref<1048576x256xf32>
  affine.for %arg7 = 0 to 1048576 {
    affine.for %arg8 = 0 to 256 {
      %13 = affine.load %in[%arg7, %arg8] : memref<1048576x256xf32>
      %14 = affine.load %in[%arg7, %arg8] : memref<1048576x256xf32>
      %15 = mulf %14, %13 : f32
      affine.store %15, %6[%arg7, %arg8] : memref<1048576x256xf32>
    }
  }
  %7 = alloc() : memref<1048576x256xf32>
  affine.for %arg7 = 0 to 1048576 {
    affine.for %arg8 = 0 to 256 {
      %13 = affine.load %6[%arg7, %arg8] : memref<1048576x256xf32>
      %14 = cmpf "ogt", %13, %cst : f32
      %15 = select %14, %13, %cst : f32
      affine.store %15, %7[%arg7, %arg8] : memref<1048576x256xf32>
    }
  }
  %8 = alloc() : memref<1048576x256xf32>
  affine.for %arg7 = 0 to 1048576 {
    affine.for %arg8 = 0 to 256 {
      %13 = affine.load %7[%arg7, %arg8] : memref<1048576x256xf32>
      %14 = affine.load %7[%arg7, %arg8] : memref<1048576x256xf32>
      %15 = mulf %14, %13 : f32
      affine.store %15, %8[%arg7, %arg8] : memref<1048576x256xf32>
    }
  }
  affine.for %arg7 = 0 to 1048576 {
    affine.for %arg8 = 0 to 256 {
      %13 = affine.load %8[%arg7, %arg8] : memref<1048576x256xf32>
      %14 = cmpf "ogt", %13, %cst : f32
      %15 = select %14, %13, %cst : f32
      affine.store %15, %out[%arg7, %arg8] : memref<1048576x256xf32>
    }
  }
  dealloc %6 : memref<1048576x256xf32>
  dealloc %7 : memref<1048576x256xf32>
  dealloc %8 : memref<1048576x256xf32>
  return
}

However, affine loop fusion only fuses the first 3 loops and not the 4th one:

module {
  func @main(%arg0: memref<1048576x256xf32>, %arg1: memref<1048576x256xf32>) {
    %cst = constant 0.000000e+00 : f32
    %0 = alloc() : memref<1048576x256xf32>
    %1 = alloc() : memref<1048576x256xf32>
    %2 = alloc() : memref<1048576x256xf32>
    affine.for %arg2 = 0 to 1048576 {
      affine.for %arg3 = 0 to 256 {
        %3 = affine.load %arg0[%arg2, %arg3] : memref<1048576x256xf32>
        %4 = affine.load %arg0[%arg2, %arg3] : memref<1048576x256xf32>
        %5 = mulf %4, %3 : f32
        affine.store %5, %0[%arg2, %arg3] : memref<1048576x256xf32>
        %6 = affine.load %0[%arg2, %arg3] : memref<1048576x256xf32>
        %7 = cmpf "ogt", %6, %cst : f32
        %8 = select %7, %6, %cst : f32
        affine.store %8, %1[%arg2, %arg3] : memref<1048576x256xf32>
        %9 = affine.load %1[%arg2, %arg3] : memref<1048576x256xf32>
        %10 = affine.load %1[%arg2, %arg3] : memref<1048576x256xf32>
        %11 = mulf %10, %9 : f32
        affine.store %11, %2[%arg2, %arg3] : memref<1048576x256xf32>
      }
    }
    affine.for %arg2 = 0 to 1048576 {
      affine.for %arg3 = 0 to 256 {
        %3 = affine.load %2[%arg2, %arg3] : memref<1048576x256xf32>
        %4 = cmpf "ogt", %3, %cst : f32
        %5 = select %4, %3, %cst : f32
        affine.store %5, %arg1[%arg2, %arg3] : memref<1048576x256xf32>
      }
    }
    dealloc %0 : memref<1048576x256xf32>
    dealloc %1 : memref<1048576x256xf32>
    dealloc %2 : memref<1048576x256xf32>
    return
  }
}

This happens because the algorithm starts with the 3rd loop nest as consumer and it fuses 2nd and 1st one into it. Then, it picks 4th as consumer and tries to fuse the previously fused one into it, which it's not possible because it has multiple stores.

This PR is a stepping stone towards supporting generic multi-store producer loop nests in affine loop fusion. It extends the algorithm to support fusion of multi-store producer loop nests that:

  1. have only one store that writes to a function-local live out, and
  2. the remaining stores are only involved in loop nest self dependences or no dependences within the function.

It would be great to get feedback on this fix, if this approach is aligned with what you envision for the generic multi-output problem and if there are more cases that could be problematic and are currently not properly addressed in this PR.

Thanks!
Diego

This PR is a stepping stone towards supporting generic multi-store
source loop nests in affine loop fusion. It extends the algorithm to
support fusion of multi-store loop nests that:
 1. have only one store that writes to a function-local live out, and
 2. the remaining stores are involved in loop nest self dependences
    or no dependences within the function.
@bondhugula
Copy link
Contributor

bondhugula commented Oct 2, 2019 via email

@andydavis1
Copy link
Contributor

andydavis1 commented Oct 2, 2019

Hello Diego,

Glad you are using Affine Loop Fusion, and thanks for the feedback and PR!!! I'll take a look at the PR in a bit.
W.r.t future plans for this pass: I do have a plan for a new Affine Loop Fusion pass, which will handle the many of these corner cases in a unified way (and result in a simpler pass). I've started to build out the utilities for the new pass here: https://github.com/tensorflow/mlir/blob/master/include/mlir/Transforms/LoopFusionUtils.h
Stay tuned!

@dcaballe
Copy link
Contributor Author

dcaballe commented Oct 2, 2019

Great! Thank you both!
Please, keep me posted on the new fusion algorithm and let me know if I can help with reviews or something. Do you plan to send an RFC highlighting the main changes, limitations and improvements?

Copy link
Contributor

@bondhugula bondhugula left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A first round of superficial comments.

AffineStoreOp getUniqueStoreToLiveOut(Node *node) {
AffineStoreOp uniqueStore;
for (auto *op : node->stores) {
auto storeOpInst = cast<AffineStoreOp>(op);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: The "Inst" suffix used to be used earlier when MLIR operations used to be called Instructions. You can just drop it now, i.e.,
storeOpInst -> storeOp

@@ -322,6 +322,38 @@ struct MemRefDependenceGraph {
return false;
}

// Returns the unique AffineStoreOp in `node` that meets all the following:
// *) store is the only one that writes to a function-local live out memref,
// *) store is not the source of a `node` self-dependence.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a self dependence on node ?

// *) store is the only one that writes to a function-local live out memref,
// *) store is not the source of a `node` self-dependence.
// Otherwise, returns a null AffineStoreOp.
AffineStoreOp getUniqueStoreToLiveOut(Node *node) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be a local / static function.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It uses class member outEdges. I could make it a static non-member function but I think it makes sense as it is. It's similar to other member functions, such as writesToLiveInOrEscapingMemrefs.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, had missed it; fine as is.

// live-out.
// TODO(andydavis) Support more generic multi-output src loop nests
// fusion.
auto srcStoreOpInst = mdg->getUniqueStoreToLiveOut(srcNode);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

srcStoreOpInst -> srcStoreOp

@@ -1259,6 +1260,7 @@ func @should_not_fuse_multi_output_producer() {
// CHECK-NEXT: }
// CHECK-NEXT: affine.for %{{.*}} = 0 to 10 {
// CHECK-NEXT: %{{.*}} = affine.load %{{.*}}[%{{.*}}] : memref<10xf32>
// CHECK-NEXT: %{{.*}} = affine.load %{{.*}}[%{{.*}}] : memref<10xf32>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need to match the result here and in the one above.

@@ -972,25 +1004,17 @@ static Value *createPrivateMemRef(AffineForOp forOp, Operation *srcStoreOpInst,
// TODO(andydavis) Generalize this to handle more live in/out cases.
static bool canFuseSrcWhichWritesToLiveOut(unsigned srcId, unsigned dstId,
Value *memref,
AffineStoreOp srcStoreOpInst,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

srcStoreOpInst -> srcStoreOp

%v1 = affine.load %m[%i1] : memref<10xf32>
}
// CHECK: affine.for [[i0:%.*]] = 0 to 10 {
// CHECK-NEXT: affine.store {{%.*}}, [[LOCAL_M:%.*]]{{\[}}[[i0]]{{\]}} : memref<10xf32>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

%{{.*}} to be consistent.

affine.for %i1 = 0 to 10 {
%v1 = affine.load %m[%i1] : memref<10xf32>
}
// CHECK: affine.for [[i0:%.*]] = 0 to 10 {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can capture this as:
%[[i0:.*]] and then use %[[i0]]. This way you won't need the escapes below.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! Thanks! I forgot to ask about alternatives for this since I found it a bit uncomfortable.

}
// CHECK: affine.for [[i0:%.*]] = 0 to 10 {
// CHECK-NEXT: affine.store {{%.*}}, [[LOCAL_M:%.*]]{{\[}}[[i0]]{{\]}} : memref<10xf32>
// CHECK-NEXT: [[v0:%.*]] = affine.load [[LOCAL_M]]{{\[}}[[i0]]{{\]}} : memref<10xf32>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could be rewritten as:
[%[[i0]]]

%v0 = affine.load %m[%i1] : memref<10xf32>
}
// CHECK: affine.for [[i0:%.*]] = 0 to 10 {
// CHECK-NEXT: affine.store {{%.*}}, {{%.*\[}}[[i0]]{{\]}} : memref<10xf32>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Likewise. Capture %[[i0:.*]] and use [%[[i0]]] -- more readable this way.

Copy link
Contributor Author

@dcaballe dcaballe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, Uday! Addressed.

// *) store is the only one that writes to a function-local live out memref,
// *) store is not the source of a `node` self-dependence.
// Otherwise, returns a null AffineStoreOp.
AffineStoreOp getUniqueStoreToLiveOut(Node *node) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It uses class member outEdges. I could make it a static non-member function but I think it makes sense as it is. It's similar to other member functions, such as writesToLiveInOrEscapingMemrefs.

affine.for %i1 = 0 to 10 {
%v1 = affine.load %m[%i1] : memref<10xf32>
}
// CHECK: affine.for [[i0:%.*]] = 0 to 10 {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! Thanks! I forgot to ask about alternatives for this since I found it a bit uncomfortable.

Copy link
Contributor

@bondhugula bondhugula left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR looks great to me! Just some efficiency and documentation related suggestions.

@@ -322,6 +322,38 @@ struct MemRefDependenceGraph {
return false;
}

// Returns the unique AffineStoreOp in `node` that meets all the following:
// *) store is the only one that writes to a function-local live out memref,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I sort of prefer:
"function-local memref live out of node"
The comments inside should be fine.

// *) store is the only one that writes to a function-local live out memref,
// *) store is not the source of a `node` self-dependence.
// Otherwise, returns a null AffineStoreOp.
AffineStoreOp getUniqueStoreToLiveOut(Node *node) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, had missed it; fine as is.

@@ -972,33 +1004,25 @@ static Value *createPrivateMemRef(AffineForOp forOp, Operation *srcStoreOpInst,
// TODO(andydavis) Generalize this to handle more live in/out cases.
static bool canFuseSrcWhichWritesToLiveOut(unsigned srcId, unsigned dstId,
Value *memref,
AffineStoreOp srcStoreOp,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment for this function needs an update 'srcNode' could write to multiple memrefs, but only one of them may have an outgoing edge.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Function comment doesn't document 'srcStoreOp'.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I revisited this a bit since memref is no longer needed and some of the checks should be actually asserts after getUniqueStoreToLiveOut. Please, let me know if the documentation is not clear. We can iterate on it.

// Check that all stores are to the same memref.
if (storeMemrefs.size() != 1 ||
mdg->getOutEdgeCount(srcNode->id, memref) != 1)
// Return false if 'srcNode' has more than one output edge on 'memref'.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unless 'srcNode' has exactly one outgoing edge on 'memref'.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed

// *) store is the only one that writes to a function-local live out memref,
// *) store is not the source of a self-dependence on `node`.
// Otherwise, returns a null AffineStoreOp.
AffineStoreOp getUniqueStoreToLiveOut(Node *node) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

getUniqueOutgoingStore or getUniqueLiveOutStore sounds better to me.

// -----

// CHECK-LABEL: func @should_fuse_function_live_out_multi_store_producer
func @should_fuse_function_live_out_multi_store_producer(%live_out_m : memref<10xf32>) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: This is really %arg_m or %live_in_out_m.

// live-out.
// TODO(andydavis) Support more generic multi-output src loop nests
// fusion.
auto srcStoreOp = mdg->getUniqueStoreToLiveOut(srcNode);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this store be on 'memref'? Since you are actually checking getOutEdgeCount != 1, you've made sure to eliminate the "no store case" on memref. So, if this finds a store, it will have to be on 'memref'.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, it looks like it. I had some problems with this assumption in an early implementation but probably something was wrong. I think it's better to keep getUniqueStoreToLiveOut a bit more generic so I'll add an assert for that constraint here. Thanks!

for (auto *op : node->stores) {
auto storeOp = cast<AffineStoreOp>(op);
auto *memref = storeOp.getMemRef();
auto outEdgeIt = outEdges.find(node->id);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is invariant and can be hoisted out.

// (self-dependence edges are not represented in graph at the moment),
// *) writes to a function live out memref (function parameter), or
// *) is dead.
if (outEdgeIt == outEdges.end() ||
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is invariant and can be hoisted out and you can exit early. If the node doesn't have outgoing edges, just return nullptr?!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, good catch!

// *) writes to a function live out memref (function parameter), or
// *) is dead.
if (outEdgeIt == outEdges.end() ||
llvm::all_of(outEdgeIt->second, [=](const Edge &edge) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hoist out outEdgeIt->second access as well.

const auto &nodeOutEdges = outEdgeIt->second;

The only thing needed inside is:

   if (llvm::all_of(outEdges, [=](const Edge &edge) { 
       ...
       ) 
       continue;

@dcaballe
Copy link
Contributor Author

dcaballe commented Oct 3, 2019

Thanks, Uday!

// -----

// CHECK-LABEL: func @should_fuse_self_dependence_multi_store_producer() {
func @should_fuse_self_dependence_multi_store_producer() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you have any multi-store fusion unit tests where one of the fused loops stores to a live out memref?
func @test(%live_out : : memref<64x9xi32>) {
// Add loop nest here which stores to %live_out
// Add other loop nests to fuse with above loop nest.
return %live_out : memref<64x9xi32>
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the one in 2325 (should_fuse_function_live_out_multi_store_producer) what you are looking for?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. Thanks.

@dcaballe
Copy link
Contributor Author

dcaballe commented Oct 7, 2019

Anything else is needed?

@bondhugula
Copy link
Contributor

This looks great to me.

@andydavis1
Copy link
Contributor

Yes. Looks good. Thanks Diego!

tensorflow-copybara pushed a commit to tensorflow/tensorflow that referenced this pull request Oct 9, 2019
This PR is a stepping stone towards supporting generic multi-store
source loop nests in affine loop fusion. It extends the algorithm to
support fusion of multi-store loop nests that:
 1. have only one store that writes to a function-local live out, and
 2. the remaining stores are involved in loop nest self dependences
    or no dependences within the function.

Closes #162

COPYBARA_INTEGRATE_REVIEW=tensorflow/mlir#162 from dcaballe:dcaballe/multi-output-fusion 7fb7dec6fe8b45f5ce176f018bfe37b256420c45
PiperOrigin-RevId: 273773907
dcaballe added a commit to dcaballe/mlir that referenced this pull request Nov 22, 2019
tensorflow#162 introduced a bug that
incorrectly allowed fusion of producer loops with multiple outgoing
edges. This commit fixes that problem. It also introduces a new flag to
disable sibling loop fusion so that we can test producer-consumer fusion
in isolation.
mlir-copybara-bot pushed a commit that referenced this pull request Dec 3, 2019
#162 introduced a bug that
incorrectly allowed fusion of producer loops with multiple outgoing
edges. This commit fixes that problem. It also introduces a new flag to
disable sibling loop fusion so that we can test producer-consumer fusion
in isolation.

Closes #259

COPYBARA_INTEGRATE_REVIEW=#259 from dcaballe:dcaballe/fix_multi_out_edge_producer_fusion 578d566
PiperOrigin-RevId: 283531105
tensorflow-copybara pushed a commit to tensorflow/tensorflow that referenced this pull request Dec 3, 2019
tensorflow/mlir#162 introduced a bug that
incorrectly allowed fusion of producer loops with multiple outgoing
edges. This commit fixes that problem. It also introduces a new flag to
disable sibling loop fusion so that we can test producer-consumer fusion
in isolation.

Closes #259

COPYBARA_INTEGRATE_REVIEW=tensorflow/mlir#259 from dcaballe:dcaballe/fix_multi_out_edge_producer_fusion 578d5661705fd5c56c555832d5e0528df88c5282
PiperOrigin-RevId: 283531105
Change-Id: I3a6173463ea20bd35555c24fa451bfbf2dfac098
swift-ci pushed a commit to apple/llvm-project that referenced this pull request Dec 24, 2019
This PR is a stepping stone towards supporting generic multi-store
source loop nests in affine loop fusion. It extends the algorithm to
support fusion of multi-store loop nests that:
 1. have only one store that writes to a function-local live out, and
 2. the remaining stores are involved in loop nest self dependences
    or no dependences within the function.

Closes tensorflow/mlir#162

COPYBARA_INTEGRATE_REVIEW=tensorflow/mlir#162 from dcaballe:dcaballe/multi-output-fusion 7fb7dec6fe8b45f5ce176f018bfe37b256420c45
PiperOrigin-RevId: 273773907
swift-ci pushed a commit to apple/llvm-project that referenced this pull request Dec 24, 2019
tensorflow/mlir#162 introduced a bug that
incorrectly allowed fusion of producer loops with multiple outgoing
edges. This commit fixes that problem. It also introduces a new flag to
disable sibling loop fusion so that we can test producer-consumer fusion
in isolation.

Closes tensorflow/mlir#259

COPYBARA_INTEGRATE_REVIEW=tensorflow/mlir#259 from dcaballe:dcaballe/fix_multi_out_edge_producer_fusion 578d5661705fd5c56c555832d5e0528df88c5282
PiperOrigin-RevId: 283531105
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants