Rewrite ReduceOp to support arbitrary reduce operations #1305

peterbell10 · 2023-03-08T22:51:08Z

This changes tt.reduce to replace redOp by a region containing arbitrary code. For example, tl.sum is now lowered as:

%res = "tt.reduce"(%arg0) ({
^bb0(%arg1: f32, %arg2: f32):
  %add = arith.addf %arg1, %arg2 : f32
  tt.reduce.return %add : f32
}) {axis = 1 : i32} : (tensor<128x128xf32>) -> tensor<128xf32>

Support for index reductions at the MLIR level are also dropped in favor of simultaneous reductions over multiple tensors. Which generalizes the code without loss of performance. So for example argmin gets lowered as:

  %7 = tt.make_range {end = 256 : i32, start = 0 : i32} : tensor<256xi32>
  %8 = tt.view %7 : (tensor<256xi32>) -> tensor<1x256xi32>
  %9:2 = "tt.reduce"(%6, %8) ({
  ^bb0(%arg4: f32, %arg5: i32, %arg6: f32, %arg7: i32):
    %14 = arith.cmpf olt, %arg4, %arg6 : f32
    %15 = arith.cmpf ogt, %arg4, %arg6 : f32
    %16 = arith.cmpi slt, %arg5, %arg7 : i32
    %17 = arith.select %16, %arg5, %arg7 : i32
    %18 = arith.select %15, %arg7, %17 : i32
    %19 = arith.select %14, %arg5, %18 : i32
    %20 = arith.cmpf olt, %arg4, %arg6 : f32
    %21 = arith.select %20, %arg4, %arg6 : f32
    tt.reduce.return %21, %19 : f32, i32
  }) {axis = 1 : i32} : (tensor<1x256xf32>, tensor<1x256xi32>) -> (tensor<1xf32>, tensor<1xi32>)

peterbell10 · 2023-03-08T22:51:55Z

lib/Conversion/TritonGPUToLLVM/GenericReduceOpToLLVM.cpp

+    // Create a new copy of the reduce block, and inline it
+    Block *currentBlock = rewriter.getBlock();
+    Region &parent = *currentBlock->getParent();
+    rewriter.cloneRegionBefore(reduceOp, &parent.front());
+    auto &newReduce = parent.front();
+    auto returnOp = dyn_cast<triton::GenericReduceReturnOp>(newReduce.getTerminator());
+    rewriter.mergeBlockBefore(&newReduce, &*rewriter.getInsertionPoint(), {acc, cur});
+    acc = returnOp.getResult();
+    // Delete the terminator, which is no longer used
+    rewriter.eraseOp(returnOp);


This is the main change compared to ReduceOpToLLVM.cpp.

peterbell10 · 2023-03-08T22:53:09Z

python/triton/language/semantic.py

+def prod(input: tl.tensor, axis: int, builder: ir.builder) -> tl.tensor:
+
+    def make_mul(reduce_op):
+        ir_scalar_ty = input.type.scalar.to_ir(builder)
+        region = reduce_op.get_region(0)
+        with insertion_guard(builder):
+            block = builder.create_block_with_parent(region, [ir_scalar_ty] * 2)
+            fmul = builder.create_fmul(block.arg(0), block.arg(1))
+            builder.create_reduce_ret(fmul)
+
+    return reduction(input, axis, make_mul, builder)


I've been using this for testing but the end goal would be to have the compiler build the inner function from a lambda, or something like that. I might need some help with that though.

Haha yeah it's not entirely trivial. I think it means the ASTVisitor should be modified to create MLIR functions out of lambda, and then the reduce op could merge in the basic block from this function

peterbell10 · 2023-03-09T10:59:55Z

include/triton/Dialect/Triton/IR/TritonOps.td

+def TT_GenericReduceOp: TT_Op<"generic_reduce",
+                             [Pure, DeclareOpInterfaceMethods<InferTypeOpInterface>]> {
+    let summary = "Reduction using generic combination algorithm";
+    let arguments = (ins TT_Tensor:$operand, I32Attr:$axis);


@ptillet assuming I can get index reductions working, do you think it would be reasonable to replace ReduceOp entirely?

Yes, if index reductions can work, then I think we could replace ReduceOp with the new op. We'll have to do some heavier testing to make sure that the performance hasn't decreased

peterbell10 · 2023-03-13T22:28:37Z

python/triton/language/core.py

+    axis = _constexpr_to_value(axis)
+    n = input.shape[axis]
+    index = arange(0, n, _builder=_builder)
+    new_shape = [constexpr(1)] * len(input.shape)
+    new_shape[axis] = constexpr(n)
+    index = view(index, new_shape, _builder=_builder)
+    index = broadcast_to(index, input.shape, _builder=_builder)
+
+    values, indices = semantic.min_with_index(input, index, axis, _builder)
+    return indices


This is my strategy for armin/argmax. Instead of special casing it I just lower it as a reduction over two tensors:

%7 = tt.make_range {end = 256 : i32, start = 0 : i32} : tensor<256xi32> %8 = tt.view %7 : (tensor<256xi32>) -> tensor<1x256xi32> %9:2 = "tt.generic_reduce"(%6, %8) ({ ^bb0(%arg4: f32, %arg5: i32, %arg6: f32, %arg7: i32): %15 = arith.cmpf olt, %arg4, %arg6 : f32 %16 = arith.cmpf ogt, %arg4, %arg6 : f32 %17 = arith.minsi %arg5, %arg7 : i32 %18 = arith.select %16, %arg7, %17 : i32 %19 = arith.select %15, %arg5, %18 : i32 %20 = arith.minf %arg4, %arg6 : f32 tt.generic_reduce.return %20, %19 : f32, i32 }) {axis = 1 : i32} : (tensor<1x256xf32>, tensor<1x256xi32>) -> (tensor<1xf32>, tensor<1xi32>)

This has some really nice properties.

the reduction code is the same whether you discard the min/max value or not

It generalized perfectly to higher numbers of tensors, e.g. the 3 needed for aten.var_mean

argmin/argmax specific logic is defined entirely at python level

In my limited testing so far, it performs identically

include/triton/Analysis/Utility.h

peterbell10 · 2023-03-13T22:34:08Z

lib/Dialect/TritonGPU/Transforms/TritonGPUConversion.cpp

@@ -80,6 +80,8 @@ TritonGPUConversionTarget::TritonGPUConversionTarget(
  // Some ops from SCF are illegal
  addIllegalOp<scf::ExecuteRegionOp, scf::ParallelOp, scf::ReduceOp,
               scf::ReduceReturnOp>();
+  // We have custom versions of some arith operators
+  addIllegalOp<arith::CmpIOp, arith::CmpFOp, arith::SelectOp>();


I did start running into edge cases in the some of the Dialect conversion code, where these were slipping through despite there being a conversion rule for them. It's possible that nested regions are handled differently by MLIR, not sure.

peterbell10 · 2023-03-13T22:42:23Z

lib/Conversion/TritonGPUToLLVM/GenericReduceOpToLLVM.cpp

+        barrier();
+        for (unsigned i = 0; i < op.getNumOperands(); ++i) {
+          store(acc[i], writePtrs[i]);
+        }


The new changes here basically just change

foo(acc) if (withIndex) foo(accIndex)

into equivalent for loops.

peterbell10 · 2023-03-14T21:18:22Z

python/triton/language/core.py

+            param_types = [ty.to_ir(_builder) for ty in prototype.param_types]
+            block = _builder.create_block_with_parent(region, param_types)
+            args = [tensor(block.arg(i), ty) for i, ty in enumerate(prototype.param_types)]
+            results = _generator.call_JitFunction(combine_fn, args, kwargs={})


I'm in two minds whether this is hacky or elegant, but it works. I pass the CodeGenerator in via the _generator argument much like the _builder argument, then call this function which I factored out of visit_Call to generate the appropriate function definition and call it.

This is cherry-picked from triton-lang#1305 If you call a `JITFunction` twice in the same kernel, first with `int32` then with `uint32`, the second call will treat the unsigned value as signed. This passes through MLIR without error because MLIR uses the same types for both, but different operation calls will be generated.

This is cherry-picked from #1305 If you call a `JITFunction` twice in the same kernel, first with `int32` then with `uint32`, the second call will treat the unsigned value as signed. This passes through MLIR without error because MLIR uses the same types for both, but different operation calls will be generated so you may silently get the wrong result.

ptillet · 2023-03-16T15:59:17Z

Thanks for the PR. Things are busy right now, but we will review it next week!

ptillet · 2023-03-23T18:47:08Z

(sorry, things have been busy and haven't had time to review this yet!)

peterbell10 · 2023-04-04T15:14:47Z

@ptillet do you have any idea when you might have time to review this?

lib/Analysis/Membar.cpp

include/triton/Dialect/TritonGPU/IR/TritonGPUOps.td

include/triton/Dialect/Triton/IR/TritonOps.td

lib/Analysis/Utility.cpp

Jokeren · 2023-04-04T17:23:28Z

lib/Dialect/TritonGPU/Transforms/RemoveLayoutConversions.cpp

+      }
+
+      // TODO: This always takes layout from the first argument which
+      // is fine for argmin/argmax but may not be optimal generally


I think you limit all arguments of reduce to have the same encoding. So this is just fine?

if (t.getShape() != srcShape) { rop.emitError() << "shape mismatch"; }

The concern is that the first argument might be cheap to convert but the second argument slow to convert. In that case this will remove the cheap layout conversion and add a more expensive one.

Also, I don't think there's ever a case where shape mismatch can happen.

lib/Dialect/TritonGPU/Transforms/RemoveLayoutConversions.cpp

test/Conversion/triton_ops.mlir

lib/Dialect/TritonGPU/Transforms/Utility.cpp

test/TritonGPU/combine.mlir

This reverts commit 19d31c6.

peterbell10 · 2023-04-12T16:16:49Z

@Jokeren I've fixed the merge conflicts with #1497 and #1514. Test are passing for me with an A100.

ptillet · 2023-04-12T20:28:57Z

Benchmark related stuff were merged in yesterday, so it's possible the tests got flaky. I'll investigate later today.

ptillet · 2023-04-13T01:40:02Z

Thanks again for the PR @peterbell10 . And thanks @Jokeren for the review.

A small oversight in triton-lang#1305, since `view` can rearrange elements it should be avoided here. Instead I use indexing with `None` to create new dimensions.

A small oversight in #1305, since `view` can rearrange elements it should be avoided here. Instead I use indexing with `None` to create new dimensions. Co-authored-by: Philippe Tillet <phil@openai.com>

…on-lang#1340) This is cherry-picked from triton-lang#1305 If you call a `JITFunction` twice in the same kernel, first with `int32` then with `uint32`, the second call will treat the unsigned value as signed. This passes through MLIR without error because MLIR uses the same types for both, but different operation calls will be generated so you may silently get the wrong result.

…riton-lang#1305) Fixes triton-lang#1285 This changes `tt.reduce` to replace `redOp` by a region containing arbitrary code. For example, `tl.sum` is now lowered as: ```mlir %res = "tt.reduce"(%arg0) ({ ^bb0(%arg1: f32, %arg2: f32): %add = arith.addf %arg1, %arg2 : f32 tt.reduce.return %add : f32 }) {axis = 1 : i32} : (tensor<128x128xf32>) -> tensor<128xf32> ``` Support for index reductions at the MLIR level are also dropped in favor of simultaneous reductions over multiple tensors. Which generalizes the code without loss of performance. So for example `argmin` gets lowered as: ```mlir %7 = tt.make_range {end = 256 : i32, start = 0 : i32} : tensor<256xi32> %8 = tt.view %7 : (tensor<256xi32>) -> tensor<1x256xi32> %9:2 = "tt.reduce"(%6, %8) ({ ^bb0(%arg4: f32, %arg5: i32, %arg6: f32, %arg7: i32): %14 = arith.cmpf olt, %arg4, %arg6 : f32 %15 = arith.cmpf ogt, %arg4, %arg6 : f32 %16 = arith.cmpi slt, %arg5, %arg7 : i32 %17 = arith.select %16, %arg5, %arg7 : i32 %18 = arith.select %15, %arg7, %17 : i32 %19 = arith.select %14, %arg5, %18 : i32 %20 = arith.cmpf olt, %arg4, %arg6 : f32 %21 = arith.select %20, %arg4, %arg6 : f32 tt.reduce.return %21, %19 : f32, i32 }) {axis = 1 : i32} : (tensor<1x256xf32>, tensor<1x256xi32>) -> (tensor<1xf32>, tensor<1xi32>) ```

A small oversight in triton-lang#1305, since `view` can rearrange elements it should be avoided here. Instead I use indexing with `None` to create new dimensions. Co-authored-by: Philippe Tillet <phil@openai.com>

peterbell10 commented Mar 8, 2023

View reviewed changes

peterbell10 commented Mar 9, 2023

View reviewed changes

peterbell10 force-pushed the generic-reduction branch from f6d4247 to eaea0a3 Compare March 13, 2023 22:16

peterbell10 commented Mar 13, 2023

View reviewed changes

peterbell10 changed the title ~~POC: Add generic reduction operator to mlir dialect~~ Rewrite ReduceOp to support arbitrary reduce operations Mar 14, 2023

peterbell10 marked this pull request as ready for review March 14, 2023 21:12

peterbell10 requested review from daadaada and Jokeren as code owners March 14, 2023 21:12

peterbell10 commented Mar 14, 2023

View reviewed changes

peterbell10 mentioned this pull request Mar 14, 2023

Mangle signed and unsigned integer types differently #1340

Merged

peterbell10 force-pushed the generic-reduction branch from e01d246 to 34973c1 Compare March 16, 2023 16:36

peterbell10 added 10 commits March 20, 2023 20:08

Add GenericReduceOp to ttir and add tl.prod using it

5f6e3de

Lower tt.generic_reduce to LLVM IR

786433e

Support simultaneous reduction of multiple tensors

ab461e8

Automatically build reduction combine op region from JITFunction

0aba718

Replace old ReduceOp entirely

08c2e35

Misc cleanup

4b74ce3

Add SameOperandsEncoding

4b2b16a

Run clang-format

b3957a7

Fix lit tests

bbec2fe

Update to newer LLVM

7e195a6

peterbell10 force-pushed the generic-reduction branch from cad9967 to 7e195a6 Compare March 20, 2023 21:33

peterbell10 added 2 commits March 23, 2023 18:26

Merge remote-tracking branch 'upstream/main' into generic-reduction

0f7a528

Lint

19d490b

peterbell10 added 2 commits March 30, 2023 15:50

Merge remote-tracking branch 'upstream/main' into generic-reduction

c5d928b

Fix merge conflicts

c6f777b

Jokeren reviewed Apr 4, 2023

View reviewed changes

peterbell10 added 2 commits April 4, 2023 19:08

Respond to some review comments

440a39d

Merge remote-tracking branch 'upstream/main' into generic-reduction

3db5241

peterbell10 force-pushed the generic-reduction branch from 3dd30f7 to 3db5241 Compare April 4, 2023 18:24

peterbell10 requested a review from Jokeren April 7, 2023 13:33

Jokeren approved these changes Apr 7, 2023

View reviewed changes

peterbell10 added 3 commits April 8, 2023 00:07

Merge remote-tracking branch 'upstream/main' into HEAD

c404afe

Don't rematerialize ReduceOp

19d31c6

Merge remote-tracking branch 'upstream/main' into generic-reduction

a6ae9e7

peterbell10 commented Apr 10, 2023

View reviewed changes

lib/Dialect/TritonGPU/Transforms/Utility.cpp Outdated Show resolved Hide resolved

test/TritonGPU/combine.mlir Outdated Show resolved Hide resolved

peterbell10 added 3 commits April 11, 2023 16:03

Merge remote-tracking branch 'upstream/main' into generic-reduction

0cd8f0f

Revert "Don't rematerialize ReduceOp"

c7c8ac1

This reverts commit 19d31c6.

Merge remote-tracking branch 'upstream/main' into generic-reduction

768241d

ptillet enabled auto-merge (squash) April 12, 2023 16:55

Merge branch 'main' into generic-reduction

04c2164

ptillet merged commit e152183 into triton-lang:main Apr 13, 2023

peterbell10 mentioned this pull request Apr 13, 2023

Don't call tl.view in arg{min,max} #1518

Merged

peterbell10 deleted the generic-reduction branch August 18, 2023 14:32

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rewrite ReduceOp to support arbitrary reduce operations #1305

Rewrite ReduceOp to support arbitrary reduce operations #1305

peterbell10 commented Mar 8, 2023 •

edited

peterbell10 Mar 8, 2023

peterbell10 Mar 8, 2023

ptillet Mar 9, 2023

peterbell10 Mar 9, 2023

ptillet Mar 9, 2023

peterbell10 Mar 13, 2023

peterbell10 Mar 13, 2023

peterbell10 Mar 13, 2023 •

edited

peterbell10 Mar 14, 2023

ptillet commented Mar 16, 2023

ptillet commented Mar 23, 2023

peterbell10 commented Apr 4, 2023

Jokeren Apr 4, 2023

peterbell10 Apr 4, 2023

peterbell10 commented Apr 12, 2023

ptillet commented Apr 12, 2023

ptillet commented Apr 13, 2023

Rewrite ReduceOp to support arbitrary reduce operations #1305

Rewrite ReduceOp to support arbitrary reduce operations #1305

Conversation

peterbell10 commented Mar 8, 2023 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

peterbell10 Mar 13, 2023 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ptillet commented Mar 16, 2023

ptillet commented Mar 23, 2023

peterbell10 commented Apr 4, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

peterbell10 commented Apr 12, 2023

ptillet commented Apr 12, 2023

ptillet commented Apr 13, 2023

peterbell10 commented Mar 8, 2023 •

edited

peterbell10 Mar 13, 2023 •

edited