[opt] Support atomic min/max in warp reduction optimization #2956

strongoier · 2021-09-17T06:44:13Z

Previously, only atomic add/sub are supported in warp reduction optimization. This PR aims at also supporting atomic min/max.

Thanks @yolo2themoon for providing the following example:

import taichi as ti

ti.init(kernel_profiler=True, arch=ti.cuda)

n = 2048 * 2048 * 2

a1 = ti.ndarray(ti.f32, n)
a2 = ti.ndarray(ti.f32, n)
f1 = ti.field(ti.f32, n)
f2 = ti.field(ti.f32, n)

temp_add = ti.field(ti.f32, ())
temp_max = ti.field(ti.f32, ())

@ti.kernel
def atomicadd_field():
    for P in range(n):
        ti.atomic_add(temp_add[None], f1[P])

@ti.kernel
def atomicadd_array(field_in: ti.any_arr()):
    for P in range(n):
        ti.atomic_add(temp_add[None], field_in[P])

@ti.kernel
def atomicmax_field():
    for P in range(n):
        ti.atomic_max(temp_max[None], f2[P])

@ti.kernel
def atomicmax_array(field_in: ti.any_arr()):
    for P in range(n):
        ti.atomic_max(temp_max[None], field_in[P])

@ti.kernel
def init():
    temp_add[None] = 0.
    temp_max[None] = 0.

init()
atomicadd_field()
atomicadd_array(a1)
atomicmax_field()
atomicmax_array(a2)
ti.clear_kernel_profile_info()
for i in range(100):
    init()
    atomicadd_field()
    atomicadd_array(a1)
    atomicmax_field()
    atomicmax_array(a2)
ti.print_kernel_profile_info()

Profiling results before this PR:

CUDA Profiler
=========================================================================
[      %     total   count |      min       avg       max   ] Kernel name
[ 49.77%  52.260 s  10100x |    5.015     5.174     5.767 ms] atomicmax_array_c10_0_kernel_4_range_for
[ 49.29%  51.757 s  10100x |    4.945     5.124     5.699 ms] atomicmax_field_c8_0_kernel_3_range_for
[  0.46%   0.480 s  10200x |    0.045     0.047     0.050 ms] atomicadd_array_c6_0_kernel_2_range_for
[  0.45%   0.476 s  10200x |    0.045     0.047     0.049 ms] atomicadd_field_c4_0_kernel_1_range_for
[  0.03%   0.035 s  10200x |    0.003     0.003     0.007 ms] init_c12_0_kernel_0_serial
-------------------------------------------------------------------------
[100.00%] Total kernel execution time: 105.009 s   number of records: 5
=========================================================================

Profiling results after this PR:

CUDA Profiler
=========================================================================
[      %     total   count |      min       avg       max   ] Kernel name
[ 26.41%   0.575 s  10100x |    0.056     0.057     0.059 ms] atomicmax_field_c8_0_kernel_3_range_for
[ 26.37%   0.574 s  10100x |    0.055     0.057     0.058 ms] atomicmax_array_c10_0_kernel_4_range_for
[ 22.95%   0.500 s  10200x |    0.048     0.049     0.050 ms] atomicadd_array_c6_0_kernel_2_range_for
[ 22.59%   0.492 s  10200x |    0.047     0.048     0.049 ms] atomicadd_field_c4_0_kernel_1_range_for
[  1.67%   0.036 s  10200x |    0.003     0.004     0.005 ms] init_c12_0_kernel_0_serial
-------------------------------------------------------------------------
[100.00%] Total kernel execution time:   2.178 s   number of records: 5
=========================================================================

We can see ~100x speedup for atomic max.

netlify · 2021-09-17T06:44:19Z

✔️ Deploy Preview for jovial-fermat-aa59dc canceled.

🔨 Explore the source changes: c6c8e87

🔍 Inspect the deploy log: https://app.netlify.com/sites/jovial-fermat-aa59dc/deploys/61484050e497040008722fc6

strongoier · 2021-09-17T06:45:57Z

/format

k-ye · 2021-09-17T06:50:41Z

taichi/transforms/make_thread_local.cpp

+         op_type == AtomicOpType::max || op_type == AtomicOpType::min;
+}
+
+AtomicOpType atomic_op_genre(AtomicOpType op_type) {


Why genre...?

See latest commit.

strongoier · 2021-09-18T15:27:22Z

/format

strongoier · 2021-09-20T08:02:38Z

/format

k-ye

LGTM!

BTW, do you still remember why ND-Array wasn't supported in the beginning?

k-ye · 2021-09-22T09:24:42Z

taichi/ir/type_utils.h

@@ -124,5 +124,57 @@ inline bool needs_grad(DataType dt) {
  return is_real(dt);
 }

+inline TypedConstant get_max_value(DataType dt) {
+  if (dt->is_primitive(PrimitiveTypeID::i8)) {


nit: in the future maybe we can predefine something like a PER_TYPE(i8, int8), then we don't have to repeat here..

k-ye · 2021-09-22T09:27:42Z

taichi/transforms/make_thread_local.cpp

@@ -134,7 +143,12 @@ void make_thread_local_offload(OffloadedStmt *offload) {
          TypeFactory::create_vector_or_scalar_type(1, data_type, true));

      auto zero = offload->tls_prologue->insert(
-          std::make_unique<ConstStmt>(TypedConstant(data_type, 0)), -1);
+          std::make_unique<ConstStmt>(dest.second == AtomicOpType::max


nit: get_reduction_init_value() ?

strongoier · 2021-09-22T09:37:01Z

LGTM!

BTW, do you still remember why ND-Array wasn't supported in the beginning?

There was no alias analysis for ExternalPtrStmt so those maybe_same_address checks would return true. Then it would not be treated as an optimization opportunity. #2952 fixed it.

[opt] Support min/max in warp reduction optimization

a57197a

strongoier requested review from ailzhang, k-ye and yolo2themoon September 17, 2021 06:45

Auto Format

add4b07

strongoier changed the title ~~[opt] Support min/max in warp reduction optimization~~ [opt] Support atomic min/max in warp reduction optimization Sep 17, 2021

k-ye reviewed Sep 18, 2021

View reviewed changes

strongoier added 3 commits September 18, 2021 22:20

Make tests stronger

431643a

Unfold unnecessary functions

3f01abd

Fix initial values for min/max

f17eec5

Auto Format

764e13f

strongoier marked this pull request as draft September 18, 2021 16:42

Fix comments

f1c2ddc

Auto Format

c6c8e87

strongoier marked this pull request as ready for review September 20, 2021 08:10

strongoier requested a review from k-ye September 22, 2021 06:39

k-ye approved these changes Sep 22, 2021

View reviewed changes

k-ye merged commit 5a18210 into taichi-dev:master Sep 22, 2021

strongoier deleted the reduction-min-max branch September 22, 2021 14:59

qiao-bo mentioned this pull request Sep 23, 2021

[release] v0.8.0 #2974

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[opt] Support atomic min/max in warp reduction optimization #2956

[opt] Support atomic min/max in warp reduction optimization #2956

strongoier commented Sep 17, 2021 •

edited

Loading

netlify bot commented Sep 17, 2021 •

edited

Loading

strongoier commented Sep 17, 2021

k-ye Sep 17, 2021

strongoier Sep 20, 2021

strongoier commented Sep 18, 2021

strongoier commented Sep 20, 2021

k-ye left a comment

k-ye Sep 22, 2021

k-ye Sep 22, 2021

strongoier commented Sep 22, 2021

[opt] Support atomic min/max in warp reduction optimization #2956

[opt] Support atomic min/max in warp reduction optimization #2956

Conversation

strongoier commented Sep 17, 2021 • edited Loading

netlify bot commented Sep 17, 2021 • edited Loading

strongoier commented Sep 17, 2021

k-ye Sep 17, 2021

Choose a reason for hiding this comment

strongoier Sep 20, 2021

Choose a reason for hiding this comment

strongoier commented Sep 18, 2021

strongoier commented Sep 20, 2021

k-ye left a comment

Choose a reason for hiding this comment

k-ye Sep 22, 2021

Choose a reason for hiding this comment

k-ye Sep 22, 2021

Choose a reason for hiding this comment

strongoier commented Sep 22, 2021

strongoier commented Sep 17, 2021 •

edited

Loading

netlify bot commented Sep 17, 2021 •

edited

Loading