Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[opt] Support atomic min/max in warp reduction optimization #2956

Merged
merged 8 commits into from
Sep 22, 2021

Conversation

strongoier
Copy link
Contributor

@strongoier strongoier commented Sep 17, 2021

Related issue = #2487, #2951, #2952

Previously, only atomic add/sub are supported in warp reduction optimization. This PR aims at also supporting atomic min/max.

Thanks @yolo2themoon for providing the following example:

import taichi as ti

ti.init(kernel_profiler=True, arch=ti.cuda)

n = 2048 * 2048 * 2

a1 = ti.ndarray(ti.f32, n)
a2 = ti.ndarray(ti.f32, n)
f1 = ti.field(ti.f32, n)
f2 = ti.field(ti.f32, n)

temp_add = ti.field(ti.f32, ())
temp_max = ti.field(ti.f32, ())

@ti.kernel
def atomicadd_field():
    for P in range(n):
        ti.atomic_add(temp_add[None], f1[P])

@ti.kernel
def atomicadd_array(field_in: ti.any_arr()):
    for P in range(n):
        ti.atomic_add(temp_add[None], field_in[P])

@ti.kernel
def atomicmax_field():
    for P in range(n):
        ti.atomic_max(temp_max[None], f2[P])

@ti.kernel
def atomicmax_array(field_in: ti.any_arr()):
    for P in range(n):
        ti.atomic_max(temp_max[None], field_in[P])

@ti.kernel
def init():
    temp_add[None] = 0.
    temp_max[None] = 0.

init()
atomicadd_field()
atomicadd_array(a1)
atomicmax_field()
atomicmax_array(a2)
ti.clear_kernel_profile_info()
for i in range(100):
    init()
    atomicadd_field()
    atomicadd_array(a1)
    atomicmax_field()
    atomicmax_array(a2)
ti.print_kernel_profile_info()

Profiling results before this PR:

CUDA Profiler
=========================================================================
[      %     total   count |      min       avg       max   ] Kernel name
[ 49.77%  52.260 s  10100x |    5.015     5.174     5.767 ms] atomicmax_array_c10_0_kernel_4_range_for
[ 49.29%  51.757 s  10100x |    4.945     5.124     5.699 ms] atomicmax_field_c8_0_kernel_3_range_for
[  0.46%   0.480 s  10200x |    0.045     0.047     0.050 ms] atomicadd_array_c6_0_kernel_2_range_for
[  0.45%   0.476 s  10200x |    0.045     0.047     0.049 ms] atomicadd_field_c4_0_kernel_1_range_for
[  0.03%   0.035 s  10200x |    0.003     0.003     0.007 ms] init_c12_0_kernel_0_serial
-------------------------------------------------------------------------
[100.00%] Total kernel execution time: 105.009 s   number of records: 5
=========================================================================

Profiling results after this PR:

CUDA Profiler
=========================================================================
[      %     total   count |      min       avg       max   ] Kernel name
[ 26.41%   0.575 s  10100x |    0.056     0.057     0.059 ms] atomicmax_field_c8_0_kernel_3_range_for
[ 26.37%   0.574 s  10100x |    0.055     0.057     0.058 ms] atomicmax_array_c10_0_kernel_4_range_for
[ 22.95%   0.500 s  10200x |    0.048     0.049     0.050 ms] atomicadd_array_c6_0_kernel_2_range_for
[ 22.59%   0.492 s  10200x |    0.047     0.048     0.049 ms] atomicadd_field_c4_0_kernel_1_range_for
[  1.67%   0.036 s  10200x |    0.003     0.004     0.005 ms] init_c12_0_kernel_0_serial
-------------------------------------------------------------------------
[100.00%] Total kernel execution time:   2.178 s   number of records: 5
=========================================================================

We can see ~100x speedup for atomic max.

@netlify
Copy link

netlify bot commented Sep 17, 2021

✔️ Deploy Preview for jovial-fermat-aa59dc canceled.

🔨 Explore the source changes: c6c8e87

🔍 Inspect the deploy log: https://app.netlify.com/sites/jovial-fermat-aa59dc/deploys/61484050e497040008722fc6

@strongoier
Copy link
Contributor Author

/format

@strongoier strongoier changed the title [opt] Support min/max in warp reduction optimization [opt] Support atomic min/max in warp reduction optimization Sep 17, 2021
op_type == AtomicOpType::max || op_type == AtomicOpType::min;
}

AtomicOpType atomic_op_genre(AtomicOpType op_type) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why genre...?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See latest commit.

@strongoier
Copy link
Contributor Author

/format

@strongoier strongoier marked this pull request as draft September 18, 2021 16:42
@strongoier
Copy link
Contributor Author

/format

@strongoier strongoier marked this pull request as ready for review September 20, 2021 08:10
Copy link
Member

@k-ye k-ye left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

BTW, do you still remember why ND-Array wasn't supported in the beginning?

@@ -124,5 +124,57 @@ inline bool needs_grad(DataType dt) {
return is_real(dt);
}

inline TypedConstant get_max_value(DataType dt) {
if (dt->is_primitive(PrimitiveTypeID::i8)) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: in the future maybe we can predefine something like a PER_TYPE(i8, int8), then we don't have to repeat here..

@@ -134,7 +143,12 @@ void make_thread_local_offload(OffloadedStmt *offload) {
TypeFactory::create_vector_or_scalar_type(1, data_type, true));

auto zero = offload->tls_prologue->insert(
std::make_unique<ConstStmt>(TypedConstant(data_type, 0)), -1);
std::make_unique<ConstStmt>(dest.second == AtomicOpType::max
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: get_reduction_init_value() ?

@k-ye k-ye merged commit 5a18210 into taichi-dev:master Sep 22, 2021
@strongoier
Copy link
Contributor Author

LGTM!

BTW, do you still remember why ND-Array wasn't supported in the beginning?

There was no alias analysis for ExternalPtrStmt so those maybe_same_address checks would return true. Then it would not be treated as an optimization opportunity. #2952 fixed it.

@strongoier strongoier deleted the reduction-min-max branch September 22, 2021 14:59
@qiao-bo qiao-bo mentioned this pull request Sep 23, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants