[IR] Improve dynamic SNode performance #1178

yuanming-hu · 2020-06-07T20:20:20Z

Description
For some reason, dynamic SNodes become very slow nowadays. After a binary search, I find the commit that leads to this behavior is c6086e0

Test case:

import taichi as ti
import time

ti.init(arch=ti.gpu, print_ir=True)

x = ti.Vector(2, dt=ti.f32)
y = ti.Vector(2, dt=ti.f32)

ti.root.dynamic(ti.i, 1024 * 1024, 1024).place(x, y)

@ti.kernel
def initialize():
    for i in range(1024 * 1024):
        x[i] = [i, i]
        
@ti.kernel
def copy():
    for i in x:
        y[i] = x[i]

initialize()

while True:
    t = time.time()
    copy()
    ti.sync()
    print((time.time() - t) * 1000)

Before that commit copy takes 1.6ms but after it takes > 5s.

The IR:

Before:

kernel {
  $0 = offloaded clear_list S1dynamic
  $1 = offloaded listgen S0root->S1dynamic
  $2 = offloaded struct_for(S1dynamic) block_dim=0 {
    <i32 x1> $3 = loop index 0
    <gen*x1> $4 = get root
    <i32 x1> $5 = const [0]
    <gen*x1> $6 = [S0root][root]::lookup($4, $5) activate = false
    <gen*x1> $7 = get child [S0root->S1dynamic] $6
    <i32 x1> $8 = bit_extract($3 + 0, 0~20)
    <gen*x1> $9 = [S1dynamic][dynamic]::lookup($7, $8) activate = false
    <f32*x1> $10 = get child [S1dynamic->S2place_f32] $9
    <f32 x1> $11 = global load $10
    <f32*x1> $12 = get child [S1dynamic->S4place_f32] $9
    <f32*x1> $13 : global store [$12 <- $11]
    <f32*x1> $14 = get child [S1dynamic->S3place_f32] $9
    <f32 x1> $15 = global load $14
    <f32*x1> $16 = get child [S1dynamic->S5place_f32] $9
    <f32*x1> $17 : global store [$16 <- $15]
  }
}

After:

kernel {
  $0 = offloaded clear_list S1dynamic
  $1 = offloaded listgen S0root->S1dynamic
  $2 = offloaded struct_for(S1dynamic) block_dim=0 {
    <i32 x1> $3 = loop $2 index 0
    <gen*x1> $4 = get root
    <i32 x1> $5 = const [0]
    <gen*x1> $6 = [S0root][root]::lookup($4, $5) activate = false
    <gen*x1> $7 = get child [S0root->S1dynamic] $6
    <i32 x1> $8 = bit_extract($3 + 0, 0~20)
    <gen*x1> $9 = [S1dynamic][dynamic]::lookup($7, $8) activate = false
    <f32*x1> $10 = get child [S1dynamic->S2place_f32] $9
    <f32 x1> $11 = global load $10
    <i32 x1> $12 = loop $2 index 0
    <i32 x1> $13 = bit_extract($12 + 0, 0~20)
    <gen*x1> $14 = [S1dynamic][dynamic]::lookup($7, $13) activate = true
    <f32*x1> $15 = get child [S1dynamic->S4place_f32] $14
    <f32*x1> $16 : global store [$15 <- $11]
    <i32 x1> $17 = loop $2 index 0
    <i32 x1> $18 = bit_extract($17 + 0, 0~20)
    <gen*x1> $19 = [S1dynamic][dynamic]::lookup($7, $18) activate = false
    <f32*x1> $20 = get child [S1dynamic->S3place_f32] $19
    <f32 x1> $21 = global load $20
    <i32 x1> $22 = loop $2 index 0
    <i32 x1> $23 = bit_extract($22 + 0, 0~20)
    <gen*x1> $24 = [S1dynamic][dynamic]::lookup($7, $23) activate = true
    <f32*x1> $25 = get child [S1dynamic->S5place_f32] $24
    <f32*x1> $26 : global store [$25 <- $21]
  }
}

I guess it's the dynamic SNode activation that leads to very low performance. Probably the activation weakening pass is not functioning after we introduce LoopIndexStmt. I'm fixing that now.

The text was updated successfully, but these errors were encountered:

yuanming-hu · 2020-06-07T20:24:57Z

Btw, @xumingkuan do you have an idea why LoopIndexStmts are not CSEd? This happens in the latest commit as well.

xumingkuan · 2020-06-07T20:27:58Z

Interesting... I thought whole_kernel_cse should've done CSE in this case after #1082?

xumingkuan · 2020-06-07T20:41:17Z

Interesting... I thought whole_kernel_cse should've done CSE in this case after #1082?

I will investigate this later.

yuanming-hu · 2020-06-12T18:25:14Z

taichi_elements (bd9ebdc7b86f748a1bdca7b74ee95151f3a8c863)

with Taichi ea30eec :
GPU: 4-5 FPS CPU: 6 FPS

with Taichi v0.6.4:
GPU: 50 FPS CPU: 16 FPS

xumingkuan · 2020-06-12T20:41:47Z

After #1229, TODO:

kernel {
  $0 = offloaded clear_list S17dynamic
  $1 = offloaded listgen S0root->S17dynamic
  $2 = offloaded struct_for(S17dynamic) block_dim=0 {
    <i32 x1> $3 = loop $2 index 0
    <gen*x1> $4 = get root
    <i32 x1> $5 = const [0]
    <gen*x1> $6 = [S0root][root]::lookup($4, $5) activate = false
    <gen*x1> $7 = get child [S0root->S17dynamic] $6
    ...
    <gen*x1> $296 = [S17dynamic][dynamic]::lookup($7, $3) activate = true
    <f32*x1> $297 = get child [S17dynamic->S18place_f32] $296
    <f32 x1> $298 = atomic add($297, $295)
    <f32 x1> $299 = mul $294 $268
    <f32*x1> $300 = get child [S17dynamic->S19place_f32] $296
    <f32 x1> $301 = atomic add($300, $299)
  }
}

xumingkuan · 2020-06-13T04:19:45Z

master branch, before lower_access:

  $13 = offloaded struct_for(S17dynamic) block_dim=0 {
    <i32 x1> $14 = loop $13 index 0
    <f32*x1> $15 = global ptr [S18place_f32], index [$14] activate=false
    <f32 x1> $16 = global load $15
    <f32 x1> $17 = const [128.0]
    <f32 x1> $18 = mul $16 $17
    <f32 x1> $19 = const [0.5]
    <f32 x1> $20 = sub $18 $19
    <i32 x1> $21 = cast_value<i32> $20
    <f32*x1> $22 = global ptr [S19place_f32], index [$14] activate=false
    <f32 x1> $23 = global load $22
    <f32 x1> $24 = const [128.0]
    <f32 x1> $25 = mul $23 $24
    <f32 x1> $26 = const [0.5]
    <f32 x1> $27 = sub $25 $26
    <i32 x1> $28 = cast_value<i32> $27
    <i32 x1> $137 = const [2]
    <i32 x1> $138 = add $28 $137
    <i32 x1> $237 = const [2]
    <i32 x1> $238 = add $21 $237
    <f32*x1> $285 = global ptr [S13place_f32], index [$238, $138] activate=false
    <f32 x1> $286 = global load $285
    <f32*x1> $287 = global ptr [S14place_f32], index [$238, $138] activate=false
    <f32 x1> $288 = global load $287

yuanming-hu self-assigned this Jun 7, 2020

xumingkuan mentioned this issue Jun 7, 2020

[Opt] [bug] Improve the whole kernel CSE pass (stage 1) and fix a bug #1180

Merged

This was referenced Jun 7, 2020

[Perf] Improve dynamic SNode performance #1182

Merged

[misc] Remove usage of deprecated functions in benchmarks #1193

Merged

[llvm] [opt] CSE for LoopIndexStmt #1203

Merged

xumingkuan mentioned this issue Jun 12, 2020

[Opt] Eliminate useless OffsetAndExtractBitsStmt of a LoopIndexStmt #1229

Merged

This was referenced Jun 13, 2020

[Perf] Improve dynamic SNode performance (stage 2) #1232

Merged

[perf] Improve dynamic SNode performance (stage 3) #1238

Merged

This was referenced Jun 16, 2020

[Bug] [opt] [doc] Fix OffsetAndExtractBitsStmt optimization and improve documentation on virtual/physical indices #1259

Merged

[perf] Refactor kernel profiler #1261

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[IR] Improve dynamic SNode performance #1178

[IR] Improve dynamic SNode performance #1178

yuanming-hu commented Jun 7, 2020

yuanming-hu commented Jun 7, 2020

xumingkuan commented Jun 7, 2020

xumingkuan commented Jun 7, 2020

yuanming-hu commented Jun 12, 2020

xumingkuan commented Jun 12, 2020

xumingkuan commented Jun 13, 2020 •

edited

[IR] Improve dynamic SNode performance #1178

[IR] Improve dynamic SNode performance #1178

Comments

yuanming-hu commented Jun 7, 2020

yuanming-hu commented Jun 7, 2020

xumingkuan commented Jun 7, 2020

xumingkuan commented Jun 7, 2020

yuanming-hu commented Jun 12, 2020

xumingkuan commented Jun 12, 2020

xumingkuan commented Jun 13, 2020 • edited

xumingkuan commented Jun 13, 2020 •

edited