Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[IR] Improve dynamic SNode performance #1178

Open
yuanming-hu opened this issue Jun 7, 2020 · 6 comments
Open

[IR] Improve dynamic SNode performance #1178

yuanming-hu opened this issue Jun 7, 2020 · 6 comments
Assignees

Comments

@yuanming-hu
Copy link
Member

Description
For some reason, dynamic SNodes become very slow nowadays. After a binary search, I find the commit that leads to this behavior is c6086e0

Test case:

import taichi as ti
import time

ti.init(arch=ti.gpu, print_ir=True)

x = ti.Vector(2, dt=ti.f32)
y = ti.Vector(2, dt=ti.f32)

ti.root.dynamic(ti.i, 1024 * 1024, 1024).place(x, y)

@ti.kernel
def initialize():
    for i in range(1024 * 1024):
        x[i] = [i, i]
        
@ti.kernel
def copy():
    for i in x:
        y[i] = x[i]

initialize()

while True:
    t = time.time()
    copy()
    ti.sync()
    print((time.time() - t) * 1000)

Before that commit copy takes 1.6ms but after it takes > 5s.

The IR:

Before:

kernel {
  $0 = offloaded clear_list S1dynamic
  $1 = offloaded listgen S0root->S1dynamic
  $2 = offloaded struct_for(S1dynamic) block_dim=0 {
    <i32 x1> $3 = loop index 0
    <gen*x1> $4 = get root
    <i32 x1> $5 = const [0]
    <gen*x1> $6 = [S0root][root]::lookup($4, $5) activate = false
    <gen*x1> $7 = get child [S0root->S1dynamic] $6
    <i32 x1> $8 = bit_extract($3 + 0, 0~20)
    <gen*x1> $9 = [S1dynamic][dynamic]::lookup($7, $8) activate = false
    <f32*x1> $10 = get child [S1dynamic->S2place_f32] $9
    <f32 x1> $11 = global load $10
    <f32*x1> $12 = get child [S1dynamic->S4place_f32] $9
    <f32*x1> $13 : global store [$12 <- $11]
    <f32*x1> $14 = get child [S1dynamic->S3place_f32] $9
    <f32 x1> $15 = global load $14
    <f32*x1> $16 = get child [S1dynamic->S5place_f32] $9
    <f32*x1> $17 : global store [$16 <- $15]
  }
}

After:

kernel {
  $0 = offloaded clear_list S1dynamic
  $1 = offloaded listgen S0root->S1dynamic
  $2 = offloaded struct_for(S1dynamic) block_dim=0 {
    <i32 x1> $3 = loop $2 index 0
    <gen*x1> $4 = get root
    <i32 x1> $5 = const [0]
    <gen*x1> $6 = [S0root][root]::lookup($4, $5) activate = false
    <gen*x1> $7 = get child [S0root->S1dynamic] $6
    <i32 x1> $8 = bit_extract($3 + 0, 0~20)
    <gen*x1> $9 = [S1dynamic][dynamic]::lookup($7, $8) activate = false
    <f32*x1> $10 = get child [S1dynamic->S2place_f32] $9
    <f32 x1> $11 = global load $10
    <i32 x1> $12 = loop $2 index 0
    <i32 x1> $13 = bit_extract($12 + 0, 0~20)
    <gen*x1> $14 = [S1dynamic][dynamic]::lookup($7, $13) activate = true
    <f32*x1> $15 = get child [S1dynamic->S4place_f32] $14
    <f32*x1> $16 : global store [$15 <- $11]
    <i32 x1> $17 = loop $2 index 0
    <i32 x1> $18 = bit_extract($17 + 0, 0~20)
    <gen*x1> $19 = [S1dynamic][dynamic]::lookup($7, $18) activate = false
    <f32*x1> $20 = get child [S1dynamic->S3place_f32] $19
    <f32 x1> $21 = global load $20
    <i32 x1> $22 = loop $2 index 0
    <i32 x1> $23 = bit_extract($22 + 0, 0~20)
    <gen*x1> $24 = [S1dynamic][dynamic]::lookup($7, $23) activate = true
    <f32*x1> $25 = get child [S1dynamic->S5place_f32] $24
    <f32*x1> $26 : global store [$25 <- $21]
  }
}

I guess it's the dynamic SNode activation that leads to very low performance. Probably the activation weakening pass is not functioning after we introduce LoopIndexStmt. I'm fixing that now.

@yuanming-hu yuanming-hu self-assigned this Jun 7, 2020
@yuanming-hu
Copy link
Member Author

Btw, @xumingkuan do you have an idea why LoopIndexStmts are not CSEd? This happens in the latest commit as well.

@xumingkuan
Copy link
Collaborator

Interesting... I thought whole_kernel_cse should've done CSE in this case after #1082?

@xumingkuan
Copy link
Collaborator

Interesting... I thought whole_kernel_cse should've done CSE in this case after #1082?

I will investigate this later.

@yuanming-hu
Copy link
Member Author

taichi_elements (bd9ebdc7b86f748a1bdca7b74ee95151f3a8c863)

with Taichi ea30eec :
GPU: 4-5 FPS CPU: 6 FPS

with Taichi v0.6.4:
GPU: 50 FPS CPU: 16 FPS

@xumingkuan
Copy link
Collaborator

After #1229, TODO:

kernel {
  $0 = offloaded clear_list S17dynamic
  $1 = offloaded listgen S0root->S17dynamic
  $2 = offloaded struct_for(S17dynamic) block_dim=0 {
    <i32 x1> $3 = loop $2 index 0
    <gen*x1> $4 = get root
    <i32 x1> $5 = const [0]
    <gen*x1> $6 = [S0root][root]::lookup($4, $5) activate = false
    <gen*x1> $7 = get child [S0root->S17dynamic] $6
    ...
    <gen*x1> $296 = [S17dynamic][dynamic]::lookup($7, $3) activate = true
    <f32*x1> $297 = get child [S17dynamic->S18place_f32] $296
    <f32 x1> $298 = atomic add($297, $295)
    <f32 x1> $299 = mul $294 $268
    <f32*x1> $300 = get child [S17dynamic->S19place_f32] $296
    <f32 x1> $301 = atomic add($300, $299)
  }
}

@xumingkuan
Copy link
Collaborator

xumingkuan commented Jun 13, 2020

master branch, before lower_access:

  $13 = offloaded struct_for(S17dynamic) block_dim=0 {
    <i32 x1> $14 = loop $13 index 0
    <f32*x1> $15 = global ptr [S18place_f32], index [$14] activate=false
    <f32 x1> $16 = global load $15
    <f32 x1> $17 = const [128.0]
    <f32 x1> $18 = mul $16 $17
    <f32 x1> $19 = const [0.5]
    <f32 x1> $20 = sub $18 $19
    <i32 x1> $21 = cast_value<i32> $20
    <f32*x1> $22 = global ptr [S19place_f32], index [$14] activate=false
    <f32 x1> $23 = global load $22
    <f32 x1> $24 = const [128.0]
    <f32 x1> $25 = mul $23 $24
    <f32 x1> $26 = const [0.5]
    <f32 x1> $27 = sub $25 $26
    <i32 x1> $28 = cast_value<i32> $27
    <i32 x1> $137 = const [2]
    <i32 x1> $138 = add $28 $137
    <i32 x1> $237 = const [2]
    <i32 x1> $238 = add $21 $237
    <f32*x1> $285 = global ptr [S13place_f32], index [$238, $138] activate=false
    <f32 x1> $286 = global load $285
    <f32*x1> $287 = global ptr [S14place_f32], index [$238, $138] activate=false
    <f32 x1> $288 = global load $287

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants