Passing struct vs deferencing fields in struct performance #4802

bsavery · 2022-04-15T19:29:23Z

In short I see significant performance decrease in passing a struct to a function vs dereferencing the struct fields and passing the values.

Here I show two ways to intersect a ray with a bunch of spheres, one passing the sphere struct vs getting the values of the sphere struct and passing those.

import taichi as ti
import time

ti.init(arch=ti.gpu)

n_spheres = 100000

# just a ray pointing in Z direction
ray_org = ti.Vector([0.0, 0.0, 0.0])
ray_dir = ti.Vector([0.0, 0.0, 1.0])

# random set of spheres
sphere_type = ti.types.struct(center=ti.types.vector(3, ti.f32), radius=ti.f32)
spheres = sphere_type.field(shape=(n_spheres,))
@ti.kernel
def fill_spheres():
    # generate random spheres in 0-100 xyz with radius 0-10
    for i in spheres:
        spheres[i].radius = ti.random() * 10.0
        spheres[i].center = ti.Vector([ti.random(), ti.random(), ti.random()]) * 100.0
fill_spheres()


@ti.kernel
def pass_reference() -> ti.i32:
    num_hit = 0
    for i in spheres:
        sphere = spheres[i]
        if intersect_sphere(sphere, ray_org, ray_dir):
            num_hit += 1
    return num_hit


@ti.kernel
def pass_decomposed() -> ti.i32:
    num_hit = 0
    for i in spheres:
        sphere = spheres[i]
        center, radius = sphere.center, sphere.radius
        if intersect_center_radius(center, radius, ray_org, ray_dir):
            num_hit += 1
    return num_hit


@ti.func
def intersect_sphere(sphere, ray_origin, ray_direction):
    # return if the ray hits the sphere
    oc = ray_origin - sphere.center
    a = ray_direction.norm_sqr()
    half_b = oc.dot(ray_direction)
    c = (oc.norm_sqr() - sphere.radius**2)
    discriminant = (half_b**2) - a * c

    return discriminant >= 0.0

@ti.func
def intersect_center_radius(center, radius, ray_origin, ray_direction):
    # return if the ray hits the decomposed sphere
    oc = ray_origin - center
    a = ray_direction.norm_sqr()
    half_b = oc.dot(ray_direction)
    c = (oc.norm_sqr() - radius**2)
    discriminant = (half_b**2) - a * c

    return discriminant >= 0.0


t = time.time()
pass_reference()
print('Passing reference', time.time() - t)

t = time.time()
pass_decomposed()
print('Passing decomposed', time.time() - t)

Also note metal vs vulkan effects quite a bit:
Metal
Passing reference 0.23202180862426758
Passing decomposed 0.09195494651794434

Vulkan
Passing reference 0.07849979400634766
Passing decomposed 0.05303597450256348

CPU
Passing reference 0.08902120590209961
Passing decomposed 0.06803393363952637

Originally posted by @bsavery in #4784 (reply in thread)

The text was updated successfully, but these errors were encountered:

FantasyVR · 2022-04-18T01:37:46Z

cc @qiao-bo @turbo0628

k-ye · 2022-04-18T07:57:22Z

As a good first step, we can enable ti.init(..., print_ir=True) to see if there's significant difference in CHI IR. See #4784 (reply in thread)

k-ye · 2022-04-18T08:04:32Z

Note that you might also want to exclude the timing for the first run, because that counts JIT time as well.

bsavery · 2022-04-18T16:16:15Z

@k-ye thanks for that.

So with a simple change to exclude the jit time (run once before timing) I get:

Vulkan
Passing reference 0.02786087989807129
Passing decomposed 0.047796010971069336 (worse!)

CPU
Passing reference 0.01233816146850586
Passing decomposed 0.011642217636108398

Metal:
Passing reference 0.0014309883117675781
Passing decomposed 0.0014178752899169922

So there is something coming out with different code, and the previous timing seemed to more reflect that the compilation / JIT time is different for each kernel (because of the different code being generated)

Side question: Is there a way to not recompile kernels each test run? I.e. if I run python test.py and the code for the kernel doesn't change, it would be nice to not recompile next time I call python test.py. Not sure how you could do this, maybe with timestamps or checking if the IR is different, but you could imagine this being useful.

k-ye · 2022-04-18T22:26:01Z

Is there a way to not recompile kernels each test run?

Yup, @PGZXB 's working on an offline cache system (#4401), starting with the LLVM backend. We are using the AST as the cache key now. @PGZXB only has one day or so per week to work on Taichi, so the feature is a bit slow to release. But we are moving towards that direction. Thanks for your suggestion :-)

bobcao3 · 2022-04-21T08:20:33Z

That's quite fascinating. I'd imagine the CHI-IR generated should be quite similar, and it's weird that we only see Vulkan with a big regression. (And I'm assuming this is on a mac? Where SPIR-V is actually translated to MSL by MoltenVK...)

We should check three things:

Check for consistency and reproducibility on other devices (i.e. whether this is a problem that interacts with specific environment or this is an issue only from codegen)
Compare the CHI-IR differences
Compare the SPIR-V differences

Something else that can be quite helpful is to run the two SPIR-V through Radeon Graphics Analyzer as well to get the raw assembly for the #inst & cycle latency readings

FantasyVR added discussion Welcome discussion! advanced optimization The issue or bug is related to advanced optimization labels Apr 21, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Passing struct vs deferencing fields in struct performance #4802

Passing struct vs deferencing fields in struct performance #4802

bsavery commented Apr 15, 2022 •

edited

Loading

FantasyVR commented Apr 18, 2022

k-ye commented Apr 18, 2022

k-ye commented Apr 18, 2022

bsavery commented Apr 18, 2022 •

edited

Loading

k-ye commented Apr 18, 2022 •

edited

Loading

bobcao3 commented Apr 21, 2022

Passing struct vs deferencing fields in struct performance #4802

Passing struct vs deferencing fields in struct performance #4802

Comments

bsavery commented Apr 15, 2022 • edited Loading

FantasyVR commented Apr 18, 2022

k-ye commented Apr 18, 2022

k-ye commented Apr 18, 2022

bsavery commented Apr 18, 2022 • edited Loading

k-ye commented Apr 18, 2022 • edited Loading

bobcao3 commented Apr 21, 2022

bsavery commented Apr 15, 2022 •

edited

Loading

bsavery commented Apr 18, 2022 •

edited

Loading

k-ye commented Apr 18, 2022 •

edited

Loading