New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[metal] Implement range_for using grid stride loop #780
Conversation
// We don't clamp this to kMaxNumThreadsGridStrideLoop, because we know | ||
// for sure that we need |num_elems| of threads. | ||
// sdf_renderer.py benchmark for setting |num_threads| | ||
// - num_elemnts: ~20 samples/s | ||
// - kMaxNumThreadsGridStrideLoop: ~12 samples/s | ||
ka.num_threads = num_elems; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In order to fit different applications, can we make kMaxNumThreadsGridStrideLoop
configuable like ti.cfg.device_memory_GB
?
If so, please use a generic name (i.e. not ti.cfg.metal_xxx, just ti.cfg.xxx) since opengl may implement grid-stride-loop later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right. Note that kMaxNumThreadsGridStrideLoop
is currently inside metal
namespace, so it's not meant to be shared.
Once all the backends adopt this approach, I think we can start adding a new config field as you suggested?
Thanks for the PR!
That's probably for coalesced memory access on CUDA.
This is an alternative solution. In contrast to stride by I think the way GPU caches are designed makes it more important to have spatial locality within a warp, than to have temporary locality within a thread. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
I will merge this one and test with stride = whole grid size.. |
With this change, we no longer need a sync to figure out the number of threads to launch.
Related issue = #722
[Click here for the format server]
Might be easier to see what's going on with an example... The following snippet is part of the output from
taichi/tests/python/test_loops.py
Lines 143 to 155 in f0d6bd7
I have a question for the grid stride loop. In the tutorial, each thread advances by the size of the entire grid. Do you know why?
In Metal, I first figure out the number of elements in the kernel, then compute
range_ = (total_elems + grid_size - 1) / grid_size
, and each thread only covers[thread_id * range_, (thread_id + 1) * range)
. I thought this could somewhat improve the spatial locality..?