Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[metal] Implement range_for using grid stride loop #780

Merged
merged 2 commits into from Apr 15, 2020

Conversation

k-ye
Copy link
Member

@k-ye k-ye commented Apr 14, 2020

With this change, we no longer need a sync to figure out the number of threads to launch.

Related issue = #722

[Click here for the format server]

Might be easier to see what's going on with an example... The following snippet is part of the output from

def test_loop_arg_as_range():
# Dynamic range loops are intended to make sure global tmps work
x = ti.var(ti.i32)
n = 1000
@ti.layout
def layout():
ti.root.dense(ti.i, n).place(x)
@ti.kernel
def test(b: ti.i32, e: ti.i32):
for i in range(b, e):
x[i - b] = i

void mtl_k0000_test_c4_0_1_func(
    device byte* root_addr,
    device byte* global_tmps_addr,
    device byte* args_addr,
    device byte* runtime_addr,
    const int linear_loop_idx_) {
  device Runtime *runtime_ = reinterpret_cast<device Runtime *>(runtime_addr);
  mtl_k0000_test_c4_0_args args_ctx_(args_addr);
  device RandState* rand_state_ = reinterpret_cast<device RandState*>(runtime_->rand_seeds + (linear_loop_idx_ % 65536));
  const int tmp9 = linear_loop_idx_;
  const int32_t tmp10 = *args_ctx_.arg0();
  const int32_t tmp11 = (tmp9 - tmp10);
  S0 tmp13(root_addr);
  S0_ch tmp15 = tmp13.children(tmp14);
  S1 tmp16 = tmp15.get0();
  auto tmp17 = (((0 + tmp11) >> 0) & ((1 << 10) - 1));
  S1_ch tmp21 = tmp16.children(tmp17);
  device int32_t* tmp22 = tmp21.get0().val;
  *tmp22 = tmp9;
}

kernel void mtl_k0000_test_c4_0_1(
    device byte* root_addr [[buffer(0)]],
    device byte* global_tmps_addr [[buffer(1)]],
    device byte* args_addr [[buffer(2)]],
    device byte* runtime_addr [[buffer(3)]],
    const uint ugrid_size_ [[threads_per_grid]],
    const uint utid_ [[thread_position_in_grid]]) {
  // range_for, range known at runtime
  device int32_t* tmp37 = reinterpret_cast<device int32_t*>(global_tmps_addr + 0);
  int32_t tmp38 = *tmp37;  // begin_expr
  device int32_t* tmp39 = reinterpret_cast<device int32_t*>(global_tmps_addr + 4);
  int32_t tmp40 = *tmp39;  // end_expr
  const int total_elems = tmp40 - tmp38;
  const int range_ = max((int)((total_elems + ugrid_size_ - 1) / ugrid_size_), 1);
  const int begin_ = (range_ * (int)utid_) + tmp38;
  const int end_ = min(range_ * (int)(utid_ + 1), total_elems) + tmp38;
  for (int ii = begin_; ii < end_; ++ii) {
    mtl_k0000_test_c4_0_1_func(root_addr, global_tmps_addr, args_addr, runtime_addr, ii);
  }
}

I have a question for the grid stride loop. In the tutorial, each thread advances by the size of the entire grid. Do you know why?

In Metal, I first figure out the number of elements in the kernel, then compute range_ = (total_elems + grid_size - 1) / grid_size, and each thread only covers [thread_id * range_, (thread_id + 1) * range). I thought this could somewhat improve the spatial locality..?

@k-ye k-ye requested a review from yuanming-hu April 14, 2020 10:59
Comment on lines +668 to +673
// We don't clamp this to kMaxNumThreadsGridStrideLoop, because we know
// for sure that we need |num_elems| of threads.
// sdf_renderer.py benchmark for setting |num_threads|
// - num_elemnts: ~20 samples/s
// - kMaxNumThreadsGridStrideLoop: ~12 samples/s
ka.num_threads = num_elems;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In order to fit different applications, can we make kMaxNumThreadsGridStrideLoop configuable like ti.cfg.device_memory_GB?
If so, please use a generic name (i.e. not ti.cfg.metal_xxx, just ti.cfg.xxx) since opengl may implement grid-stride-loop later.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right. Note that kMaxNumThreadsGridStrideLoop is currently inside metal namespace, so it's not meant to be shared.

Once all the backends adopt this approach, I think we can start adding a new config field as you suggested?

@yuanming-hu
Copy link
Member

Thanks for the PR!

I have a question for the grid stride loop. In the tutorial, each thread advances by the size of the entire grid. Do you know why?

That's probably for coalesced memory access on CUDA.

In Metal, I first figure out the number of elements in the kernel, then compute range_ = (total_elems + grid_size - 1) / grid_size, and each thread only covers [thread_id * range_, (thread_id + 1) * range). I thought this could somewhat improve the spatial locality..?

This is an alternative solution. In contrast to stride by gridDim * blockDim, you are striding with 1. I'm not familiar with Metal, but I guess striding by 1 does not have as efficient memory accesses compared to striding by gridDim * blockDim. This is because in this implementation, threads in a warp are not accessing consecutive bytes in memory. (I assume your Intel/AMD gpus have the roughly same design as CUDA, where a hardware memory coalescer tried to batch memory transactions in a warp when it is able to.)

I think the way GPU caches are designed makes it more important to have spatial locality within a warp, than to have temporary locality within a thread.

Copy link
Member

@yuanming-hu yuanming-hu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@k-ye
Copy link
Member Author

k-ye commented Apr 15, 2020

I will merge this one and test with stride = whole grid size..

@k-ye k-ye merged commit c2a1056 into taichi-dev:master Apr 15, 2020
@k-ye k-ye deleted the range branch April 15, 2020 08:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants