[metal] Implement range_for using grid stride loop #780

k-ye · 2020-04-14T10:59:50Z

With this change, we no longer need a sync to figure out the number of threads to launch.

Related issue = #722

Might be easier to see what's going on with an example... The following snippet is part of the output from

Lines 143 to 155 in f0d6bd7

    
           def test_loop_arg_as_range(): 
        
               # Dynamic range loops are intended to make sure global tmps work 
        
               x = ti.var(ti.i32) 
        
               n = 1000 
        
               @ti.layout 
        
               def layout(): 
        
                   ti.root.dense(ti.i, n).place(x) 
        
               @ti.kernel 
        
               def test(b: ti.i32, e: ti.i32): 
        
                   for i in range(b, e): 
        
                       x[i - b] = i

void mtl_k0000_test_c4_0_1_func(
    device byte* root_addr,
    device byte* global_tmps_addr,
    device byte* args_addr,
    device byte* runtime_addr,
    const int linear_loop_idx_) {
  device Runtime *runtime_ = reinterpret_cast<device Runtime *>(runtime_addr);
  mtl_k0000_test_c4_0_args args_ctx_(args_addr);
  device RandState* rand_state_ = reinterpret_cast<device RandState*>(runtime_->rand_seeds + (linear_loop_idx_ % 65536));
  const int tmp9 = linear_loop_idx_;
  const int32_t tmp10 = *args_ctx_.arg0();
  const int32_t tmp11 = (tmp9 - tmp10);
  S0 tmp13(root_addr);
  S0_ch tmp15 = tmp13.children(tmp14);
  S1 tmp16 = tmp15.get0();
  auto tmp17 = (((0 + tmp11) >> 0) & ((1 << 10) - 1));
  S1_ch tmp21 = tmp16.children(tmp17);
  device int32_t* tmp22 = tmp21.get0().val;
  *tmp22 = tmp9;
}

kernel void mtl_k0000_test_c4_0_1(
    device byte* root_addr [[buffer(0)]],
    device byte* global_tmps_addr [[buffer(1)]],
    device byte* args_addr [[buffer(2)]],
    device byte* runtime_addr [[buffer(3)]],
    const uint ugrid_size_ [[threads_per_grid]],
    const uint utid_ [[thread_position_in_grid]]) {
  // range_for, range known at runtime
  device int32_t* tmp37 = reinterpret_cast<device int32_t*>(global_tmps_addr + 0);
  int32_t tmp38 = *tmp37;  // begin_expr
  device int32_t* tmp39 = reinterpret_cast<device int32_t*>(global_tmps_addr + 4);
  int32_t tmp40 = *tmp39;  // end_expr
  const int total_elems = tmp40 - tmp38;
  const int range_ = max((int)((total_elems + ugrid_size_ - 1) / ugrid_size_), 1);
  const int begin_ = (range_ * (int)utid_) + tmp38;
  const int end_ = min(range_ * (int)(utid_ + 1), total_elems) + tmp38;
  for (int ii = begin_; ii < end_; ++ii) {
    mtl_k0000_test_c4_0_1_func(root_addr, global_tmps_addr, args_addr, runtime_addr, ii);
  }
}

I have a question for the grid stride loop. In the tutorial, each thread advances by the size of the entire grid. Do you know why?

In Metal, I first figure out the number of elements in the kernel, then compute range_ = (total_elems + grid_size - 1) / grid_size, and each thread only covers [thread_id * range_, (thread_id + 1) * range). I thought this could somewhat improve the spatial locality..?

archibate · 2020-04-14T15:29:50Z

taichi/codegen/codegen_metal.cpp

+      // We don't clamp this to kMaxNumThreadsGridStrideLoop, because we know
+      // for sure that we need |num_elems| of threads.
+      // sdf_renderer.py benchmark for setting |num_threads|
+      // - num_elemnts: ~20 samples/s
+      // - kMaxNumThreadsGridStrideLoop: ~12 samples/s
+      ka.num_threads = num_elems;


In order to fit different applications, can we make kMaxNumThreadsGridStrideLoop configuable like ti.cfg.device_memory_GB?
If so, please use a generic name (i.e. not ti.cfg.metal_xxx, just ti.cfg.xxx) since opengl may implement grid-stride-loop later.

Right. Note that kMaxNumThreadsGridStrideLoop is currently inside metal namespace, so it's not meant to be shared.

Once all the backends adopt this approach, I think we can start adding a new config field as you suggested?

yuanming-hu · 2020-04-14T23:15:26Z

Thanks for the PR!

I have a question for the grid stride loop. In the tutorial, each thread advances by the size of the entire grid. Do you know why?

That's probably for coalesced memory access on CUDA.

In Metal, I first figure out the number of elements in the kernel, then compute range_ = (total_elems + grid_size - 1) / grid_size, and each thread only covers [thread_id * range_, (thread_id + 1) * range). I thought this could somewhat improve the spatial locality..?

This is an alternative solution. In contrast to stride by gridDim * blockDim, you are striding with 1. I'm not familiar with Metal, but I guess striding by 1 does not have as efficient memory accesses compared to striding by gridDim * blockDim. This is because in this implementation, threads in a warp are not accessing consecutive bytes in memory. (I assume your Intel/AMD gpus have the roughly same design as CUDA, where a hardware memory coalescer tried to batch memory transactions in a warp when it is able to.)

I think the way GPU caches are designed makes it more important to have spatial locality within a warp, than to have temporary locality within a thread.

yuanming-hu

LGTM!

k-ye · 2020-04-15T08:10:17Z

I will merge this one and test with stride = whole grid size..

[metal] Implement range_for using grid stride loop

794ade4

k-ye requested a review from yuanming-hu April 14, 2020 10:59

[skip ci] enforce code format

9aeaef9

archibate reviewed Apr 14, 2020

View reviewed changes

yuanming-hu approved these changes Apr 14, 2020

View reviewed changes

k-ye merged commit c2a1056 into taichi-dev:master Apr 15, 2020

k-ye deleted the range branch April 15, 2020 08:10

k-ye mentioned this pull request Apr 15, 2020

[metal] Metal backend general improvements #722

Closed

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[metal] Implement range_for using grid stride loop #780

[metal] Implement range_for using grid stride loop #780

k-ye commented Apr 14, 2020 •

edited

archibate Apr 14, 2020

k-ye Apr 14, 2020

yuanming-hu commented Apr 14, 2020

yuanming-hu left a comment

k-ye commented Apr 15, 2020

	def test_loop_arg_as_range():
	# Dynamic range loops are intended to make sure global tmps work
	x = ti.var(ti.i32)
	n = 1000

	@ti.layout
	def layout():
	ti.root.dense(ti.i, n).place(x)

	@ti.kernel
	def test(b: ti.i32, e: ti.i32):
	for i in range(b, e):
	x[i - b] = i

[metal] Implement range_for using grid stride loop #780

[metal] Implement range_for using grid stride loop #780

Conversation

k-ye commented Apr 14, 2020 • edited

archibate Apr 14, 2020

Choose a reason for hiding this comment

k-ye Apr 14, 2020

Choose a reason for hiding this comment

yuanming-hu commented Apr 14, 2020

yuanming-hu left a comment

Choose a reason for hiding this comment

k-ye commented Apr 15, 2020

k-ye commented Apr 14, 2020 •

edited