[cuda] Use cuMemsetD32 to fill scalar ndarray #3907

qiao-bo · 2021-12-29T10:21:09Z

Replace the Ndarray fill method from the previous kernel-based approach with a memset-based implementation.

Performance improvement:
test_script.py

import taichi as ti
import time

ti.init(arch=ti.cuda)

print("test fill")
N = 2048*2048

a = ti.ndarray(ti.f32, N)
a.fill(2.5)

iterations = 1000
t_start = time.perf_counter()
for i in range(iterations):
    a.fill(float(i))
    #assert a[i] == float(i)
t_used = time.perf_counter() - t_start 
print('total time:', "{:.3f}".format(t_used*1000), "ms")

tested on RTX 3080.

`iterations`	Master	Memset
10	0.280 s	0.060 ms
100	4.778 s	0.530 ms
1000	46.425 s	0.005 s

Status:
This PR addresses only scalar ndarray on CUDA. Support of Matrix/Vector.ndarray and CPU will be followed in later PRs.

netlify · 2021-12-29T10:21:15Z

✔️ Deploy Preview for jovial-fermat-aa59dc canceled.

🔨 Explore the source changes: 9e90eb1

🔍 Inspect the deploy log: https://app.netlify.com/sites/jovial-fermat-aa59dc/deploys/61ce816fa359c00008619276

qiao-bo · 2021-12-29T10:22:24Z

/format

taichi/backends/cuda/cuda_device.cpp

bobcao3

Maybe we can utilize cuStreamBeginCapture to make the command list actually deferred

qiao-bo · 2021-12-30T01:32:30Z

Maybe we can utilize cuStreamBeginCapture to make the command list actually deferred

@bobcao3 thanks for the information. At the moment, we do not want to defer the fill execution. In this case, we don't need to over utilize the CommandList utility. Maybe a simpler solution is to have the fill method in CudaDevice itself. WDYT?

bobcao3 · 2021-12-30T03:24:22Z

Maybe we can utilize cuStreamBeginCapture to make the command list actually deferred

@bobcao3 thanks for the information. At the moment, we do not want to defer the fill execution. In this case, we don't need to over utilize the CommandList utility. Maybe a simpler solution is to have the fill method in CudaDevice itself. WDYT?

@qiao-bo I don't think that's ideal. Not all APIs support immediate mode resource content filling. In addition, in hardware these commands are executed by the DMA / Transfer engine which uses a queue. Since all of this is on the C++ side, it is fine to have it under CommandList, it's reasonably fast. In addition, after we move everything into CommandList, we should expect a drop in overall host-side overhead as CUDA has a fully completed command graph to work with & uses less implicit sync.

qiao-bo · 2021-12-30T05:10:28Z

Maybe we can utilize cuStreamBeginCapture to make the command list actually deferred

@bobcao3 thanks for the information. At the moment, we do not want to defer the fill execution. In this case, we don't need to over utilize the CommandList utility. Maybe a simpler solution is to have the fill method in CudaDevice itself. WDYT?

@qiao-bo I don't think that's ideal. Not all APIs support immediate mode resource content filling. In addition, in hardware these commands are executed by the DMA / Transfer engine which uses a queue. Since all of this is on the C++ side, it is fine to have it under CommandList, it's reasonably fast. In addition, after we move everything into CommandList, we should expect a drop in overall host-side overhead as CUDA has a fully completed command graph to work with & uses less implicit sync.

I agree with the logic. But seems like we have not implemented any device utilities to support immediate content filling for llvm backend? i see a potential major change here, does it still fit into this PR?

qiao-bo · 2021-12-30T16:21:33Z

/format

qiao-bo · 2021-12-31T04:03:55Z

/format

bobcao3

LGTM

k-ye · 2021-12-31T08:38:49Z

taichi/llvm/llvm_program.cpp

@@ -604,5 +604,18 @@ uint64_t *LlvmProgramImpl::get_ndarray_alloc_info_ptr(DeviceAllocation &alloc) {
    return (uint64_t *)cpu_device()->get_alloc_info(alloc).ptr;
  }
 }
+
+void LlvmProgramImpl::fill_ndarray(DeviceAllocation &alloc,


@qiao-bo nit: Do we need mutable reference DeviceAllocation&?

If not, let's make them const. Otherwise let's follow Google's style and use pointer

@k-ye , made const in #3921. Thanks

qiao-bo added 3 commits December 29, 2021 17:39

Use cuMemsetD32 for fill ndarray

813e78d

Add a test

071f3b5

Merge branch 'master' into fill

1f28073

qiao-bo requested review from strongoier, ailzhang and k-ye December 29, 2021 10:21

Auto Format

34d3a27

qiao-bo added this to Performance Improvement in Backends Performance Dec 29, 2021

bobcao3 reviewed Dec 29, 2021

View reviewed changes

taichi/backends/cuda/cuda_device.cpp Outdated Show resolved Hide resolved

bobcao3 requested changes Dec 29, 2021

View reviewed changes

Defer memset execution from commandlist

1a63c34

qiao-bo requested a review from bobcao3 December 30, 2021 16:21

taichi-gardener and others added 6 commits December 30, 2021 16:22

Auto Format

ac5b2e6

Bypass device api

9abc40c

Revert the changes using device api

aa00742

Revert

0400592

Merge branch 'master' into fill

83ee5a8

Cleanup

979e008

Auto Format

9e90eb1

bobcao3 approved these changes Dec 31, 2021

View reviewed changes

qiao-bo merged commit 9725b6f into taichi-dev:master Dec 31, 2021

qiao-bo deleted the fill branch December 31, 2021 05:19

k-ye reviewed Dec 31, 2021

View reviewed changes

qiao-bo mentioned this pull request Dec 31, 2021

[refactor] Optimize vector and matrix ndarray fill #3921

Merged

qiao-bo moved this from Performance Improvement to Done in Backends Performance Feb 15, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[cuda] Use cuMemsetD32 to fill scalar ndarray #3907

[cuda] Use cuMemsetD32 to fill scalar ndarray #3907

qiao-bo commented Dec 29, 2021

netlify bot commented Dec 29, 2021 •

edited

qiao-bo commented Dec 29, 2021

bobcao3 left a comment

qiao-bo commented Dec 30, 2021

bobcao3 commented Dec 30, 2021 •

edited

qiao-bo commented Dec 30, 2021

qiao-bo commented Dec 30, 2021

qiao-bo commented Dec 31, 2021

bobcao3 left a comment

k-ye Dec 31, 2021

qiao-bo Dec 31, 2021 •

edited

[cuda] Use cuMemsetD32 to fill scalar ndarray #3907

[cuda] Use cuMemsetD32 to fill scalar ndarray #3907

Conversation

qiao-bo commented Dec 29, 2021

netlify bot commented Dec 29, 2021 • edited

qiao-bo commented Dec 29, 2021

bobcao3 left a comment

Choose a reason for hiding this comment

qiao-bo commented Dec 30, 2021

bobcao3 commented Dec 30, 2021 • edited

qiao-bo commented Dec 30, 2021

qiao-bo commented Dec 30, 2021

qiao-bo commented Dec 31, 2021

bobcao3 left a comment

Choose a reason for hiding this comment

k-ye Dec 31, 2021

Choose a reason for hiding this comment

qiao-bo Dec 31, 2021 • edited

Choose a reason for hiding this comment

netlify bot commented Dec 29, 2021 •

edited

bobcao3 commented Dec 30, 2021 •

edited

qiao-bo Dec 31, 2021 •

edited