Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[cuda] Use cuMemsetD32 to fill scalar ndarray #3907

Merged
merged 12 commits into from
Dec 31, 2021
Merged

Conversation

qiao-bo
Copy link
Collaborator

@qiao-bo qiao-bo commented Dec 29, 2021

Replace the Ndarray fill method from the previous kernel-based approach with a memset-based implementation.

Performance improvement:
test_script.py

import taichi as ti
import time

ti.init(arch=ti.cuda)

print("test fill")
N = 2048*2048

a = ti.ndarray(ti.f32, N)
a.fill(2.5)

iterations = 1000
t_start = time.perf_counter()
for i in range(iterations):
    a.fill(float(i))
    #assert a[i] == float(i)
t_used = time.perf_counter() - t_start 
print('total time:', "{:.3f}".format(t_used*1000), "ms")

tested on RTX 3080.

iterations Master Memset
10 0.280 s 0.060 ms
100 4.778 s 0.530 ms
1000 46.425 s 0.005 s

Status:
This PR addresses only scalar ndarray on CUDA. Support of Matrix/Vector.ndarray and CPU will be followed in later PRs.

@netlify
Copy link

netlify bot commented Dec 29, 2021

✔️ Deploy Preview for jovial-fermat-aa59dc canceled.

🔨 Explore the source changes: 9e90eb1

🔍 Inspect the deploy log: https://app.netlify.com/sites/jovial-fermat-aa59dc/deploys/61ce816fa359c00008619276

@qiao-bo
Copy link
Collaborator Author

qiao-bo commented Dec 29, 2021

/format

@qiao-bo qiao-bo added this to Performance Improvement in Backends Performance Dec 29, 2021
Copy link
Collaborator

@bobcao3 bobcao3 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we can utilize cuStreamBeginCapture to make the command list actually deferred

@qiao-bo
Copy link
Collaborator Author

qiao-bo commented Dec 30, 2021

Maybe we can utilize cuStreamBeginCapture to make the command list actually deferred

@bobcao3 thanks for the information. At the moment, we do not want to defer the fill execution. In this case, we don't need to over utilize the CommandList utility. Maybe a simpler solution is to have the fill method in CudaDevice itself. WDYT?

@bobcao3
Copy link
Collaborator

bobcao3 commented Dec 30, 2021

Maybe we can utilize cuStreamBeginCapture to make the command list actually deferred

@bobcao3 thanks for the information. At the moment, we do not want to defer the fill execution. In this case, we don't need to over utilize the CommandList utility. Maybe a simpler solution is to have the fill method in CudaDevice itself. WDYT?

@qiao-bo I don't think that's ideal. Not all APIs support immediate mode resource content filling. In addition, in hardware these commands are executed by the DMA / Transfer engine which uses a queue. Since all of this is on the C++ side, it is fine to have it under CommandList, it's reasonably fast. In addition, after we move everything into CommandList, we should expect a drop in overall host-side overhead as CUDA has a fully completed command graph to work with & uses less implicit sync.

@qiao-bo
Copy link
Collaborator Author

qiao-bo commented Dec 30, 2021

Maybe we can utilize cuStreamBeginCapture to make the command list actually deferred

@bobcao3 thanks for the information. At the moment, we do not want to defer the fill execution. In this case, we don't need to over utilize the CommandList utility. Maybe a simpler solution is to have the fill method in CudaDevice itself. WDYT?

@qiao-bo I don't think that's ideal. Not all APIs support immediate mode resource content filling. In addition, in hardware these commands are executed by the DMA / Transfer engine which uses a queue. Since all of this is on the C++ side, it is fine to have it under CommandList, it's reasonably fast. In addition, after we move everything into CommandList, we should expect a drop in overall host-side overhead as CUDA has a fully completed command graph to work with & uses less implicit sync.

I agree with the logic. But seems like we have not implemented any device utilities to support immediate content filling for llvm backend? i see a potential major change here, does it still fit into this PR?

@qiao-bo
Copy link
Collaborator Author

qiao-bo commented Dec 30, 2021

/format

@qiao-bo qiao-bo requested a review from bobcao3 December 30, 2021 16:21
@qiao-bo
Copy link
Collaborator Author

qiao-bo commented Dec 31, 2021

/format

Copy link
Collaborator

@bobcao3 bobcao3 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@qiao-bo qiao-bo merged commit 9725b6f into taichi-dev:master Dec 31, 2021
@qiao-bo qiao-bo deleted the fill branch December 31, 2021 05:19
@@ -604,5 +604,18 @@ uint64_t *LlvmProgramImpl::get_ndarray_alloc_info_ptr(DeviceAllocation &alloc) {
return (uint64_t *)cpu_device()->get_alloc_info(alloc).ptr;
}
}

void LlvmProgramImpl::fill_ndarray(DeviceAllocation &alloc,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@qiao-bo nit: Do we need mutable reference DeviceAllocation&?

If not, let's make them const. Otherwise let's follow Google's style and use pointer

Copy link
Collaborator Author

@qiao-bo qiao-bo Dec 31, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@k-ye , made const in #3921. Thanks

@qiao-bo qiao-bo moved this from Performance Improvement to Done in Backends Performance Feb 15, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants