Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

stream create, copy and destroy example #3470

Closed
jinz2014 opened this issue May 3, 2024 · 8 comments
Closed

stream create, copy and destroy example #3470

jinz2014 opened this issue May 3, 2024 · 8 comments
Assignees

Comments

@jinz2014
Copy link

jinz2014 commented May 3, 2024

Running the stream create and destroy example shows that the time is about 2X-3X longer than the time on an Nvidia GPU for the following cases. Thanks for your comments and suggestions.

Link:
https://github.com/zjin-lcf/HeCBench/tree/master/src/streamCreateCopyDestroy-hip/

MI210
Create+Copy+Synchronize+Destroy time for 1 streams and 5000 buffers and 16 iterations 49.6401 (ms)
Create+Copy+Synchronize+Destroy time for 2 streams and 5000 buffers and 8 iterations 50.2982 (ms)
Create+Copy+Synchronize+Destroy time for 4 streams and 5000 buffers and 4 iterations 57.4719 (ms)
Create+Copy+Synchronize+Destroy time for 8 streams and 5000 buffers and 2 iterations 54.3432 (ms)

https://github.com/zjin-lcf/HeCBench/tree/master/src/streamCreateCopyDestroy-cuda

A100
Create+Copy+Synchronize+Destroy time for 1 streams and 5000 buffers and 16 iterations 23.3694 (ms)
Create+Copy+Synchronize+Destroy time for 2 streams and 5000 buffers and 8 iterations 23.2853 (ms)
Create+Copy+Synchronize+Destroy time for 4 streams and 5000 buffers and 4 iterations 23.38 (ms)
Create+Copy+Synchronize+Destroy time for 8 streams and 5000 buffers and 2 iterations 23.2302 (ms)

@shadidashmiz shadidashmiz self-assigned this May 13, 2024
@shadidashmiz
Copy link
Contributor

Looks like your test time increases linearly with num of buffer you allocate does not look like hipStream issue

@jinz2014
Copy link
Author

In the test, the two datacenter GPUs are not installed on the same host, so I am not sure if different hosts may impact the execution time.
So, I tried to run the CUDA and HIP programs on a desktop computer with both Nvidia and AMD GPUs.

RTX3090

Create+Copy+Synchronize+Destroy time for 1 streams and 1 buffers and 128 iterations 0.0128745 (ms)
Create+Copy+Synchronize+Destroy time for 2 streams and 1 buffers and 64 iterations 0.00757924 (ms)
Create+Copy+Synchronize+Destroy time for 4 streams and 1 buffers and 32 iterations 0.0062003 (ms)
Create+Copy+Synchronize+Destroy time for 8 streams and 1 buffers and 16 iterations 0.0063163 (ms)
Create+Copy+Synchronize+Destroy time for 1 streams and 100 buffers and 64 iterations 0.270007 (ms)
Create+Copy+Synchronize+Destroy time for 2 streams and 100 buffers and 32 iterations 0.255549 (ms)
Create+Copy+Synchronize+Destroy time for 4 streams and 100 buffers and 16 iterations 0.27291 (ms)
Create+Copy+Synchronize+Destroy time for 8 streams and 100 buffers and 8 iterations 0.267216 (ms)
Create+Copy+Synchronize+Destroy time for 1 streams and 1000 buffers and 32 iterations 2.53417 (ms)
Create+Copy+Synchronize+Destroy time for 2 streams and 1000 buffers and 16 iterations 2.52819 (ms)
Create+Copy+Synchronize+Destroy time for 4 streams and 1000 buffers and 8 iterations 2.52339 (ms)
Create+Copy+Synchronize+Destroy time for 8 streams and 1000 buffers and 4 iterations 2.52614 (ms)
Create+Copy+Synchronize+Destroy time for 1 streams and 5000 buffers and 16 iterations 12.7661 (ms)
Create+Copy+Synchronize+Destroy time for 2 streams and 5000 buffers and 8 iterations 12.7234 (ms)
Create+Copy+Synchronize+Destroy time for 4 streams and 5000 buffers and 4 iterations 12.7502 (ms)
Create+Copy+Synchronize+Destroy time for 8 streams and 5000 buffers and 2 iterations 12.7159 (ms)

gfx1030

Create+Copy+Synchronize+Destroy time for 1 streams and 1 buffers and 128 iterations 1.99878 (ms)
Create+Copy+Synchronize+Destroy time for 2 streams and 1 buffers and 64 iterations 0.574584 (ms)
Create+Copy+Synchronize+Destroy time for 4 streams and 1 buffers and 32 iterations 0.610492 (ms)
Create+Copy+Synchronize+Destroy time for 8 streams and 1 buffers and 16 iterations 0.587304 (ms)
Create+Copy+Synchronize+Destroy time for 1 streams and 100 buffers and 64 iterations 1.39792 (ms)
Create+Copy+Synchronize+Destroy time for 2 streams and 100 buffers and 32 iterations 1.39171 (ms)
Create+Copy+Synchronize+Destroy time for 4 streams and 100 buffers and 16 iterations 1.41488 (ms)
Create+Copy+Synchronize+Destroy time for 8 streams and 100 buffers and 8 iterations 1.43967 (ms)
Create+Copy+Synchronize+Destroy time for 1 streams and 1000 buffers and 32 iterations 9.0404 (ms)
Create+Copy+Synchronize+Destroy time for 2 streams and 1000 buffers and 16 iterations 9.03053 (ms)
Create+Copy+Synchronize+Destroy time for 4 streams and 1000 buffers and 8 iterations 9.05028 (ms)
Create+Copy+Synchronize+Destroy time for 8 streams and 1000 buffers and 4 iterations 9.15136 (ms)
Create+Copy+Synchronize+Destroy time for 1 streams and 5000 buffers and 16 iterations 43.0856 (ms)
Create+Copy+Synchronize+Destroy time for 2 streams and 5000 buffers and 8 iterations 43.0919 (ms)
Create+Copy+Synchronize+Destroy time for 4 streams and 5000 buffers and 4 iterations 43.1138 (ms)
Create+Copy+Synchronize+Destroy time for 8 streams and 5000 buffers and 2 iterations 43.166 (ms)

@bdenhollander
Copy link

I profiled your code on Windows on gfx1032. The majority of the time was spent in memcpy rather than creating and destroying streams. This code may be more of host to device copy benchmark.
image

@jinz2014
Copy link
Author

Yes, most time is spent on data copy. I updated the summary of the issue.

@jinz2014 jinz2014 changed the title stream create and destroy example stream create, copy and destroy example May 22, 2024
@jinz2014
Copy link
Author

@bdenhollander What is your profiler ?

@bdenhollander
Copy link

The screenshot is from Visual Studio 2019's built in profiler.

@schung-amd
Copy link

Hi @jinz2014, hipMemcpyAsync (and cudaMemcpyAsync on the CUDA end) are asynchronous with compute operations but not necessarily memory copy operations; only one copy can be executing at a time per direction across a PCIE link. This means that your host-to-device copies are actually serialized, even though you have used the async version. You can check this by comparing performance with the regular hipMemcpy.

If you want to profile the performance of multiple streams, you should do so with kernels that perform computation, as these can be overlapped with memory transfers. In this case you should also pin your host memory with hipHostMalloc, as otherwise the copies will be synchronous (as per docs). If you want to profile the performance of hipMemcpy itself, you should transfer a larger amount of data in a single memcpy to reduce the amount of overhead being measured.

@jinz2014
Copy link
Author

jinz2014 commented Oct 3, 2024

Thanks.

@jinz2014 jinz2014 closed this as completed Oct 3, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants