-
Notifications
You must be signed in to change notification settings - Fork 67
Benchmarks - Add GPU Stream Micro Benchmark #697
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
@microsoft-github-policy-service agree company="Microsoft" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR introduces a new GPU stream micro benchmark that measures double-precision memory operations performance on GPUs. Key changes include new C++ and Python benchmark implementations, comprehensive unit tests, and updated user documentation with detailed metrics.
Reviewed Changes
Copilot reviewed 12 out of 15 changed files in this pull request and generated 1 comment.
Show a summary per file
File | Description |
---|---|
tests/benchmarks/micro_benchmarks/test_gpu_stream.py | Added unit tests for command generation and result parsing. |
superbench/benchmarks/micro_benchmarks/gpu_stream/*.hpp & .cpp | New benchmark implementation including kernels, utils, and option parsing. |
superbench/benchmarks/micro_benchmarks/gpu_stream.py | Python wrapper for launching the GPU stream benchmark. |
docs/user-tutorial/benchmarks/micro-benchmarks.md | Updated documentation to include the new GPU stream benchmark. |
examples/benchmarks/gpu_stream.py | Minimal example usage for the GPU stream benchmark. |
Files not reviewed (3)
- superbench/benchmarks/micro_benchmarks/cuda_common.cmake: Language not supported
- superbench/benchmarks/micro_benchmarks/gpu_stream/CMakeLists.txt: Language not supported
- superbench/benchmarks/micro_benchmarks/gpu_stream/gpu_stream_kernels.cu: Language not supported
Comments suppressed due to low confidence (1)
superbench/benchmarks/micro_benchmarks/gpu_stream/gpu_stream_utils.cpp:42
- The usage message incorrectly refers to 'gpu_copy' instead of 'gpu_stream'. Please update the message to accurately reflect the benchmark name.
std::cout << "Usage: gpu_copy " << "--size <size in bytes> " << "--num_warm_up <num_warm_up> " << "--num_loops <num_loops> " << "[--check_data]" << std::endl;
superbench/benchmarks/micro_benchmarks/gpu_stream/gpu_stream_utils.cpp
Outdated
Show resolved
Hide resolved
| STREAM\_SCALE\_double\_gpu\_[0-9]\_buffer\_[0-9]+\_block\_[0-9]+\_bw | bandwidth (GB/s) | The fp64 memory bandwidth of the GPU for the scale operation with specified buffer size and block size. | | ||
| STREAM\_ADD\_double\_gpu\_[0-9]\_buffer\_[0-9]+\_block\_[0-9]+\_bw | bandwidth (GB/s) | The fp64 memory bandwidth of the GPU for the add operation with specified buffer size and block size. | | ||
| STREAM\_TRIAD\_double\_gpu\_[0-9]\_buffer\_[0-9]+\_block\_[0-9]+\_bw | bandwidth (GB/s) | The fp64 memory bandwidth of the GPU for the triad operation with specified buffer size and block size. | | | ||
| STREAM\_COPY\_double\_gpu\_[0-9]\_buffer\_[0-9]+\_block\_[0-9]+\_pct | Efficiency (%) | The fp64 memory bandwidth efficiency of the GPU for the copy operation with specified buffer size and block size. | | ||
| STREAM\_SCALE\_double\_gpu\_[0-9]\_buffer\_[0-9]+\_block\_[0-9]+\_pct | Efficiency (%) | The fp64 memory bandwidth efficiency of the GPU for the scale operation with specified buffer size and block size. | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
align the |
position?
|
||
#### Introduction | ||
|
||
Measure the memory bandwidth of GPU using the STREAM benchmark. The benchmark tests various memory operations including copy, scale, add, and triad for double datatype. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is the fp64 required? why not support fp32/bf16 etc.?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nvidia STREAM benchmark runs fp64 and fp32. For our current implementation, fp64 results are validated against Nvidia results but fp32 ones underperform, so more data types will be debugged and added in a future PR.
set(CMAKE_CUDA_STANDARD 11) | ||
set(CMAKE_CUDA_STANDARD 17) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will this break existing benchmarks? maybe separate to another pr
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes. the variant library in gpu_stream.hpp in needs C++17.
Added GPU Stream benchmark - measures the GPU memory bandwidth and efficiency for double datatype through various memory operations including copy, scale, add, and triad.
gpu-stream
detailing its introduction, metrics, and descriptions.gpu-stream
. Example output is insuperbenchmark/tests/data/gpu_stream.log
.