Benchmarks - Add GPU Stream Micro Benchmark #697

WenqingLan1 · 2025-04-08T17:20:00Z

Added GPU Stream benchmark - measures the GPU memory bandwidth and efficiency for double datatype through various memory operations including copy, scale, add, and triad.

added documentation for gpu-stream detailing its introduction, metrics, and descriptions.
added unit tests for gpu-stream. Example output is in superbenchmark/tests/data/gpu_stream.log.
updated the CUDA standard from C++11 to C++17 for compatibility with the new benchmark.

WenqingLan1 · 2025-04-08T17:27:57Z

@microsoft-github-policy-service agree company="Microsoft"

Copilot

Pull Request Overview

This PR introduces a new GPU stream micro benchmark that measures double-precision memory operations performance on GPUs. Key changes include new C++ and Python benchmark implementations, comprehensive unit tests, and updated user documentation with detailed metrics.

Reviewed Changes

Copilot reviewed 12 out of 15 changed files in this pull request and generated 1 comment.

Show a summary per file

File	Description
tests/benchmarks/micro_benchmarks/test_gpu_stream.py	Added unit tests for command generation and result parsing.
superbench/benchmarks/micro_benchmarks/gpu_stream/*.hpp & .cpp	New benchmark implementation including kernels, utils, and option parsing.
superbench/benchmarks/micro_benchmarks/gpu_stream.py	Python wrapper for launching the GPU stream benchmark.
docs/user-tutorial/benchmarks/micro-benchmarks.md	Updated documentation to include the new GPU stream benchmark.
examples/benchmarks/gpu_stream.py	Minimal example usage for the GPU stream benchmark.

Files not reviewed (3)

superbench/benchmarks/micro_benchmarks/cuda_common.cmake: Language not supported
superbench/benchmarks/micro_benchmarks/gpu_stream/CMakeLists.txt: Language not supported
superbench/benchmarks/micro_benchmarks/gpu_stream/gpu_stream_kernels.cu: Language not supported

Comments suppressed due to low confidence (1)

superbench/benchmarks/micro_benchmarks/gpu_stream/gpu_stream_utils.cpp:42

The usage message incorrectly refers to 'gpu_copy' instead of 'gpu_stream'. Please update the message to accurately reflect the benchmark name.

std::cout << "Usage: gpu_copy " << "--size <size in bytes> " << "--num_warm_up <num_warm_up> " << "--num_loops <num_loops> " << "[--check_data]" << std::endl;

superbench/benchmarks/micro_benchmarks/gpu_stream/gpu_stream_utils.cpp

abuccts · 2025-04-30T03:47:03Z

docs/user-tutorial/benchmarks/micro-benchmarks.md

+| STREAM\_SCALE\_double\_gpu\_[0-9]\_buffer\_[0-9]+\_block\_[0-9]+\_bw | bandwidth (GB/s) | The fp64 memory bandwidth of the GPU for the scale operation with specified buffer size and block size.                        |
+| STREAM\_ADD\_double\_gpu\_[0-9]\_buffer\_[0-9]+\_block\_[0-9]+\_bw   | bandwidth (GB/s) | The fp64 memory bandwidth of the GPU for the add operation with specified buffer size and block size.                          |
+| STREAM\_TRIAD\_double\_gpu\_[0-9]\_buffer\_[0-9]+\_block\_[0-9]+\_bw | bandwidth (GB/s) | The fp64 memory bandwidth of the GPU for the triad operation with specified buffer size and block size.                          |                        |
+| STREAM\_COPY\_double\_gpu\_[0-9]\_buffer\_[0-9]+\_block\_[0-9]+\_pct | Efficiency (%) | The fp64 memory bandwidth efficiency of the GPU for the copy operation with specified buffer size and block size.                         |
+| STREAM\_SCALE\_double\_gpu\_[0-9]\_buffer\_[0-9]+\_block\_[0-9]+\_pct | Efficiency (%) | The fp64 memory bandwidth efficiency of the GPU for the scale operation with specified buffer size and block size.                        |


align the | position?

docs/user-tutorial/benchmarks/micro-benchmarks.md

abuccts · 2025-04-30T03:47:43Z

docs/user-tutorial/benchmarks/micro-benchmarks.md

+
+#### Introduction
+
+Measure the memory bandwidth of GPU using the STREAM benchmark. The benchmark tests various memory operations including copy, scale, add, and triad for double datatype.


is the fp64 required? why not support fp32/bf16 etc.?

Nvidia STREAM benchmark runs fp64 and fp32. For our current implementation, fp64 results are validated against Nvidia results but fp32 ones underperform, so more data types will be debugged and added in a future PR.

abuccts · 2025-04-30T03:49:20Z

superbench/benchmarks/micro_benchmarks/cuda_common.cmake

-    set(CMAKE_CUDA_STANDARD 11)
+    set(CMAKE_CUDA_STANDARD 17)


will this break existing benchmarks? maybe separate to another pr

yes. the variant library in gpu_stream.hpp in needs C++17.

superbench/benchmarks/micro_benchmarks/gpu_stream/gpu_stream.cu

superbench/benchmarks/micro_benchmarks/gpu_stream/gpu_stream_utils.cpp

WenqingLan1 added 2 commits April 3, 2025 18:08

add gpu-stream micro bench

8558be5

cleanup

9d6486c

WenqingLan1 requested review from cp5555, guoshzhao and a team as code owners April 8, 2025 17:20

guoshzhao requested review from abuccts, polarG and Copilot April 21, 2025 22:44

Copilot AI reviewed Apr 21, 2025

View reviewed changes

superbench/benchmarks/micro_benchmarks/gpu_stream/gpu_stream_utils.cpp Outdated Show resolved Hide resolved

abuccts reviewed Apr 30, 2025

View reviewed changes

Merge branch 'microsoft:main' into feat/gpu-stream

f947786

polarG added the micro-benchmarks label Apr 30, 2025

WenqingLan1 and others added 2 commits May 2, 2025 22:07

fix typo

bfcc14b

Merge branch 'microsoft:main' into feat/gpu-stream

bbe1584

guoshzhao mentioned this pull request May 14, 2025

V0.12.0 Release Plan #710

Open

34 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Benchmarks - Add GPU Stream Micro Benchmark #697

Benchmarks - Add GPU Stream Micro Benchmark #697

Uh oh!

WenqingLan1 commented Apr 8, 2025

Uh oh!

WenqingLan1 commented Apr 8, 2025

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

abuccts Apr 30, 2025

Uh oh!

Uh oh!

abuccts Apr 30, 2025

Uh oh!

WenqingLan1 May 2, 2025

Uh oh!

abuccts Apr 30, 2025

Uh oh!

WenqingLan1 May 2, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!


		#### Introduction

		Measure the memory bandwidth of GPU using the STREAM benchmark. The benchmark tests various memory operations including copy, scale, add, and triad for double datatype.

Benchmarks - Add GPU Stream Micro Benchmark #697

Are you sure you want to change the base?

Benchmarks - Add GPU Stream Micro Benchmark #697

Uh oh!

Conversation

WenqingLan1 commented Apr 8, 2025

Uh oh!

WenqingLan1 commented Apr 8, 2025

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Uh oh!

abuccts Apr 30, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

abuccts Apr 30, 2025

Choose a reason for hiding this comment

Uh oh!

WenqingLan1 May 2, 2025

Choose a reason for hiding this comment

Uh oh!

abuccts Apr 30, 2025

Choose a reason for hiding this comment

Uh oh!

WenqingLan1 May 2, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!