Skip to content

zjin-lcf/HeCBench

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

HeCBench

This repository contains a collection of heterogeneous computing benchmarks written with CUDA, HIP, SYCL/DPC++, and OpenMP-4.5 target offloading for studying performance, portability, and productivity.

Background, use cases and future work

Z. Jin and J. S. Vetter, "A Benchmark Suite for Improving Performance Portability of the SYCL Programming Model," 2023 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), Raleigh, NC, USA, 2023, pp. 325-327, doi: 10.1109/ISPASS57527.2023.00041. (https://ieeexplore.ieee.org/document/10158214)

Software installation

AMD ROCm
Intel DPC++ compiler or Intel oneAPI toolkit
NVIDIA HPC SDK

Dependencies

Certain SYCL benchmarks require oneDPL, oneTBB, Syclomatic, or oneMKL interfaces.

Benchmark categories

Each benchmark falls into a single category. While such classification is not accurate, the arrangement serves as a starting point for users of the benchmark suite. Please see the Reference for more information about each benchmark.

Automotive

daphne

Bandwidth

allreduce, cmembench, babelstream, ccl, memcpy, memtest, pingpong, randomAccess, shmembench, triad 

Bioinformatics

all-pairs-distance, bsw, ccs, cm, deredundancy, diamond, epistasis, extend2, frna, fsm, ga, logan, minibude, minimap2, nbnxm, nw, pcc, prna, sa, snake

Computer vision and image processing

affine, aobench, asmooth, background-subtract, bezier-surface, bilateral, bm3d, boxfilter, cbsfil, car, ced, colorwheel, convolution1D, convolution3D, convolutionDeformable, convolutionSeperable, dct8x8, debayer, depixel, degrid, doh, dpid, egs, face, flame, gabor, gamma-correction, hogbom, mandelbrot, marchCubes, match, medianfilter, morphology, mriQ, ne, opticalFlow, perlin, sobel, tonemapping, recursiveGaussian, resize, sad, seam-carving, spm, srad, ssim, stencil1d, stencil3d, surfel, voxelization, zoom

Cryptography

aes, bitcracker, bitpermute, chacha20, columnarSolver, ecdh, keccaktreehash, merkle, present  

Data compression and reduction

atomicAggregate, atomicCAS, atomicCost, atomicIntrinsics, atomicPerf, atomicSystemWide, bitpacking, bscan, bwt, compute-score, contract, dxtc2, filter, fpc, histogram, lzss, minmax, mpc, mtf, rle, sc, scan, scan2, scan3, segment-reduce

Data encoding, decoding, or verification

ans, crc64, crs, entropy, jenkins-hash, ldpc, md5hash, murmurhash3

Finance

aop, black-scholes, binomial, bonds, libor

Geoscience

aidw, coordinates, geodesic, hausdorff, haversine, stsg

Graph and Tree

cc, floydwarshall, floydwarshall2, gc, hbc, hungarian, mis, sssp, rsmt

Language and kernel features

aligned-types, asta, blockAccess, blockexchange, collision, concurrentKernels, conversion, copy, dispatch, graphExecution, ert, interleave, intrinsics-cast, kernelLaunch, layout, mallocFree, maxFlops, mixbench, nosync, openmp, overlap, p2p, pad, pitch, popcount, prefetch, reverse, ring, saxpy-ompt, shuffle, simpleMultiDevice, streamCreateCopyDestroy, streamOrderedAllocation, streamPriority, streamUM, tensorAccessor, threadfence, warpexchange, vote, wmma, wordcount, zerocopy 

Machine learning

accuracy, adam, addBiasResidualLayerNorm, attention, attentionMultiHead, backprop, bincount, bn, channelShuffle, channelSum, clink, concat, crossEntropy, dense-embedding, dropout, dwconv, dwconv1d, expdist, flip, gd, gelu, ge-spmm, glu, gmm, gru, kalman, kmc, kmeans, knn, layernorm, lda, lif, logprob, lr, lrn, mask, matern, maxpool3d, mcpr, meanshift, mf-sgd, mmcsf, mnist, mrc, multinomial, nlll, nonzero, overlay, p4, page-rank, permute, perplexity, pointwise, pool, qkv, qtclustering, remap, relu, resnet-kernels, rowwiseMoments, rotary, sampling, scel, softmax, softmax-fused, softmax-online, stddev, streamcluster, swish, unfold, vol2col, wedford, winograd, word2vec

Math

atan2, blas-dot, blas-fp8gemm, blas-gemm, blas-gemmBatched, blas-gemmStridedBatched, blas-gemmEx, blas-gemmEx2, complex, cross, determinant, divergence, dp, eigenvalue, f16max, f16sp, frechet, fresnel, fwt, gaussian, geam, gels, gemv, hellinger, hmm, idivide, interval, jaccard, jacobi, kurtosis, lanczos, langford, lci, lebesgue, leukocyte, lfib4, log2, lud, ludb, michalewicz, matrix-rotate, matrixT, minkowski, mr, mrg32k3a, norm2, nqueen, ntt, phmm, pnpoly, reverse2D, rfs, romberg, rsc, sddmm-batch, secp256k1, simpleSpmv, slu, spd2s, spgeam, spgemm, spmm, spmv, spnnz, sps2d, spsort, sptrsv, thomas, wyllie, zeropoint

Random number generation

mt, permutate, qrg, rng-wallace, sobol, urng

Search

bfs, bsearch, b+tree, grep, keogh, s8n, ss, sss, tsp

Signal processing

extrema, fft, lombscargle, sosfil, zmddft

Simulation

ace, adv, amgmk, axhelm, bh, bspline-vgh, burger, cooling, ccsd-trpdrv, che, chemv, chi2, clenergy, cmp, cobahh, d2q9_bgk, d3q19_bgk, damage, ddbp, dslash, easyWave, eikonal, fdtd3d, feynman-kac, fhd, fluidSim, gibbs, goulash, gpp, grrt, haccmk, halo-finder, heartwall, heat, heat2d, henry, hexicton, hotspot, hotspot3D, hpl, hwt1d, hypterm, ising, iso2dfd, laplace, laplace3d, lavaMD, lid-driven-cavity, logic-resim, logic-rewrite, loopback, lsqt, lulesh, mcmd, md, mdh, metropolis, miniFE, minimod, minisweep, miniWeather, multimaterial, myocte, nbody, particle-diffusion, particlefilter, particles, pathfinder, pns, projectile, pso, qem, rainflow, rayleighBenardConvection, reaction, rsbench, rtm8, rushlarsen, s3d, su3sheath, simplemoc, slit, sparkler, sph, sw4ck, tensorT, testSNAP, tissue, tpacf, tqs, tridiagonal, tsa, vanGenuchten, vmc, wlcpow, wsm5, xlqc, xsbench

Sorting

bitonic-sort, hybridsort, is, merge, quicksort, radixsort, segsort, sort, sortKV, split, warpsort

Robotics

inversek2j, rodrigues

Run a benchmark

Option 1: Makefile scripts that build and run an individual benchmark

  Navigate to a benchmark in CUDA (benchmark-cuda) and type  
  `make ARCH=sm_70 run`  // run on a NIVIDA GPU device with compute capability 7.0
  
  Navigate to a benchmark in HIP (benchmark-hip) and type  
  `make run`
  
  Navigate to a benchmark in SYCL (benchmark-sycl) and type   
 `make CUDA=yes CUDA_ARCH=sm_70 GCC_TOOLCHAIN="" run` (targeting an NVIDIA GPU)
 `make HIP=yes HIP_ARCH=gfx908 run`                   (targeting an AMD GPU)  
 `make run` or `make CC=icpx run`                     (targeting an Intel GPU)
  NOTE: `--gcc-toolchain` may be required for the SYCL compiler to select the proper GNU toolchain; otherwise unset GCC_TOOLCHAIN
 
  Navigate to a benchmark in OpenMP (benchmark-omp) and type  
  `make -f Makefile.nvc run`  (targeting NVIDIA GPUs)
  `make -f Makefile.aomp run` (targeting AMD GPUs)
  `make run`                  (targeting Intel GPUs) 
  
  Users may need to set appropriate values (e.g., `sm_80`, `sm_90`, `gfx906`, `gfx1030`) for their target offloading devices  
  `make -f Makefile.nvc SM=cc80 run`
  `make -f Makefile.aomp ARCH=gfx906 run`

Option 2: Python scripts that help build, run and gather results from the benchmarks. As well as a basic script to compare results from two different runs.

It works with a `.json` file containing the benchmark names, a regex to
find the timings in the benchmark output and optional arguments that
must be provided to the benchmark binary. The `subset.json` contains
a subset of the benchmarks for cuda, hip and sycl at the moment, so more
work would be required to support the rest of the benchmarks. In
addition if there are failing benchmarks in the `.json` list, an
additional text file can be provided with a list of benchmarks to skip
when running all of them. Benchmarks in the text file can still be run
explicitly.

For example, to run all the SYCL benchmarks and then all the CUDA
benchmarks and compare the two:

```
./autohecbench.py sycl -o sycl.csv
./autohecbench.py cuda -o cuda.csv
./autohecbench-compare.py sycl.csv cuda.csv
```

It can also be used to run a single benchmark:

```
./autohecbench.py backprop-sycl --verbose
```

To run a single benchmark using the Intel oneAPI Toolkit on an Intel GPU:

```
./autohecbench.py backprop-sycl --sycl-type opencl --compiler-name icpx --verbose
```

To run a single benchmark using the Intel oneAPI Toolkit on a CPU:

```
./autohecbench.py backprop-sycl --sycl-type cpu --compiler-name icpx --verbose
```

By default it will run a warmup iteration before running each benchmark,
and it is possible to run the benchmarks multiple times with `-r`:
```
./autohecbench.py backprop-sycl -r 20 -o mandel-sycl.csv
```

And it also has options to pick the CUDA SM version or HIP architecture and a
few other parameters. Type `./autohecbench.py` to see all the options.

Dataset

For Rodinia benchmarks, please download the dataset at http://lava.cs.virginia.edu/Rodinia/download.htm
For other benchmarks, datasets are either included with the benchmarks or could be downloaded through the links described in the README files.

Known issues

The programs have not been evaluated on Windows or MacOS
The lastest Intel SYCL compiler (not the Intel oneAPI toolkit) may be needed for building some SYCL programs successfully
Kernel results do not exactly match using these programming languages on a platform for certain programs
Not all programs automate the verification of host and device results
Not all CUDA programs have SYCL, HIP or OpenMP equivalents
Not all programs have OpenMP target offloading implementations
Raw performance of any program may be suboptimal
Some programs may take long to complete on an integrated GPU
Some host programs contain platform-specific intrinsics, so they may cause compile error on a PowerPC platform

Emulation

When double-precision floating-point operations are not supported on certain Intel GPU devices, software emulation may be enabled. FP64 emulation

Feedback from the papers

Faqir-Rhazoui, Y. and García, C., 2024. SYCL in the edge: performance and energy evaluation for heterogeneous acceleration. The Journal of Supercomputing, pp.1-21.

Dearing, M.T., Tao, Y., Wu, X., Lan, Z. and Taylor, V., 2024. LASSI: An LLM-based Automated Self-Correcting Pipeline for Translating Parallel Scientific Codes. arXiv preprint arXiv:2407.01638.

Marzen, L., Dutta, A. and Jannesari, A., 2024. Static Generation of Efficient OpenMP Offload Data Mappings. arXiv preprint arXiv:2406.13881.

Ivanov, I.R., Zinenko, O., Domke, J., Endo, T. and Moses, W.S., 2024, March. Retargeting and Respecializing GPU Workloads for Performance Portability. In 2024 IEEE/ACM International Symposium on Code Generation and Optimization (CGO) (pp. 119-132). IEEE.

Shilpage, W.R. and Wright, S.A., 2023, May. An investigation into the performance and portability of sycl compiler implementations. In International Conference on High Performance Computing (pp. 605-619). Cham: Springer Nature Switzerland.

Tian, S., Chapman, B. and Doerfert, J., 2023, August. Maximizing Parallelism and GPU Utilization For Direct GPU Compilation Through Ensemble Execution. In Proceedings of the 52nd International Conference on Parallel Processing Workshops (pp. 112-118).

Tian, S., Scogland, T., Chapman, B. and Doerfert, J., 2023, November. OpenMP kernel language extensions for performance portable GPU codes. In Proceedings of the SC'23 Workshops of The International Conference on High Performance Computing, Network, Storage, and Analysis (pp. 876-883).

Murtovi, A., Georgakoudis, G., Parasyris, K., Liao, C., Laguna, I. and Steffen, B., 2023. Enhancing Performance Through Control-flow Unmerging and Loop Unrolling (No. LLNL-CONF-849354). Lawrence Livermore National Laboratory (LLNL), Livermore, CA (United States).

Alpay, A., Soproni, B., Wünsche, H. and Heuveline, V., 2022, May. Exploring the possibility of a hipSYCL-based implementation of oneAPI. In Proceedings of the 10th International Workshop on OpenCL (pp. 1-12).

Thavappiragasam, M. and Kale, V., 2022, November. OpenMP’s Asynchronous Offloading for All-pairs Shortest Path Graph Algorithms on GPUs. In 2022 IEEE/ACM International Workshop on Hierarchical Parallelism for Exascale Computing (HiPar) (pp. 1-11). IEEE.

Doerfert, J., Jasper, M., Huber, J., Abdelaal, K., Georgakoudis, G., Scogland, T. and Parasyris, K., 2022, October. Breaking the vendor lock: performance portable programming through OpenMP as target independent runtime layer. In Proceedings of the International Conference on Parallel Architectures and Compilation Techniques (pp. 494-504).

Tian, S., Huber, J., Parasyris, K., Chapman, B. and Doerfert, J., 2022, November. Direct GPU compilation and execution for host applications with OpenMP Parallelism. In 2022 IEEE/ACM Eighth Workshop on the LLVM Compiler Infrastructure in HPC (LLVM-HPC) (pp. 43-51). IEEE.

Jin, Z. and Vetter, J.S., 2022, December. Understanding performance portability of bioinformatics applications in SYCL on an NVIDIA GPU. In 2022 IEEE International Conference on Bioinformatics and Biomedicine (BIBM) (pp. 2190-2195). IEEE.

Jin, Z., 2021. The Rodinia Benchmarks in SYCL (No. ORNL/TM-2021/2338). Oak Ridge National Laboratory (ORNL), Oak Ridge, TN (United States).

Experimental Results

Early results are shown here

Reference

accuracy (cuda)

Accuracy of prediction (https://pytorch.org/)

ace (cuda)

Phase-field simulation of dendritic solidification (https://github.com/myousefi2016/Allen-Cahn-CUDA)

adam (cuda)

Adaptive moment estimation (https://github.com/hpcaitech/ColossalAI)

addBiasResidualLayerNorm (cuda)

Combines the bias, residual of previous block and the computation of layer normalization (https://github.com/NVIDIA/FasterTransformer)

adv (cuda)

Advection (https://github.com/Nek5000/nekBench/tree/master/adv)

aes (opencl)

AES encrypt and decrypt (https://github.com/Multi2Sim/m2s-bench-amdsdk-2.5-src)

affine (opencl)

Affine transformation (https://github.com/Xilinx/SDAccel_Examples/tree/master/vision/affine)

aidw (cuda)

Adaptive inverse distance weighting (Mei, G., Xu, N. & Xu, L. Improving GPU-accelerated adaptive IDW interpolation algorithm using fast kNN search. SpringerPlus 5, 1389 (2016))

aligned-types (cuda)

Alignment specification for variables of structured types (http://docs.nvidia.com/cuda/cuda-samples/index.html)

allreduce (cuda)

The ring allreduce and ring allgather (https://github.com/baidu-research/baidu-allreduce)

all-pairs-distance (cuda)

All-pairs distance calculation (https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2910913/)

amgmk (openmp)

The relax kernel in the AMGmk benchmark (https://asc.llnl.gov/CORAL-benchmarks/Micro/amgmk-v1.0.tar.gz)

ans (cuda)

Asymmetric numeral systems decoding (https://github.com/weissenberger/multians)

aobench (openmp)

A lightweight ambient occlusion renderer (https://code.google.com/archive/p/aobench)

aop (cuda)

American options pricing (https://github.com/NVIDIA-developer-blog)

asmooth (cuda)

Adaptive smoothing (http://www.hcs.harvard.edu/admiralty/)

asta (cuda)

Array of structure of tiled array for data layout transposition (https://github.com/chai-benchmarks/chai)

atan2 (cpp)

Approximate the atan2 math function (https://github.com/cms-patatrack/pixeltrack-standalone)

atomicAggreate (cuda)

Atomic aggregate (https://developer.nvidia.com/blog/using-cuda-warp-level-primitives/)

atomicIntrinsics (cuda)

Atomic add, subtract, min, max, AND, OR, XOR (http://docs.nvidia.com/cuda/cuda-samples/index.html)

atomicCAS (cuda)

64-bit atomic add, min, and max with compare and swap (https://github.com/treecode/Bonsai/blob/master/runtime/profiling/derived_atomic_functions.h)

atomicCost

Evaluate the cost of atomic add operations

atomicPerf (cuda)

Evaluate atomic add operations over global and shared memory (https://stackoverflow.com/questions/22367238/cuda-atomic-operation-performance-in-different-scenarios)

atomicReduction (hip)

Integer sum reduction with atomics (https://github.com/ROCm-Developer-Tools/HIP-Examples/tree/master/reduction)

atomicSystemWide (cuda)

System-wide atomics (http://docs.nvidia.com/cuda/cuda-samples/index.html)

attention (pseudocodes)

Ham, T.J., et al., 2020, February. A^ 3: Accelerating Attention Mechanisms in Neural Networks with Approximation. In 2020 IEEE International Symposium on High Performance Computer Architecture (HPCA) (pp. 328-341). IEEE.

attentionMultiHead (cuda)

Implementation of multi-head attention (https://github.com/IrishCoffee/cudnnMultiHeadAttention)

axhelm (cuda)

Helmholtz matrix-vector product (https://github.com/Nek5000/nekBench/tree/master/axhelm)

babelstream (cuda)

Measure memory transfer rates for copy, add, mul, triad, dot, and nstream (https://github.com/UoB-HPC/BabelStream)

background-subtract (cuda)

Background subtraction (Alptekin Temizel et al. Experiences on Image and Video Processing with CUDA and OpenCL, In Applications of GPU Computing Series, GPU Computing Gems Emerald Edition, Morgan Kaufmann, 2011, Pages 547-567)

backprop (opencl)

Backpropagation in the Rodinia benchmark suite (http://lava.cs.virginia.edu/Rodinia/download_links.htm)

bezier-surface (opencl)

The Bezier surface (https://github.com/chai-benchmarks/chai)

bfs (opencl)

The breadth-first search in the Rodinia benchmark suite (http://lava.cs.virginia.edu/Rodinia/download_links.htm)

bh (cuda)

Simulate the gravitational forces in a star cluster using the Barnes-Hut n-body algorithm (https://userweb.cs.txstate.edu/~burtscher/research/ECL-BH/)

bilateral (cuda)

Bilateral filter (https://github.com/jstraub/cudaPcl)

bincount (cuda)

Count the number of values that fall into each bin (https://pytorch.org/)

binomial (cuda)

Evaluate fair call price for a given set of European options under binomial model (https://docs.nvidia.com/cuda/cuda-samples/index.html)

bitcracker (cuda)

Open-source password cracking tool for storage devices (https://github.com/e-ago/bitcracker.git)

bitonic-sort (sycl)

Bitonic sorting (https://github.com/oneapi-src/oneAPI-samples/tree/master/DirectProgramming/)

bitpacking (cuda)

A bit-level operation that aims to reduce the number of bits required to store each value (https://github.com/NVIDIA/nvcomp)

bitpermute (cuda)

Permute the data using bit-level operations in an array (https://github.com/supranational/sppark)

black-scholes (cuda)

The Black-Scholes simulation (https://github.com/cavazos-lab/FinanceBench)

blas-dot (cuda)

A dot product between two real vectors

blas-fp8gemm (cuda)

Scaled matrix-matrix multiplication in a 8-bit floating-point format

blas-gemm (sycl)

General matrix-matrix multiplications (https://www.intel.com/content/www/us/en/docs/onemkl/developer-reference-dpcpp/2025-0/gemm.html)

blas-gemmBatched (cuda)

Batched general matrix-matrix multiplication (https://github.com/pyrovski/cublasSgemmBatched-example)

blas-gemmStridedBatched (cuda)

Strided batched general matrix-matrix multiplication (https://github.com/pyrovski/cublasSgemmBatched-example)

blas-gemmEx (cuda)

Extended general matrix-matrix multiplications (https://godweiyang.com/2021/08/24/gemm/, https://github.com/UoB-HPC/abc-pvc-deepdive)

blas-gemmEx2 (cuda)

Extended general matrix-matrix multiplications using cuBLASLt, hipBLASLt, and oneDNN

blockAccess (cuda)

Block access from the CUB's collective primitives (https://github.com/NVIDIA/cub)

blockexchange (cuda)

Rearrange data partitioned across a thread block (https://github.com/NVIDIA/cub)

bm3d (cuda)

Block-matching and 3D filtering method for image denoising (https://github.com/DawyD/bm3d-gpu)

bn (cuda)

Bayesian network learning (https://github.com/OSU-STARLAB/UVM_benchmark/blob/master/non_UVM_benchmarks)

bonds (cuda)

Fixed-rate bond with flat forward curve (https://github.com/cavazos-lab/FinanceBench)

boxfilter (cuda)

Box filtering (http://developer.download.nvidia.com/compute/cuda/3_0/sdk/website/OpenCL/website/samples.html)

bscan (cuda)

Binary scan in a block (Harris, M. and Garland, M., 2012. Optimizing parallel prefix operations for the Fermi architecture. In GPU Computing Gems Jade Edition (pp. 29-38). Morgan Kaufmann.)

bsearch (cuda)

Classic and vectorizable binary search algorithms (https://www.sciencedirect.com/science/article/abs/pii/S0743731517302836)

bspline-vgh (openmp)

Compute the value, gradient and hessian on random positions in a 3D box (https://github.com/QMCPACK/miniqmc/blob/OMP_offload/src/OpenMP/main.cpp)

bsw (cuda)

GPU accelerated Smith-Waterman for performing batch alignments (https://github.com/mgawan/ADEPT)

burger (openmp)

2D Burger's equation (https://github.com/soumyasen1809/OpenMP_C_12_steps_to_Navier_Stokes)

bwt (cuda)

Burrows-Wheeler transform (https://github.com/jedbrooke/cuda_bwt)

b+tree (opencl)

B+Tree in the Rodinia benchmark suite (http://lava.cs.virginia.edu/Rodinia/download_links.htm)

car (cuda)

Content adaptive resampling (https://github.com/sunwj/CAR)

cbsfil (cuda)

Cubic b-spline filtering (https://github.com/DannyRuijters/CubicInterpolationCUDA)

cc (cuda)

Connected components (https://userweb.cs.txstate.edu/~burtscher/research/ECL-CC/)

ccl (cuda)

Collective communications library (https://github.com/NVIDIA/nccl)

ccs (cuda)

Condition-dependent Correlation Subgroups (https://github.com/abhatta3/Condition-dependent-Correlation-Subgroups-CCS)

ccsd-trpdrv (c)

The CCSD tengy kernel, which was converted from Fortran to C by Jeff Hammond, in NWChem (https://github.com/jeffhammond/nwchem-ccsd-trpdrv)

ced (opencl)

Canny edge detection (https://github.com/chai-benchmarks/chai)

cfd (opencl)

The CFD solver in the Rodinia benchmark suite (http://lava.cs.virginia.edu/Rodinia/download_links.htm)

chacha20 (c)

ChaCha20 stream cipher (https://github.com/983/ChaCha20)

channelShuffle (cuda)

Divide the channels in a tensor into groups and rearrange them (https://pytorch.org/)

channelSum (cuda)

Per-channel sum of values (https://pytorch.org/)

che (cuda)

Phase-field simulation of spinodal decomposition using the Cahn-Hilliard equation (https://github.com/myousefi2016/Cahn-Hilliard-CUDA)

chemv (cuda)