Accelerating Reduction and Scan Using Tensor Core Units

@inproceedings{Dakkak:2019:ARS:3330345.3331057,
 author = {Dakkak, Abdul and Li, Cheng and Xiong, Jinjun and Gelado, Isaac and Hwu, Wen-mei},
 title = {Accelerating Reduction and Scan Using Tensor Core Units},
 booktitle = {Proceedings of the ACM International Conference on Supercomputing},
 series = {ICS '19},
 year = {2019},
 isbn = {978-1-4503-6079-1},
 location = {Phoenix, Arizona},
 pages = {46--57},
 numpages = {12},
 url = {http://doi.acm.org/10.1145/3330345.3331057},
 doi = {10.1145/3330345.3331057},
 acmid = {3331057},
 publisher = {ACM},
 address = {New York, NY, USA},
}

Installing latest cmake

cmake version >=3.8 is required. (there's a problem with hunter using cmake 3.10.2)

  cd /tmp
  wget https://cmake.org/files/v3.10/cmake-3.10.1-Linux-x86_64.sh
  sudo sh cmake-3.10.1-Linux-x86_64.sh --prefix=/usr/local --exclude-subdir

you may also want to remove the default installation sudo apt-get remove cmake

you need to install from source if on ppc64le

Compile

To compile the project run the following commands

mkdir -p build
cd build
cmake -DCMAKE_BUILD_TYPE=Release ..
make

if you get errors about nvcc not supporting your gcc compiler, then you may want to use

cmake -DCMAKE_BUILD_TYPE=Release -DCMAKE_CUDA_HOST_COMPILER=`which gcc-6` ..

Available Benchmarks

The following benchmakrs are currently available

GEMM

Name	Description
CUBLAS_WMMA_GEMM	CUBLAS GEMM w/ Tensor Cores. AType, BType = half, CType = float
CUBLAS_GEMM	CUBLAS GEMM w/o Tensor Cores. AType, BType = half, CType = float
CUTLASS_WGEMM	CUTLASS GEMM w/ Tensor Cores. AType, BType = half, CType = float
CUDA_WMMA_GEMM_NAIVE	Naive CUDA GEMM w/ Tensor Cores. AType, BType = half, CType = float
CUDA_WMMA_GEMM_SHAREDMEM	Shared memory CUDA GEMM w/ Tensor Cores. Atype, BType = half, CType = float
CUDA_WMMA_GEMM	Optimized CUDA GEMM (from CUDA Samples). AType, BType = half, CType = float

Comparision

HGEMM

cublasHgemm

Name	Description
CUBLAS_WMMA_HGEMM	CUBLAS HGEMM w/ Tensor Cores. AType, BType = half, CType = half
CUBLAS_HGEMM	CUBLAS HGEMM w/o Tensor Cores. AType, BType = half, CType = half

Comparision

GEMV

Name	Description
CUBLAS_GEMV	cublasSgemv y = alpha * Ax + beta * y. AType, xType, yType = float
CUDA_WMMA_GEMV_CUBLAS	Use CUBLAS GEMM w/ Tensor Cores for GEMV. AType, xType = half, yType = float
CUDA_WMMA_GEMV_NAIVE	Use Naive CUDA GEMM w/ Tensor Cores for GEMV. AType, xType = half, yType = float
CUDA_WMMA_GEMV_SHAREDMEM	Use Shared memory CUDA GEMM w/ Tensor Cores for GEMV. AType, xType = half, yType = float

Comparision

PrefixSum

inType = half, outType = half

Segmented PrefixSum

Name	Description
CUB_BLOCK_SEGMENTED_PREFIXSUM	cub::BlockScan. segment_size = THREADS_PER_BLOCK * ITEMS_PER_THREAD
CUB_WARP_SEGMENTED_PREFIXSUM	cub::WarpScan. segment_size = LOGICAL_THREADS_PER_WARP
THRUST_SEGMENTED_PREFIXSUM	thrust::inclusive_scan_by_key, inType = half, outType = half
CUDA_WMMA_SEGMENTED_PREFIXSUM_16	use Tensor Cores. segment_size = 16. each warp calculates WMMA_TILES_PER_WARP * 16 segments
CUDA_WMMA_SEGMENTED_PREFIXSUM_16N	use Tensor Cores. segment_size = 16n where n is a positive integer. each warp calculates 16 segment
CUDA_WMMA_SEGMENTED_PREFIXSUM_256	use Tensor Cores. segment_size = 16. each warp calculates WMMA_TILES_PER_WARP segments
CUDA_WMMA_SEGMENTED_PREFIXSUM_256N	use Tensor Cores. segment_size = 256n where n is a positive integer. each warp calculates 1 segment

WWMMA_TILES_PER_WARP and WARPS_PER_BLOCK tuning for CUDA_WMMA_SEGMENTED_PREFIXSUM_16 and CUDA_WMMA_SEGMENTED_PREFIXSUM_256

(WMMA_TILES_PER_WARP = 2, WARPS_PER_BLOCK = 4) is the best.

Comparision

Full PrefixSum

Name	Description
CUB_FULL_PREFIXSUM	cub::DeviceScan::ExclusiveSum. inType = half, outType = half
THRUST_FULL_PREFIXSUM	thrust::inclusive_scan. inType = half, outType = half
CUDA_WMMA_FULL_PREFIXSUM_3KERS_256	use CUDA_WMMA_SEGMENTED_PREFIXSUM_256 as the first kernel and use cub::DeviceScan::ExclusiveSum for scanning the partial sums, the third kernel adds the partials sums. inType = half, outType = half
CUDA_WMMA_FULL_PREFIXSUM_3KERS	use CUDA_WMMA_SEGMENTED_PREFIXSUM_256N as the first kernel and use cub::DeviceScan::ExclusiveSum for scanning the partial sums, the third kernel adds the partials sums. inType = half, outType = half
CUDA_WMMA_FULL_PREFIXSUM_CG	one kernel. use cooperative groups grid sysnchrozation to scan the partial sums. inType = half, outType = half

Note: There's a bug in cub::WarpScan for LOGICAL_THREADS_PER_WARP = 16.

Comparision

Reduction

Segmented Reduction

Name	Description
CUB_BLOCK_SEGMENTED_REDUCTION	cub::BlockReduce. segment_size = THREADS_PER_BLOCK * ITEMS_PER_THREAD
CUB_DEVICE_SEGMENTED_REDUCTION	cub::DeviceSegmentedReduce::Sum
THRUST_SEGMENTED_REDUCTION	thrust::reduce_by_key
CUDA_WMMA_SEGMENTED_REDUCTION_16	use Tensor Cores. segment_size = 16. each warp calculates WMMA_TILES_PER_WARP * 16 segments
CUDA_WMMA_SEGMENTED_REDUCTION_16N	use Tensor Cores. segment_size = 16n where n is a positive integer. each warp calculates 16 segment
CUDA_WMMA_SEGMENTED_REDUCTION_256	use Tensor Cores. segment_size = 16. each warp calculates WMMA_TILES_PER_WARP segments
CUDA_WMMA_SEGMENTED_REDUCTION_256N	use Tensor Cores. segment_size = 256n where n is a positive integer. each warp calculates 1 segment

WWMMA_TILES_PER_WARP and WARPS_PER_BLOCK tuning for CUDA_WMMA_SEGMENTED_REDUCTION_16 and CUDA_WMMA_SEGMENTED_REDUCTION_256

(WMMA_TILES_PER_WARP = 1, WARPS_PER_BLOCK = 8) is the best.

Comparision

Full Reduction

Name	Description
CUB_FULL_REDUCTION	cub::DeviceReduce::Sum. inType = half, outType = half
THRUST_FULL_REDUCTION	thrust::reduce. inType = half, outType = half
CUDA_WMMA_FULL_REDUCTION_2KERS	use CUDA_WMMA_SEGMENTED_REDUCTION_256N as the first kernel and use cub::DeviceSegmentedReduce::Sum for reducing the all segments. inType = half, outType = half
CUDA_WMMA_FULL_REDUCTION_CG	one kernel. use cooperative groups grid sysnchrozation. inType = half, outType = half
CUDA_WMMA_FULL_REDUCTION_ATOMIC_W_BLOCK_SYNC	one kernel. use atomicAdd with block synchronization. inType = half, outType = half
CUDA_WMMA_FULL_REDUCTION_ATOMIC_W_ATOMIC_BALLOT	one kernel. use atomicAdd with ballot. inType = half, outType = half

Comparision

Running Individual Benchmarks

you can benchmark each primitive individually using

./bench --benchmark_filter=[name_of_primitive]

for example

./bench --benchmark_filter=WMMA_GEMM

futher controls over the benchmarks are explained in the --help option

Benchmark all the primitives

./bench

The above will output to stdout somthing like

------------------------------------------------------------------------------
Benchmark                       Time           CPU Iterations UserCounters...
------------------------------------------------------------------------------
SGEMM/1000/1/1/-1/1             5 us          5 us     126475 K=1 M=1000 N=1 alpha=-1 beta=1
SGEMM/128/169/1728/1/0        539 us        534 us       1314 K=1.728k M=128 N=169 alpha=1 beta=0
SGEMM/128/729/1200/1/0       1042 us       1035 us        689 K=1.2k M=128 N=729 alpha=1 beta=0
SGEMM/192/169/1728/1/0        729 us        724 us        869 K=1.728k M=192 N=169 alpha=1 beta=0
SGEMM/256/169/1/1/1             9 us          9 us      75928 K=1 M=256 N=169 alpha=1 beta=1
SGEMM/256/729/1/1/1            35 us         35 us      20285 K=1 M=256 N=729 alpha=1 beta=1
SGEMM/384/169/1/1/1            18 us         18 us      45886 K=1 M=384 N=169 alpha=1 beta=1
SGEMM/384/169/2304/1/0       2475 us       2412 us        327 K=2.304k M=384 N=169 alpha=1 beta=0
SGEMM/50/1000/1/1/1            10 us         10 us      73312 K=1 M=50 N=1000 alpha=1 beta=1
SGEMM/50/1000/4096/1/0       6364 us       5803 us        100 K=4.096k M=50 N=1000 alpha=1 beta=0
SGEMM/50/4096/1/1/1            46 us         45 us      13491 K=1 M=50 N=4.096k alpha=1 beta=1
SGEMM/50/4096/4096/1/0      29223 us      26913 us         20 K=4.096k M=50 N=4.096k alpha=1 beta=0
SGEMM/50/4096/9216/1/0      55410 us      55181 us         10 K=9.216k M=50 N=4.096k alpha=1 beta=0
SGEMM/96/3025/1/1/1            55 us         51 us      14408 K=1 M=96 N=3.025k alpha=1 beta=1
SGEMM/96/3025/363/1/0        1313 us       1295 us        570 K=363 M=96 N=3.025k alpha=1 beta=0

Output as JSON using

./bench --benchmark_out_format=json --benchmark_out=test.json

or preferably

./bench --benchmark_out_format=json --benchmark_out=`hostname`.json

On Minsky With PowerAI

mkdir -p build && cd build && rm -fr * && cmake -DCMAKE_BUILD_TYPE=Release ..

Disable CPU frequency scaling

If you see this error:

***WARNING*** CPU scaling is enabled, the benchmark real time measurements may be noisy and will incur extra overhead.

you might want to disable the CPU frequency scaling while running the benchmark:

sudo cpupower frequency-set --governor performance
./mybench
sudo cpupower frequency-set --governor powersave

Create Figures

python plot/plot.py plot/spec/full_reduction.yml

or generate all figures

python plot/plot.py all

Run with Docker

Install nvidia-docker, then, list the available benchmarks.

nvidia-docker run  --rm raiproject/tensorcore_bench:latest bench --benchmark_list_tests

You can run benchmarks in the following way (probably with the --benchmark_filter flag).

nvidia-docker run --privileged --rm -v `readlink -f .`:/data -u `id -u`:`id -g` raiproject/tensorcore_bench:amd64-latest ./run_benchmarks.sh

--privileged is needed to set the NUMA policy for NUMA benchmarks.
-v `readlink -f .`:/data maps the current directory into the container as /data.
--benchmark_out=/data/\`hostname`.json tells the bench binary to write out to /data, which is mapped to the current directory.
-u `id -u`:`id -g` tells docker to run as user id -u and group id -g, which is the current user and group. This means that files that docker produces will be modifiable from the host system without root permission.

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
.travis		.travis
cmake		cmake
figures		figures
src		src
third_party		third_party
tools		tools
.agignore		.agignore
.clang-format		.clang-format
.gitignore		.gitignore
.gitmodules		.gitmodules
.travis.yml		.travis.yml
CMakeLists.txt		CMakeLists.txt
README.md		README.md

uiemUI/tcu_scope

Folders and files

Latest commit

History

Repository files navigation

Accelerating Reduction and Scan Using Tensor Core Units

Installing latest cmake

Compile

Available Benchmarks

Comparision

Comparision

Comparision

Segmented PrefixSum

WWMMA_TILES_PER_WARP and WARPS_PER_BLOCK tuning for CUDA_WMMA_SEGMENTED_PREFIXSUM_16 and CUDA_WMMA_SEGMENTED_PREFIXSUM_256

Comparision

Full PrefixSum

Comparision

Segmented Reduction

WWMMA_TILES_PER_WARP and WARPS_PER_BLOCK tuning for CUDA_WMMA_SEGMENTED_REDUCTION_16 and CUDA_WMMA_SEGMENTED_REDUCTION_256

Comparision

Full Reduction

Comparision

Running Individual Benchmarks

Benchmark all the primitives

On Minsky With PowerAI

Disable CPU frequency scaling

Create Figures

Run with Docker

About

Resources

Stars

Watchers

Forks

Languages