"HPC is all about reducing data movement". Optimizing GEMM on GPU and CPU platforms share the same idea: to hide the memory latency with massive parallelism, cache-/register-level data re-use, and manual prefetching. On CPUs, both instruction-level and data-level parallelisms are exploited as well as delicate prefetching schemes are designed to hide the memory latency. Meanwhile, we partition the input matrices and pack them before computing to ensure a "smooth and low-latancy" computing kernel. The prefetching for matrix C
is especially critical to the CPU GEMM performance.
On GPU, we also need to take advantage of the low-latency cache. There are rich opportunities on GPUs for us to exploit cache-level data re-use. More details could be found in the official document of CUTLASS.
All questions are encouraged to sent to yujiazhai94@gmail.com.
- We compiled the program with
gcc 7.3.0
under Ubuntu 18.04.5 LTS. - NVIDIA cuBLAS version:
CUDA cuBLAS 11.3.1.68
.
Just three steps.
- We first modify the path of
nvcc
inMakefile
. - Second, type in
make
to compile. A binary executablesgemm_gpu
will be generated. - Third, run the binary using
./sgemm_gpu [kernel_number]
, wherekernel_number
selects the kernel for benchmark.0
represents NVIDIA cuBLAS and1-11
represent 11 kernels demonstrating the optimizing strategies.
Here we take the column-major implemetation for SGEMM. Both A and B are not transposed.
Kernel 1 is the most naive implementation of SGEMM in CUDA. This is the triple-for-loop implementation with register re-use when updating C(i,j)
. In this version, each threa block (TB) is responsible for a 32x32
sub-block of C
, and each thread computes only a single element of the C
matrix.
Kernel2 partitions the matrix A
and matrix B
into 32x32
blocks. These 32x32
blocks are loaded into shared memory before being loaded for GEMM computation. When loading the data into shared memory (this is called as packing in CPU GEMM), each thread is responsible to load/store one element and we set 1024 threads per TB using __launch_bounds__(1024)
. After packing is completed, all the threads are synchronized and then start to compute for their own element. Since each TB is still to compute a 32x32
matrix C
, each thread remains to take a single element of C
.
In short, this version adds cache blocking upon the previous version, with the parameter set {Ms,Ns,Ks}={32,32,32}
.
We bring a simple optimization upon kernel 2 here: storing threadIdx.x
before re-using it massively, in order to reduce living registers and benefit the compiler optimization. The performance slightly improves in this step.
In the previous version, the memory access on the shared memory is not coalesced. We re-ordered the memory access pattern on the shared memory: making the shared memory col-major but transposing matrix B
when packing it into the shared memory. This doubles the performance.
In this step, we ask each thread to compute 4 elements for the C
matrix. Therefore, we now have 256 threads in a TB to compute the 32x32
matrix C
that the TB is responsible for. Using the CPU-GEMM language, the micro kernel's shape is: 4x1
: that is to say, after the packing routine completes, each thread loads a 4x1
A and an 1x1
B and computes C(4x1)
+= A(4x1)*B(1x1)
.
Starting from this step, we restrict 256 threads for each TB.
Since our target machine supports a 128-bit transaction from the DRAM, we can apply the vectorized load operation using the float4
data type.
Considering there are sufficient registers (64K) for a TB while we only assign 256 threads for each TB, it should be safe, in terms of the performance, for us to assign more workloads to each thread. Now we ask each thread to compute a 4x4
sub-matrix of C
so we gain massive data re-use at the register level compared with the previous step.
Additionally, when the input matrices are large, we can increase Ms
and Ns
and maintain enough TBs to map to streaming multiprocessors. Here we increase {Ms,Ns}
from the previous {32,32}
to {64,64}
but decreased the Ks
from 32
to 16
to maintain the same shared memory consumption. Since everything but the two parameters Ms,Ns
are different, we deduce that the asking TBs to do more jobs benefits the performance when the input matrices are large enough.
Assign more workloads for each TB AND each thread.
Since each warp contains 32 threads and the memory accesses to the same memory address in shared memory within the same warp can be coalesced, we introduce a {Mw,Nw}
={4xMr,8xNr}
to benefit the warp-level parallelism. We refer readers to (Huang, 2018) for more details.
Kernel 10 (Kernel9 + prefetching (Huang, 2018))
We introduce the double buffer strategy for the shared memory buffers to cancel an unnecessary syncthreads inside the loop body, pushing the performance to the limit - mostly same as the close-sourced NVIDIA cuBLAS.