Apologies if you felt a wave of excitement after traversing the introduction, eager to jump into the first kernel, as there is a another very important introductory concept to get familiar with, one that is very vital for our purposes of optimizing matrix multiplication.

### BLAS standard, SGEMM and cuBLAS
Three new scary abbreaviations. What are they? 

BLAS is Basic Linear Algebra Subprograms, and is a library specification for streamlined linear algebra routines that was introduced in the late 1970s - essentially, since everyone needed to write the same linear algebra routines e.g. dot products, matrix multiplies, etc, it was agreed to have a standarized set of subprograms

A key distinction though, is that BLAS defines the specifications of these subprograms, not the actual implementations. 

This means that BLAS defines what arguments a routine will take, what exactly it must compute mathematically as well as naming conventions.

This will start flickering some light bulbs with regards to the seemingly bizarre arguments you will see passed to our kernel, which is a SGEMM (Single-precision GEneral Matrix-Matrix multiply) as per the naming convention defined. Further, BLAS dictates that a SGEMM routine should have an objective of computing: \[
C \;\leftarrow\; \alpha \cdot (A \times B) \;+\; \beta \cdot C
\]
This also, will not makae much sense right now, but I am putting these here for the reader to jump back and have a deja-vu moment when everything starts to click.

What BLAS does not detail is how we implement this routine, and compute this value, which resulted in vendors such as NVIDIA, Intel, IBM, Cray, etc to provide their own highly optimizied BLAS implementations for their specialized hardware. However, important to note that not all of these BLAS implementations are GPU BLAS, and are CPU-based BLAS implementations. 

The one we will focus on (beating, actually) is cuBLAS, is NVIDIA's implemenation for GPUs, leveraging the Tensor Cores in NVIDIA GPUs.

After this happened, bigger numerical packages (think NumPy, TensorFlow, PyTorch, etc) thought 'Hey, let's not try to reinvent the wheel' and decided to call BLAS under the hood for all their matrix math via the implementations created by the vendors.

As mentioned, the implementations it uses to do the heavy math-lifting can be either CPU-based or GPU-based implementations. This should bake in the intuition that BLAS simply standardizes the convention and correctness of linear algebra math subproblems, not the performance, which can vary hugely.

Small example code excerpt of how a library you might have used before, PyTorch, for <> might have unknowingly been your first contact with matrix multiplication that takes advantage of GPU hardware.

In [None]:
a = torch.randn(1000, 1000, device="cuda")  # GPU tensor
b = torch.randn(1000, 1000, device="cuda")
c = a @ b  # runs GEMM on GPU (cuBLAS under the hood)

Why this is so cool, is that we could dig up some code written in the 1980s that uses linear algebra routines with BLAS naming conventions and argument list and still compile today if we link it with a BLAS library. Conversely, we can also think about in this way: cutting-edge AI, like training LLMs is still driven by the same 40-year old API contract that standarized the matrix multiples being used.

#### cuBLAS
So, circling back and delving a bit deeper into our BLAS implementation of choice, cuBLAS is the product of technical geniuses at NVIDIA spending years to fine-tune and optimize kernels for their hardware, that are designed for high-performance BLAS operations on NVIDIA GPUs using CUDA.

However, the end-user is abstracted from most of this, having the ability to simple call routines.

Example use case:

In [None]:
// initialize cuBLAS
cublasHandle_t handle;
cublasCreate(&handle);

// initialize matrices, as pointers to global memory on the GPU
float *A_device, *B_device, *C_device;
cudaMalloc(&A_device, size_A);
cudaMalloc(&B_device, size_B);
cudaMalloc(&C_device, size_C);  // this is the output buffer

// multiply matrices A and B on GPU, store result in C
cublasSgemm(handle, CUBLAS_OP_N, CUBLAS_OP_N,
            m, n, k,
            &alpha, A_device, lda,
                    B_device, ldb,
            &beta,  C_device, ldc);

We will actually not delve too much into this code, as (drum roll please...) our **main objective here is to actually achieve as close as we can, to the performance of the cuBLAS routine for SGEMM, with our iteratively improving kernels**