<div style="background-color:#192015; color:#7fa637; padding:12px; border-radius:8px; max-width:80%; width:auto; margin:0 auto;">

![nvmath-python](_assets/nvmath_head_panel@0.25x.png)

<p style="font-size:0.85em; margin-top:8px;">
Copyright (c) 2025, NVIDIA CORPORATION & AFFILIATES<br>
SPDX-License-Identifier: BSD-3-Clause
</p>

</div>


# Getting started with nvmath-python: kernel fusion

In this tutorial we provide basic 101 about **nvmath-python**, how it fits in and plays with an existing scientific computing ecosystem in Python and what makes it a useful addition for this ecosystem. This notebook explains fundamentals of kernel fusion in nvmath-python

<div class="alert alert-box alert-info">
    To use this notebook you will need a computer equipped with NVIDIA GPU as well as an environment with properly installed Python libraries and (optionally) CUDA Toolkit. Please refer to the nvmath-python documentation for getting familiar with <a href="https://docs.nvidia.com/cuda/nvmath-python/0.2.1/installation.html#install-nvmath-python">installation options</a>.
</div>

## Introduction

**nvmath-python** is powerful library designed to bridge the gap between Python's scientific computing community and **NVIDIA's CUDA-X math libraries**. The community has done a decent work in enabling GPU computing through **CuPy** and **PyTorch** and some other libraries and frameworks. They all leverage a goodness  of NVIDIA CUDA-X math libraries, such as cuBLAS and cuFFT, and demonstrate amazing performance gains compared to CPU libraries. At the same time, being constrained to *NumPy-like* APIs these GPU libraries do not always exploit full potential and advanced features of underlying NVIDIA CUDA-X libraries. This is where nvmath-python may become handy. It reimagined how math library API design can be intuitive, pythonic, and yet performant in sophisticated usage scenarios.

Like other *NumPy-like* libraries, nvmath-python implements core numerical algorithms useful in many scientific and engineering fields. However, nvmath-python does not aim to replace or duplicate them. First and foremost, nvmath-python **is NOT** an *array library*, it does not implement traditional array library functionality, such as array *indexing* and *slicing*.

## nvmath-python is NOT an array library

In [None]:
# Basic NumPy array creation, indexing and slicing
import numpy as np

# 1D array
a = np.arange(10)  # create array with values from 0 to 9
print("a =", a)  # print the array
print("a[2] =", a[2])  # access the third element (index 2)
print("a[2:7:2] =", a[2:7:2])  # slice from index 2 to 6 with step 2

# 2D array
b = np.arange(12).reshape(3, 4)  # create 3x4 array (matrix)
print("b =", b)  # print the matrix
print("b[0:2, 1:4] (submatrix) =", b[0:2, 1:4])  # slice rows 0-1 and columns 1-3
print("b[:,1] (second column) =", b[:, 1])  # access all rows in second column
print("b[0] (first row) =", b[0])  # access first row
print("b[1,2] =", b[1, 2])  # access element at row 1, column 2

Instead, nvmath-python is designed to **co-exist with array libraries** (NumPy, CuPy, PyTorch, *etc.*):

In [None]:
import numpy as np
import nvmath

n, m, k = 2, 4, 5
a_cpu = np.random.rand(n, k)
b_cpu = np.random.rand(k, m)

c_cpu = nvmath.linalg.advanced.matmul(a_cpu, b_cpu)  # matrix multiplication
print("c_cpu.shape =", c_cpu.shape)  # should be (2, 4)
print(type(c_cpu))  # should be numpy.ndarray

In [None]:
import nvmath
import cupy as cp

n, m, k = 2, 4, 5
a_gpu = cp.random.rand(n, k)
b_gpu = cp.random.rand(k, m)

c_gpu = nvmath.linalg.advanced.matmul(a_gpu, b_gpu)  # matrix multiplication
print("c_gpu.shape =", c_gpu.shape)  # should be (2, 4)
print(type(c_gpu))  # should be cupy.ndarray

In [None]:
import torch
import nvmath

n, m, k = 2, 4, 5

# CPU tensors
a_cpu = torch.rand(n, k)
b_cpu = torch.rand(k, m)

# On CPU
c_cpu = nvmath.linalg.advanced.matmul(a_cpu, b_cpu)
print("c_cpu.shape =", c_cpu.shape)  # should be (2, 4)
print("type(c_cpu) =", type(c_cpu))  # should be torch.Tensor

# Run on GPU if available
if torch.cuda.is_available():
    # move the same tensors to GPU
    a_gpu = a_cpu.cuda()
    b_gpu = b_cpu.cuda()
    c_gpu = nvmath.linalg.advanced.matmul(a_gpu, b_gpu)
    print("c_gpu.shape =", c_gpu.shape)
    print("c_gpu device =", getattr(c_gpu, "device", "unknown"))
    print("type(c_gpu) =", type(c_gpu))

<div style="background-color:#76B900; color:#ffffff; padding:12px; border-radius:8px; max-width:80%; width:auto; margin:0 auto;">
<h3 style="margin-top:0; color:#ffffff">Takeaways</h3>

- nvmath-python accepts arrays from multiple libraries: NumPy (CPU), CuPy (GPU), and PyTorch (CPU/GPU).
- To tell where a result lives use simple type/device checks — e.g. `isinstance(c, np.ndarray)`, `isinstance(c, cp.ndarray)` and `c.device`, or for PyTorch inspect `c.is_cuda` / `c.device`.
- **Remember**: array semantics (indexing/slicing) are provided by the array library (NumPy/CuPy/PyTorch). nvmath-python focuses on math operations and interoperates with those array types.
</div>

## Benchmarking GPU codes with `cupyx.profiler.benchmark`

Since GPU kernels are launched asynchronously, a host call may return before the device work finishes. Naive timing (e.g. Python's `time.time()` method or Jupyter's `%%timeit`) measures only host-side overhead, not true device execution time. The `cupyx.profiler.benchmark` uses CUDA events, proper synchronization, warm-ups, and repeated runs to produce stable, device-level timing measurements. It removes much of the noise introduced by Python overhead, one-time setup costs, and asynchronous execution and reports aggregated statistics so you get reproducible, comparable numbers.

In [None]:
import numpy as np
import cupyx as cpx


# Helper function to benchmark two implementations F and (optionally) F_alternative
# When F_alternative is provided, in addition to raw performance numbers (seconds)
# speedup of F relative to F_alternative is reported
def benchmark(
    F, F_name="Implementation", F_alternative=None, F_alternative_name="Alternative implementation", n_repeat=10, n_warmup=1
):
    timing = cpx.profiler.benchmark(F, n_repeat=n_repeat, n_warmup=n_warmup)  # warm-up + repeated runs
    perf = np.min(timing.gpu_times)  # best time from repeated runs
    print(f"{F_name} performance = {perf:0.4f} sec")

    if F_alternative is not None:
        timing_alt = cpx.profiler.benchmark(F_alternative, n_repeat=n_repeat, n_warmup=n_warmup)
        perf_alt = np.min(timing_alt.gpu_times)
        print(f"{F_alternative_name} performance = {perf_alt:0.4f} sec")
        print(f"Speedup = {perf_alt / perf:0.4f}x")
    else:
        perf_alt = None

    return perf, perf_alt

It's time to do real benchmarking to compare how nvmath-python performs on `matmul` relative to `cupy`.

<div style="background-color:#76B900; color:#ffffff; padding:12px; border-radius:8px; max-width:80%; width:auto; margin:0 auto;">
<h3 style="margin-top:0; color:#ffffff">Practical notes</h3>

- Make sure data is already allocated on device before benchmarking (transfer costs should be excluded unless you intend to measure them).
- Run several repeats and inspect distributions (median is usually more robust than min or mean).
- For end-to-end profiling (memory transfers, kernel launches, kernel internals) use NVIDIA tools like `nsys`, `nvprof`, or Nsight — `cupyx.profiler.benchmark` focuses on accurate kernel timing within Python workflows.
- Watch out for GPU power/clock state and thermal throttling; for stable numbers, use consistent GPU governor/clock settings if available.
</div>

In [None]:
m, n, k = 8192, 4096, 2048  # Use large enough sizes to get measurable times

a = cp.random.rand(m, k, dtype=cp.float32)
b = cp.random.rand(k, n, dtype=cp.float32)

# It's time to do real benchmarking to compare how nvmath-python
# performs on `matmul` relative to `cupy`:
benchmark(
    lambda: nvmath.linalg.advanced.matmul(a, b),  # nvmath-python implementation
    F_name="nvmath-python matmul",
    F_alternative=lambda: cp.matmul(a, b),  # CuPy implementation
    F_alternative_name="CuPy matmul",
)

Is there anything wrong with nvmath-python? Why does "advanced" `matmul` run as good (or as bad?) as CuPy's `matmul`?

The explanation is actually very simple: both CuPy and nvmath-python rely on the same cuBLAS library, they both perform nothing simpler than pure matrix-matrix multiplication, this is the reason (and the only reason) why we observe **identical** performance. To make a difference, we need to apply a little bit more **sophisticated use case** that does not simply translate to *NumPy-like* `matmul` call.

## Composite operations with nvmath-python

In many real scenarios the `matmul` operation we considered earlier is chained with another operations. For example, in linear algebra GEMM, *Generalized Matrix Multiplication* is a commonly used building block in scientific and engineering applications.  In its simplified form GEMM performs the following operation:

$$
\mathrm{D_{m \times n} \leftarrow \alpha \cdot (A_{m \times k} \cdot B_{k \times n}) + \beta \cdot C_{m \times n}}
$$

Using an array library one can easily implement this chained operation:

In [None]:
m, n, k = 10_000_000, 40, 10  # Take tall-and-skinny matrices to illustrate kernel fusion benefits

a = cp.random.rand(m, k, dtype=cp.float32)
b = cp.random.rand(k, n, dtype=cp.float32)
c = cp.random.rand(m, n, dtype=cp.float32)

alpha = 1.5
beta = 0.5

d = alpha * a @ b + beta * c
d.shape

With nvmath-python you can also perform the composite operation, and you can do that with a single function call:

In [None]:
d = nvmath.linalg.advanced.matmul(a, b, c, alpha=alpha, beta=beta)
d.shape

It's time to benchmark each alternative:

In [None]:
benchmark(
    lambda: nvmath.linalg.advanced.matmul(a, b, c, alpha=alpha, beta=beta),  # nvmath-python implementation
    F_name="nvmath-python GEMM",
    F_alternative=lambda: alpha * a @ b + beta * c,  # CuPy implementation
    F_alternative_name="CuPy equivalent",
)

Both nvmath-python and CuPy perform the same composite operation, both are accelerated by the respective cuBLAS library. Why is nvmath-python significantly faster?

The "magic" behind is called a kernel fusion. In the case of CuPy every basic operation is a separate function call on the host, which initiates a respective GPU kernel invocation asynchronously, returns to the host, submits the next kernel for execution, etc. In the case of nvmath-python the chained operation is performed as a whole in a single fused kernel. There are no accompanying overheads of multiple kernel invocations, but, more importantly, an execution as a fused kernel allows optimization of memory accesses, which significantly increases the *arithmetic intensity* of the kernel.

<div style="background-color:#76B900; color:#ffffff; padding:12px; border-radius:8px; max-width:80%; width:auto; margin:0 auto;">
<h3 style="margin-top:0; color:#ffffff">Exercise 1. Evaluate performance of CuPy <code>@</code>-based vs. <code>matmul</code>-based implementations of GEMM</h3>
In the above examples we implemented GEMM using <code>@</code> operator notation for matrix multiplication. Implement CuPy variant using <code>matmul</code> function. Benchmark <code>@</code> variant vs. <code>matmul</code> variant and explain performance difference (if any).

<strong>Hint</strong>: Consider operation precedence along with computational costs of each chained operation.
</div>

## NVIDIA Nsight plugin for JupyterLab

NVIDIA has created a useful JupyterLab plugin for the NVIDIA Nsight Tools, allowing to do a performance profiling from within notebooks. The following command will install the plugin into your environment.

``` bash
    pip install jupyterlab-nvidia-nsight
```

After the installation you will see a new tab **NVIDIA Nsight** in JupyterLab's menu. In the menu select **Profiling with Nsight Systems...**. This will restart the JupyterLab kernel. Select cells you wish to profile and execute **Run and profile selected cells...**

In [None]:
# Profile this cell with Nsight Systems...
d = alpha * a @ b + beta * c
d.shape

After running the profiler and opening the generated report you will be able to see something like this

<img src="./_assets/nsys-report-cupy.png" alt="Nsight Systems report example" width="75%"/>

Note the number of kernels and their execution times. You should see that CuPy multiply kernel takes longest and consumes more time than SGEMM kernel.

Let us profile nvmath-python's implementation now:

In [None]:
# Profile this cell with Nsight Systems...
d = nvmath.linalg.advanced.matmul(a, b, c, alpha=alpha, beta=beta)
type(d)

By opening the **Nsight Systems** UI you should be able to see something like this

<img src="./_assets/nsys-report-nvmath.png" alt="Nsight Systems report example" width="75%"/>

Note the only kernel in the timeline is SGEMM. Everything else has been fused into that kernel. This is the true reason why nvmath-python's performance is much better in a composite operation like GEMM.

## GEMM with fused epilogs

 The library goes one step further and allows fusion of GEMM operation with an epilog function $f(x)$, which is applied element-wise to the result of GEMM operation. Specifically,

$$
\mathrm{D_{m \times n}} \leftarrow f(\mathrm{\alpha \cdot (A_{m \times k} \cdot B_{k \times n}) + \beta \cdot C_{m \times n}})
$$

For deeper dive into GEMM epologs and other advanced techniqus of the matrix-matrix multiplication in the nvmath-python, please refer to a collection of notebooks under `notebooks/matmul/` directory:
* [notebooks/matmul/01_introduction.ipynb](../matmul/01_introduction.ipynb)
* [notebooks/matmul/02_epilogs.ipynb](../matmul/02_epilogs.ipynb)
* [notebooks/matmul/03_backpropagation.ipynb](../matmul/03_backpropagation.ipynb)
* [notebooks/matmul/04_fp8.ipynb](../matmul/04_fp8.ipynb)