<div style="background-color:#192015; color:#7fa637; padding:12px; border-radius:8px; max-width:80%; width:auto; margin:0 auto;">

![nvmath-python](_assets/nvmath_head_panel@0.25x.png)

<p style="font-size:0.85em; margin-top:8px;">
Copyright (c) 2025, NVIDIA CORPORATION & AFFILIATES<br>
SPDX-License-Identifier: BSD-3-Clause
</p>

</div>

# Getting started with nvmath-python: stateful APIs, autotuning

In this tutorial we provide basic 101 about **nvmath-python**, how it fits in and plays with an existing scientific computing ecosystem in Python and what makes it a useful addition for this ecosystem. 
<div class="alert alert-box alert-info">
    To use this notebook you will need a computer equipped with NVIDIA GPU as well as an environment with properly installed Python libraries and (optionally) CUDA Toolkit. Please refer to the nvmath-python documentation for getting familiar with <a href="https://docs.nvidia.com/cuda/nvmath-python/0.2.1/installation.html#install-nvmath-python">installation options</a>.
</div>

This section aims at introduction to nvmath-python stateful APIs and autotuning. 

But first, let's borrow the benchmarking helper function we created in the previous section to continue performance experiments with stateful APIs.

In [None]:
import numpy as np
import cupyx as cpx


# Helper function to benchmark two implementations F and (optionally) F_alternative
# When F_alternative is provided, in addition to raw performance numbers (seconds)
# speedup of F relative to F_alternative is reported
def benchmark(
    F, F_name="Implementation", F_alternative=None, F_alternative_name="Alternative implementation", n_repeat=10, n_warmup=1
):
    timing = cpx.profiler.benchmark(F, n_repeat=n_repeat, n_warmup=n_warmup)  # warm-up + repeated runs
    perf = np.min(timing.gpu_times)  # best time from repeated runs
    print(f"{F_name} performance = {perf:0.4f} sec")

    if F_alternative is not None:
        timing_alt = cpx.profiler.benchmark(F_alternative, n_repeat=n_repeat, n_warmup=n_warmup)
        perf_alt = np.min(timing_alt.gpu_times)
        print(f"{F_alternative_name} performance = {perf_alt:0.4f} sec")
        print(f"Speedup = {perf_alt / perf:0.4f}x")
    else:
        perf_alt = None

    return perf, perf_alt

Let us consider a scenario which is typical in neural networks. We will perform in a batch a series of matrix-matrix multiplications combined with bias and **ReLU** as an *activation*. 

In [None]:
import cupy as cp
import nvmath
from nvmath.linalg.advanced import MatmulEpilog

m, n, k = 124, 10, 15

a = cp.random.rand(m, k, dtype=cp.float32)
b = cp.random.rand(k, n, dtype=cp.float32)
d = cp.empty((m, n), dtype=cp.float32)
bias = cp.random.rand(m, dtype=cp.float32)

d = nvmath.linalg.advanced.matmul(a, b, epilog=MatmulEpilog(MatmulEpilog.RELU_BIAS), epilog_inputs={"bias": bias})

print("d type =", type(d))
print("d shape =", d.shape)

## Stateful and stateless APIs

What we used so far is *stateless* (or *function-form*) API for `matmul` in nvmath-python. This is a convenience API allowing to perform a desired operation as a single function call and get the result. Under the hood, however, a lot of machinery has been going before actual computation took place. Let us illustrate that by using nvmath-python's logging capabilities:

In [None]:
# Demonstrate what happens under the hood using nvmath-python's logging capabilities
import logging

# Configure the root logger to INFO and include timestamps
logging.basicConfig(level=logging.INFO, format="%(asctime)s %(levelname)-8s %(message)s", force=True)
logging.disable(logging.NOTSET)  # ensure logging is enabled

logging.info("About to call matmul() — inspect logs for execution flow")

d = nvmath.linalg.advanced.matmul(a, b, epilog=MatmulEpilog(MatmulEpilog.RELU_BIAS), epilog_inputs={"bias": bias})

print("d type =", type(d))
print("d shape =", d.shape)

logging.info("Completed matmul() call")
logging.disable(logging.CRITICAL)  # disable logging

As you can see, quite a bit of action has happened during the call to `matmul`:

* **SPECIFICATION PHASE:** In this phase nvmath-python analyzes input arguments and creates an internal object `Matmul` with all required data prepared for the underlying cuBLASLt library call. Note, that this phase may take a noticeable overhead when inputs are relatively small.
* **PLANNING PHASE:** This is where cuBLASLt analyzes prepared data from `Matmul` object and performs a search of a suitable algorithm to effectively perform an operation. It uses internal heuristics to select the most promising algorithm. The planning phase also introduces a noticeable overhead before actual computation even started.
* **EXECUTION PHASE:** This is the phase where actual computation will happen.
* **RESOURCE MANAGEMENT PHASE:** This the final phase where `Matmul` resource are released. Did you notice respective INFO line `The Matmul object's resources have been released` in the log?

Now imaging, that our workflow assumes a **series of matrix multiplications**. In this scenario for each matrix multiplication we will go through all phases over and over again. What if matrix shapes and layouts do not change? What if input `dtypes` do not change? In this case it would be desirable to perform the *specification* and the *planning* once and amortize their cost through *multiple executions*. Exactly for this more sophisticated scenario nvmath-python is offering the **stateful** (or **class-form**) API for the matrix-matrix multiplication. In this API you rather construct an object (an instance of the class `Matmul` - did you notice the line in the logger telling `The Matmul operation has been created`?)

Let us illustrate this with the code:

In [None]:
import cupy as cp
import nvmath
from nvmath.linalg.advanced import MatmulEpilog

m, n, k, batch_size = 124, 10, 15, 8

a = cp.random.rand(batch_size, m, k, dtype=cp.float32)
b = cp.random.rand(batch_size, k, n, dtype=cp.float32)
d = cp.empty((batch_size, m, n), dtype=cp.float32)
bias = cp.random.rand(batch_size, m, dtype=cp.float32)


def matmul_batched_stateless(a, b, bias):
    for i in range(batch_size):
        d[i] = nvmath.linalg.advanced.matmul(
            a[i], b[i], epilog=MatmulEpilog(MatmulEpilog.RELU_BIAS), epilog_inputs={"bias": bias[i]}
        )

    return d


def matmul_batched_stateful(a, b, bias):
    with nvmath.linalg.advanced.Matmul(a[0], b[0]) as mm:
        mm.plan(epilog=MatmulEpilog(MatmulEpilog.RELU_BIAS), epilog_inputs={"bias": bias[0]})
        mm.execute()
        for i in range(1, batch_size):
            mm.reset_operands(a=a[i], b=b[i], epilog_inputs={"bias": bias[i]})
            d[i] = mm.execute()

    return d

In [None]:
import logging

logging.basicConfig(level=logging.INFO, format="%(asctime)s %(levelname)-8s %(message)s", force=True)
logging.disable(logging.NOTSET)

logging.info("*************** Stateless API ***************")

matmul_batched_stateless(a, b, bias)

logging.info("*************** Stateful API ***************")

matmul_batched_stateful(a, b, bias)

logging.disable(logging.CRITICAL)

You can see from the logger that in the case of stateless API there are 8 **SPECIFICATION**, 8 **PLANNING**, and 8 **EXECUTION** phases. In the case of stateful API implementation we managed to reduce the number of **SPECIFICATION** and **PLANNING** phases to 1. Let us see the performance impact.

In [None]:
benchmark(
    lambda: matmul_batched_stateful(a, b, bias), "Stateful API", lambda: matmul_batched_stateless(a, b, bias), "Stateless API"
)

<div style="background-color:#76B900; color:#ffffff; padding:12px; border-radius:8px; max-width:80%; width:auto; margin:0 auto;">
<h3 style="margin-top:0; color:#ffffff">Exercise 3. Batch dimension vs. batch sequence </h3>
In the above example we implemented batching as a sequence of matrices being processed one by one in a loop. This is a common technique in streaming data case or when the entire batch does not fit into GPU memory. An alternative approach is to add a dedicated batching dimension and operate with the batch as a single tensor. The nvmath-python library supports both use cases.

Implement a batching dimension approach and compare performance to a batch sequence approach. Explain performance difference (if any).
</div>

## Autotuning with nvmath-python

Using stateful APIs becomes even more essential when there is a need for autotuning. In many cases the built-in heuristics for `matmul` kernel selection work reasonably out-of-the-box. There are cases, however, where the underlying cuBLASLt library may choose suboptimal kernel and additional tuning is required. Let us see how such additional cost can be amortized through multiple executions.



In [None]:
m, n, k, batch_size = 124, 1024, 1512, 1024

a = cp.random.rand(batch_size, m, k, dtype=cp.float32)
b = cp.random.rand(batch_size, k, n, dtype=cp.float32)
d = cp.empty((batch_size, m, n), dtype=cp.float32)
bias = cp.random.rand(batch_size, m, dtype=cp.float32)


def matmul_batched_stateful_autotuned(a, b, bias):
    with nvmath.linalg.advanced.Matmul(a[0], b[0]) as mm:
        mm.plan(epilog=MatmulEpilog(MatmulEpilog.RELU_BIAS), epilog_inputs={"bias": bias[0]})
        mm.autotune(iterations=5)
        mm.execute()
        for i in range(1, batch_size):
            mm.reset_operands(a=a[i], b=b[i], epilog_inputs={"bias": bias[i]})
            d[i] = mm.execute()


benchmark(
    lambda: matmul_batched_stateful_autotuned(a, b, bias),
    "Stateful API with autotuning",
    lambda: matmul_batched_stateful(a, b, bias),
    "Stateful API without autotuning",
)

<div style="background-color:#76B900; color:#ffffff; padding:12px; border-radius:8px; max-width:80%; width:auto; margin:0 auto;">
<h3 style="margin-top:0; color:#ffffff">Takeaways</h3>

- Stateless API is convenient for single operations but repeats specification and planning for each call.
- Stateful API allows specification and planning once, then multiple executions, significantly improving performance for batched operations.
- Four phases in nvmath-python operations: specification, planning, execution, and resource management.
- Autotuning finds optimal kernels when built-in heuristics are suboptimal, providing additional performance gains.
</div>