<div style="background-color:#192015; color:#7fa637; padding:12px; border-radius:8px; max-width:80%; width:auto; margin:0 auto;">

![nvmath-python](_assets/nvmath_head_panel@0.25x.png)

<p style="font-size:0.85em; margin-top:8px;">
Copyright (c) 2025, NVIDIA CORPORATION & AFFILIATES<br>
SPDX-License-Identifier: BSD-3-Clause
</p>

</div>

# Getting started with nvmath-python: memory and execution spaces

In this tutorial we provide basic 101 about **nvmath-python**, how it fits in and plays with an existing scientific computing ecosystem in Python and what makes it a useful addition for this ecosystem. This notebook touches upon memory and execution space concepts foundations in the nvmath-python operation

<div class="alert alert-box alert-info">
    To use this notebook you will need a computer equipped with NVIDIA GPU as well as an environment with properly installed Python libraries and (optionally) CUDA Toolkit. Please refer to the nvmath-python documentation for getting familiar with <a href="https://docs.nvidia.com/cuda/nvmath-python/0.2.1/installation.html#install-nvmath-python">installation options</a>.
</div>

## Memory and execution spaces

Flexibility of choosing a corresponding array library (or multiple libraries!) to interoperate with applies to both GPU-only libraries (e.g. CuPy), CPU-libraries libraries (e.g. NumPy), and CPU-GPU libraries (.e.g. PyTorch). It is possible because nvmath-python is backed by both GPU libraries (such as cuBLAS or cuFFT) and CPU libraries (NVPL for NVIDIA Grace CPUs and Intel MKL for x86 hosts). This allows easy code migration between CPU and GPU as well as an implementation of complex hybrid workflows that combine both CPU and GPU execution.

The *memory space* is memory dedicated for storing input data and results. It is tied to a specific device (or a host) and is allocated and released by means of respective device/host API call. The *execution space* is where the data is actually processed. Memory and execution spaces are not necessarily the same. This is important to remember because data transfer between memory spaces often incurs non-negligible costs. These costs may be high not only in the case of data movement between a host CPU and a GPU device, but also between two GPU devices.

Let's take an example.

In [None]:
import cupy as cp
import numpy as np
import nvmath

m, n, k = 8000, 2000, 4000
a_cpu = np.random.randn(m, k).astype(np.float32)
b_cpu = np.random.randn(k, n).astype(np.float32)

a_gpu = cp.random.randn(m, k, dtype=cp.float32)
b_gpu = cp.random.randn(k, n, dtype=cp.float32)

d_cpu = nvmath.linalg.advanced.matmul(a_cpu, b_cpu)
d_gpu = nvmath.linalg.advanced.matmul(a_gpu, b_gpu)
type(d_cpu)  # numpy.ndarray
type(d_gpu)  # cupy.ndarray

We will use again the helper for benchmarking the codes:

In [None]:
import numpy as np
import cupyx as cpx


# Helper function to benchmark two implementations F and (optionally) F_alternative
# When F_alternative is provided, in addition to raw performance numbers (seconds)
# speedup of F relative to F_alternative is reported
def benchmark(
    F, F_name="Implementation", F_alternative=None, F_alternative_name="Alternative implementation", n_repeat=10, n_warmup=1
):
    timing = cpx.profiler.benchmark(F, n_repeat=n_repeat, n_warmup=n_warmup)  # warm-up + repeated runs
    perf = np.min(timing.gpu_times)  # best time from repeated runs
    print(f"{F_name} performance = {perf:0.4f} sec")

    if F_alternative is not None:
        timing_alt = cpx.profiler.benchmark(F_alternative, n_repeat=n_repeat, n_warmup=n_warmup)
        perf_alt = np.min(timing_alt.gpu_times)
        print(f"{F_alternative_name} performance = {perf_alt:0.4f} sec")
        print(f"Speedup = {perf_alt / perf:0.4f}x")
    else:
        perf_alt = None

    return perf, perf_alt

In [None]:
benchmark(
    lambda: nvmath.linalg.advanced.matmul(a_gpu, b_gpu),
    F_name="Matmul with GPU inputs",
    F_alternative=lambda: nvmath.linalg.advanced.matmul(a_cpu, b_cpu),
    F_alternative_name="Matmul with CPU inputs",
)

The difference is noticeable but where does the cost come from? Indeed, `nvmath.linalg.advanced.matmul` belongs to a category of *specialized APIs*. In contrast to *generic APIs* such as `nvmath.fft.fft`, specialized ones serve very specific needs, which comes at a cost of generality. Specifically, `nvmath.linalg.advanced.matmul` supports GPU execution space only. When `nvmath.linalg.advanced.matmul` receives CPU tensor inputs, it inherently copies them into its *execution space*, then performs operation, and then copies result back to the original *memory space*.

The next example illustrates what is happening under-the-hood of nvmath-python through library's logging mechanism.

In [None]:
import logging

# Configure root logger to show info messages from nvmath and its internals
logging.basicConfig(level=logging.INFO, format="%(asctime)s %(levelname)-8s %(message)s", force=True)
logging.disable(logging.NOTSET)  # ensure logging is enabled

# Run matmul with GPU inputs (execution space == GPU)
logging.info("******************************************************")
logging.info("********************* GPU INPUTS *********************")
logging.info("******************************************************")
d_gpu = nvmath.linalg.advanced.matmul(a_gpu, b_gpu)
print("d_gpu type:", type(d_gpu))

# Run matmul with CPU inputs (this will cause nvmath to copy to execution space internally)
logging.info("******************************************************")
logging.info("********************* CPU INPUTS *********************")
logging.info("******************************************************")
d_cpu = nvmath.linalg.advanced.matmul(a_cpu, b_cpu)
print("d_cpu type:", type(d_cpu))

In the case of GPU inputs the `= SPECIFICATION PHASE =` section reports:

`The input operands' memory space is cuda, and the execution space is on device 0.`

while in the case of CPU inputs the report is different:

`The input operands' memory space is cpu, and the execution space is on device 0.`

Such as significant overhead cannot be ignored, for sure. The nvmath-python's logging mechanism is a great tool to understand potential costs and refactor the code accordingly to minimize the impact.

<div style="background-color:#76B900; color:#ffffff; padding:12px; border-radius:8px; max-width:80%; width:auto; margin:0 auto;">
<h3 style="margin-top:0; color:#ffffff">Exercise 2. Estimate CPU-GPU data transfer overhead</h3>
We see non-negligible performance difference between data residing in CPU memory space vs. GPU memory space in the above example. Given that execution space is always GPU, estimate data transfer cost. Implement a dedicated benchmark for a cross-check.
</div>


Next, let us illustrate the data flow in the case of the library's **fast Fourier transforms (FFT)**:

In [None]:
N = 10000
e_cpu = (np.random.randn(N) + 1j * np.random.randn(N)).astype(np.complex64)
e_gpu = cp.array(e_cpu)  # move NumPy data to GPU as CuPy array (complex64)

# compute FFT with nvmath (for CPU inputs nvmath may copy to execution space internally)
logging.info("******************************************************")
logging.info("********************* CPU INPUTS *********************")
logging.info("******************************************************")
r_cpu = nvmath.fft.fft(e_cpu)
print("r_cpu type:", type(r_cpu), getattr(r_cpu, "dtype", None), getattr(r_cpu, "shape", None))

logging.info("******************************************************")
logging.info("********************* GPU INPUTS *********************")
logging.info("******************************************************")
r_gpu = nvmath.fft.fft(e_gpu)
print("r_gpu type:", type(r_gpu), getattr(r_gpu, "dtype", None), getattr(r_gpu, "shape", None))

Take a note that when input operand is a CPU operand then the library choses execution space to be CPU, thanks to the fact that FFT belongs to *generic APIs* providing consistent behavior between CPU and GPU. In the case of GPU inputs the library selects GPU as an execution space.

<div style="background-color:#76B900; color:#ffffff; padding:12px; border-radius:8px; max-width:80%; width:auto; margin:0 auto;">
<h3 style="margin-top:0; color:#ffffff">Takeaways</h3>

- Memory space (where data is stored) and execution space (where computation happens) may differ, leading to data transfer costs.
- Some specialized APIs, e.g. `nvmath.linalg.advanced.matmul`, only support GPU execution, automatically transferring CPU data to GPU with associated overhead.
- Generic APIs like `nvmath.fft.fft` adapt to input location: CPU inputs execute on CPU, GPU inputs execute on GPU.
- Use nvmath-python's logging mechanism to understand internal operations and identify potential bottlenecks.
</div>
