---
title: The Cost of Numerical Linear Algebra
description: Overview of measuring computational costs in numerical linear algebra including floating point operations, matrix-vector queries, and entry evaluations
keywords: [computational cost, floating point operations, flops, matrix-vector multiplication, precision, hardware efficiency, numerical linear algebra]
---

Randomized Numerical Linear Algebra (RandNLA) achieves speedups over classical NLA algorithms is by one or both of the following strategies:
1. reducing the complexity of algorithms
1. reorganizing computation to take advantage of modern hardware


While practitioners may care about particular quantities like runtime, energy use, or monetary cost, these metrics are highly system dependent.
As such, it is beneficial to understand the underlying costs of algorithms in a more abstract way.
In this section, we provide a brief overview some common ways of measuring costs in NLA, which will be useful for understanding the improvements that RandNLA can provide.
Of course, there is no "best" way of measuring costs, and

## Floating point operations

Perhaps the most common way of measuring the cost of numerical algorithms is by counting the number of floating point operations (flops) required to complete the algorithm.
This provides a mathematical (hardware-independent) measure of algorithmic complexity, which is useful for comparing algorithms and understanding how they scale with the input size.



### Flops are not necessarily representative of cost

Flops alone don't tell us everything, since the flop rate (number of flops per unit time) can vary significantly depending on the specific operation being performed and the hardware being used.
On modern computing environments, other factors such as memory access patterns, communication costs, and parallelism can have a significant impact on the actual runtime of an algorithm.
In fact, some modern hardware devices are literally designed to perform certain types of linear-algebra operations (e.g., matrix-matrix multiplication) very efficiently (see e.g. [nVidia's Tensor Cores](https://www.nvidia.com/en-us/data-center/tensor-cores/)).



To illustrate this point, let's consider the following basic linear-algebra operations:
1. matrix-vector multiplication: $\vec{A}\vec{x}$ 
2. matrix-matrix multiplication: $\vec{A}^\T\vec{A}$
3. QR factorization of $\vec{A}$


We'll time each of these operations on a $n\times d$ matrix $\vec{A}$.By dividing the time by the number of flops, we can get a sense of relative efficiency of these primitives.

In [1]:
import numpy as np
import time
import pandas as pd

In [78]:
# Generate a random matrix and vector
n = 2000
d = 500

A = np.random.randn(n,d)
x = np.random.randn(d)

In [79]:
# Define the operations and their flop counts
operations = {
    'A@x': {
        'func': lambda: A@x,
        'flops': 2 * n * d
    },
    'A.T@A': {
        'func': lambda: A.T@A,
        'flops': 2 * n * d**2
    },
    'QR(A)': {
        'func': lambda: np.linalg.qr(A),
        'flops': 2 * n * d**2 - (2/3) * d**3
    },
}

We'll now time each of these operations and compare the flop rates.

In [80]:
# Time the operations
n_repeat = 20 # Number of repetitions for averaging

results = [] # Initialize results list

# Time each operation
for method_name, operation in operations.items():
    start = time.time()
    for _ in range(n_repeat):
        operation['func']()
    end = time.time()
    
    avg_time = (end - start) / n_repeat
    
    results.append({
        'method': method_name,
        'flops/s': operation['flops']/avg_time 
    })

# Create DataFrame
results_df = pd.DataFrame(results)
results_df['flops/s (relative)'] = 100*results_df['flops/s'] / results_df['flops/s'].max()
results_df.style.format({'flops/s':'{:1.1e}','flops/s (relative)':'{:1.0f}%'})

Unnamed: 0,method,flops/s,flops/s (relative)
0,A@x,600000000.0,0%
1,A.T@A,130000000000.0,100%
2,QR(A),11000000000.0,8%


The results of our experiment are striking. Even these basic linear algebra operations differ in efficinecy by orders of magnitude.
For instance, the flop rate of matrix-vector multiplication is about 10x lower than that of matrix-matrix multiplication.
This means that   

### Precision impacts cost

While NLA has traditionally been performed in double precision, many modern hardware devices (e.g., GPUs) are optimized for single precision or even lower precision arithmetic. 
Perhaps unsurprisingly, using lower precision can substantially increase the flop rate of numerical linear algebra.

To illusrate this, we compare the flop rates of matrix-matrix multiplication in single and double precision.

In [84]:
# Create matrices in different precisions
A_f64 = np.random.randn(n, d).astype(np.float64)
A_f32 = A_f64.astype(np.float32)

Similar to above, we'll benchmark the flop rates of matrix-matrix multiplication in single and double precision.

In [86]:

# Precision experiment: Compare float64 vs float32 matrix multiplication
precision_ops = {
    'A.T@A (float64)': {
        'func': lambda: A_f64.T @ A_f64,
        'flops': 2 * n * d**2
    },
    'A.T@A (float32)': {
        'func': lambda: A_f32.T @ A_f32,
        'flops': 2 * n * d**2
    }
}

precision_results = []

# Time each precision
for method_name, operation in precision_ops.items():
    start = time.time()
    for _ in range(n_repeat):
        operation['func']()
    end = time.time()
    
    avg_time = (end - start) / n_repeat
    
    precision_results.append({
        'method': method_name,
        'time (s)': avg_time,
        'flops/s': operation['flops']/avg_time 
    })

# Create DataFrame
precision_df = pd.DataFrame(precision_results)
precision_df['speedup'] = precision_df['flops/s'] / precision_df['flops/s'].min()
precision_df.style.format({'time (s)': '{:.4f}', 'flops/s':'{:.1e}', 'speedup': '{:.1f}x'})

Unnamed: 0,method,time (s),flops/s,speedup
0,A.T@A (float64),0.0155,65000000000.0,1.0x
1,A.T@A (float32),0.008,130000000000.0,1.9x


Here we observe roughly a 2x speedup by switching from 64-bit double precision to 32-bit single precision.
On GPUs, the speedups of using lower-precision can be even more pronounced, as the hardware is often specifically designed to take advantage of lower precision arithmetic.

Of course, lower-precision formats have lower precision 🤔, so we must be careful to ensure that the results of our computations are still accurate enough for our purposes.

## Matrix-vector queries

An increasingly popular way of measuring the cost of numerical algorithms is by counting the number of matrix-vector queries required to complete the algorithm.
Here, a matrix-vector query to $\vec{A}$ is an evaluation of the linear map $\vec{x} \mapsto \vec{A}\vec{x}$ for some input vector $\vec{x}$.
In some cases we may also have access to transpose queries $\vec{y} \mapsto \vec{A}^\T\vec{y}$. 

Some examples where this access model is natural include:
- The arithmetic and runtime costs of Krylov subspace methods like power iteration, conjugate gradient, and Lanczos are often dominated by the cost of matrix-vector products.
- A legacy PDE solver may give us access to the action of the solution operator of a PDE, even though we do not have access to the entries of $\vec{A}$ directly.
- Computing matrix-vector products with a Hessian (of e.g. a Neural Network) can be much cheaper than computing the Hessian itself {cite:p}`pearlmutter_94`.

There are more theoretical motivations as well. 
Perhaps the most exciting is the potential for *lower bounds* on the number of matrix-vector queries required to solve a problem.
This offers a way to measure of the complexity of various linear algebra tasks.

## Entry evaluations

In some settings, evaluating a single entry of a matrix may be the dominant cost. 
For instance if $\vec{A}$ is a Kernel matrix, then the $(i,j)$ entry of $\vec{A}$ takes the form $K(\vec{x}_i, \vec{x}_j)$, where $K : \R^m\times R^m\to \R$ is a function acting on data of dimension $m$, and may be costly to evaluate. 
