Skip to content

The speed of matrix inversion is relatively slow for many small matrices #91536

Open
@yoshipon

Description

@yoshipon

🐛 Describe the bug

I found that torch.linalg.inv is relatively slow for many (= 100,000) small (< 32 x 32) matrices compared to CuPy.

The complete benchmark code is uploaded here: https://gist.github.com/yoshipon/dc7e14635d48656c767d47132351eaf6
For an excerpt, torch.linalg.inv for size = (100000, 4, 4) costs 639 µs with the following code:

%%timeit x = torch.randn(*size, dtype=dtype, device="cuda"); torch.cuda.synchronize()
y = torch.linalg.inv(x)
torch.cuda.synchronize()

On the other hand, CuPy's one costs only 288 µs (2x faster) with the following code:

%%timeit x = torch.randn(*size, dtype=dtype, device="cuda"); torch.cuda.synchronize()
y = cupy_inv(x)
torch.cuda.synchronize()

Note that cupy_inv uses *getrfBatched and *getriBatched of cuBLAS and is defined as follows:

def cupy_inv(x_):
    x = cp.from_dlpack(x_)
    return torch.from_dlpack(cp.linalg.inv(x))

I would appreciate it if torch.linalg.inv could be speeded up because it is quite important for my research field of multichannel audio signal processing.
The multichannel (microphone array) signal processing is important for many audio applications including distant speech recognition [1], speech enhancement [2], and source separation [3].
[1] T. Ochiai, et al. "Multichannel end-to-end speech recognition." ICML, 2017.
[2] L. Drude, et al. "Unsupervised training of neural mask-based beamforming." INTERSPEECH, 2019.
[3] Y. Bando, et al. "Neural full-rank spatial covariance analysis for blind source separation." IEEE SP Letters, 2021.
Because we usually calculate more than 100K small matrices per training sample in a forward path, it often becomes a bottleneck in training speed.

Versions

PyTorch version: 1.13.1+cu117
Is debug build: False
CUDA used to build PyTorch: 11.7
ROCM used to build PyTorch: N/A

OS: Ubuntu 20.04.5 LTS (x86_64)
GCC version: (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0
Clang version: Could not collect
CMake version: version 3.16.3
Libc version: glibc-2.31

Python version: 3.9.13 (main, Aug 25 2022, 23:26:10) [GCC 11.2.0] (64-bit runtime)
Python platform: Linux-3.10.0-862.el7.x86_64-x86_64-with-glibc2.31
Is CUDA available: True
CUDA runtime version: 11.7.99
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration:
GPU 0: Tesla V100-SXM2-16GB
GPU 1: Tesla V100-SXM2-16GB
GPU 2: Tesla V100-SXM2-16GB
GPU 3: Tesla V100-SXM2-16GB

Nvidia driver version: 510.47.03
cuDNN version: Probably one of the following:
/usr/lib/x86_64-linux-gnu/libcudnn.so.8.5.0
/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.5.0
/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.5.0
/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.5.0
/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.5.0
/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.5.0
/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.5.0
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

Versions of relevant libraries:
[pip3] mypy==0.991
[pip3] mypy-extensions==0.4.3
[pip3] numpy==1.21.5
[pip3] numpydoc==1.4.0
[pip3] pytorch-ignite==0.4.10
[pip3] torch==1.13.1+cu117
[pip3] torchaudio==0.13.1+cu117
[pip3] torchvision==0.14.1+cu117
[conda] blas 1.0 mkl
[conda] mkl 2021.4.0 h06a4308_640
[conda] mkl-service 2.4.0 py39h7f8727e_0
[conda] mkl_fft 1.3.1 py39hd3c417c_0
[conda] mkl_random 1.2.2 py39h51133e4_0
[conda] numpy 1.21.5 py39h6c91a56_3
[conda] numpy-base 1.21.5 py39ha15fc14_3
[conda] numpydoc 1.4.0 py39h06a4308_0
[conda] pytorch-ignite 0.4.10 pypi_0 pypi
[conda] torch 1.13.1+cu117 pypi_0 pypi
[conda] torchaudio 0.13.1+cu117 pypi_0 pypi
[conda] torchvision 0.14.1+cu117 pypi_0 pypi

cc @ngimel @jianyuh @nikitaved @pearu @mruberry @walterddr @IvanYashchuk @xwang233 @lezcano

Metadata

Metadata

Assignees

No one assigned

    Labels

    module: cudaRelated to torch.cuda, and CUDA support in generalmodule: linear algebraIssues related to specialized linear algebra operations in PyTorch; includes matrix multiply matmulmodule: performanceIssues related to performance, either of kernel code or framework gluetriagedThis issue has been looked at a team member, and triaged and prioritized into an appropriate module

    Type

    No type

    Projects

    Status

    TODO

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions