Description
🐛 Describe the bug
I found that torch.linalg.inv
is relatively slow for many (= 100,000) small (< 32 x 32) matrices compared to CuPy.
The complete benchmark code is uploaded here: https://gist.github.com/yoshipon/dc7e14635d48656c767d47132351eaf6
For an excerpt, torch.linalg.inv
for size = (100000, 4, 4)
costs 639 µs
with the following code:
%%timeit x = torch.randn(*size, dtype=dtype, device="cuda"); torch.cuda.synchronize()
y = torch.linalg.inv(x)
torch.cuda.synchronize()
On the other hand, CuPy's one costs only 288 µs
(2x faster) with the following code:
%%timeit x = torch.randn(*size, dtype=dtype, device="cuda"); torch.cuda.synchronize()
y = cupy_inv(x)
torch.cuda.synchronize()
Note that cupy_inv
uses *getrfBatched
and *getriBatched
of cuBLAS and is defined as follows:
def cupy_inv(x_):
x = cp.from_dlpack(x_)
return torch.from_dlpack(cp.linalg.inv(x))
I would appreciate it if torch.linalg.inv
could be speeded up because it is quite important for my research field of multichannel audio signal processing.
The multichannel (microphone array) signal processing is important for many audio applications including distant speech recognition [1], speech enhancement [2], and source separation [3].
[1] T. Ochiai, et al. "Multichannel end-to-end speech recognition." ICML, 2017.
[2] L. Drude, et al. "Unsupervised training of neural mask-based beamforming." INTERSPEECH, 2019.
[3] Y. Bando, et al. "Neural full-rank spatial covariance analysis for blind source separation." IEEE SP Letters, 2021.
Because we usually calculate more than 100K small matrices per training sample in a forward path, it often becomes a bottleneck in training speed.
Versions
PyTorch version: 1.13.1+cu117
Is debug build: False
CUDA used to build PyTorch: 11.7
ROCM used to build PyTorch: N/A
OS: Ubuntu 20.04.5 LTS (x86_64)
GCC version: (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0
Clang version: Could not collect
CMake version: version 3.16.3
Libc version: glibc-2.31
Python version: 3.9.13 (main, Aug 25 2022, 23:26:10) [GCC 11.2.0] (64-bit runtime)
Python platform: Linux-3.10.0-862.el7.x86_64-x86_64-with-glibc2.31
Is CUDA available: True
CUDA runtime version: 11.7.99
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration:
GPU 0: Tesla V100-SXM2-16GB
GPU 1: Tesla V100-SXM2-16GB
GPU 2: Tesla V100-SXM2-16GB
GPU 3: Tesla V100-SXM2-16GB
Nvidia driver version: 510.47.03
cuDNN version: Probably one of the following:
/usr/lib/x86_64-linux-gnu/libcudnn.so.8.5.0
/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.5.0
/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.5.0
/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.5.0
/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.5.0
/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.5.0
/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.5.0
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True
Versions of relevant libraries:
[pip3] mypy==0.991
[pip3] mypy-extensions==0.4.3
[pip3] numpy==1.21.5
[pip3] numpydoc==1.4.0
[pip3] pytorch-ignite==0.4.10
[pip3] torch==1.13.1+cu117
[pip3] torchaudio==0.13.1+cu117
[pip3] torchvision==0.14.1+cu117
[conda] blas 1.0 mkl
[conda] mkl 2021.4.0 h06a4308_640
[conda] mkl-service 2.4.0 py39h7f8727e_0
[conda] mkl_fft 1.3.1 py39hd3c417c_0
[conda] mkl_random 1.2.2 py39h51133e4_0
[conda] numpy 1.21.5 py39h6c91a56_3
[conda] numpy-base 1.21.5 py39ha15fc14_3
[conda] numpydoc 1.4.0 py39h06a4308_0
[conda] pytorch-ignite 0.4.10 pypi_0 pypi
[conda] torch 1.13.1+cu117 pypi_0 pypi
[conda] torchaudio 0.13.1+cu117 pypi_0 pypi
[conda] torchvision 0.14.1+cu117 pypi_0 pypi
cc @ngimel @jianyuh @nikitaved @pearu @mruberry @walterddr @IvanYashchuk @xwang233 @lezcano
Metadata
Metadata
Assignees
Labels
Type
Projects
Status