The speed of matrix inversion is relatively slow for many small matrices

### 🐛 Describe the bug

I found that `torch.linalg.inv` is relatively slow for many (= 100,000) small (< 32 x 32) matrices compared to CuPy.

The complete benchmark code is uploaded here: https://gist.github.com/yoshipon/dc7e14635d48656c767d47132351eaf6
For an excerpt, `torch.linalg.inv` for `size = (100000, 4, 4)` costs `639 µs` with the following code:
```
%%timeit x = torch.randn(*size, dtype=dtype, device="cuda"); torch.cuda.synchronize()
y = torch.linalg.inv(x)
torch.cuda.synchronize()
```
On the other hand, CuPy's one costs only `288 µs` (2x faster) with the following code:
```
%%timeit x = torch.randn(*size, dtype=dtype, device="cuda"); torch.cuda.synchronize()
y = cupy_inv(x)
torch.cuda.synchronize()
```
Note that `cupy_inv` uses `*getrfBatched` and `*getriBatched` of cuBLAS and is defined as follows:
```
def cupy_inv(x_):
    x = cp.from_dlpack(x_)
    return torch.from_dlpack(cp.linalg.inv(x))
```

I would appreciate it if `torch.linalg.inv` could be speeded up because it is quite important for my research field of multichannel audio signal processing.
The multichannel (microphone array) signal processing is important for many audio applications including distant speech recognition [1], speech enhancement [2], and source separation [3].
[1] T. Ochiai, et al. "Multichannel end-to-end speech recognition." ICML, 2017.
[2] L. Drude, et al. "Unsupervised training of neural mask-based beamforming." INTERSPEECH, 2019.
[3] Y. Bando, et al. "Neural full-rank spatial covariance analysis for blind source separation." IEEE SP Letters, 2021.
Because we usually calculate more than 100K small matrices per training sample in a forward path, it often becomes a bottleneck in training speed.



### Versions

PyTorch version: 1.13.1+cu117
Is debug build: False
CUDA used to build PyTorch: 11.7
ROCM used to build PyTorch: N/A

OS: Ubuntu 20.04.5 LTS (x86_64)
GCC version: (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0
Clang version: Could not collect
CMake version: version 3.16.3
Libc version: glibc-2.31

Python version: 3.9.13 (main, Aug 25 2022, 23:26:10)  [GCC 11.2.0] (64-bit runtime)
Python platform: Linux-3.10.0-862.el7.x86_64-x86_64-with-glibc2.31
Is CUDA available: True
CUDA runtime version: 11.7.99
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration: 
GPU 0: Tesla V100-SXM2-16GB
GPU 1: Tesla V100-SXM2-16GB
GPU 2: Tesla V100-SXM2-16GB
GPU 3: Tesla V100-SXM2-16GB

Nvidia driver version: 510.47.03
cuDNN version: Probably one of the following:
/usr/lib/x86_64-linux-gnu/libcudnn.so.8.5.0
/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.5.0
/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.5.0
/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.5.0
/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.5.0
/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.5.0
/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.5.0
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

Versions of relevant libraries:
[pip3] mypy==0.991
[pip3] mypy-extensions==0.4.3
[pip3] numpy==1.21.5
[pip3] numpydoc==1.4.0
[pip3] pytorch-ignite==0.4.10
[pip3] torch==1.13.1+cu117
[pip3] torchaudio==0.13.1+cu117
[pip3] torchvision==0.14.1+cu117
[conda] blas                      1.0                         mkl  
[conda] mkl                       2021.4.0           h06a4308_640  
[conda] mkl-service               2.4.0            py39h7f8727e_0  
[conda] mkl_fft                   1.3.1            py39hd3c417c_0  
[conda] mkl_random                1.2.2            py39h51133e4_0  
[conda] numpy                     1.21.5           py39h6c91a56_3  
[conda] numpy-base                1.21.5           py39ha15fc14_3  
[conda] numpydoc                  1.4.0            py39h06a4308_0  
[conda] pytorch-ignite            0.4.10                   pypi_0    pypi
[conda] torch                     1.13.1+cu117             pypi_0    pypi
[conda] torchaudio                0.13.1+cu117             pypi_0    pypi
[conda] torchvision               0.14.1+cu117             pypi_0    pypi


cc @ngimel @jianyuh @nikitaved @pearu @mruberry @walterddr @IvanYashchuk @xwang233 @Lezcano

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

The speed of matrix inversion is relatively slow for many small matrices #91536

🐛 Describe the bug

Versions

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

The speed of matrix inversion is relatively slow for many small matrices #91536

Description

🐛 Describe the bug

Versions

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions