Potential efficiency improvement of density-fitting (cderi, get_jk) #1915

ajz34 · 2023-10-22T14:45:29Z

Hi devs!

While performing RI-JK (SCF) computations, I found two possible efficiency bottlenecks. These problems may be alleviated by minor code modifications.

For some remarks below, I think I think I'm not ready to propose a pull request for this issue.

Coulomb integral not paralleled using `numpy.einsum` without `optimize=True`

For evaluation of coulomb integrals, codes like

pyscf/pyscf/df/df_jk.py

Lines 260 to 261 in 231da24

    
           rho = numpy.einsum('ix,px->ip', dmtril, eri1) 
        
           vj += numpy.einsum('ip,px->ix', rho, eri1)

and

pyscf/pyscf/df/df_jk.py

Lines 290 to 291 in 231da24

    
           rho = numpy.einsum('ix,px->ip', dmtril, eri1) 
        
           vj += numpy.einsum('ip,px->ix', rho, eri1)

envokes numpy.einsum, which have optional parameter optimize=False.

For optimize=False, numpy.einsum just contract tensors or matrices without parallel. That could be great efficiency loss when using more CPU threads.

Simply add optimize=True to numpy.einsum will resolve this problem.

Since many codes in PySCF uses numpy.einsum without optimize=True, I think that this is not only a problem of RI-JK SCF, but also other tasks.

Cholesky decomposed ERI with `scipy.linalg.solve_triangular`

I found that, probably, scipy.linalg.solve_triangular just copies or transposes the large 3c-2e integrals ints $J_{P, \mu \nu}$ to $J_{\mu \nu, P}$. Since NumPy/SciPy often performs copy or transposition with only one thread, this can be extremely slow.

For evaluation of cholesky decomposed ERI $Y_{Q, \mu \nu}$ (mf.with_df._cderi, discuss incore case for simplicity)
$$\sum_{Q} L_{PQ} Y_{Q, \mu \nu} = J_{P, \mu \nu} \text{ or } \mathbf{L} \mathbf{Y} = \mathbf{J}$$
where $L_{PQ}$ is the lower triangular matrix of cholesky-decomposed 2c-2e ERI $J_{PQ}$, as well as $J_{P, \mu \nu}$ the 3c-2e ERI
$$\sum_{R} L_{PR} L_{QR} = J_{PQ} \text{ or } \mathbf{L} \mathbf{L}^\dagger = \mathbf{J}$$

pyscf/pyscf/df/incore.py

Lines 180 to 188 in 231da24

    
           if decompose_j2c == 'cd': 
        
               if ints.flags.c_contiguous: 
        
                   ints = lib.transpose(ints, out=bufs2).T 
        
                   bufs1, bufs2 = bufs2, bufs1 
        
               dat = scipy.linalg.solve_triangular(low, ints, lower=True, 
        
                                                   overwrite_b=True, check_finite=False) 
        
               if dat.flags.f_contiguous: 
        
                   dat = lib.transpose(dat.T, out=bufs2) 
        
               cderi[:,p0:p1] = dat

PySCF's gto.moleintor.getints3c will probably generates ints $J_{P, \mu \nu}$ as C-contiguous matrix, or $J_{\mu \nu, P}$ as F-contiguous matrix.

SciPy seems only accepts fortran-type array as arguments of its BLAS interface. If matrix transposition is to be avoided, then $J_{\mu \nu, P}$ as F-contiguous matrix is better to be passed as argument. Then our problem is to solve $\mathbf{Y}^\dagger$ from $\mathbf{Y}^\dagger \mathbf{L}^\dagger = \mathbf{J}^\dagger$ F-contiguously; then transpose $\mathbf{Y}^\dagger$ to $\mathbf{Y}$ as a C-contiguous matrix.

What's tricky is that, for function scipy.linalg.solve_triangular, it calls dtrtrs. This BLAS function could not handle $\mathbf{L}^\dagger$ lying on right-side. So the underlying BLAS function dtrsm is required to solve $\mathbf{Y}^\dagger \mathbf{L}^\dagger = \mathbf{J}^\dagger$. However, seems there's no function in scipy wraping ?trsm as high-level API. So to avoid too much data copy and transposition, we may need to explicitly call scipy.linalg.blas.dtrsm:

if decompose_j2c == 'cd':
    if not ints.flags.c_contiguous:
        log.warn("`ints` is not c_contiguous here. Additional transpose is involved.")
        ints = np.array(ints, order='C')
    cderi[:, p0:p1] = scipy.linalg.blas.dtrsm(
        1, low.T, ints.T, side=1, lower=False, overwrite_b=True).T

This code also has some potential problems:

This code almost asserts ints as result from transposed gto.moleintor.getints3c is C-contiguous; but I don't know if there could be any exceptions.
This code assumes integrals to be double-type.
This code avoids transpose, but still requires some data copy. Assignment to cderi[:, p0:p1] also requires data copy, which seems to be unavoidable if ints and cderi uses different buffers in memory. I think this data copy process is possible to be avoided, but that may involve some code logic changes to buffer assignment.
Above mentioned code is only for incore.py; I haven't validated case of outcore.py.

Simple benchmark efficiency improvement

C₂₅H₅₂, RI-JK/def2-TZVP (202 electrons, 1087 basis, 2811 auxbasis), Intel 6150 with 32 physical cores, fully incore
PySCF compiled with MKL, haswell

use `np.einsum(optimize=True)`*	use `dtrsm`	RI-JK time
x	x	80.66 sec
√	x	48.25 sec
√	√	35.64 sec

* This option is applied globally, meaning that I manually modified signature in numpy/core/einsumfunc.py directly.

benchmark_code.py.txt

The text was updated successfully, but these errors were encountered:

baopengbp · 2023-11-03T09:06:46Z

simple one:
rho = lib.dot(dmtril, eri1.T)
vj += lib.dot(rho, eri1)
openmp with dtrsm may be faster, a ctypes should be use.

ajz34 · 2023-11-03T14:16:19Z

@baopengbp

simple one:
rho = lib.dot(dmtril, eri1.T)
vj += lib.dot(rho, eri1)

Yes, exactly 😄

openmp with dtrsm may be faster, a ctypes should be use.

I guess that's equally fast. Given F-contiguous matrices, scipy's wrapper of Lapack's dtrsm (linked with MKL) may be equally fast compared to MKL itself. It is also multi-threaded. So I think that ctypes could be utilized, but is certainly not the only option.
I haven't tested the case of OpenBLAS.

Though it seems (I used to believe) that scipy runs dtrtrs or dtrsm single-threaded, that's actually because scipy applies transposition of input matrices to F-contiguous. The transposition costs lots of time (could be accelarated but only single threaded by scipy), and then scipy performs dtrsm fast and multi-threaded.

ajz34 · 2023-11-03T14:21:32Z

Additional note for get_k: dsyrk instead of dgemm

This seems not a great improvement, but it's simple and probably worth a try. This following code can actually utilize numpy.dot, instead of lib.dot, though it is possible the latter one is more stable in other types of matrix multiplication computations.

pyscf/pyscf/df/df_jk.py

Line 304 in 14d8882

vk[k] += lib.dot(buf1.T, buf1)

Possible reason that numpy.dot could be faster than lib.dot here, is that numpy will handle something like $\mathbf{A}^\dagger \mathbf{A}$ by dsyrk instead of dgemm.

https://github.com/numpy/numpy/blob/c35050022fe9e613b316c4d586d59686b9dbc34d/numpy/_core/src/umath/matmul.c.src#L161-L196

use `np.einsum(optimize=True)`*	use `dtrsm`	use `numpy.dot` for get_k	RI-JK time
x	x	x	80.66 sec
√	x	x	48.25 sec
√	√	x	35.64 sec
√	√	√	30.49 sec

baopengbp · 2023-11-04T09:49:16Z

That is useful, I saw a 30% faster for this dot.

baopengbp · 2023-11-05T02:27:29Z

openmp with dtrsm may be faster, a ctypes should be use.

I guess that's equally fast. Given F-contiguous matrices, scipy's wrapper of Lapack's dtrsm (linked with MKL) may be equally fast compared to MKL itself. It is also multi-threaded. So I think that ctypes could be utilized, but is certainly not the only option. I haven't tested the case of OpenBLAS.

Though it seems (I used to believe) that scipy runs dtrtrs or dtrsm single-threaded, that's actually because scipy applies transposition of input matrices to F-contiguous. The transposition costs lots of time (could be accelarated but only single threaded by scipy), and then scipy performs dtrsm fast and multi-threaded.

Yes，It took long times to transpose to the array for dtrsm and back.

baopengbp · 2023-11-05T02:28:41Z

openmp with dtrsm may be faster, a ctypes should be use.

I guess that's equally fast. Given F-contiguous matrices, scipy's wrapper of Lapack's dtrsm (linked with MKL) may be equally fast compared to MKL itself. It is also multi-threaded. So I think that ctypes could be utilized, but is certainly not the only option. I haven't tested the case of OpenBLAS.

Though it seems (I used to believe) that scipy runs dtrtrs or dtrsm single-threaded, that's actually because scipy applies transposition of input matrices to F-contiguous. The transposition costs lots of time (could be accelarated but only single threaded by scipy), and then scipy performs dtrsm fast and multi-threaded.

Yes，It took long times to transpose to the array for dtrsm and back.

chillenb · 2024-05-05T06:22:42Z

I like the improvements discovered in this thread so much that I made a pull request (#2205). Some differences:

the trsm variation of appropriate type is found using scipy.linalg.get_blas_funcs
fall back to scipy.linalg.solve_triangular if F contiguous
outcore
I've always found np.matmul to be slightly faster than np.dot for unclear reasons, perhaps some weird openmp behavior, so I used it.

* use trsm in df.incore * use np.matmul in df/df_jk.py * faster df.outcore with trsm * avoid numpy mem copy with h5py read_direct * clarify chained matmuls * use lib.dot to avoid syrk, df/outcore.py swap buffers

ajz34 changed the title ~~Potential efficiency improvement of density-fitting (cderi, get_j)~~ Potential efficiency improvement of density-fitting (cderi, get_jk) Nov 3, 2023

ajz34 mentioned this issue Nov 18, 2023

DFT running slow in SLURM system #1956

Open

chillenb mentioned this issue May 5, 2024

Density fitting performance optimization (re issue #1915) #2205

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Potential efficiency improvement of density-fitting (cderi, get_jk) #1915

Potential efficiency improvement of density-fitting (cderi, get_jk) #1915

ajz34 commented Oct 22, 2023 •

edited

Loading

baopengbp commented Nov 3, 2023

ajz34 commented Nov 3, 2023

ajz34 commented Nov 3, 2023 •

edited

Loading

baopengbp commented Nov 4, 2023

baopengbp commented Nov 5, 2023

baopengbp commented Nov 5, 2023

chillenb commented May 5, 2024

Potential efficiency improvement of density-fitting (cderi, get_jk) #1915

Potential efficiency improvement of density-fitting (cderi, get_jk) #1915

Comments

ajz34 commented Oct 22, 2023 • edited Loading

Coulomb integral not paralleled using numpy.einsum without optimize=True

Cholesky decomposed ERI with scipy.linalg.solve_triangular

Simple benchmark efficiency improvement

baopengbp commented Nov 3, 2023

ajz34 commented Nov 3, 2023

ajz34 commented Nov 3, 2023 • edited Loading

Additional note for get_k: dsyrk instead of dgemm

baopengbp commented Nov 4, 2023

baopengbp commented Nov 5, 2023

baopengbp commented Nov 5, 2023

chillenb commented May 5, 2024

ajz34 commented Oct 22, 2023 •

edited

Loading

Coulomb integral not paralleled using `numpy.einsum` without `optimize=True`

Cholesky decomposed ERI with `scipy.linalg.solve_triangular`

ajz34 commented Nov 3, 2023 •

edited

Loading