Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Potential efficiency improvement of density-fitting (cderi, get_jk) #1915

Open
ajz34 opened this issue Oct 22, 2023 · 7 comments
Open

Potential efficiency improvement of density-fitting (cderi, get_jk) #1915

ajz34 opened this issue Oct 22, 2023 · 7 comments

Comments

@ajz34
Copy link
Contributor

ajz34 commented Oct 22, 2023

Hi devs!

While performing RI-JK (SCF) computations, I found two possible efficiency bottlenecks. These problems may be alleviated by minor code modifications.

For some remarks below, I think I think I'm not ready to propose a pull request for this issue.

Coulomb integral not paralleled using numpy.einsum without optimize=True

For evaluation of coulomb integrals, codes like

pyscf/pyscf/df/df_jk.py

Lines 260 to 261 in 231da24

rho = numpy.einsum('ix,px->ip', dmtril, eri1)
vj += numpy.einsum('ip,px->ix', rho, eri1)
and

pyscf/pyscf/df/df_jk.py

Lines 290 to 291 in 231da24

rho = numpy.einsum('ix,px->ip', dmtril, eri1)
vj += numpy.einsum('ip,px->ix', rho, eri1)
envokes numpy.einsum, which have optional parameter optimize=False.

For optimize=False, numpy.einsum just contract tensors or matrices without parallel. That could be great efficiency loss when using more CPU threads.

Simply add optimize=True to numpy.einsum will resolve this problem.

Since many codes in PySCF uses numpy.einsum without optimize=True, I think that this is not only a problem of RI-JK SCF, but also other tasks.

Cholesky decomposed ERI with scipy.linalg.solve_triangular

I found that, probably, scipy.linalg.solve_triangular just copies or transposes the large 3c-2e integrals ints $J_{P, \mu \nu}$ to $J_{\mu \nu, P}$. Since NumPy/SciPy often performs copy or transposition with only one thread, this can be extremely slow.

For evaluation of cholesky decomposed ERI $Y_{Q, \mu \nu}$ (mf.with_df._cderi, discuss incore case for simplicity)
$$\sum_{Q} L_{PQ} Y_{Q, \mu \nu} = J_{P, \mu \nu} \text{ or } \mathbf{L} \mathbf{Y} = \mathbf{J}$$
where $L_{PQ}$ is the lower triangular matrix of cholesky-decomposed 2c-2e ERI $J_{PQ}$, as well as $J_{P, \mu \nu}$ the 3c-2e ERI
$$\sum_{R} L_{PR} L_{QR} = J_{PQ} \text{ or } \mathbf{L} \mathbf{L}^\dagger = \mathbf{J}$$

pyscf/pyscf/df/incore.py

Lines 180 to 188 in 231da24

if decompose_j2c == 'cd':
if ints.flags.c_contiguous:
ints = lib.transpose(ints, out=bufs2).T
bufs1, bufs2 = bufs2, bufs1
dat = scipy.linalg.solve_triangular(low, ints, lower=True,
overwrite_b=True, check_finite=False)
if dat.flags.f_contiguous:
dat = lib.transpose(dat.T, out=bufs2)
cderi[:,p0:p1] = dat

PySCF's gto.moleintor.getints3c will probably generates ints $J_{P, \mu \nu}$ as C-contiguous matrix, or $J_{\mu \nu, P}$ as F-contiguous matrix.

SciPy seems only accepts fortran-type array as arguments of its BLAS interface. If matrix transposition is to be avoided, then $J_{\mu \nu, P}$ as F-contiguous matrix is better to be passed as argument. Then our problem is to solve $\mathbf{Y}^\dagger$ from $\mathbf{Y}^\dagger \mathbf{L}^\dagger = \mathbf{J}^\dagger$ F-contiguously; then transpose $\mathbf{Y}^\dagger$ to $\mathbf{Y}$ as a C-contiguous matrix.

What's tricky is that, for function scipy.linalg.solve_triangular, it calls dtrtrs. This BLAS function could not handle $\mathbf{L}^\dagger$ lying on right-side. So the underlying BLAS function dtrsm is required to solve $\mathbf{Y}^\dagger \mathbf{L}^\dagger = \mathbf{J}^\dagger$. However, seems there's no function in scipy wraping ?trsm as high-level API. So to avoid too much data copy and transposition, we may need to explicitly call scipy.linalg.blas.dtrsm:

if decompose_j2c == 'cd':
    if not ints.flags.c_contiguous:
        log.warn("`ints` is not c_contiguous here. Additional transpose is involved.")
        ints = np.array(ints, order='C')
    cderi[:, p0:p1] = scipy.linalg.blas.dtrsm(
        1, low.T, ints.T, side=1, lower=False, overwrite_b=True).T

This code also has some potential problems:

  1. This code almost asserts ints as result from transposed gto.moleintor.getints3c is C-contiguous; but I don't know if there could be any exceptions.
  2. This code assumes integrals to be double-type.
  3. This code avoids transpose, but still requires some data copy. Assignment to cderi[:, p0:p1] also requires data copy, which seems to be unavoidable if ints and cderi uses different buffers in memory. I think this data copy process is possible to be avoided, but that may involve some code logic changes to buffer assignment.
  4. Above mentioned code is only for incore.py; I haven't validated case of outcore.py.

Simple benchmark efficiency improvement

C25H52, RI-JK/def2-TZVP (202 electrons, 1087 basis, 2811 auxbasis), Intel 6150 with 32 physical cores, fully incore
PySCF compiled with MKL, haswell

use np.einsum(optimize=True)* use dtrsm RI-JK time
x x 80.66 sec
x 48.25 sec
35.64 sec

* This option is applied globally, meaning that I manually modified signature in numpy/core/einsumfunc.py directly.

benchmark_code.py.txt

@baopengbp
Copy link

  1. simple one:
    rho = lib.dot(dmtril, eri1.T)
    vj += lib.dot(rho, eri1)

  2. openmp with dtrsm may be faster, a ctypes should be use.

@ajz34
Copy link
Contributor Author

ajz34 commented Nov 3, 2023

@baopengbp

  1. simple one:
    rho = lib.dot(dmtril, eri1.T)
    vj += lib.dot(rho, eri1)

Yes, exactly 😄

  1. openmp with dtrsm may be faster, a ctypes should be use.

I guess that's equally fast. Given F-contiguous matrices, scipy's wrapper of Lapack's dtrsm (linked with MKL) may be equally fast compared to MKL itself. It is also multi-threaded. So I think that ctypes could be utilized, but is certainly not the only option.
I haven't tested the case of OpenBLAS.

Though it seems (I used to believe) that scipy runs dtrtrs or dtrsm single-threaded, that's actually because scipy applies transposition of input matrices to F-contiguous. The transposition costs lots of time (could be accelarated but only single threaded by scipy), and then scipy performs dtrsm fast and multi-threaded.

@ajz34
Copy link
Contributor Author

ajz34 commented Nov 3, 2023

Additional note for get_k: dsyrk instead of dgemm

This seems not a great improvement, but it's simple and probably worth a try. This following code can actually utilize numpy.dot, instead of lib.dot, though it is possible the latter one is more stable in other types of matrix multiplication computations.

vk[k] += lib.dot(buf1.T, buf1)

Possible reason that numpy.dot could be faster than lib.dot here, is that numpy will handle something like $\mathbf{A}^\dagger \mathbf{A}$ by dsyrk instead of dgemm.

https://github.com/numpy/numpy/blob/c35050022fe9e613b316c4d586d59686b9dbc34d/numpy/_core/src/umath/matmul.c.src#L161-L196

use np.einsum(optimize=True)* use dtrsm use numpy.dot for get_k RI-JK time
x x x 80.66 sec
x x 48.25 sec
x 35.64 sec
30.49 sec

@ajz34 ajz34 changed the title Potential efficiency improvement of density-fitting (cderi, get_j) Potential efficiency improvement of density-fitting (cderi, get_jk) Nov 3, 2023
@baopengbp
Copy link

That is useful, I saw a 30% faster for this dot.

@baopengbp
Copy link

  1. openmp with dtrsm may be faster, a ctypes should be use.

I guess that's equally fast. Given F-contiguous matrices, scipy's wrapper of Lapack's dtrsm (linked with MKL) may be equally fast compared to MKL itself. It is also multi-threaded. So I think that ctypes could be utilized, but is certainly not the only option. I haven't tested the case of OpenBLAS.

Though it seems (I used to believe) that scipy runs dtrtrs or dtrsm single-threaded, that's actually because scipy applies transposition of input matrices to F-contiguous. The transposition costs lots of time (could be accelarated but only single threaded by scipy), and then scipy performs dtrsm fast and multi-threaded.

Yes,It took long times to transpose to the array for dtrsm and back.

1 similar comment
@baopengbp
Copy link

  1. openmp with dtrsm may be faster, a ctypes should be use.

I guess that's equally fast. Given F-contiguous matrices, scipy's wrapper of Lapack's dtrsm (linked with MKL) may be equally fast compared to MKL itself. It is also multi-threaded. So I think that ctypes could be utilized, but is certainly not the only option. I haven't tested the case of OpenBLAS.

Though it seems (I used to believe) that scipy runs dtrtrs or dtrsm single-threaded, that's actually because scipy applies transposition of input matrices to F-contiguous. The transposition costs lots of time (could be accelarated but only single threaded by scipy), and then scipy performs dtrsm fast and multi-threaded.

Yes,It took long times to transpose to the array for dtrsm and back.

@chillenb
Copy link
Contributor

chillenb commented May 5, 2024

I like the improvements discovered in this thread so much that I made a pull request (#2205). Some differences:

  • the trsm variation of appropriate type is found using scipy.linalg.get_blas_funcs
  • fall back to scipy.linalg.solve_triangular if F contiguous
  • outcore
  • I've always found np.matmul to be slightly faster than np.dot for unclear reasons, perhaps some weird openmp behavior, so I used it.

sunqm pushed a commit that referenced this issue May 23, 2024
* use trsm in df.incore

* use np.matmul in df/df_jk.py

* faster df.outcore with trsm

* avoid numpy mem copy with h5py read_direct

* clarify chained matmuls

* use lib.dot to avoid syrk, df/outcore.py swap buffers
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants