-
-
Notifications
You must be signed in to change notification settings - Fork 556
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Potential efficiency improvement of density-fitting (cderi, get_jk) #1915
Comments
|
Yes, exactly 😄
I guess that's equally fast. Given F-contiguous matrices, scipy's wrapper of Lapack's Though it seems (I used to believe) that scipy runs |
Additional note for get_k: dsyrk instead of dgemmThis seems not a great improvement, but it's simple and probably worth a try. This following code can actually utilize Line 304 in 14d8882
Possible reason that
|
That is useful, I saw a 30% faster for this dot. |
Yes,It took long times to transpose to the array for dtrsm and back. |
1 similar comment
Yes,It took long times to transpose to the array for dtrsm and back. |
I like the improvements discovered in this thread so much that I made a pull request (#2205). Some differences:
|
Hi devs!
While performing RI-JK (SCF) computations, I found two possible efficiency bottlenecks. These problems may be alleviated by minor code modifications.
For some remarks below, I think I think I'm not ready to propose a pull request for this issue.
Coulomb integral not paralleled using
numpy.einsum
withoutoptimize=True
For evaluation of coulomb integrals, codes like
pyscf/pyscf/df/df_jk.py
Lines 260 to 261 in 231da24
pyscf/pyscf/df/df_jk.py
Lines 290 to 291 in 231da24
numpy.einsum
, which have optional parameteroptimize=False
.For
optimize=False
,numpy.einsum
just contract tensors or matrices without parallel. That could be great efficiency loss when using more CPU threads.Simply add
optimize=True
tonumpy.einsum
will resolve this problem.Since many codes in PySCF uses
numpy.einsum
withoutoptimize=True
, I think that this is not only a problem of RI-JK SCF, but also other tasks.Cholesky decomposed ERI with
scipy.linalg.solve_triangular
I found that, probably,$J_{P, \mu \nu}$ to $J_{\mu \nu, P}$ . Since NumPy/SciPy often performs copy or transposition with only one thread, this can be extremely slow.
scipy.linalg.solve_triangular
just copies or transposes the large 3c-2e integralsints
For evaluation of cholesky decomposed ERI$Y_{Q, \mu \nu}$ (
$$\sum_{Q} L_{PQ} Y_{Q, \mu \nu} = J_{P, \mu \nu} \text{ or } \mathbf{L} \mathbf{Y} = \mathbf{J}$$ $L_{PQ}$ is the lower triangular matrix of cholesky-decomposed 2c-2e ERI $J_{PQ}$ , as well as $J_{P, \mu \nu}$ the 3c-2e ERI
$$\sum_{R} L_{PR} L_{QR} = J_{PQ} \text{ or } \mathbf{L} \mathbf{L}^\dagger = \mathbf{J}$$
mf.with_df._cderi
, discuss incore case for simplicity)where
pyscf/pyscf/df/incore.py
Lines 180 to 188 in 231da24
PySCF's$J_{P, \mu \nu}$ as C-contiguous matrix, or $J_{\mu \nu, P}$ as F-contiguous matrix.
gto.moleintor.getints3c
will probably generatesints
SciPy seems only accepts fortran-type array as arguments of its BLAS interface. If matrix transposition is to be avoided, then$J_{\mu \nu, P}$ as F-contiguous matrix is better to be passed as argument. Then our problem is to solve $\mathbf{Y}^\dagger$ from $\mathbf{Y}^\dagger \mathbf{L}^\dagger = \mathbf{J}^\dagger$ F-contiguously; then transpose $\mathbf{Y}^\dagger$ to $\mathbf{Y}$ as a C-contiguous matrix.
What's tricky is that, for function$\mathbf{L}^\dagger$ lying on right-side. So the underlying BLAS function $\mathbf{Y}^\dagger \mathbf{L}^\dagger = \mathbf{J}^\dagger$ . However, seems there's no function in scipy wraping
scipy.linalg.solve_triangular
, it callsdtrtrs
. This BLAS function could not handledtrsm
is required to solve?trsm
as high-level API. So to avoid too much data copy and transposition, we may need to explicitly callscipy.linalg.blas.dtrsm
:This code also has some potential problems:
ints
as result from transposedgto.moleintor.getints3c
is C-contiguous; but I don't know if there could be any exceptions.cderi[:, p0:p1]
also requires data copy, which seems to be unavoidable ifints
andcderi
uses different buffers in memory. I think this data copy process is possible to be avoided, but that may involve some code logic changes to buffer assignment.Simple benchmark efficiency improvement
C25H52, RI-JK/def2-TZVP (202 electrons, 1087 basis, 2811 auxbasis), Intel 6150 with 32 physical cores, fully incore
PySCF compiled with MKL, haswell
np.einsum(optimize=True)
*dtrsm
* This option is applied globally, meaning that I manually modified signature in numpy/core/einsumfunc.py directly.
benchmark_code.py.txt
The text was updated successfully, but these errors were encountered: