Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DFT running slow in SLURM system #1956

Open
ftsong22 opened this issue Nov 14, 2023 · 10 comments
Open

DFT running slow in SLURM system #1956

ftsong22 opened this issue Nov 14, 2023 · 10 comments

Comments

@ftsong22
Copy link

I am running a demo DFT calculation on a small molecule on a SLURM cluster. Here is the python script dft.py:

from pyscf import gto, dft
import numpy as np
from time import perf_counter

def run():
    xyz = """
    C     0.0324075417    1.5597651675    0.0173503882
    C     0.0165327393    0.0274410148    0.0305054658
    C     0.7401306193   -0.5295158287    1.2618054783
    C    -1.4291837338   -0.5260272859   -0.0298377316
    C    -2.1505122569   -0.1629995548   -1.2436891238
    C    -2.7262202740    0.1392620880   -2.2551761211
    H     1.0589579093    1.9399387642    0.0096348267
    H    -0.4845464252    1.9534338078   -0.8620306386
    H    -0.4643663260    1.9623774373    0.9087359643
    H     0.5364722319   -0.3291892511   -0.8679611487
    H     0.2428191250   -0.2118700025    2.1862800077
    H     0.7634809319   -1.6244926985    1.2541585815
    H     1.7738173614   -0.1720416972    1.3033542233
    H    -1.3996184762   -1.6199713974    0.0562607039
    H    -1.9858455552   -0.1676037942    0.8474974509
    H    -3.2444660827    0.4018884507   -3.1437253070
    """
    begin = perf_counter()
    mol = gto.Mole()
    mol.atom = xyz
    mol.basis = 'ccpvtz'
    mol.symmetry = False
    mol.max_memory = 12000
    mol.verbose = 4
    mf = dft.RKS(mol).density_fit()
    mf.grids.level = 4
    mf.xc = 'PBE'
    mf.kernel()
    end = perf_counter()
    print(f"Time: {end-begin:.2f} s")

When I am simply running the script with python dft.py, the code should be running on the management node, and finished in ~53s. But when I tried to submit it to the computation node with srun python dft.py, the exact same code took ~105s. The CPU on the computation node is similar to the management node, if not better. Why will this happen? And may I have a reference on how much time this calculation should take on single CPU core?

@jamesETsmith
Copy link
Collaborator

There are a lot of different reasons this could happen. Could you give us a few details about:

  • How many cores are on the management and computation node?
  • What are the default settings of srun on your system? Does it give you exclusive access to a node or does it request a single process?
  • Are you setting the number of threads with the appropriate environment variables?
  • Did you install pyscf with pip or build from sources?
  • Are you using the best version of BLAS on your computational nodes?

A couple of general notes about performance:

  • You should be careful to check if another job is running on the same computation node. My guess is that you're running into an oversubscription issue because you aren't getting exclusive access to the node. See the slurm exclusive option docs here
  • If you're interested in maximizing performance you should be compiling pyscf from source on your computation nodes.
  • As a last resort, you should reach out to your system administrator and have them double-check that you're running the job as expected

@ftsong22
Copy link
Author

@jamesETsmith Thank you for replying. I found that when using srun, the program defaults to use a single thread. While directly running on the management node, it used all 32 threads. But I tried calculating the same molecule under the same setting with Psi4 (except for a little difference in the grid), it only took 45 sec with a single thread. Would you pls estimate or try this code and let me know how long this computation should take with a single thread in PySCF?

@ajz34
Copy link
Contributor

ajz34 commented Nov 18, 2023

I'm just curious on this issue 😹 (not going to dive deep into this topic currently)

Agree with comments of jamesETsmith. I also tried to make some very crude profiling of this task on two machines.

Machine specification

Memory benchmark were performed with aid of Intel MLC.

Xeon 6150 Ryzen 7945HX
Deployment Server Laptop
Architecture SKL Zen 4
BLAS MKL (conda) OpenBLAS (pypi)
PySCF compile compiled pypi
DRAM bandwidth (seq-read) 216 GB/sec 58 GB/sec
Loaded Latency (no delay) 198 ns 811 ns
L2 Cache 1 MB/core 1 MB/core
L3 Cache 24.75 MB 64 MB

Timing benchmark

Note that tricks described in #1915 are applied in both cases. RI-J or RI-JK utilized. But I think that's not important, as Hartree-Fock is quite fast in this case.

CPU Functional Physical Cores Task time
Xeon 6150 PBE 1 117.68 sec
PBE 8 30.16 sec
PBE 16 23.38 sec
HF 16 1.75 sec
Ryzen 7945HX PBE 1 121.88 sec
PBE 8 30.41 sec
PBE 16 28.57 sec
HF 16 2.06 sec
  • Major differences of these two machines are DRAM bandwidth and memory load latency.
  • Xeon 6150 is significantly faster than Ryzen 7945HX for 16-core task, though they behave almost the same for 8-core task. (I mean physical core of CPU if mentioning "core"; a CPU can have lots of cores; cores=threads if no hyper-threading)
  • Xeon 6150 is also faster than Ryzen 7945HX for 1-core (single-threaded) task. I guess that's because CPU frequency is large when 1-core but decreases significantly when 8-core for Xeon 6150; where frequency of Ryzen 7945HX is steady.

I guess there's a memory or cache/TLB-miss related efficiency problem.

Profiling

For benchmark of the Ryzen 7945HX (16 cores), own-time (wall time) percentage of some functions are

Function Percentage
_dot_ao_ao_sparse 58.4%
_dot_ao_dm_sparse 16.2%
eval_gto 6.7%
_contract_rho_sparse 3.9%
nr_direct_drv 3.2%
_scale_ao_sparse 3.0%
_eval_xc 2.2%
others 6.4%

Preliminary hacking

I tried to change _dot_ao_ao_sparse to the _dot_ao_ao_dense counterpart; specifically, changing the following code

pyscf/pyscf/dft/numint.py

Lines 1156 to 1163 in e64ed31

elif xctype == 'GGA':
ao_deriv = 1
for i, ao, mask, wv in block_loop(ao_deriv):
wv[0] *= .5 # *.5 because vmat + vmat.T at the end
aow = _scale_ao_sparse(ao[:4], wv[:4], mask, ao_loc, out=aow)
_dot_ao_ao_sparse(ao[0], aow, None, nbins, mask, pair_mask, ao_loc,
hermi=0, out=vmat[i])
vmat = lib.hermi_sum(vmat, axes=(0,2,1))

into

    elif xctype == 'GGA':
        ao_deriv = 1
        for i, ao, mask, wv in block_loop(ao_deriv):
            wv[0] *= .5  # *.5 because vmat + vmat.T at the end
            aow = numpy.einsum("xgi, xg -> gi", ao[:4], wv[:4], optimize=True)
            _dot_ao_ao_dense(ao[0], aow, None, vmat[i])
        vmat = lib.hermi_sum(vmat, axes=(0,2,1))
Physical Cores Time before change Time after change Percentage
1 121.88 sec 103.76 sec 15%
16 28.57 sec 17.21 sec 40%

Note that this is certainly not a solution. Sparse grid contraction should be potentially faster than dense one.

Preliminary conclusion?

I'm not sure whether these are correct. I'm newbie to program efficiency.

Generally,

  • For small systems, DFT approximations (even without hybrid ones, $O(N^3)$ ) probably inevitably slower than HF $O(N^4)$. I think this is not only specific to PySCF, but also other atomic-orbital-basis based programs, due to FLOPs analysis. Situation may different for large systems.
  • Perhaps efficiency could be improved for sparse DFT grid AO-AO contraction, or other sparse grid operations.
  • Though these functions are paralleled (CPU runs at 800% / 1600%), this task does not benefit much from introducing more CPU cores.

My guess of efficiency problem is that

  • Though _dot_ao_ao_dense should be mainly computation-intensive GEMM, _dot_ao_ao_sparse is probably implemented in a memory-intensive way (but I haven't double checked by roofline model).
  • I guess there could be serious cache/TLB-miss problem in _dot_ao_ao_sparse and may have some room to be optimized. But I haven't validated that. Actually currently, I don't know the algorithms of sparse DFT integral and contraction.

So probably, since sparse DFT grid contraction is implemented in memory-intensive way, a slower timing for this task at computation node, may be the reason of the following conditions:

  • In some cases, DRAM bandwidth or memory latency of computation node is slower than that of management node. CPU frequency and instruction throughput may be better on computation node, however, these are more crutial for computation-intensive tasks than memory-intensive tasks.
  • For memory-intensive task, if your task is not exclusive, then other person's tasks may consumes up DRAM bandwidth. However, on management node, maybe you could utilize the full DRAM bandwidth, making this task running faster.
  • Or some other people just used up CPU resources unintentionally. It could happen sometimes.

@susilehtola
Copy link
Contributor

susilehtola commented Nov 18, 2023 via email

sunqm added a commit to sunqm/pyscf that referenced this issue Nov 18, 2023
@sunqm
Copy link
Collaborator

sunqm commented Nov 18, 2023

I didn't carefully optimize the memory locality and the data transferring in the sparse algorithm. Like @ajz34 's guess, it has high penalty in multi-threading execution due to cache miss and memory access. And the performance may be strongly affected by the loadage, threads, and other runtime environments.

In PR #1962, I turned off the sparse algorithm for small systems. It should be able to give you more consistent performance.

@verena-neufeld
Copy link
Contributor

Thanks! Is this resolved now?

@ftsong22
Copy link
Author

Thank you all for your help! Based on @ajz34's benchmarks, my time (approximately 110 seconds with a single thread) seems to be reasonable. However, when I tried the same program later, the performance was inconsistent, sometimes taking up to 160 seconds. I will try the latest update by @sunqm later to see if it can solve the problem.

There is another issue regarding the use of density fitting. My original code uses density fitting to compare with Psi4, which uses df by default. In this case, I found Psi4 to be significantly faster, taking only around 50 seconds, while PySCF takes about 110 seconds, which is reasonable as confirmed by @ajz34. (I am referring to using a single thread in all cases.) But then I turned off df in both PySCF and Psi4 (for Psi4 I set scf_type='pk'). I discovered that while the PySCF time only increased to somewhere around 150 seconds, the Psi4 time jumped to 190 seconds.

This seems very strange to me. Should the performance increase multiple times when density fitting is turned on? Or is it because Psi4 is heavily optimized for density fitting as it is the default?

@susilehtola
Copy link
Contributor

There is another issue regarding the use of density fitting. My original code uses density fitting to compare with Psi4, which uses df by default. In this case, I found Psi4 to be significantly faster, taking only around 50 seconds, while PySCF takes about 110 seconds, which is reasonable as confirmed by @ajz34. (I am referring to using a single thread in all cases.) But then I turned off df in both PySCF and Psi4 (for Psi4 I set scf_type='pk'). I discovered that while the PySCF time only increased to somewhere around 150 seconds, the Psi4 time jumped to 190 seconds.

This seems very strange to me. Should the performance increase multiple times when density fitting is turned on? Or is it because Psi4 is heavily optimized for density fitting as it is the default?

This is completely expected. Psi4 is a program optimized for segmented contracted basis sets, while PySCF uses general contractions, and cc-pVTZ is a generally contracted basis set. The integrals are therefore slower in Psi4; if you study heavy atoms, the speed differences can be huge e.g. 2 seconds vs several hours. When density fitting is used, the differences are smaller since the number of integrals is smaller, $\mathcal{O}(N^2)$ instead of $\mathcal{O}(N^4)$.

When sufficiently accurate auxilary basis sets are used, the error made in the DF approximation can be made insignificant; see e.g. my recent work on automatic generation of auxiliary basis sets in J. Chem. Theory Comput. 17, 6886 (2021) and J. Chem. Theory Comput. 19, 6242 (2023).

@bogdanoff
Copy link
Contributor

see e.g. my recent work on automatic generation of auxiliary basis sets in J. Chem. Theory Comput. 17, 6886 (2021) and J. Chem. Theory Comput. 19, 6242 (2023).

That is actually quite impressive! Is there an implementation to generate those for use in PySCF?

@susilehtola
Copy link
Contributor

see e.g. my recent work on automatic generation of auxiliary basis sets in J. Chem. Theory Comput. 17, 6886 (2021) and J. Chem. Theory Comput. 19, 6242 (2023).

That is actually quite impressive! Is there an implementation to generate those for use in PySCF?

The implementation in the basis set utility of ERKALE is openly available on GitHub. The tool reads in and writes out the basis sets in Gaussian'94 format. You can use the Basis Set Exchange Python library to convert between basis set formats if necessary.

The long-term plan is to include the necessary functionality natively in the Basis Set Exchange.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants