DFT running slow in SLURM system #1956

ftsong22 · 2023-11-14T07:32:44Z

I am running a demo DFT calculation on a small molecule on a SLURM cluster. Here is the python script dft.py:

from pyscf import gto, dft
import numpy as np
from time import perf_counter

def run():
    xyz = """
    C     0.0324075417    1.5597651675    0.0173503882
    C     0.0165327393    0.0274410148    0.0305054658
    C     0.7401306193   -0.5295158287    1.2618054783
    C    -1.4291837338   -0.5260272859   -0.0298377316
    C    -2.1505122569   -0.1629995548   -1.2436891238
    C    -2.7262202740    0.1392620880   -2.2551761211
    H     1.0589579093    1.9399387642    0.0096348267
    H    -0.4845464252    1.9534338078   -0.8620306386
    H    -0.4643663260    1.9623774373    0.9087359643
    H     0.5364722319   -0.3291892511   -0.8679611487
    H     0.2428191250   -0.2118700025    2.1862800077
    H     0.7634809319   -1.6244926985    1.2541585815
    H     1.7738173614   -0.1720416972    1.3033542233
    H    -1.3996184762   -1.6199713974    0.0562607039
    H    -1.9858455552   -0.1676037942    0.8474974509
    H    -3.2444660827    0.4018884507   -3.1437253070
    """
    begin = perf_counter()
    mol = gto.Mole()
    mol.atom = xyz
    mol.basis = 'ccpvtz'
    mol.symmetry = False
    mol.max_memory = 12000
    mol.verbose = 4
    mf = dft.RKS(mol).density_fit()
    mf.grids.level = 4
    mf.xc = 'PBE'
    mf.kernel()
    end = perf_counter()
    print(f"Time: {end-begin:.2f} s")

When I am simply running the script with python dft.py, the code should be running on the management node, and finished in ~53s. But when I tried to submit it to the computation node with srun python dft.py, the exact same code took ~105s. The CPU on the computation node is similar to the management node, if not better. Why will this happen? And may I have a reference on how much time this calculation should take on single CPU core?

The text was updated successfully, but these errors were encountered:

jamesETsmith · 2023-11-14T13:28:37Z

There are a lot of different reasons this could happen. Could you give us a few details about:

How many cores are on the management and computation node?
What are the default settings of srun on your system? Does it give you exclusive access to a node or does it request a single process?
Are you setting the number of threads with the appropriate environment variables?
Did you install pyscf with pip or build from sources?
Are you using the best version of BLAS on your computational nodes?

A couple of general notes about performance:

You should be careful to check if another job is running on the same computation node. My guess is that you're running into an oversubscription issue because you aren't getting exclusive access to the node. See the slurm exclusive option docs here
If you're interested in maximizing performance you should be compiling pyscf from source on your computation nodes.
As a last resort, you should reach out to your system administrator and have them double-check that you're running the job as expected

ftsong22 · 2023-11-18T14:09:53Z

@jamesETsmith Thank you for replying. I found that when using srun, the program defaults to use a single thread. While directly running on the management node, it used all 32 threads. But I tried calculating the same molecule under the same setting with Psi4 (except for a little difference in the grid), it only took 45 sec with a single thread. Would you pls estimate or try this code and let me know how long this computation should take with a single thread in PySCF?

ajz34 · 2023-11-18T15:08:31Z

I'm just curious on this issue 😹 (not going to dive deep into this topic currently)

Agree with comments of jamesETsmith. I also tried to make some very crude profiling of this task on two machines.

Machine specification

Memory benchmark were performed with aid of Intel MLC.

	Xeon 6150	Ryzen 7945HX
Deployment	Server	Laptop
Architecture	SKL	Zen 4
BLAS	MKL (conda)	OpenBLAS (pypi)
PySCF compile	compiled	pypi
DRAM bandwidth (seq-read)	216 GB/sec	58 GB/sec
Loaded Latency (no delay)	198 ns	811 ns
L2 Cache	1 MB/core	1 MB/core
L3 Cache	24.75 MB	64 MB

Timing benchmark

Note that tricks described in #1915 are applied in both cases. RI-J or RI-JK utilized. But I think that's not important, as Hartree-Fock is quite fast in this case.

CPU	Functional	Physical Cores	Task time
Xeon 6150	PBE	1	117.68 sec
	PBE	8	30.16 sec
	PBE	16	23.38 sec
	HF	16	1.75 sec
Ryzen 7945HX	PBE	1	121.88 sec
	PBE	8	30.41 sec
	PBE	16	28.57 sec
	HF	16	2.06 sec

Major differences of these two machines are DRAM bandwidth and memory load latency.
Xeon 6150 is significantly faster than Ryzen 7945HX for 16-core task, though they behave almost the same for 8-core task. (I mean physical core of CPU if mentioning "core"; a CPU can have lots of cores; cores=threads if no hyper-threading)
Xeon 6150 is also faster than Ryzen 7945HX for 1-core (single-threaded) task. I guess that's because CPU frequency is large when 1-core but decreases significantly when 8-core for Xeon 6150; where frequency of Ryzen 7945HX is steady.

I guess there's a memory or cache/TLB-miss related efficiency problem.

Profiling

For benchmark of the Ryzen 7945HX (16 cores), own-time (wall time) percentage of some functions are

Function	Percentage
`_dot_ao_ao_sparse`	58.4%
`_dot_ao_dm_sparse`	16.2%
`eval_gto`	6.7%
`_contract_rho_sparse`	3.9%
`nr_direct_drv`	3.2%
`_scale_ao_sparse`	3.0%
`_eval_xc`	2.2%
others	6.4%

Preliminary hacking

I tried to change _dot_ao_ao_sparse to the _dot_ao_ao_dense counterpart; specifically, changing the following code

pyscf/pyscf/dft/numint.py

Lines 1156 to 1163 in e64ed31

    
           elif xctype == 'GGA': 
        
               ao_deriv = 1 
        
               for i, ao, mask, wv in block_loop(ao_deriv): 
        
                   wv[0] *= .5  # *.5 because vmat + vmat.T at the end 
        
                   aow = _scale_ao_sparse(ao[:4], wv[:4], mask, ao_loc, out=aow) 
        
                   _dot_ao_ao_sparse(ao[0], aow, None, nbins, mask, pair_mask, ao_loc, 
        
                                     hermi=0, out=vmat[i]) 
        
               vmat = lib.hermi_sum(vmat, axes=(0,2,1))

into

    elif xctype == 'GGA':
        ao_deriv = 1
        for i, ao, mask, wv in block_loop(ao_deriv):
            wv[0] *= .5  # *.5 because vmat + vmat.T at the end
            aow = numpy.einsum("xgi, xg -> gi", ao[:4], wv[:4], optimize=True)
            _dot_ao_ao_dense(ao[0], aow, None, vmat[i])
        vmat = lib.hermi_sum(vmat, axes=(0,2,1))

Physical Cores	Time before change	Time after change	Percentage
1	121.88 sec	103.76 sec	15%
16	28.57 sec	17.21 sec	40%

Note that this is certainly not a solution. Sparse grid contraction should be potentially faster than dense one.

Preliminary conclusion?

I'm not sure whether these are correct. I'm newbie to program efficiency.

Generally,

For small systems, DFT approximations (even without hybrid ones, $O(N^3)$ ) probably inevitably slower than HF $O(N^4)$. I think this is not only specific to PySCF, but also other atomic-orbital-basis based programs, due to FLOPs analysis. Situation may different for large systems.
Perhaps efficiency could be improved for sparse DFT grid AO-AO contraction, or other sparse grid operations.
Though these functions are paralleled (CPU runs at 800% / 1600%), this task does not benefit much from introducing more CPU cores.

My guess of efficiency problem is that

Though _dot_ao_ao_dense should be mainly computation-intensive GEMM, _dot_ao_ao_sparse is probably implemented in a memory-intensive way (but I haven't double checked by roofline model).
I guess there could be serious cache/TLB-miss problem in _dot_ao_ao_sparse and may have some room to be optimized. But I haven't validated that. Actually currently, I don't know the algorithms of sparse DFT integral and contraction.

So probably, since sparse DFT grid contraction is implemented in memory-intensive way, a slower timing for this task at computation node, may be the reason of the following conditions:

In some cases, DRAM bandwidth or memory latency of computation node is slower than that of management node. CPU frequency and instruction throughput may be better on computation node, however, these are more crutial for computation-intensive tasks than memory-intensive tasks.
For memory-intensive task, if your task is not exclusive, then other person's tasks may consumes up DRAM bandwidth. However, on management node, maybe you could utilize the full DRAM bandwidth, making this task running faster.
Or some other people just used up CPU resources unintentionally. It could happen sometimes.

susilehtola · 2023-11-18T20:45:24Z

Note that Psi4 uses density fitting by default

sunqm · 2023-11-18T22:34:07Z

I didn't carefully optimize the memory locality and the data transferring in the sparse algorithm. Like @ajz34 's guess, it has high penalty in multi-threading execution due to cache miss and memory access. And the performance may be strongly affected by the loadage, threads, and other runtime environments.

In PR #1962, I turned off the sparse algorithm for small systems. It should be able to give you more consistent performance.

verena-neufeld · 2023-11-20T05:39:39Z

Thanks! Is this resolved now?

ftsong22 · 2023-11-20T15:40:42Z

Thank you all for your help! Based on @ajz34's benchmarks, my time (approximately 110 seconds with a single thread) seems to be reasonable. However, when I tried the same program later, the performance was inconsistent, sometimes taking up to 160 seconds. I will try the latest update by @sunqm later to see if it can solve the problem.

There is another issue regarding the use of density fitting. My original code uses density fitting to compare with Psi4, which uses df by default. In this case, I found Psi4 to be significantly faster, taking only around 50 seconds, while PySCF takes about 110 seconds, which is reasonable as confirmed by @ajz34. (I am referring to using a single thread in all cases.) But then I turned off df in both PySCF and Psi4 (for Psi4 I set scf_type='pk'). I discovered that while the PySCF time only increased to somewhere around 150 seconds, the Psi4 time jumped to 190 seconds.

This seems very strange to me. Should the performance increase multiple times when density fitting is turned on? Or is it because Psi4 is heavily optimized for density fitting as it is the default?

susilehtola · 2023-11-20T17:48:17Z

There is another issue regarding the use of density fitting. My original code uses density fitting to compare with Psi4, which uses df by default. In this case, I found Psi4 to be significantly faster, taking only around 50 seconds, while PySCF takes about 110 seconds, which is reasonable as confirmed by @ajz34. (I am referring to using a single thread in all cases.) But then I turned off df in both PySCF and Psi4 (for Psi4 I set scf_type='pk'). I discovered that while the PySCF time only increased to somewhere around 150 seconds, the Psi4 time jumped to 190 seconds.

This seems very strange to me. Should the performance increase multiple times when density fitting is turned on? Or is it because Psi4 is heavily optimized for density fitting as it is the default?

This is completely expected. Psi4 is a program optimized for segmented contracted basis sets, while PySCF uses general contractions, and cc-pVTZ is a generally contracted basis set. The integrals are therefore slower in Psi4; if you study heavy atoms, the speed differences can be huge e.g. 2 seconds vs several hours. When density fitting is used, the differences are smaller since the number of integrals is smaller, $\mathcal{O}(N^2)$ instead of $\mathcal{O}(N^4)$.

When sufficiently accurate auxilary basis sets are used, the error made in the DF approximation can be made insignificant; see e.g. my recent work on automatic generation of auxiliary basis sets in J. Chem. Theory Comput. 17, 6886 (2021) and J. Chem. Theory Comput. 19, 6242 (2023).

bogdanoff · 2023-11-21T09:46:19Z

see e.g. my recent work on automatic generation of auxiliary basis sets in J. Chem. Theory Comput. 17, 6886 (2021) and J. Chem. Theory Comput. 19, 6242 (2023).

That is actually quite impressive! Is there an implementation to generate those for use in PySCF?

susilehtola · 2023-11-21T11:26:54Z

see e.g. my recent work on automatic generation of auxiliary basis sets in J. Chem. Theory Comput. 17, 6886 (2021) and J. Chem. Theory Comput. 19, 6242 (2023).

That is actually quite impressive! Is there an implementation to generate those for use in PySCF?

The implementation in the basis set utility of ERKALE is openly available on GitHub. The tool reads in and writes out the basis sets in Gaussian'94 format. You can use the Basis Set Exchange Python library to convert between basis set formats if necessary.

The long-term plan is to include the necessary functionality natively in the Basis Set Exchange.

sunqm added a commit to sunqm/pyscf that referenced this issue Nov 18, 2023

Swith off sparsity algorithm for small system (issue pyscf#1956)

9d1415c

sunqm added a commit that referenced this issue Nov 18, 2023

Swith off sparsity algorithm for small system (issue #1956)

4e4599e

sokolov-group pushed a commit to sokolov-group/pyscf that referenced this issue Jul 29, 2024

Swith off sparsity algorithm for small system (issue pyscf#1956)

db53365

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DFT running slow in SLURM system #1956

DFT running slow in SLURM system #1956

ftsong22 commented Nov 14, 2023

jamesETsmith commented Nov 14, 2023

ftsong22 commented Nov 18, 2023

ajz34 commented Nov 18, 2023

susilehtola commented Nov 18, 2023 via email

sunqm commented Nov 18, 2023

verena-neufeld commented Nov 20, 2023

ftsong22 commented Nov 20, 2023

susilehtola commented Nov 20, 2023

bogdanoff commented Nov 21, 2023

susilehtola commented Nov 21, 2023

DFT running slow in SLURM system #1956

DFT running slow in SLURM system #1956

Comments

ftsong22 commented Nov 14, 2023

jamesETsmith commented Nov 14, 2023

ftsong22 commented Nov 18, 2023

ajz34 commented Nov 18, 2023

Machine specification

Timing benchmark

Profiling

Preliminary hacking

Preliminary conclusion?

susilehtola commented Nov 18, 2023 via email

sunqm commented Nov 18, 2023

verena-neufeld commented Nov 20, 2023

ftsong22 commented Nov 20, 2023

susilehtola commented Nov 20, 2023

bogdanoff commented Nov 21, 2023

susilehtola commented Nov 21, 2023