-
-
Notifications
You must be signed in to change notification settings - Fork 551
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DFT running slow in SLURM system #1956
Comments
There are a lot of different reasons this could happen. Could you give us a few details about:
A couple of general notes about performance:
|
@jamesETsmith Thank you for replying. I found that when using |
I'm just curious on this issue 😹 (not going to dive deep into this topic currently) Agree with comments of jamesETsmith. I also tried to make some very crude profiling of this task on two machines. Machine specificationMemory benchmark were performed with aid of Intel MLC.
Timing benchmarkNote that tricks described in #1915 are applied in both cases. RI-J or RI-JK utilized. But I think that's not important, as Hartree-Fock is quite fast in this case.
I guess there's a memory or cache/TLB-miss related efficiency problem. ProfilingFor benchmark of the Ryzen 7945HX (16 cores), own-time (wall time) percentage of some functions are
Preliminary hackingI tried to change Lines 1156 to 1163 in e64ed31
into elif xctype == 'GGA':
ao_deriv = 1
for i, ao, mask, wv in block_loop(ao_deriv):
wv[0] *= .5 # *.5 because vmat + vmat.T at the end
aow = numpy.einsum("xgi, xg -> gi", ao[:4], wv[:4], optimize=True)
_dot_ao_ao_dense(ao[0], aow, None, vmat[i])
vmat = lib.hermi_sum(vmat, axes=(0,2,1))
Note that this is certainly not a solution. Sparse grid contraction should be potentially faster than dense one. Preliminary conclusion?I'm not sure whether these are correct. I'm newbie to program efficiency. Generally,
My guess of efficiency problem is that
So probably, since sparse DFT grid contraction is implemented in memory-intensive way, a slower timing for this task at computation node, may be the reason of the following conditions:
|
Note that Psi4 uses density fitting by default
|
I didn't carefully optimize the memory locality and the data transferring in the sparse algorithm. Like @ajz34 's guess, it has high penalty in multi-threading execution due to cache miss and memory access. And the performance may be strongly affected by the loadage, threads, and other runtime environments. In PR #1962, I turned off the sparse algorithm for small systems. It should be able to give you more consistent performance. |
Thanks! Is this resolved now? |
Thank you all for your help! Based on @ajz34's benchmarks, my time (approximately 110 seconds with a single thread) seems to be reasonable. However, when I tried the same program later, the performance was inconsistent, sometimes taking up to 160 seconds. I will try the latest update by @sunqm later to see if it can solve the problem. There is another issue regarding the use of density fitting. My original code uses density fitting to compare with Psi4, which uses df by default. In this case, I found Psi4 to be significantly faster, taking only around 50 seconds, while PySCF takes about 110 seconds, which is reasonable as confirmed by @ajz34. (I am referring to using a single thread in all cases.) But then I turned off df in both PySCF and Psi4 (for Psi4 I set scf_type='pk'). I discovered that while the PySCF time only increased to somewhere around 150 seconds, the Psi4 time jumped to 190 seconds. This seems very strange to me. Should the performance increase multiple times when density fitting is turned on? Or is it because Psi4 is heavily optimized for density fitting as it is the default? |
This is completely expected. Psi4 is a program optimized for segmented contracted basis sets, while PySCF uses general contractions, and cc-pVTZ is a generally contracted basis set. The integrals are therefore slower in Psi4; if you study heavy atoms, the speed differences can be huge e.g. 2 seconds vs several hours. When density fitting is used, the differences are smaller since the number of integrals is smaller, When sufficiently accurate auxilary basis sets are used, the error made in the DF approximation can be made insignificant; see e.g. my recent work on automatic generation of auxiliary basis sets in J. Chem. Theory Comput. 17, 6886 (2021) and J. Chem. Theory Comput. 19, 6242 (2023). |
That is actually quite impressive! Is there an implementation to generate those for use in PySCF? |
The implementation in the basis set utility of ERKALE is openly available on GitHub. The tool reads in and writes out the basis sets in Gaussian'94 format. You can use the Basis Set Exchange Python library to convert between basis set formats if necessary. The long-term plan is to include the necessary functionality natively in the Basis Set Exchange. |
I am running a demo DFT calculation on a small molecule on a SLURM cluster. Here is the python script
dft.py
:When I am simply running the script with
python dft.py
, the code should be running on the management node, and finished in ~53s. But when I tried to submit it to the computation node withsrun python dft.py
, the exact same code took ~105s. The CPU on the computation node is similar to the management node, if not better. Why will this happen? And may I have a reference on how much time this calculation should take on single CPU core?The text was updated successfully, but these errors were encountered: