-
Notifications
You must be signed in to change notification settings - Fork 41
Description
We have noticed an unexpected behavior of CUDA compilation on Blackwell cards (sm_100 and sm_120). The ECP type 2 general kernel, as well as general first and second derivative kernels for both ECP type 1 and type 2, returns incorrect results.
We have reproduced the problem on RTX 5080 and 5090, and in issue #508 the same problem shows up on B200. It cannot be reproduced on A100 or earlier cards. CUDA version doesn't seem to matter, as we reproduce the problem with both CUDA 12.8 and 13.0.
A minimal reproducible case in GPU4PySCF is:
import pyscf
import numpy as np
mol = pyscf.M(atom = '''
I 1.0 0.0 0.0
''',
basis = """
I D
0.30900000000 1.0000000
""",
ecp = """
ECP
I nelec 28
I D
2 15.06890800 35.43952900
END
""",
charge = -1,
verbose = 0)
from pyscf import dft
mf = dft.RKS(mol, xc = "PBE")
hcore_cpu = mf.get_hcore()
from gpu4pyscf import dft
mf = dft.RKS(mol, xc = "PBE")
hcore_gpu = mf.get_hcore().get()
print(np.max(np.abs(hcore_gpu - hcore_cpu)))
print(np.where(np.abs(hcore_gpu - hcore_cpu) > 1e-10))
In this example, the block of ECP matrix with D-type orbital and D-type ECP produces wrong result on Blackwell cards. The expected result (obtained from a A100) is
7.105427357601002e-15
(array([], dtype=int64), array([], dtype=int64))
and the actual result (obtained from a RTX 5080) is
0.0008569962110911433
(array([0, 1, 2, 3, 4]), array([0, 1, 2, 3, 4]))
This unexpected result may vary based on the software and hardware setting, but it is always non-zero.
We also prepared a "minimal" cuda example as the attached file (Please remove the .cpp suffix, github doesn't like cu files). It is one cu file, with the input data built in, and we hope this can make debugging easier. We have also inserted print statements to make clear the unexpected behavior.
To compile, run the following command (you can remove the unnecessary sm):
nvcc -std=c++17 --ptxas-options=-v -O2 -gencode=arch=compute_120,code=sm_120 -gencode=arch=compute_100,code=sm_100 -gencode=arch=compute_90,code=sm_90 -gencode=arch=compute_80,code=sm_80 ecp_type2_reproduce.cu
It'll generate a.out
executable. To run, just run
./a.out
The correct output is
ur = 0.00000, dca = 0.00000, npi = 1, ai,ci = 0.30900,0.33416, radi = 0.33416,0.00000,0.00000,0.00000,0.00000
ur = 0.00000, dca = 0.00000, dcb = 0.00000, npi = 1, npj = 1, ai,ci = 0.30900,0.33416, aj,cj = 0.30900,0.33416, radi = 0.33416,0.00000,0.00000,0.00000,0.00000, radj = 0.33416,0.00000,0.00000,0.00000,0.00000,
0.00048, 0.00000, 0.00000, -0.00024, 0.00000, -0.00024,
0.00000, 0.00036, 0.00000, 0.00000, 0.00000, 0.00000,
0.00000, 0.00000, 0.00036, 0.00000, 0.00000, 0.00000,
-0.00024, 0.00000, 0.00000, 0.00048, 0.00000, -0.00024,
0.00000, 0.00000, 0.00000, 0.00000, 0.00036, 0.00000,
-0.00024, 0.00000, 0.00000, -0.00024, 0.00000, 0.00048,
And output from a RTX 5080 is
ur = 0.00000, dca = 0.00000, npi = 1, ai,ci = 0.30900,0.33416, radi = 1.33416,0.00000,3.14159,2.71828,1.41421
ur = 0.00000, dca = 0.00000, dcb = 0.00000, npi = 1, npj = 1, ai,ci = 0.30900,0.33416, aj,cj = 0.30900,0.33416, radi = 1.00000,0.00000,3.14159,2.71828,1.41421, radj = 0.33416,0.00000,0.00000,0.00000,0.00000,
0.00144, 0.00000, 0.00000, -0.00072, 0.00000, -0.00072,
0.00000, 0.00108, 0.00000, 0.00000, 0.00000, 0.00000,
0.00000, 0.00000, 0.00108, 0.00000, 0.00000, 0.00000,
-0.00072, 0.00000, 0.00000, 0.00144, 0.00000, -0.00072,
0.00000, 0.00000, 0.00000, 0.00000, 0.00108, 0.00000,
-0.00072, 0.00000, 0.00000, -0.00072, 0.00000, 0.00144,
Here are some phenomena we have observed, which might provide some hint to the origin of this unexpected behavior:
- We have very carefully analyzed every memory access, both local and shared memory, to make sure no out-of-bound error had occured. compute-sanitizer doesn't report any warnings or errors on this minimal CUDA example. (It will show some race warning on GPU4PySCF minimal example, because we use shared-memory reductions within a warp without explicit sync, it's unrelated.)
- In the minimal CUDA example, in function
type2_cart()
, the computation of local arrayradi
is contaminated by the functiontype2_facs_rad()
. At the end oftype2_facs_rad()
function, I assigned some values to thebuf
variable, which clearly suppose to do nothing. However those assigned values shows up onradi
. We believe register overlap / contamination is part of the reason. - When we perform initialization on the unused part of some local arrays (
reg_angi
andreg_angj
intype2_cart()
), the problem disappeared. We believe this only changed the optimization pattern and bypassed the problem. - Adding or removing
__restrict__
keyword for function arguments doesn't help. - Changing
type2_facs_rad()
to__noinline__
introduced a different problem, the result is still incorrect but became a different number. __forceinline__
every device function increased the memory usage for compilation (cicc) up to 32 GB for the CUDA example and more than 128 GB for the GPU4PySCF compilation (even after isolating thetype2_cart()
function compilation to one file). Even if it fixes the problem (it does, in the CUDA example), we are not able to afford the memory requirement for compilation, and not able to test the other kernels.- When we tried to turn off the compiler optimization using flag
--Ofast-compile max
, the problem disappeared. - If you wonder why this minimal CUDA example is 5000 lines long, it is because when we try to remove some unused switch-cases, the problem also disappeared. Those cases are just unused in this minimal example, they are necessary.