Skip to content

Unexpected CUDA behavior encountered in ECP kernels #521

@henryw7

Description

@henryw7

We have noticed an unexpected behavior of CUDA compilation on Blackwell cards (sm_100 and sm_120). The ECP type 2 general kernel, as well as general first and second derivative kernels for both ECP type 1 and type 2, returns incorrect results.

We have reproduced the problem on RTX 5080 and 5090, and in issue #508 the same problem shows up on B200. It cannot be reproduced on A100 or earlier cards. CUDA version doesn't seem to matter, as we reproduce the problem with both CUDA 12.8 and 13.0.

A minimal reproducible case in GPU4PySCF is:

import pyscf
import numpy as np

mol = pyscf.M(atom = '''
    I  1.0 0.0 0.0
''',
basis = """
I    D
      0.30900000000          1.0000000
""",
ecp = """
ECP
I nelec 28
I D
2      15.06890800            35.43952900
END
""",
charge = -1,
verbose = 0)

from pyscf import dft
mf = dft.RKS(mol, xc = "PBE")
hcore_cpu = mf.get_hcore()

from gpu4pyscf import dft
mf = dft.RKS(mol, xc = "PBE")
hcore_gpu = mf.get_hcore().get()

print(np.max(np.abs(hcore_gpu - hcore_cpu)))
print(np.where(np.abs(hcore_gpu - hcore_cpu) > 1e-10))

In this example, the block of ECP matrix with D-type orbital and D-type ECP produces wrong result on Blackwell cards. The expected result (obtained from a A100) is

7.105427357601002e-15
(array([], dtype=int64), array([], dtype=int64))

and the actual result (obtained from a RTX 5080) is

0.0008569962110911433
(array([0, 1, 2, 3, 4]), array([0, 1, 2, 3, 4]))

This unexpected result may vary based on the software and hardware setting, but it is always non-zero.

We also prepared a "minimal" cuda example as the attached file (Please remove the .cpp suffix, github doesn't like cu files). It is one cu file, with the input data built in, and we hope this can make debugging easier. We have also inserted print statements to make clear the unexpected behavior.

ecp_type2_reproduce.cu.cpp

To compile, run the following command (you can remove the unnecessary sm):

nvcc -std=c++17 --ptxas-options=-v -O2 -gencode=arch=compute_120,code=sm_120 -gencode=arch=compute_100,code=sm_100 -gencode=arch=compute_90,code=sm_90 -gencode=arch=compute_80,code=sm_80 ecp_type2_reproduce.cu

It'll generate a.out executable. To run, just run

./a.out

The correct output is

ur = 0.00000, dca = 0.00000, npi = 1, ai,ci = 0.30900,0.33416, radi = 0.33416,0.00000,0.00000,0.00000,0.00000
ur = 0.00000, dca = 0.00000, dcb = 0.00000, npi = 1, npj = 1, ai,ci = 0.30900,0.33416, aj,cj = 0.30900,0.33416, radi = 0.33416,0.00000,0.00000,0.00000,0.00000, radj = 0.33416,0.00000,0.00000,0.00000,0.00000,
 0.00048,  0.00000,  0.00000, -0.00024,  0.00000, -0.00024, 
 0.00000,  0.00036,  0.00000,  0.00000,  0.00000,  0.00000, 
 0.00000,  0.00000,  0.00036,  0.00000,  0.00000,  0.00000, 
-0.00024,  0.00000,  0.00000,  0.00048,  0.00000, -0.00024, 
 0.00000,  0.00000,  0.00000,  0.00000,  0.00036,  0.00000, 
-0.00024,  0.00000,  0.00000, -0.00024,  0.00000,  0.00048,

And output from a RTX 5080 is

ur = 0.00000, dca = 0.00000, npi = 1, ai,ci = 0.30900,0.33416, radi = 1.33416,0.00000,3.14159,2.71828,1.41421
ur = 0.00000, dca = 0.00000, dcb = 0.00000, npi = 1, npj = 1, ai,ci = 0.30900,0.33416, aj,cj = 0.30900,0.33416, radi = 1.00000,0.00000,3.14159,2.71828,1.41421, radj = 0.33416,0.00000,0.00000,0.00000,0.00000,
 0.00144,  0.00000,  0.00000, -0.00072,  0.00000, -0.00072, 
 0.00000,  0.00108,  0.00000,  0.00000,  0.00000,  0.00000, 
 0.00000,  0.00000,  0.00108,  0.00000,  0.00000,  0.00000, 
-0.00072,  0.00000,  0.00000,  0.00144,  0.00000, -0.00072, 
 0.00000,  0.00000,  0.00000,  0.00000,  0.00108,  0.00000, 
-0.00072,  0.00000,  0.00000, -0.00072,  0.00000,  0.00144,

Here are some phenomena we have observed, which might provide some hint to the origin of this unexpected behavior:

  • We have very carefully analyzed every memory access, both local and shared memory, to make sure no out-of-bound error had occured. compute-sanitizer doesn't report any warnings or errors on this minimal CUDA example. (It will show some race warning on GPU4PySCF minimal example, because we use shared-memory reductions within a warp without explicit sync, it's unrelated.)
  • In the minimal CUDA example, in function type2_cart(), the computation of local array radi is contaminated by the function type2_facs_rad(). At the end of type2_facs_rad() function, I assigned some values to the buf variable, which clearly suppose to do nothing. However those assigned values shows up on radi. We believe register overlap / contamination is part of the reason.
  • When we perform initialization on the unused part of some local arrays (reg_angi and reg_angj in type2_cart()), the problem disappeared. We believe this only changed the optimization pattern and bypassed the problem.
  • Adding or removing __restrict__ keyword for function arguments doesn't help.
  • Changing type2_facs_rad() to __noinline__ introduced a different problem, the result is still incorrect but became a different number.
  • __forceinline__ every device function increased the memory usage for compilation (cicc) up to 32 GB for the CUDA example and more than 128 GB for the GPU4PySCF compilation (even after isolating the type2_cart() function compilation to one file). Even if it fixes the problem (it does, in the CUDA example), we are not able to afford the memory requirement for compilation, and not able to test the other kernels.
  • When we tried to turn off the compiler optimization using flag --Ofast-compile max, the problem disappeared.
  • If you wonder why this minimal CUDA example is 5000 lines long, it is because when we try to remove some unused switch-cases, the problem also disappeared. Those cases are just unused in this minimal example, they are necessary.

Thanks @wxj6000 and @sunqm for the help with debugging.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions