Unexpected CUDA behavior encountered in ECP kernels

We have noticed an unexpected behavior of CUDA compilation on Blackwell cards (sm_100 and sm_120). The ECP type 2 general kernel, as well as general first and second derivative kernels for both ECP type 1 and type 2, returns incorrect results.

We have reproduced the problem on RTX 5080 and 5090, and in issue #508 the same problem shows up on B200. It cannot be reproduced on A100 or earlier cards. CUDA version doesn't seem to matter, as we reproduce the problem with both CUDA 12.8 and 13.0.

A minimal reproducible case in GPU4PySCF is:
```
import pyscf
import numpy as np

mol = pyscf.M(atom = '''
    I  1.0 0.0 0.0
''',
basis = """
I    D
      0.30900000000          1.0000000
""",
ecp = """
ECP
I nelec 28
I D
2      15.06890800            35.43952900
END
""",
charge = -1,
verbose = 0)

from pyscf import dft
mf = dft.RKS(mol, xc = "PBE")
hcore_cpu = mf.get_hcore()

from gpu4pyscf import dft
mf = dft.RKS(mol, xc = "PBE")
hcore_gpu = mf.get_hcore().get()

print(np.max(np.abs(hcore_gpu - hcore_cpu)))
print(np.where(np.abs(hcore_gpu - hcore_cpu) > 1e-10))
```
In this example, the block of ECP matrix with D-type orbital and D-type ECP produces wrong result on Blackwell cards. The expected result (obtained from a A100) is
```
7.105427357601002e-15
(array([], dtype=int64), array([], dtype=int64))
```
and the actual result (obtained from a RTX 5080) is
```
0.0008569962110911433
(array([0, 1, 2, 3, 4]), array([0, 1, 2, 3, 4]))
```
This unexpected result may vary based on the software and hardware setting, but it is always non-zero.

We also prepared a "minimal" cuda example as the attached file (Please remove the .cpp suffix, github doesn't like cu files). It is one cu file, with the input data built in, and we hope this can make debugging easier. We have also inserted print statements to make clear the unexpected behavior.

[ecp_type2_reproduce.cu.cpp](https://github.com/user-attachments/files/22503550/ecp_type2_reproduce.cu.cpp)

To compile, run the following command (you can remove the unnecessary sm):
```
nvcc -std=c++17 --ptxas-options=-v -O2 -gencode=arch=compute_120,code=sm_120 -gencode=arch=compute_100,code=sm_100 -gencode=arch=compute_90,code=sm_90 -gencode=arch=compute_80,code=sm_80 ecp_type2_reproduce.cu
```
It'll generate `a.out` executable. To run, just run
```
./a.out
```
The correct output is
```
ur = 0.00000, dca = 0.00000, npi = 1, ai,ci = 0.30900,0.33416, radi = 0.33416,0.00000,0.00000,0.00000,0.00000
ur = 0.00000, dca = 0.00000, dcb = 0.00000, npi = 1, npj = 1, ai,ci = 0.30900,0.33416, aj,cj = 0.30900,0.33416, radi = 0.33416,0.00000,0.00000,0.00000,0.00000, radj = 0.33416,0.00000,0.00000,0.00000,0.00000,
 0.00048,  0.00000,  0.00000, -0.00024,  0.00000, -0.00024, 
 0.00000,  0.00036,  0.00000,  0.00000,  0.00000,  0.00000, 
 0.00000,  0.00000,  0.00036,  0.00000,  0.00000,  0.00000, 
-0.00024,  0.00000,  0.00000,  0.00048,  0.00000, -0.00024, 
 0.00000,  0.00000,  0.00000,  0.00000,  0.00036,  0.00000, 
-0.00024,  0.00000,  0.00000, -0.00024,  0.00000,  0.00048,
```
And output from a RTX 5080 is
```
ur = 0.00000, dca = 0.00000, npi = 1, ai,ci = 0.30900,0.33416, radi = 1.33416,0.00000,3.14159,2.71828,1.41421
ur = 0.00000, dca = 0.00000, dcb = 0.00000, npi = 1, npj = 1, ai,ci = 0.30900,0.33416, aj,cj = 0.30900,0.33416, radi = 1.00000,0.00000,3.14159,2.71828,1.41421, radj = 0.33416,0.00000,0.00000,0.00000,0.00000,
 0.00144,  0.00000,  0.00000, -0.00072,  0.00000, -0.00072, 
 0.00000,  0.00108,  0.00000,  0.00000,  0.00000,  0.00000, 
 0.00000,  0.00000,  0.00108,  0.00000,  0.00000,  0.00000, 
-0.00072,  0.00000,  0.00000,  0.00144,  0.00000, -0.00072, 
 0.00000,  0.00000,  0.00000,  0.00000,  0.00108,  0.00000, 
-0.00072,  0.00000,  0.00000, -0.00072,  0.00000,  0.00144,
```

Here are some phenomena we have observed, which might provide some hint to the origin of this unexpected behavior:
- We have very carefully analyzed every memory access, both local and shared memory, to make sure no out-of-bound error had occured. compute-sanitizer doesn't report any warnings or errors on this minimal CUDA example. (It will show some race warning on GPU4PySCF minimal example, because we use shared-memory reductions within a warp without explicit sync, it's unrelated.)
- In the minimal CUDA example, in function `type2_cart()`, the computation of local array `radi` is contaminated by the function `type2_facs_rad()`. At the end of `type2_facs_rad()` function, I assigned some values to the `buf` variable, which clearly suppose to do nothing. However those assigned values shows up on `radi`. We believe register overlap / contamination is part of the reason.
- When we perform initialization on the unused part of some local arrays (`reg_angi` and `reg_angj` in `type2_cart()`), the problem disappeared. We believe this only changed the optimization pattern and bypassed the problem.
- Adding or removing `__restrict__` keyword for function arguments doesn't help.
- Changing  `type2_facs_rad()` to `__noinline__` introduced a different problem, the result is still incorrect but became a different number.
- `__forceinline__` every device function increased the memory usage for compilation (cicc) up to 32 GB for the CUDA example and more than 128 GB for the GPU4PySCF compilation (even after isolating the `type2_cart()` function compilation to one file). Even if it fixes the problem (it does, in the CUDA example), we are not able to afford the memory requirement for compilation, and not able to test the other kernels.
- When we tried to turn off the compiler optimization using flag `--Ofast-compile max`, the problem disappeared.
- If you wonder why this minimal CUDA example is 5000 lines long, it is because when we try to remove some unused switch-cases, the problem also disappeared. Those cases are just unused in this minimal example, they are necessary.

Thanks @wxj6000 and @sunqm for the help with debugging.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Unexpected CUDA behavior encountered in ECP kernels #521

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Unexpected CUDA behavior encountered in ECP kernels #521

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions