Segfault libxc.so #188

markperri · 2024-07-25T21:35:50Z

I installed pyscf into my environment in a Jupyter Notebook docker container running ubuntu 22.04 and python 3.11

pip install pyscf gpu4pyscf-cuda12x cutensor-cu12

When I test with the given example I get a segfault:

import pyscf
from gpu4pyscf.dft import rks

atom ='''
O       0.0000000000    -0.0000000000     0.1174000000
H      -0.7570000000    -0.0000000000    -0.4696000000
H       0.7570000000     0.0000000000    -0.4696000000
'''

mol = pyscf.M(atom=atom, basis='def2-tzvpp')
mf = rks.RKS(mol, xc='LDA').density_fit()

e_dft = mf.kernel()  # compute total energy

Segmentation fault

kernel: python[394296]: segfault at 0 ip 00007f506ab6fcff sp 00007ffdbc440778 error 6 in libxc.so[7f506ab65000+1e0000]
kernel: Code: a9 1a 00 48 8b 44 24 08 48 83 c4 18 c3 66 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 00 48 8b 05 a9 73 1d 00 66 0f ef c9 66 0f ef c0 <48> c7 07 00 00 00 00 c7 47 20 00 00 00 00 48 89 47 08 48 c7 47 38

It looks like I have two libxc.so:

/opt/conda/lib/python3.11/site-packages/gpu4pyscf/lib/deps/lib/libxc.so
/opt/conda/lib/python3.11/site-packages/pyscf/lib/deps/lib/libxc.so

pip freeze | grep scf
gpu4pyscf-cuda12x==1.0
gpu4pyscf-libxc-cuda12x==0.4
pyscf==2.6.2
pyscf-dispersion==1.0.2

Do you have any thoughts on how to fix the segfault?

The text was updated successfully, but these errors were encountered:

wxj6000 · 2024-07-25T22:30:54Z

The following things can be helpful to identify the issue:

Run the following code to see if it is an issue related to libxc.so in PySCF or libxc.so (CUDA version) in gpu4pyscf.

import pyscf
from pyscf.dft import rks

atom ='''
O       0.0000000000    -0.0000000000     0.1174000000
H      -0.7570000000    -0.0000000000    -0.4696000000
H       0.7570000000     0.0000000000    -0.4696000000
'''

mol = pyscf.M(atom=atom, basis='def2-tzvpp')
mf = rks.RKS(mol, xc='LDA').density_fit()

e_dft = mf.kernel()  # compute total energy

What is your GPU type?
What is the message before Segmentation fault?

markperri · 2024-07-25T23:06:18Z

Thanks for the quick response.

That code runs fine:

converged SCF energy = -75.2427927513195

I am using an A100-40 on Jetstream2. It is sliced to 1/5 of a GPU in the hypervisor on this VM size. I also tried it on a g3.xl VM size, which uses the entire GPU, and got the same error.

+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.183.01             Driver Version: 535.183.01   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  GRID A100X-8C                  On  | 00000000:04:00.0 Off |                    0 |
| N/A   N/A    P0              N/A /  N/A |      0MiB /  8192MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|  No running processes found                                                           |
+---------------------------------------------------------------------------------------+

There's no messages before that line, it's just Segmentation fault. I have to look in /var/log/messages to see the details. I'm not sure if that's due to running it in a docker container.

Python 3.11.9 | packaged by conda-forge | (main, Apr 19 2024, 18:36:13) [GCC 12.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import pyscf
>>> from gpu4pyscf.dft import rks
>>>
>>> atom ='''
... O       0.0000000000    -0.0000000000     0.1174000000
... H      -0.7570000000    -0.0000000000    -0.4696000000
... H       0.7570000000     0.0000000000    -0.4696000000
... '''
>>>
>>> mol = pyscf.M(atom=atom, basis='def2-tzvpp')
>>> mf = rks.RKS(mol, xc='LDA').density_fit()
>>> e_dft = mf.kernel()  # compute total energy
Segmentation fault

/var/log/messages:

kernel: python[445393]: segfault at 0 ip 00007f8cb2b6fcff sp 00007ffc13e36838 error 6 in libxc.so[7f8cb2b65000+1e0000]
kernel: Code: a9 1a 00 48 8b 44 24 08 48 83 c4 18 c3 66 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 00 48 8b 05 a9 73 1d 00 66 0f ef c9 66 0f ef c0 <48> c7 07 00 00 00 00 c7 47 20 00 00 00 00 48 89 47 08 48 c7 47 38

Thanks,
Mark

wxj6000 · 2024-07-26T00:21:01Z

@markperri Thanks for the info. I tried to create a similar environment, but I was not able to reproduce the issue. If possible, could you please share your docker file?

And you probably have tried, but sometimes it is helpful to reinstall or create a new conda environment to avoid some possible conflict.

markperri · 2024-07-26T00:58:53Z

@wxj6000 Here is a minimal dockerfile that gives the same error. I wonder if there's something about the way this system is setup. I'll see if I can find another CUDA application to test the installation in general tomorrow.
Thanks,
Mark

FROM nvidia/cuda:12.2.0-devel-ubuntu22.04

RUN apt-get update -y && \
    apt-get install -y --no-install-recommends \
    python3-dev \
    python3-pip \
    python3-wheel \
    python3-setuptools && \
    rm -rf /var/lib/apt/lists/* /var/cache/apt/archives/*


ENV CUDA_HOME="/usr/local/cuda" LD_LIBRARY_PATH="${CUDA_HOME}/lib64::${LD_LIBRARY_PATH}"
RUN echo "export PATH=${CUDA_HOME}/bin:\$PATH" >> /etc/bash.bashrc
RUN echo "export LD_LIBRARY_PATH=${CUDA_HOME}/lib64:\$LD_LIBRARY_PATH" >> /etc/bash.bashrc

RUN pip3 install pyscf gpu4pyscf-cuda12x cutensor-cu12

root@23aed08bf45d:/# python3
Python 3.10.12 (main, Mar 22 2024, 16:50:05) [GCC 11.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import pyscf
>>> from gpu4pyscf.dft import rks
>>>
>>> atom ='''
... O       0.0000000000    -0.0000000000     0.1174000000
... H      -0.7570000000    -0.0000000000    -0.4696000000
... H       0.7570000000     0.0000000000    -0.4696000000
... '''
>>>
>>> mol = pyscf.M(atom=atom, basis='def2-tzvpp')
>>> mf = rks.RKS(mol, xc='LDA').density_fit()
>>>
>>> e_dft = mf.kernel()  # compute total energy
Segmentation fault (core dumped)

/var/log/messages:

python3[506069]: segfault at 0 ip 00007fa842b6fcff sp 00007ffc591d6028 error 6 in libxc.so[7fa842b65000+1e0000]
kernel: Code: a9 1a 00 48 8b 44 24 08 48 83 c4 18 c3 66 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 00 48 8b 05 a9 73 1d 00 66 0f ef c9 66 0f ef c0 <48> c7 07 00 00 00 00 c7 47 20 00 00 00 00 48 89 47 08 48 c7 47 38
``

markperri · 2024-07-26T14:45:28Z

@wxj6000 I ran a NAMD container from NVIDIA NGC and it runs fine on the GPU, so at least we know the docker / GPU setup is working. I'm not sure what else to test.

Fri Jul 26 14:40:34 2024
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.183.01             Driver Version: 535.183.01   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  GRID A100X-40C                 On  | 00000000:04:00.0 Off |                    0 |
| N/A   N/A    P0              N/A /  N/A |    672MiB / 40960MiB |     61%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    0   N/A  N/A     11413      C   namd2                                       671MiB |
+---------------------------------------------------------------------------------------+

wxj6000 · 2024-07-27T23:36:00Z

@markperri I tried the docker file you provided. The docker container works fine on my side. Let me check if there is a memory leak in the modules.

Python 3.10.12 (main, Mar 22 2024, 16:50:05) [GCC 11.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import pyscf
>>> from gpu4pyscf.dft import rks
>>> 
>>> atom ='''
... O       0.0000000000    -0.0000000000     0.1174000000
... H      -0.7570000000    -0.0000000000    -0.4696000000
... H       0.7570000000     0.0000000000    -0.4696000000
... '''
>>> 
>>> mol = pyscf.M(atom=atom, basis='def2-tzvpp')
>>> mf = rks.RKS(mol, xc='LDA').density_fit()
>>> 
>>> e_dft = mf.kernel()  # compute total energy
/usr/local/lib/python3.10/dist-packages/cupy/cuda/compiler.py:233: PerformanceWarning: Jitify is performing a one-time only warm-up to populate the persistent cache, this may take a few seconds and will be improved in a future release...
  jitify._init_module()
converged SCF energy = -75.2427927513248
>>> print(f"total energy = {e_dft}")
total energy = -75.24279275132476
>>>

markperri · 2024-07-28T15:03:09Z

@wxj6000 I compiled gpu4pyscf from source and it still gives the same error. I'll contact the Jetstream2 staff and see if they have any ideas.

Thanks,
Mark

wxj6000 · 2024-07-28T19:08:46Z

@markperri I went through the code related to libxc, and improved the interface related to memory allocation in libxc. But I am not sure if it is helpful on your side.
https://github.com/pyscf/gpu4pyscf/actions/runs/10133763490/job/28019283314?pr=189

markperri · 2024-07-28T23:46:44Z

Thanks for trying. I compiled from source with 8fdfaa8, but I get the same segfault:

kernel: python[43743]: segfault at 0 ip 00007f3fad76ddf3 sp 00007ffeda1ba2c8 error 6 in libxc.so.15[7f3fad763000+224000]
kernel: Code: 00 00 00 75 05 48 83 c4 18 c3 e8 58 68 ff ff 0f 1f 84 00 00 00 00 00 f3 0f 1e fa 48 8b 05 b5 b2 21 00 66 0f ef c9 66 0f ef c0 <48> c7 07 00 00 00 00 c7 47 20 00 00 00 00 48 89 47 08 48 c7 47 38

wxj6000 · 2024-08-21T04:22:38Z

@markperri Can you check if this PR resolves the issue please? #180

markperri · 2024-08-21T05:53:44Z

Thanks, is that the libxc_overhead branch? I installed it, but it doesn't seem to help:

pip install git+https://github.com/pyscf/gpu4pyscf.git@libxc_overhead
pip install cutensor-cu12

(base) jovyan@d67ddf22943d:/tmp$ python
Python 3.11.9 | packaged by conda-forge | (main, Apr 19 2024, 18:36:13) [GCC 12.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import pyscf
>>> from gpu4pyscf.dft import rks
>>>
>>> atom ='''
... O       0.0000000000    -0.0000000000     0.1174000000
... H      -0.7570000000    -0.0000000000    -0.4696000000
... H       0.7570000000     0.0000000000    -0.4696000000
... '''
>>>
>>> mol = pyscf.M(atom=atom, basis='def2-tzvpp')
>>> mf = rks.RKS(mol, xc='LDA').density_fit()
>>>
>>> e_dft = mf.kernel()  # compute total energy
Segmentation fault

wxj6000 · 2024-08-21T07:13:51Z

Right. It is the libxc_overhead branch. Just to confirm, have you removed the existing package if installed?

And I registered an account on ChemCompute. But I don't have the access to JupyterHub as I don't have academic emails anymore. Is there any chance to have a development environment for debugging?

markperri · 2024-08-21T14:32:02Z

Yes, this is without any gpu4pyscf installed.

markperri · 2024-08-21T14:33:16Z

Oh and @wxj6000 you should have Jupyter Notebook access now.
Thanks,
Mark

wxj6000 · 2024-08-21T18:43:52Z

@markperri Thank you for giving me the permission for debugging. It seems that the unified memory, which is required by libxc.so, is disabled on this device. Please checkout the managedMemory in the dict and CUDA documentation for the details. https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#um-requirements

We can switch to libxc on CPU if the unified memory is not supported on the device. We will let you know the progress.

{'name': b'GRID A100X-8C', 'totalGlobalMem': 8585609216, 'sharedMemPerBlock': 49152, 'regsPerBlock': 65536, 'warpSize': 32, 'maxThreadsPerBlock': 1024, 'maxThreadsDim': (1024, 1024, 64), 'maxGridSize': (2147483647, 65535, 65535), 'clockRate': 1410000, 'totalConstMem': 65536, 'major': 8, 'minor': 0, 'textureAlignment': 512, 'texturePitchAlignment': 32, 'multiProcessorCount': 108, 'kernelExecTimeoutEnabled': 0, 'integrated': 0, 'canMapHostMemory': 1, 'computeMode': 0, 'maxTexture1D': 131072, 'maxTexture2D': (131072, 65536), 'maxTexture3D': (16384, 16384, 16384), 'concurrentKernels': 1, 'ECCEnabled': 1, 'pciBusID': 4, 'pciDeviceID': 0, 'pciDomainID': 0, 'tccDriver': 0, 'memoryClockRate': 1215000, 'memoryBusWidth': 5120, 'l2CacheSize': 41943040, 'maxThreadsPerMultiProcessor': 2048, 'isMultiGpuBoard': 0, 'cooperativeLaunch': 1, 'cooperativeMultiDeviceLaunch': 1, 'deviceOverlap': 1, 'maxTexture1DMipmap': 32768, 'maxTexture1DLinear': 268435456, 'maxTexture1DLayered': (32768, 2048), 'maxTexture2DMipmap': (32768, 32768), 'maxTexture2DLinear': (131072, 65000, 2097120), 'maxTexture2DLayered': (32768,32768, 2048), 'maxTexture2DGather': (32768, 32768), 'maxTexture3DAlt': (8192, 8192, 32768), 'maxTextureCubemap': 32768, 'maxTextureCubemapLayered': (32768, 2046), 'maxSurface1D': 32768, 'maxSurface1DLayered': (32768, 2048), 'maxSurface2D': (131072, 65536), 'maxSurface2DLayered': (32768, 32768, 2048), 'maxSurface3D': (16384, 16384, 16384), 'maxSurfaceCubemap': 32768,'maxSurfaceCubemapLayered': (32768, 2046), 'surfaceAlignment': 512, 'asyncEngineCount': 5, 'unifiedAddressing': 1, 'streamPrioritiesSupported': 1, 'globalL1CacheSupported': 1, 'localL1CacheSupported': 1, 'sharedMemPerMultiprocessor': 167936, 'regsPerMultiprocessor': 65536, 'managedMemory': 0, 'multiGpuBoardGroupID': 0, 'hostNativeAtomicSupported': 0, 'singleToDoublePrecisionPerfRatio': 2, 'pageableMemoryAccess': 0, 'concurrentManagedAccess': 0, 'computePreemptionSupported': 1, 'canUseHostPointerForRegisteredMem': 0, 'sharedMemPerBlockOptin': 166912, 'pageableMemoryAccessUsesHostPageTables': 0, 'directManagedMemAccessFromHost': 0, 'uuid': b'_:\x16\x9f_\xd6\x11\xef\xbex\x9d\x11\x11\x8e+\xa9', 'luid': b'', 'luidDeviceNodeMask': 0,'persistingL2CacheMaxSize': 26214400, 'maxBlocksPerMultiProcessor': 32, 'accessPolicyMaxWindowSize': 134213632, 'reservedSharedMemPerBlock': 1024}

markperri · 2024-08-21T19:54:51Z

Oh I see. The way their hypervisor works with vGPUs doesn't allow unified memory. Looks like this package won't be compatible with their system then.
Thanks,
Mark

wxj6000 · 2024-08-25T06:35:35Z

@markperri The issue has been fixed in v1.0.1. Most tasks can be executed on ChemCompute now. But, due to the limited memory of a slice of GPU, it may raise an error of out of memory for some tasks such as Hessian calculations.

Thank you for your feedback and your cooperation!

markperri · 2024-08-25T15:32:51Z

Thanks! It works great now. I increased the instance size to use the entire GPU and the out of memory problems are fixed. But, I had to install it from github. There is something wrong with the package on pypi. It just downloads all versions and then gives up.

(base) jovyan@7db95487cf10:/tmp$ pip install gpu4pyscf
Collecting gpu4pyscf
  Downloading gpu4pyscf-1.0.1.tar.gz (206 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 206.8/206.8 kB 6.1 MB/s eta 0:00:00
  Preparing metadata (setup.py) ... done
  WARNING: Generating metadata for package gpu4pyscf produced metadata for project name gpu4pyscf-cuda12x. Fix your #egg=gpu4pyscf fragments.
Discarding https://files.pythonhosted.org/packages/3d/68/07452d97f874c77d622e42969fb54c265a734d4f7be86f18944400625bb2/gpu4pyscf-1.0.1.tar.gz (from https://pypi.org/simple/gpu4pyscf/): Requested gpu4pyscf-cuda12x from https://files.pythonhosted.org/packages/3d/68/07452d97f874c77d622e42969fb54c265a734d4f7be86f18944400625bb2/gpu4pyscf-1.0.1.tar.gz has inconsistent name: expected 'gpu4pyscf', butmetadata has 'gpu4pyscf-cuda12x'
  Downloading gpu4pyscf-1.0.tar.gz (204 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 205.0/205.0 kB 19.0 MB/s eta 0:00:00
  Preparing metadata (setup.py) ... done
  WARNING: Generating metadata for package gpu4pyscf produced metadata for project name gpu4pyscf-cuda12x. Fix your #egg=gpu4pyscf fragments.
Discarding https://files.pythonhosted.org/packages/17/00/a9bfefd38206230cd4542106b13cc1d08dcdc6b76f0be112bb4be5fb23f4/gpu4pyscf-1.0.tar.gz (from https://pypi.org/simple/gpu4pyscf/): Requested gpu4pyscf-cuda12x from https://files.pythonhosted.org/packages/17/00/a9bfefd38206230cd4542106b13cc1d08dcdc6b76f0be112bb4be5fb23f4/gpu4pyscf-1.0.tar.gz has inconsistent name: expected 'gpu4pyscf', but metadata has 'gpu4pyscf-cuda12x'
  Downloading gpu4pyscf-0.8.2.tar.gz (204 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 204.9/204.9 kB 13.6 MB/s eta 0:00:00
  Preparing metadata (setup.py) ... done
  WARNING: Generating metadata for package gpu4pyscf produced metadata for project name gpu4pyscf-cuda12x. Fix your #egg=gpu4pyscf fragments.
Discarding https://files.pythonhosted.org/packages/bb/dc/b33d96a33a406758cf9cd0ea14e5654c3d1310ee9ae7ff466ed6567816ae/gpu4pyscf-0.8.2.tar.gz (from https://pypi.org/simple/gpu4pyscf/): Requested gpu4pyscf-cuda12x from https://files.pythonhosted.org/packages/bb/dc/b33d96a33a406758cf9cd0ea14e5654c3d1310ee9ae7ff466ed6567816ae/gpu4pyscf-0.8.2.tar.gz has inconsistent name: expected 'gpu4pyscf', butmetadata has 'gpu4pyscf-cuda12x'

It continues to download older versions of gpu4pyscf and then errors out.

wxj6000 · 2024-08-25T16:26:04Z

@markperri pip3 install gpu4pyscf-cuda12x will resolve the issue.

markperri · 2024-08-25T16:44:21Z

Oh yes, sorry. Forgot that part!

markperri mentioned this issue Jul 26, 2024

CUDA toolkit error markperri/chemcompute_issues#6

Closed

wxj6000 self-assigned this Aug 14, 2024

wxj6000 added the bug Something isn't working label Aug 14, 2024

wxj6000 mentioned this issue Aug 20, 2024

Remove unified memory in CUDA libxc #180

Merged

wxj6000 mentioned this issue Aug 23, 2024

Remove unified memory wxj6000/libxc#1

Open

markperri closed this as completed Aug 25, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Segfault libxc.so #188

Segfault libxc.so #188

markperri commented Jul 25, 2024

wxj6000 commented Jul 25, 2024

markperri commented Jul 25, 2024 •

edited

Loading

wxj6000 commented Jul 26, 2024

markperri commented Jul 26, 2024

markperri commented Jul 26, 2024

wxj6000 commented Jul 27, 2024

markperri commented Jul 28, 2024

wxj6000 commented Jul 28, 2024

markperri commented Jul 28, 2024

wxj6000 commented Aug 21, 2024

markperri commented Aug 21, 2024

wxj6000 commented Aug 21, 2024 •

edited

Loading

markperri commented Aug 21, 2024

markperri commented Aug 21, 2024

wxj6000 commented Aug 21, 2024

markperri commented Aug 21, 2024

wxj6000 commented Aug 25, 2024

markperri commented Aug 25, 2024

wxj6000 commented Aug 25, 2024

markperri commented Aug 25, 2024

Segfault libxc.so #188

Segfault libxc.so #188

Comments

markperri commented Jul 25, 2024

wxj6000 commented Jul 25, 2024

markperri commented Jul 25, 2024 • edited Loading

wxj6000 commented Jul 26, 2024

markperri commented Jul 26, 2024

markperri commented Jul 26, 2024

wxj6000 commented Jul 27, 2024

markperri commented Jul 28, 2024

wxj6000 commented Jul 28, 2024

markperri commented Jul 28, 2024

wxj6000 commented Aug 21, 2024

markperri commented Aug 21, 2024

wxj6000 commented Aug 21, 2024 • edited Loading

markperri commented Aug 21, 2024

markperri commented Aug 21, 2024

wxj6000 commented Aug 21, 2024

markperri commented Aug 21, 2024

wxj6000 commented Aug 25, 2024

markperri commented Aug 25, 2024

wxj6000 commented Aug 25, 2024

markperri commented Aug 25, 2024

markperri commented Jul 25, 2024 •

edited

Loading

wxj6000 commented Aug 21, 2024 •

edited

Loading