Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Segfault libxc.so #188

Closed
markperri opened this issue Jul 25, 2024 · 20 comments
Closed

Segfault libxc.so #188

markperri opened this issue Jul 25, 2024 · 20 comments
Assignees
Labels
bug Something isn't working

Comments

@markperri
Copy link

I installed pyscf into my environment in a Jupyter Notebook docker container running ubuntu 22.04 and python 3.11

pip install pyscf gpu4pyscf-cuda12x cutensor-cu12

When I test with the given example I get a segfault:

import pyscf
from gpu4pyscf.dft import rks

atom ='''
O       0.0000000000    -0.0000000000     0.1174000000
H      -0.7570000000    -0.0000000000    -0.4696000000
H       0.7570000000     0.0000000000    -0.4696000000
'''

mol = pyscf.M(atom=atom, basis='def2-tzvpp')
mf = rks.RKS(mol, xc='LDA').density_fit()

e_dft = mf.kernel()  # compute total energy

Segmentation fault

kernel: python[394296]: segfault at 0 ip 00007f506ab6fcff sp 00007ffdbc440778 error 6 in libxc.so[7f506ab65000+1e0000]
kernel: Code: a9 1a 00 48 8b 44 24 08 48 83 c4 18 c3 66 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 00 48 8b 05 a9 73 1d 00 66 0f ef c9 66 0f ef c0 <48> c7 07 00 00 00 00 c7 47 20 00 00 00 00 48 89 47 08 48 c7 47 38

It looks like I have two libxc.so:

/opt/conda/lib/python3.11/site-packages/gpu4pyscf/lib/deps/lib/libxc.so
/opt/conda/lib/python3.11/site-packages/pyscf/lib/deps/lib/libxc.so
pip freeze | grep scf
gpu4pyscf-cuda12x==1.0
gpu4pyscf-libxc-cuda12x==0.4
pyscf==2.6.2
pyscf-dispersion==1.0.2

Do you have any thoughts on how to fix the segfault?

@wxj6000
Copy link
Collaborator

wxj6000 commented Jul 25, 2024

The following things can be helpful to identify the issue:

  1. Run the following code to see if it is an issue related to libxc.so in PySCF or libxc.so (CUDA version) in gpu4pyscf.
import pyscf
from pyscf.dft import rks

atom ='''
O       0.0000000000    -0.0000000000     0.1174000000
H      -0.7570000000    -0.0000000000    -0.4696000000
H       0.7570000000     0.0000000000    -0.4696000000
'''

mol = pyscf.M(atom=atom, basis='def2-tzvpp')
mf = rks.RKS(mol, xc='LDA').density_fit()

e_dft = mf.kernel()  # compute total energy
  1. What is your GPU type?
  2. What is the message before Segmentation fault?

@markperri
Copy link
Author

markperri commented Jul 25, 2024

Thanks for the quick response.

  1. That code runs fine:

converged SCF energy = -75.2427927513195

  1. I am using an A100-40 on Jetstream2. It is sliced to 1/5 of a GPU in the hypervisor on this VM size. I also tried it on a g3.xl VM size, which uses the entire GPU, and got the same error.
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.183.01             Driver Version: 535.183.01   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  GRID A100X-8C                  On  | 00000000:04:00.0 Off |                    0 |
| N/A   N/A    P0              N/A /  N/A |      0MiB /  8192MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|  No running processes found                                                           |
+---------------------------------------------------------------------------------------+
  1. There's no messages before that line, it's just Segmentation fault. I have to look in /var/log/messages to see the details. I'm not sure if that's due to running it in a docker container.
Python 3.11.9 | packaged by conda-forge | (main, Apr 19 2024, 18:36:13) [GCC 12.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import pyscf
>>> from gpu4pyscf.dft import rks
>>>
>>> atom ='''
... O       0.0000000000    -0.0000000000     0.1174000000
... H      -0.7570000000    -0.0000000000    -0.4696000000
... H       0.7570000000     0.0000000000    -0.4696000000
... '''
>>>
>>> mol = pyscf.M(atom=atom, basis='def2-tzvpp')
>>> mf = rks.RKS(mol, xc='LDA').density_fit()
>>> e_dft = mf.kernel()  # compute total energy
Segmentation fault

/var/log/messages:

kernel: python[445393]: segfault at 0 ip 00007f8cb2b6fcff sp 00007ffc13e36838 error 6 in libxc.so[7f8cb2b65000+1e0000]
kernel: Code: a9 1a 00 48 8b 44 24 08 48 83 c4 18 c3 66 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 00 48 8b 05 a9 73 1d 00 66 0f ef c9 66 0f ef c0 <48> c7 07 00 00 00 00 c7 47 20 00 00 00 00 48 89 47 08 48 c7 47 38

Thanks,
Mark

@wxj6000
Copy link
Collaborator

wxj6000 commented Jul 26, 2024

@markperri Thanks for the info. I tried to create a similar environment, but I was not able to reproduce the issue. If possible, could you please share your docker file?

And you probably have tried, but sometimes it is helpful to reinstall or create a new conda environment to avoid some possible conflict.

@markperri
Copy link
Author

@wxj6000 Here is a minimal dockerfile that gives the same error. I wonder if there's something about the way this system is setup. I'll see if I can find another CUDA application to test the installation in general tomorrow.
Thanks,
Mark

FROM nvidia/cuda:12.2.0-devel-ubuntu22.04

RUN apt-get update -y && \
    apt-get install -y --no-install-recommends \
    python3-dev \
    python3-pip \
    python3-wheel \
    python3-setuptools && \
    rm -rf /var/lib/apt/lists/* /var/cache/apt/archives/*


ENV CUDA_HOME="/usr/local/cuda" LD_LIBRARY_PATH="${CUDA_HOME}/lib64::${LD_LIBRARY_PATH}"
RUN echo "export PATH=${CUDA_HOME}/bin:\$PATH" >> /etc/bash.bashrc
RUN echo "export LD_LIBRARY_PATH=${CUDA_HOME}/lib64:\$LD_LIBRARY_PATH" >> /etc/bash.bashrc

RUN pip3 install pyscf gpu4pyscf-cuda12x cutensor-cu12
root@23aed08bf45d:/# python3
Python 3.10.12 (main, Mar 22 2024, 16:50:05) [GCC 11.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import pyscf
>>> from gpu4pyscf.dft import rks
>>>
>>> atom ='''
... O       0.0000000000    -0.0000000000     0.1174000000
... H      -0.7570000000    -0.0000000000    -0.4696000000
... H       0.7570000000     0.0000000000    -0.4696000000
... '''
>>>
>>> mol = pyscf.M(atom=atom, basis='def2-tzvpp')
>>> mf = rks.RKS(mol, xc='LDA').density_fit()
>>>
>>> e_dft = mf.kernel()  # compute total energy
Segmentation fault (core dumped)

/var/log/messages:

python3[506069]: segfault at 0 ip 00007fa842b6fcff sp 00007ffc591d6028 error 6 in libxc.so[7fa842b65000+1e0000]
kernel: Code: a9 1a 00 48 8b 44 24 08 48 83 c4 18 c3 66 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 00 48 8b 05 a9 73 1d 00 66 0f ef c9 66 0f ef c0 <48> c7 07 00 00 00 00 c7 47 20 00 00 00 00 48 89 47 08 48 c7 47 38
``

@markperri
Copy link
Author

@wxj6000 I ran a NAMD container from NVIDIA NGC and it runs fine on the GPU, so at least we know the docker / GPU setup is working. I'm not sure what else to test.

Fri Jul 26 14:40:34 2024
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.183.01             Driver Version: 535.183.01   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  GRID A100X-40C                 On  | 00000000:04:00.0 Off |                    0 |
| N/A   N/A    P0              N/A /  N/A |    672MiB / 40960MiB |     61%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    0   N/A  N/A     11413      C   namd2                                       671MiB |
+---------------------------------------------------------------------------------------+

@wxj6000
Copy link
Collaborator

wxj6000 commented Jul 27, 2024

@markperri I tried the docker file you provided. The docker container works fine on my side. Let me check if there is a memory leak in the modules.

Python 3.10.12 (main, Mar 22 2024, 16:50:05) [GCC 11.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import pyscf
>>> from gpu4pyscf.dft import rks
>>> 
>>> atom ='''
... O       0.0000000000    -0.0000000000     0.1174000000
... H      -0.7570000000    -0.0000000000    -0.4696000000
... H       0.7570000000     0.0000000000    -0.4696000000
... '''
>>> 
>>> mol = pyscf.M(atom=atom, basis='def2-tzvpp')
>>> mf = rks.RKS(mol, xc='LDA').density_fit()
>>> 
>>> e_dft = mf.kernel()  # compute total energy
/usr/local/lib/python3.10/dist-packages/cupy/cuda/compiler.py:233: PerformanceWarning: Jitify is performing a one-time only warm-up to populate the persistent cache, this may take a few seconds and will be improved in a future release...
  jitify._init_module()
converged SCF energy = -75.2427927513248
>>> print(f"total energy = {e_dft}")
total energy = -75.24279275132476
>>> 

@markperri
Copy link
Author

@wxj6000 I compiled gpu4pyscf from source and it still gives the same error. I'll contact the Jetstream2 staff and see if they have any ideas.

Thanks,
Mark

@wxj6000
Copy link
Collaborator

wxj6000 commented Jul 28, 2024

@markperri I went through the code related to libxc, and improved the interface related to memory allocation in libxc. But I am not sure if it is helpful on your side.
https://github.com/pyscf/gpu4pyscf/actions/runs/10133763490/job/28019283314?pr=189

@markperri
Copy link
Author

Thanks for trying. I compiled from source with 8fdfaa8, but I get the same segfault:

kernel: python[43743]: segfault at 0 ip 00007f3fad76ddf3 sp 00007ffeda1ba2c8 error 6 in libxc.so.15[7f3fad763000+224000]
kernel: Code: 00 00 00 75 05 48 83 c4 18 c3 e8 58 68 ff ff 0f 1f 84 00 00 00 00 00 f3 0f 1e fa 48 8b 05 b5 b2 21 00 66 0f ef c9 66 0f ef c0 <48> c7 07 00 00 00 00 c7 47 20 00 00 00 00 48 89 47 08 48 c7 47 38

@wxj6000 wxj6000 self-assigned this Aug 14, 2024
@wxj6000 wxj6000 added the bug Something isn't working label Aug 14, 2024
@wxj6000
Copy link
Collaborator

wxj6000 commented Aug 21, 2024

@markperri Can you check if this PR resolves the issue please? #180

@markperri
Copy link
Author

Thanks, is that the libxc_overhead branch? I installed it, but it doesn't seem to help:

pip install git+https://github.com/pyscf/gpu4pyscf.git@libxc_overhead
pip install cutensor-cu12

(base) jovyan@d67ddf22943d:/tmp$ python
Python 3.11.9 | packaged by conda-forge | (main, Apr 19 2024, 18:36:13) [GCC 12.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import pyscf
>>> from gpu4pyscf.dft import rks
>>>
>>> atom ='''
... O       0.0000000000    -0.0000000000     0.1174000000
... H      -0.7570000000    -0.0000000000    -0.4696000000
... H       0.7570000000     0.0000000000    -0.4696000000
... '''
>>>
>>> mol = pyscf.M(atom=atom, basis='def2-tzvpp')
>>> mf = rks.RKS(mol, xc='LDA').density_fit()
>>>
>>> e_dft = mf.kernel()  # compute total energy
Segmentation fault

@wxj6000
Copy link
Collaborator

wxj6000 commented Aug 21, 2024

Right. It is the libxc_overhead branch. Just to confirm, have you removed the existing package if installed?

And I registered an account on ChemCompute. But I don't have the access to JupyterHub as I don't have academic emails anymore. Is there any chance to have a development environment for debugging?

@markperri
Copy link
Author

Yes, this is without any gpu4pyscf installed.

@markperri
Copy link
Author

Oh and @wxj6000 you should have Jupyter Notebook access now.
Thanks,
Mark

@wxj6000
Copy link
Collaborator

wxj6000 commented Aug 21, 2024

@markperri Thank you for giving me the permission for debugging. It seems that the unified memory, which is required by libxc.so, is disabled on this device. Please checkout the managedMemory in the dict and CUDA documentation for the details. https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#um-requirements

We can switch to libxc on CPU if the unified memory is not supported on the device. We will let you know the progress.

{'name': b'GRID A100X-8C', 'totalGlobalMem': 8585609216, 'sharedMemPerBlock': 49152, 'regsPerBlock': 65536, 'warpSize': 32, 'maxThreadsPerBlock': 1024, 'maxThreadsDim': (1024, 1024, 64), 'maxGridSize': (2147483647, 65535, 65535), 'clockRate': 1410000, 'totalConstMem': 65536, 'major': 8, 'minor': 0, 'textureAlignment': 512, 'texturePitchAlignment': 32, 'multiProcessorCount': 108, 'kernelExecTimeoutEnabled': 0, 'integrated': 0, 'canMapHostMemory': 1, 'computeMode': 0, 'maxTexture1D': 131072, 'maxTexture2D': (131072, 65536), 'maxTexture3D': (16384, 16384, 16384), 'concurrentKernels': 1, 'ECCEnabled': 1, 'pciBusID': 4, 'pciDeviceID': 0, 'pciDomainID': 0, 'tccDriver': 0, 'memoryClockRate': 1215000, 'memoryBusWidth': 5120, 'l2CacheSize': 41943040, 'maxThreadsPerMultiProcessor': 2048, 'isMultiGpuBoard': 0, 'cooperativeLaunch': 1, 'cooperativeMultiDeviceLaunch': 1, 'deviceOverlap': 1, 'maxTexture1DMipmap': 32768, 'maxTexture1DLinear': 268435456, 'maxTexture1DLayered': (32768, 2048), 'maxTexture2DMipmap': (32768, 32768), 'maxTexture2DLinear': (131072, 65000, 2097120), 'maxTexture2DLayered': (32768,32768, 2048), 'maxTexture2DGather': (32768, 32768), 'maxTexture3DAlt': (8192, 8192, 32768), 'maxTextureCubemap': 32768, 'maxTextureCubemapLayered': (32768, 2046), 'maxSurface1D': 32768, 'maxSurface1DLayered': (32768, 2048), 'maxSurface2D': (131072, 65536), 'maxSurface2DLayered': (32768, 32768, 2048), 'maxSurface3D': (16384, 16384, 16384), 'maxSurfaceCubemap': 32768,'maxSurfaceCubemapLayered': (32768, 2046), 'surfaceAlignment': 512, 'asyncEngineCount': 5, 'unifiedAddressing': 1, 'streamPrioritiesSupported': 1, 'globalL1CacheSupported': 1, 'localL1CacheSupported': 1, 'sharedMemPerMultiprocessor': 167936, 'regsPerMultiprocessor': 65536, 'managedMemory': 0, 'multiGpuBoardGroupID': 0, 'hostNativeAtomicSupported': 0, 'singleToDoublePrecisionPerfRatio': 2, 'pageableMemoryAccess': 0, 'concurrentManagedAccess': 0, 'computePreemptionSupported': 1, 'canUseHostPointerForRegisteredMem': 0, 'sharedMemPerBlockOptin': 166912, 'pageableMemoryAccessUsesHostPageTables': 0, 'directManagedMemAccessFromHost': 0, 'uuid': b'_:\x16\x9f_\xd6\x11\xef\xbex\x9d\x11\x11\x8e+\xa9', 'luid': b'', 'luidDeviceNodeMask': 0,'persistingL2CacheMaxSize': 26214400, 'maxBlocksPerMultiProcessor': 32, 'accessPolicyMaxWindowSize': 134213632, 'reservedSharedMemPerBlock': 1024}

@markperri
Copy link
Author

Oh I see. The way their hypervisor works with vGPUs doesn't allow unified memory. Looks like this package won't be compatible with their system then.
Thanks,
Mark

@wxj6000
Copy link
Collaborator

wxj6000 commented Aug 25, 2024

@markperri The issue has been fixed in v1.0.1. Most tasks can be executed on ChemCompute now. But, due to the limited memory of a slice of GPU, it may raise an error of out of memory for some tasks such as Hessian calculations.

Thank you for your feedback and your cooperation!

@markperri
Copy link
Author

Thanks! It works great now. I increased the instance size to use the entire GPU and the out of memory problems are fixed. But, I had to install it from github. There is something wrong with the package on pypi. It just downloads all versions and then gives up.

(base) jovyan@7db95487cf10:/tmp$ pip install gpu4pyscf
Collecting gpu4pyscf
  Downloading gpu4pyscf-1.0.1.tar.gz (206 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 206.8/206.8 kB 6.1 MB/s eta 0:00:00
  Preparing metadata (setup.py) ... done
  WARNING: Generating metadata for package gpu4pyscf produced metadata for project name gpu4pyscf-cuda12x. Fix your #egg=gpu4pyscf fragments.
Discarding https://files.pythonhosted.org/packages/3d/68/07452d97f874c77d622e42969fb54c265a734d4f7be86f18944400625bb2/gpu4pyscf-1.0.1.tar.gz (from https://pypi.org/simple/gpu4pyscf/): Requested gpu4pyscf-cuda12x from https://files.pythonhosted.org/packages/3d/68/07452d97f874c77d622e42969fb54c265a734d4f7be86f18944400625bb2/gpu4pyscf-1.0.1.tar.gz has inconsistent name: expected 'gpu4pyscf', butmetadata has 'gpu4pyscf-cuda12x'
  Downloading gpu4pyscf-1.0.tar.gz (204 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 205.0/205.0 kB 19.0 MB/s eta 0:00:00
  Preparing metadata (setup.py) ... done
  WARNING: Generating metadata for package gpu4pyscf produced metadata for project name gpu4pyscf-cuda12x. Fix your #egg=gpu4pyscf fragments.
Discarding https://files.pythonhosted.org/packages/17/00/a9bfefd38206230cd4542106b13cc1d08dcdc6b76f0be112bb4be5fb23f4/gpu4pyscf-1.0.tar.gz (from https://pypi.org/simple/gpu4pyscf/): Requested gpu4pyscf-cuda12x from https://files.pythonhosted.org/packages/17/00/a9bfefd38206230cd4542106b13cc1d08dcdc6b76f0be112bb4be5fb23f4/gpu4pyscf-1.0.tar.gz has inconsistent name: expected 'gpu4pyscf', but metadata has 'gpu4pyscf-cuda12x'
  Downloading gpu4pyscf-0.8.2.tar.gz (204 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 204.9/204.9 kB 13.6 MB/s eta 0:00:00
  Preparing metadata (setup.py) ... done
  WARNING: Generating metadata for package gpu4pyscf produced metadata for project name gpu4pyscf-cuda12x. Fix your #egg=gpu4pyscf fragments.
Discarding https://files.pythonhosted.org/packages/bb/dc/b33d96a33a406758cf9cd0ea14e5654c3d1310ee9ae7ff466ed6567816ae/gpu4pyscf-0.8.2.tar.gz (from https://pypi.org/simple/gpu4pyscf/): Requested gpu4pyscf-cuda12x from https://files.pythonhosted.org/packages/bb/dc/b33d96a33a406758cf9cd0ea14e5654c3d1310ee9ae7ff466ed6567816ae/gpu4pyscf-0.8.2.tar.gz has inconsistent name: expected 'gpu4pyscf', butmetadata has 'gpu4pyscf-cuda12x'

It continues to download older versions of gpu4pyscf and then errors out.

@wxj6000
Copy link
Collaborator

wxj6000 commented Aug 25, 2024

@markperri pip3 install gpu4pyscf-cuda12x will resolve the issue.

@markperri
Copy link
Author

Oh yes, sorry. Forgot that part!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants