Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Segmentation error on scvi-tools #2388

Closed
alexanderchang1 opened this issue Jan 10, 2024 · 22 comments
Closed

Segmentation error on scvi-tools #2388

alexanderchang1 opened this issue Jan 10, 2024 · 22 comments
Labels

Comments

@alexanderchang1
Copy link

I'm working on a large set of 900,000 cells, trying to run scvi on a HPC a100 GPU. But I keep getting a segmentation error.

import sys
import os
import scanpy as sc
import numpy as np
import pandas as pd
from anndata import AnnData
import anndata as ad
from scipy.io import mmread
import scipy
import gc
import scvi
import os
import pandas as pd
import scanpy as sc
import scipy
import torch

scvi.model.SCVI.setup_anndata(adata_combined, batch_key ='donor_id')

vae = scvi.model.SCVI(adata_combined, n_hidden=256, n_latent=50, n_layers=2)

vae.train(accelerator = 'auto', devices ='auto',max_epochs=400, early_stopping=True)

adata_combined.obsm["X_scVI"] = vae.get_latent_representation()
For instance checks, use `isinstance(X, (anndata.experimental.CSRDataset, anndata.experimental.CSCDataset))` instead.

For creation, use `anndata.experimental.sparse_dataset(X)` instead.

  return _abc_instancecheck(cls, instance)
An NVIDIA GPU may be present on this machine, but a CUDA-enabled jaxlib is not installed. Falling back to cpu.
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
SLURM auto-requeueing enabled. Setting signal handlers.
FutureWarning: SparseDataset is deprecated and will be removed in late 2024. It has been replaced by the public classes CSRDataset and CSCDataset.

For instance checks, use `isinstance(X, (anndata.experimental.CSRDataset, anndata.experimental.CSCDataset))` instead.

For creation, use `anndata.experimental.sparse_dataset(X)` instead.

  return _abc_instancecheck(cls, instance)
srun: error: gpu-n41: task 0: Segmentation fault

Versions:

_libgcc_mutex 0.1 conda_forge conda-forge
_openmp_mutex 4.5 2_gnu conda-forge
absl-py 2.0.0 pypi_0 pypi
aiohttp 3.9.1 pypi_0 pypi
aiosignal 1.3.1 pypi_0 pypi
anndata 0.10.4 pypi_0 pypi
annotated-types 0.6.0 pypi_0 pypi
anyio 4.2.0 pypi_0 pypi
array-api-compat 1.4 pypi_0 pypi
arrow 1.3.0 pypi_0 pypi
async-timeout 4.0.3 pypi_0 pypi
attrs 23.2.0 pypi_0 pypi
backoff 2.2.1 pypi_0 pypi
beautifulsoup4 4.12.2 pypi_0 pypi
blessed 1.20.0 pypi_0 pypi
boto3 1.34.15 pypi_0 pypi
botocore 1.34.15 pypi_0 pypi
bzip2 1.0.8 hd590300_5 conda-forge
ca-certificates 2023.11.17 hbcca054_0 conda-forge
certifi 2022.12.7 pypi_0 pypi
charset-normalizer 2.1.1 pypi_0 pypi
chex 0.1.7 pypi_0 pypi
click 8.1.7 pypi_0 pypi
contextlib2 21.6.0 pypi_0 pypi
contourpy 1.2.0 pypi_0 pypi
croniter 1.4.1 pypi_0 pypi
cycler 0.12.1 pypi_0 pypi
dateutils 0.6.12 pypi_0 pypi
deepdiff 6.7.1 pypi_0 pypi
dm-tree 0.1.8 pypi_0 pypi
docrep 0.3.2 pypi_0 pypi
editor 1.6.5 pypi_0 pypi
etils 1.5.2 pypi_0 pypi
exceptiongroup 1.2.0 pypi_0 pypi
fastapi 0.108.0 pypi_0 pypi
filelock 3.9.0 pypi_0 pypi
flax 0.7.5 pypi_0 pypi
fonttools 4.47.0 pypi_0 pypi
frozenlist 1.4.1 pypi_0 pypi
fsspec 2023.4.0 pypi_0 pypi
get-annotations 0.1.2 pypi_0 pypi
h11 0.14.0 pypi_0 pypi
h5py 3.10.0 pypi_0 pypi
idna 3.4 pypi_0 pypi
importlib-metadata 7.0.1 pypi_0 pypi
importlib-resources 6.1.1 pypi_0 pypi
inquirer 3.2.1 pypi_0 pypi
itsdangerous 2.1.2 pypi_0 pypi
jax 0.4.23 pypi_0 pypi
jaxlib 0.4.23 pypi_0 pypi
jinja2 3.1.2 pypi_0 pypi
jmespath 1.0.1 pypi_0 pypi
joblib 1.3.2 pypi_0 pypi
kiwisolver 1.4.5 pypi_0 pypi
ld_impl_linux-64 2.40 h41732ed_0 conda-forge
libffi 3.4.2 h7f98852_5 conda-forge
libgcc-ng 13.2.0 h807b86a_3 conda-forge
libgomp 13.2.0 h807b86a_3 conda-forge
libnsl 2.0.1 hd590300_0 conda-forge
libsqlite 3.44.2 h2797004_0 conda-forge
libuuid 2.38.1 h0b41bf4_0 conda-forge
libxcrypt 4.4.36 hd590300_1 conda-forge
libzlib 1.2.13 hd590300_5 conda-forge
lightning 2.0.9.post0 pypi_0 pypi
lightning-cloud 0.5.57 pypi_0 pypi
lightning-utilities 0.10.0 pypi_0 pypi
llvmlite 0.41.1 pypi_0 pypi
markdown-it-py 3.0.0 pypi_0 pypi
markupsafe 2.1.3 pypi_0 pypi
matplotlib 3.8.2 pypi_0 pypi
mdurl 0.1.2 pypi_0 pypi
ml-collections 0.1.1 pypi_0 pypi
ml-dtypes 0.3.2 pypi_0 pypi
mpmath 1.3.0 pypi_0 pypi
msgpack 1.0.7 pypi_0 pypi
mudata 0.2.3 pypi_0 pypi
multidict 6.0.4 pypi_0 pypi
multipledispatch 1.0.0 pypi_0 pypi
natsort 8.4.0 pypi_0 pypi
ncurses 6.4 h59595ed_2 conda-forge
nest-asyncio 1.5.8 pypi_0 pypi
networkx 3.0 pypi_0 pypi
numba 0.58.1 pypi_0 pypi
numpy 1.24.1 pypi_0 pypi
numpyro 0.13.2 pypi_0 pypi
openssl 3.2.0 hd590300_1 conda-forge
opt-einsum 3.3.0 pypi_0 pypi
optax 0.1.7 pypi_0 pypi
orbax-checkpoint 0.4.8 pypi_0 pypi
ordered-set 4.1.0 pypi_0 pypi
packaging 23.2 pypi_0 pypi
pandas 2.1.4 pypi_0 pypi
patsy 0.5.6 pypi_0 pypi
pillow 9.3.0 pypi_0 pypi
pip 23.3.2 pyhd8ed1ab_0 conda-forge
protobuf 4.25.1 pypi_0 pypi
psutil 5.9.7 pypi_0 pypi
pydantic 2.1.1 pypi_0 pypi
pydantic-core 2.4.0 pypi_0 pypi
pygments 2.17.2 pypi_0 pypi
pyjwt 2.8.0 pypi_0 pypi
pynndescent 0.5.11 pypi_0 pypi
pyparsing 3.1.1 pypi_0 pypi
pyro-api 0.1.2 pypi_0 pypi
pyro-ppl 1.8.6 pypi_0 pypi
python 3.9.18 h0755675_1_cpython conda-forge
python-dateutil 2.8.2 pypi_0 pypi
python-multipart 0.0.6 pypi_0 pypi
pytorch-lightning 2.1.3 pypi_0 pypi
pytz 2023.3.post1 pypi_0 pypi
pyyaml 6.0.1 pypi_0 pypi
readchar 4.0.5 pypi_0 pypi
readline 8.2 h8228510_1 conda-forge
requests 2.28.1 pypi_0 pypi
rich 13.7.0 pypi_0 pypi
runs 1.2.0 pypi_0 pypi
s3transfer 0.10.0 pypi_0 pypi
scanpy 1.9.6 pypi_0 pypi
scikit-learn 1.3.2 pypi_0 pypi
scipy 1.11.4 pypi_0 pypi
scvi-tools 1.0.4 pypi_0 pypi
seaborn 0.13.1 pypi_0 pypi
session-info 1.0.0 pypi_0 pypi
setuptools 69.0.3 pyhd8ed1ab_0 conda-forge
six 1.16.0 pypi_0 pypi
sniffio 1.3.0 pypi_0 pypi
soupsieve 2.5 pypi_0 pypi
sparse 0.15.0 pypi_0 pypi
starlette 0.32.0.post1 pypi_0 pypi
starsessions 1.3.0 pypi_0 pypi
statsmodels 0.14.1 pypi_0 pypi
stdlib-list 0.10.0 pypi_0 pypi
sympy 1.12 pypi_0 pypi
tensorstore 0.1.52 pypi_0 pypi
threadpoolctl 3.2.0 pypi_0 pypi
tk 8.6.13 noxft_h4845f30_101 conda-forge
toolz 0.12.0 pypi_0 pypi
torch 2.1.2+cu118 pypi_0 pypi
torchaudio 2.1.2+cu118 pypi_0 pypi
torchmetrics 1.2.1 pypi_0 pypi
torchvision 0.16.2+cu118 pypi_0 pypi
tqdm 4.66.1 pypi_0 pypi
traitlets 5.14.1 pypi_0 pypi
triton 2.1.0 pypi_0 pypi
types-python-dateutil 2.8.19.20240106 pypi_0 pypi
typing-extensions 4.9.0 pypi_0 pypi
tzdata 2023.4 pypi_0 pypi
umap-learn 0.5.5 pypi_0 pypi
urllib3 1.26.13 pypi_0 pypi
uvicorn 0.25.0 pypi_0 pypi
wcwidth 0.2.13 pypi_0 pypi
websocket-client 1.7.0 pypi_0 pypi
websockets 12.0 pypi_0 pypi
wheel 0.42.0 pyhd8ed1ab_0 conda-forge
xarray 2023.12.0 pypi_0 pypi

VERSION

@martinkim0
Copy link
Contributor

Hi, are you getting the segfault during training or when calling get_latent_representation?

@alexanderchang1
Copy link
Author

During training. I had some issues with different versions of pytorch, torchaudio, torchvision cuda, so I could get it to set up the VAE model but as soon as it starts the training loop -> segmentation error. My HPC admin said this might be an issue with conda pytorch on A100 GPUs, but i reinstalled with pip in a fresh environment, and still got the same error.

@martinkim0
Copy link
Contributor

martinkim0 commented Jan 10, 2024

Are you able to run another PyTorch (non scvi-tools) model using that environment? Or do any simple PyTorch ops such as matmul, etc?

@alexanderchang1
Copy link
Author

Let me try. Give me ten minutes.

@alexanderchang1
Copy link
Author

alexanderchang1 commented Jan 10, 2024

Hi,

Sorry for the delay, my task got pushed back in the queue for Resources.

I ran this test and it worked.

import torch

def gpu_test():
    """
    python -c "import uutils; uutils.torch_uu.gpu_test()"
    """
    from torch import Tensor
    if torch.cuda.is_available():
        device_name = lambda: torch.cuda.get_device_name(torch.cuda.current_device())
    else:
        device_name = lambda: "CUDA not available"

    print(f'device name: {device_name()}')
    x: Tensor = torch.randn(2, 4).cuda()
    y: Tensor = torch.randn(4, 1).cuda()
    out: Tensor = (x @ y)
    assert out.size() == torch.Size([2, 1])
    print(f'Success, no Cuda errors means it worked see:\n{out=}')

gpu_test()
device name: NVIDIA A100-PCIE-40GB
Success, no Cuda errors means it worked see:
out=tensor([[-4.4620],
        [ 1.6832]], device='cuda:0')

@martinkim0
Copy link
Contributor

Thanks! Could you check if the following snippet runs without errors? Trying to determine if there's an issue with the installed libraries or the data itself:

adata = scvi.data.synthetic_iid()
scvi.model.SCVI.setup_anndata(adata)
model = scvi.model.SCVI(adata)
model.train(max_epochs=10)

If this runs without errors, it's likely that you need to perform additional preprocessing on your dataset such as removing cells with low counts.

@alexanderchang1
Copy link
Author

Ok, running now.

@alexanderchang1
Copy link
Author

alexanderchang1 commented Jan 10, 2024

Hi, complete error message below:

Warning: When compiling code please add the following flags to nvcc:
         -gencode arch=compute_35,code=[compute_35,sm_35] \
         -gencode arch=compute_61,code=[compute_61,sm_61] 
         -gencode arch=compute_70,code=[compute_70,sm_70] 
/bgfs/alee/LO_LAB/Personal/Alexander_Chang/alc376/envs/scvi_gpu/lib/python3.9/site-packages/scvi/_settings.py:63: UserWarning: Since v1.0.0, scvi-tools no longer uses a random seed by default. Run `scvi.settings.seed = 0` to reproduce results from previous versions.
  self.seed = seed
/bgfs/alee/LO_LAB/Personal/Alexander_Chang/alc376/envs/scvi_gpu/lib/python3.9/site-packages/scvi/_settings.py:70: UserWarning: Setting `dl_pin_memory_gpu_training` is deprecated in v1.0 and will be removed in v1.1. Please pass in `pin_memory` to the data loaders instead.
  self.dl_pin_memory_gpu_training = (
/bgfs/alee/LO_LAB/Personal/Alexander_Chang/alc376/envs/scvi_gpu/lib/python3.9/abc.py:119: FutureWarning: SparseDataset is deprecated and will be removed in late 2024. It has been replaced by the public classes CSRDataset and CSCDataset.

For instance checks, use `isinstance(X, (anndata.experimental.CSRDataset, anndata.experimental.CSCDataset))` instead.

For creation, use `anndata.experimental.sparse_dataset(X)` instead.

  return _abc_instancecheck(cls, instance)
An NVIDIA GPU may be present on this machine, but a CUDA-enabled jaxlib is not installed. Falling back to cpu.
/bgfs/alee/LO_LAB/Personal/Alexander_Chang/alc376/envs/scvi_gpu/lib/python3.9/site-packages/lightning/fabric/plugins/environments/slurm.py:168: PossibleUserWarning: The `srun` command is available on your system but is not used. HINT: If your intention is to run Lightning on SLURM, prepend your python command with `srun` like so: srun python /bgfs/alee/LO_LAB/Personal/Alexander_Chang/alc376/ma ...
  rank_zero_warn(
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
/bgfs/alee/LO_LAB/Personal/Alexander_Chang/alc376/envs/scvi_gpu/lib/python3.9/site-packages/lightning/fabric/plugins/environments/slurm.py:168: PossibleUserWarning: The `srun` command is available on your system but is not used. HINT: If your intention is to run Lightning on SLURM, prepend your python command with `srun` like so: srun python /bgfs/alee/LO_LAB/Personal/Alexander_Chang/alc376/ma ...
  rank_zero_warn(
You are using a CUDA device ('NVIDIA A100-PCIE-40GB') that has Tensor Cores. To properly utilize them, you should set `torch.set_float32_matmul_precision('medium' | 'high')` which will trade-off precision for performance. For more details, read https://pytorch.org/docs/stable/generated/torch.set_float32_matmul_precision.html#torch.set_float32_matmul_precision
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
/bgfs/alee/LO_LAB/Personal/Alexander_Chang/alc376/envs/scvi_gpu/lib/python3.9/site-packages/lightning/pytorch/loops/fit_loop.py:281: PossibleUserWarning: The number of training batches (3) is smaller than the logging interval Trainer(log_every_n_steps=10). Set a lower value for log_every_n_steps if you want to see logs for the training epoch.
  rank_zero_warn(
/var/spool/slurmd/job758614/slurm_script: line 27:  8154 Segmentation fault      /bgfs/alee/LO_LAB/Personal/Alexander_Chang/alc376/envs/scvi_gpu/bin/python /bgfs/alee/LO_LAB/Personal/Alexander_Chang/alc376/machine_learning/Project/scvi_gpu_test_2.py

@martinkim0
Copy link
Contributor

Would it be possible to determine which line of code within our library is causing this issue? There doesn't seem to be a full traceback.

@alexanderchang1
Copy link
Author

Hi,

Apologies for my inexperience, how do I get a full trace back? I'm using SLURM sbatch.

Best,
Alex

@alexanderchang1
Copy link
Author

Hi,

Sorry for the bother. I just wanted to follow up on this.

Best,
Alex

@martinkim0
Copy link
Contributor

Hey, do you happen to have a log file that was outputted by SLURM? This might be your best bet for determining what line of code is causing the segfault.

@alexanderchang1
Copy link
Author

Hi,

Please find attached.

scvi_processing_gpu_758026.zip

@martinkim0
Copy link
Contributor

martinkim0 commented Jan 16, 2024

Sorry, doesn't look like the log files provide any useful additional info. It's hard to debug this since we don't know what exactly is causing this segfault, so it really could be one of SLURM, the GPU itself, our library, the other libraries you have installed, etc, or a combination of all of these. Without any additional info, I wouldn't be able to provide any specific pointers. You could try one of the following and see if it fixes your issue or gives you additional pointers as to what could be causing this:

  • Run on the same hardware but without SLURM
  • Run on a different GPU
  • Reinstall libraries after deleting conda and pip caches as there could be a problematic version that's being cached
  • Install previous versions of scvi-tools to see if the issue reproduces

@alexanderchang1
Copy link
Author

Hi,

Ok thank you, I will let you know how that goes.

@alexanderchang1
Copy link
Author

Hi,

Just a brief update.

  • I tried running it in a jupyter notebook with a small test set of 20,000 cells. Still segmentation error (or the equivalent cause the kernel dies without saying segmentation error)
  • I tried running it on both an A100 and a GTX1080, both failed, same error.
  • I created a brand new environment and installed only what was necessary, although that still had some issues.
  • I tried some previous versions of scvi-tools but those ran into other issues with other packages that have since been updated.

Any help would be greatly appreciated.

Best,
Alex

@martinkim0
Copy link
Contributor

One last thing I can suggest trying is just going through the code line-by-line using a debugger to determine what line is being executed before the segfault.

@canergen
Copy link
Contributor

I would suggest using faulthandler to get more insights. https://docs.python.org/3/library/faulthandler.html. At least we then now the exact line. You will likely get an easier error by following: https://www.cs.jhu.edu/~aadelucia//2021/08/24/pytorch-errors/

@alexanderchang1
Copy link
Author

Hi,

We isolated the problem to the latest version of PyTorch specifically. It must be installed like this.

pip install torch==1.13.1+cu117 torchvision==0.14.1+cu117 torchaudio==0.13.1 --extra-index-url https://download.pytorch.org/whl/cu117

@martinkim0
Copy link
Contributor

Hmm interesting, do you happen to know what was going on with the newest release of PyTorch that was causing this issue?

@martinkim0
Copy link
Contributor

I'll also go ahead and close this issue since it was resolved on your end

@alexanderchang1
Copy link
Author

We never went down to that level, sorry.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants