Segmentation error on scvi-tools #2388

alexanderchang1 opened this issue Jan 10, 2024 · 22 comments

I'm working on a large set of 900,000 cells, trying to run scvi on a HPC a100 GPU. But I keep getting a segmentation error.

import sys
import os
import scanpy as sc
import numpy as np
import pandas as pd
from anndata import AnnData
import anndata as ad
from import mmread
import scipy
import gc
import scvi
import os
import pandas as pd
import scanpy as sc
import scipy
import torch

scvi.model.SCVI.setup_anndata(adata_combined, batch_key ='donor_id')

vae = scvi.model.SCVI(adata_combined, n_hidden=256, n_latent=50, n_layers=2)

vae.train(accelerator = 'auto', devices ='auto',max_epochs=400, early_stopping=True)

adata_combined.obsm["X_scVI"] = vae.get_latent_representation()
For instance checks, use `isinstance(X, (anndata.experimental.CSRDataset, anndata.experimental.CSCDataset))` instead.

For creation, use `anndata.experimental.sparse_dataset(X)` instead.

  return _abc_instancecheck(cls, instance)
An NVIDIA GPU may be present on this machine, but a CUDA-enabled jaxlib is not installed. Falling back to cpu.
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
SLURM auto-requeueing enabled. Setting signal handlers.
FutureWarning: SparseDataset is deprecated and will be removed in late 2024. It has been replaced by the public classes CSRDataset and CSCDataset.

For instance checks, use `isinstance(X, (anndata.experimental.CSRDataset, anndata.experimental.CSCDataset))` instead.

For creation, use `anndata.experimental.sparse_dataset(X)` instead.

  return _abc_instancecheck(cls, instance)
srun: error: gpu-n41: task 0: Segmentation fault


Hi, are you getting the segfault during training or when calling get_latent_representation?

During training. I had some issues with different versions of pytorch, torchaudio, torchvision cuda, so I could get it to set up the VAE model but as soon as it starts the training loop -> segmentation error. My HPC admin said this might be an issue with conda pytorch on A100 GPUs, but i reinstalled with pip in a fresh environment, and still got the same error.

martinkim0 commented Jan 10, 2024

Are you able to run another PyTorch (non scvi-tools) model using that environment? Or do any simple PyTorch ops such as matmul, etc?

Let me try. Give me ten minutes.

alexanderchang1 commented Jan 10, 2024


Sorry for the delay, my task got pushed back in the queue for Resources.

I ran this test and it worked.

import torch

def gpu_test():
    python -c "import uutils; uutils.torch_uu.gpu_test()"
    from torch import Tensor
    if torch.cuda.is_available():
        device_name = lambda: torch.cuda.get_device_name(torch.cuda.current_device())
        device_name = lambda: "CUDA not available"

    print(f'device name: {device_name()}')
    x: Tensor = torch.randn(2, 4).cuda()
    y: Tensor = torch.randn(4, 1).cuda()
    out: Tensor = (x @ y)
    assert out.size() == torch.Size([2, 1])
    print(f'Success, no Cuda errors means it worked see:\n{out=}')

device name: NVIDIA A100-PCIE-40GB
Success, no Cuda errors means it worked see:
        [ 1.6832]], device='cuda:0')

Thanks! Could you check if the following snippet runs without errors? Trying to determine if there's an issue with the installed libraries or the data itself:

adata =
model = scvi.model.SCVI(adata)

If this runs without errors, it's likely that you need to perform additional preprocessing on your dataset such as removing cells with low counts.

Ok, running now.

alexanderchang1 commented Jan 10, 2024

Hi, complete error message below:

Warning: When compiling code please add the following flags to nvcc:
         -gencode arch=compute_35,code=[compute_35,sm_35] \
         -gencode arch=compute_61,code=[compute_61,sm_61] 
         -gencode arch=compute_70,code=[compute_70,sm_70] 
/bgfs/alee/LO_LAB/Personal/Alexander_Chang/alc376/envs/scvi_gpu/lib/python3.9/site-packages/scvi/ UserWarning: Since v1.0.0, scvi-tools no longer uses a random seed by default. Run `scvi.settings.seed = 0` to reproduce results from previous versions.
  self.seed = seed
/bgfs/alee/LO_LAB/Personal/Alexander_Chang/alc376/envs/scvi_gpu/lib/python3.9/site-packages/scvi/ UserWarning: Setting `dl_pin_memory_gpu_training` is deprecated in v1.0 and will be removed in v1.1. Please pass in `pin_memory` to the data loaders instead.
  self.dl_pin_memory_gpu_training = (
/bgfs/alee/LO_LAB/Personal/Alexander_Chang/alc376/envs/scvi_gpu/lib/python3.9/ FutureWarning: SparseDataset is deprecated and will be removed in late 2024. It has been replaced by the public classes CSRDataset and CSCDataset.

For instance checks, use `isinstance(X, (anndata.experimental.CSRDataset, anndata.experimental.CSCDataset))` instead.

For creation, use `anndata.experimental.sparse_dataset(X)` instead.

  return _abc_instancecheck(cls, instance)
An NVIDIA GPU may be present on this machine, but a CUDA-enabled jaxlib is not installed. Falling back to cpu.
/bgfs/alee/LO_LAB/Personal/Alexander_Chang/alc376/envs/scvi_gpu/lib/python3.9/site-packages/lightning/fabric/plugins/environments/ PossibleUserWarning: The `srun` command is available on your system but is not used. HINT: If your intention is to run Lightning on SLURM, prepend your python command with `srun` like so: srun python /bgfs/alee/LO_LAB/Personal/Alexander_Chang/alc376/ma ...
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
/bgfs/alee/LO_LAB/Personal/Alexander_Chang/alc376/envs/scvi_gpu/lib/python3.9/site-packages/lightning/fabric/plugins/environments/ PossibleUserWarning: The `srun` command is available on your system but is not used. HINT: If your intention is to run Lightning on SLURM, prepend your python command with `srun` like so: srun python /bgfs/alee/LO_LAB/Personal/Alexander_Chang/alc376/ma ...
You are using a CUDA device ('NVIDIA A100-PCIE-40GB') that has Tensor Cores. To properly utilize them, you should set `torch.set_float32_matmul_precision('medium' | 'high')` which will trade-off precision for performance. For more details, read
/bgfs/alee/LO_LAB/Personal/Alexander_Chang/alc376/envs/scvi_gpu/lib/python3.9/site-packages/lightning/pytorch/loops/ PossibleUserWarning: The number of training batches (3) is smaller than the logging interval Trainer(log_every_n_steps=10). Set a lower value for log_every_n_steps if you want to see logs for the training epoch.
/var/spool/slurmd/job758614/slurm_script: line 27:  8154 Segmentation fault      /bgfs/alee/LO_LAB/Personal/Alexander_Chang/alc376/envs/scvi_gpu/bin/python /bgfs/alee/LO_LAB/Personal/Alexander_Chang/alc376/machine_learning/Project/

Would it be possible to determine which line of code within our library is causing this issue? There doesn't seem to be a full traceback.

Apologies for my inexperience, how do I get a full trace back? I'm using SLURM sbatch.


Sorry for the bother. I just wanted to follow up on this.


Hey, do you happen to have a log file that was outputted by SLURM? This might be your best bet for determining what line of code is causing the segfault.

Please find attached.

martinkim0 commented Jan 16, 2024

Sorry, doesn't look like the log files provide any useful additional info. It's hard to debug this since we don't know what exactly is causing this segfault, so it really could be one of SLURM, the GPU itself, our library, the other libraries you have installed, etc, or a combination of all of these. Without any additional info, I wouldn't be able to provide any specific pointers. You could try one of the following and see if it fixes your issue or gives you additional pointers as to what could be causing this:

  • Run on the same hardware but without SLURM
  • Run on a different GPU
  • Reinstall libraries after deleting conda and pip caches as there could be a problematic version that's being cached
  • Install previous versions of scvi-tools to see if the issue reproduces

Copy link


Ok thank you, I will let you know how that goes.

Copy link


Just a brief update.

  • I tried running it in a jupyter notebook with a small test set of 20,000 cells. Still segmentation error (or the equivalent cause the kernel dies without saying segmentation error)
  • I tried running it on both an A100 and a GTX1080, both failed, same error.
  • I created a brand new environment and installed only what was necessary, although that still had some issues.
  • I tried some previous versions of scvi-tools but those ran into other issues with other packages that have since been updated.

Any help would be greatly appreciated.


One last thing I can suggest trying is just going through the code line-by-line using a debugger to determine what line is being executed before the segfault.

I would suggest using faulthandler to get more insights. At least we then now the exact line. You will likely get an easier error by following:

We isolated the problem to the latest version of PyTorch specifically. It must be installed like this.

pip install torch==1.13.1+cu117 torchvision==0.14.1+cu117 torchaudio==0.13.1 --extra-index-url

Hmm interesting, do you happen to know what was going on with the newest release of PyTorch that was causing this issue?

I'll also go ahead and close this issue since it was resolved on your end

We never went down to that level, sorry.

