New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Segmentation error on scvi-tools #2388
Comments
Hi, are you getting the segfault during training or when calling |
During training. I had some issues with different versions of pytorch, torchaudio, torchvision cuda, so I could get it to set up the VAE model but as soon as it starts the training loop -> segmentation error. My HPC admin said this might be an issue with conda pytorch on A100 GPUs, but i reinstalled with pip in a fresh environment, and still got the same error. |
Are you able to run another PyTorch (non scvi-tools) model using that environment? Or do any simple PyTorch ops such as matmul, etc? |
Let me try. Give me ten minutes. |
Hi, Sorry for the delay, my task got pushed back in the queue for Resources. I ran this test and it worked. import torch
def gpu_test():
"""
python -c "import uutils; uutils.torch_uu.gpu_test()"
"""
from torch import Tensor
if torch.cuda.is_available():
device_name = lambda: torch.cuda.get_device_name(torch.cuda.current_device())
else:
device_name = lambda: "CUDA not available"
print(f'device name: {device_name()}')
x: Tensor = torch.randn(2, 4).cuda()
y: Tensor = torch.randn(4, 1).cuda()
out: Tensor = (x @ y)
assert out.size() == torch.Size([2, 1])
print(f'Success, no Cuda errors means it worked see:\n{out=}')
gpu_test()
|
Thanks! Could you check if the following snippet runs without errors? Trying to determine if there's an issue with the installed libraries or the data itself:
If this runs without errors, it's likely that you need to perform additional preprocessing on your dataset such as removing cells with low counts. |
Ok, running now. |
Hi, complete error message below:
|
Would it be possible to determine which line of code within our library is causing this issue? There doesn't seem to be a full traceback. |
Hi, Apologies for my inexperience, how do I get a full trace back? I'm using SLURM sbatch. Best, |
Hi, Sorry for the bother. I just wanted to follow up on this. Best, |
Hey, do you happen to have a log file that was outputted by SLURM? This might be your best bet for determining what line of code is causing the segfault. |
Hi, Please find attached. |
Sorry, doesn't look like the log files provide any useful additional info. It's hard to debug this since we don't know what exactly is causing this segfault, so it really could be one of SLURM, the GPU itself, our library, the other libraries you have installed, etc, or a combination of all of these. Without any additional info, I wouldn't be able to provide any specific pointers. You could try one of the following and see if it fixes your issue or gives you additional pointers as to what could be causing this:
|
Hi, Ok thank you, I will let you know how that goes. |
Hi, Just a brief update.
Any help would be greatly appreciated. Best, |
One last thing I can suggest trying is just going through the code line-by-line using a debugger to determine what line is being executed before the segfault. |
I would suggest using faulthandler to get more insights. https://docs.python.org/3/library/faulthandler.html. At least we then now the exact line. You will likely get an easier error by following: https://www.cs.jhu.edu/~aadelucia//2021/08/24/pytorch-errors/ |
Hi, We isolated the problem to the latest version of PyTorch specifically. It must be installed like this. pip install torch==1.13.1+cu117 torchvision==0.14.1+cu117 torchaudio==0.13.1 --extra-index-url https://download.pytorch.org/whl/cu117 |
Hmm interesting, do you happen to know what was going on with the newest release of PyTorch that was causing this issue? |
I'll also go ahead and close this issue since it was resolved on your end |
We never went down to that level, sorry. |
I'm working on a large set of 900,000 cells, trying to run scvi on a HPC a100 GPU. But I keep getting a segmentation error.
Versions:
_libgcc_mutex 0.1 conda_forge conda-forge
_openmp_mutex 4.5 2_gnu conda-forge
absl-py 2.0.0 pypi_0 pypi
aiohttp 3.9.1 pypi_0 pypi
aiosignal 1.3.1 pypi_0 pypi
anndata 0.10.4 pypi_0 pypi
annotated-types 0.6.0 pypi_0 pypi
anyio 4.2.0 pypi_0 pypi
array-api-compat 1.4 pypi_0 pypi
arrow 1.3.0 pypi_0 pypi
async-timeout 4.0.3 pypi_0 pypi
attrs 23.2.0 pypi_0 pypi
backoff 2.2.1 pypi_0 pypi
beautifulsoup4 4.12.2 pypi_0 pypi
blessed 1.20.0 pypi_0 pypi
boto3 1.34.15 pypi_0 pypi
botocore 1.34.15 pypi_0 pypi
bzip2 1.0.8 hd590300_5 conda-forge
ca-certificates 2023.11.17 hbcca054_0 conda-forge
certifi 2022.12.7 pypi_0 pypi
charset-normalizer 2.1.1 pypi_0 pypi
chex 0.1.7 pypi_0 pypi
click 8.1.7 pypi_0 pypi
contextlib2 21.6.0 pypi_0 pypi
contourpy 1.2.0 pypi_0 pypi
croniter 1.4.1 pypi_0 pypi
cycler 0.12.1 pypi_0 pypi
dateutils 0.6.12 pypi_0 pypi
deepdiff 6.7.1 pypi_0 pypi
dm-tree 0.1.8 pypi_0 pypi
docrep 0.3.2 pypi_0 pypi
editor 1.6.5 pypi_0 pypi
etils 1.5.2 pypi_0 pypi
exceptiongroup 1.2.0 pypi_0 pypi
fastapi 0.108.0 pypi_0 pypi
filelock 3.9.0 pypi_0 pypi
flax 0.7.5 pypi_0 pypi
fonttools 4.47.0 pypi_0 pypi
frozenlist 1.4.1 pypi_0 pypi
fsspec 2023.4.0 pypi_0 pypi
get-annotations 0.1.2 pypi_0 pypi
h11 0.14.0 pypi_0 pypi
h5py 3.10.0 pypi_0 pypi
idna 3.4 pypi_0 pypi
importlib-metadata 7.0.1 pypi_0 pypi
importlib-resources 6.1.1 pypi_0 pypi
inquirer 3.2.1 pypi_0 pypi
itsdangerous 2.1.2 pypi_0 pypi
jax 0.4.23 pypi_0 pypi
jaxlib 0.4.23 pypi_0 pypi
jinja2 3.1.2 pypi_0 pypi
jmespath 1.0.1 pypi_0 pypi
joblib 1.3.2 pypi_0 pypi
kiwisolver 1.4.5 pypi_0 pypi
ld_impl_linux-64 2.40 h41732ed_0 conda-forge
libffi 3.4.2 h7f98852_5 conda-forge
libgcc-ng 13.2.0 h807b86a_3 conda-forge
libgomp 13.2.0 h807b86a_3 conda-forge
libnsl 2.0.1 hd590300_0 conda-forge
libsqlite 3.44.2 h2797004_0 conda-forge
libuuid 2.38.1 h0b41bf4_0 conda-forge
libxcrypt 4.4.36 hd590300_1 conda-forge
libzlib 1.2.13 hd590300_5 conda-forge
lightning 2.0.9.post0 pypi_0 pypi
lightning-cloud 0.5.57 pypi_0 pypi
lightning-utilities 0.10.0 pypi_0 pypi
llvmlite 0.41.1 pypi_0 pypi
markdown-it-py 3.0.0 pypi_0 pypi
markupsafe 2.1.3 pypi_0 pypi
matplotlib 3.8.2 pypi_0 pypi
mdurl 0.1.2 pypi_0 pypi
ml-collections 0.1.1 pypi_0 pypi
ml-dtypes 0.3.2 pypi_0 pypi
mpmath 1.3.0 pypi_0 pypi
msgpack 1.0.7 pypi_0 pypi
mudata 0.2.3 pypi_0 pypi
multidict 6.0.4 pypi_0 pypi
multipledispatch 1.0.0 pypi_0 pypi
natsort 8.4.0 pypi_0 pypi
ncurses 6.4 h59595ed_2 conda-forge
nest-asyncio 1.5.8 pypi_0 pypi
networkx 3.0 pypi_0 pypi
numba 0.58.1 pypi_0 pypi
numpy 1.24.1 pypi_0 pypi
numpyro 0.13.2 pypi_0 pypi
openssl 3.2.0 hd590300_1 conda-forge
opt-einsum 3.3.0 pypi_0 pypi
optax 0.1.7 pypi_0 pypi
orbax-checkpoint 0.4.8 pypi_0 pypi
ordered-set 4.1.0 pypi_0 pypi
packaging 23.2 pypi_0 pypi
pandas 2.1.4 pypi_0 pypi
patsy 0.5.6 pypi_0 pypi
pillow 9.3.0 pypi_0 pypi
pip 23.3.2 pyhd8ed1ab_0 conda-forge
protobuf 4.25.1 pypi_0 pypi
psutil 5.9.7 pypi_0 pypi
pydantic 2.1.1 pypi_0 pypi
pydantic-core 2.4.0 pypi_0 pypi
pygments 2.17.2 pypi_0 pypi
pyjwt 2.8.0 pypi_0 pypi
pynndescent 0.5.11 pypi_0 pypi
pyparsing 3.1.1 pypi_0 pypi
pyro-api 0.1.2 pypi_0 pypi
pyro-ppl 1.8.6 pypi_0 pypi
python 3.9.18 h0755675_1_cpython conda-forge
python-dateutil 2.8.2 pypi_0 pypi
python-multipart 0.0.6 pypi_0 pypi
pytorch-lightning 2.1.3 pypi_0 pypi
pytz 2023.3.post1 pypi_0 pypi
pyyaml 6.0.1 pypi_0 pypi
readchar 4.0.5 pypi_0 pypi
readline 8.2 h8228510_1 conda-forge
requests 2.28.1 pypi_0 pypi
rich 13.7.0 pypi_0 pypi
runs 1.2.0 pypi_0 pypi
s3transfer 0.10.0 pypi_0 pypi
scanpy 1.9.6 pypi_0 pypi
scikit-learn 1.3.2 pypi_0 pypi
scipy 1.11.4 pypi_0 pypi
scvi-tools 1.0.4 pypi_0 pypi
seaborn 0.13.1 pypi_0 pypi
session-info 1.0.0 pypi_0 pypi
setuptools 69.0.3 pyhd8ed1ab_0 conda-forge
six 1.16.0 pypi_0 pypi
sniffio 1.3.0 pypi_0 pypi
soupsieve 2.5 pypi_0 pypi
sparse 0.15.0 pypi_0 pypi
starlette 0.32.0.post1 pypi_0 pypi
starsessions 1.3.0 pypi_0 pypi
statsmodels 0.14.1 pypi_0 pypi
stdlib-list 0.10.0 pypi_0 pypi
sympy 1.12 pypi_0 pypi
tensorstore 0.1.52 pypi_0 pypi
threadpoolctl 3.2.0 pypi_0 pypi
tk 8.6.13 noxft_h4845f30_101 conda-forge
toolz 0.12.0 pypi_0 pypi
torch 2.1.2+cu118 pypi_0 pypi
torchaudio 2.1.2+cu118 pypi_0 pypi
torchmetrics 1.2.1 pypi_0 pypi
torchvision 0.16.2+cu118 pypi_0 pypi
tqdm 4.66.1 pypi_0 pypi
traitlets 5.14.1 pypi_0 pypi
triton 2.1.0 pypi_0 pypi
types-python-dateutil 2.8.19.20240106 pypi_0 pypi
typing-extensions 4.9.0 pypi_0 pypi
tzdata 2023.4 pypi_0 pypi
umap-learn 0.5.5 pypi_0 pypi
urllib3 1.26.13 pypi_0 pypi
uvicorn 0.25.0 pypi_0 pypi
wcwidth 0.2.13 pypi_0 pypi
websocket-client 1.7.0 pypi_0 pypi
websockets 12.0 pypi_0 pypi
wheel 0.42.0 pyhd8ed1ab_0 conda-forge
xarray 2023.12.0 pypi_0 pypi
The text was updated successfully, but these errors were encountered: