New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Multi-GPU workstation crashes during tf.Session() #26653
Comments
@davidenunes mentions the same issue in #18652 (comment), where downgrading the nvidia package to |
I was able to get some detailed error logs
Here is a nvidia-smi call after the crash
|
Your kernel seems way too new for CUDA. Please make sure your CUDA is supported in your Linux distro. Otherwise, we cannot make sure it is a TF issue or not. |
tensorflow only supports CUDA 10.0, so I am not able to upgrade CUDA. |
Thank you for reporting this issue, it helped me resolve it on my system. I'm running a nearly identical setup as you, Arch Linux, 2 GPUs, linux kernel 5.0.3 and it crashes with the same error. After seeing this issue I downgraded to |
I'm also hitting similar behavior, uptodate Debian Sid with two RTX2080Ti. Currently using |
@sbrodehl I don't know in your case, but in mine, adjusting the batch size in DeepSpeech when training model helped avoiding the crashes. I suspect something going on with new driver when GPU runs out of memory. |
Ok, never mind, this was noise, in my case the recurrent shutdown was, as I expected at first, just a power issue. Looks like when installing the new setup I forgot we should wire two PCIe cables per RTX, and I only wired one using the Y connector to plug both on each RTX. |
Please make sure that this is a bug. As per our GitHub Policy, we only address code/doc bugs, performance issues, feature requests and build/installation issues on GitHub. tag:bug_template
System information
Linux 5.0.0-arch1-1-ARCH #1 SMP PREEMPT Mon Mar 4 14:11:43 UTC 2019 x86_64 GNU/Linux
community python-tensorflow-opt-cuda
1.13.1
3.7.2
V10.0.130
/7.5.0
Geforce GTX 1080 Ti 11GB
; Driver Version:418.43
Describe the current behavior
The workstation completely crashes if a
tf.Session()
is created when multiple GPUs are present.I will roll back the last driver updates and post any updates.
Not sure if this is an error tensorflow can fix, maybe it is just a faulty driver.
Describe the expected behavior
Workstation should not crash.
Code to reproduce the issue
Provide a reproducible test case that is the bare minimum necessary to generate the problem.
or, in short
python -c "import tensorflow as tf; s = tf.Session()"
the following line crashes as well
CUDA_VISIBLE_DEVICES="0,1" python -c "import tensorflow as tf; s = tf.Session()"
Other info / logs
The Problem exists only if multiple GPUs are present, so the following code works as expected:
CUDA_VISIBLE_DEVICES="" python -c "import tensorflow as tf; s = tf.Session()"
CUDA_VISIBLE_DEVICES="0" python -c "import tensorflow as tf; s = tf.Session()"
CUDA_VISIBLE_DEVICES="1" python -c "import tensorflow as tf; s = tf.Session()"
I ran the code on a different machine with only one GPU and it worked just fine.
The text was updated successfully, but these errors were encountered: