Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error when running two simultaneous sessions #4196

Closed
JohnRobson opened this issue Sep 4, 2016 · 17 comments
Closed

Error when running two simultaneous sessions #4196

JohnRobson opened this issue Sep 4, 2016 · 17 comments

Comments

@JohnRobson
Copy link

@JohnRobson JohnRobson commented Sep 4, 2016

Can I run two simultaneous sessions?

I start with: "with tf.Graph().as_default(), tf.Session() as sess:"

When I run my script alone, the result is OK, but when I run 2 instances the result is completely wrong or throw some errors like:

Error max() arg is an empty sequence
I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:924] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
I tensorflow/core/common_runtime/gpu/gpu_init.cc:102] Found device 0 with properties:
name: Quadro K2000M
major: 3 minor: 0 memoryClockRate (GHz) 0.745
pciBusID 0000:01:00.0
Total memory: 1.95GiB
Free memory: 1.47GiB

(and also)
float division by zero

Can I do something in my code do avoid these errors?

Maybe check if TF is already running and hold it?

Than you very much.

OS: Arch Linux

$ ls -l /opt/cuda/lib/libcud*
/opt/cuda/lib/libcudadevrt.a
/opt/cuda/lib/libcudart.so.7.5.18
/opt/cuda/lib/libcudart_static.a
/opt/cuda/lib/libcudnn.so -> /opt/cuda/lib64/libcudnn.so.4.0.7

PIP: 0.9.0

@tatatodd

This comment has been minimized.

Copy link
Contributor

@tatatodd tatatodd commented Sep 7, 2016

If I'm understanding you correctly, you are running two instances of the same python program, each in its own process. I believe that should work.

One simple suggestion is to try with the latest release version 0.10.0.

Otherwise, can you try to boil your problem down to a minimal program that exhibits the failure? That will either narrow in on the problem, or allow me to reproduce. Thanks!

@vrv

This comment has been minimized.

Copy link
Contributor

@vrv vrv commented Sep 7, 2016

Actually you probably don't want to be using the same GPU simultaneously from two TensorFlow processes. It's somewhat safe to pause one while the other uses the GPU, but if they are both actively using the GPU, bad things are likely to happen, I believe.

@kracwarlock

This comment has been minimized.

Copy link

@kracwarlock kracwarlock commented Sep 22, 2016

I have tried using the same GPU from two TF processes simultaneously. Sometimes both run; sometimes one crashes with messages like these:

E tensorflow/stream_executor/cuda/cuda_blas.cc:367] failed to create cublas handle: CUBLAS_STATUS_NOT_INITIALIZED
W tensorflow/stream_executor/stream.cc:1334] attempting to perform BLAS operation using StreamExecutor without BLAS support
E tensorflow/stream_executor/cuda/cuda_blas.cc:367] failed to create cublas handle: CUBLAS_STATUS_NOT_INITIALIZED
W tensorflow/stream_executor/stream.cc:1334] attempting to perform BLAS operation using StreamExecutor without BLAS support
E tensorflow/stream_executor/cuda/cuda_blas.cc:367] failed to create cublas handle: CUBLAS_STATUS_NOT_INITIALIZED
W tensorflow/stream_executor/stream.cc:1334] attempting to perform BLAS operation using StreamExecutor without BLAS support
E tensorflow/core/client/tensor_c_api.cc:485] Blas SGEMM launch failed : a.shape=(128, 100), b.shape=(100, 16), m=128, n=16, k=100

Why is that?

@ArturoDeza

This comment has been minimized.

Copy link

@ArturoDeza ArturoDeza commented Nov 18, 2016

Agree with the post above (and had the same errors complaining about not finding CUBLAS). I had to close my other tensorflow terminals to re-run a new tensorflow program that is calling a single GPU.

@smhoang

This comment has been minimized.

Copy link

@smhoang smhoang commented Jan 20, 2017

I got the same issue when running 2 Tensorflow/Keras at the same time, each in its process. The error is: "failed to create cublas handle: CUBLAS_STATUS_NOT_INITIALIZED". Any suggestion ? Thanks a lot.

@jpowie01

This comment has been minimized.

Copy link

@jpowie01 jpowie01 commented Feb 25, 2017

I've also encountered this problem today. Any tips?

@yaroslavvb

This comment has been minimized.

Copy link
Contributor

@yaroslavvb yaroslavvb commented Feb 26, 2017

CUBLAS_STATUS_NOT_INITIALIZED can be thrown when you are out of memory, which can happen when you run two GPU TensorFlow processes in parallel. TensorFlow doesn't like sharing GPU memory with another process, the solution is to run only a single TensorFlow process per GPU (using CUDA_VISIBLE_DEVICES=k to restrict k'th process to k'th GPU)

@DanielTakeshi

This comment has been minimized.

Copy link

@DanielTakeshi DanielTakeshi commented Mar 30, 2017

Just to be clear, this means that if I have only one GPU on my machine, I should only have one Tensorflow session (or code) active? No other easy way around it?

To be concerete, I have a file dqn.py which runs the DQN algorithm on my GPU. Can I run python dqn.py in one terminal window, and then open the python script, change the random seed, then runpython dqn.py again on a second terminal window?

I assume I'd need to set the GPU memory allocation to be <50% for this to work.

@yaroslavvb

This comment has been minimized.

Copy link
Contributor

@yaroslavvb yaroslavvb commented Mar 30, 2017

You should have at most one active TensorFlow process per GPU. Use CUDA_VISIBLE_DEVICES to determine which process gets which GPU (CUDA_VISIBLE_DEVICES='' will disable GPU)

@xolott

This comment has been minimized.

Copy link

@xolott xolott commented May 26, 2017

I'm trying to run 2 models in 2 different process (multiprocess in python). Each process create a session. One load a MTCNN model and the other one load a CNN. 1/20 times, both work fine.
If I run one or another process alone, work just fine. I set <50% of GPU memory on each process
My error is:
InternalError: Failed to create session.

Or:
F tensorflow/core/kernels/conv_ops.cc:605] Check failed: stream->parent()->GetConvolveAlgorithms(&algorithms)

Always the same process fail...

@sagardatascientists

This comment has been minimized.

Copy link

@sagardatascientists sagardatascientists commented Jan 28, 2018

Its better to use Theano if you want to run two process on same GPU.

@ayyappa428

This comment has been minimized.

Copy link

@ayyappa428 ayyappa428 commented May 14, 2018

how to create multiple instances using theano. If i change OMP_NUM_THREADS = 2 two cores will be enabled.But i want to create more instances like multi processing instead of having multiple workers.I want to send the multiple requests using theano is it possible?

@zwfcrazy

This comment has been minimized.

Copy link

@zwfcrazy zwfcrazy commented May 23, 2018

By using gpu_options.per_process_gpu_memory_fraction or gpu_options.allow_growth to limit memory pre-allocation of each tf process, I was able to run multiple tf processes on the same GPU simultaneously. (of course, your model needs to be small enough in order to do so)
But I am not sure if there will be any other problems.
I also don't know how this will affect the computing speed of each process.

@xolott

This comment has been minimized.

Copy link

@xolott xolott commented May 23, 2018

As far as I know, TF can't have more than one session per GPU. It will have problem accessing gpu memory and using the Cuda cores. If you want multiple models go to TensorflowServing

@twangnh

This comment has been minimized.

Copy link

@twangnh twangnh commented Jun 26, 2018

same problem for multiple process run on same GPU, I have some machine with titanx and ubuntu 14 , others with p40 and centos 7, with gpu_options.per_process_gpu_memory_fraction and gpu_options.allow_growth, I can run on titanx machine, but always get CUBLAS_STATUS_NOT_INITIALIZED on p40 machines

@loretoparisi

This comment has been minimized.

Copy link

@loretoparisi loretoparisi commented Nov 23, 2018

@tatatodd I think this issue should not be closed, but tagged as Question. In my case we are running Keras and a tensorflow model. And the same error occurs.

@jotes35

This comment has been minimized.

Copy link

@jotes35 jotes35 commented Nov 30, 2018

I have a similar problem. I use an HPC server with 6 different compute nodes I run tensorflow simultaneously on two different nodes . I still got an error on one of them. Is that not strange?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
You can’t perform that action at this time.