Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Session hang issue with python multiprocessing #8220

Closed
rfeinman opened this issue Mar 8, 2017 · 20 comments

Comments

Projects
None yet
@rfeinman
Copy link

commented Mar 8, 2017

Issue summary

I am having trouble allocating GPU devices for a multiprocessing pool. Please see the short code reproduction below. I would like to understand why I am getting the CUDA_ERROR_NOT_INITIALIZED error in case 4. For this case, the program hangs, and I have to stop my docker container to exit.

Minimal reproducible example

core code:

import tensorflow as tf

def run_session(device):
    gpu_options = tf.GPUOptions(allow_growth=True, visible_device_list=device)
    sess = tf.Session(config=tf.ConfigProto(gpu_options=gpu_options))
    print('Using device #%s' % device)
    a = tf.placeholder(tf.int16, name='a')
    y = tf.identity(a, name='y')
    print sess.run(y, feed_dict={a: 3})
    sess.close()
    print('Done.')

Case 1 (this works fine):

run_session('0')
I tensorflow/core/common_runtime/gpu/gpu_device.cc:885] Found device 0 with properties:
name: GeForce GTX 980 Ti
major: 5 minor: 2 memoryClockRate (GHz) 1.076
pciBusID 0000:08:00.0
Total memory: 5.97GiB
Free memory: 5.86GiB
I tensorflow/core/common_runtime/gpu/gpu_device.cc:906] DMA: 0
I tensorflow/core/common_runtime/gpu/gpu_device.cc:916] 0:   Y
I tensorflow/core/common_runtime/gpu/gpu_device.cc:975] Creating TensorFlow device (/gpu:0) -> (device: 0, name: GeForce GTX 980 Ti, pci bus id: 0000:08:00.0)
Using device #0
3
Done.

Case 2 (this works fine):

run_session('0')
run_session('1')
I tensorflow/core/common_runtime/gpu/gpu_device.cc:885] Found device 0 with properties:
name: GeForce GTX 980 Ti
major: 5 minor: 2 memoryClockRate (GHz) 1.076
pciBusID 0000:08:00.0
Total memory: 5.97GiB
Free memory: 5.86GiB
I tensorflow/core/common_runtime/gpu/gpu_device.cc:906] DMA: 0
I tensorflow/core/common_runtime/gpu/gpu_device.cc:916] 0:   Y
I tensorflow/core/common_runtime/gpu/gpu_device.cc:975] Creating TensorFlow device (/gpu:0) -> (device: 0, name: GeForce GTX 980 Ti, pci bus id: 0000:08:00.0)
Using device #0
3
Done.
W tensorflow/stream_executor/cuda/cuda_driver.cc:590] creating context when one is currently active; existing: 0x24cbbe0
I tensorflow/core/common_runtime/gpu/gpu_device.cc:885] Found device 0 with properties:
name: GeForce GTX 980 Ti
major: 5 minor: 2 memoryClockRate (GHz) 1.076
pciBusID 0000:84:00.0
Total memory: 5.97GiB
Free memory: 5.86GiB
I tensorflow/core/common_runtime/gpu/gpu_device.cc:906] DMA: 1
I tensorflow/core/common_runtime/gpu/gpu_device.cc:916] 1:   Y
I tensorflow/core/common_runtime/gpu/gpu_device.cc:975] Creating TensorFlow device (/gpu:0) -> (device: 1, name: GeForce GTX 980 Ti, pci bus id: 0000:84:00.0)
Using device #1
3
Done.

Case 3 (this works fine):

import multiprocessing as mp

p = mp.Pool(2)
p.map(run_session, ['0', '1'])
p.close()
p.join()
I tensorflow/core/common_runtime/gpu/gpu_device.cc:885] Found device 0 with properties:
name: GeForce GTX 980 Ti
major: 5 minor: 2 memoryClockRate (GHz) 1.076
pciBusID 0000:84:00.0
Total memory: 5.97GiB
Free memory: 5.86GiB
I tensorflow/core/common_runtime/gpu/gpu_device.cc:906] DMA: 1
I tensorflow/core/common_runtime/gpu/gpu_device.cc:916] 1:   Y
I tensorflow/core/common_runtime/gpu/gpu_device.cc:975] Creating TensorFlow device (/gpu:0) -> (device: 1, name: GeForce GTX 980 Ti, pci bus id: 0000:84:00.0)
I tensorflow/core/common_runtime/gpu/gpu_device.cc:885] Found device 0 with properties:
name: GeForce GTX 980 Ti
major: 5 minor: 2 memoryClockRate (GHz) 1.076
pciBusID 0000:08:00.0
Total memory: 5.97GiB
Free memory: 5.86GiB
I tensorflow/core/common_runtime/gpu/gpu_device.cc:906] DMA: 0
I tensorflow/core/common_runtime/gpu/gpu_device.cc:916] 0:   Y
I tensorflow/core/common_runtime/gpu/gpu_device.cc:975] Creating TensorFlow device (/gpu:0) -> (device: 0, name: GeForce GTX 980 Ti, pci bus id: 0000:08:00.0)
Using device #1
Using device #0
3
Done.
3
Done.

Case 4 (here, the program hangs):

import multiprocessing as mp

run_session('0')
p = mp.Pool(2)
p.map(run_session, ['0', '1'])
p.close()
p.join()
I tensorflow/core/common_runtime/gpu/gpu_device.cc:885] Found device 0 with properties:
name: GeForce GTX 980 Ti
major: 5 minor: 2 memoryClockRate (GHz) 1.076
pciBusID 0000:08:00.0
Total memory: 5.97GiB
Free memory: 5.86GiB
I tensorflow/core/common_runtime/gpu/gpu_device.cc:906] DMA: 0
I tensorflow/core/common_runtime/gpu/gpu_device.cc:916] 0:   Y
I tensorflow/core/common_runtime/gpu/gpu_device.cc:975] Creating TensorFlow device (/gpu:0) -> (device: 0, name: GeForce GTX 980 Ti, pci bus id: 0000:08:00.0)
Using device #0
3
Done.
E tensorflow/stream_executor/cuda/cuda_driver.cc:1368] could not retrieve CUDA device count: CUDA_ERROR_NOT_INITIALIZED
Using device #0
E tensorflow/stream_executor/cuda/cuda_driver.cc:1368] could not retrieve CUDA device count: CUDA_ERROR_NOT_INITIALIZED
Using device #1

Environment info

Operating System: Ubuntu 14.04.4 LTS (GNU/Linux 3.19.0-25-generic x86_64)
Docker container: gcr.io/tensorflow/tensorflow:latest-devel-gpu
CUDA version: 8.0.61
cuDNN version: 5.1.10

Related GitHub issues

#1578

@prb12

This comment has been minimized.

Copy link
Member

commented Mar 8, 2017

@suharshs This seems like it may be down to device creation in DirectSession. Could you please comment?

@suharshs suharshs self-assigned this Mar 8, 2017

@suharshs suharshs assigned zheng-xq and unassigned suharshs Apr 24, 2017

@brandon-white

This comment has been minimized.

Copy link

commented May 16, 2017

@prb12 @suharshs @zheng-xq Have any of you taken a look at this one yet? I am getting the same issue for case 4. Any ideas for a temporary fix?

@suharshs suharshs assigned suharshs and unassigned zheng-xq May 17, 2017

@suharshs

This comment has been minimized.

Copy link
Member

commented May 17, 2017

Apologies for the delay, I am taking a look into this soon.

@suharshs suharshs changed the title GPU session hang issue with multiprocessing Session hang issue with python multiprocessing May 18, 2017

@suharshs

This comment has been minimized.

Copy link
Member

commented May 18, 2017

Update: I have looked into this a bit more, and have a couple more interesting repro cases :)
Works:

run_session('0')
run_session('0')

Hangs:

  run_session('0')
  p = mp.Process(target=run_session, args=('0'))
  p.start()
  p.join()

It looks like there is some shared python tensorflow state that interferes when a new python process is created (multiprocessing creates new python process whose state separation i am not to clear on). I plan to look into it very soon, but just wanted to provide an update in case that gives you any workarounds.

@suharshs

This comment has been minimized.

Copy link
Member

commented May 19, 2017

The python multiprocessing package seems to just call fork when creating a child process. This cannot work when the child process calls async code (i.e TensorFlow is multithreaded). From the posix spec for fork:

If a multi-threaded process calls fork(), the new process shall contain a replica of the calling thread and its entire address space, possibly including the states of mutexes and other resources. Consequently, to avoid errors, the child process may only execute async-signal-safe operations until such time as one of the exec functions is called.

So long story short, don't use python multiprocessing for anything non-trivial and expect it to work :)

@Lancerchiang

This comment has been minimized.

Copy link

commented Mar 3, 2018

Hi I had the same issue today, but this problem can be resolved by putting import tensorflow as tf inside your worker function (and the result is well parallelised).

@Lancerchiang

This comment has been minimized.

Copy link

commented Apr 23, 2018

@suharshs Python multiprocessing works fine with tensorflow. The only thing should be noticed is that tensorflow must be imported independently inside each process (must use multiprocessing instead of multithreading since tensorflow will take over the entire process). Below is how I achieved multi-GPU and multiprocessing inferencing and I hope it helps:

import os
import multiprocessing


class Predictor(multiprocessing.Process):
    def __init__(self, input_queue, gpu_id):
        multiprocessing.Process.__init__(self)
        self.input_queue = input_queue
        self.gpu_id = gpu_id
    def run(self):
        #set GPU id before importing tensorflow!!!!!!!!!!!!!
        os.environ["CUDA_VISIBLE_DEVICES"] = "{}".format(self.gpu_id)
        #import tensorflow here
        import tensorflow as tf
        sess = tf.Session()
        print('Using device #%s' % self.gpu_id)
        a = tf.placeholder(tf.int16, name='a')
        y = tf.identity(a, name='y')
        while True:
            input = self.input_queue.get()
            if input is None:
                self.input_queue.task_done()
                print("Exiting Process %d" % self.gpu_id)
                break
            else:
                print sess.run(y, feed_dict={a: input})
                self.input_queue.task_done()
        sess.close()
        return

if __name__ == "__main__":
    jobs = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
    num_gpus = 2
    p_list = []
    input_queue = multiprocessing.JoinableQueue()

    for i in range(num_gpus):
        p = Predictor(input_queue, i)
        p_list.append(p)
    for p in p_list:
        p.start()
    for job in jobs:
        input_queue.put(job)
    for i in range(num_gpus):
        input_queue.put(None)

    input_queue.join()
    for p in p_list:
        p.join()
@breckuh

This comment has been minimized.

Copy link

commented May 9, 2018

I am also running into this issue. Multiprocessing works unless I first run a session in the parent thread. I've tried moving the "import tensorflow" statement to the function as @Lancerchiang suggested with no luck. Below is my minimal repro with 4 test cases.

import os
import tensorflow
from multiprocessing.pool import Pool

def runInSubprocess(somearg):
    print('Training model on process id {}.'.format(os.getpid()))
    with tensorflow.Session() as sess:
        sess.run(tensorflow.global_variables_initializer())

# This Hangs:
runInSubprocess(2)
Pool(processes=2).map(runInSubprocess, [1,2])

# This works:
runInSubprocess(2)
runInSubprocess(2)

# This works:
Pool(processes=2).map(runInSubprocess, [1,2])
Pool(processes=2).map(runInSubprocess, [1,2])

# This works:
Pool(processes=2).map(runInSubprocess, [1,2])
runInSubprocess(2)
@Lancerchiang

This comment has been minimized.

Copy link

commented May 11, 2018

@breckuh If you really need to run a tensorflow session in your parent process, my advice is that launching explicit child processes like I did above instead of using pool mapping, and import tensorflow in your parent process after you have done that in your child processes.

import os
import multiprocessing
import time

class Predictor(multiprocessing.Process):
    def __init__(self, input_queue, gpu_id):
        multiprocessing.Process.__init__(self)
        self.input_queue = input_queue
        self.gpu_id = gpu_id
    def run(self):
        #set GPU id before importing tensorflow!!
        #os.environ["CUDA_VISIBLE_DEVICES"] = "{}".format(self.gpu_id)
        import tensorflow as tf
        sess = tf.Session()
        print('Using device #%s' % self.gpu_id)
        a = tf.placeholder(tf.int16, name='a')
        y = tf.identity(a, name='y')
        while True:
            input = self.input_queue.get()
            if input is None:
                self.input_queue.task_done()
                print("Exiting Process %d" % self.gpu_id)
                break
            else:
                print sess.run(y, feed_dict={a: input})
                self.input_queue.task_done()
        sess.close()
        return

if __name__ == "__main__":
    works = [4,5]
    num_gpus = 2
    p_list = []
    input_queue = multiprocessing.JoinableQueue()

    for i in range(num_gpus):
        p = Predictor(input_queue, i)
        p_list.append(p)
    for p in p_list:
        p.start()

    time.sleep(2)

    import tensorflow as tf

    with tf.Session() as sess:
        sess.run(tf.global_variables_initializer())

    for work in works:
        input_queue.put(work)
    for i in range(num_gpus):
        input_queue.put(None)

    input_queue.join()
    for p in p_list:
        p.join()

It would give:

2018-05-11 11:01:57.844637: I tensorflow/core/platform/cpu_feature_guard.cc:137] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.2 AVX AVX2 FMA
2018-05-11 11:01:57.844638: I tensorflow/core/platform/cpu_feature_guard.cc:137] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.2 AVX AVX2 FMA
Using device #1
Using device #0
2018-05-11 11:01:59.207167: I tensorflow/core/platform/cpu_feature_guard.cc:137] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.2 AVX AVX2 FMA
4
5
Exiting Process 1
Exiting Process 0

You can see the three tensorflow sessions finished successfully.

@breck7

This comment has been minimized.

Copy link

commented May 11, 2018

Thanks @Lancerchiang that makes sense. I don't actually know if we'll ever have this use case in practice, it only came up because our test suite was failing when certain tests were run in different orders. Then we fell down a rabbit hole isolating this :). In the end we just had the workaround where we specifically arranged our suite to run the tests in the child processes first, and then the tests in the parent processes after. Not ideal but good enough for now. What I would like to do is add a line or two to check if this hang might hit and then Throw/Alert the user, so no one is left hanging.

@mayankchatteron1

This comment has been minimized.

Copy link

commented Jul 18, 2018

@mrry .
I am facing the problem - "could not retrieve CUDA device count: CUDA_ERROR_NOT_INITIALIZED" due to memory issue while using django celery.
Is there any work around for python multiprocessing or celery in case of distributed tensorflow with gpu.

@fengyuan777

This comment has been minimized.

Copy link

commented Aug 21, 2018

import numpy as np
import tensorflow as tf
from multiprocessing import Process, Pool
import os
import time

def run_proc(name, session):

  • import tensorflow as tf
  •     process_session = session
    
  •     process_input = process_session.graph.get_tensor_by_name('input:0')
    
  •     process_output = process_session.graph.get_tensor_by_name('output:0')
    
  •     res = process_session.run(process_input, feed_dict={process_output: np.ones((10, 2))})
    

if name == 'main':
import tensorflow as tf
session = tf.Session()
with session.as_default():
input = tf.placeholder(dtype=tf.float32, shape=[None, 2], name='input')
tmp = tf.ones(shape=[10, 2])
add_output = tf.add(x=input, y=tmp, name='output')
print('Parent process %s.' % os.getpid())
p = Process(
target=run_proc,
args=('test', session))
print('Process will start.')
p.start()
p.join()
print('Process end.')
it stucks when new process start to run(, feed_dict={})

@Wesley-Li

This comment has been minimized.

Copy link

commented Sep 20, 2018

@mrry .
I am facing the problem - "could not retrieve CUDA device count: CUDA_ERROR_NOT_INITIALIZED" due to memory issue while using django celery.
Is there any work around for python multiprocessing or celery in case of distributed tensorflow with gpu.

I just hit the same issue when using celery worker to run tensorflow gpu. I this issue solved?

@homedawn

This comment has been minimized.

Copy link

commented Feb 14, 2019

I miss the problem:PicklingError: Can't pickle <type 'module'>: attribute lookup builtin.module failed @rfeinman @

@zhangjinyangnwpu

This comment has been minimized.

Copy link

commented Mar 21, 2019

@suharshs Python multiprocessing works fine with tensorflow. The only thing should be noticed is that tensorflow must be imported independently inside each process (must use multiprocessing instead of multithreading since tensorflow will take over the entire process). Below is how I achieved multi-GPU and multiprocessing inferencing and I hope it helps:

import os
import multiprocessing


class Predictor(multiprocessing.Process):
    def __init__(self, input_queue, gpu_id):
        multiprocessing.Process.__init__(self)
        self.input_queue = input_queue
        self.gpu_id = gpu_id
    def run(self):
        #set GPU id before importing tensorflow!!!!!!!!!!!!!
        os.environ["CUDA_VISIBLE_DEVICES"] = "{}".format(self.gpu_id)
        #import tensorflow here
        import tensorflow as tf
        sess = tf.Session()
        print('Using device #%s' % self.gpu_id)
        a = tf.placeholder(tf.int16, name='a')
        y = tf.identity(a, name='y')
        while True:
            input = self.input_queue.get()
            if input is None:
                self.input_queue.task_done()
                print("Exiting Process %d" % self.gpu_id)
                break
            else:
                print sess.run(y, feed_dict={a: input})
                self.input_queue.task_done()
        sess.close()
        return

if __name__ == "__main__":
    jobs = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
    num_gpus = 2
    p_list = []
    input_queue = multiprocessing.JoinableQueue()

    for i in range(num_gpus):
        p = Predictor(input_queue, i)
        p_list.append(p)
    for p in p_list:
        p.start()
    for job in jobs:
        input_queue.put(job)
    for i in range(num_gpus):
        input_queue.put(None)

    input_queue.join()
    for p in p_list:
        p.join()

I wonder if I want get a return value from every processing, how should I do? return in method 'run'? and how can I get this return value?

@Lancerchiang

This comment has been minimized.

Copy link

commented Mar 21, 2019

I wonder if I want get a return value from every processing, how should I do? return in method 'run'? and how can I get this return value?

import os
import multiprocessing


class Predictor(multiprocessing.Process):
    def __init__(self, input_queue, output_queue, gpu_id):
        multiprocessing.Process.__init__(self)
        self.input_queue = input_queue
        self.output_queue = output_queue
        self.gpu_id = gpu_id

    def run(self):
        #set GPU id before importing tensorflow!!!!!!!!!!!!!
        os.environ["CUDA_VISIBLE_DEVICES"] = "{}".format(self.gpu_id)
        #import tensorflow here
        import tensorflow as tf
        sess = tf.Session()
        print('Using device #%s' % self.gpu_id)
        a = tf.placeholder(tf.int16, name='a')
        y = tf.identity(a, name='y')
        while True:
            input = self.input_queue.get()
            if input is None:
                self.input_queue.task_done()
                print("Exiting Process %d" % self.gpu_id)
                break
            else:
                res = sess.run(y, feed_dict={a: input})
                self.input_queue.task_done()
                self.output_queue.put(res)
        sess.close()
        return

if __name__ == "__main__":
    jobs = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
    num_gpus = 2
    p_list = []
    input_queue = multiprocessing.JoinableQueue()
    output_queue = multiprocessing.Queue()
    for i in range(num_gpus):
        p = Predictor(input_queue, output_queue, i)
        p_list.append(p)

    for p in p_list:
        p.start()

    for job in jobs:
        input_queue.put(job)

    for i in range(num_gpus):
        input_queue.put(None)

    for i in range(num_gpus):
        print(output_queue.get())

    input_queue.join()
    
    for p in p_list:
        p.join()
@zhangjinyangnwpu

This comment has been minimized.

Copy link

commented Mar 21, 2019

For the code, I'm confusing that why

for i in range(num_gpus):
        input_queue.put(None)

What is this part means?
@Lancerchiang

@Lancerchiang

This comment has been minimized.

Copy link

commented Mar 21, 2019

For the code, I'm confusing that why

for i in range(num_gpus):
        input_queue.put(None)

What is this part means?
@Lancerchiang

@zhangjinyangnwpu The workers won't know the tasks are done if this signal is not broadcasted

@xieyi4650

This comment has been minimized.

Copy link

commented Apr 22, 2019

@breckuh If you really need to run a tensorflow session in your parent process, my advice is that launching explicit child processes like I did above instead of using pool mapping, and import tensorflow in your parent process after you have done that in your child processes.

import os
import multiprocessing
import time

class Predictor(multiprocessing.Process):
    def __init__(self, input_queue, gpu_id):
        multiprocessing.Process.__init__(self)
        self.input_queue = input_queue
        self.gpu_id = gpu_id
    def run(self):
        #set GPU id before importing tensorflow!!
        #os.environ["CUDA_VISIBLE_DEVICES"] = "{}".format(self.gpu_id)
        import tensorflow as tf
        sess = tf.Session()
        print('Using device #%s' % self.gpu_id)
        a = tf.placeholder(tf.int16, name='a')
        y = tf.identity(a, name='y')
        while True:
            input = self.input_queue.get()
            if input is None:
                self.input_queue.task_done()
                print("Exiting Process %d" % self.gpu_id)
                break
            else:
                print sess.run(y, feed_dict={a: input})
                self.input_queue.task_done()
        sess.close()
        return

if __name__ == "__main__":
    works = [4,5]
    num_gpus = 2
    p_list = []
    input_queue = multiprocessing.JoinableQueue()

    for i in range(num_gpus):
        p = Predictor(input_queue, i)
        p_list.append(p)
    for p in p_list:
        p.start()

    time.sleep(2)

    import tensorflow as tf

    with tf.Session() as sess:
        sess.run(tf.global_variables_initializer())

    for work in works:
        input_queue.put(work)
    for i in range(num_gpus):
        input_queue.put(None)

    input_queue.join()
    for p in p_list:
        p.join()

It would give:

2018-05-11 11:01:57.844637: I tensorflow/core/platform/cpu_feature_guard.cc:137] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.2 AVX AVX2 FMA
2018-05-11 11:01:57.844638: I tensorflow/core/platform/cpu_feature_guard.cc:137] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.2 AVX AVX2 FMA
Using device #1
Using device #0
2018-05-11 11:01:59.207167: I tensorflow/core/platform/cpu_feature_guard.cc:137] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.2 AVX AVX2 FMA
4
5
Exiting Process 1
Exiting Process 0

You can see the three tensorflow sessions finished successfully.

I really need to run a session(write in c++ called by python3) in the parent process, the program hangs before multi-process evaluation using GPU. As far as I know, GPU memory is process related which means only kill process can release memory. Does this mean the GPU used in the parent process session cannot be used for the next multi-process evaluation?
When launching explicit child processes like you said(without input or output queue) the same problem occurred as using pool.apply_async(func, args).
Besides I am writing some tools on the tensorflow source code, there is no import tensorflow as tf.
Any ideas to fix this? @Lancerchiang

@Lancerchiang

This comment has been minimized.

Copy link

commented Apr 23, 2019

@breckuh If you really need to run a tensorflow session in your parent process, my advice is that launching explicit child processes like I did above instead of using pool mapping, and import tensorflow in your parent process after you have done that in your child processes.

import os
import multiprocessing
import time

class Predictor(multiprocessing.Process):
    def __init__(self, input_queue, gpu_id):
        multiprocessing.Process.__init__(self)
        self.input_queue = input_queue
        self.gpu_id = gpu_id
    def run(self):
        #set GPU id before importing tensorflow!!
        #os.environ["CUDA_VISIBLE_DEVICES"] = "{}".format(self.gpu_id)
        import tensorflow as tf
        sess = tf.Session()
        print('Using device #%s' % self.gpu_id)
        a = tf.placeholder(tf.int16, name='a')
        y = tf.identity(a, name='y')
        while True:
            input = self.input_queue.get()
            if input is None:
                self.input_queue.task_done()
                print("Exiting Process %d" % self.gpu_id)
                break
            else:
                print sess.run(y, feed_dict={a: input})
                self.input_queue.task_done()
        sess.close()
        return

if __name__ == "__main__":
    works = [4,5]
    num_gpus = 2
    p_list = []
    input_queue = multiprocessing.JoinableQueue()

    for i in range(num_gpus):
        p = Predictor(input_queue, i)
        p_list.append(p)
    for p in p_list:
        p.start()

    time.sleep(2)

    import tensorflow as tf

    with tf.Session() as sess:
        sess.run(tf.global_variables_initializer())

    for work in works:
        input_queue.put(work)
    for i in range(num_gpus):
        input_queue.put(None)

    input_queue.join()
    for p in p_list:
        p.join()

It would give:

2018-05-11 11:01:57.844637: I tensorflow/core/platform/cpu_feature_guard.cc:137] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.2 AVX AVX2 FMA
2018-05-11 11:01:57.844638: I tensorflow/core/platform/cpu_feature_guard.cc:137] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.2 AVX AVX2 FMA
Using device #1
Using device #0
2018-05-11 11:01:59.207167: I tensorflow/core/platform/cpu_feature_guard.cc:137] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.2 AVX AVX2 FMA
4
5
Exiting Process 1
Exiting Process 0

You can see the three tensorflow sessions finished successfully.

I really need to run a session(write in c++ called by python3) in the parent process, the program hangs before multi-process evaluation using GPU. As far as I know, GPU memory is process related which means only kill process can release memory. Does this mean the GPU used in the parent process session cannot be used for the next multi-process evaluation?
When launching explicit child processes like you said(without input or output queue) the same problem occurred as using pool.apply_async(func, args).
Besides I am writing some tools on the tensorflow source code, there is no import tensorflow as tf.
Any ideas to fix this? @Lancerchiang

I currently don't know how to make different processes sharing the identical GPU memory at Python level. But for the session hang issue, how about you try to reload your customized module in the child processes? The folk() function is called when Python starts a new process and the parent process's module info will be copied as well. I guess reimporting the module would solve the problem.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.