Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Avoiding blocking of processes due to lack of data #19

Closed
ghost opened this issue Sep 21, 2017 · 2 comments
Closed

Avoiding blocking of processes due to lack of data #19

ghost opened this issue Sep 21, 2017 · 2 comments

Comments

@ghost
Copy link

ghost commented Sep 21, 2017

The code relevant to this issue can be found here
Situation of the problem
I am using tf.contrib.staging.StagingArea for efficient usage of GPUs by prefetching.
To explain the issue better I am taking a small part of the snippet from the above code here :

with tf.device("/gpu:0"):
        runningcorrect = tf.get_variable("runningcorrect", [], dtype=tf.float32, initializer=tf.zeros_initializer(), trainable=False)
        runningnum = tf.get_variable("runningnum", [], dtype=tf.float32, initializer=tf.zeros_initializer(), trainable=False)
    for i in range(numgpus):
        with tf.variable_scope(tf.get_variable_scope(), reuse=i>0) as vscope:
            with tf.device('/gpu:{}'.format(i)):
                with tf.name_scope('GPU-Tower-{}'.format(i)) as scope:
                    stagingarea = tf.contrib.staging.StagingArea([tf.float32, tf.int32], shapes=[[trainbatchsize, 3, 221, 221], [trainbatchsize]], capacity=20)
                    stagingclarify.append(stagingarea.clear())
                    putop = stagingarea.put(input_iterator.get_next())
                    train_put_list.append(putop)
                    getop = stagingarea.get()
                    train_get_list.append(getop)
                    elem = train_get_list[i]
                    net, networksummaries =  overfeataccurate(elem[0],numclasses=1000)

So I am using a tf.contrib.staging.StagingArea on each GPU. Each StagingArea takes its input from a tf.contrib.data.Dataset using a tf.contrib.data.Iterator. For each GPU the input is taken from the StagingArea using a StagingArea.get() op.

The Problem
Initially the training works fine. Towards the end of an epoch however, when a StagingArea does not get trainbatchsize number of tensors and the tf.contrib.data.Iterator has produced a tf.errors.OutOfRangeError, the training is blocked. It is clear that why this problem is happening. However I am not able to think of a clean way to correct this problem.
Can I get insights into this issue ?

@vahidk
Copy link
Owner

vahidk commented Sep 22, 2017

Is this during training or at test time? During training you should be able to just constantly provide data by dataset.repeat().

@ghost
Copy link
Author

ghost commented Sep 22, 2017

I am looking into your suggestions. I also found out that the TF commit I was using has a few faults. But I need to investigate more to make sure that I am on the right track. Please do not close this issue till then.

@ghost ghost closed this as completed Sep 23, 2017
This issue was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant