You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The code relevant to this issue can be found here Situation of the problem
I am using tf.contrib.staging.StagingArea for efficient usage of GPUs by prefetching.
To explain the issue better I am taking a small part of the snippet from the above code here :
with tf.device("/gpu:0"):
runningcorrect = tf.get_variable("runningcorrect", [], dtype=tf.float32, initializer=tf.zeros_initializer(), trainable=False)
runningnum = tf.get_variable("runningnum", [], dtype=tf.float32, initializer=tf.zeros_initializer(), trainable=False)
for i in range(numgpus):
with tf.variable_scope(tf.get_variable_scope(), reuse=i>0) as vscope:
with tf.device('/gpu:{}'.format(i)):
with tf.name_scope('GPU-Tower-{}'.format(i)) as scope:
stagingarea = tf.contrib.staging.StagingArea([tf.float32, tf.int32], shapes=[[trainbatchsize, 3, 221, 221], [trainbatchsize]], capacity=20)
stagingclarify.append(stagingarea.clear())
putop = stagingarea.put(input_iterator.get_next())
train_put_list.append(putop)
getop = stagingarea.get()
train_get_list.append(getop)
elem = train_get_list[i]
net, networksummaries = overfeataccurate(elem[0],numclasses=1000)
So I am using a tf.contrib.staging.StagingArea on each GPU. Each StagingArea takes its input from a tf.contrib.data.Dataset using a tf.contrib.data.Iterator. For each GPU the input is taken from the StagingArea using a StagingArea.get() op.
The Problem
Initially the training works fine. Towards the end of an epoch however, when a StagingArea does not get trainbatchsize number of tensors and the tf.contrib.data.Iterator has produced a tf.errors.OutOfRangeError, the training is blocked. It is clear that why this problem is happening. However I am not able to think of a clean way to correct this problem.
Can I get insights into this issue ?
The text was updated successfully, but these errors were encountered:
I am looking into your suggestions. I also found out that the TF commit I was using has a few faults. But I need to investigate more to make sure that I am on the right track. Please do not close this issue till then.
The code relevant to this issue can be found here
Situation of the problem
I am using
tf.contrib.staging.StagingArea
for efficient usage of GPUs by prefetching.To explain the issue better I am taking a small part of the snippet from the above code here :
So I am using a
tf.contrib.staging.StagingArea
on each GPU. EachStagingArea
takes its input from atf.contrib.data.Dataset
using atf.contrib.data.Iterator
. For each GPU the input is taken from theStagingArea
using aStagingArea.get()
op.The Problem
Initially the training works fine. Towards the end of an epoch however, when a
StagingArea
does not gettrainbatchsize
number of tensors and thetf.contrib.data.Iterator
has produced atf.errors.OutOfRangeError
, the training is blocked. It is clear that why this problem is happening. However I am not able to think of a clean way to correct this problem.Can I get insights into this issue ?
The text was updated successfully, but these errors were encountered: