-
Notifications
You must be signed in to change notification settings - Fork 74k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Re-emerged Issue #31509 - BaseCollectiveExecutor::StartAbort Out of range: #32817
Comments
@oracle3001 |
I am having the exact same problem using this mock model. I am using tf 2.0.0 release. import matplotlib.pyplot as plt
import numpy as np
import tensorflow as tf
if __name__ == '__main__':
x = tf.random.normal((14000, 30, 1))
y = tf.ones_like(x)
discriminator = tf.keras.models.Sequential([
tf.keras.layers.LSTM(100, input_shape=(30, 1), return_sequences=True),
tf.keras.layers.LSTM(100, recurrent_dropout=0.4,
dropout=0.4, return_sequences=True)
])
discriminator.compile(loss='binary_crossentropy',
optimizer=tf.keras.optimizers.Adam(lr=0.001))
dataset = tf.data.Dataset.from_tensor_slices((x, y))
dataset = dataset.batch(64)
discriminator.fit(dataset, epochs=2)
`` |
I am also having this message feeding a dataset into a 1D Convnet. Happens on my Mac with tf version 2.0.0-rc2. Not reproducible on Colab. import numpy as np
import tensorflow as tf
def create_timeseries_element():
# returns a random time series of 100 intervals, each with 3 features,
# and a random one-hot array of 5 entries
data = np.random.rand(100,3)
label = np.eye(5, dtype='int')[np.random.choice(5)]
return data, label
def data_generator():
d, l = create_timeseries_element()
yield (d, l)
model = tf.keras.models.Sequential([
tf.keras.layers.Conv1D(128, 9, activation='relu', input_shape=(100, 3)),
tf.keras.layers.Conv1D(128, 9, activation='relu'),
tf.keras.layers.MaxPooling1D(2),
tf.keras.layers.Conv1D(256, 5, activation='relu'),
tf.keras.layers.Conv1D(256, 5, activation='relu'),
tf.keras.layers.GlobalAveragePooling1D(),
tf.keras.layers.Dropout(0.5),
tf.keras.layers.Dense(5, activation='softmax')])
model.compile(optimizer='adam',
loss='categorical_crossentropy',
metrics=['accuracy'])
ds = tf.data.Dataset.from_generator(data_generator, output_types=(tf.float32, tf.int32),
output_shapes=(tf.TensorShape([100, 3]), tf.TensorShape([5])))
model.fit(ds.batch(32)) |
I am having an issue similar to this and tried to run in Collab just to get an un-ending runtime. I asked my question in full on SO here. I had some numpy arrays that were trained in keras in the previous version of tf and now have to rewrite my model. Got way worse accuracy so I am thinking I need to switch to tf.data.Dataset. So I did:
Once the training starts I get this warning error:
|
I am having the same issue as above on TF 2.0 release. Is this a bug with tensorflow or is there an issue with the code? |
It seems everybody who is having this issue is using Windows. I presume that must have something to do with it? |
I am having the issue on a Mac with the latest version of MacOS |
I am having the same problem after porting my code from 1.14 to 2.0. I am running on UBUNTU 18.04 (not only a windows problem). It occurs for me during both training and predict. (so not linked to the optimiser). I do NOT get the problem if i hide the GPU. I do get the problem if I expose the GPU. Edit - Note: Everything seems to run properly, I just get the warnings Edit - In another case - I get the problem whether I use GPU or not. |
I think I may have found why it is complaining. However, I have no idea how to fix it. While training, we all get the IteratorGetNext Error: Sequence out of range. I noticed that let’s say I have a dataset size of 60,000 with a batch size of 64 that would require floor(60000/64)= 937 to iterate through the entire dataset for one epoch. However, when training using .fit(verbose=1) I noticed that it attempts to iterate through the dataset 938 (most likely a rounding error because 60000/64=937.5) and thus I get this error. Can someone please confirm this is the case for you as well? Thanks Edit: So I found a way around this when building the tf.data.Dataset, make sure to add the .repeat() method because the program will complain that you ran out of data, and when using .fit() Here is a full example that got it working. This will cause the error: data = tf.random.normal((60000,30,4))
ground_truth = tf.ones((60000,1))
dataset = tf.data.Dataset.from_tensor_slices((data, ground_truth)).batch(64)
#predefined model here: input: [?, 30,4] output: [?,1]
model.fit(dataset, epochs=5)
'''
938/Unknown - 16s 17ms/step - loss: 0.02172019-10-07 14:49:49.928619: W tensorflow/core/common_runtime/base_collective_executor.cc:216] BaseCollectiveExecutor::StartAbort Out of range: End of sequence
[[{{node IteratorGetNext}}]]
[[Shape/_2]]
2019-10-07 14:49:49.928619: W tensorflow/core/common_runtime/base_collective_executor.cc:216] BaseCollectiveExecutor::StartAbort Out of range: End of sequence
[[{{node IteratorGetNext}}]]
938/938 [==============================] - 16s 17ms/step - loss: 0.0217
Epoch 2/5
935/938 [============================>.] - ETA: 0s - loss: 2.2229e-062019-10-07 14:49:59.722216: W tensorflow/core/common_runtime/base_collective_executor.cc:216] BaseCollectiveExecutor::StartAbort Out of range: End of sequence
[[{{node IteratorGetNext}}]]
2019-10-07 14:49:59.722218: W tensorflow/core/common_runtime/base_collective_executor.cc:216] BaseCollectiveExecutor::StartAbort Out of range: End of sequence
[[{{node IteratorGetNext}}]]
[[Shape/_2]]
''' This is the work around. batch_size = 64
data = tf.random.normal((60000,30,4))
ground_truth = tf.ones((60000,1))
dataset = tf.data.Dataset.from_tensor_slices((data, ground_truth)).batch(batch_size).repeat()
#predefined model here: input: [?, 30,4] output: [?,1]
model.fit(dataset, epochs=5, steps_per_epoch=data.shape[0]//batch_size)
'''
937/937 [==============================] - 15s 16ms/step - loss: 0.0135
Epoch 2/5
937/937 [==============================] - 10s 10ms/step - loss: 1.4460e-05
Epoch 3/5
937/937 [==============================] - 10s 11ms/step - loss: 4.3097e-06
Epoch 4/5
937/937 [==============================] - 10s 10ms/step - loss: 1.8212e-06
Epoch 5/5
'''
`` |
Good idea. I think you can use the take statement on the dataset to limit
yourself to the first 64*937 records avoiding any need to round at the end
…On Mon, 7 Oct 2019, 23:02 duysqubix ***@***.***> wrote:
I think I may have found why it is complaining. However, I have no idea
how to fix it. While training, we all get the IteratorGetNext Error:
Sequence out of range.
I noticed that let’s say I have a dataset size of 60,000 with a batch size
of 64 that would require floor(60000/64)= 937 to iterate through entire
dataset for one epoch. However, when training using .fit(verbose=1) i
botice that it attempts to iterate through the dataset 938 (most likely a
rounding error because 60000/64=937.5) and thus I get this error. Can
someone please confirm this is the case for you as well? Thanks
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#32817?email_source=notifications&email_token=ABW6MMFS2UIXVTALDOPPVSLQNOPYDA5CNFSM4I2PHNA2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEARYYMI#issuecomment-539200561>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ABW6MMANBQ3NCFAKJLMWHH3QNOPYDANCNFSM4I2PHNAQ>
.
|
I'm experiencing a similar problem to @duysqubix with my code in that I have a number of samples that doesn't neatly divide by the batch size. @duysqubix code works for me and the error disappears if I repeat the dataset and specify
Trying @mtpgva's advice above and using a import tensorflow as tf
data = tf.random.normal((60000,30,4))
ground_truth = tf.ones((60000,1))
dataset = tf.data.Dataset.from_tensor_slices((data, ground_truth)).take(512).batch(64)
model = tf.keras.models.Sequential([
tf.keras.layers.Dense(1, activation='softmax')
])
model.compile(optimizer='adam',
loss='categorical_crossentropy',
metrics=['accuracy'])
#predefined model here: input: [?, 30,4] output: [?,1]
model.fit(dataset, epochs=5)
I also tried using the import tensorflow as tf
data = tf.random.normal((60000,30,4))
ground_truth = tf.ones((60000,1))
dataset = tf.data.Dataset.from_tensor_slices((data, ground_truth)).batch(64, drop_remainder=True)
model = tf.keras.models.Sequential([
tf.keras.layers.Dense(1, activation='softmax')
])
model.compile(optimizer='adam',
loss='categorical_crossentropy',
metrics=['accuracy'])
#predefined model here: input: [?, 30,4] output: [?,1]
model.fit(dataset, epochs=5)
|
I think we may need to summon @fchollet |
@duysqubix Your suggestion fixed my issue! |
@oracle3001 Can you please let us know if the issue still persists?.Please close the issue if it was resolved already. Thanks! |
Any update? |
This fixes, my issue, however, surely the final batch size being reduced should not create an issue? Repeating the part of the first batch of data should not be the solution, surely? |
Hey @ravikyram, the solution that @duysqubix postet is a workaround, the underlying problem still exists. The problem is that the number of iterations performed on the dataset is greater than the number of batches in the dataset. I'm not actually sure that this is a bug, considering that python iteration uses the The work around "fixes" this by giving an explicitly calculated number of iterations to the So either the behavior is correct and the warning should be hidden, or the internally calculated number of iterations is faulty and should be changed. |
Yes, it still persists...see all the other posts with the same issue. |
Hi, I agree with @BeWe11, the issue is still there. |
@sharkdtu I changed code and use tf.data.experimental.cardinality issue here @npuichigo Why? (of course I must shuffle data before fit but it's only an example) |
|
Is this reply related to the original question? |
I just mentioned my understanding of the two parameters of epochs and steps_per_epoch in keras model, why they cause errors, and my handling method. I'm sorry to confuse you.
…------------------ 原始邮件 ------------------
发件人: "Zhu, Lingchen"<notifications@github.com>;
发送时间: 2019年12月12日(星期四) 晚上10:52
收件人: "tensorflow/tensorflow"<tensorflow@noreply.github.com>;
抄送: "326437990"<326437990@qq.com>; "Comment"<comment@noreply.github.com>;
主题: Re: [tensorflow/tensorflow] Re-emerged Issue #31509 - BaseCollectiveExecutor::StartAbort Out of range: (#32817)
将全量数据集切分成多个batch,对模型进行分批迭代训练,是常规做法,先理解一下两个概念:
* [batch_size:每个batch的数据量大小] * [batches:整个数据集按照batch_size切分后的batch数量] keras的fit函数的部分参数如下: model.fit(self, x=None, y=None,epochs=1,steps_per_epoch=None) * [epoch:迭代次数,一次迭代可以粗糙的理解成使用一个batch的训练数据对model进行训练。 The number of iterations, one iteration can be roughly understood as using a batch of training data to train the model.] * [steps_per_epoch:每次迭代被model消费的batches,可以粗糙的理解为将固定数量的batch合并成一个更大的bigger_batch,然后用bigger_batch对model进行训练,训练结束即为完成一个epoch。 Each iteration of the batches consumed by the model. It can be roughly seen as a combination of a fixed number of batches into a bigger batch named bigger_batch, and then the model is trained with the bigger_batch. The end of the training is the completion of an epoch.] * 个人看法:这是model训练过程中训练数据不足的问题,可以将model的训练过程理解为生产者与消费者的关系。tensorflow2.0的dataset集成了generator的功能,可直接作为吐出训练数据的生成器. dataset不断提供数据,model训练过程不断消费数据,一旦dataset没有数据提供并且model的训练过程还没结束,就会报错,所以需要确保epochs*steps_per_epoch <= dataset所能提供的batches。 你可以根据经验确定batch_size和steps_per_epoch,然后对全量数据集使用repeat()来避免model训练过程中数据不足的问题。如果觉得没有必要对batch再做处理,可令steps_per_epoch=1。 * My View:This is the problem of insufficient training data in the model training process, which can be seen as the relationship between producers and consumers. Tensorflow2.0 Dataset integrates the function of generator and can be used to spit out training data directly.Dataset provides data continuously, and model training process consumes data continuously. Once dataset does not provide data and model training process is not finished, an error will be reported. Therefore, it is necessary to ensure that epochs*steps_per_epoch is less than the size of batches provided by dataset.You can determine batch_size and steps_per_epoch based on experience, and then use repeat() for Dataset to avoid data shortage during model training.If you don't think it's necessary to deal with the batch again, you can make steps_per_epoch = 1.
验证过程:
train_data = tf.random.normal((5,4))#5个4维特征向量 label = tf.ones((5,1))#5个类别标签 dataset = tf.data.Dataset.from_tensor_slices((data, label)) dataset <TensorSliceDataset shapes: ((4,), (1,)), types: (tf.float32, tf.float32)>
全量数据集dataset按照batch_size进行切分,如果最后一个batch的数量不足batch_size,根据drop_remainder判断是否将其丢弃。在上述例子中,train_data和label构成的Dataset包含5个用来训练model的张量,暂且称之为train张量,train张量又包含2个张量:1个4维特征向量和1个label。
dataset = dataset.batch(batch_size, drop_remainder=True).repeat(2)
dataset <RepeatDataset shapes: ((2, 4), (2, 1)), types: (tf.float32, tf.float32)>
调用batch()进行切分,batch_size=2,drop_remainder=True,可知batches==2,每个batch包含2个train张量,最后一个batch的大小为1,丢弃;repeat(2)后batches==4。
model.fit(dataset, epochs=4, steps_per_epoch=1) #fit函数中的x和y参数代表特征向量和类别,可直接用Dataset类型的变量赋值
dataset有4个batch,batch_size == 2,每次使用1个batch的数据训练model(bigger_batch_size == batch_size x 1 == 2),可以迭代4次
model.fit(dataset, epochs=1, steps_per_epoch=4)
dataset有4个batch,batch_size == 2,每次使用4个batch的数据训练model(bigger_batch_size == batch_size x 4 == 6),可以迭代1次
完整的验证代码如下: import tensorflow as tf tf.__version__ def build_model(): model = tf.keras.Sequential([ tf.keras.layers.Dense(10, activation=tf.nn.relu, input_shape=(4,)), tf.keras.layers.Dense(10, activation=tf.nn.relu), tf.keras.layers.Dense(3, activation='softmax')]) model.compile( optimizer='Adam', loss='categorical_crossentropy', metrics=['accuracy'] ) return model def check_data_batch_size(dataset): #iterator = iter(dataset) iterator = dataset.__iter__() i=0 try: while i<100: #data = next(iterator) data = iterator.get_next() i += 1 print('id:',i) print('data:',data) except Exception as e: print(repr(e)) return i batch_size = 2 data = tf.random.normal((5,4)) label = tf.ones((5,1)) dataset = tf.data.Dataset.from_tensor_slices((data, label)) dataset = dataset.batch(2, drop_remainder=True).repeat(2) batches = check_data_batch_size(dataset) print('batches:',batches) model = build_model() model.fit(dataset, epochs=2, steps_per_epoch=2)
Is this reply related to the original question?
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub, or unsubscribe.
|
@duysqubix Brilliant, your suggestion fixed my issue. And for other people facing this issue, just for the record: # loss, acc = net.evaluate(tst_set) # do not use this when using a Repeating dataset
loss, acc = net.evaluate(tst_set, steps=3) # e.g., 3 |
I got the same problems in the Tensflow 2-gpu in the centos. Has anyone know how to fix this problem? |
This issue should be fixed by the pull request for issue #35314 . The warning was actually propagated up from C++ and so python was passing it forward. But there is really no problem here, no issues with training or anything, according to the issue. The solution was that Google lowered the logging level to ignore these warnings. The change is in TF 2.0 nightly, and will be widely available in the next release. But you can use TF nightly to get the benefit now. So this issue can probably be closed. |
Tensorflow 2.1(stable) released, has anyone known if this warning fixed in the new version ? |
I have the same problem when practicing the code from official tutorial.
|
@Xiaohui-Z I can also confirm that the issue is not solved. Using the example code from the TF docs still produces the issue:
I noted that this only happens during the first iteration where the total count seems to be unknown. This is odd too because the |
Can also confirm that the error (warning) is still being raised on 2.1 (my docker base image is cuda:10.1-cudnn7-devel-ubuntu18.04). # Triggers the warning.
dataset = raw_dataset.map(_parse_proto).take(32).batch(8)
model.evaluate(dataset)
# No warning
dataset = raw_dataset.map(_parse_proto).take(32).batch(8)
model.evaluate(dataset, steps=4) |
I also have this warning in TF2.1.0.
|
I have a similar error but can't seem to find anywhere else where anyone else is experiencing here it, and here is the traceback of my error: Train on 2737611 samples, validate on 2737612 samples Function call stack: |
Can anyone confirm what the result of this behavior is? I'm confused whether its a logging error, or whether the final batch does not get train/evaluated. For example, imagine I had 100 samples with a batch size of 52. Would I be training on a batch of 50 and 48 (expected behavior), or would I train on 50 and then just fail to fill the next batch and move to the next epoch? This is especially scary in a validation batch and I would be terrified to find that I have a variable validation set (especially if you shuffle!). There is alot of discussion in many spots, but no clear indication of the significance of this error. Some would have you believe it is just a warning. I am on tensorflow==2.1.0. |
hi,I am a freshman in DL from CHina ,I meet the same error like you.Through search the interent,I found the answer:you need add the 'repeat()',but rember that don't input function parameter,then you need to add "step_per_epoch" in fit(),and it's value is "x_train//batchszie''.it works in my project,I hope it can help you to solve your problem .my English is poor,don't mind! |
I tried to run on Colab with TF v2.5 and faced a different error,please find the gist here..Thanks! |
@oracle3001, I've tried reproducing the issue in |
This issue has been automatically marked as stale because it has no recent activity. It will be closed if no further activity occurs. Thank you. |
Closing as stale. Please reopen if you'd like to work on this further. |
The previous issue described in #31509 was fixed, but I am now experiencing exactly the same issue with all the same setup using the latest nightly build of TF2.0 when using tf.keras.optimizers.Adam
The text was updated successfully, but these errors were encountered: