Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Re-emerged Issue #31509 - BaseCollectiveExecutor::StartAbort Out of range: #32817

Closed
adam-hartshorne opened this issue Sep 25, 2019 · 45 comments
Closed
Assignees
Labels
comp:data tf.data related issues comp:keras Keras related issues stale This label marks the issue/pr stale - to be closed automatically if no activity stat:awaiting response Status - Awaiting response from author TF 2.0 Issues relating to TensorFlow 2.0 type:bug Bug

Comments

@adam-hartshorne
Copy link

adam-hartshorne commented Sep 25, 2019

The previous issue described in #31509 was fixed, but I am now experiencing exactly the same issue with all the same setup using the latest nightly build of TF2.0 when using tf.keras.optimizers.Adam

@adam-hartshorne adam-hartshorne changed the title Same Issue As Issue #31509 With Adamax - BaseCollectiveExecutor::StartAbort Out of range: Re-emerged Issue #31509 - BaseCollectiveExecutor::StartAbort Out of range: Sep 25, 2019
@ravikyram ravikyram self-assigned this Sep 26, 2019
@ravikyram ravikyram added comp:keras Keras related issues TF 2.0 Issues relating to TensorFlow 2.0 labels Sep 26, 2019
@ravikyram
Copy link
Contributor

@oracle3001
I am not seeing any issue with tf.keras.optimizers.Adam in latest TF 2.0.0-rc2 version.Please, find the gist here. Thanks!

@ravikyram ravikyram added the stat:awaiting response Status - Awaiting response from author label Sep 26, 2019
@duysPES
Copy link

duysPES commented Oct 3, 2019

I am having the exact same problem using this mock model. I am using tf 2.0.0 release.
On windows

import matplotlib.pyplot as plt
import numpy as np
import tensorflow as tf



if __name__ == '__main__':

    x = tf.random.normal((14000, 30, 1))
    y = tf.ones_like(x)

    discriminator = tf.keras.models.Sequential([
        tf.keras.layers.LSTM(100, input_shape=(30, 1), return_sequences=True),
        tf.keras.layers.LSTM(100, recurrent_dropout=0.4,
                             dropout=0.4, return_sequences=True)
    ])

    discriminator.compile(loss='binary_crossentropy',
                          optimizer=tf.keras.optimizers.Adam(lr=0.001))

    dataset = tf.data.Dataset.from_tensor_slices((x, y))
    dataset = dataset.batch(64)

    discriminator.fit(dataset, epochs=2)
``

@ravikyram
Copy link
Contributor

@duysPES
I am able to execute the code successfully in colab using TF 2.0.0-rc2 .Please, find the gist here.Thanks!

@juliangall
Copy link

I am also having this message feeding a dataset into a 1D Convnet. Happens on my Mac with tf version 2.0.0-rc2. Not reproducible on Colab.

import numpy as np
import tensorflow as tf
def create_timeseries_element():
    # returns a random time series of 100 intervals, each with 3 features,
    # and a random one-hot array of 5 entries
    data = np.random.rand(100,3)
    label = np.eye(5, dtype='int')[np.random.choice(5)]
    return data, label

def data_generator():
    d, l = create_timeseries_element()
    yield (d, l)

model = tf.keras.models.Sequential([
    tf.keras.layers.Conv1D(128, 9, activation='relu', input_shape=(100, 3)),
    tf.keras.layers.Conv1D(128, 9, activation='relu'),
    tf.keras.layers.MaxPooling1D(2),
    tf.keras.layers.Conv1D(256, 5, activation='relu'),
    tf.keras.layers.Conv1D(256, 5, activation='relu'),
    tf.keras.layers.GlobalAveragePooling1D(),
    tf.keras.layers.Dropout(0.5),
    tf.keras.layers.Dense(5, activation='softmax')])
model.compile(optimizer='adam',
              loss='categorical_crossentropy',
              metrics=['accuracy'])

ds = tf.data.Dataset.from_generator(data_generator, output_types=(tf.float32, tf.int32),
                                      output_shapes=(tf.TensorShape([100, 3]), tf.TensorShape([5])))
model.fit(ds.batch(32))

@raceee
Copy link

raceee commented Oct 5, 2019

I am having an issue similar to this and tried to run in Collab just to get an un-ending runtime. I asked my question in full on SO here.

I had some numpy arrays that were trained in keras in the previous version of tf and now have to rewrite my model. Got way worse accuracy so I am thinking I need to switch to tf.data.Dataset.

So I did:
train_dataset = tf.data.Dataset.from_tensor_slices((X_train_deleted_nans, y_train_no_nans)) train_dataset = train_dataset.shuffle(SHUFFLE_CONST).batch(BATCH_SIZE)

model.summary() gave me:

BatchDataset shapes: ((None, 2756), (None,)), types: (tf.float64, tf.int64)
Model: sequential

Layer (type)                 Output Shape              Param #   
=================================================================
dense (Dense)                (None, 1379)              3801903   
_________________________________________________________________
dropout (Dropout)            (None, 1379)              0         
_________________________________________________________________
dense_1 (Dense)              (None, 1379)              1903020   
_________________________________________________________________
dropout_1 (Dropout)          (None, 1379)              0         
_________________________________________________________________
dense_2 (Dense)              (None, 1379)              1903020   
_________________________________________________________________
dropout_2 (Dropout)          (None, 1379)              0         
_________________________________________________________________
dense_3 (Dense)              (None, 1379)              1903020   
_________________________________________________________________
dropout_3 (Dropout)          (None, 1379)              0         
_________________________________________________________________
dense_4 (Dense)              (None, 1)                 1380      
=================================================================
Total params: 9,512,343
Trainable params: 9,512,343
Non-trainable params: 0

model.compile(optimizer=adam, loss=bce, metrics=['accuracy'])
model.fit(train_dataset, epochs=1000, verbose=0)

Once the training starts I get this warning error:

2019-10-04 23:47:56.691434: W tensorflow/core/common_runtime/base_collective_executor.cc:216] BaseCollectiveExecutor::StartAbort Out of range: End of sequence
     [[{{node IteratorGetNext}}]]

@AhmUgEk
Copy link

AhmUgEk commented Oct 5, 2019

I am having the same issue as above on TF 2.0 release. Is this a bug with tensorflow or is there an issue with the code?

@adam-hartshorne
Copy link
Author

It seems everybody who is having this issue is using Windows. I presume that must have something to do with it?

@juliangall
Copy link

I am having the issue on a Mac with the latest version of MacOS

@mtpgva
Copy link

mtpgva commented Oct 6, 2019

I am having the same problem after porting my code from 1.14 to 2.0.

I am running on UBUNTU 18.04 (not only a windows problem). It occurs for me during both training and predict. (so not linked to the optimiser). I do NOT get the problem if i hide the GPU. I do get the problem if I expose the GPU.

Edit - Note: Everything seems to run properly, I just get the warnings

Edit - In another case - I get the problem whether I use GPU or not.

@tensorflowbutler tensorflowbutler removed the stat:awaiting response Status - Awaiting response from author label Oct 7, 2019
@duysqubix
Copy link

duysqubix commented Oct 7, 2019

I think I may have found why it is complaining. However, I have no idea how to fix it. While training, we all get the IteratorGetNext Error: Sequence out of range.

I noticed that let’s say I have a dataset size of 60,000 with a batch size of 64 that would require floor(60000/64)= 937 to iterate through the entire dataset for one epoch. However, when training using .fit(verbose=1) I noticed that it attempts to iterate through the dataset 938 (most likely a rounding error because 60000/64=937.5) and thus I get this error. Can someone please confirm this is the case for you as well? Thanks

Edit:

So I found a way around this when building the tf.data.Dataset, make sure to add the .repeat() method because the program will complain that you ran out of data, and when using .fit() add the following:

Here is a full example that got it working.

This will cause the error:

data = tf.random.normal((60000,30,4))
ground_truth = tf.ones((60000,1))
dataset = tf.data.Dataset.from_tensor_slices((data, ground_truth)).batch(64)

#predefined model here: input: [?, 30,4] output: [?,1]
model.fit(dataset, epochs=5)

'''
    938/Unknown - 16s 17ms/step - loss: 0.02172019-10-07 14:49:49.928619: W tensorflow/core/common_runtime/base_collective_executor.cc:216] BaseCollectiveExecutor::StartAbort Out of range: End of sequence
         [[{{node IteratorGetNext}}]]
         [[Shape/_2]]
2019-10-07 14:49:49.928619: W tensorflow/core/common_runtime/base_collective_executor.cc:216] BaseCollectiveExecutor::StartAbort Out of range: End of sequence
         [[{{node IteratorGetNext}}]]
938/938 [==============================] - 16s 17ms/step - loss: 0.0217
Epoch 2/5
935/938 [============================>.] - ETA: 0s - loss: 2.2229e-062019-10-07 14:49:59.722216: W tensorflow/core/common_runtime/base_collective_executor.cc:216] BaseCollectiveExecutor::StartAbort Out of range: End of sequence
         [[{{node IteratorGetNext}}]]
2019-10-07 14:49:59.722218: W tensorflow/core/common_runtime/base_collective_executor.cc:216] BaseCollectiveExecutor::StartAbort Out of range: End of sequence
         [[{{node IteratorGetNext}}]]
         [[Shape/_2]]
'''

This is the work around.

batch_size = 64
data = tf.random.normal((60000,30,4))
ground_truth = tf.ones((60000,1))
dataset = tf.data.Dataset.from_tensor_slices((data, ground_truth)).batch(batch_size).repeat()

#predefined model here: input: [?, 30,4] output: [?,1]
model.fit(dataset, epochs=5, steps_per_epoch=data.shape[0]//batch_size)

'''
937/937 [==============================] - 15s 16ms/step - loss: 0.0135
Epoch 2/5
937/937 [==============================] - 10s 10ms/step - loss: 1.4460e-05
Epoch 3/5
937/937 [==============================] - 10s 11ms/step - loss: 4.3097e-06
Epoch 4/5
937/937 [==============================] - 10s 10ms/step - loss: 1.8212e-06
Epoch 5/5
'''

``

@mtpgva
Copy link

mtpgva commented Oct 8, 2019 via email

@samueljackson92
Copy link

I'm experiencing a similar problem to @duysqubix with my code in that I have a number of samples that doesn't neatly divide by the batch size. @duysqubix code works for me and the error disappears if I repeat the dataset and specify steps_per_epoch.

  • I'm seeing this on Ubuntu 18.04, so definitely not a Windows only problem.
  • I see this issue with both the tensorflow 2 release and the tensorflow 2 RC2.

Trying @mtpgva's advice above and using a take a number of samples that are divisible by the batch size I find that I still get the same message, even when using the simplified example provided by @duysqubix:

import tensorflow as tf
data = tf.random.normal((60000,30,4))
ground_truth = tf.ones((60000,1))
dataset = tf.data.Dataset.from_tensor_slices((data, ground_truth)).take(512).batch(64)

model = tf.keras.models.Sequential([
    tf.keras.layers.Dense(1, activation='softmax')
])
model.compile(optimizer='adam',
              loss='categorical_crossentropy',
              metrics=['accuracy'])

#predefined model here: input: [?, 30,4] output: [?,1]
model.fit(dataset, epochs=5)
Epoch 1/5
2019-10-08 09:01:00.212603: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10.0
      8/Unknown - 1s 84ms/step - loss: 102.0359 - accuracy: 1.00002019-10-08 09:01:00.443158: W tensorflow/core/common_runtime/base_collective_executor.cc:216] BaseCollectiveExecutor::StartAbort Out of range: End of sequence
         [[{{node IteratorGetNext}}]]
         [[IteratorGetNext/_16]]
2019-10-08 09:01:00.443241: W tensorflow/core/common_runtime/base_collective_executor.cc:216] BaseCollectiveExecutor::StartAbort Out of range: End of sequence
         [[{{node IteratorGetNext}}]]
8/8 [==============================] - 1s 85ms/step - loss: 102.0359 - accuracy: 1.0000
Epoch 2/5
1/8 [==>...........................] - ETA: 0s - loss: 102.0359 - accuracy: 1.00002019-10-08 09:01:00.502043: W tensorflow/core/common_runtime/base_collective_executor.cc:216] BaseCollectiveExecutor::StartAbort Out of range: End of sequence
         [[{{node IteratorGetNext}}]]
         [[Shape/_4]]
2019-10-08 09:01:00.502100: W tensorflow/core/common_runtime/base_collective_executor.cc:216] BaseCollectiveExecutor::StartAbort Out of range: End of sequence
         [[{{node IteratorGetNext}}]]
8/8 [==============================] - 0s 7ms/step - loss: 102.0359 - accuracy: 1.0000
Epoch 3/5
1/8 [==>...........................] - ETA: 0s - loss: 102.0359 - accuracy: 1.00002019-10-08 09:01:00.544339: W tensorflow/core/common_runtime/base_collective_executor.cc:216] BaseCollectiveExecutor::StartAbort Out of range: End of sequence
         [[{{node IteratorGetNext}}]]
         [[IteratorGetNext/_16]]
2019-10-08 09:01:00.544373: W tensorflow/core/common_runtime/base_collective_executor.cc:216] BaseCollectiveExecutor::StartAbort Out of range: End of sequence
         [[{{node IteratorGetNext}}]]
8/8 [==============================] - 0s 5ms/step - loss: 102.0359 - accuracy: 1.0000
Epoch 4/5
1/8 [==>...........................] - ETA: 0s - loss: 102.0359 - accuracy: 1.00002019-10-08 09:01:00.587002: W tensorflow/core/common_runtime/base_collective_executor.cc:216] BaseCollectiveExecutor::StartAbort Out of range: End of sequence
         [[{{node IteratorGetNext}}]]
         [[IteratorGetNext/_16]]
2019-10-08 09:01:00.587044: W tensorflow/core/common_runtime/base_collective_executor.cc:216] BaseCollectiveExecutor::StartAbort Out of range: End of sequence
         [[{{node IteratorGetNext}}]]
8/8 [==============================] - 0s 5ms/step - loss: 102.0359 - accuracy: 1.0000
Epoch 5/5
1/8 [==>...........................] - ETA: 0s - loss: 102.0359 - accuracy: 1.00002019-10-08 09:01:00.631688: W tensorflow/core/common_runtime/base_collective_executor.cc:216] BaseCollectiveExecutor::StartAbort Out of range: End of sequence
         [[{{node IteratorGetNext}}]]
         [[Shape/_4]]
2019-10-08 09:01:00.631740: W tensorflow/core/common_runtime/base_collective_executor.cc:216] BaseCollectiveExecutor::StartAbort Out of range: End of sequence
         [[{{node IteratorGetNext}}]]
8/8 [==============================] - 0s 6ms/step - loss: 102.0359 - accuracy: 1.0000

I also tried using the drop_remainder=True argument on .batch but still get the error message:

import tensorflow as tf
data = tf.random.normal((60000,30,4))
ground_truth = tf.ones((60000,1))
dataset = tf.data.Dataset.from_tensor_slices((data, ground_truth)).batch(64, drop_remainder=True)

model = tf.keras.models.Sequential([
    tf.keras.layers.Dense(1, activation='softmax')
])
model.compile(optimizer='adam',
              loss='categorical_crossentropy',
              metrics=['accuracy'])

#predefined model here: input: [?, 30,4] output: [?,1]
model.fit(dataset, epochs=5)
Epoch 1/5
2019-10-08 09:03:47.431058: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10.0
    937/Unknown - 3s 3ms/step - loss: 102.0359 - accuracy: 1.00002019-10-08 09:03:50.275433: W tensorflow/core/common_runtime/base_collective_executor.cc:216] BaseCollectiveExecutor::StartAbort Out of range: End of sequence
         [[{{node IteratorGetNext}}]]
         [[IteratorGetNext/_2]]
2019-10-08 09:03:50.275587: W tensorflow/core/common_runtime/base_collective_executor.cc:216] BaseCollectiveExecutor::StartAbort Out of range: End of sequence
         [[{{node IteratorGetNext}}]]
937/937 [==============================] - 3s 3ms/step - loss: 102.0359 - accuracy: 1.0000
Epoch 2/5
919/937 [============================>.] - ETA: 0s - loss: 102.0359 - accuracy: 1.00002019-10-08 09:03:52.891814: W tensorflow/core/common_runtime/base_collective_executor.cc:216] BaseCollectiveExecutor::StartAbort Out of range: End of sequence
         [[{{node IteratorGetNext}}]]
         [[IteratorGetNext/_2]]
2019-10-08 09:03:52.891940: W tensorflow/core/common_runtime/base_collective_executor.cc:216] BaseCollectiveExecutor::StartAbort Out of range: End of sequence
         [[{{node IteratorGetNext}}]]
937/937 [==============================] - 3s 3ms/step - loss: 102.0359 - accuracy: 1.0000
Epoch 3/5
931/937 [============================>.] - ETA: 0s - loss: 102.0359 - accuracy: 1.00002019-10-08 09:03:55.506978: W tensorflow/core/common_runtime/base_collective_executor.cc:216] BaseCollectiveExecutor::StartAbort Out of range: End of sequence
         [[{{node IteratorGetNext}}]]
         [[IteratorGetNext/_2]]
2019-10-08 09:03:55.507100: W tensorflow/core/common_runtime/base_collective_executor.cc:216] BaseCollectiveExecutor::StartAbort Out of range: End of sequence
         [[{{node IteratorGetNext}}]]
937/937 [==============================] - 3s 3ms/step - loss: 102.0359 - accuracy: 1.0000
Epoch 4/5
918/937 [============================>.] - ETA: 0s - loss: 102.0359 - accuracy: 1.00002019-10-08 09:03:58.045499: W tensorflow/core/common_runtime/base_collective_executor.cc:216] BaseCollectiveExecutor::StartAbort Out of range: End of sequence
         [[{{node IteratorGetNext}}]]
         [[IteratorGetNext/_2]]
2019-10-08 09:03:58.045610: W tensorflow/core/common_runtime/base_collective_executor.cc:216] BaseCollectiveExecutor::StartAbort Out of range: End of sequence
         [[{{node IteratorGetNext}}]]
937/937 [==============================] - 3s 3ms/step - loss: 102.0359 - accuracy: 1.0000
Epoch 5/5
932/937 [============================>.] - ETA: 0s - loss: 102.0359 - accuracy: 1.00002019-10-08 09:04:00.654601: W tensorflow/core/common_runtime/base_collective_executor.cc:216] BaseCollectiveExecutor::StartAbort Out of range: End of sequence
         [[{{node IteratorGetNext}}]]
         [[IteratorGetNext/_2]]
2019-10-08 09:04:00.654715: W tensorflow/core/common_runtime/base_collective_executor.cc:216] BaseCollectiveExecutor::StartAbort Out of range: End of sequence
         [[{{node IteratorGetNext}}]]
937/937 [==============================] - 3s 3ms/step - loss: 102.0359 - accuracy: 1.0000

@duysqubix
Copy link

I think we may need to summon @fchollet

@raceee
Copy link

raceee commented Oct 10, 2019

@duysqubix Your suggestion fixed my issue!

@ravikyram
Copy link
Contributor

@oracle3001

Can you please let us know if the issue still persists?.Please close the issue if it was resolved already. Thanks!

@npuichigo
Copy link

Any update?

@AhmUgEk
Copy link

AhmUgEk commented Oct 14, 2019

@oracle3001

Can you please let us know if the issue still persists?.Please close the issue if it was resolved already. Thanks!

This fixes, my issue, however, surely the final batch size being reduced should not create an issue?

Repeating the part of the first batch of data should not be the solution, surely?

@BeWe11
Copy link

BeWe11 commented Oct 14, 2019

Hey @ravikyram,

the solution that @duysqubix postet is a workaround, the underlying problem still exists.

The problem is that the number of iterations performed on the dataset is greater than the number of batches in the dataset. I'm not actually sure that this is a bug, considering that python iteration uses the StopIteration exception to mark endings of iterables as well. But if that's the case the warning should not be displayed.

The work around "fixes" this by giving an explicitly calculated number of iterations to the model.fit method. This should not be necessary and might not even be possible in all cases. For example when using bucketing, the exact number of batches cannot be easily extracted from the dataset (except by performing a full dataset iteration before training, which would be a workaround as well).

So either the behavior is correct and the warning should be hidden, or the internally calculated number of iterations is faulty and should be changed.

@adam-hartshorne
Copy link
Author

@oracle3001

Can you please let us know if the issue still persists?.Please close the issue if it was resolved already. Thanks!

Yes, it still persists...see all the other posts with the same issue.

@AndreaRigoni
Copy link

AndreaRigoni commented Oct 16, 2019

Hi, I agree with @BeWe11, the issue is still there.
Moreover if you are using a data.from_generator() function the number of actual steps must be computed on the first epoch.

@ravikyram ravikyram added comp:data tf.data related issues type:bug Bug labels Oct 16, 2019
@ravikyram ravikyram assigned gowthamkpr and unassigned ravikyram Oct 16, 2019
@LaurentBerger
Copy link

@sharkdtu I changed code and use tf.data.experimental.cardinality issue here

@npuichigo Why? (of course I must shuffle data before fit but it's only an example)

@600DZY
Copy link

600DZY commented Dec 11, 2019

keras的fit函数的部分参数如下:
model.fit(self, x=None, y=None,epochs=1,steps_per_epoch=None)
  • [epoch:迭代次数,一次迭代可以粗糙的理解成使用一个batch的训练数据对model进行训练。
    The number of iterations, one iteration can be roughly understood as using a batch of training data to train the model.]

  • [steps_per_epoch:每次迭代被model消费的batches,可以粗糙的理解为将固定数量的batch合并成一个更大的bigger_batch,然后用bigger_batch对model进行训练,训练结束即为完成一个epoch。
    Each iteration of the batches consumed by the model. It can be roughly seen as a combination of a fixed number of batches into a bigger batch named bigger_batch, and then the model is trained with the bigger_batch. The end of the training is the completion of an epoch.]

  • 个人看法:这是model训练过程中训练数据不足的问题,可以将model的训练过程理解为生产者与消费者的关系。tensorflow2.0的dataset集成了generator的功能,可直接作为吐出训练数据的生成器.
    dataset不断提供数据,model训练过程不断消费数据,一旦dataset没有数据提供并且model的训练过程还没结束,就会报错,所以需要确保epochs*steps_per_epoch <= dataset所能提供的batches。
    你可以根据经验确定batch_size和steps_per_epoch,然后对全量数据集使用repeat()来避免model训练过程中数据不足的问题。如果觉得没有必要对batch再做处理,可令steps_per_epoch=1。

  • My View:This is the problem of insufficient training data in the model training process, which can be seen as the relationship between producers and consumers.
    Tensorflow2.0 Dataset integrates the function of generator and can be used to spit out training data directly.Dataset provides data continuously, and model training process consumes data continuously. Once dataset does not provide data and model training process is not finished, an error will be reported. Therefore, it is necessary to ensure that epochs*steps_per_epoch is less than the size of batches provided by dataset.You can determine batch_size and steps_per_epoch based on experience, and then use repeat() for Dataset to avoid data shortage during model training.If you don't think it's necessary to deal with the batch again, you can make steps_per_epoch = 1.

@zhulingchen
Copy link

将全量数据集切分成多个batch,对模型进行分批迭代训练,是常规做法,先理解一下两个概念:

* [batch_size:每个batch的数据量大小]

* [batches:整个数据集按照batch_size切分后的batch数量]
keras的fit函数的部分参数如下:
model.fit(self, x=None, y=None,epochs=1,steps_per_epoch=None)
* [epoch:迭代次数,一次迭代可以粗糙的理解成使用一个batch的训练数据对model进行训练。
  The number of iterations, one iteration can be roughly understood as using a batch of training data to train the model.]

* [steps_per_epoch:每次迭代被model消费的batches,可以粗糙的理解为将固定数量的batch合并成一个更大的bigger_batch,然后用bigger_batch对model进行训练,训练结束即为完成一个epoch。
  Each iteration of the batches consumed by the model. It can be roughly seen as a combination of a fixed number of batches into a bigger batch named bigger_batch, and then the model is trained with the bigger_batch. The end of the training is the completion of an epoch.]

* 个人看法:这是model训练过程中训练数据不足的问题,可以将model的训练过程理解为生产者与消费者的关系。tensorflow2.0的dataset集成了generator的功能,可直接作为吐出训练数据的生成器.
  dataset不断提供数据,model训练过程不断消费数据,一旦dataset没有数据提供并且model的训练过程还没结束,就会报错,所以需要确保epochs*steps_per_epoch <= dataset所能提供的batches。
  你可以根据经验确定batch_size和steps_per_epoch,然后对全量数据集使用repeat()来避免model训练过程中数据不足的问题。如果觉得没有必要对batch再做处理,可令steps_per_epoch=1。

* My View:This is the problem of insufficient training data in the model training process, which can be seen as the relationship between producers and consumers.
  Tensorflow2.0 Dataset integrates the function of generator and can be used to spit out training data directly.Dataset provides data continuously, and model training process consumes data continuously. Once dataset does not provide data and model training process is not finished, an error will be reported. Therefore, it is necessary to ensure that epochs*steps_per_epoch is less than the size of batches provided by dataset.You can determine batch_size and steps_per_epoch based on experience, and then use repeat() for Dataset to avoid data shortage during model training.If you don't think it's necessary to deal with the batch again, you can make steps_per_epoch = 1.

验证过程:

train_data = tf.random.normal((5,4))#5个4维特征向量
label = tf.ones((5,1))#5个类别标签
dataset = tf.data.Dataset.from_tensor_slices((data, label))
dataset
<TensorSliceDataset shapes: ((4,), (1,)), types: (tf.float32, tf.float32)>

全量数据集dataset按照batch_size进行切分,如果最后一个batch的数量不足batch_size,根据drop_remainder判断是否将其丢弃。在上述例子中,train_data和label构成的Dataset包含5个用来训练model的张量,暂且称之为train张量,train张量又包含2个张量:1个4维特征向量和1个label。

dataset = dataset.batch(batch_size, drop_remainder=True).repeat(2)

dataset
<RepeatDataset shapes: ((2, 4), (2, 1)), types: (tf.float32, tf.float32)>

调用batch()进行切分,batch_size=2,drop_remainder=True,可知batches==2,每个batch包含2个train张量,最后一个batch的大小为1,丢弃;repeat(2)后batches==4。

model.fit(dataset, epochs=4, steps_per_epoch=1)
#fit函数中的x和y参数代表特征向量和类别,可直接用Dataset类型的变量赋值

dataset有4个batch,batch_size == 2,每次使用1个batch的数据训练model(bigger_batch_size == batch_size x 1 == 2),可以迭代4次

model.fit(dataset, epochs=1, steps_per_epoch=4)

dataset有4个batch,batch_size == 2,每次使用4个batch的数据训练model(bigger_batch_size == batch_size x 4 == 6),可以迭代1次

完整的验证代码如下:
import tensorflow as tf
tf.__version__

def build_model():
    model = tf.keras.Sequential([
        tf.keras.layers.Dense(10, activation=tf.nn.relu, input_shape=(4,)),  
        tf.keras.layers.Dense(10, activation=tf.nn.relu),
        tf.keras.layers.Dense(3, activation='softmax')])
    model.compile(
        optimizer='Adam',
        loss='categorical_crossentropy',
        metrics=['accuracy']
    )
    return model

def check_data_batch_size(dataset):
    #iterator = iter(dataset)
    iterator = dataset.__iter__()
    i=0
    try:
        while i<100:
            #data = next(iterator)
            data = iterator.get_next()
            i += 1
            print('id:',i)
            print('data:',data)
    except Exception as e:
        print(repr(e))
    return i

batch_size =  2
data = tf.random.normal((5,4))
label = tf.ones((5,1))
dataset = tf.data.Dataset.from_tensor_slices((data, label))
dataset = dataset.batch(2, drop_remainder=True).repeat(2)
batches = check_data_batch_size(dataset)
print('batches:',batches)
model = build_model()
model.fit(dataset, epochs=2, steps_per_epoch=2)

Is this reply related to the original question?

@600DZY
Copy link

600DZY commented Dec 12, 2019 via email

@eustomaqua
Copy link

@duysqubix Brilliant, your suggestion fixed my issue.

And for other people facing this issue, just for the record:

# loss, acc = net.evaluate(tst_set)  # do not use this when using a Repeating dataset
loss, acc = net.evaluate(tst_set, steps=3)  # e.g., 3

@Xiaohui-Z
Copy link

I got the same problems in the Tensflow 2-gpu in the centos. Has anyone know how to fix this problem?

@00krishna
Copy link

This issue should be fixed by the pull request for issue #35314 . The warning was actually propagated up from C++ and so python was passing it forward. But there is really no problem here, no issues with training or anything, according to the issue.

The solution was that Google lowered the logging level to ignore these warnings. The change is in TF 2.0 nightly, and will be widely available in the next release. But you can use TF nightly to get the benefit now.

So this issue can probably be closed.

@Xiaohui-Z
Copy link

Tensorflow 2.1(stable) released, has anyone known if this warning fixed in the new version ?

@OvidZheng
Copy link

OvidZheng commented Jan 13, 2020

I have the same problem when practicing the code from official tutorial.
I am using Catalina 10.15 python 3.76 TF 2.1.0

mbedding = "https://tfhub.dev/google/tf2-preview/gnews-swivel-20dim/1"
hub_layer = hub.KerasLayer(embedding, input_shape=[],
                           dtype=tf.string, trainable=True)
hub_layer(train_examples_batch[:3])

model = tf.keras.Sequential()
model.add(hub_layer)
model.add(tf.keras.layers.Dense(16, activation='relu'))
model.add(tf.keras.layers.Dense(1, activation='sigmoid'))

model.summary()

model.compile(optimizer='adam',
              loss='binary_crossentropy',
              metrics=['accuracy'])

history = model.fit(train_data.shuffle(10000).batch(512),
                    epochs=20,
                    validation_data=validation_data.batch(512),
                    verbose=1)

results = model.evaluate(test_data.batch(512), verbose=2)
for name, value in zip(model.metrics_names, results):
  print("%s: %.3f" % (name, value))
29/30 [============================>.] - ETA: 0s - loss: 0.2012 - accuracy: 0.92892020-01-13 13:53:00.393082: W tensorflow/core/common_runtime/base_collective_executor.cc:217] BaseCollectiveExecutor::StartAbort Out of range: End of sequence
	 [[{{node IteratorGetNext}}]]

@Flamefire
Copy link
Contributor

@Xiaohui-Z I can also confirm that the issue is not solved. Using the example code from the TF docs still produces the issue:

import tensorflow_datasets as tfds
import tensorflow as tf
tfds.disable_progress_bar()


def make_datasets_unbatched():
    # Scaling MNIST data from (0, 255] to (0., 1.]
    def scale(image, label):
        image = tf.cast(image, tf.float32)
        image /= 255
        return image, label

    datasets, info = tfds.load(name='mnist',
                               with_info=True,
                               as_supervised=True)

    return datasets['train'].map(scale).cache().shuffle(10000)


def build_and_compile_cnn_model():
    model = tf.keras.Sequential([
        tf.keras.layers.Conv2D(32,
                               3,
                               activation='relu',
                               input_shape=(28, 28, 1)),
        tf.keras.layers.MaxPooling2D(),
        tf.keras.layers.Flatten(),
        tf.keras.layers.Dense(64, activation='relu'),
        tf.keras.layers.Dense(10, activation='softmax')
    ])
    model.compile(loss=tf.keras.losses.sparse_categorical_crossentropy,
                  optimizer=tf.keras.optimizers.SGD(learning_rate=0.001),
                  metrics=['accuracy'])
    return model


train_datasets = make_datasets_unbatched().batch(64)
model = build_and_compile_cnn_model()

model.fit(x=train_datasets, epochs=2)

I noted that this only happens during the first iteration where the total count seems to be unknown. This is odd too because the numExamples in the statistics key of the dataset_info.json is set correctly.

@RomainSabathe
Copy link

RomainSabathe commented Jan 25, 2020

Can also confirm that the error (warning) is still being raised on 2.1 (my docker base image is cuda:10.1-cudnn7-devel-ubuntu18.04).

# Triggers the warning.
dataset = raw_dataset.map(_parse_proto).take(32).batch(8)  
model.evaluate(dataset)

# No warning
dataset = raw_dataset.map(_parse_proto).take(32).batch(8)  
model.evaluate(dataset, steps=4)

@ismael-elatifi
Copy link

ismael-elatifi commented Mar 11, 2020

I also have this warning in TF2.1.0. model.predict(ds.batch(1)) works but gives this warning :

2020-03-11 17:04:24.760612: W tensorflow/core/common_runtime/base_collective_executor.cc:217] BaseCollectiveExecutor::StartAbort Out of range: End of sequence
	 [[{{node IteratorGetNext}}]]

@ericvoots
Copy link

ericvoots commented Aug 21, 2020

I have a similar error but can't seem to find anywhere else where anyone else is experiencing here it, and here is the traceback of my error:

Train on 2737611 samples, validate on 2737612 samples
Epoch 1/123
Epoch 2/123
Epoch 3/123
Epoch 4/123
Epoch 5/123
Epoch 6/123
2020-08-20 22:56:33.810266: W tensorflow/core/common_runtime/base_collective_executor.cc:217] BaseCollectiveExecutor::StartAbort Invalid argument: assertion failed: [predictions must be >= 0] [Condition x >= y did not hold element-wise:] [x (sequential/dense_10/Sigmoid:0) = ] [[nan][nan][nan]...] [y (metrics/tp/Cast_2/x:0) = ] [0]
[[{{node metrics/tp/assert_greater_equal/Assert/AssertGuard/else/_1/Assert}}]]
[[metrics/recall/assert_greater_equal/Assert/AssertGuard/pivot_f/_143/_157]]
2020-08-20 22:56:33.824745: W tensorflow/core/common_runtime/base_collective_executor.cc:217] BaseCollectiveExecutor::StartAbort Invalid argument: assertion failed: [predictions must be >= 0] [Condition x >= y did not hold element-wise:] [x (sequential/dense_10/Sigmoid:0) = ] [[nan][nan][nan]...] [y (metrics/tp/Cast_2/x:0) = ] [0]
[[{{node metrics/tp/assert_greater_equal/Assert/AssertGuard/else/_1/Assert}}]]
WARNING:tensorflow:Can save best model only with val_precision available, skipping.
Traceback (most recent call last):
File "tf_working.py", line 399, in
keras_auto_tuner(training_df, '1week_target_class')
File "tf_working.py", line 382, in keras_auto_tuner
validation_data=(val_features, y_val))
File "C:\Users\evoot\anaconda3\envs\tf_sh\lib\site-packages\kerastuner\engine\base_tuner.py", line 130, in search
self.run_trial(trial, *fit_args, **fit_kwargs)
File "C:\Users\evoot\anaconda3\envs\tf_sh\lib\site-packages\kerastuner\engine\multi_execution_tuner.py", line 96, in run_trial
history = model.fit(*fit_args, **copied_fit_kwargs)
File "C:\Users\evoot\anaconda3\envs\tf_sh\lib\site-packages\tensorflow_core\python\keras\engine\training.py", line 819, in fit
use_multiprocessing=use_multiprocessing)
File "C:\Users\evoot\anaconda3\envs\tf_sh\lib\site-packages\tensorflow_core\python\keras\engine\training_v2.py", line 342, in fit
total_epochs=epochs)
File "C:\Users\evoot\anaconda3\envs\tf_sh\lib\site-packages\tensorflow_core\python\keras\engine\training_v2.py", line 128, in run_one_epoch
batch_outs = execution_function(iterator)
File "C:\Users\evoot\anaconda3\envs\tf_sh\lib\site-packages\tensorflow_core\python\keras\engine\training_v2_utils.py", line 98, in execution_function
distributed_function(input_fn))
File "C:\Users\evoot\anaconda3\envs\tf_sh\lib\site-packages\tensorflow_core\python\eager\def_function.py", line 568, in call
result = self._call(*args, **kwds)
File "C:\Users\evoot\anaconda3\envs\tf_sh\lib\site-packages\tensorflow_core\python\eager\def_function.py", line 599, in _call
return self._stateless_fn(*args, **kwds) # pylint: disable=not-callable
File "C:\Users\evoot\anaconda3\envs\tf_sh\lib\site-packages\tensorflow_core\python\eager\function.py", line 2363, in call
return graph_function._filtered_call(args, kwargs) # pylint: disable=protected-access
File "C:\Users\evoot\anaconda3\envs\tf_sh\lib\site-packages\tensorflow_core\python\eager\function.py", line 1611, in _filtered_call
self.captured_inputs)
File "C:\Users\evoot\anaconda3\envs\tf_sh\lib\site-packages\tensorflow_core\python\eager\function.py", line 1692, in _call_flat
ctx, args, cancellation_manager=cancellation_manager))
File "C:\Users\evoot\anaconda3\envs\tf_sh\lib\site-packages\tensorflow_core\python\eager\function.py", line 545, in call
ctx=ctx)
File "C:\Users\evoot\anaconda3\envs\tf_sh\lib\site-packages\tensorflow_core\python\eager\execute.py", line 67, in quick_execute
six.raise_from(core._status_to_exception(e.code, message), None)
File "", line 3, in raise_from
tensorflow.python.framework.errors_impl.InvalidArgumentError: 2 root error(s) found.
(0) Invalid argument: assertion failed: [predictions must be >= 0] [Condition x >= y did not hold element-wise:] [x (sequential/dense_10/Sigmoid:0) = ] [[nan][nan][nan]...] [y (metrics/tp/Cast_2/x:0) = ] [0]
[[{{node metrics/tp/assert_greater_equal/Assert/AssertGuard/else/_1/Assert}}]]
[[metrics/recall/assert_greater_equal/Assert/AssertGuard/pivot_f/_143/_157]]
(1) Invalid argument: assertion failed: [predictions must be >= 0] [Condition x >= y did not hold element-wise:] [x (sequential/dense_10/Sigmoid:0) = ] [[nan][nan][nan]...] [y (metrics/tp/Cast_2/x:0) = ] [0]
[[{{node metrics/tp/assert_greater_equal/Assert/AssertGuard/else/_1/Assert}}]]
0 successful operations.
0 derived errors ignored. [Op:__inference_distributed_function_222355]

Function call stack:
distributed_function -> distributed_function

@bw4sz
Copy link

bw4sz commented Nov 6, 2020

Can anyone confirm what the result of this behavior is? I'm confused whether its a logging error, or whether the final batch does not get train/evaluated. For example, imagine I had 100 samples with a batch size of 52. Would I be training on a batch of 50 and 48 (expected behavior), or would I train on 50 and then just fail to fill the next batch and move to the next epoch? This is especially scary in a validation batch and I would be terrified to find that I have a variable validation set (especially if you shuffle!). There is alot of discussion in many spots, but no clear indication of the significance of this error. Some would have you believe it is just a warning. I am on tensorflow==2.1.0.

@GitHub-Of-WangQiang
Copy link

我想我可能已经找到了为什么要抱怨。但是,我不知道如何解决它。在训练过程中,我们都会收到IteratorGetNext错误:序列超出范围。

我注意到,假设我的数据集大小为60,000,而批处理大小为64,则需要floor(60000/64)= 937才能遍历整个数据集一个时期。但是,当使用.fit(verbose = 1)进行训练时,我注意到它尝试遍历数据集938(很可能是舍入错误,因为60000/64 = 937.5),因此我得到了这个错误。有人可以请您确认这种情况吗?谢谢

编辑:

因此,在构建tf.data.Dataset时,我找到了一种解决方法,请确保添加.repeat()方法,因为程序会抱怨您用完了数据,并且在使用.fit()时 添加以下内容

这是一个完整的示例,可以正常工作。

这将导致错误:

数据 =  TF随机的正常((60000304))
 ground_truth  =  TF那些((600001))
的数据集 =  TF数据数据集from_tensor_slices((dataground_truth))。64#predefined模型在这里:输入:[?,30,4]输出:[?,1]
模型适合资料集历元= 5''' 
    938 /未知-16秒17毫秒/步-丢失:0.02172019-10-07 14:49:49.928619:W tensorflow / core / common_runtime / base_collective_executor.cc:216] BaseCollectiveExecutor :: StartAbort超出范围:序列结束
         [ [{{node IteratorGetNext}}]] 
         [[Shape / _2]] 
2019-10-07 14:49:49.928619:W tensorflow / core / common_runtime / base_collective_executor.cc:216] BaseCollectiveExecutor :: StartAbort超出范围:结束于序列
         [[{{node IteratorGetNext}}]] 
938/938 [=============================]-16s 17ms /步进-亏损:0.0217
时代2/5
935/938 [===========================>。]-ETA:0秒-损失:2.2229e-062019-10-07 14:49:59.722216:W tensorflow / core / common_runtime / base_collective_executor.cc:216] BaseCollectiveExecutor :: StartAbort超出范围:序列结束
         [[{{node IteratorGetNext}}]] 
2019-10-07 14:49:59.722218 :W tensorflow / core / common_runtime / base_collective_executor.cc:216] BaseCollectiveExecutor :: StartAbort超出范围:序列结束
         [[{{node IteratorGetNext}}] 
         [[Shape / _2]] 
'''

这是变通办法。

batch_size  =  64
数据 =  tf随机的正常((60000304))
 ground_truth  =  TF那些((600001))
的数据集 =  TF数据数据集from_tensor_slices((dataground_truth))。批处理batch_size)。重复()

#predefined模型在这里:输入:[?,30,4]输出:[?,1]
模型配合数据集历元= 5steps_per_epoch =数据形状[ 0 ] //的batch_size''' 
937/937 [=============================]-15s 16ms / step-损失:0.0135 
Epoch 2 / 5 
937/937 [==============================]-10s 10ms / step-损耗:1.4460e-05
时代3 /
937/937 [=============================]-10s 11ms / step-损失:4.3097e-06
时期4/5 
937/937 [=============================]-10s 10ms / step-损耗:1.8212e-06
时代5/5 
'''

``

I think I may have found why it is complaining. However, I have no idea how to fix it. While training, we all get the IteratorGetNext Error: Sequence out of range.

I noticed that let’s say I have a dataset size of 60,000 with a batch size of 64 that would require floor(60000/64)= 937 to iterate through the entire dataset for one epoch. However, when training using .fit(verbose=1) I noticed that it attempts to iterate through the dataset 938 (most likely a rounding error because 60000/64=937.5) and thus I get this error. Can someone please confirm this is the case for you as well? Thanks

Edit:

So I found a way around this when building the tf.data.Dataset, make sure to add the .repeat() method because the program will complain that you ran out of data, and when using .fit() add the following:

Here is a full example that got it working.

This will cause the error:

data = tf.random.normal((60000,30,4))
ground_truth = tf.ones((60000,1))
dataset = tf.data.Dataset.from_tensor_slices((data, ground_truth)).batch(64)

#predefined model here: input: [?, 30,4] output: [?,1]
model.fit(dataset, epochs=5)

'''
    938/Unknown - 16s 17ms/step - loss: 0.02172019-10-07 14:49:49.928619: W tensorflow/core/common_runtime/base_collective_executor.cc:216] BaseCollectiveExecutor::StartAbort Out of range: End of sequence
         [[{{node IteratorGetNext}}]]
         [[Shape/_2]]
2019-10-07 14:49:49.928619: W tensorflow/core/common_runtime/base_collective_executor.cc:216] BaseCollectiveExecutor::StartAbort Out of range: End of sequence
         [[{{node IteratorGetNext}}]]
938/938 [==============================] - 16s 17ms/step - loss: 0.0217
Epoch 2/5
935/938 [============================>.] - ETA: 0s - loss: 2.2229e-062019-10-07 14:49:59.722216: W tensorflow/core/common_runtime/base_collective_executor.cc:216] BaseCollectiveExecutor::StartAbort Out of range: End of sequence
         [[{{node IteratorGetNext}}]]
2019-10-07 14:49:59.722218: W tensorflow/core/common_runtime/base_collective_executor.cc:216] BaseCollectiveExecutor::StartAbort Out of range: End of sequence
         [[{{node IteratorGetNext}}]]
         [[Shape/_2]]
'''

This is the work around.

batch_size = 64
data = tf.random.normal((60000,30,4))
ground_truth = tf.ones((60000,1))
dataset = tf.data.Dataset.from_tensor_slices((data, ground_truth)).batch(batch_size).repeat()

#predefined model here: input: [?, 30,4] output: [?,1]
model.fit(dataset, epochs=5, steps_per_epoch=data.shape[0]//batch_size)

'''
937/937 [==============================] - 15s 16ms/step - loss: 0.0135
Epoch 2/5
937/937 [==============================] - 10s 10ms/step - loss: 1.4460e-05
Epoch 3/5
937/937 [==============================] - 10s 11ms/step - loss: 4.3097e-06
Epoch 4/5
937/937 [==============================] - 10s 10ms/step - loss: 1.8212e-06
Epoch 5/5
'''

``

hi,I am a freshman in DL from CHina ,I meet the same error like you.Through search the interent,I found the answer:you need add the 'repeat()',but rember that don't input function parameter,then you need to add "step_per_epoch" in fit(),and it's value is "x_train//batchszie''.it works in my project,I hope it can help you to solve your problem .my English is poor,don't mind!

@sushreebarsa
Copy link
Contributor

I tried to run on Colab with TF v2.5 and faced a different error,please find the gist here..Thanks!

@sanatmpa1
Copy link

@oracle3001,

I've tried reproducing the issue in TF 2.6.0 and its working fine now. Please take a look at the gist here. Thanks!

@sanatmpa1 sanatmpa1 added stat:awaiting response Status - Awaiting response from author and removed stat:awaiting tensorflower Status - Awaiting response from tensorflower labels Sep 8, 2021
@google-ml-butler
Copy link

This issue has been automatically marked as stale because it has no recent activity. It will be closed if no further activity occurs. Thank you.

@google-ml-butler google-ml-butler bot added the stale This label marks the issue/pr stale - to be closed automatically if no activity label Sep 15, 2021
@google-ml-butler
Copy link

Closing as stale. Please reopen if you'd like to work on this further.

@google-ml-butler
Copy link

Are you satisfied with the resolution of your issue?
Yes
No

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
comp:data tf.data related issues comp:keras Keras related issues stale This label marks the issue/pr stale - to be closed automatically if no activity stat:awaiting response Status - Awaiting response from author TF 2.0 Issues relating to TensorFlow 2.0 type:bug Bug
Projects
None yet
Development

No branches or pull requests