Re-emerged Issue #31509 - BaseCollectiveExecutor::StartAbort Out of range: #32817

adam-hartshorne · 2019-09-25T16:31:29Z

The previous issue described in #31509 was fixed, but I am now experiencing exactly the same issue with all the same setup using the latest nightly build of TF2.0 when using tf.keras.optimizers.Adam

ravikyram · 2019-09-26T05:58:10Z

@oracle3001
I am not seeing any issue with tf.keras.optimizers.Adam in latest TF 2.0.0-rc2 version.Please, find the gist here. Thanks!

duysPES · 2019-10-03T14:40:59Z

I am having the exact same problem using this mock model. I am using tf 2.0.0 release.
On windows

import matplotlib.pyplot as plt
import numpy as np
import tensorflow as tf



if __name__ == '__main__':

    x = tf.random.normal((14000, 30, 1))
    y = tf.ones_like(x)

    discriminator = tf.keras.models.Sequential([
        tf.keras.layers.LSTM(100, input_shape=(30, 1), return_sequences=True),
        tf.keras.layers.LSTM(100, recurrent_dropout=0.4,
                             dropout=0.4, return_sequences=True)
    ])

    discriminator.compile(loss='binary_crossentropy',
                          optimizer=tf.keras.optimizers.Adam(lr=0.001))

    dataset = tf.data.Dataset.from_tensor_slices((x, y))
    dataset = dataset.batch(64)

    discriminator.fit(dataset, epochs=2)
``

ravikyram · 2019-10-04T09:56:11Z

@duysPES
I am able to execute the code successfully in colab using TF 2.0.0-rc2 .Please, find the gist here.Thanks!

juliangall · 2019-10-05T14:41:50Z

I am also having this message feeding a dataset into a 1D Convnet. Happens on my Mac with tf version 2.0.0-rc2. Not reproducible on Colab.

import numpy as np
import tensorflow as tf
def create_timeseries_element():
    # returns a random time series of 100 intervals, each with 3 features,
    # and a random one-hot array of 5 entries
    data = np.random.rand(100,3)
    label = np.eye(5, dtype='int')[np.random.choice(5)]
    return data, label

def data_generator():
    d, l = create_timeseries_element()
    yield (d, l)

model = tf.keras.models.Sequential([
    tf.keras.layers.Conv1D(128, 9, activation='relu', input_shape=(100, 3)),
    tf.keras.layers.Conv1D(128, 9, activation='relu'),
    tf.keras.layers.MaxPooling1D(2),
    tf.keras.layers.Conv1D(256, 5, activation='relu'),
    tf.keras.layers.Conv1D(256, 5, activation='relu'),
    tf.keras.layers.GlobalAveragePooling1D(),
    tf.keras.layers.Dropout(0.5),
    tf.keras.layers.Dense(5, activation='softmax')])
model.compile(optimizer='adam',
              loss='categorical_crossentropy',
              metrics=['accuracy'])

ds = tf.data.Dataset.from_generator(data_generator, output_types=(tf.float32, tf.int32),
                                      output_shapes=(tf.TensorShape([100, 3]), tf.TensorShape([5])))
model.fit(ds.batch(32))

raceee · 2019-10-05T19:13:07Z

I am having an issue similar to this and tried to run in Collab just to get an un-ending runtime. I asked my question in full on SO here.

I had some numpy arrays that were trained in keras in the previous version of tf and now have to rewrite my model. Got way worse accuracy so I am thinking I need to switch to tf.data.Dataset.

So I did:
train_dataset = tf.data.Dataset.from_tensor_slices((X_train_deleted_nans, y_train_no_nans)) train_dataset = train_dataset.shuffle(SHUFFLE_CONST).batch(BATCH_SIZE)

model.summary() gave me:

BatchDataset shapes: ((None, 2756), (None,)), types: (tf.float64, tf.int64)
Model: sequential

Layer (type)                 Output Shape              Param #   
=================================================================
dense (Dense)                (None, 1379)              3801903   
_________________________________________________________________
dropout (Dropout)            (None, 1379)              0         
_________________________________________________________________
dense_1 (Dense)              (None, 1379)              1903020   
_________________________________________________________________
dropout_1 (Dropout)          (None, 1379)              0         
_________________________________________________________________
dense_2 (Dense)              (None, 1379)              1903020   
_________________________________________________________________
dropout_2 (Dropout)          (None, 1379)              0         
_________________________________________________________________
dense_3 (Dense)              (None, 1379)              1903020   
_________________________________________________________________
dropout_3 (Dropout)          (None, 1379)              0         
_________________________________________________________________
dense_4 (Dense)              (None, 1)                 1380      
=================================================================
Total params: 9,512,343
Trainable params: 9,512,343
Non-trainable params: 0

model.compile(optimizer=adam, loss=bce, metrics=['accuracy'])
model.fit(train_dataset, epochs=1000, verbose=0)

Once the training starts I get this warning error:

2019-10-04 23:47:56.691434: W tensorflow/core/common_runtime/base_collective_executor.cc:216] BaseCollectiveExecutor::StartAbort Out of range: End of sequence
     [[{{node IteratorGetNext}}]]

AhmUgEk · 2019-10-05T22:16:57Z

I am having the same issue as above on TF 2.0 release. Is this a bug with tensorflow or is there an issue with the code?

adam-hartshorne · 2019-10-06T11:39:57Z

It seems everybody who is having this issue is using Windows. I presume that must have something to do with it?

juliangall · 2019-10-06T13:26:20Z

I am having the issue on a Mac with the latest version of MacOS

mtpgva · 2019-10-06T13:26:48Z

I am having the same problem after porting my code from 1.14 to 2.0.

I am running on UBUNTU 18.04 (not only a windows problem). It occurs for me during both training and predict. (so not linked to the optimiser). I do NOT get the problem if i hide the GPU. I do get the problem if I expose the GPU.

Edit - Note: Everything seems to run properly, I just get the warnings

Edit - In another case - I get the problem whether I use GPU or not.

duysqubix · 2019-10-07T20:56:05Z

I think I may have found why it is complaining. However, I have no idea how to fix it. While training, we all get the IteratorGetNext Error: Sequence out of range.

I noticed that let’s say I have a dataset size of 60,000 with a batch size of 64 that would require floor(60000/64)= 937 to iterate through the entire dataset for one epoch. However, when training using .fit(verbose=1) I noticed that it attempts to iterate through the dataset 938 (most likely a rounding error because 60000/64=937.5) and thus I get this error. Can someone please confirm this is the case for you as well? Thanks

Edit:

So I found a way around this when building the tf.data.Dataset, make sure to add the .repeat() method because the program will complain that you ran out of data, and when using .fit() ~~add the following~~:

Here is a full example that got it working.

This will cause the error:

data = tf.random.normal((60000,30,4))
ground_truth = tf.ones((60000,1))
dataset = tf.data.Dataset.from_tensor_slices((data, ground_truth)).batch(64)

#predefined model here: input: [?, 30,4] output: [?,1]
model.fit(dataset, epochs=5)

'''
    938/Unknown - 16s 17ms/step - loss: 0.02172019-10-07 14:49:49.928619: W tensorflow/core/common_runtime/base_collective_executor.cc:216] BaseCollectiveExecutor::StartAbort Out of range: End of sequence
         [[{{node IteratorGetNext}}]]
         [[Shape/_2]]
2019-10-07 14:49:49.928619: W tensorflow/core/common_runtime/base_collective_executor.cc:216] BaseCollectiveExecutor::StartAbort Out of range: End of sequence
         [[{{node IteratorGetNext}}]]
938/938 [==============================] - 16s 17ms/step - loss: 0.0217
Epoch 2/5
935/938 [============================>.] - ETA: 0s - loss: 2.2229e-062019-10-07 14:49:59.722216: W tensorflow/core/common_runtime/base_collective_executor.cc:216] BaseCollectiveExecutor::StartAbort Out of range: End of sequence
         [[{{node IteratorGetNext}}]]
2019-10-07 14:49:59.722218: W tensorflow/core/common_runtime/base_collective_executor.cc:216] BaseCollectiveExecutor::StartAbort Out of range: End of sequence
         [[{{node IteratorGetNext}}]]
         [[Shape/_2]]
'''

This is the work around.

batch_size = 64
data = tf.random.normal((60000,30,4))
ground_truth = tf.ones((60000,1))
dataset = tf.data.Dataset.from_tensor_slices((data, ground_truth)).batch(batch_size).repeat()

#predefined model here: input: [?, 30,4] output: [?,1]
model.fit(dataset, epochs=5, steps_per_epoch=data.shape[0]//batch_size)

'''
937/937 [==============================] - 15s 16ms/step - loss: 0.0135
Epoch 2/5
937/937 [==============================] - 10s 10ms/step - loss: 1.4460e-05
Epoch 3/5
937/937 [==============================] - 10s 11ms/step - loss: 4.3097e-06
Epoch 4/5
937/937 [==============================] - 10s 10ms/step - loss: 1.8212e-06
Epoch 5/5
'''

``

mtpgva · 2019-10-08T06:34:59Z

Good idea. I think you can use the take statement on the dataset to limit yourself to the first 64*937 records avoiding any need to round at the end

…

On Mon, 7 Oct 2019, 23:02 duysqubix ***@***.***> wrote: I think I may have found why it is complaining. However, I have no idea how to fix it. While training, we all get the IteratorGetNext Error: Sequence out of range. I noticed that let’s say I have a dataset size of 60,000 with a batch size of 64 that would require floor(60000/64)= 937 to iterate through entire dataset for one epoch. However, when training using .fit(verbose=1) i botice that it attempts to iterate through the dataset 938 (most likely a rounding error because 60000/64=937.5) and thus I get this error. Can someone please confirm this is the case for you as well? Thanks — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#32817?email_source=notifications&email_token=ABW6MMFS2UIXVTALDOPPVSLQNOPYDA5CNFSM4I2PHNA2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEARYYMI#issuecomment-539200561>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABW6MMANBQ3NCFAKJLMWHH3QNOPYDANCNFSM4I2PHNAQ> .

samueljackson92 · 2019-10-08T08:23:26Z

I'm experiencing a similar problem to @duysqubix with my code in that I have a number of samples that doesn't neatly divide by the batch size. @duysqubix code works for me and the error disappears if I repeat the dataset and specify steps_per_epoch.

I'm seeing this on Ubuntu 18.04, so definitely not a Windows only problem.
I see this issue with both the tensorflow 2 release and the tensorflow 2 RC2.

Trying @mtpgva's advice above and using a take a number of samples that are divisible by the batch size I find that I still get the same message, even when using the simplified example provided by @duysqubix:

import tensorflow as tf
data = tf.random.normal((60000,30,4))
ground_truth = tf.ones((60000,1))
dataset = tf.data.Dataset.from_tensor_slices((data, ground_truth)).take(512).batch(64)

model = tf.keras.models.Sequential([
    tf.keras.layers.Dense(1, activation='softmax')
])
model.compile(optimizer='adam',
              loss='categorical_crossentropy',
              metrics=['accuracy'])

#predefined model here: input: [?, 30,4] output: [?,1]
model.fit(dataset, epochs=5)

Epoch 1/5
2019-10-08 09:01:00.212603: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10.0
      8/Unknown - 1s 84ms/step - loss: 102.0359 - accuracy: 1.00002019-10-08 09:01:00.443158: W tensorflow/core/common_runtime/base_collective_executor.cc:216] BaseCollectiveExecutor::StartAbort Out of range: End of sequence
         [[{{node IteratorGetNext}}]]
         [[IteratorGetNext/_16]]
2019-10-08 09:01:00.443241: W tensorflow/core/common_runtime/base_collective_executor.cc:216] BaseCollectiveExecutor::StartAbort Out of range: End of sequence
         [[{{node IteratorGetNext}}]]
8/8 [==============================] - 1s 85ms/step - loss: 102.0359 - accuracy: 1.0000
Epoch 2/5
1/8 [==>...........................] - ETA: 0s - loss: 102.0359 - accuracy: 1.00002019-10-08 09:01:00.502043: W tensorflow/core/common_runtime/base_collective_executor.cc:216] BaseCollectiveExecutor::StartAbort Out of range: End of sequence
         [[{{node IteratorGetNext}}]]
         [[Shape/_4]]
2019-10-08 09:01:00.502100: W tensorflow/core/common_runtime/base_collective_executor.cc:216] BaseCollectiveExecutor::StartAbort Out of range: End of sequence
         [[{{node IteratorGetNext}}]]
8/8 [==============================] - 0s 7ms/step - loss: 102.0359 - accuracy: 1.0000
Epoch 3/5
1/8 [==>...........................] - ETA: 0s - loss: 102.0359 - accuracy: 1.00002019-10-08 09:01:00.544339: W tensorflow/core/common_runtime/base_collective_executor.cc:216] BaseCollectiveExecutor::StartAbort Out of range: End of sequence
         [[{{node IteratorGetNext}}]]
         [[IteratorGetNext/_16]]
2019-10-08 09:01:00.544373: W tensorflow/core/common_runtime/base_collective_executor.cc:216] BaseCollectiveExecutor::StartAbort Out of range: End of sequence
         [[{{node IteratorGetNext}}]]
8/8 [==============================] - 0s 5ms/step - loss: 102.0359 - accuracy: 1.0000
Epoch 4/5
1/8 [==>...........................] - ETA: 0s - loss: 102.0359 - accuracy: 1.00002019-10-08 09:01:00.587002: W tensorflow/core/common_runtime/base_collective_executor.cc:216] BaseCollectiveExecutor::StartAbort Out of range: End of sequence
         [[{{node IteratorGetNext}}]]
         [[IteratorGetNext/_16]]
2019-10-08 09:01:00.587044: W tensorflow/core/common_runtime/base_collective_executor.cc:216] BaseCollectiveExecutor::StartAbort Out of range: End of sequence
         [[{{node IteratorGetNext}}]]
8/8 [==============================] - 0s 5ms/step - loss: 102.0359 - accuracy: 1.0000
Epoch 5/5
1/8 [==>...........................] - ETA: 0s - loss: 102.0359 - accuracy: 1.00002019-10-08 09:01:00.631688: W tensorflow/core/common_runtime/base_collective_executor.cc:216] BaseCollectiveExecutor::StartAbort Out of range: End of sequence
         [[{{node IteratorGetNext}}]]
         [[Shape/_4]]
2019-10-08 09:01:00.631740: W tensorflow/core/common_runtime/base_collective_executor.cc:216] BaseCollectiveExecutor::StartAbort Out of range: End of sequence
         [[{{node IteratorGetNext}}]]
8/8 [==============================] - 0s 6ms/step - loss: 102.0359 - accuracy: 1.0000

I also tried using the drop_remainder=True argument on .batch but still get the error message:

import tensorflow as tf
data = tf.random.normal((60000,30,4))
ground_truth = tf.ones((60000,1))
dataset = tf.data.Dataset.from_tensor_slices((data, ground_truth)).batch(64, drop_remainder=True)

model = tf.keras.models.Sequential([
    tf.keras.layers.Dense(1, activation='softmax')
])
model.compile(optimizer='adam',
              loss='categorical_crossentropy',
              metrics=['accuracy'])

#predefined model here: input: [?, 30,4] output: [?,1]
model.fit(dataset, epochs=5)

Epoch 1/5
2019-10-08 09:03:47.431058: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10.0
    937/Unknown - 3s 3ms/step - loss: 102.0359 - accuracy: 1.00002019-10-08 09:03:50.275433: W tensorflow/core/common_runtime/base_collective_executor.cc:216] BaseCollectiveExecutor::StartAbort Out of range: End of sequence
         [[{{node IteratorGetNext}}]]
         [[IteratorGetNext/_2]]
2019-10-08 09:03:50.275587: W tensorflow/core/common_runtime/base_collective_executor.cc:216] BaseCollectiveExecutor::StartAbort Out of range: End of sequence
         [[{{node IteratorGetNext}}]]
937/937 [==============================] - 3s 3ms/step - loss: 102.0359 - accuracy: 1.0000
Epoch 2/5
919/937 [============================>.] - ETA: 0s - loss: 102.0359 - accuracy: 1.00002019-10-08 09:03:52.891814: W tensorflow/core/common_runtime/base_collective_executor.cc:216] BaseCollectiveExecutor::StartAbort Out of range: End of sequence
         [[{{node IteratorGetNext}}]]
         [[IteratorGetNext/_2]]
2019-10-08 09:03:52.891940: W tensorflow/core/common_runtime/base_collective_executor.cc:216] BaseCollectiveExecutor::StartAbort Out of range: End of sequence
         [[{{node IteratorGetNext}}]]
937/937 [==============================] - 3s 3ms/step - loss: 102.0359 - accuracy: 1.0000
Epoch 3/5
931/937 [============================>.] - ETA: 0s - loss: 102.0359 - accuracy: 1.00002019-10-08 09:03:55.506978: W tensorflow/core/common_runtime/base_collective_executor.cc:216] BaseCollectiveExecutor::StartAbort Out of range: End of sequence
         [[{{node IteratorGetNext}}]]
         [[IteratorGetNext/_2]]
2019-10-08 09:03:55.507100: W tensorflow/core/common_runtime/base_collective_executor.cc:216] BaseCollectiveExecutor::StartAbort Out of range: End of sequence
         [[{{node IteratorGetNext}}]]
937/937 [==============================] - 3s 3ms/step - loss: 102.0359 - accuracy: 1.0000
Epoch 4/5
918/937 [============================>.] - ETA: 0s - loss: 102.0359 - accuracy: 1.00002019-10-08 09:03:58.045499: W tensorflow/core/common_runtime/base_collective_executor.cc:216] BaseCollectiveExecutor::StartAbort Out of range: End of sequence
         [[{{node IteratorGetNext}}]]
         [[IteratorGetNext/_2]]
2019-10-08 09:03:58.045610: W tensorflow/core/common_runtime/base_collective_executor.cc:216] BaseCollectiveExecutor::StartAbort Out of range: End of sequence
         [[{{node IteratorGetNext}}]]
937/937 [==============================] - 3s 3ms/step - loss: 102.0359 - accuracy: 1.0000
Epoch 5/5
932/937 [============================>.] - ETA: 0s - loss: 102.0359 - accuracy: 1.00002019-10-08 09:04:00.654601: W tensorflow/core/common_runtime/base_collective_executor.cc:216] BaseCollectiveExecutor::StartAbort Out of range: End of sequence
         [[{{node IteratorGetNext}}]]
         [[IteratorGetNext/_2]]
2019-10-08 09:04:00.654715: W tensorflow/core/common_runtime/base_collective_executor.cc:216] BaseCollectiveExecutor::StartAbort Out of range: End of sequence
         [[{{node IteratorGetNext}}]]
937/937 [==============================] - 3s 3ms/step - loss: 102.0359 - accuracy: 1.0000

duysqubix · 2019-10-08T11:41:59Z

I think we may need to summon @fchollet

raceee · 2019-10-10T04:52:24Z

@duysqubix Your suggestion fixed my issue!

ravikyram · 2019-10-14T05:40:30Z

@oracle3001

Can you please let us know if the issue still persists?.Please close the issue if it was resolved already. Thanks!

npuichigo · 2019-10-14T06:33:22Z

Any update?

AhmUgEk · 2019-10-14T09:50:24Z

@oracle3001

Can you please let us know if the issue still persists?.Please close the issue if it was resolved already. Thanks!

This fixes, my issue, however, surely the final batch size being reduced should not create an issue?

Repeating the part of the first batch of data should not be the solution, surely?

BeWe11 · 2019-10-14T10:09:15Z

Hey @ravikyram,

the solution that @duysqubix postet is a workaround, the underlying problem still exists.

The problem is that the number of iterations performed on the dataset is greater than the number of batches in the dataset. I'm not actually sure that this is a bug, considering that python iteration uses the StopIteration exception to mark endings of iterables as well. But if that's the case the warning should not be displayed.

The work around "fixes" this by giving an explicitly calculated number of iterations to the model.fit method. This should not be necessary and might not even be possible in all cases. For example when using bucketing, the exact number of batches cannot be easily extracted from the dataset (except by performing a full dataset iteration before training, which would be a workaround as well).

So either the behavior is correct and the warning should be hidden, or the internally calculated number of iterations is faulty and should be changed.

adam-hartshorne · 2019-10-14T13:18:29Z

@oracle3001

Can you please let us know if the issue still persists?.Please close the issue if it was resolved already. Thanks!

Yes, it still persists...see all the other posts with the same issue.

AndreaRigoni · 2019-10-16T08:23:30Z

Hi, I agree with @BeWe11, the issue is still there.
Moreover if you are using a data.from_generator() function the number of actual steps must be computed on the first epoch.

LaurentBerger · 2019-11-22T08:02:58Z

@sharkdtu I changed code and use tf.data.experimental.cardinality issue here

@npuichigo Why? (of course I must shuffle data before fit but it's only an example)

600DZY · 2019-12-11T16:50:32Z

keras的fit函数的部分参数如下：
model.fit(self, x=None, y=None,epochs=1,steps_per_epoch=None)

[epoch:迭代次数，一次迭代可以粗糙的理解成使用一个batch的训练数据对model进行训练。
The number of iterations, one iteration can be roughly understood as using a batch of training data to train the model.]
[steps_per_epoch：每次迭代被model消费的batches，可以粗糙的理解为将固定数量的batch合并成一个更大的bigger_batch，然后用bigger_batch对model进行训练，训练结束即为完成一个epoch。
Each iteration of the batches consumed by the model. It can be roughly seen as a combination of a fixed number of batches into a bigger batch named bigger_batch, and then the model is trained with the bigger_batch. The end of the training is the completion of an epoch.]
个人看法：这是model训练过程中训练数据不足的问题，可以将model的训练过程理解为生产者与消费者的关系。tensorflow2.0的dataset集成了generator的功能，可直接作为吐出训练数据的生成器.
dataset不断提供数据，model训练过程不断消费数据，一旦dataset没有数据提供并且model的训练过程还没结束，就会报错，所以需要确保epochs*steps_per_epoch <= dataset所能提供的batches。
你可以根据经验确定batch_size和steps_per_epoch，然后对全量数据集使用repeat()来避免model训练过程中数据不足的问题。如果觉得没有必要对batch再做处理，可令steps_per_epoch=1。
My View:This is the problem of insufficient training data in the model training process, which can be seen as the relationship between producers and consumers.
Tensorflow2.0 Dataset integrates the function of generator and can be used to spit out training data directly.Dataset provides data continuously, and model training process consumes data continuously. Once dataset does not provide data and model training process is not finished, an error will be reported. Therefore, it is necessary to ensure that epochs*steps_per_epoch is less than the size of batches provided by dataset.You can determine batch_size and steps_per_epoch based on experience, and then use repeat() for Dataset to avoid data shortage during model training.If you don't think it's necessary to deal with the batch again, you can make steps_per_epoch = 1.

zhulingchen · 2019-12-12T14:52:16Z

将全量数据集切分成多个batch，对模型进行分批迭代训练，是常规做法，先理解一下两个概念：

* [batch_size：每个batch的数据量大小]

* [batches：整个数据集按照batch_size切分后的batch数量]

keras的fit函数的部分参数如下：
model.fit(self, x=None, y=None,epochs=1,steps_per_epoch=None)

* [epoch:迭代次数，一次迭代可以粗糙的理解成使用一个batch的训练数据对model进行训练。
  The number of iterations, one iteration can be roughly understood as using a batch of training data to train the model.]

* [steps_per_epoch：每次迭代被model消费的batches，可以粗糙的理解为将固定数量的batch合并成一个更大的bigger_batch，然后用bigger_batch对model进行训练，训练结束即为完成一个epoch。
  Each iteration of the batches consumed by the model. It can be roughly seen as a combination of a fixed number of batches into a bigger batch named bigger_batch, and then the model is trained with the bigger_batch. The end of the training is the completion of an epoch.]

* 个人看法：这是model训练过程中训练数据不足的问题，可以将model的训练过程理解为生产者与消费者的关系。tensorflow2.0的dataset集成了generator的功能，可直接作为吐出训练数据的生成器.
  dataset不断提供数据，model训练过程不断消费数据，一旦dataset没有数据提供并且model的训练过程还没结束，就会报错，所以需要确保epochs*steps_per_epoch <= dataset所能提供的batches。
  你可以根据经验确定batch_size和steps_per_epoch，然后对全量数据集使用repeat()来避免model训练过程中数据不足的问题。如果觉得没有必要对batch再做处理，可令steps_per_epoch=1。

* My View:This is the problem of insufficient training data in the model training process, which can be seen as the relationship between producers and consumers.
  Tensorflow2.0 Dataset integrates the function of generator and can be used to spit out training data directly.Dataset provides data continuously, and model training process consumes data continuously. Once dataset does not provide data and model training process is not finished, an error will be reported. Therefore, it is necessary to ensure that epochs*steps_per_epoch is less than the size of batches provided by dataset.You can determine batch_size and steps_per_epoch based on experience, and then use repeat() for Dataset to avoid data shortage during model training.If you don't think it's necessary to deal with the batch again, you can make steps_per_epoch = 1.

验证过程：

train_data = tf.random.normal((5,4))#5个4维特征向量
label = tf.ones((5,1))#5个类别标签
dataset = tf.data.Dataset.from_tensor_slices((data, label))

dataset
<TensorSliceDataset shapes: ((4,), (1,)), types: (tf.float32, tf.float32)>

全量数据集dataset按照batch_size进行切分，如果最后一个batch的数量不足batch_size，根据drop_remainder判断是否将其丢弃。在上述例子中，train_data和label构成的Dataset包含5个用来训练model的张量，暂且称之为train张量，train张量又包含2个张量：1个4维特征向量和1个label。

dataset = dataset.batch(batch_size, drop_remainder=True).repeat(2)

dataset
<RepeatDataset shapes: ((2, 4), (2, 1)), types: (tf.float32, tf.float32)>

调用batch()进行切分，batch_size=2，drop_remainder=True，可知batches==2，每个batch包含2个train张量，最后一个batch的大小为1，丢弃；repeat(2)后batches==4。

model.fit(dataset, epochs=4, steps_per_epoch=1)
#fit函数中的x和y参数代表特征向量和类别，可直接用Dataset类型的变量赋值

dataset有4个batch，batch_size == 2,每次使用1个batch的数据训练model(bigger_batch_size == batch_size x 1 == 2)，可以迭代4次

model.fit(dataset, epochs=1, steps_per_epoch=4)

dataset有4个batch，batch_size == 2,每次使用4个batch的数据训练model(bigger_batch_size == batch_size x 4 == 6)，可以迭代1次

完整的验证代码如下：
import tensorflow as tf
tf.__version__

def build_model():
    model = tf.keras.Sequential([
        tf.keras.layers.Dense(10, activation=tf.nn.relu, input_shape=(4,)),  
        tf.keras.layers.Dense(10, activation=tf.nn.relu),
        tf.keras.layers.Dense(3, activation='softmax')])
    model.compile(
        optimizer='Adam',
        loss='categorical_crossentropy',
        metrics=['accuracy']
    )
    return model

def check_data_batch_size(dataset):
    #iterator = iter(dataset)
    iterator = dataset.__iter__()
    i=0
    try:
        while i<100:
            #data = next(iterator)
            data = iterator.get_next()
            i += 1
            print('id:',i)
            print('data:',data)
    except Exception as e:
        print(repr(e))
    return i

batch_size =  2
data = tf.random.normal((5,4))
label = tf.ones((5,1))
dataset = tf.data.Dataset.from_tensor_slices((data, label))
dataset = dataset.batch(2, drop_remainder=True).repeat(2)
batches = check_data_batch_size(dataset)
print('batches:',batches)
model = build_model()
model.fit(dataset, epochs=2, steps_per_epoch=2)

Is this reply related to the original question?

600DZY · 2019-12-12T15:10:24Z

I just mentioned my understanding of the two parameters of epochs and steps_per_epoch in keras model, why they cause errors, and my handling method. I'm sorry to confuse you.

…

------------------ 原始邮件 ------------------ 发件人: "Zhu, Lingchen"<notifications@github.com>; 发送时间: 2019年12月12日(星期四) 晚上10:52 收件人: "tensorflow/tensorflow"<tensorflow@noreply.github.com>; 抄送: "326437990"<326437990@qq.com>; "Comment"<comment@noreply.github.com>; 主题: Re: [tensorflow/tensorflow] Re-emerged Issue #31509 - BaseCollectiveExecutor::StartAbort Out of range: (#32817) 将全量数据集切分成多个batch，对模型进行分批迭代训练，是常规做法，先理解一下两个概念： * [batch_size：每个batch的数据量大小] * [batches：整个数据集按照batch_size切分后的batch数量] keras的fit函数的部分参数如下： model.fit(self, x=None, y=None,epochs=1,steps_per_epoch=None) * [epoch:迭代次数，一次迭代可以粗糙的理解成使用一个batch的训练数据对model进行训练。 The number of iterations, one iteration can be roughly understood as using a batch of training data to train the model.] * [steps_per_epoch：每次迭代被model消费的batches，可以粗糙的理解为将固定数量的batch合并成一个更大的bigger_batch，然后用bigger_batch对model进行训练，训练结束即为完成一个epoch。 Each iteration of the batches consumed by the model. It can be roughly seen as a combination of a fixed number of batches into a bigger batch named bigger_batch, and then the model is trained with the bigger_batch. The end of the training is the completion of an epoch.] * 个人看法：这是model训练过程中训练数据不足的问题，可以将model的训练过程理解为生产者与消费者的关系。tensorflow2.0的dataset集成了generator的功能，可直接作为吐出训练数据的生成器. dataset不断提供数据，model训练过程不断消费数据，一旦dataset没有数据提供并且model的训练过程还没结束，就会报错，所以需要确保epochs*steps_per_epoch <= dataset所能提供的batches。你可以根据经验确定batch_size和steps_per_epoch，然后对全量数据集使用repeat()来避免model训练过程中数据不足的问题。如果觉得没有必要对batch再做处理，可令steps_per_epoch=1。 * My View:This is the problem of insufficient training data in the model training process, which can be seen as the relationship between producers and consumers. Tensorflow2.0 Dataset integrates the function of generator and can be used to spit out training data directly.Dataset provides data continuously, and model training process consumes data continuously. Once dataset does not provide data and model training process is not finished, an error will be reported. Therefore, it is necessary to ensure that epochs*steps_per_epoch is less than the size of batches provided by dataset.You can determine batch_size and steps_per_epoch based on experience, and then use repeat() for Dataset to avoid data shortage during model training.If you don't think it's necessary to deal with the batch again, you can make steps_per_epoch = 1. 验证过程： train_data = tf.random.normal((5,4))#5个4维特征向量 label = tf.ones((5,1))#5个类别标签 dataset = tf.data.Dataset.from_tensor_slices((data, label)) dataset <TensorSliceDataset shapes: ((4,), (1,)), types: (tf.float32, tf.float32)> 全量数据集dataset按照batch_size进行切分，如果最后一个batch的数量不足batch_size，根据drop_remainder判断是否将其丢弃。在上述例子中，train_data和label构成的Dataset包含5个用来训练model的张量，暂且称之为train张量，train张量又包含2个张量：1个4维特征向量和1个label。 dataset = dataset.batch(batch_size, drop_remainder=True).repeat(2) dataset <RepeatDataset shapes: ((2, 4), (2, 1)), types: (tf.float32, tf.float32)> 调用batch()进行切分，batch_size=2，drop_remainder=True，可知batches==2，每个batch包含2个train张量，最后一个batch的大小为1，丢弃；repeat(2)后batches==4。 model.fit(dataset, epochs=4, steps_per_epoch=1) #fit函数中的x和y参数代表特征向量和类别，可直接用Dataset类型的变量赋值 dataset有4个batch，batch_size == 2,每次使用1个batch的数据训练model(bigger_batch_size == batch_size x 1 == 2)，可以迭代4次 model.fit(dataset, epochs=1, steps_per_epoch=4) dataset有4个batch，batch_size == 2,每次使用4个batch的数据训练model(bigger_batch_size == batch_size x 4 == 6)，可以迭代1次完整的验证代码如下： import tensorflow as tf tf.__version__ def build_model(): model = tf.keras.Sequential([ tf.keras.layers.Dense(10, activation=tf.nn.relu, input_shape=(4,)), tf.keras.layers.Dense(10, activation=tf.nn.relu), tf.keras.layers.Dense(3, activation='softmax')]) model.compile( optimizer='Adam', loss='categorical_crossentropy', metrics=['accuracy'] ) return model def check_data_batch_size(dataset): #iterator = iter(dataset) iterator = dataset.__iter__() i=0 try: while i<100: #data = next(iterator) data = iterator.get_next() i += 1 print('id:',i) print('data:',data) except Exception as e: print(repr(e)) return i batch_size = 2 data = tf.random.normal((5,4)) label = tf.ones((5,1)) dataset = tf.data.Dataset.from_tensor_slices((data, label)) dataset = dataset.batch(2, drop_remainder=True).repeat(2) batches = check_data_batch_size(dataset) print('batches:',batches) model = build_model() model.fit(dataset, epochs=2, steps_per_epoch=2) Is this reply related to the original question? — You are receiving this because you commented. Reply to this email directly, view it on GitHub, or unsubscribe.

eustomaqua · 2020-01-07T02:43:15Z

@duysqubix Brilliant, your suggestion fixed my issue.

And for other people facing this issue, just for the record:

# loss, acc = net.evaluate(tst_set)  # do not use this when using a Repeating dataset
loss, acc = net.evaluate(tst_set, steps=3)  # e.g., 3

Xiaohui-Z · 2020-01-07T21:39:49Z

I got the same problems in the Tensflow 2-gpu in the centos. Has anyone know how to fix this problem?

00krishna · 2020-01-09T16:54:01Z

This issue should be fixed by the pull request for issue #35314 . The warning was actually propagated up from C++ and so python was passing it forward. But there is really no problem here, no issues with training or anything, according to the issue.

The solution was that Google lowered the logging level to ignore these warnings. The change is in TF 2.0 nightly, and will be widely available in the next release. But you can use TF nightly to get the benefit now.

So this issue can probably be closed.

Xiaohui-Z · 2020-01-10T16:42:37Z

Tensorflow 2.1(stable) released, has anyone known if this warning fixed in the new version ?

OvidZheng · 2020-01-13T06:07:49Z

I have the same problem when practicing the code from official tutorial.
I am using Catalina 10.15 python 3.76 TF 2.1.0

mbedding = "https://tfhub.dev/google/tf2-preview/gnews-swivel-20dim/1"
hub_layer = hub.KerasLayer(embedding, input_shape=[],
                           dtype=tf.string, trainable=True)
hub_layer(train_examples_batch[:3])

model = tf.keras.Sequential()
model.add(hub_layer)
model.add(tf.keras.layers.Dense(16, activation='relu'))
model.add(tf.keras.layers.Dense(1, activation='sigmoid'))

model.summary()

model.compile(optimizer='adam',
              loss='binary_crossentropy',
              metrics=['accuracy'])

history = model.fit(train_data.shuffle(10000).batch(512),
                    epochs=20,
                    validation_data=validation_data.batch(512),
                    verbose=1)

results = model.evaluate(test_data.batch(512), verbose=2)
for name, value in zip(model.metrics_names, results):
  print("%s: %.3f" % (name, value))

29/30 [============================>.] - ETA: 0s - loss: 0.2012 - accuracy: 0.92892020-01-13 13:53:00.393082: W tensorflow/core/common_runtime/base_collective_executor.cc:217] BaseCollectiveExecutor::StartAbort Out of range: End of sequence
	 [[{{node IteratorGetNext}}]]

Flamefire · 2020-01-23T16:04:47Z

@Xiaohui-Z I can also confirm that the issue is not solved. Using the example code from the TF docs still produces the issue:

import tensorflow_datasets as tfds
import tensorflow as tf
tfds.disable_progress_bar()


def make_datasets_unbatched():
    # Scaling MNIST data from (0, 255] to (0., 1.]
    def scale(image, label):
        image = tf.cast(image, tf.float32)
        image /= 255
        return image, label

    datasets, info = tfds.load(name='mnist',
                               with_info=True,
                               as_supervised=True)

    return datasets['train'].map(scale).cache().shuffle(10000)


def build_and_compile_cnn_model():
    model = tf.keras.Sequential([
        tf.keras.layers.Conv2D(32,
                               3,
                               activation='relu',
                               input_shape=(28, 28, 1)),
        tf.keras.layers.MaxPooling2D(),
        tf.keras.layers.Flatten(),
        tf.keras.layers.Dense(64, activation='relu'),
        tf.keras.layers.Dense(10, activation='softmax')
    ])
    model.compile(loss=tf.keras.losses.sparse_categorical_crossentropy,
                  optimizer=tf.keras.optimizers.SGD(learning_rate=0.001),
                  metrics=['accuracy'])
    return model


train_datasets = make_datasets_unbatched().batch(64)
model = build_and_compile_cnn_model()

model.fit(x=train_datasets, epochs=2)

I noted that this only happens during the first iteration where the total count seems to be unknown. This is odd too because the numExamples in the statistics key of the dataset_info.json is set correctly.

RomainSabathe · 2020-01-25T11:31:48Z

Can also confirm that the error (warning) is still being raised on 2.1 (my docker base image is cuda:10.1-cudnn7-devel-ubuntu18.04).

# Triggers the warning.
dataset = raw_dataset.map(_parse_proto).take(32).batch(8)  
model.evaluate(dataset)

# No warning
dataset = raw_dataset.map(_parse_proto).take(32).batch(8)  
model.evaluate(dataset, steps=4)

ismael-elatifi · 2020-03-11T16:08:15Z

I also have this warning in TF2.1.0. model.predict(ds.batch(1)) works but gives this warning :

2020-03-11 17:04:24.760612: W tensorflow/core/common_runtime/base_collective_executor.cc:217] BaseCollectiveExecutor::StartAbort Out of range: End of sequence
	 [[{{node IteratorGetNext}}]]

ericvoots · 2020-08-21T02:59:29Z

I have a similar error but can't seem to find anywhere else where anyone else is experiencing here it, and here is the traceback of my error:

Train on 2737611 samples, validate on 2737612 samples
Epoch 1/123
Epoch 2/123
Epoch 3/123
Epoch 4/123
Epoch 5/123
Epoch 6/123
2020-08-20 22:56:33.810266: W tensorflow/core/common_runtime/base_collective_executor.cc:217] BaseCollectiveExecutor::StartAbort Invalid argument: assertion failed: [predictions must be >= 0] [Condition x >= y did not hold element-wise:] [x (sequential/dense_10/Sigmoid:0) = ] [[nan][nan][nan]...] [y (metrics/tp/Cast_2/x:0) = ] [0]
[[{{node metrics/tp/assert_greater_equal/Assert/AssertGuard/else/_1/Assert}}]]
[[metrics/recall/assert_greater_equal/Assert/AssertGuard/pivot_f/_143/_157]]
2020-08-20 22:56:33.824745: W tensorflow/core/common_runtime/base_collective_executor.cc:217] BaseCollectiveExecutor::StartAbort Invalid argument: assertion failed: [predictions must be >= 0] [Condition x >= y did not hold element-wise:] [x (sequential/dense_10/Sigmoid:0) = ] [[nan][nan][nan]...] [y (metrics/tp/Cast_2/x:0) = ] [0]
[[{{node metrics/tp/assert_greater_equal/Assert/AssertGuard/else/_1/Assert}}]]
WARNING:tensorflow:Can save best model only with val_precision available, skipping.
Traceback (most recent call last):
File "tf_working.py", line 399, in
keras_auto_tuner(training_df, '1week_target_class')
File "tf_working.py", line 382, in keras_auto_tuner
validation_data=(val_features, y_val))
File "C:\Users\evoot\anaconda3\envs\tf_sh\lib\site-packages\kerastuner\engine\base_tuner.py", line 130, in search
self.run_trial(trial, *fit_args, **fit_kwargs)
File "C:\Users\evoot\anaconda3\envs\tf_sh\lib\site-packages\kerastuner\engine\multi_execution_tuner.py", line 96, in run_trial
history = model.fit(*fit_args, **copied_fit_kwargs)
File "C:\Users\evoot\anaconda3\envs\tf_sh\lib\site-packages\tensorflow_core\python\keras\engine\training.py", line 819, in fit
use_multiprocessing=use_multiprocessing)
File "C:\Users\evoot\anaconda3\envs\tf_sh\lib\site-packages\tensorflow_core\python\keras\engine\training_v2.py", line 342, in fit
total_epochs=epochs)
File "C:\Users\evoot\anaconda3\envs\tf_sh\lib\site-packages\tensorflow_core\python\keras\engine\training_v2.py", line 128, in run_one_epoch
batch_outs = execution_function(iterator)
File "C:\Users\evoot\anaconda3\envs\tf_sh\lib\site-packages\tensorflow_core\python\keras\engine\training_v2_utils.py", line 98, in execution_function
distributed_function(input_fn))
File "C:\Users\evoot\anaconda3\envs\tf_sh\lib\site-packages\tensorflow_core\python\eager\def_function.py", line 568, in call
result = self._call(*args, **kwds)
File "C:\Users\evoot\anaconda3\envs\tf_sh\lib\site-packages\tensorflow_core\python\eager\def_function.py", line 599, in _call
return self._stateless_fn(*args, **kwds) # pylint: disable=not-callable
File "C:\Users\evoot\anaconda3\envs\tf_sh\lib\site-packages\tensorflow_core\python\eager\function.py", line 2363, in call
return graph_function._filtered_call(args, kwargs) # pylint: disable=protected-access
File "C:\Users\evoot\anaconda3\envs\tf_sh\lib\site-packages\tensorflow_core\python\eager\function.py", line 1611, in _filtered_call
self.captured_inputs)
File "C:\Users\evoot\anaconda3\envs\tf_sh\lib\site-packages\tensorflow_core\python\eager\function.py", line 1692, in _call_flat
ctx, args, cancellation_manager=cancellation_manager))
File "C:\Users\evoot\anaconda3\envs\tf_sh\lib\site-packages\tensorflow_core\python\eager\function.py", line 545, in call
ctx=ctx)
File "C:\Users\evoot\anaconda3\envs\tf_sh\lib\site-packages\tensorflow_core\python\eager\execute.py", line 67, in quick_execute
six.raise_from(core._status_to_exception(e.code, message), None)
File "", line 3, in raise_from
tensorflow.python.framework.errors_impl.InvalidArgumentError: 2 root error(s) found.
(0) Invalid argument: assertion failed: [predictions must be >= 0] [Condition x >= y did not hold element-wise:] [x (sequential/dense_10/Sigmoid:0) = ] [[nan][nan][nan]...] [y (metrics/tp/Cast_2/x:0) = ] [0]
[[{{node metrics/tp/assert_greater_equal/Assert/AssertGuard/else/_1/Assert}}]]
[[metrics/recall/assert_greater_equal/Assert/AssertGuard/pivot_f/_143/_157]]
(1) Invalid argument: assertion failed: [predictions must be >= 0] [Condition x >= y did not hold element-wise:] [x (sequential/dense_10/Sigmoid:0) = ] [[nan][nan][nan]...] [y (metrics/tp/Cast_2/x:0) = ] [0]
[[{{node metrics/tp/assert_greater_equal/Assert/AssertGuard/else/_1/Assert}}]]
0 successful operations.
0 derived errors ignored. [Op:__inference_distributed_function_222355]

Function call stack:
distributed_function -> distributed_function

bw4sz · 2020-11-06T23:13:07Z

Can anyone confirm what the result of this behavior is? I'm confused whether its a logging error, or whether the final batch does not get train/evaluated. For example, imagine I had 100 samples with a batch size of 52. Would I be training on a batch of 50 and 48 (expected behavior), or would I train on 50 and then just fail to fill the next batch and move to the next epoch? This is especially scary in a validation batch and I would be terrified to find that I have a variable validation set (especially if you shuffle!). There is alot of discussion in many spots, but no clear indication of the significance of this error. Some would have you believe it is just a warning. I am on tensorflow==2.1.0.

GitHub-Of-WangQiang · 2021-01-02T11:46:52Z

我想我可能已经找到了为什么要抱怨。但是，我不知道如何解决它。在训练过程中，我们都会收到IteratorGetNext错误：序列超出范围。

我注意到，假设我的数据集大小为60,000，而批处理大小为64，则需要floor（60000/64）= 937才能遍历整个数据集一个时期。但是，当使用.fit（verbose = 1）进行训练时，我注意到它尝试遍历数据集938（很可能是舍入错误，因为60000/64 = 937.5），因此我得到了这个错误。有人可以请您确认这种情况吗？谢谢

编辑：

因此，在构建tf.data.Dataset时，我找到了一种解决方法，请确保添加.repeat（）方法，因为程序会抱怨您用完了数据，并且在使用.fit（）时 ~~添加以下内容~~：

这是一个完整的示例，可以正常工作。

这将导致错误：

数据 =  TF。随机的。正常（（60000，30，4））
 ground_truth  =  TF。那些（（60000，1））
的数据集 =  TF。数据。数据集。from_tensor_slices（（data，ground_truth））。批（64）

#predefined模型在这里：输入：[？，30,4]输出：[？，1]
模型。适合（资料集，历元= 5）

''' 
    938 /未知-16秒17毫秒/步-丢失：0.02172019-10-07 14：49：49.928619：W tensorflow / core / common_runtime / base_collective_executor.cc：216] BaseCollectiveExecutor :: StartAbort超出范围：序列结束
         [ [{{node IteratorGetNext}}]] 
         [[Shape / _2]] 
2019-10-07 14：49：49.928619：W tensorflow / core / common_runtime / base_collective_executor.cc：216] BaseCollectiveExecutor :: StartAbort超出范围：结束于序列
         [[{{node IteratorGetNext}}]] 
938/938 [=============================]-16s 17ms /步进-亏损：0.0217
时代2/5
935/938 [===========================>。]-ETA：0秒-损失：2.2229e-062019-10-07 14：49：59.722216：W tensorflow / core / common_runtime / base_collective_executor.cc：216] BaseCollectiveExecutor :: StartAbort超出范围：序列结束
         [[{{node IteratorGetNext}}]] 
2019-10-07 14：49：59.722218 ：W tensorflow / core / common_runtime / base_collective_executor.cc：216] BaseCollectiveExecutor :: StartAbort超出范围：序列结束
         [[{{node IteratorGetNext}}] 
         [[Shape / _2]] 
'''

这是变通办法。

batch_size  =  64
数据 =  tf。随机的。正常（（60000，30，4））
 ground_truth  =  TF。那些（（60000，1））
的数据集 =  TF。数据。数据集。from_tensor_slices（（data，ground_truth））。批处理（batch_size）。重复（）

#predefined模型在这里：输入：[？，30,4]输出：[？，1]
模型。配合（数据集，历元= 5，steps_per_epoch =数据。形状[ 0 ] //的batch_size）

''' 
937/937 [=============================]-15s 16ms / step-损失：0.0135 
Epoch 2 / 5 
937/937 [==============================]-10s 10ms / step-损耗：1.4460e-05
时代3 /
937/937 [=============================]-10s 11ms / step-损失：4.3097e-06
时期4/5 
937/937 [=============================]-10s 10ms / step-损耗：1.8212e-06
时代5/5 
'''

``

I think I may have found why it is complaining. However, I have no idea how to fix it. While training, we all get the IteratorGetNext Error: Sequence out of range.

I noticed that let’s say I have a dataset size of 60,000 with a batch size of 64 that would require floor(60000/64)= 937 to iterate through the entire dataset for one epoch. However, when training using .fit(verbose=1) I noticed that it attempts to iterate through the dataset 938 (most likely a rounding error because 60000/64=937.5) and thus I get this error. Can someone please confirm this is the case for you as well? Thanks

Edit:

So I found a way around this when building the tf.data.Dataset, make sure to add the .repeat() method because the program will complain that you ran out of data, and when using .fit() ~~add the following~~:

Here is a full example that got it working.

This will cause the error:

data = tf.random.normal((60000,30,4))
ground_truth = tf.ones((60000,1))
dataset = tf.data.Dataset.from_tensor_slices((data, ground_truth)).batch(64)

#predefined model here: input: [?, 30,4] output: [?,1]
model.fit(dataset, epochs=5)

'''
    938/Unknown - 16s 17ms/step - loss: 0.02172019-10-07 14:49:49.928619: W tensorflow/core/common_runtime/base_collective_executor.cc:216] BaseCollectiveExecutor::StartAbort Out of range: End of sequence
         [[{{node IteratorGetNext}}]]
         [[Shape/_2]]
2019-10-07 14:49:49.928619: W tensorflow/core/common_runtime/base_collective_executor.cc:216] BaseCollectiveExecutor::StartAbort Out of range: End of sequence
         [[{{node IteratorGetNext}}]]
938/938 [==============================] - 16s 17ms/step - loss: 0.0217
Epoch 2/5
935/938 [============================>.] - ETA: 0s - loss: 2.2229e-062019-10-07 14:49:59.722216: W tensorflow/core/common_runtime/base_collective_executor.cc:216] BaseCollectiveExecutor::StartAbort Out of range: End of sequence
         [[{{node IteratorGetNext}}]]
2019-10-07 14:49:59.722218: W tensorflow/core/common_runtime/base_collective_executor.cc:216] BaseCollectiveExecutor::StartAbort Out of range: End of sequence
         [[{{node IteratorGetNext}}]]
         [[Shape/_2]]
'''

This is the work around.

batch_size = 64
data = tf.random.normal((60000,30,4))
ground_truth = tf.ones((60000,1))
dataset = tf.data.Dataset.from_tensor_slices((data, ground_truth)).batch(batch_size).repeat()

#predefined model here: input: [?, 30,4] output: [?,1]
model.fit(dataset, epochs=5, steps_per_epoch=data.shape[0]//batch_size)

'''
937/937 [==============================] - 15s 16ms/step - loss: 0.0135
Epoch 2/5
937/937 [==============================] - 10s 10ms/step - loss: 1.4460e-05
Epoch 3/5
937/937 [==============================] - 10s 11ms/step - loss: 4.3097e-06
Epoch 4/5
937/937 [==============================] - 10s 10ms/step - loss: 1.8212e-06
Epoch 5/5
'''

``

hi，I am a freshman in DL from CHina ,I meet the same error like you.Through search the interent,I found the answer:you need add the 'repeat()',but rember that don't input function parameter,then you need to add "step_per_epoch" in fit(),and it's value is "x_train//batchszie''.it works in my project,I hope it can help you to solve your problem .my English is poor,don't mind!

sushreebarsa · 2021-06-24T16:35:22Z

I tried to run on Colab with TF v2.5 and faced a different error,please find the gist here..Thanks!

sanatmpa1 · 2021-09-08T15:26:48Z

@oracle3001,

I've tried reproducing the issue in TF 2.6.0 and its working fine now. Please take a look at the gist here. Thanks!

google-ml-butler · 2021-09-15T15:55:06Z

This issue has been automatically marked as stale because it has no recent activity. It will be closed if no further activity occurs. Thank you.

google-ml-butler · 2021-09-22T16:24:16Z

Closing as stale. Please reopen if you'd like to work on this further.

google-ml-butler · 2021-09-22T16:24:18Z

Are you satisfied with the resolution of your issue?
Yes
No

adam-hartshorne changed the title ~~Same Issue As Issue #31509 With Adamax - BaseCollectiveExecutor::StartAbort Out of range:~~ Re-emerged Issue #31509 - BaseCollectiveExecutor::StartAbort Out of range: Sep 25, 2019

ravikyram self-assigned this Sep 26, 2019

ravikyram added comp:keras Keras related issues TF 2.0 Issues relating to TensorFlow 2.0 labels Sep 26, 2019

ravikyram added the stat:awaiting response Status - Awaiting response from author label Sep 26, 2019

tensorflowbutler removed the stat:awaiting response Status - Awaiting response from author label Oct 7, 2019

ravikyram added comp:data tf.data related issues type:bug Bug labels Oct 16, 2019

ravikyram assigned gowthamkpr and unassigned ravikyram Oct 16, 2019

00krishna mentioned this issue Dec 20, 2019

Lack of dataset length or cardinality causes BaseCollectiveExecutor::StartAbort Out of range issues #35314

Closed

This was referenced Feb 19, 2020

Getting error: when using distilBERT BaseCollectiveExecutor: Out of range amaiya/ktrain#65

Closed

Getting multiple errors from tensorflow after trying to predict on data. amaiya/ktrain#67

Closed

Saduf2019 assigned tilakrayal Aug 16, 2021

sanatmpa1 assigned sanatmpa1 and unassigned tilakrayal Sep 8, 2021

sanatmpa1 added stat:awaiting response Status - Awaiting response from author and removed stat:awaiting tensorflower Status - Awaiting response from tensorflower labels Sep 8, 2021

google-ml-butler bot added the stale This label marks the issue/pr stale - to be closed automatically if no activity label Sep 15, 2021

google-ml-butler bot closed this as completed Sep 22, 2021

Re-emerged Issue #31509 - BaseCollectiveExecutor::StartAbort Out of range: #32817

Re-emerged Issue #31509 - BaseCollectiveExecutor::StartAbort Out of range: #32817

Comments

adam-hartshorne commented Sep 25, 2019 • edited

ravikyram commented Sep 26, 2019

duysPES commented Oct 3, 2019

ravikyram commented Oct 4, 2019

juliangall commented Oct 5, 2019

raceee commented Oct 5, 2019

AhmUgEk commented Oct 5, 2019

adam-hartshorne commented Oct 6, 2019

juliangall commented Oct 6, 2019

mtpgva commented Oct 6, 2019 • edited

duysqubix commented Oct 7, 2019 • edited

mtpgva commented Oct 8, 2019 via email

samueljackson92 commented Oct 8, 2019

duysqubix commented Oct 8, 2019

raceee commented Oct 10, 2019

ravikyram commented Oct 14, 2019

npuichigo commented Oct 14, 2019

AhmUgEk commented Oct 14, 2019

BeWe11 commented Oct 14, 2019 • edited

adam-hartshorne commented Oct 14, 2019

AndreaRigoni commented Oct 16, 2019 • edited

LaurentBerger commented Nov 22, 2019

600DZY commented Dec 11, 2019 • edited

zhulingchen commented Dec 12, 2019

600DZY commented Dec 12, 2019 via email

eustomaqua commented Jan 7, 2020

Xiaohui-Z commented Jan 7, 2020

00krishna commented Jan 9, 2020

Xiaohui-Z commented Jan 10, 2020

OvidZheng commented Jan 13, 2020 • edited

Flamefire commented Jan 23, 2020

RomainSabathe commented Jan 25, 2020 • edited

ismael-elatifi commented Mar 11, 2020 • edited

ericvoots commented Aug 21, 2020 • edited

bw4sz commented Nov 6, 2020

GitHub-Of-WangQiang commented Jan 2, 2021

sushreebarsa commented Jun 24, 2021

sanatmpa1 commented Sep 8, 2021

google-ml-butler bot commented Sep 15, 2021

google-ml-butler bot commented Sep 22, 2021

google-ml-butler bot commented Sep 22, 2021

adam-hartshorne commented Sep 25, 2019 •

edited

mtpgva commented Oct 6, 2019 •

edited

duysqubix commented Oct 7, 2019 •

edited

BeWe11 commented Oct 14, 2019 •

edited

AndreaRigoni commented Oct 16, 2019 •

edited

600DZY commented Dec 11, 2019 •

edited

OvidZheng commented Jan 13, 2020 •

edited

RomainSabathe commented Jan 25, 2020 •

edited

ismael-elatifi commented Mar 11, 2020 •

edited

ericvoots commented Aug 21, 2020 •

edited