Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Keras & data.Dataset : "Your dataset iterator ran out of data" #25254

Closed
andsteing opened this issue Jan 28, 2019 · 27 comments
Closed

Keras & data.Dataset : "Your dataset iterator ran out of data" #25254

andsteing opened this issue Jan 28, 2019 · 27 comments
Assignees
Labels
comp:data tf.data related issues comp:keras Keras related issues type:bug Bug

Comments

@andsteing
Copy link

System information

  • Have I written custom code (as opposed to using a stock example script provided in TensorFlow): Yes
  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Colab
  • Mobile device (e.g. iPhone 8, Pixel 2, Samsung Galaxy) if the issue happens on mobile device: n/a
  • TensorFlow installed from (source or binary): Colab
  • TensorFlow version (use command below): 1.12
  • Python version: 3.6.7
  • Bazel version (if compiling from source): n/a
  • GCC/Compiler version (if compiling from source): n/a
  • CUDA/cuDNN version: n/a
  • GPU model and memory: n/a

Describe the current behavior
Keras model.fit() does not reset validation dataset iterator between epochs. Thus, when specifying validation_steps < validation_dataset_size / batch_size, then every evaluation will be performed on a different set of examples.

Describe the expected behavior
I would expect that model.fit() restarts from the beginning in the validation dataset after every epoch of training. This way the validation dataset could be used without .repeat() and the evaluation would be performed on the same set of examples.

Code to reproduce the issue
https://colab.research.google.com/drive/1UjKNbX38UC4EG6EPm6xLzQ1AmFV8HWe5

Other info / logs

WARNING:tensorflow:Your dataset iterator ran out of data interrupting testing. Make sure that your dataset can generate at least `steps` batches (in this case, 100 batches). You may need to use the repeat() function when building your dataset.
@ghost
Copy link

ghost commented Jan 29, 2019

I think it would be nice to have tf.keras consider one epoch when the dataset runs out of data, it would make the steps_per_epoch and validation_steps undeeded.

@jvishnuvardhan jvishnuvardhan self-assigned this Jan 29, 2019
@jvishnuvardhan jvishnuvardhan added comp:keras Keras related issues comp:data tf.data related issues type:bug Bug labels Jan 29, 2019
@jvishnuvardhan jvishnuvardhan added the stat:awaiting tensorflower Status - Awaiting response from tensorflower label Jan 29, 2019
@suphoff
Copy link
Contributor

suphoff commented Feb 1, 2019

As a quick workaround the validation dataset could be trimmed to validation_steps * batch_size using .take() before .repeat().

@mrry mrry assigned omalleyt12 and unassigned mrry Feb 21, 2019
@mrry
Copy link
Contributor

mrry commented Feb 21, 2019

Reassigning this to @omalleyt12, since I think he has been improving the validation path lately, and I believe this feature would need to be implemented at the Keras level (but feel free to reassign, Tom!).

@tensorflowbutler tensorflowbutler removed the stat:awaiting tensorflower Status - Awaiting response from tensorflower label Feb 22, 2019
@omalleyt12
Copy link
Contributor

We now allow users to not pass in validation_steps or steps_per_epoch for datasets, like in @cassianocasagrande 's suggestion

@felixnext
Copy link

felixnext commented Mar 22, 2019

System information

  • Have I written custom code (as opposed to using a stock example script provided in TensorFlow): Yes
  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Linux Ubuntu 16.04
  • Mobile device (e.g. iPhone 8, Pixel 2, Samsung Galaxy) if the issue happens on mobile device: n/a
  • TensorFlow installed from (source or binary): binary
  • TensorFlow version (use command below): 1.13 / 2.0-alpha
  • Python version: 3.5.2
  • Bazel version (if compiling from source): n/a
  • GCC/Compiler version (if compiling from source): n/a
  • CUDA/cuDNN version: 10.0/7.5
  • GPU model and memory: 1080 Ti / 32 GB

I still encounter similar problems with TF 2.0 alpha and TF 1.13.

TF 1.13

gen_train = # ... generator function ...
gen_test = # ... generator function ...
# creation of datasets
types = (tf.float32, tf.int32)
shapes = ((512, 512, 3), (2,))
ds_train = tf.data.Dataset.from_generator(lambda: gen_train, types, shapes).shuffle(1000).repeat().batch(32)
ds_test = tf.data.Dataset.from_generator(lambda: gen_test, types, shapes).shuffle(100).repeat().batch(32)

# usage in model
model.fit(ds_train, steps_per_epoch=188, validation_data=ds_test, validation_steps=20, epochs=10, verbose=True, callbacks=[visualize, tensorboard])

gen_train provide a tuple of an image and a one_hot vector. steps_per_epoch are set to the exact number of batches in the dataset.
However once I reach batch 156 (the one where the dataset would be required to load the next iteration for shuffling) the system stops. I have medium CPU usage from python (25-35%) and no progress at all in the learning.

TF 2.0

gen_train = # ... generator function ...
gen_test = # ... generator function ...
# creation of datasets
types = (tf.float32, tf.int32)
shapes = ((512, 512, 3), (2,))
ds_train = tf.data.Dataset.from_generator(lambda: gen_train, types, shapes).shuffle(1000).batch(32)
ds_test = tf.data.Dataset.from_generator(lambda: gen_test, types, shapes).shuffle(100).batch(32)

# usage in model
model.fit(ds_train, validation_data=ds_test, epochs=10, verbose=True, callbacks=[visualize, tensorboard])

In this case the system completes the first epoch and the evaluation. However beginning of the second epoch I get the following error:

W0322 18:36:04.919457 140678915827456 training_generator.py:228] Your dataset ran out of data; interrupting training. Make sure that your dataset can generate at least `steps_per_epoch * epochs` batches (in this case, 1880 batches). You may need to use the repeat() function when building your dataset.

If I am using .repeat() on the datasets (and provide steps_per_epoch and validation_steps args) I run into the same problem as with TF 1.13.

@omalleyt12 Am I right in the assumption that the change is already included in the tf-2.0-alpha release? (As TF 1.13 would raise an error if I do not provide steps_per_epoch and validation_steps, while TF2 does not)

@omalleyt12
Copy link
Contributor

I'm guessing the generator runs out of Data, I don't think repeat works with generators

@felixnext
Copy link

Thanks for the note.
It actually does work, but the lambda function has to create the generator. So the code would look like this:

# creation of datasets
types = (tf.float32, tf.int32)
shapes = ((512, 512, 3), (2,))
ds_train = tf.data.Dataset.from_generator(lambda: fct_to_create_train_gen(), types, shapes).shuffle(1000).batch(32)
ds_test = tf.data.Dataset.from_generator(lambda: fct_to_create_test_gen(), types, shapes).shuffle(100).batch(32)

# usage in model
model.fit(ds_train, validation_data=ds_test, epochs=10, verbose=True, callbacks=[visualize, tensorboard])

Alternative option would be to create a generator that loops infinitely over the data, but that would require to provide steps_per_epoch and validation_steps.

@ahmedanis03
Copy link

ahmedanis03 commented Jul 8, 2019

I think this bug still exists when it is used in multiworkerdistributed mode,
i launched workers using kubernetes, first epoch ran correctly then I got this error message, if I used steps_per_epoch, and repeat() everything works fine

tfversion: 2.0.0-beta1

2019-07-07 06:48:48.821000: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:111] Filling up shuffle buffer (this may take a while): 28884 of 100000000
2019-07-07 06:48:49.144075: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:162] Shuffle buffer filled.
2019-07-07 06:48:49.161736: E tensorflow/core/common_runtime/ring_alg.cc:279] Aborting RingReduce with Out of range: [_Derived_]End of sequence
         [[{{node IteratorGetNext}}]]
2019-07-07 06:48:49.161820: W tensorflow/core/common_runtime/base_collective_executor.cc:216] BaseCollectiveExecutor::StartAbort Out of range: [_Derived_]End of sequence
         [[{{node IteratorGetNext}}]]
2019-07-07 06:48:49.162250: E tensorflow/core/common_runtime/ring_alg.cc:279] Aborting RingReduce with Out of range: [_Derived_]End of sequence
         [[{{node IteratorGetNext}}]]
2019-07-07 06:48:49.162329: W tensorflow/core/common_runtime/base_collective_executor.cc:216] BaseCollectiveExecutor::StartAbort Out of range: [_Derived_]End of sequence
         [[{{node IteratorGetNext}}]]
2019-07-07 06:48:49.163848: W tensorflow/core/framework/op_kernel.cc:1546] OP_REQUIRES failed at collective_ops.cc:223 : Out of range: [_Derived_]End of sequence
         [[{{node IteratorGetNext}}]]
2019-07-07 06:48:49.163910: W tensorflow/core/common_runtime/base_collective_executor.cc:216] BaseCollectiveExecutor::StartAbort Out of range: [_Derived_]End of sequence
         [[{{node IteratorGetNext}}]]
         [[metrics/accuracy/div_no_nan/allreduce_1/CollectiveReduce]]
2019-07-07 06:48:49.164582: W tensorflow/core/framework/op_kernel.cc:1546] OP_REQUIRES failed at collective_ops.cc:223 : Out of range: [_Derived_]End of sequence
         [[{{node IteratorGetNext}}]]
2019-07-07 06:48:49.169979: E tensorflow/core/common_runtime/ring_alg.cc:279] Aborting RingReduce with Out of range: [_Derived_]End of sequence
         [[{{node IteratorGetNext}}]]
2019-07-07 06:48:49.170069: W tensorflow/core/common_runtime/base_collective_executor.cc:216] BaseCollectiveExecutor::StartAbort Out of range: [_Derived_]End of sequence
         [[{{node IteratorGetNext}}]]
2019-07-07 06:48:49.170203: W tensorflow/core/framework/op_kernel.cc:1546] OP_REQUIRES failed at collective_ops.cc:223 : Out of range: [_Derived_]End of sequence
         [[{{node IteratorGetNext}}]]
2019-07-07 06:48:49.179226: E tensorflow/core/common_runtime/ring_alg.cc:279] Aborting RingReduce with Out of range: [_Derived_]End of sequence
         [[{{node IteratorGetNext}}]]
2019-07-07 06:48:49.179355: W tensorflow/core/common_runtime/base_collective_executor.cc:216] BaseCollectiveExecutor::StartAbort Out of range: [_Derived_]End of sequence
         [[{{node IteratorGetNext}}]]
2019-07-07 06:48:49.179457: W tensorflow/core/framework/op_kernel.cc:1546] OP_REQUIRES failed at collective_ops.cc:223 : Out of range: [_Derived_]End of sequence
         [[{{node IteratorGetNext}}]]
W0707 06:48:49.193020 140662389356352 training_arrays.py:309] Your dataset ran out of data; interrupting training. Make sure that your dataset can generate at least `steps_per_epoch * epochs` batches (in this case, 1404 batches). You may need to use the repeat() function when building your dataset.
train.py

from __future__ import absolute_import, division, print_function, unicode_literals
import tensorflow_datasets as tfds
import tensorflow as tf
import argparse
import json
import os
tfds.disable_progress_bar()


parser = argparse.ArgumentParser(description='Welcome')
parser.add_argument('-workers', type=str, default='dummy')
parser.add_argument('-type', type=str, default='dummy')
parser.add_argument('-index', type=str, default='dummy')

args = parser.parse_args()

os.environ['TF_CONFIG'] = json.dumps({
    'cluster': {
        'worker': args.workers.split(','),
    },
    'task': {'type': args.type, 'index': int(args.index)}
})


strategy = tf.distribute.experimental.MultiWorkerMirroredStrategy()

BUFFER_SIZE = 10000

# Scaling MNIST data from (0, 255] to (0., 1.]
def scale(image, label):
  image = tf.cast(image, tf.float32)
  image /= 255
  return image, label

datasets, info = tfds.load(name='mnist',
                           with_info=True,
                           as_supervised=True)

train_datasets_unbatched = datasets['train'].map(scale).shuffle(BUFFER_SIZE)
#train_datasets = train_datasets_unbatched.batch(BATCH_SIZE)

def build_and_compile_cnn_model():
  model = tf.keras.Sequential([
      tf.keras.layers.Conv2D(32, 3, activation='relu', input_shape=(28, 28, 1)),
      tf.keras.layers.MaxPooling2D(),
      tf.keras.layers.Flatten(),
      tf.keras.layers.Dense(64, activation='relu'),
      tf.keras.layers.Dense(10, activation='softmax')
  ])
  model.compile(
      loss=tf.keras.losses.sparse_categorical_crossentropy,
      optimizer=tf.keras.optimizers.SGD(learning_rate=0.001),
      metrics=['accuracy'])
  return model



NUM_WORKERS = 2
# Here the batch size scales up by number of workers since 
# `tf.data.Dataset.batch` expects the global batch size. Previously we used 64, 
# and now this becomes 128.
GLOBAL_BATCH_SIZE = 64 * NUM_WORKERS
train_datasets = train_datasets_unbatched.batch(GLOBAL_BATCH_SIZE, drop_remainder=True)
with strategy.scope():
  multi_worker_model = build_and_compile_cnn_model()
multi_worker_model.fit(train_datasets, epochs=3)

@benhe2011
Copy link

Yes, even running the Multi-worker distributed training with Keras code example on the official TensorFlow Documentation website has this error. How do you get this to work over data loaded in from MNIST for example? This is the code example I was talking about, and it's identical to the one shown by ahmedanis03: https://www.tensorflow.org/beta/tutorials/distribute/multi_worker_with_keras.

@edwardyehuang
Copy link
Contributor

same issue

@snowkylin
Copy link

same for me.

@Path-A
Copy link

Path-A commented Jul 23, 2019

I am having the same issue as @ahmedanis03 and @benhe2011, but even for one machine two GPU setup. I modified the old code from the multi_gpu_model documentation and used the required MirroredStrategy.

  • Have I written custom code (as opposed to using a stock example script provided in TensorFlow): Yes
  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Windows 10.0.18362 Build 18362
  • Mobile device (e.g. iPhone 8, Pixel 2, Samsung Galaxy) if the issue happens on mobile device: n/a
  • TensorFlow installed from (source or binary): binary
  • TensorFlow version (use command below): 2.0.0-beta1
  • Python version: 3.6.8
  • Bazel version (if compiling from source): n/a
  • GCC/Compiler version (if compiling from source): n/a
  • CUDA/cuDNN version: 10.0/7.6.1
  • GPU model and memory: 2x MSI GeForce RTX 2080 Ti GAMING X TRIO (no NVlink)
import tensorflow as tf
from tensorflow.keras.applications import Xception
import numpy as np

num_samples = 1000
height = 224
width = 224
num_classes = 1000

strategy = tf.distribute.MirroredStrategy(devices=["/gpu:0", "/gpu:1"], 
                                          cross_device_ops=tf.distribute.HierarchicalCopyAllReduce())
with strategy.scope():
    parallel_model = Xception(weights=None,
                              input_shape=(height, width, 3),
                              classes=num_classes)
    parallel_model.compile(loss='categorical_crossentropy', optimizer='rmsprop')

# Generate dummy data.
x = np.random.random((num_samples, height, width, 3))
y = np.random.random((num_samples, num_classes))

parallel_model.summary()
# This `fit` call will be distributed on 2 GPUs.
# Since the batch size is 64, each GPU will process 32 samples.
parallel_model.fit(x, y, epochs=10, batch_size=64)

It runs perfectly until the final epoch. It always ends with:

Epoch 8/10
16/16 [==============================] - 5s 334ms/step - loss: 3590.6503
Epoch 9/10
16/16 [==============================] - 5s 332ms/step - loss: 3597.1092
Epoch 10/10
12/16 [=====================>........] - ETA: 1s - loss: 3603.6067
W0723 14:30:47.582621   232 training_arrays.py:309] Your dataset ran out of data; 
interrupting training. Make sure that your dataset can generate at least `steps_per_epoch * epochs` 
batches (in this case, 160 batches). You may need to use the repeat() function when 
building your dataset.
12/16 [=====================>........] - ETA: 1s - loss: 3603.6067

This is only an issue with MirroredStrategy. When I train on a single GPU, there is no issue.

Edit: I'm also getting this output too!

2019-07-23 16:52:04.173982: W tensorflow/core/common_runtime/base_collective_executor.cc:216] BaseCollectiveExecutor::StartAbort Out of range: End of sequence
         [[{{node IteratorGetNext_1}}]]
         [[GroupCrossDeviceControlEdges_0/RMSprop/RMSprop/update_0/Const/_355]]
2019-07-23 16:52:04.183310: W tensorflow/core/common_runtime/base_collective_executor.cc:216] BaseCollectiveExecutor::StartAbort Out of range: End of sequence
         [[{{node IteratorGetNext_1}}]]
         [[Identity_1/_376]]
2019-07-23 16:52:04.189139: W tensorflow/core/common_runtime/base_collective_executor.cc:216] BaseCollectiveExecutor::StartAbort Out of range: End of sequence
         [[{{node IteratorGetNext_1}}]]

TEMPORARY SOLUTION: I converted the numpy arrays to a tf Dataset and used .repeat() while providing the proper number of steps per epoch within fit:

train_dataset = tf.data.Dataset.from_tensor_slices((x, y))

BATCH_SIZE = 64
BUFFER_SIZE = 10000

train_dataset = train_dataset.shuffle(BUFFER_SIZE).repeat().batch(BATCH_SIZE)

if BUFFER_SIZE % BATCH_SIZE != 0:
    parallel_steps = BUFFER_SIZE // BATCH_SIZE + 1
else:
    parallel_steps = BUFFER_SIZE // BATCH_SIZE

# This `fit` call will be distributed on 2 GPUs.
# Since the batch size is 64, each GPU will process 32 samples.
parallel_model.fit(train_dataset, epochs=10, steps_per_epoch = parallel_steps)

@ghost
Copy link

ghost commented Sep 2, 2019

Any news on this topic?
Have the same issues

@samkitjain
Copy link

I am having similar issue. @Path-A 's solution works like charm ! Thanks. I tried it and I confirm it works.

@kkanellis
Copy link

kkanellis commented Oct 12, 2019

I had a similar issue and setting drop_remainder=True in the tf.Dataset batch method worked for me.

@menatte
Copy link

menatte commented Dec 9, 2019

Any updates? I experience the same with MultiWorkerMirroredStrategy.

.repeat() + setting the correct number of samples works as a workaround. Nevertheless, it'd be nicer if one can just use the entire validation set without providing the correct number of samples.

@davidlrobinson
Copy link

I get the same issue when using tensorflow.keras.preprocessing.image.ImageDataGenerator with model.fit() when trying to specify a steps_per_epoch greater than the length of the generator.

@wang1ang
Copy link

wang1ang commented Apr 7, 2020

Got the same issue on TF 2.1.0

@jackd
Copy link
Contributor

jackd commented Apr 26, 2020

This still occurs in 2.2 (tf-nightly) if your epochs have varying lengths. I accept this is a rare occurance, but e.g. for graph neural networks, computation / memory requirements are generally dependent on the number of nodes, so batches can have dynamic sizes to accomodate this, which can lead to slightly varying numbers of batches per epoch. If epochs beyond the first are even one step shorter than the first epoch, this issue still arises.

@liqinglin54951
Copy link

liqinglin54951 commented Sep 18, 2020

image

image

Solution: Put the repeat(epochs) in the front of batch( batch_size )
image

image

https://blog.csdn.net/Linli522362242/article/details/108396485

@sorushsaghari
Copy link

def gen():
    df = read_csv(train_file_path,
                  dtype=data_types, chunksize=2048)
    for d in df:
        y = d['like_gt']
        d.drop(['tweet_id', 'engaging_id', 'reply_gt', 'retweet_gt',
                'quote_gt', 'like_gt'], axis=1, inplace=True)

        yield np.asarray(d), np.asarray(y)

model.fit(gen(), batch_size=64, epochs=5)

What's the problem with this code that skips all epochs greater than 1?

@eromoe
Copy link

eromoe commented Nov 24, 2020

tf.data.Dataset.from_generator has a big problem when using funcational api .
It force convert a list of tensors to tensor .
For example , I have a multi-input models

import numpy as np
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers

data = pd.DataFrame(np.random.uniform(size=(1000,3)), columns=['Sales', 'SalesDiff7', 'SalesAggMean7'])

multi_inputs = []
multi_outputs = []
window_size = 1

for i in range(data.shape[1]):
    ti = keras.Input(shape=(window_size, 1), name=f't{i}')
    tlstm = layers.LSTM(32)(ti)
    tp = keras.layers.Dense(units=1)(tlstm)
    multi_inputs.append(ti)
    multi_outputs.append(tp)
    
r = tf.concat(multi_outputs, -1)
c = keras.layers.Flatten()(r)
result = keras.layers.Dense(units=1)(c)

model = keras.Model(
    inputs=multi_inputs,
    outputs=result,
)

Here, the model need input of a list of 3 tensor .

But Dataset.map return only tensor

def split_multi_window(features):
  inputs = features[:, slice(0, 1, None), :]
  inputs = tf.split(inputs, num_or_size_splits=features.shape[-1], axis=len(features.shape)-1)
    
  labels = features[:, slice(1, None, None), slice(0,1, None) ]
  return inputs, labels

data = pd.DataFrame(np.random.uniform(size=(1000,3)), columns=['Sales', 'SalesDiff7', 'SalesAggMean7']).to_numpy()
ds = tf.keras.preprocessing.timeseries_dataset_from_array(
  data=data,
  targets=None,
  sequence_length=1,
  sequence_stride=1,
  shuffle=True,
  batch_size=32,)

ds2  = ds.map(split_multi_window)
ds2
#  <MapDataset shapes: ((3, None, None, 1), (None, None, 1)), types: (tf.float64, tf.float64)>

Try to reformat the dataset

ds2 = tf.data.Dataset.from_generator(lambda : ((list(x), y) for x, y in ds2), (list, tf.float32))

would get error : TypeError: Cannot convert value <class 'list'> to a TensorFlow DType.

@mevol
Copy link

mevol commented Mar 4, 2021

I think the issue with not having enough data in the last batch still exists. For me it happens when running model.predict() as my data is passed in through a generator. Here my problem:

len(X_test) = 567
batch_size = 5
number of steps = 567 / 5 = 113.4

When using this calculation I have 113 steps, as 0.4 or the last two samples are being ignored. This causes the warning below, which is correct:

113/113 [============================>.] - ETA: 0sWARNING:tensorflow:Your input ran out of data; interrupting training. Make sure that your dataset or generator can generate at least steps_per_epoch * epochs batches (in this case, 113.4 batches). You may need to use the repeat() function when building your dataset.
113/113 [============================>.] - 7s 59ms/step

When increasing the step size to 114 by applying "steps=math.ceil(len(X_test) / batch_size)"
the new step size is still ignored and only 113 of 114 steps are executed. See warning below:

113/114 [============================>.] - ETA: 0sWARNING:tensorflow:Your input ran out of data; interrupting training. Make sure that your dataset or generator can generate at least steps_per_epoch * epochs batches (in this case, 114 batches). You may need to use the repeat() function when building your dataset.
113/114 [============================>.] - 7s 60ms/step

I don't see these problems when running model.fit() and setting "steps_per_epoch" and "validation_steps"

I am using Singularity container: tensorflow_2.3.2-gpu-jupyter.sif

@samijaba
Copy link

I got the same isssue in training with retinanet. I resolved it by using repeat inside "fit"

`train_steps_per_epoch = dataset_info.splits["train[:95%]"].num_examples // batch_size
valid_steps_per_epoch = dataset_info.splits["train[95%:]"].num_examples // batch_size

train_steps = 4*100000
epochs = train_steps // train_steps_per_epoch
print (valid_steps_per_epoch)

import random

random.seed(10)
history = model.fit(
train_dataset.repeat(),
batch_size=batch_size,
validation_data=val_dataset.repeat(),
steps_per_epoch=train_steps_per_epoch,
validation_steps=valid_steps_per_epoch,
epochs=epochs,
callbacks=callbacks_list,
verbose=1,
)_`

@zhenghh04
Copy link

I encountered the same issue.
I had a python generator: gen
tf.data.Dataset.from_generator(lambda: gen, ....)
It does not support repeat. But if I define a lambda generator outside of tf.data.from_generator(gen_lambda, ...) as follows

gen_lambda = lambda: gen
tf.data.Dataset.from_generator(gen_lambda, ....) solved the issue.
It supports repeat method.

@samijaba
Copy link

The issue was resolved in my case. The porblem was in my dataset that contain some image with no labels. the number of generated samples < size of the set, so the generator will try to generate data that don't exist!

@jayagami
Copy link

jayagami commented Oct 10, 2021

Providing a case may help somebody.

I create a dataset with image_dataset_from_directory , initialized with batch_size.
Then I passed steps_per_epoch and validation_steps into model.fit , and what I get is early stopping.

The right way is removing steps_per_epoch and validation_steps parameters in model.fit, because the steps was initiated when batch_size passed into image_dataset_from_directory

Chaos......

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
comp:data tf.data related issues comp:keras Keras related issues type:bug Bug
Projects
None yet
Development

No branches or pull requests