Allow datasets to provide the number of examples they contain #36531

Flamefire · 2020-02-07T10:17:22Z

System information

TensorFlow version (you are using): 2.1.0
Are you willing to contribute it (Yes/No): No

Describe the feature and the current behavior/state.

Currently there is no good way to get to the number of samples or batches contained by a dataset although the information is usually available.

What you can do: sum(1 for _ in dataset) but this might not do what one wants:
When the dataset is batched it will return the number of batches including the trailing one. MultiWorkerMirroredStrategy can't handle that.

Usually this information is already available, see e.g. tensorflow/datasets#1403

Will this change the current api? How?

Add a member num_examples and/or an overload for __len__

Who will benefit with this feature?

Everyone using MultiWorkerMirroredStrategy
Everyone using steps_per_epoch
TF itself as the number of samples/batches is known before executing the training loop avoid status reports like 10/Unknown
This would help to provide correct behavior in 6be131d#diff-f8dd40712ac721c1b363e1a1ec44c1a3R741-R747

Any Other info.

There is an experimental op cardinality which might be very related. However it often (always?) returns "Unknown". Tested with MNIST from TFDS.

The text was updated successfully, but these errors were encountered:

Conchylicultor · 2020-02-07T18:49:13Z

For more context, TFDS cannot provides the tf.data.Dataset cardinality because it is not supported by TFRecordDataset and (maybe) interleave op. If there was a way to manually overwrite the cardinality of a tf.data.Dataset, we could forward the number of examples to the tf.data.Dataset.

Related issue: tensorflow/datasets#1456

Conchylicultor · 2020-02-18T18:53:19Z

Thanks to jsmira, this should be fixed in d25235b with tf.data.experimental.assert_cardinality(123)

ds = tf.data.TFRecordDataset("examples.tfrecord")
tf.data.experimental.cardinality(ds)  # tf.data.experimental.UNKNOWN_CARDINALITY

ds = ds.apply(tf.data.experimental.assert_cardinality(42))
tf.data.experimental.cardinality(ds).numpy()  # 42

I'll update the TFDS side. But this issue can be closed.

tensorflow-bot bot assigned ravikyram Feb 7, 2020

Flamefire mentioned this issue Feb 7, 2020

tf.data.Dataset unusable with steps_per_epoch standard training loop #36539

Open

Conchylicultor mentioned this issue Feb 7, 2020

Expose the number of examples directly in the tf.data object once supported by TF tensorflow/datasets#1456

Closed

ravikyram added comp:data tf.data related issues TF 2.1 for tracking issues in 2.1 release type:feature Feature requests labels Feb 10, 2020

ravikyram assigned jvishnuvardhan and unassigned ravikyram Feb 10, 2020

jvishnuvardhan assigned jsimsa and unassigned jvishnuvardhan Feb 10, 2020

jvishnuvardhan added the stat:awaiting tensorflower Status - Awaiting response from tensorflower label Feb 10, 2020

jsimsa closed this as completed Feb 18, 2020

bhack mentioned this issue May 19, 2021

infinite dataset while it is actually finite from tf.data.experimental.choose_from_datasets #49276

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow datasets to provide the number of examples they contain #36531

Allow datasets to provide the number of examples they contain #36531

Flamefire commented Feb 7, 2020

Conchylicultor commented Feb 7, 2020 •

edited

Loading

Conchylicultor commented Feb 18, 2020 •

edited

Loading

Allow datasets to provide the number of examples they contain #36531

Allow datasets to provide the number of examples they contain #36531

Comments

Flamefire commented Feb 7, 2020

Conchylicultor commented Feb 7, 2020 • edited Loading

Conchylicultor commented Feb 18, 2020 • edited Loading

Conchylicultor commented Feb 7, 2020 •

edited

Loading

Conchylicultor commented Feb 18, 2020 •

edited

Loading