S3 roll-out #737

pierrot0 · 2019-07-04T10:25:19Z

This is about rolling out the S3 (new sharding shuffling slicing mechanism).

Currently, the hash function being used is siphash, with csiphash library, which causes tests to crash on kokoro with py3 and tf13. We don't understand why at this point, and consider switching to md5 (slower - but faster than pure python implementation of siphash - but hey, at least it's guaranteed to work everywhere).

…sh. (Issue #737) PiperOrigin-RevId: 256531213

PiperOrigin-RevId: 256790378

PiperOrigin-RevId: 257178144

PiperOrigin-RevId: 257503545

PiperOrigin-RevId: 257505321

PiperOrigin-RevId: 257503545

PiperOrigin-RevId: 257505321

PiperOrigin-RevId: 256940348

PiperOrigin-RevId: 257503545

PiperOrigin-RevId: 257561767

PiperOrigin-RevId: 257505321

PiperOrigin-RevId: 257503545

PiperOrigin-RevId: 257505321

PiperOrigin-RevId: 257503545

PiperOrigin-RevId: 257574363

PiperOrigin-RevId: 257505321

pierrot0 · 2019-07-11T10:03:14Z

Recep, Chanchal, if you have time, please do not hesitate to make new versions with S3 support for DatasetBuilder classes which don't have that already. It should be doable for all builders, and beam ones once #677 is fixed.

PiperOrigin-RevId: 257603054

PiperOrigin-RevId: 262321493

PiperOrigin-RevId: 262124254

PiperOrigin-RevId: 262133210

PiperOrigin-RevId: 262325028

PiperOrigin-RevId: 262328250

PiperOrigin-RevId: 262133203

PiperOrigin-RevId: 262133198

PiperOrigin-RevId: 262133194

PiperOrigin-RevId: 262513266

PiperOrigin-RevId: 262133203

PiperOrigin-RevId: 262517318

PiperOrigin-RevId: 262133198

PiperOrigin-RevId: 262520330

Create a new version of all Beam datasets, using S3 compatible tfrecords writer. Default versions of beam datasets are not changed, but cannot generate the datasets at head anymore. Builders are made pickable, which makes it easier to write Beam operations. Mechanism to order examples is the same as in non-Beam pipeline: 1- Hash the keys, 2- Distributes examples in buckets based on hash(key), 3- Sort each bucket, 4- Write final tfrecord shards. The main difference is we don't handle the files to store the buckets ourselves, Beam does. As in non-Beam pipeline, not knowing in advance the total size of a split doesn't make it easy to pick the right number of buckets. Went with 100K buckets, which should be fine for splits up to 1PB. PiperOrigin-RevId: 274156411

Create a new version of all Beam datasets, using S3 compatible tfrecords writer. Default versions of beam datasets are not changed, but cannot generate the datasets at head anymore. Builders are made pickable, which makes it easier to write Beam operations. Mechanism to order examples is the same as in non-Beam pipeline: 1- Hash the keys, 2- Distributes examples in buckets based on hash(key), 3- Sort each bucket, 4- Write final tfrecord shards. The main difference is we don't handle the files to store the buckets ourselves, Beam does. As in non-Beam pipeline, not knowing in advance the total size of a split doesn't make it easy to pick the right number of buckets. Went with 100K buckets, which should be fine for splits up to 1PB. PiperOrigin-RevId: 285730454

Conchylicultor · 2020-02-27T01:51:25Z

Closing this now S3 has been rolled out.
#1519 to track cleanup legacy code.

pierrot0 added the enhancement New feature or request label Jul 4, 2019

pierrot0 self-assigned this Jul 4, 2019

tfds-copybara pushed a commit that referenced this issue Jul 4, 2019

remove csiphash dependency, as it is causing tests on py3 tf13 to cra…

9b98adb

…sh. (Issue #737) PiperOrigin-RevId: 256531213

tfds-copybara pushed a commit that referenced this issue Jul 8, 2019

TFDS Issue #737: use md5 instead of siphash.

2f75bf8

PiperOrigin-RevId: 256790378

tfds-copybara pushed a commit that referenced this issue Jul 9, 2019

TFDS Issue #737: use md5 instead of siphash.

20f3ddf

PiperOrigin-RevId: 256790378

tfds-copybara pushed a commit that referenced this issue Jul 9, 2019

TFDS Issue #737: use md5 instead of siphash.

b78939c

PiperOrigin-RevId: 256790378

tfds-copybara pushed a commit that referenced this issue Jul 9, 2019

TFDS Issue #737: use md5 instead of siphash.

48171e1

PiperOrigin-RevId: 256790378

tfds-copybara pushed a commit that referenced this issue Jul 9, 2019

TFDS Issue #737: use md5 instead of siphash.

fd4a4f7

PiperOrigin-RevId: 257178144

tfds-copybara pushed a commit that referenced this issue Jul 10, 2019

imdb dataset: S3 version (issue #737).

804d612

PiperOrigin-RevId: 257503545

tfds-copybara mentioned this issue Jul 10, 2019

imdb dataset: S3 version (issue #737). #756

Merged

tfds-copybara pushed a commit that referenced this issue Jul 10, 2019

bair_robot_pushing dataset: S3 version (issue #737).

8d5dba1

PiperOrigin-RevId: 257505321

tfds-copybara mentioned this issue Jul 10, 2019

bair_robot_pushing dataset: S3 version (issue #737). #757

Merged

tfds-copybara pushed a commit that referenced this issue Jul 10, 2019

imdb dataset: S3 version (issue #737).

ba1dc49

PiperOrigin-RevId: 257503545

tfds-copybara pushed a commit that referenced this issue Jul 11, 2019

bair_robot_pushing dataset: S3 version (issue #737).

bdfd156

PiperOrigin-RevId: 257505321

tfds-copybara pushed a commit that referenced this issue Jul 11, 2019

make S3 experiment default to True (Issue #737).

5704e59

PiperOrigin-RevId: 256940348

tfds-copybara pushed a commit that referenced this issue Jul 11, 2019

imdb dataset: S3 version (issue #737).

1ef778d

PiperOrigin-RevId: 257503545

tfds-copybara pushed a commit that referenced this issue Jul 11, 2019

imdb dataset: S3 version (issue #737).

864faa5

PiperOrigin-RevId: 257503545

tfds-copybara pushed a commit that referenced this issue Jul 11, 2019

make S3 experiment default to True (Issue #737).

41fbb33

PiperOrigin-RevId: 257561767

tfds-copybara pushed a commit that referenced this issue Jul 11, 2019

bair_robot_pushing dataset: S3 version (issue #737).

8d53609

PiperOrigin-RevId: 257505321

tfds-copybara pushed a commit that referenced this issue Jul 11, 2019

imdb dataset: S3 version (issue #737).

e195ded

PiperOrigin-RevId: 257503545

tfds-copybara pushed a commit that referenced this issue Jul 11, 2019

imdb dataset: S3 version (issue #737).

e191484

PiperOrigin-RevId: 257503545

pierrot0 mentioned this issue Jul 11, 2019

Enable overlapping splits to build cross-validation datasets. #664

Closed

tfds-copybara pushed a commit that referenced this issue Jul 11, 2019

bair_robot_pushing dataset: S3 version (issue #737).

676054d

PiperOrigin-RevId: 257505321

tfds-copybara pushed a commit that referenced this issue Jul 11, 2019

imdb dataset: S3 version (issue #737).

4f3a5c1

PiperOrigin-RevId: 257503545

tfds-copybara pushed a commit that referenced this issue Jul 11, 2019

imdb dataset: S3 version (issue #737).

0ebd37f

PiperOrigin-RevId: 257574363

tfds-copybara pushed a commit that referenced this issue Jul 11, 2019

bair_robot_pushing dataset: S3 version (issue #737).

d442964

PiperOrigin-RevId: 257505321

pierrot0 assigned us and ChanchalKumarMaji Jul 11, 2019

tfds-copybara pushed a commit that referenced this issue Jul 11, 2019

bair_robot_pushing dataset: S3 version (issue #737).

8d6f9a6

PiperOrigin-RevId: 257603054

tfds-copybara pushed a commit that referenced this issue Aug 8, 2019

TFDS: librispeech: Add S3 version. Issue #737.

c247a8f

PiperOrigin-RevId: 262321493

tfds-copybara pushed a commit that referenced this issue Aug 8, 2019

TFDS: groove: Add S3 version. Issue #737.

406390a

PiperOrigin-RevId: 262124254

tfds-copybara pushed a commit that referenced this issue Aug 8, 2019

TFDS: nsynth: Add S3 version. Issue #737.

93029f7

PiperOrigin-RevId: 262133210

tfds-copybara pushed a commit that referenced this issue Aug 8, 2019

TFDS: groove: Add S3 version. Issue #737.

c231428

PiperOrigin-RevId: 262325028

tfds-copybara pushed a commit that referenced this issue Aug 8, 2019

TFDS: nsynth: Add S3 version. Issue #737.

5545d41

PiperOrigin-RevId: 262328250

tfds-copybara pushed a commit that referenced this issue Aug 9, 2019

TFDS: image_folder: Add S3 version. Issue #737.

95303af

PiperOrigin-RevId: 262133203

tfds-copybara pushed a commit that referenced this issue Aug 9, 2019

TFDS: mnist_corrupted: Add S3 version. Issue #737.

c04651d

PiperOrigin-RevId: 262133198

tfds-copybara pushed a commit that referenced this issue Aug 9, 2019

TFDS: cifar10_corrupted: Add S3 version. Issue #737.

7d5619d

PiperOrigin-RevId: 262133194

tfds-copybara pushed a commit that referenced this issue Aug 9, 2019

TFDS: cifar10_corrupted: Add S3 version. Issue #737.

9df42af

PiperOrigin-RevId: 262133194

tfds-copybara pushed a commit that referenced this issue Aug 9, 2019

TFDS: cifar10_corrupted: Add S3 version. Issue #737.

4cc215a

PiperOrigin-RevId: 262513266

tfds-copybara pushed a commit that referenced this issue Aug 9, 2019

TFDS: image_folder: Add S3 version. Issue #737.

16a1d3c

PiperOrigin-RevId: 262133203

tfds-copybara pushed a commit that referenced this issue Aug 9, 2019

TFDS: image_folder: Add S3 version. Issue #737.

373eca4

PiperOrigin-RevId: 262517318

tfds-copybara pushed a commit that referenced this issue Aug 9, 2019

TFDS: mnist_corrupted: Add S3 version. Issue #737.

b4aa141

PiperOrigin-RevId: 262133198

tfds-copybara pushed a commit that referenced this issue Aug 9, 2019

TFDS: mnist_corrupted: Add S3 version. Issue #737.

95af825

PiperOrigin-RevId: 262520330

us mentioned this issue Aug 16, 2019

Add translate/wmt S3 Version #923

Merged

ChanchalKumarMaji mentioned this issue Aug 16, 2019

Launch S3 to video datasets. #924

Closed

tfds-copybara mentioned this issue Nov 20, 2019

Beam tfrecords writer. Issue #677. Issue #737. #1210

Merged

Conchylicultor closed this as completed Feb 27, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

S3 roll-out #737

S3 roll-out #737

pierrot0 commented Jul 4, 2019

pierrot0 commented Jul 11, 2019

Conchylicultor commented Feb 27, 2020

S3 roll-out #737

S3 roll-out #737

Comments

pierrot0 commented Jul 4, 2019

pierrot0 commented Jul 11, 2019

Conchylicultor commented Feb 27, 2020