Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

S3 roll-out #737

Closed
pierrot0 opened this issue Jul 4, 2019 · 2 comments
Closed

S3 roll-out #737

pierrot0 opened this issue Jul 4, 2019 · 2 comments
Assignees
Labels
enhancement New feature or request

Comments

@pierrot0
Copy link
Collaborator

pierrot0 commented Jul 4, 2019

This is about rolling out the S3 (new sharding shuffling slicing mechanism).

Currently, the hash function being used is siphash, with csiphash library, which causes tests to crash on kokoro with py3 and tf13. We don't understand why at this point, and consider switching to md5 (slower - but faster than pure python implementation of siphash - but hey, at least it's guaranteed to work everywhere).

@pierrot0 pierrot0 added the enhancement New feature or request label Jul 4, 2019
@pierrot0 pierrot0 self-assigned this Jul 4, 2019
tfds-copybara pushed a commit that referenced this issue Jul 4, 2019
tfds-copybara pushed a commit that referenced this issue Jul 8, 2019
PiperOrigin-RevId: 256790378
tfds-copybara pushed a commit that referenced this issue Jul 9, 2019
PiperOrigin-RevId: 256790378
tfds-copybara pushed a commit that referenced this issue Jul 9, 2019
PiperOrigin-RevId: 256790378
tfds-copybara pushed a commit that referenced this issue Jul 9, 2019
PiperOrigin-RevId: 256790378
tfds-copybara pushed a commit that referenced this issue Jul 9, 2019
PiperOrigin-RevId: 257178144
tfds-copybara pushed a commit that referenced this issue Jul 10, 2019
PiperOrigin-RevId: 257503545
tfds-copybara pushed a commit that referenced this issue Jul 10, 2019
tfds-copybara pushed a commit that referenced this issue Jul 10, 2019
PiperOrigin-RevId: 257503545
tfds-copybara pushed a commit that referenced this issue Jul 11, 2019
tfds-copybara pushed a commit that referenced this issue Jul 11, 2019
tfds-copybara pushed a commit that referenced this issue Jul 11, 2019
PiperOrigin-RevId: 257503545
tfds-copybara pushed a commit that referenced this issue Jul 11, 2019
PiperOrigin-RevId: 257503545
tfds-copybara pushed a commit that referenced this issue Jul 11, 2019
tfds-copybara pushed a commit that referenced this issue Jul 11, 2019
tfds-copybara pushed a commit that referenced this issue Jul 11, 2019
PiperOrigin-RevId: 257503545
tfds-copybara pushed a commit that referenced this issue Jul 11, 2019
PiperOrigin-RevId: 257503545
tfds-copybara pushed a commit that referenced this issue Jul 11, 2019
tfds-copybara pushed a commit that referenced this issue Jul 11, 2019
PiperOrigin-RevId: 257503545
tfds-copybara pushed a commit that referenced this issue Jul 11, 2019
PiperOrigin-RevId: 257574363
tfds-copybara pushed a commit that referenced this issue Jul 11, 2019
@pierrot0
Copy link
Collaborator Author

Recep, Chanchal, if you have time, please do not hesitate to make new versions with S3 support for DatasetBuilder classes which don't have that already. It should be doable for all builders, and beam ones once #677 is fixed.

tfds-copybara pushed a commit that referenced this issue Jul 11, 2019
tfds-copybara pushed a commit that referenced this issue Aug 8, 2019
PiperOrigin-RevId: 262321493
tfds-copybara pushed a commit that referenced this issue Aug 8, 2019
PiperOrigin-RevId: 262124254
tfds-copybara pushed a commit that referenced this issue Aug 8, 2019
PiperOrigin-RevId: 262133210
tfds-copybara pushed a commit that referenced this issue Aug 8, 2019
PiperOrigin-RevId: 262325028
tfds-copybara pushed a commit that referenced this issue Aug 8, 2019
PiperOrigin-RevId: 262328250
tfds-copybara pushed a commit that referenced this issue Aug 9, 2019
tfds-copybara pushed a commit that referenced this issue Aug 9, 2019
tfds-copybara pushed a commit that referenced this issue Aug 9, 2019
tfds-copybara pushed a commit that referenced this issue Aug 9, 2019
tfds-copybara pushed a commit that referenced this issue Aug 9, 2019
tfds-copybara pushed a commit that referenced this issue Aug 9, 2019
tfds-copybara pushed a commit that referenced this issue Aug 9, 2019
PiperOrigin-RevId: 262517318
tfds-copybara pushed a commit that referenced this issue Aug 9, 2019
tfds-copybara pushed a commit that referenced this issue Aug 9, 2019
tfds-copybara pushed a commit that referenced this issue Nov 20, 2019
Create a new version of all Beam datasets, using S3 compatible tfrecords writer.
Default versions of beam datasets are not changed, but cannot generate the datasets at head anymore.

Builders are made pickable, which makes it easier to write Beam operations.

Mechanism to order examples is the same as in non-Beam pipeline:
 1- Hash the keys,
 2- Distributes examples in buckets based on hash(key),
 3- Sort each bucket,
 4- Write final tfrecord shards.

The main difference is we don't handle the files to store the buckets ourselves, Beam does.

As in non-Beam pipeline, not knowing in advance the total size of a split doesn't make it easy to pick the right number of buckets. Went with 100K buckets, which should be fine for splits up to 1PB.

PiperOrigin-RevId: 274156411
tfds-copybara pushed a commit that referenced this issue Dec 5, 2019
Create a new version of all Beam datasets, using S3 compatible tfrecords writer.
Default versions of beam datasets are not changed, but cannot generate the datasets at head anymore.

Builders are made pickable, which makes it easier to write Beam operations.

Mechanism to order examples is the same as in non-Beam pipeline:
 1- Hash the keys,
 2- Distributes examples in buckets based on hash(key),
 3- Sort each bucket,
 4- Write final tfrecord shards.

The main difference is we don't handle the files to store the buckets ourselves, Beam does.

As in non-Beam pipeline, not knowing in advance the total size of a split doesn't make it easy to pick the right number of buckets. Went with 100K buckets, which should be fine for splits up to 1PB.

PiperOrigin-RevId: 274156411
tfds-copybara pushed a commit that referenced this issue Dec 11, 2019
Create a new version of all Beam datasets, using S3 compatible tfrecords writer.
Default versions of beam datasets are not changed, but cannot generate the datasets at head anymore.

Builders are made pickable, which makes it easier to write Beam operations.

Mechanism to order examples is the same as in non-Beam pipeline:
 1- Hash the keys,
 2- Distributes examples in buckets based on hash(key),
 3- Sort each bucket,
 4- Write final tfrecord shards.

The main difference is we don't handle the files to store the buckets ourselves, Beam does.

As in non-Beam pipeline, not knowing in advance the total size of a split doesn't make it easy to pick the right number of buckets. Went with 100K buckets, which should be fine for splits up to 1PB.

PiperOrigin-RevId: 274156411
tfds-copybara pushed a commit that referenced this issue Dec 11, 2019
Create a new version of all Beam datasets, using S3 compatible tfrecords writer.
Default versions of beam datasets are not changed, but cannot generate the datasets at head anymore.

Builders are made pickable, which makes it easier to write Beam operations.

Mechanism to order examples is the same as in non-Beam pipeline:
 1- Hash the keys,
 2- Distributes examples in buckets based on hash(key),
 3- Sort each bucket,
 4- Write final tfrecord shards.

The main difference is we don't handle the files to store the buckets ourselves, Beam does.

As in non-Beam pipeline, not knowing in advance the total size of a split doesn't make it easy to pick the right number of buckets. Went with 100K buckets, which should be fine for splits up to 1PB.

PiperOrigin-RevId: 274156411
tfds-copybara pushed a commit that referenced this issue Dec 11, 2019
Create a new version of all Beam datasets, using S3 compatible tfrecords writer.
Default versions of beam datasets are not changed, but cannot generate the datasets at head anymore.

Builders are made pickable, which makes it easier to write Beam operations.

Mechanism to order examples is the same as in non-Beam pipeline:
 1- Hash the keys,
 2- Distributes examples in buckets based on hash(key),
 3- Sort each bucket,
 4- Write final tfrecord shards.

The main difference is we don't handle the files to store the buckets ourselves, Beam does.

As in non-Beam pipeline, not knowing in advance the total size of a split doesn't make it easy to pick the right number of buckets. Went with 100K buckets, which should be fine for splits up to 1PB.

PiperOrigin-RevId: 274156411
tfds-copybara pushed a commit that referenced this issue Dec 11, 2019
Create a new version of all Beam datasets, using S3 compatible tfrecords writer.
Default versions of beam datasets are not changed, but cannot generate the datasets at head anymore.

Builders are made pickable, which makes it easier to write Beam operations.

Mechanism to order examples is the same as in non-Beam pipeline:
 1- Hash the keys,
 2- Distributes examples in buckets based on hash(key),
 3- Sort each bucket,
 4- Write final tfrecord shards.

The main difference is we don't handle the files to store the buckets ourselves, Beam does.

As in non-Beam pipeline, not knowing in advance the total size of a split doesn't make it easy to pick the right number of buckets. Went with 100K buckets, which should be fine for splits up to 1PB.

PiperOrigin-RevId: 274156411
tfds-copybara pushed a commit that referenced this issue Dec 11, 2019
Create a new version of all Beam datasets, using S3 compatible tfrecords writer.
Default versions of beam datasets are not changed, but cannot generate the datasets at head anymore.

Builders are made pickable, which makes it easier to write Beam operations.

Mechanism to order examples is the same as in non-Beam pipeline:
 1- Hash the keys,
 2- Distributes examples in buckets based on hash(key),
 3- Sort each bucket,
 4- Write final tfrecord shards.

The main difference is we don't handle the files to store the buckets ourselves, Beam does.

As in non-Beam pipeline, not knowing in advance the total size of a split doesn't make it easy to pick the right number of buckets. Went with 100K buckets, which should be fine for splits up to 1PB.

PiperOrigin-RevId: 274156411
tfds-copybara pushed a commit that referenced this issue Dec 11, 2019
Create a new version of all Beam datasets, using S3 compatible tfrecords writer.
Default versions of beam datasets are not changed, but cannot generate the datasets at head anymore.

Builders are made pickable, which makes it easier to write Beam operations.

Mechanism to order examples is the same as in non-Beam pipeline:
 1- Hash the keys,
 2- Distributes examples in buckets based on hash(key),
 3- Sort each bucket,
 4- Write final tfrecord shards.

The main difference is we don't handle the files to store the buckets ourselves, Beam does.

As in non-Beam pipeline, not knowing in advance the total size of a split doesn't make it easy to pick the right number of buckets. Went with 100K buckets, which should be fine for splits up to 1PB.

PiperOrigin-RevId: 274156411
tfds-copybara pushed a commit that referenced this issue Dec 11, 2019
Create a new version of all Beam datasets, using S3 compatible tfrecords writer.
Default versions of beam datasets are not changed, but cannot generate the datasets at head anymore.

Builders are made pickable, which makes it easier to write Beam operations.

Mechanism to order examples is the same as in non-Beam pipeline:
 1- Hash the keys,
 2- Distributes examples in buckets based on hash(key),
 3- Sort each bucket,
 4- Write final tfrecord shards.

The main difference is we don't handle the files to store the buckets ourselves, Beam does.

As in non-Beam pipeline, not knowing in advance the total size of a split doesn't make it easy to pick the right number of buckets. Went with 100K buckets, which should be fine for splits up to 1PB.

PiperOrigin-RevId: 274156411
tfds-copybara pushed a commit that referenced this issue Dec 12, 2019
Create a new version of all Beam datasets, using S3 compatible tfrecords writer.
Default versions of beam datasets are not changed, but cannot generate the datasets at head anymore.

Builders are made pickable, which makes it easier to write Beam operations.

Mechanism to order examples is the same as in non-Beam pipeline:
 1- Hash the keys,
 2- Distributes examples in buckets based on hash(key),
 3- Sort each bucket,
 4- Write final tfrecord shards.

The main difference is we don't handle the files to store the buckets ourselves, Beam does.

As in non-Beam pipeline, not knowing in advance the total size of a split doesn't make it easy to pick the right number of buckets. Went with 100K buckets, which should be fine for splits up to 1PB.

PiperOrigin-RevId: 274156411
tfds-copybara pushed a commit that referenced this issue Dec 17, 2019
Create a new version of all Beam datasets, using S3 compatible tfrecords writer.
Default versions of beam datasets are not changed, but cannot generate the datasets at head anymore.

Builders are made pickable, which makes it easier to write Beam operations.

Mechanism to order examples is the same as in non-Beam pipeline:
 1- Hash the keys,
 2- Distributes examples in buckets based on hash(key),
 3- Sort each bucket,
 4- Write final tfrecord shards.

The main difference is we don't handle the files to store the buckets ourselves, Beam does.

As in non-Beam pipeline, not knowing in advance the total size of a split doesn't make it easy to pick the right number of buckets. Went with 100K buckets, which should be fine for splits up to 1PB.

PiperOrigin-RevId: 285730454
@Conchylicultor
Copy link
Member

Closing this now S3 has been rolled out.
#1519 to track cleanup legacy code.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

4 participants