New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
S3 roll-out #737
Labels
enhancement
New feature or request
Comments
tfds-copybara
pushed a commit
that referenced
this issue
Jul 4, 2019
…sh. (Issue #737) PiperOrigin-RevId: 256531213
Recep, Chanchal, if you have time, please do not hesitate to make new versions with S3 support for DatasetBuilder classes which don't have that already. It should be doable for all builders, and beam ones once #677 is fixed. |
tfds-copybara
pushed a commit
that referenced
this issue
Nov 20, 2019
Create a new version of all Beam datasets, using S3 compatible tfrecords writer. Default versions of beam datasets are not changed, but cannot generate the datasets at head anymore. Builders are made pickable, which makes it easier to write Beam operations. Mechanism to order examples is the same as in non-Beam pipeline: 1- Hash the keys, 2- Distributes examples in buckets based on hash(key), 3- Sort each bucket, 4- Write final tfrecord shards. The main difference is we don't handle the files to store the buckets ourselves, Beam does. As in non-Beam pipeline, not knowing in advance the total size of a split doesn't make it easy to pick the right number of buckets. Went with 100K buckets, which should be fine for splits up to 1PB. PiperOrigin-RevId: 274156411
tfds-copybara
pushed a commit
that referenced
this issue
Dec 5, 2019
Create a new version of all Beam datasets, using S3 compatible tfrecords writer. Default versions of beam datasets are not changed, but cannot generate the datasets at head anymore. Builders are made pickable, which makes it easier to write Beam operations. Mechanism to order examples is the same as in non-Beam pipeline: 1- Hash the keys, 2- Distributes examples in buckets based on hash(key), 3- Sort each bucket, 4- Write final tfrecord shards. The main difference is we don't handle the files to store the buckets ourselves, Beam does. As in non-Beam pipeline, not knowing in advance the total size of a split doesn't make it easy to pick the right number of buckets. Went with 100K buckets, which should be fine for splits up to 1PB. PiperOrigin-RevId: 274156411
tfds-copybara
pushed a commit
that referenced
this issue
Dec 11, 2019
Create a new version of all Beam datasets, using S3 compatible tfrecords writer. Default versions of beam datasets are not changed, but cannot generate the datasets at head anymore. Builders are made pickable, which makes it easier to write Beam operations. Mechanism to order examples is the same as in non-Beam pipeline: 1- Hash the keys, 2- Distributes examples in buckets based on hash(key), 3- Sort each bucket, 4- Write final tfrecord shards. The main difference is we don't handle the files to store the buckets ourselves, Beam does. As in non-Beam pipeline, not knowing in advance the total size of a split doesn't make it easy to pick the right number of buckets. Went with 100K buckets, which should be fine for splits up to 1PB. PiperOrigin-RevId: 274156411
tfds-copybara
pushed a commit
that referenced
this issue
Dec 11, 2019
Create a new version of all Beam datasets, using S3 compatible tfrecords writer. Default versions of beam datasets are not changed, but cannot generate the datasets at head anymore. Builders are made pickable, which makes it easier to write Beam operations. Mechanism to order examples is the same as in non-Beam pipeline: 1- Hash the keys, 2- Distributes examples in buckets based on hash(key), 3- Sort each bucket, 4- Write final tfrecord shards. The main difference is we don't handle the files to store the buckets ourselves, Beam does. As in non-Beam pipeline, not knowing in advance the total size of a split doesn't make it easy to pick the right number of buckets. Went with 100K buckets, which should be fine for splits up to 1PB. PiperOrigin-RevId: 274156411
tfds-copybara
pushed a commit
that referenced
this issue
Dec 11, 2019
Create a new version of all Beam datasets, using S3 compatible tfrecords writer. Default versions of beam datasets are not changed, but cannot generate the datasets at head anymore. Builders are made pickable, which makes it easier to write Beam operations. Mechanism to order examples is the same as in non-Beam pipeline: 1- Hash the keys, 2- Distributes examples in buckets based on hash(key), 3- Sort each bucket, 4- Write final tfrecord shards. The main difference is we don't handle the files to store the buckets ourselves, Beam does. As in non-Beam pipeline, not knowing in advance the total size of a split doesn't make it easy to pick the right number of buckets. Went with 100K buckets, which should be fine for splits up to 1PB. PiperOrigin-RevId: 274156411
tfds-copybara
pushed a commit
that referenced
this issue
Dec 11, 2019
Create a new version of all Beam datasets, using S3 compatible tfrecords writer. Default versions of beam datasets are not changed, but cannot generate the datasets at head anymore. Builders are made pickable, which makes it easier to write Beam operations. Mechanism to order examples is the same as in non-Beam pipeline: 1- Hash the keys, 2- Distributes examples in buckets based on hash(key), 3- Sort each bucket, 4- Write final tfrecord shards. The main difference is we don't handle the files to store the buckets ourselves, Beam does. As in non-Beam pipeline, not knowing in advance the total size of a split doesn't make it easy to pick the right number of buckets. Went with 100K buckets, which should be fine for splits up to 1PB. PiperOrigin-RevId: 274156411
tfds-copybara
pushed a commit
that referenced
this issue
Dec 11, 2019
Create a new version of all Beam datasets, using S3 compatible tfrecords writer. Default versions of beam datasets are not changed, but cannot generate the datasets at head anymore. Builders are made pickable, which makes it easier to write Beam operations. Mechanism to order examples is the same as in non-Beam pipeline: 1- Hash the keys, 2- Distributes examples in buckets based on hash(key), 3- Sort each bucket, 4- Write final tfrecord shards. The main difference is we don't handle the files to store the buckets ourselves, Beam does. As in non-Beam pipeline, not knowing in advance the total size of a split doesn't make it easy to pick the right number of buckets. Went with 100K buckets, which should be fine for splits up to 1PB. PiperOrigin-RevId: 274156411
tfds-copybara
pushed a commit
that referenced
this issue
Dec 11, 2019
Create a new version of all Beam datasets, using S3 compatible tfrecords writer. Default versions of beam datasets are not changed, but cannot generate the datasets at head anymore. Builders are made pickable, which makes it easier to write Beam operations. Mechanism to order examples is the same as in non-Beam pipeline: 1- Hash the keys, 2- Distributes examples in buckets based on hash(key), 3- Sort each bucket, 4- Write final tfrecord shards. The main difference is we don't handle the files to store the buckets ourselves, Beam does. As in non-Beam pipeline, not knowing in advance the total size of a split doesn't make it easy to pick the right number of buckets. Went with 100K buckets, which should be fine for splits up to 1PB. PiperOrigin-RevId: 274156411
tfds-copybara
pushed a commit
that referenced
this issue
Dec 11, 2019
Create a new version of all Beam datasets, using S3 compatible tfrecords writer. Default versions of beam datasets are not changed, but cannot generate the datasets at head anymore. Builders are made pickable, which makes it easier to write Beam operations. Mechanism to order examples is the same as in non-Beam pipeline: 1- Hash the keys, 2- Distributes examples in buckets based on hash(key), 3- Sort each bucket, 4- Write final tfrecord shards. The main difference is we don't handle the files to store the buckets ourselves, Beam does. As in non-Beam pipeline, not knowing in advance the total size of a split doesn't make it easy to pick the right number of buckets. Went with 100K buckets, which should be fine for splits up to 1PB. PiperOrigin-RevId: 274156411
tfds-copybara
pushed a commit
that referenced
this issue
Dec 12, 2019
Create a new version of all Beam datasets, using S3 compatible tfrecords writer. Default versions of beam datasets are not changed, but cannot generate the datasets at head anymore. Builders are made pickable, which makes it easier to write Beam operations. Mechanism to order examples is the same as in non-Beam pipeline: 1- Hash the keys, 2- Distributes examples in buckets based on hash(key), 3- Sort each bucket, 4- Write final tfrecord shards. The main difference is we don't handle the files to store the buckets ourselves, Beam does. As in non-Beam pipeline, not knowing in advance the total size of a split doesn't make it easy to pick the right number of buckets. Went with 100K buckets, which should be fine for splits up to 1PB. PiperOrigin-RevId: 274156411
tfds-copybara
pushed a commit
that referenced
this issue
Dec 17, 2019
Create a new version of all Beam datasets, using S3 compatible tfrecords writer. Default versions of beam datasets are not changed, but cannot generate the datasets at head anymore. Builders are made pickable, which makes it easier to write Beam operations. Mechanism to order examples is the same as in non-Beam pipeline: 1- Hash the keys, 2- Distributes examples in buckets based on hash(key), 3- Sort each bucket, 4- Write final tfrecord shards. The main difference is we don't handle the files to store the buckets ourselves, Beam does. As in non-Beam pipeline, not knowing in advance the total size of a split doesn't make it easy to pick the right number of buckets. Went with 100K buckets, which should be fine for splits up to 1PB. PiperOrigin-RevId: 285730454
Closing this now S3 has been rolled out. |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
This is about rolling out the S3 (new sharding shuffling slicing mechanism).
Currently, the hash function being used is siphash, with csiphash library, which causes tests to crash on kokoro with py3 and tf13. We don't understand why at this point, and consider switching to md5 (slower - but faster than pure python implementation of siphash - but hey, at least it's guaranteed to work everywhere).
The text was updated successfully, but these errors were encountered: