-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Data] Allow automatic handling of string features as byte features during TFRecord serialization #37995
[Data] Allow automatic handling of string features as byte features during TFRecord serialization #37995
Conversation
…erialization Signed-off-by: EdwardCuiPeacock <Edward.Cui@nbcuni.com>
Thanks for the contribution @EdwardCuiPeacock! Do you think you can add a small test for this in Also cc @scottjlee for review |
Signed-off-by: EdwardCuiPeacock <Edward.Cui@nbcuni.com>
Signed-off-by: EdwardCuiPeacock <Edward.Cui@nbcuni.com>
Signed-off-by: EdwardCuiPeacock <Edward.Cui@nbcuni.com>
Signed-off-by: EdwardCuiPeacock <Edward.Cui@nbcuni.com>
Hi, @amogkam . I added the new unit tests. To install the test_requirement.txt though, I had to change |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for adding this feature!
Thanks for the contribution @EdwardCuiPeacock! It's been merged. |
…uring TFRecord serialization (ray-project#37995) I am attempting to use ray.data.write_tfrecords to convert my features into TFRecords. Since some of my features are string features, the following error is thrown: ValueError: Value is of type string, which we cannot convert to a supported tf.train.Feature storage type (bytes, float, or int). Naturally, TFRecords serializes string features as bytes feature automatically (as typically done in Tensorflow and Tensorflow-extended (TFX); see for example: https://github.com/tensorflow/tfx/blob/fd070288ffa5dcb28204810d2bed261c7e1df201/tfx/components/example_gen/csv_example_gen/executor.py#L67). Making the TFRecord writer automatically treat strings as bytes would help serialize string features correctly, avoiding the hassle of creating additional map functions to convert string features into bytes. When reading TFRecords, however, we should still keep the bytes features as bytes, as the program does not necessarily know a byte object represents a string. In this example provided by Tensorflow, tensors are serialized into bytes and stored in tfrecords, using tf.io.serialize_tensor. --------- Signed-off-by: EdwardCuiPeacock <Edward.Cui@nbcuni.com>
…uring TFRecord serialization (ray-project#37995) I am attempting to use ray.data.write_tfrecords to convert my features into TFRecords. Since some of my features are string features, the following error is thrown: ValueError: Value is of type string, which we cannot convert to a supported tf.train.Feature storage type (bytes, float, or int). Naturally, TFRecords serializes string features as bytes feature automatically (as typically done in Tensorflow and Tensorflow-extended (TFX); see for example: https://github.com/tensorflow/tfx/blob/fd070288ffa5dcb28204810d2bed261c7e1df201/tfx/components/example_gen/csv_example_gen/executor.py#L67). Making the TFRecord writer automatically treat strings as bytes would help serialize string features correctly, avoiding the hassle of creating additional map functions to convert string features into bytes. When reading TFRecords, however, we should still keep the bytes features as bytes, as the program does not necessarily know a byte object represents a string. In this example provided by Tensorflow, tensors are serialized into bytes and stored in tfrecords, using tf.io.serialize_tensor. --------- Signed-off-by: EdwardCuiPeacock <Edward.Cui@nbcuni.com> Signed-off-by: NripeshN <nn2012@hw.ac.uk>
…uring TFRecord serialization (ray-project#37995) I am attempting to use ray.data.write_tfrecords to convert my features into TFRecords. Since some of my features are string features, the following error is thrown: ValueError: Value is of type string, which we cannot convert to a supported tf.train.Feature storage type (bytes, float, or int). Naturally, TFRecords serializes string features as bytes feature automatically (as typically done in Tensorflow and Tensorflow-extended (TFX); see for example: https://github.com/tensorflow/tfx/blob/fd070288ffa5dcb28204810d2bed261c7e1df201/tfx/components/example_gen/csv_example_gen/executor.py#L67). Making the TFRecord writer automatically treat strings as bytes would help serialize string features correctly, avoiding the hassle of creating additional map functions to convert string features into bytes. When reading TFRecords, however, we should still keep the bytes features as bytes, as the program does not necessarily know a byte object represents a string. In this example provided by Tensorflow, tensors are serialized into bytes and stored in tfrecords, using tf.io.serialize_tensor. --------- Signed-off-by: EdwardCuiPeacock <Edward.Cui@nbcuni.com> Signed-off-by: harborn <gangsheng.wu@intel.com>
…uring TFRecord serialization (ray-project#37995) I am attempting to use ray.data.write_tfrecords to convert my features into TFRecords. Since some of my features are string features, the following error is thrown: ValueError: Value is of type string, which we cannot convert to a supported tf.train.Feature storage type (bytes, float, or int). Naturally, TFRecords serializes string features as bytes feature automatically (as typically done in Tensorflow and Tensorflow-extended (TFX); see for example: https://github.com/tensorflow/tfx/blob/fd070288ffa5dcb28204810d2bed261c7e1df201/tfx/components/example_gen/csv_example_gen/executor.py#L67). Making the TFRecord writer automatically treat strings as bytes would help serialize string features correctly, avoiding the hassle of creating additional map functions to convert string features into bytes. When reading TFRecords, however, we should still keep the bytes features as bytes, as the program does not necessarily know a byte object represents a string. In this example provided by Tensorflow, tensors are serialized into bytes and stored in tfrecords, using tf.io.serialize_tensor. --------- Signed-off-by: EdwardCuiPeacock <Edward.Cui@nbcuni.com>
…uring TFRecord serialization (ray-project#37995) I am attempting to use ray.data.write_tfrecords to convert my features into TFRecords. Since some of my features are string features, the following error is thrown: ValueError: Value is of type string, which we cannot convert to a supported tf.train.Feature storage type (bytes, float, or int). Naturally, TFRecords serializes string features as bytes feature automatically (as typically done in Tensorflow and Tensorflow-extended (TFX); see for example: https://github.com/tensorflow/tfx/blob/fd070288ffa5dcb28204810d2bed261c7e1df201/tfx/components/example_gen/csv_example_gen/executor.py#L67). Making the TFRecord writer automatically treat strings as bytes would help serialize string features correctly, avoiding the hassle of creating additional map functions to convert string features into bytes. When reading TFRecords, however, we should still keep the bytes features as bytes, as the program does not necessarily know a byte object represents a string. In this example provided by Tensorflow, tensors are serialized into bytes and stored in tfrecords, using tf.io.serialize_tensor. --------- Signed-off-by: EdwardCuiPeacock <Edward.Cui@nbcuni.com> Signed-off-by: e428265 <arvind.chandramouli@lmco.com>
…uring TFRecord serialization (ray-project#37995) I am attempting to use ray.data.write_tfrecords to convert my features into TFRecords. Since some of my features are string features, the following error is thrown: ValueError: Value is of type string, which we cannot convert to a supported tf.train.Feature storage type (bytes, float, or int). Naturally, TFRecords serializes string features as bytes feature automatically (as typically done in Tensorflow and Tensorflow-extended (TFX); see for example: https://github.com/tensorflow/tfx/blob/fd070288ffa5dcb28204810d2bed261c7e1df201/tfx/components/example_gen/csv_example_gen/executor.py#L67). Making the TFRecord writer automatically treat strings as bytes would help serialize string features correctly, avoiding the hassle of creating additional map functions to convert string features into bytes. When reading TFRecords, however, we should still keep the bytes features as bytes, as the program does not necessarily know a byte object represents a string. In this example provided by Tensorflow, tensors are serialized into bytes and stored in tfrecords, using tf.io.serialize_tensor. --------- Signed-off-by: EdwardCuiPeacock <Edward.Cui@nbcuni.com> Signed-off-by: Victor <vctr.y.m@example.com>
[Sorry, previous PR was kind of messed up...]
Why are these changes needed?
I am attempting to use
ray.data.write_tfrecords
to convert my features into TFRecords. Since some of my features are string features, the following error is thrown:Naturally, TFRecords serializes string features as bytes feature automatically (as typically done in Tensorflow and Tensorflow-extended (TFX); see for example: https://github.com/tensorflow/tfx/blob/fd070288ffa5dcb28204810d2bed261c7e1df201/tfx/components/example_gen/csv_example_gen/executor.py#L67). Making the TFRecord writer automatically treat strings as bytes would help serialize string features correctly, avoiding the hassle of creating additional map functions to convert string features into bytes.
When reading TFRecords, however, we should still keep the bytes features as bytes, as the program does not necessarily know a byte object represents a string. In this example provided by Tensorflow, tensors are serialized into bytes and stored in tfrecords, using
tf.io.serialize_tensor
.Related issue number
Checks
git commit -s
) in this PR.scripts/format.sh
to lint the changes in this PR.method in Tune, I've added it in
doc/source/tune/api/
under thecorresponding
.rst
file.