Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC: String Tensor Unification #91

Merged
merged 2 commits into from May 6, 2019
Merged

Conversation

gharibian
Copy link
Member

@gharibian gharibian commented Apr 12, 2019

Feedback period will be open until 2019-04-26

ABI Stable Unified String Tensors

Status Proposed
Author(s) Dero Gharibian (dero@google.com)
Sponsor Gunhan Gulsoy (gunan@google.com)
Updated 2019-04-11

Objective

To unify and define the byte interface of a string tensor across TensorFlow’s C API (TF_STRING), TF Lite (kTfLiteString), and TF-Core/C++ (DT_STRING) with the purpose of enabling modular TensorFlow and mitigating the performance overhead of string tensor conversions.

@ewilderj ewilderj added this to Needs attention in RFC management via automation Apr 12, 2019
@ewilderj ewilderj added the RFC: Proposed RFC Design Document label Apr 12, 2019
@ewilderj ewilderj moved this from Needs attention to Open reviews in RFC management Apr 12, 2019
For TFLite, we propose a minor change to the layout of a kTfLiteString; the
prototypes for string creation and string accessors in `strings_util.h` and
`strings.h` do not need to change. Existing exported TFLite string tensors will
need to be updated.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to be clear, we will have to remain backward compatible with the old layout. That is, we'll have to continue supporting execution of models exported with the old layout, as well as the new layout, and be able to distinguish between the two.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I updated the doc to address this.

Since TFLite provides generators and accessors for TFLite string tensors, the
requisite changes needed to have TFLite conform to the `OffsetType` defined above
is on the order of a ~20 line CL. The more difficult task is to find all uses
of string tensors (internally, and potentially) in flat files, and coordinate
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Echoing the above point, we cannot rely on updating all models. We'll have to decide on an approach which gracefully handles the existing layout for models that exist in the wild.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I included your solution of adding a new TFLite enum type for tstring, which should gracefully handle existing existing layouts.

@sjamesr
Copy link
Contributor

sjamesr commented Apr 18, 2019

Here are the notes from the design review session:

  • Is the pre-alloc type superfluous? Or can we mark that type as reserved?
    • @mrry: If someone else allocated the memory for us, it would be useful to have pre-alloc. Maybe pre-alloc isn't the best way to do this, however. Possibly not crucial.
  • @alextp: Do you still want to replace all occurrences of string within tensorflow with tstring?
    • DT_STRING tensors use tensorflow::tstring
    • Not all occurrences of string become tensorflow::tstring. The typedef std::string string is not touched.
  • @m3bm3b: suppose you omit a variant, and you discover you need it, how hard is it to add a new?
    • 4 types are allowed, another bit will have to be sacrificed if you want to add a 5th type
    • Backwards compatibility with TF Lite becomes an issue, this is probably more difficult than actually adding the type, need to work with TF Lite to maintain backwards compatibility with serialized tensors.
  • @alextp: wouldn't things be much simpler if lvalue assignment was not supported?
    • Lots of downstream kernels use lvalue assignment, to support those kernels we need lvalue assignment. Doesn't add too much complexity, string already supports it.
    • @alextp: lvalue assignments might cause you to have to convert away from your most efficient representation
    • @edloper: Is there an issue for lvalue assignment for pre-alloc type? No, we just decay into the large string type.
  • @alextp : is this design document externalized? Yes.
  • Can we have a string_view-style implementation? This is tough because tstring assumes that it has allocated the underlying memory. Converting between different tstring types would be awkward.

Apologies if I missed anything.

@ewilderj
Copy link
Contributor

@gharibian please update the RFC with any changes coming out of the review meeting, and update the Status to "Accepted". Thanks!

[Java](https://github.com/tensorflow/tensorflow/blob/r1.13/tensorflow/java/src/main/native/tensor_jni.cc?#L270),
[golang](https://github.com/tensorflow/tensorflow/blob/r1.13/tensorflow/go/tensor.go?#L395),
etc, are expected to
modify a raw buffer in order to construct and pass a string tensor. We propose
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know if it's too late to comment on this I just want to confirm one thing: when we talk about "language bindings expected to modify a raw buffer", we are talking about modifying the buffer directly from the Java code (for example) and not from its bridge to the C API (JNI in this case), I'm I right?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not quite; this is specifically about the C API, and the fact that accessors/mutators do not exist in the C-API for C string tensor construction.

In the Java case, string tensors are created in the bridge JNI code by directly manipulating a buffer (though nothing precludes one from creating it in Java and passing it through via JNI). With the purposed update, the string tensors will still be created in the JNI bridge code, but instead of directly manipulating a buffer as is currently done [1], the JNI bridge will use new methods in the TF C API for string tensor construction.

[1] https://github.com/tensorflow/tensorflow/blob/r1.13/tensorflow/java/src/main/native/tensor_jni.cc?#L246

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok. I think it could still be useful though to have a string layout simple enough to let the Java client (and other language bindings) to write them directly into the tensor buffer, using some custom writers. We might achieve better performances by doing this and that is what we will probably do for all other datatypes.

Similarly, the language bindings could read a string value directly from the tensor buffer, useful in eager execution environment. What do you think?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the creation of string tensors, of the four string sub-types proposed, the kOffsetType fits the bill of being a "simple" layout which allows for direct writing to a buffer. Creating an kOffsetType string tensor would mitigate the current expensive conversion of TF_STRING -> DT_STRING. The layout is similar to the current C-String/TF_STRING layout (i.e. single block of contiguous memory), but has endianness defined and is compatible with eigen.

Things get a bit complicated for reading and modifying existing string tensors---since in those situations, the string could be any of the four types, and memory ownership comes into play with kLargeType.

Ideally, in order to future proof potential changes in byte layout, it would still be beneficial to utilize the new proposed functions in the C-API when reading/modifying/constructing string tensors. We will provide functions for creating string tensors from lists of strings in order to reduce function call overhead.

Alternatively, since we no longer will be using std::string as a string container, and the layout of strings will be well-defined and reside in TF, we could have duplicate implementations in java for better performance. For full duplication, kLargeTypes may need an additional field for a deallocator.

In short, this design does not preclude your suggestions of direct access to DT_STRING buffers, and, in fact, enables it (since the ABI of std::string is not defined). We can assess the potential performance gains vs additional complexity in having a duplicate Java implementation once the DT_STRING scalar is mapped to tensorflow::tstring instead of std::string.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perfect, thanks for the detailed answer

Updated TFLite sections to address feedback from @jdduke.  Marked as
Accepted.
@gharibian
Copy link
Member Author

@gharibian please update the RFC with any changes coming out of the review meeting, and update the Status to "Accepted". Thanks!

@ewilderj Updates are merged-in, and the Status has been set to "Accepted". Ready for review/merge.

@ewilderj ewilderj moved this from Open reviews to Awaiting Committee Notes in RFC management May 6, 2019
@ewilderj ewilderj merged commit 391360e into tensorflow:master May 6, 2019
RFC management automation moved this from Awaiting Committee Notes to Accepted RFCs May 6, 2019
@ewilderj ewilderj added RFC: Accepted RFC Design Document: Accepted by Review and removed RFC: Proposed RFC Design Document labels May 6, 2019
karllessard added a commit to karllessard/tensorflow-community that referenced this pull request May 10, 2019
* Adding a doc to deprecate collections

* Responding to Karmels comments

* Minor fix to VariableTracker sample code

* RFC for random numbers in TensorFlow 2.0

* Changes after some feedback

* Removed 'global_seed' in the main code and showed the design with 'global_seed' in the Questions section.

* Some changes after feedback

* A tweak

* Change after feedback

* A tweak

* changes

* changes

* fix link

* new-rfc

* changes

* Update rfcs/20181225-tf-backend.md

Co-Authored-By: alextp <apassos@google.com>

* Added some considerations about tf.function

* Renamed the internal name "op_generator" to "global_generator"

* Changed seed size from 256 to 1024 bits

* Initial signpost for community meetings

Adding this so there is basic information about how to find the community calendar and get invited to meetings.

* Add iCal link too

* changes

* Initial version of embedding and partitioned variable RFC.

* Fix one formatting issue.

* Fix another formatting issue.

* Use markdown language for the table instead of HTML.

* Add tensorflow/io R Package CRAN release instructions (tensorflow#53)

* Added Design Review Notes

* Make clear distinction between embedding variables and loadbalancing
variables.

* Added decisions below each question, and "how to use generators with distribution strategies".

* Adopted Dong Lin's suggestions

* Add a paragraph pointing out the problem with the `partition_strategy` argument.

* RFC: Move from tf.contrib to addons (tensorflow#37)

* Checkpoint addons RFC for review

* Add code review to RFC

Add future pull request information to criteria

Update modified date

added some description

RFC Move to addons

* Add weight decay optimizers

* Remove conv2d_in_plane

* Add group_norm

* Accept addons RFC

* Update alternatives since `DynamicPartition` and `DynamicStitch` do have GPU kernels.

* Add a section for saving and restore `PartitionedVariable`.

* Mention that variable types can be nested, attention needs to be paid to their saving and restoring mechanism.

* Create README.md (tensorflow#57)

* Splitted `_state_var` into `_state_var` and `_alg_var` (because of concerns from implementation), and changed status to "Accepted"

* Updated timestamp

* Moved the auto-selection of algorithm from `create_rng_state` to `Generator.__init__`

* Update according to the discussion

* Move performance heuristics in Distribution Strategy level. We will not expose knobs for users to control;
* Emphasize that embedding support in v2 will all be via `Embedding` layer. Users can use `tf.compat.v1` to handle embedding by themselves;
* Mention that default `partition_strategy` in v1 `embedding_lookup` is "mod", which will possibly break users's model when they update to TF 2.0;
* We want to prioritize shuffling embedding after 2.0 release;
* We have plans to serialize and deserialize `Embedding` layer and Distribution Strategies to allow loading a saved model to a different number of partitions.

* Update relese binary build command for sig-io (tensorflow#58)

This PR updates relese binary build command for sig-io

Signed-off-by: Yong Tang <yong.tang.github@outlook.com>

* Add Bryan to SIG IO release team (tensorflow#59)

* Change to accepted

* Add link to TensorFlow IO R package

* Updated link for the friction log. (tensorflow#64)

* Switch DistStrat revised API examples to TensorFlow 2 style. (tensorflow#63)

* RFC: Attention for Dense Networks on Keras (tensorflow#54)

* Design review for "Attention for Dense Networks"

* RFC: Stateful Containers with tf.Module (tensorflow#56)

* Create 20190117-tf-module.md

* Update 20190117-tf-module.md

* Loosen return type for variable properties.

* Use Dense consistently.

Thanks brilee@ for spotting!

* Remove convert_to_tensor from examples.

This wasn't ever required and including it might cause confusion.

h/t pluskid@ gehring@ and awav@

* Remove owned_* methods.

* Document `_flatten`

See tensorflow/tensorflow@5076adf6 for more context.

* Fix typo in module name.

Thanks k-w-w@!

* Update 20190117-tf-module.md

* RFC: New tf.print (tensorflow#14)

* New tf.print proposal

* Attempt to fix table of contents

* Removed not-working TOC label

* Minor updates to the doc.

* Update tf.print to be accepted

* Added design review notes

* Marking doc as accepted

* Update cond_v2 design doc (tensorflow#70)

* Update to bring in line with implementation

* Added the symbol map to the RFC.

* Updated testing section of the Community site.

* Removed the 100%, formatting tweaks.

* Update CHARTER.md

* Change contact email address

I will leave my current company soon, so update my email.

* Create README.md

* Logos for SIGs

* Update README.md

* Update addons owners (tensorflow#85)

Add Yan Facai as another project lead.

* Created a FAQ for TF 2.0. (tensorflow#78)

Adding 2.0 related FAQ to the Testing group.

* Request and charter for SIG JVM (tensorflow#86)

Chartering docs for SIG JVM

* Update CODEOWNERS

Add @karllessard, @sjamesr and @tzolov as code owners for sigs/jvm.

* Update CODEOWNERS

Add missing /

* Update CODEOWNERS

Add @dynamicwebpaige as owner for sigs/testing/

* Update RFC with current information (tensorflow#89)

Make current to SIG Addons

* RFC: TF on Demand Project (tensorflow#69)

* Adding an RFC for TF on Demand Project.

* modified one line in tf-on-demand md file.

* Changing RFC status from PROPOSED to ACCEPTED.

* RFC: SavedModel Save/Load in 2.x (tensorflow#34)

* RFC for SavedModel Save/Load in 2.x

* Minor edits and a discussion topic for load() with multiple MetaGraphs

* Tweak to the "Imported representations of signatures" section

* Update "Importing existing SavedModels" with the .signatures change

* Update RFC and add review notes

* Status -> accepted

* Update CHARTER.md

New leads.

* Update 20180920-unify-rnn-interface.md (tensorflow#81)

Typo fix.

* Update yyyymmdd-rfc-template.md

Adding "user benefit" section into the RFC template, to encourage articulating the benefit to users in a clear way.

* Update while_v2 design doc (tensorflow#71)

* Update while_v2 design doc, include link to implementation

* Update TF 2.0 FAQ to link to TensorBoard TF 2.0 tutorial (tensorflow#94)

* CLN: update sig addons logo png (tensorflow#99)

* Add SIG Keras

Add a reference link to Keras' governance repository for SIG Keras.

* RFC: String Tensor Unification (tensorflow#91)

* RFC: String Tensor Unification

* Updated rfcs/20190411-string-unification.md

Updated TFLite sections to address feedback from @jdduke.  Marked as
Accepted.

* Start RFC for tensor buffers
tensorflow-copybara pushed a commit to tensorflow/tensorflow that referenced this pull request Aug 1, 2019
tensorflow-copybara pushed a commit to tensorflow/tensorflow that referenced this pull request Aug 2, 2019
tensorflow-copybara pushed a commit to tensorflow/tensorflow that referenced this pull request Aug 3, 2019
tensorflow-copybara pushed a commit to tensorflow/tensorflow that referenced this pull request Aug 7, 2019
This is a part of a larger migration effort for tensorflow::tstring.
See: tensorflow/community#91
PiperOrigin-RevId: 262172788
tensorflow-copybara pushed a commit to tensorflow/tensorflow that referenced this pull request Aug 10, 2019
This is a part of a larger migration effort for tensorflow::tstring.
See: tensorflow/community#91
PiperOrigin-RevId: 262737181
tensorflow-copybara pushed a commit to tensorflow/tensorflow that referenced this pull request Aug 10, 2019
Updated char* tstring::data() to conform to C++11.

This is a part of a larger migration effort for tensorflow::tstring.
See: tensorflow/community#91
PiperOrigin-RevId: 262752692
tensorflow-copybara pushed a commit to tensorflow/tensorflow that referenced this pull request Aug 12, 2019
This is a part of a larger migration effort for tensorflow::tstring.
See: tensorflow/community#91
PiperOrigin-RevId: 262850088
tensorflow-copybara pushed a commit to tensorflow/tensorflow that referenced this pull request Aug 12, 2019
This is a part of a larger migration effort for tensorflow::tstring.
See: tensorflow/community#91
PiperOrigin-RevId: 262918772
tensorflow-copybara pushed a commit to tensorflow/tensorflow that referenced this pull request Aug 12, 2019
This is a part of a larger migration effort for tensorflow::tstring.
See: tensorflow/community#91
PiperOrigin-RevId: 262918920
tensorflow-copybara pushed a commit to tensorflow/tensorflow that referenced this pull request Aug 26, 2019
This is a part of a larger migration effort for tensorflow::tstring.
See: tensorflow/community#91
PiperOrigin-RevId: 265475150
tensorflow-copybara pushed a commit to tensorflow/tensorflow that referenced this pull request Aug 26, 2019
This is a part of a larger migration effort for tensorflow::tstring.
See: tensorflow/community#91
PiperOrigin-RevId: 265535878
tensorflow-copybara pushed a commit to tensorflow/tensorflow that referenced this pull request Aug 27, 2019
This is a part of a larger migration effort for tensorflow::tstring.
See: tensorflow/community#91
PiperOrigin-RevId: 265690695
tensorflow-copybara pushed a commit to tensorflow/tensorflow that referenced this pull request Aug 28, 2019
This is a part of a larger migration effort for tensorflow::tstring.
See: tensorflow/community#91
PiperOrigin-RevId: 265811129
tensorflow-copybara pushed a commit to tensorflow/tensorflow that referenced this pull request Aug 28, 2019
This is a part of a larger migration effort for tensorflow::tstring.
See: tensorflow/community#91
PiperOrigin-RevId: 265822025
QrczakMK added a commit to google/riegeli that referenced this pull request Oct 28, 2019
tf-text-github-robot pushed a commit to tensorflow/text that referenced this pull request Nov 18, 2019
Note that during the transition period tstring is typedef'ed to std::string.

See: tensorflow/community#91
PiperOrigin-RevId: 281154990
gharibian added a commit to gharibian/models that referenced this pull request Nov 18, 2019
Note that during the transition period tstring is typedef'ed to
std::string.

See: tensorflow/community#91
gharibian added a commit to gharibian/addons that referenced this pull request Nov 18, 2019
Note that during the transition period tstring is typedef'ed to
std::string.

See: tensorflow/community#91
seanpmorgan pushed a commit to tensorflow/addons that referenced this pull request Nov 19, 2019
Note that during the transition period tstring is typedef'ed to
std::string.

See: tensorflow/community#91
tf-text-github-robot pushed a commit to tensorflow/text that referenced this pull request Nov 19, 2019
Note that during the transition period tstring is typedef'ed to std::string.

See: tensorflow/community#91
PiperOrigin-RevId: 281154990
tf-text-github-robot pushed a commit to tensorflow/text that referenced this pull request Nov 19, 2019
Note that during the transition period tstring is typedef'ed to std::string.

See: tensorflow/community#91
PiperOrigin-RevId: 281154990
tf-text-github-robot pushed a commit to tensorflow/text that referenced this pull request Nov 20, 2019
Note that during the transition period tstring is typedef'ed to std::string.

See: tensorflow/community#91
PiperOrigin-RevId: 281154990
tf-text-github-robot pushed a commit to tensorflow/text that referenced this pull request Nov 20, 2019
Note that during the transition period tstring is typedef'ed to std::string.

See: tensorflow/community#91
PiperOrigin-RevId: 281401138
jonycgn pushed a commit to tensorflow/compression that referenced this pull request Nov 20, 2019
Note that during the transition period tstring is typedef'ed to std::string.

See: tensorflow/community#91
PiperOrigin-RevId: 281163678
Change-Id: I20ac40a58b48113c76103011c5061283d434558b
tensorflow-copybara pushed a commit to tensorflow/serving that referenced this pull request Nov 22, 2019
Note that during the transition period tstring is typedef'ed to std::string.

See: tensorflow/community#91
PiperOrigin-RevId: 281985097
tensorflow-copybara pushed a commit to tensorflow/tensor2tensor that referenced this pull request Nov 22, 2019
Note that during the transition period tstring is typedef'ed to std::string.

See: tensorflow/community#91
PiperOrigin-RevId: 282043204
lingvo-bot added a commit to tensorflow/lingvo that referenced this pull request Nov 26, 2019
Note that during the transition period tstring is typedef'ed to std::string.

See: tensorflow/community#91
PiperOrigin-RevId: 282495664
tensorflow-copybara pushed a commit to tensorflow/tensorflow that referenced this pull request Dec 6, 2019
Note that during the transition period tstring is typedef'ed to std::string.

This is a part of a larger migration effort for tensorflow::tstring.
See: tensorflow/community#91
PiperOrigin-RevId: 284138845
Change-Id: Ia478fde1a34804b54aca402e336e60cd75744852
ThomasColthurst pushed a commit to google/nucleus that referenced this pull request Dec 17, 2019
Note that during the transition period tstring is typedef'ed to std::string.

See: tensorflow/community#91
PiperOrigin-RevId: 285003891
tensorflow-copybara pushed a commit to tensorflow/tensorflow that referenced this pull request Jan 23, 2020
See: tensorflow/community#91
PiperOrigin-RevId: 291049893
Change-Id: I79d809bd3e581da27fc87e8b53921fadce3b7a93
lespeholt pushed a commit to google-research/seed_rl that referenced this pull request Jan 29, 2020
Note that during the transition period tstring is typedef'ed to std::string.

See: tensorflow/community#91
PiperOrigin-RevId: 281300599
lespeholt pushed a commit to google-research/seed_rl that referenced this pull request Jan 29, 2020
Note that during the transition period tstring is typedef'ed to std::string.

See: tensorflow/community#91
PiperOrigin-RevId: 284243848
@goldiegadde goldiegadde added this to Done in TensorFlow 2.2.0 Feb 13, 2020
MarkDaoust pushed a commit to tensorflow/examples that referenced this pull request Mar 17, 2020
Note that during the transition period tstring is typedef'ed to
std::string.

See: tensorflow/community#91
pichuan pushed a commit to google/deepvariant that referenced this pull request Mar 26, 2020
Note that during the transition period tstring is typedef'ed to std::string.

See: tensorflow/community#91
PiperOrigin-RevId: 285003891
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cla: yes RFC: Accepted RFC Design Document: Accepted by Review
Projects
RFC management
  
Accepted RFCs
Development

Successfully merging this pull request may close these issues.

None yet

6 participants