From b829407f76f28e79e07c8ed17143904e47a45d55 Mon Sep 17 00:00:00 2001 From: Zhuo Peng <1835738+brills@users.noreply.github.com> Date: Wed, 10 Mar 2021 09:12:57 -0800 Subject: [PATCH] First commit --- rfcs/20210305-tfx-struct2tensor.md | 221 ++++++++++++++++++ .../data_view_components.png | Bin 0 -> 47234 bytes .../graph_to_tensor_tfxio.png | Bin 0 -> 42860 bytes .../tf_example_vs_elwc.png | Bin 0 -> 46128 bytes 4 files changed, 221 insertions(+) create mode 100644 rfcs/20210305-tfx-struct2tensor.md create mode 100644 rfcs/20210305-tfx-struct2tensor/data_view_components.png create mode 100644 rfcs/20210305-tfx-struct2tensor/graph_to_tensor_tfxio.png create mode 100644 rfcs/20210305-tfx-struct2tensor/tf_example_vs_elwc.png diff --git a/rfcs/20210305-tfx-struct2tensor.md b/rfcs/20210305-tfx-struct2tensor.md new file mode 100644 index 000000000..b6a19368c --- /dev/null +++ b/rfcs/20210305-tfx-struct2tensor.md @@ -0,0 +1,221 @@ +# Support structured data in TFX through `struct2tensor` and `DataView` + +Status | Proposed +:------------ | :-------------------------------------------------------------- +**Author(s)** | Zhuo Peng (zhuo@google.com) +**Sponsor** | Zhitao Li (zhitaoli@google.com) +**Updated** | 2021-03-05 + +## Objective + +This RFC proposes several additions to TFX in order to support building ML +pipelines that process __structurally richer__ data that TFX does not have +apriori knowledge about how to parse. Such knowledge is provided by the +user, through __`struct2tensor`__ (showcased in this RFC) or other TensorFlow +graphs and made available to all TFX components through __Standardized TFX +inputs__ and __`DataView`s__. + +### Background + +### `struct2tensor` + +[`struct2tensor`](https://github.com/google/struct2tensor) is a library to +create TF graphs (a `struct2tensor` +"[expression](https://github.com/google/struct2tensor/blob/master/g3doc/api_docs/python/s2t/Expression.md)") +that parse serialized Protocol Buffers (protobuf) into a representation (a bag +of TF (composite) Tensors) that preserves the protobuf structure (for example +`tf.RaggedTensor`s and `tf.SparseTensor`s). It also allows manipulation of such +structure. + +### Standardized TFX inputs + +The +[Standardized TFX inputs RFC](https://github.com/1025KB/community/blob/875c04645f9029cb3c5d75bfdb8bf63e5560e9d9/rfcs/20191017-tfx-standardized-inputs.md) +introduced a common in-memory data representation to TFX components and an I/O +abstraction layer that produces the representation. The chosen representation, +Apache Arrow, is powerful enough to represent protobuf-like structured data, or +what the `tf.Tensor`, `tf.RaggedTensor`, or `tf.SparseTensor` logically +represent. + +### Goal + +* Propose a `TFXIO` for `struct2tensor`. + * Note that although designed for `struct2tensor`, this `TFXIO` only sees + the TF Graph that `struct2tensor` builds, which means it can support other + TF Graphs that decode string records into (composite) Tensors. + +* Propose the orchestration support needed by the proposed `TFXIO`. + +### Non Goal + +* Address how components / libraries can handle the new Tensor / Arrow types. + For example, TF Transform needs to be able to accept `tf.RaggedTensors` and + output `tf.RaggedTensors`. These need to be addressed separately in each + component, perhaps by separate designs, if needed. +* Address how TF serving can allow serving a model that has a (composite) + Tensor-based Predict signature, or any other signatures that do not use + `struct2tensor` to parse input protobufs. In this doc, it is assumed that + the + exported serving graph would take a dense 1-D Tensor of dtype `tf.string` + whose values are serialized protobufs. + - The reason why the above problem might be relevant to this design is + that in certain use cases, it might be desirable to use a different + format in serving than in training (e.g. using protobufs in training + while + using JSON in serving -- as long as they parse to the same (composite) + tensors fed into the model graph). + + +## Motivation + +TFX has historically assumed that `tf.Example` is the data payload format and +it is the only format fully supported by all the components. `tf.Example` +naturally represents flat data, while certain ML tasks need *structurally +richer* logical representations. For example, in the list-wise ranking problem, +one “example” input to the model consists of a list of documents to rank, and +each document contains some features. [`tensorflow_ranking`](https://github.com/tensorflow/ranking) +is a library that helps build such ranking models. Supporting +`tensorflow_ranking` in TFX has been a hot feature request. + +
+
+ left: flat data represented by tf.Examples
+ right: typical data for ranking problems -- each “example” contains
+ several “candidates”
+

+
+ left: a tf.Example-based pipeline topology
+ right: proposed topology of a struct2tensor-based pipeline
+
+
nN=jp0$Q@{Cb#wxs-O9eV+=W04x9;6)wE
zr&~fw8YELUvkC|`8ruN^O*c@jE)W0q%dVVcxUEZA;}>tCWmW~o=)DlY=jl!`+(`|2
z`tcb_8_$O9&tJ0LBDBZ+-iatuT^+(nvfqeSC%S){hf6l(Sr5+ZS(BL?85>hYH$Yr5
zw43|ac6+%>*0A0o+%~tUAXk=3TiLZ<7BR AK_yv%j=c3fuRrUN|km@>M1Nn5b1T&v@8ko{$~i
z4Ztp~=r(_-J#5`*`jSBZl~>($>Z;68z&;Oc%9Fg8SyMtttQuZz9g%V`7*f(2s`GIEd4Rq!h&pBtr>t)U|YGi!2>xDH&{JH?dzvj1Z)
zkY(9)MaQ_w;aw2dIF(%@9
014~
z?~VU*-`|&PWlQYa6FCR-OxNn#&CB*&5kbvCel1Uv=%owO6kPQ054vaTTqoS@*PpeF
zS3>YBNgqCpf^DCZe>1aEYh_oNHlFhZ2{G+2327~-OsDEETrGQ#J^5ZZmmN!{u^PT5
znJ@9RE}96u;QFjQP^3@6k`yZg92(wcj#ayl#R`)Ace$Vk^I5xRF^}s-LNt|}2jX#+
zgq75dg+clsM$3D`($e(T8Pn<04M8}S{>FbTquIAoiE-u@uNQSLS3b$R+uAjZKK^_1
zqi!6Er^yo>%`2%CGaV1;xRn;Wv7Jd5HVNd7OdTa+a<#N5eF4zZONa$8hgCA!-If3+
z)@^GT@7_c%?{?lL!9HPemBT9p`*!nSUg7nS_Ldf^K`MG~Kbw6pZ%5`+`prjJHK8$b
zTF>2H&nV$oTFVcq
BHC-V{G%Z%V`<%wgnVWR($wAZghO<0uXQN&d086ZUMys
z`+5n7BLCsb$~8BD((9^Ov7-y*ii&su$+M48m|M4u#pD&E;~^5DQP&>#Kj~j1(N!%
z$8ecG^X4S{TKvMd&t#91W+Yu$f5m9n;%;$==!)CVmqA-C9=DhnD!y!S5%b6E90_u?
z$otsB$PzL?Jk@*(e$H5KPQ>
F7?iNBn^#TGTQuu7#yW60@GD;J#k5p}{U1KAXi6hg*JtXnuTfmI=mZiG2
zr^?7yuMI7ueu{I4-e4)i@kW4b5jIb5e+8844n+FK>4$2Qds5Kdy|)dC3EpzanHMKT
z_y)dDFfLX_fk3r3J%|RAeK%LM4+M#w5f&Ag3Ci<1FwBLMP-AT_-V!7m!lo-1aI!B9
z<_J
~TUFX(XGCiLu?mA}N&&7~)i
zTU^N^vSIPk?H*0Ukv|7Ja*_^Oz5FJ$9lvD)hjSie6?pj`8JhSF;_?d@RV>%vE^nG#*~0VV1@sPedh2RMesbg1y4N$UqcbBD$Bk~4_P}=?wYLSY%14g+ycKO~
zDBbMq&M&E1BU~ok{H8tU-vl8L6H8
5NIIYL=xO#b@=PhWRSf++CVOm;L8wPYfH^QFLIv-*t9Gt#u=@R1
zYc^C#CD#tvo}U=+b+J-66SW@>y)pCdQ5vhf`bIdwwxgc}%G`UrqZ~vC2 F5Sug*a+}eLhvNiw9{2=TyA9ZILrp`W+(-z^}6&Ub9}iKnR}u
z)bJz+`Eim)EO3*rZVj)3$qUrX!rP!4Qd>qcSDej
z