Skip to content
This repository was archived by the owner on Jul 10, 2025. It is now read-only.

Conversation

@1025KB
Copy link
Contributor

@1025KB 1025KB commented Feb 5, 2020

This RFC will be open for comment until Wednesday, Feb 19th, 2020.
cc @karmel

TFX Generic Trainer

Status Proposed
RFC # 201
Author(s) Jiayi Zhao (jyzhao@google.com)
Sponsor Konstantinos Katsiapis (katsiapis@google.com), Zhitao Li (zhitaoli@google.com), Karmel Allison (karmel@google.com)
Updated 2020-01-17

Objective

Support any TensorFlow Training loop in TFX Trainer in addition to tf.estimator, primarily focused on native Keras model.

This doc will focus on native Keras support (without model_to_estimator) in TFX.
We propose changing the user facing API to be more generic so that users can do
(single node) native Keras model training within TFX.

@theadactyl theadactyl added the RFC: Proposed RFC Design Document label Feb 5, 2020
@theadactyl theadactyl changed the title RFC: TFX Generic Trainer RFC: TFX Generic Trainer / Native Keras Support Feb 5, 2020
@theadactyl
Copy link
Contributor

Notes from today's design meeting at 2pm PT:

  • [Jiayi] Single worker distribution strategy supported
  • [Jiayi] Multi-worker distribution strategy is not supported yet
  • [Jiayi] Need to be able to be able to see if worker is chief
  • [Jiayi] In cloud we're able to bring up the cluster to run multi-worker, but we don't have it on prem right now
  • [Jiayi] Check chief for multiworkermirrored strategy? Used to have to check, but wont need to in the future
  • [azacks] For generic training, good to still have is_chief
  • [azacks] Strategy itself should have to say whether is chief or not
  • [Jiayi] TPU strategy or onedevice? Are these also single worker?
  • [goutham] yes
  • [Jiayi] Run function name - to align with internal, we want to align on naming. Any suggestions?
  • [goutham] trainer evaluate function we've used for trainer
  • [goutham] As long as it's properly documented with many code example it should be fine
  • [Jiayi] Trainer_fn and run_fn are both for trainer, alright reuse TrainerFnArgs dict
  • [Jiayi]After we deprecated trainer_fn it will look better for run_fn to use TrainerFnArgs
  • [azacks] when we do save, directory is part of environment. Is this already part of trainer args?
  • [azacks]We should show an example to make sure people are saving to the right location and more consistent api to avoid those mistakes
  • [tf ranking] To publish tfx template, should we use native keras or model estimator?
  • [azacks] we're not quite there yet for fulling supporting templates in native keras
  • [jiayi] Some of estimator functionality isnt supported in native keras, there's a bug tracking that; if your use case is simple (one worker strategy), go with native keras
  • [jiayi] Model estimator doesn't support tpu strategy - oss example is coming, works internally right now

Copy link
Contributor

@theadactyl theadactyl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@theadactyl theadactyl merged commit d066269 into tensorflow:master Feb 14, 2020
@ematejska ematejska added RFC: Accepted RFC Design Document: Accepted by Review and removed RFC: Proposed RFC Design Document labels Feb 28, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

cla: yes RFC: Accepted RFC Design Document: Accepted by Review

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants