Rethink the TFJob CRD #209

DjangoPeng · 2017-12-10T04:00:50Z

In general, we always launch more than one TensorFlow training jobs with multiple kinds of hyper parameters combinations to get the best training results. At the same time, we can use one TensorBoard instance to visualize all of TensorFlow events outputted from training jobs. It is really a common case for data scientists and algorithm engineers. But, in the current design, the TensorBoard job is binding to one TensorFlow job.

In order to make above come true, we need to decouple the TensorFlow job and TensorBoard job. I think we can make some changes on the TFJob CRD. At first, I think both jobs can share TFReplicaSpec. Moreover, we can use TFReplicaSpec to specify the TensorFlow Serving job. At the same time, we can add a type field to define the TFJob type (one of Training, TensorBoard or Serving).

A possible spec may like the following...

Distributed training job

apiVersion: "tensorflow.org/v1alpha1"
kind: "TFJob"
metadata:
  name: "training-job"
spec:
  type: Training
  tfReplicaSpec:
  - replicas: 2
    tfReplicaType: PS
    template:
      spec:
        containers:
        - name: tensorflow
          image: tensorflow/tensorflow:1.4.0
          command:
          - "/workdir/tensorflow/launch_training.sh"
          volumeMounts:
          - name: workdir
            mountPath: /workdir
        volumes:
        - name: workdir
          glusterfs:
            endpoints: <gluster-cluster>
            path: <gluster_vol_subpath>
        restartPolicy: OnFailure
  tfReplicaSpec:
  - replicas: 4
    tfReplicaType: Worker
    template:
      spec:
        containers:
        - name: tensorflow
          image: tensorflow/tensorflow:1.4.0
          command:
          - "/workdir/tensorflow/launch_training.sh"
          args:
          - "--data_dir=/workdir/data"
          - "--train_dir=/workdir/train"
          - "--model_dir=/workdir/model"
          volumeMounts:
          - name: workdir
            mountPath: /workdir
        volumes:
        - name: workdir
          glusterfs:
            endpoints: <gluster-cluster>
            path: <gluster_vol_subpath>
        restartPolicy: OnFailure

TensorBoard Job

apiVersion: "tensorflow.org/v1alpha1"
kind: "TFJob"
metadata:
  name: "tensorboard-job"
spec:
  type: TensorBoard
  tfReplicaSpec:
  - replicas: 1
    template:
      spec:
        containers:
        - name: tensorboard
          image: tensorflow/tensorflow:1.4.0
          ports:
          - containerPort: <container-port>
          command:
          - "/workdir/tensorflow/launch_tensorboard.sh"
          volumeMounts:
          - name: workdir
            mountPath: /workdir            
        volumes:
        - name: workdir
          glusterfs:
            endpoints: <gluster-cluster>
            path: <gluster_vol_subpath>
        restartPolicy: OnFailure

TensorFlow Serving jobs are familiar with TensorBoard jobs. In this way, we can reuse many configurations of Kubernetes built-in resource.

@jlewi WDYT.

The text was updated successfully, but these errors were encountered:

jlewi · 2017-12-10T23:10:23Z

So I agree that tying TensorBoard to a particular job doesn't facilitate comparing multiple experiments.

I think the long term plan is for TensorBoard to become a service with a DB storing the events for all jobs. See tensorflow/tensorboard#92. So at that point TfJob's would just specify the location of the DB to send the events to and there would be a separate TensorBoard service that would be running that would allow them to select the experiments they want to look at.

Does that seem like the right long term solution?

I think what you are suggesting is that in the meantime we can make this more convenient by taking advantage of the fact that TensorBoard today allows you to load multiple event files at once. Creating a chart or something else to make that more convenient makes sense; but what's the advantage of modifying TfJob controller to directly support that as opposed to creating a separate helm chart to support that?

How scalable is loading multiple experiments into TensorBoard? I think TensorBoard loads all of them into memory, so does that limit the approach.

What do you think of the approach taken by StudioML? They provide a DB that keeps track of all your experiments. You can then go to a UI, select an experiment and launch TensorBoard for that instance?

pineking · 2017-12-11T01:35:30Z

I think it's a better solution that the user can select the experiments they like to show in the TensorBoard.

…

On Mon, Dec 11, 2017 at 7:10 AM, Jeremy Lewi ***@***.***> wrote: So I agree that tying TensorBoard to a particular job doesn't facilitate comparing multiple experiments. I think the long term plan is for TensorBoard to become a service with a DB storing the events for all jobs. See tensorflow/tensorboard#92 <tensorflow/tensorboard#92>. So at that point TfJob's would just specify the location of the DB to send the events to and there would be a separate TensorBoard service that would be running that would allow them to select the experiments they want to look at. Does that seem like the right long term solution? I think what you are suggesting is that in the meantime we can make this more convenient by taking advantage of the fact that TensorBoard today allows you to load multiple event files at once. Creating a chart or something else to make that more convenient makes sense; but what's the advantage of modifying TfJob controller to directly support that as opposed to creating a separate helm chart to support that? How scalable is loading multiple experiments into TensorBoard? I think TensorBoard loads all of them into memory, so does that limit the approach. What do you think of the approach taken by StudioML? They provide a DB that keeps track of all your experiments. You can then go to a UI, select an experiment and launch TensorBoard for that instance? — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#209 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAEcOJlfjbPITLNaOa78Ey9IMmIEJ3EIks5s_GTfgaJpZM4Q8XGs> .

-- --------------------------------------------- Qingsong Liu liuqs.ustc@gmail.com Univ. of Sci.& Tech. of China ----------------------------------------------

DjangoPeng · 2017-12-11T02:56:36Z

The TensorBoard upstream may have a big picture including many new features, such as supporting DB and cache, productionization ,monitoring, security, etc. I do agree all of them are benefits for us. But, I think we need to beat the current obstacle that is how to help data scientists train and track model easily. For example, if we want to train a InceptionV3 model, we might create a project dir named inception_v3 as below:

- inception_v3
  - ps2_worker4_batch32_lr0.1
     - logdir
        - tf_events_{version}
        - ...
     - models
        - {version}
           - saved_model.pb
     - training
        - {version}.checkpoint
  - ps2_worker8_batch32_lr0.1
  - ps2_worker4_batch64_lr0.1
  - ps2_worker8_batch64_lr0.1
  - ps2_worker4_batch32_lr0.2
  - ps2_worker8_batch32_lr0.2
  - ps2_worker4_batch64_lr0.2
  - ps2_worker8_batch64_lr0.2

What we want to do is launching a TensorBoard instance, maybe called incepionv3-tensorboard-job, to track the inception_v3/ dir. If so we can refresh the TensorBoard to visualize the latest tf events in any inception_v3/{hyper_params}/logdir. In this way, data scientists can search multiple experiments by hyper param name and value of the inceptionV3 model.

IMO, we can do the following improvement immediately after we reach a consensus.

Support unified storage for one model. (Actually, Kubernetes has supported this if we mount the same volume)
Support launching an individual TensorBoard instance to track a TensorFlow project/model.

Does that seem like the right long term solution?

As you asked in the tensorflow/tensorboard#92, I think there are some problems upstream need to resolve to realize the long term solution. The current TensorFlow summary ops can not write tf.events to DB directly. I think we should think twice about the DB supporting. All we know that tf.events storing many serialized data struct defined by protobufs, is it suitable to save them in DB? If not, we have to make a service to convert them each side.

what's the advantage of modifying TfJob controller to directly support that as opposed to creating a separate helm chart to support that?

The current design couples TensorFlow training job and TensorBoard job. So, it's not convenient and expensive to track one project/model. IMO, training job, TensorBoard job and Serving job all belongs to TFJob. So, if TFJob controller just support training job, it's incompleted, is it?

What do you think of the approach taken by StudioML?

I would try the StudioML today. Then I would give a feedback. 😄

jlewi · 2017-12-11T03:54:18Z

So is the proposal to create a Tensorboard instance that can track a directory?

Can't tensorboard already be pointed at a directory and automatically load multiple runs from that directory?

In your example spec for a TensorBoard job, you have 2 replicas. How does each replica decide which event files in the directory to load?

The current design couples TensorFlow training job and TensorBoard job. So, it's not convenient and expensive to track one project/model. IMO, training job, TensorBoard job and Serving job all belongs to TFJob. So, if TFJob controller just support training job, it's incompleted, is it?

The intent of the TB replica in TfJob is not to provide a comprehensive solution for running TensorBoard but simply a convenient way to avoid forcing the user to launch and manage a TensorBoard job separately. There's an expectation that users would have to launch TB separately if they wanted to analyze the results after the job finished or do any comparative analysis. I think of TfJob launching TensorBoard more as running a status server useful for monitoring the progress of the job.

I don't think we want to have a single controller for training jobs, tensorboard jobs, and tensorflow serving. Each of these has different semantics and should probably have different controllers. Training jobs for example run to completion but TensorBoard and TensorFlow serving are servers that should run until stopped.

K8s already handles servers pretty well (e.g. via Deployments and ReplicaSets). So do we need custom controllers for TensorBoard and TensorFlow serving?

DjangoPeng · 2017-12-11T04:16:26Z

@jlewi Sorry for the typo. The TensorBoard job replicas should be 1, and I have fixed that.

K8s already handles servers pretty well (e.g. via Deployments and ReplicaSets). So do we need custom controllers for TensorBoard and TensorFlow serving?

I agree on that we can launch TensorFlow Serving job via Deployment and ReplicaSets. But, if we want to build an e2e ML system based on Kubernetes, we have to do more things. For example, we might launch one TensorFlow Serving job in the beginning. With the number of requesting increasing, we have to scale out the TensorFlow Serving job. One way, via Deployment controller, we don't need to do anything, but we can't get the real-time status of the scaling operation either. Another way, we make a controller to report the status of scaling operation to the KubeFlow monitor.

Training jobs for example run to completion but TensorBoard and TensorFlow serving are servers that should run until stopped.

In fact, the PS of training jobs are servers as well. So, the business logic and callback function may be familiar. Now, we can't automatically manage distributed training job cause we have no idea whether all Workers of one training job succeeded their computation.

I think it's open and benefit to discuss the design. And I can open an issue to explain the KubeFlow Server and Monitor proposal.

bhack · 2017-12-11T12:24:46Z

@wbuchwalter In you blog post you have suggested to run Tensorboard as a service over a shared mount point. So, are there or not scalability problem with this workaround (i.e. memory bottleneck etc.)? Or is it just an isolation problem between experiments?

bhack · 2017-12-11T12:34:13Z

See also https://github.com/wbuchwalter/tensorflow-k8s-azure/blob/be91ed8134bb94751606646c6c13f549101e789c/7-hyperparam-sweep/solution-chart/templates/deployment.yaml#L59-L60

jlewi · 2017-12-11T13:10:57Z

@DjangoPeng A proposal for a KubeFlow server sounds like a good idea.

I agree on that we can launch TensorFlow Serving job via Deployment and ReplicaSets. But, if we want to build an e2e ML system based on Kubernetes, we have to do more things. For example, we might launch one TensorFlow Serving job in the beginning. With the number of requesting increasing, we have to scale out the TensorFlow Serving job. One way, via Deployment controller, we don't need to do anything, but we can't get the real-time status of the scaling operation either. Another way, we make a controller to report the status of scaling operation to the KubeFlow monitor.

Adding better support for autoscaling and monitoring TensorFlow serving jobs makes sense to me. For example, @owensk was did some initial work to ensure health checks were working. That work should be in google/kubeflow here.

I think my confusion is that I tend to think of training jobs, and tensorboard, and serving jobs as separate things because the semantics are very different. So at a high level I'd expect a distinct controller or helm package for each one. If we were to build separate controllers for each one we could potentially reuse code

bhack · 2017-12-11T13:17:42Z

@jlewi Kubeflow and K8 here are confusing me. What is the difference between the two projects?

jlewi · 2017-12-11T13:26:10Z

@jlewi Kubeflow and K8 here are confusing me. What is the difference between the two projects?

Kubeflow is the name for the effort to make deploying an ML stack on K8s easy.

The fact that the code is split across two repos google/kubeflow and tensorflow/k8s in two different GitHub organizations is an unfortunate consequence of history and corporate policies.

At some point I hope we will fix this and create a more organized repository.

bhack · 2017-12-11T13:40:20Z

@jlewi Yes I hope that it will be solved when the tools will mature. I understand the policies but it will be unpractical from an user point of view to have two access entries.

DjangoPeng · 2017-12-18T07:49:37Z

I think the current design of CRD is dedicated to TensorFlow training jobs. While single CRD for multiple TFJob types might not a good choose. So I'd like to close this issue, and we can open another PR to discuss the CRD of TensorBoard and TensorFlow Serving. @jlewi WDYT

jlewi · 2017-12-18T14:44:51Z

@DjangoPeng Makes sense to me. Thanks.

DjangoPeng mentioned this issue Dec 11, 2017

Proposal: our expectation on KubeFlow kubeflow/kubeflow#33

Closed

jlewi closed this as completed Dec 18, 2017

gaocegege mentioned this issue Jan 11, 2018

[discussion] Differences between tensorflow/k8s and caicloud/kubeflow-controller #283

Closed

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rethink the TFJob CRD #209

Rethink the TFJob CRD #209

DjangoPeng commented Dec 10, 2017 •

edited

jlewi commented Dec 10, 2017

pineking commented Dec 11, 2017 via email

DjangoPeng commented Dec 11, 2017 •

edited

jlewi commented Dec 11, 2017

DjangoPeng commented Dec 11, 2017 •

edited

bhack commented Dec 11, 2017 •

edited

bhack commented Dec 11, 2017 •

edited

jlewi commented Dec 11, 2017 •

edited

bhack commented Dec 11, 2017

jlewi commented Dec 11, 2017

bhack commented Dec 11, 2017

DjangoPeng commented Dec 18, 2017

jlewi commented Dec 18, 2017

Rethink the TFJob CRD #209

Rethink the TFJob CRD #209

Comments

DjangoPeng commented Dec 10, 2017 • edited

jlewi commented Dec 10, 2017

pineking commented Dec 11, 2017 via email

DjangoPeng commented Dec 11, 2017 • edited

jlewi commented Dec 11, 2017

DjangoPeng commented Dec 11, 2017 • edited

bhack commented Dec 11, 2017 • edited

bhack commented Dec 11, 2017 • edited

jlewi commented Dec 11, 2017 • edited

bhack commented Dec 11, 2017

jlewi commented Dec 11, 2017

bhack commented Dec 11, 2017

DjangoPeng commented Dec 18, 2017

jlewi commented Dec 18, 2017

DjangoPeng commented Dec 10, 2017 •

edited

DjangoPeng commented Dec 11, 2017 •

edited

DjangoPeng commented Dec 11, 2017 •

edited

bhack commented Dec 11, 2017 •

edited

bhack commented Dec 11, 2017 •

edited

jlewi commented Dec 11, 2017 •

edited