Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rethink the TFJob CRD #209

Closed
DjangoPeng opened this issue Dec 10, 2017 · 13 comments
Closed

Rethink the TFJob CRD #209

DjangoPeng opened this issue Dec 10, 2017 · 13 comments

Comments

@DjangoPeng
Copy link
Member

DjangoPeng commented Dec 10, 2017

In general, we always launch more than one TensorFlow training jobs with multiple kinds of hyper parameters combinations to get the best training results. At the same time, we can use one TensorBoard instance to visualize all of TensorFlow events outputted from training jobs. It is really a common case for data scientists and algorithm engineers. But, in the current design, the TensorBoard job is binding to one TensorFlow job.

In order to make above come true, we need to decouple the TensorFlow job and TensorBoard job. I think we can make some changes on the TFJob CRD. At first, I think both jobs can share TFReplicaSpec. Moreover, we can use TFReplicaSpec to specify the TensorFlow Serving job. At the same time, we can add a type field to define the TFJob type (one of Training, TensorBoard or Serving).

A possible spec may like the following...

Distributed training job

apiVersion: "tensorflow.org/v1alpha1"
kind: "TFJob"
metadata:
  name: "training-job"
spec:
  type: Training
  tfReplicaSpec:
  - replicas: 2
    tfReplicaType: PS
    template:
      spec:
        containers:
        - name: tensorflow
          image: tensorflow/tensorflow:1.4.0
          command:
          - "/workdir/tensorflow/launch_training.sh"
          volumeMounts:
          - name: workdir
            mountPath: /workdir
        volumes:
        - name: workdir
          glusterfs:
            endpoints: <gluster-cluster>
            path: <gluster_vol_subpath>
        restartPolicy: OnFailure
  tfReplicaSpec:
  - replicas: 4
    tfReplicaType: Worker
    template:
      spec:
        containers:
        - name: tensorflow
          image: tensorflow/tensorflow:1.4.0
          command:
          - "/workdir/tensorflow/launch_training.sh"
          args:
          - "--data_dir=/workdir/data"
          - "--train_dir=/workdir/train"
          - "--model_dir=/workdir/model"
          volumeMounts:
          - name: workdir
            mountPath: /workdir
        volumes:
        - name: workdir
          glusterfs:
            endpoints: <gluster-cluster>
            path: <gluster_vol_subpath>
        restartPolicy: OnFailure

TensorBoard Job

apiVersion: "tensorflow.org/v1alpha1"
kind: "TFJob"
metadata:
  name: "tensorboard-job"
spec:
  type: TensorBoard
  tfReplicaSpec:
  - replicas: 1
    template:
      spec:
        containers:
        - name: tensorboard
          image: tensorflow/tensorflow:1.4.0
          ports:
          - containerPort: <container-port>
          command:
          - "/workdir/tensorflow/launch_tensorboard.sh"
          volumeMounts:
          - name: workdir
            mountPath: /workdir            
        volumes:
        - name: workdir
          glusterfs:
            endpoints: <gluster-cluster>
            path: <gluster_vol_subpath>
        restartPolicy: OnFailure

TensorFlow Serving jobs are familiar with TensorBoard jobs. In this way, we can reuse many configurations of Kubernetes built-in resource.

@jlewi WDYT.

@jlewi
Copy link
Contributor

jlewi commented Dec 10, 2017

So I agree that tying TensorBoard to a particular job doesn't facilitate comparing multiple experiments.

I think the long term plan is for TensorBoard to become a service with a DB storing the events for all jobs. See tensorflow/tensorboard#92. So at that point TfJob's would just specify the location of the DB to send the events to and there would be a separate TensorBoard service that would be running that would allow them to select the experiments they want to look at.

Does that seem like the right long term solution?

I think what you are suggesting is that in the meantime we can make this more convenient by taking advantage of the fact that TensorBoard today allows you to load multiple event files at once. Creating a chart or something else to make that more convenient makes sense; but what's the advantage of modifying TfJob controller to directly support that as opposed to creating a separate helm chart to support that?

How scalable is loading multiple experiments into TensorBoard? I think TensorBoard loads all of them into memory, so does that limit the approach.

What do you think of the approach taken by StudioML? They provide a DB that keeps track of all your experiments. You can then go to a UI, select an experiment and launch TensorBoard for that instance?

@pineking
Copy link
Member

pineking commented Dec 11, 2017 via email

@DjangoPeng
Copy link
Member Author

DjangoPeng commented Dec 11, 2017

The TensorBoard upstream may have a big picture including many new features, such as supporting DB and cache, productionization ,monitoring, security, etc. I do agree all of them are benefits for us. But, I think we need to beat the current obstacle that is how to help data scientists train and track model easily. For example, if we want to train a InceptionV3 model, we might create a project dir named inception_v3 as below:

- inception_v3
  - ps2_worker4_batch32_lr0.1
     - logdir
        - tf_events_{version}
        - ...
     - models
        - {version}
           - saved_model.pb
     - training
        - {version}.checkpoint
  - ps2_worker8_batch32_lr0.1
  - ps2_worker4_batch64_lr0.1
  - ps2_worker8_batch64_lr0.1
  - ps2_worker4_batch32_lr0.2
  - ps2_worker8_batch32_lr0.2
  - ps2_worker4_batch64_lr0.2
  - ps2_worker8_batch64_lr0.2

What we want to do is launching a TensorBoard instance, maybe called incepionv3-tensorboard-job, to track the inception_v3/ dir. If so we can refresh the TensorBoard to visualize the latest tf events in any inception_v3/{hyper_params}/logdir. In this way, data scientists can search multiple experiments by hyper param name and value of the inceptionV3 model.

IMO, we can do the following improvement immediately after we reach a consensus.

  • Support unified storage for one model. (Actually, Kubernetes has supported this if we mount the same volume)
  • Support launching an individual TensorBoard instance to track a TensorFlow project/model.

Does that seem like the right long term solution?

As you asked in the tensorflow/tensorboard#92, I think there are some problems upstream need to resolve to realize the long term solution. The current TensorFlow summary ops can not write tf.events to DB directly. I think we should think twice about the DB supporting. All we know that tf.events storing many serialized data struct defined by protobufs, is it suitable to save them in DB? If not, we have to make a service to convert them each side.

what's the advantage of modifying TfJob controller to directly support that as opposed to creating a separate helm chart to support that?

The current design couples TensorFlow training job and TensorBoard job. So, it's not convenient and expensive to track one project/model. IMO, training job, TensorBoard job and Serving job all belongs to TFJob. So, if TFJob controller just support training job, it's incompleted, is it?

What do you think of the approach taken by StudioML?

I would try the StudioML today. Then I would give a feedback. 😄

@jlewi
Copy link
Contributor

jlewi commented Dec 11, 2017

So is the proposal to create a Tensorboard instance that can track a directory?

Can't tensorboard already be pointed at a directory and automatically load multiple runs from that directory?

In your example spec for a TensorBoard job, you have 2 replicas. How does each replica decide which event files in the directory to load?

The current design couples TensorFlow training job and TensorBoard job. So, it's not convenient and expensive to track one project/model. IMO, training job, TensorBoard job and Serving job all belongs to TFJob. So, if TFJob controller just support training job, it's incompleted, is it?

The intent of the TB replica in TfJob is not to provide a comprehensive solution for running TensorBoard but simply a convenient way to avoid forcing the user to launch and manage a TensorBoard job separately. There's an expectation that users would have to launch TB separately if they wanted to analyze the results after the job finished or do any comparative analysis. I think of TfJob launching TensorBoard more as running a status server useful for monitoring the progress of the job.

I don't think we want to have a single controller for training jobs, tensorboard jobs, and tensorflow serving. Each of these has different semantics and should probably have different controllers. Training jobs for example run to completion but TensorBoard and TensorFlow serving are servers that should run until stopped.

K8s already handles servers pretty well (e.g. via Deployments and ReplicaSets). So do we need custom controllers for TensorBoard and TensorFlow serving?

@DjangoPeng
Copy link
Member Author

DjangoPeng commented Dec 11, 2017

@jlewi Sorry for the typo. The TensorBoard job replicas should be 1, and I have fixed that.

K8s already handles servers pretty well (e.g. via Deployments and ReplicaSets). So do we need custom controllers for TensorBoard and TensorFlow serving?

I agree on that we can launch TensorFlow Serving job via Deployment and ReplicaSets. But, if we want to build an e2e ML system based on Kubernetes, we have to do more things. For example, we might launch one TensorFlow Serving job in the beginning. With the number of requesting increasing, we have to scale out the TensorFlow Serving job. One way, via Deployment controller, we don't need to do anything, but we can't get the real-time status of the scaling operation either. Another way, we make a controller to report the status of scaling operation to the KubeFlow monitor.

Training jobs for example run to completion but TensorBoard and TensorFlow serving are servers that should run until stopped.

In fact, the PS of training jobs are servers as well. So, the business logic and callback function may be familiar. Now, we can't automatically manage distributed training job cause we have no idea whether all Workers of one training job succeeded their computation.

I think it's open and benefit to discuss the design. And I can open an issue to explain the KubeFlow Server and Monitor proposal.

@bhack
Copy link

bhack commented Dec 11, 2017

@wbuchwalter In you blog post you have suggested to run Tensorboard as a service over a shared mount point. So, are there or not scalability problem with this workaround (i.e. memory bottleneck etc.)? Or is it just an isolation problem between experiments?

@jlewi
Copy link
Contributor

jlewi commented Dec 11, 2017

@DjangoPeng A proposal for a KubeFlow server sounds like a good idea.

I agree on that we can launch TensorFlow Serving job via Deployment and ReplicaSets. But, if we want to build an e2e ML system based on Kubernetes, we have to do more things. For example, we might launch one TensorFlow Serving job in the beginning. With the number of requesting increasing, we have to scale out the TensorFlow Serving job. One way, via Deployment controller, we don't need to do anything, but we can't get the real-time status of the scaling operation either. Another way, we make a controller to report the status of scaling operation to the KubeFlow monitor.

Adding better support for autoscaling and monitoring TensorFlow serving jobs makes sense to me. For example, @owensk was did some initial work to ensure health checks were working. That work should be in google/kubeflow here.

I think my confusion is that I tend to think of training jobs, and tensorboard, and serving jobs as separate things because the semantics are very different. So at a high level I'd expect a distinct controller or helm package for each one. If we were to build separate controllers for each one we could potentially reuse code

@bhack
Copy link

bhack commented Dec 11, 2017

@jlewi Kubeflow and K8 here are confusing me. What is the difference between the two projects?

@jlewi
Copy link
Contributor

jlewi commented Dec 11, 2017

@jlewi Kubeflow and K8 here are confusing me. What is the difference between the two projects?

Kubeflow is the name for the effort to make deploying an ML stack on K8s easy.

The fact that the code is split across two repos google/kubeflow and tensorflow/k8s in two different GitHub organizations is an unfortunate consequence of history and corporate policies.

At some point I hope we will fix this and create a more organized repository.

@bhack
Copy link

bhack commented Dec 11, 2017

@jlewi Yes I hope that it will be solved when the tools will mature. I understand the policies but it will be unpractical from an user point of view to have two access entries.

@DjangoPeng
Copy link
Member Author

I think the current design of CRD is dedicated to TensorFlow training jobs. While single CRD for multiple TFJob types might not a good choose. So I'd like to close this issue, and we can open another PR to discuss the CRD of TensorBoard and TensorFlow Serving. @jlewi WDYT

@jlewi
Copy link
Contributor

jlewi commented Dec 18, 2017

@DjangoPeng Makes sense to me. Thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants