New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Rethink the TFJob CRD #209
Comments
So I agree that tying TensorBoard to a particular job doesn't facilitate comparing multiple experiments. I think the long term plan is for TensorBoard to become a service with a DB storing the events for all jobs. See tensorflow/tensorboard#92. So at that point TfJob's would just specify the location of the DB to send the events to and there would be a separate TensorBoard service that would be running that would allow them to select the experiments they want to look at. Does that seem like the right long term solution? I think what you are suggesting is that in the meantime we can make this more convenient by taking advantage of the fact that TensorBoard today allows you to load multiple event files at once. Creating a chart or something else to make that more convenient makes sense; but what's the advantage of modifying TfJob controller to directly support that as opposed to creating a separate helm chart to support that? How scalable is loading multiple experiments into TensorBoard? I think TensorBoard loads all of them into memory, so does that limit the approach. What do you think of the approach taken by StudioML? They provide a DB that keeps track of all your experiments. You can then go to a UI, select an experiment and launch TensorBoard for that instance? |
I think it's a better solution that the user can select the experiments
they like to show in the TensorBoard.
…On Mon, Dec 11, 2017 at 7:10 AM, Jeremy Lewi ***@***.***> wrote:
So I agree that tying TensorBoard to a particular job doesn't facilitate
comparing multiple experiments.
I think the long term plan is for TensorBoard to become a service with a
DB storing the events for all jobs. See tensorflow/tensorboard#92
<tensorflow/tensorboard#92>. So at that point
TfJob's would just specify the location of the DB to send the events to and
there would be a separate TensorBoard service that would be running that
would allow them to select the experiments they want to look at.
Does that seem like the right long term solution?
I think what you are suggesting is that in the meantime we can make this
more convenient by taking advantage of the fact that TensorBoard today
allows you to load multiple event files at once. Creating a chart or
something else to make that more convenient makes sense; but what's the
advantage of modifying TfJob controller to directly support that as opposed
to creating a separate helm chart to support that?
How scalable is loading multiple experiments into TensorBoard? I think
TensorBoard loads all of them into memory, so does that limit the approach.
What do you think of the approach taken by StudioML? They provide a DB
that keeps track of all your experiments. You can then go to a UI, select
an experiment and launch TensorBoard for that instance?
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#209 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AAEcOJlfjbPITLNaOa78Ey9IMmIEJ3EIks5s_GTfgaJpZM4Q8XGs>
.
--
---------------------------------------------
Qingsong Liu
liuqs.ustc@gmail.com
Univ. of Sci.& Tech. of China
----------------------------------------------
|
The TensorBoard upstream may have a big picture including many new features, such as supporting DB and cache, productionization ,monitoring, security, etc. I do agree all of them are benefits for us. But, I think we need to beat the current obstacle that is how to help data scientists train and track model easily. For example, if we want to train a InceptionV3 model, we might create a project dir named
What we want to do is launching a TensorBoard instance, maybe called IMO, we can do the following improvement immediately after we reach a consensus.
As you asked in the tensorflow/tensorboard#92, I think there are some problems upstream need to resolve to realize the long term solution. The current TensorFlow summary ops can not write tf.events to DB directly. I think we should think twice about the DB supporting. All we know that tf.events storing many serialized data struct defined by protobufs, is it suitable to save them in DB? If not, we have to make a service to convert them each side.
The current design couples TensorFlow training job and TensorBoard job. So, it's not convenient and expensive to track one project/model. IMO, training job, TensorBoard job and Serving job all belongs to TFJob. So, if TFJob controller just support training job, it's incompleted, is it?
I would try the StudioML today. Then I would give a feedback. 😄 |
So is the proposal to create a Tensorboard instance that can track a directory? Can't tensorboard already be pointed at a directory and automatically load multiple runs from that directory? In your example spec for a TensorBoard job, you have 2 replicas. How does each replica decide which event files in the directory to load?
The intent of the TB replica in TfJob is not to provide a comprehensive solution for running TensorBoard but simply a convenient way to avoid forcing the user to launch and manage a TensorBoard job separately. There's an expectation that users would have to launch TB separately if they wanted to analyze the results after the job finished or do any comparative analysis. I think of TfJob launching TensorBoard more as running a status server useful for monitoring the progress of the job. I don't think we want to have a single controller for training jobs, tensorboard jobs, and tensorflow serving. Each of these has different semantics and should probably have different controllers. Training jobs for example run to completion but TensorBoard and TensorFlow serving are servers that should run until stopped. K8s already handles servers pretty well (e.g. via Deployments and ReplicaSets). So do we need custom controllers for TensorBoard and TensorFlow serving? |
@jlewi Sorry for the typo. The TensorBoard job replicas should be 1, and I have fixed that.
I agree on that we can launch TensorFlow Serving job via Deployment and ReplicaSets. But, if we want to build an e2e ML system based on Kubernetes, we have to do more things. For example, we might launch one TensorFlow Serving job in the beginning. With the number of requesting increasing, we have to scale out the TensorFlow Serving job. One way, via Deployment controller, we don't need to do anything, but we can't get the real-time status of the scaling operation either. Another way, we make a controller to report the status of scaling operation to the KubeFlow monitor.
In fact, the PS of training jobs are servers as well. So, the business logic and callback function may be familiar. Now, we can't automatically manage distributed training job cause we have no idea whether all Workers of one training job succeeded their computation. I think it's open and benefit to discuss the design. And I can open an issue to explain the KubeFlow Server and Monitor proposal. |
@wbuchwalter In you blog post you have suggested to run Tensorboard as a service over a shared mount point. So, are there or not scalability problem with this workaround (i.e. memory bottleneck etc.)? Or is it just an isolation problem between experiments? |
@DjangoPeng A proposal for a KubeFlow server sounds like a good idea.
Adding better support for autoscaling and monitoring TensorFlow serving jobs makes sense to me. For example, @owensk was did some initial work to ensure health checks were working. That work should be in google/kubeflow here. I think my confusion is that I tend to think of training jobs, and tensorboard, and serving jobs as separate things because the semantics are very different. So at a high level I'd expect a distinct controller or helm package for each one. If we were to build separate controllers for each one we could potentially reuse code |
@jlewi Kubeflow and K8 here are confusing me. What is the difference between the two projects? |
Kubeflow is the name for the effort to make deploying an ML stack on K8s easy. The fact that the code is split across two repos google/kubeflow and tensorflow/k8s in two different GitHub organizations is an unfortunate consequence of history and corporate policies. At some point I hope we will fix this and create a more organized repository. |
@jlewi Yes I hope that it will be solved when the tools will mature. I understand the policies but it will be unpractical from an user point of view to have two access entries. |
I think the current design of CRD is dedicated to TensorFlow training jobs. While single CRD for multiple TFJob types might not a good choose. So I'd like to close this issue, and we can open another PR to discuss the CRD of TensorBoard and TensorFlow Serving. @jlewi WDYT |
@DjangoPeng Makes sense to me. Thanks. |
In general, we always launch more than one TensorFlow training jobs with multiple kinds of hyper parameters combinations to get the best training results. At the same time, we can use one TensorBoard instance to visualize all of TensorFlow events outputted from training jobs. It is really a common case for data scientists and algorithm engineers. But, in the current design, the TensorBoard job is binding to one TensorFlow job.
In order to make above come true, we need to decouple the TensorFlow job and TensorBoard job. I think we can make some changes on the TFJob CRD. At first, I think both jobs can share
TFReplicaSpec
. Moreover, we can useTFReplicaSpec
to specify the TensorFlow Serving job. At the same time, we can add atype
field to define the TFJob type (one ofTraining
,TensorBoard
orServing
).A possible spec may like the following...
Distributed training job
TensorBoard Job
TensorFlow Serving jobs are familiar with TensorBoard jobs. In this way, we can reuse many configurations of Kubernetes built-in resource.
@jlewi WDYT.
The text was updated successfully, but these errors were encountered: