Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Request] Add Kubeflow as the way to run TensorFlow distributed training jobs on Kubernetes #79

Closed
gaocegege opened this issue Apr 20, 2018 · 4 comments

Comments

@gaocegege
Copy link

gaocegege commented Apr 20, 2018

Hi, I am from Kubeflow community, which is an open source community dedicated to making using ML stacks on Kubernetes easy, fast and extensible.

We implemented an operator for TensorFlow on Kubernetes. Personally, I think the CRD defined in the operator is ease to use and understand compared to the jinja template. Here is one example:

apiVersion: "kubeflow.org/v1alpha1"
kind: "TFJob"
metadata:
  name: "example-job"
spec:
  replicaSpecs:
    - replicas: 1
      tfReplicaType: MASTER
      template:
        spec:
          containers:
            - image: gcr.io/tf-on-k8s-dogfood/tf_sample:dc944ff
              name: tensorflow
          restartPolicy: OnFailure
    - replicas: 1
      tfReplicaType: WORKER
      template:
        spec:
          containers:
            - image: gcr.io/tf-on-k8s-dogfood/tf_sample:dc944ff
              name: tensorflow
          restartPolicy: OnFailure
    - replicas: 2
      tfReplicaType: PS
      template:
        spec:
          containers:
            - image: gcr.io/tf-on-k8s-dogfood/tf_sample:dc944ff
              name: tensorflow
restartPolicy: OnFailure

I am wondering if we could add our implementation in the repository to enrich the ecosystem of TensorFlow and Kubernetes.

Welcome your suggestions!

/cc @ddysher @jlewi @DjangoPeng @ScorpioCPH

@jlewi
Copy link
Contributor

jlewi commented May 15, 2018

/cc @jhseu

@jhseu
Copy link
Contributor

jhseu commented May 15, 2018

Yeah, this seems reasonable. Feel free to send a pull request and add @jlewi and me as a reviewer.

@ddysher
Copy link

ddysher commented Jun 30, 2018

should we close this per #89?

@gaocegege
Copy link
Author

I think so. I am closing it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants