Custom Trainer Executor for Multi-Worker GPU Training on Kubernetes #2248

Eric-Le-Ge · 2020-07-29T18:37:10Z

Support multi-worker training experimentally with GPUs on Kubernetes with a custom executor for trainer component.
Current support is only provided for GKE's GPU node pool and would need to be extended to support other Kubernetes execution platforms.

Sample usage in taxi_pipeline_native_keras.py:

trainer = Trainer( module_file=module_file, custom_executor_spec=executor_spec.ExecutorClassSpec( kubernetes_trainer_executor.GenericExecutor), examples=transform.outputs['transformed_examples'], transform_graph=transform.outputs['transform_graph'], schema=schema_gen.outputs['schema'], train_args=trainer_pb2.TrainArgs(num_steps=1000), eval_args=trainer_pb2.EvalArgs(num_steps=150), custom_config={ kubernetes_trainer_executor.TRAINING_ARGS_KEY: {'num_workers': 4, 'num_gpus_per_worker': 1} })

Eric-Le-Ge · 2020-07-29T18:38:00Z

@charlesccychen @chuanyu

charlesccychen · 2020-08-03T04:49:42Z

CC: @1025KB

charlesccychen

Thanks!

R: @chuanyu

tfx/components/base/base_executor.py

charlesccychen · 2020-08-25T18:24:27Z

tfx/extensions/google_cloud_kubernetes/runner.py

@@ -0,0 +1,245 @@
+# Copyright 2020 Google LLC. All Rights Reserved.


Could you move everything to be under tfx/extensions/experimental/kubernetes/trainer?

charlesccychen · 2020-08-25T18:26:29Z

tfx/extensions/google_cloud_kubernetes/runner.py

+                    command=_COMMAND,
+                    args=job_args,
+                    security_context=client.V1SecurityContext(
+                        privileged=True,


What happens when we don't use privileged?

privileged=True grants the necessary container privilege for users' training script to use the TF profiler for profiling. Since many training scripts contains the profiler callback by default (i.e. cifar10 example), I think it would be most convenient to grant this by default.

charlesccychen · 2020-08-25T18:26:53Z

tfx/extensions/google_cloud_kubernetes/runner.py

+                    ) if num_gpus_per_worker > 0 else None,
+                ),
+            ],
+            restart_policy=kube_utils.RestartPolicy.NEVER.value,


What's the rationale for not allowing restarts?

I think Kubernetes by default restarts completed containers, so in the event of successful training we would like to prevent the pod from running twice; if the pod fails, then we also do not need to restart it since the component would also fail as well.

tfx/extensions/google_cloud_kubernetes/runner.py

charlesccychen · 2020-08-25T18:30:26Z

tfx/extensions/google_cloud_kubernetes/runner.py

+  training_inputs = training_inputs.copy()
+
+  json_inputs = artifact_utils.jsonify_artifact_dict(input_dict)
+  logging.info('json_inputs=\'%s\'.', json_inputs)


Is it useful to keep these logging lines?

These were kept to be consistent with the CAIP trainer's runner. My personal opinion is that they are not that useful.

tfx/extensions/google_cloud_kubernetes/runner.py

charlesccychen · 2020-08-25T18:32:31Z

tfx/utils/kube_utils.py

+    return None
+
+
+def wait_pod(core_api: k8s_client.CoreV1Api,


Could you please rebase this once you are done with the other PRs, since I think this is shared.

Right, I think it would be best to rebase this once the Kubernetes Dag Runner is merged, since it contains a lot of changes in kube_utils.

chuanyu · 2020-08-25T23:48:27Z

tfx/extensions/google_cloud_kubernetes/runner.py

+                                          i) for i in range(num_workers)]
+
+
+def _pod_is_done(resp: client.V1Pod):


Return type?

This should be addressed with the rebase.

tfx/extensions/google_cloud_kubernetes/runner.py

tfx/extensions/google_cloud_kubernetes/runner_test.py

tfx/extensions/google_cloud_kubernetes/trainer/executor.py

chuanyu · 2020-08-25T23:57:12Z

tfx/extensions/google_cloud_kubernetes/trainer/executor.py

+    else:
+      absl.logging.warning(
+          "Missing unique_id in executor, using a random id instead.")
+      unique_id = test_utils.random_id()


We shouldn't depend on test_utils for non-test code

Considering copying the function over or finding another way to generate the random id. @charlesccychen any suggestions?

github-actions · 2020-09-28T01:37:52Z

This PR is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days

casassg · 2021-04-12T23:36:02Z

Would be interesting to have this one as an alternative to AI Platform Training.

zhitaoli · 2021-04-13T15:22:22Z

Would be interesting to have this one as an alternative to AI Platform Training.

While I don't think we have the expertise to do this from a first party perspective, I think this can be a good candidate for a project in TFX AddOns [1]

[1] https://github.com/tensorflow/tfx-addons

@rcrowe-google WDYT?

casassg · 2021-04-13T16:34:50Z

I would agree with that. I'm happy to promote it as well. That said, this would require that Executor interface be a stable interface in 1.0 which it currently is not.

zhitaoli · 2021-04-13T16:37:27Z

@JiaYi Zhao ***@***.***> @charles Chen ***@***.***> can comment on what they think here.

…

On Tue, Apr 13, 2021 at 9:35 AM Gerard Casas Saez ***@***.***> wrote: I would agree with that. I'm happy to promote it as well. That said, this would require that Executor interface be a stable interface in 1.0 which it currently is not. — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#2248 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAY6AZWMLMDLFWENSZMCCZ3TIRXD3ANCNFSM4PL4ZGCA> .

-- Zhitao Li

casassg · 2021-04-13T16:40:17Z

(tagging GH usernames CC @1025KB @charlesccychen )

Eric-Le-Ge added 7 commits July 23, 2020 23:28

initial untested impl

c5c525f

wip

79d9302

trainer raw

886574d

remove outdated utils

765ec10

pre-lint

d7842b7

draft finished

e100211

add reference for dockerfile

418f026

googlebot added the cla: yes label Jul 29, 2020

Eric-Le-Ge and others added 7 commits August 2, 2020 23:16

build tf_config into container environment

2a26852

Update kube_utils.py

2681435

lint fix

91d5c68

Merge branch 'training' of github.com:Eric-Le-Ge/tfx into training

1370ae7

add privilege security context

327b81a

Update runner.py

67f30ab

Update kube_utils.py

8e1a53c

Eric-Le-Ge mentioned this pull request Aug 15, 2020

Distributed Training For Cifar10 Example #2336

Closed

Eric-Le-Ge added 2 commits August 23, 2020 23:05

allow use for custom image

60e58aa

allow custom image override

6b9f163

charlesccychen reviewed Aug 25, 2020

View reviewed changes

chuanyu reviewed Aug 25, 2020

View reviewed changes

zhitaoli requested review from 1025KB and goutham August 26, 2020 00:06

Eric-Le-Ge added 4 commits August 26, 2020 01:38

refactoring to kubernetes

4c53746

refactoring to kubernetes

831c40d

refactor kubernetes trainer

99cc48e

rebase experimental training extension to master

7c657ac

charlesccychen approved these changes Aug 28, 2020

View reviewed changes

github-actions bot added the stale label Sep 28, 2020

github-actions bot closed this Oct 3, 2020

charlesccychen reopened this Oct 3, 2020

github-actions bot closed this Oct 9, 2020

		@@ -0,0 +1,245 @@
		# Copyright 2020 Google LLC. All Rights Reserved.

		i) for i in range(num_workers)]


		def _pod_is_done(resp: client.V1Pod):

Custom Trainer Executor for Multi-Worker GPU Training on Kubernetes #2248

Custom Trainer Executor for Multi-Worker GPU Training on Kubernetes #2248

Uh oh!

Conversation

Eric-Le-Ge commented Jul 29, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Eric-Le-Ge commented Jul 29, 2020

Uh oh!

charlesccychen commented Aug 3, 2020

Uh oh!

charlesccychen left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Sep 28, 2020

Uh oh!

casassg commented Apr 12, 2021

Uh oh!

zhitaoli commented Apr 13, 2021

Uh oh!

casassg commented Apr 13, 2021

Uh oh!

zhitaoli commented Apr 13, 2021 via email

Uh oh!

casassg commented Apr 13, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

Eric-Le-Ge commented Jul 29, 2020 •

edited

Loading

casassg commented Apr 13, 2021 •

edited

Loading