-
Notifications
You must be signed in to change notification settings - Fork 726
Custom Trainer Executor for Multi-Worker GPU Training on Kubernetes #2248
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
CC: @1025KB |
charlesccychen
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
R: @chuanyu
| @@ -0,0 +1,245 @@ | |||
| # Copyright 2020 Google LLC. All Rights Reserved. | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you move everything to be under tfx/extensions/experimental/kubernetes/trainer?
| command=_COMMAND, | ||
| args=job_args, | ||
| security_context=client.V1SecurityContext( | ||
| privileged=True, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What happens when we don't use privileged?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
privileged=True grants the necessary container privilege for users' training script to use the TF profiler for profiling. Since many training scripts contains the profiler callback by default (i.e. cifar10 example), I think it would be most convenient to grant this by default.
| ) if num_gpus_per_worker > 0 else None, | ||
| ), | ||
| ], | ||
| restart_policy=kube_utils.RestartPolicy.NEVER.value, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the rationale for not allowing restarts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think Kubernetes by default restarts completed containers, so in the event of successful training we would like to prevent the pod from running twice; if the pod fails, then we also do not need to restart it since the component would also fail as well.
| training_inputs = training_inputs.copy() | ||
|
|
||
| json_inputs = artifact_utils.jsonify_artifact_dict(input_dict) | ||
| logging.info('json_inputs=\'%s\'.', json_inputs) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it useful to keep these logging lines?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These were kept to be consistent with the CAIP trainer's runner. My personal opinion is that they are not that useful.
| return None | ||
|
|
||
|
|
||
| def wait_pod(core_api: k8s_client.CoreV1Api, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you please rebase this once you are done with the other PRs, since I think this is shared.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, I think it would be best to rebase this once the Kubernetes Dag Runner is merged, since it contains a lot of changes in kube_utils.
| i) for i in range(num_workers)] | ||
|
|
||
|
|
||
| def _pod_is_done(resp: client.V1Pod): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Return type?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be addressed with the rebase.
| else: | ||
| absl.logging.warning( | ||
| "Missing unique_id in executor, using a random id instead.") | ||
| unique_id = test_utils.random_id() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We shouldn't depend on test_utils for non-test code
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Considering copying the function over or finding another way to generate the random id. @charlesccychen any suggestions?
|
This PR is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days |
|
Would be interesting to have this one as an alternative to AI Platform Training. |
While I don't think we have the expertise to do this from a first party perspective, I think this can be a good candidate for a project in TFX AddOns [1] [1] https://github.com/tensorflow/tfx-addons @rcrowe-google WDYT? |
|
I would agree with that. I'm happy to promote it as well. That said, this would require that Executor interface be a stable interface in 1.0 which it currently is not. |
|
… On Tue, Apr 13, 2021 at 9:35 AM Gerard Casas Saez ***@***.***> wrote:
I would agree with that. I'm happy to promote it as well. That said, this
would require that Executor interface be a stable interface in 1.0 which it
currently is not.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#2248 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAY6AZWMLMDLFWENSZMCCZ3TIRXD3ANCNFSM4PL4ZGCA>
.
--
Zhitao Li
|
|
(tagging GH usernames CC @1025KB @charlesccychen ) |
Support multi-worker training experimentally with GPUs on Kubernetes with a custom executor for trainer component.
Current support is only provided for GKE's GPU node pool and would need to be extended to support other Kubernetes execution platforms.
Sample usage in taxi_pipeline_native_keras.py:
trainer = Trainer( module_file=module_file, custom_executor_spec=executor_spec.ExecutorClassSpec( kubernetes_trainer_executor.GenericExecutor), examples=transform.outputs['transformed_examples'], transform_graph=transform.outputs['transform_graph'], schema=schema_gen.outputs['schema'], train_args=trainer_pb2.TrainArgs(num_steps=1000), eval_args=trainer_pb2.EvalArgs(num_steps=150), custom_config={ kubernetes_trainer_executor.TRAINING_ARGS_KEY: {'num_workers': 4, 'num_gpus_per_worker': 1} })