Skip to content

Conversation

@Eric-Le-Ge
Copy link
Contributor

@Eric-Le-Ge Eric-Le-Ge commented Jul 29, 2020

Support multi-worker training experimentally with GPUs on Kubernetes with a custom executor for trainer component.
Current support is only provided for GKE's GPU node pool and would need to be extended to support other Kubernetes execution platforms.

Sample usage in taxi_pipeline_native_keras.py:

trainer = Trainer( module_file=module_file, custom_executor_spec=executor_spec.ExecutorClassSpec( kubernetes_trainer_executor.GenericExecutor), examples=transform.outputs['transformed_examples'], transform_graph=transform.outputs['transform_graph'], schema=schema_gen.outputs['schema'], train_args=trainer_pb2.TrainArgs(num_steps=1000), eval_args=trainer_pb2.EvalArgs(num_steps=150), custom_config={ kubernetes_trainer_executor.TRAINING_ARGS_KEY: {'num_workers': 4, 'num_gpus_per_worker': 1} })

@Eric-Le-Ge
Copy link
Contributor Author

@charlesccychen @chuanyu

@charlesccychen
Copy link
Contributor

CC: @1025KB

Copy link
Contributor

@charlesccychen charlesccychen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

R: @chuanyu

@@ -0,0 +1,245 @@
# Copyright 2020 Google LLC. All Rights Reserved.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you move everything to be under tfx/extensions/experimental/kubernetes/trainer?

command=_COMMAND,
args=job_args,
security_context=client.V1SecurityContext(
privileged=True,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens when we don't use privileged?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

privileged=True grants the necessary container privilege for users' training script to use the TF profiler for profiling. Since many training scripts contains the profiler callback by default (i.e. cifar10 example), I think it would be most convenient to grant this by default.

) if num_gpus_per_worker > 0 else None,
),
],
restart_policy=kube_utils.RestartPolicy.NEVER.value,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the rationale for not allowing restarts?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think Kubernetes by default restarts completed containers, so in the event of successful training we would like to prevent the pod from running twice; if the pod fails, then we also do not need to restart it since the component would also fail as well.

training_inputs = training_inputs.copy()

json_inputs = artifact_utils.jsonify_artifact_dict(input_dict)
logging.info('json_inputs=\'%s\'.', json_inputs)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it useful to keep these logging lines?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These were kept to be consistent with the CAIP trainer's runner. My personal opinion is that they are not that useful.

return None


def wait_pod(core_api: k8s_client.CoreV1Api,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please rebase this once you are done with the other PRs, since I think this is shared.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, I think it would be best to rebase this once the Kubernetes Dag Runner is merged, since it contains a lot of changes in kube_utils.

i) for i in range(num_workers)]


def _pod_is_done(resp: client.V1Pod):
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Return type?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be addressed with the rebase.

else:
absl.logging.warning(
"Missing unique_id in executor, using a random id instead.")
unique_id = test_utils.random_id()
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We shouldn't depend on test_utils for non-test code

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Considering copying the function over or finding another way to generate the random id. @charlesccychen any suggestions?

@zhitaoli zhitaoli requested review from 1025KB and goutham August 26, 2020 00:06
@github-actions
Copy link
Contributor

This PR is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days

@github-actions github-actions bot added the stale label Sep 28, 2020
@github-actions github-actions bot closed this Oct 3, 2020
@charlesccychen charlesccychen reopened this Oct 3, 2020
@github-actions github-actions bot closed this Oct 9, 2020
@casassg
Copy link
Member

casassg commented Apr 12, 2021

Would be interesting to have this one as an alternative to AI Platform Training.

@zhitaoli
Copy link
Contributor

Would be interesting to have this one as an alternative to AI Platform Training.

While I don't think we have the expertise to do this from a first party perspective, I think this can be a good candidate for a project in TFX AddOns [1]

[1] https://github.com/tensorflow/tfx-addons

@rcrowe-google WDYT?

@casassg
Copy link
Member

casassg commented Apr 13, 2021

I would agree with that. I'm happy to promote it as well. That said, this would require that Executor interface be a stable interface in 1.0 which it currently is not.

@zhitaoli
Copy link
Contributor

zhitaoli commented Apr 13, 2021 via email

@casassg
Copy link
Member

casassg commented Apr 13, 2021

(tagging GH usernames CC @1025KB @charlesccychen )

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants