-
Notifications
You must be signed in to change notification settings - Fork 97
Parallelize membership discovery and train step, in order to improve the elastic performance. #60
Conversation
Thanks for the design doc. This makes sense. We are currently in the phase of redesigning the architecture of torchelastic in our upcoming
This is a much simpler model to understand than what we currently have (which is an in-process agent running inside the user space). Take a sneak peak here in this unittest: https://github.com/pytorch/elastic/blob/master/test/agent/server/local_elastic_agent_test.py In this new design re-rendezvous naturally runs in parallel to the train_step. |
Do you have any timeline about the new release? I am eager to read about the new design and integrate the new release into our framework. |
The agent is already committed (https://github.com/pytorch/elastic/blob/master/torchelastic/agent/server/local_elastic_agent.py#L52) I just published a PR with the launcher that is completely compatible and similar in usage with Check out the docs here: https://github.com/pytorch/elastic/pull/65/files#diff-d337650690ddced88d1c0c7187c979f9R17 Would love your feedback, would you be open to sharing your use-case? I'm curious about the setup (cloud - aws, gcp, azure - , on-prem, using k8 or not) the scale of your jobs and what makes elasticity important for you. It would really help us prioritize features and improve user experience. |
Hi @kiukchung , I've just read your agent feature. I have some questions/concerns in mind.
|
Sure, I am glad to share some info about our use-case. I am in the Scheduling Team belonged to an internal cluster service in Microsoft. We used to offer the scheduling abilities to the traditional/ general jobs. Nowadays, we are going to support AI workload. It's still in POC phase. IMO, elasticity is one of the most features we need. It could offer us a lot of abilities. I will list some of examples.
There are a lot of more benefits of elasticity. But we have to consider the following things.
|
Ideally for a given job you would use homogeneous nodes (even though the cluster itself is heterogeneous). This is especially true with GPUs as you never want to mix GPU architectures or number of GPUs per
Yes this is for performance reasons. The two most important ones are:
The existing APIs: |
Thanks for sharing your use-case!
Happy to discuss more, feel free to PM me kiuk@fb.com so that we can set up a meeting. Thanks! |
Due to my performance test, the overhead of coordinator.rendezvous_barrier is non-negligible. It is impractical for cluster scheduler to scale workers in minutes.
It is possible to parallelize the rendezvous_barrier and train_step because they are independent in theory.
I propose a simple solution about it.