-
Notifications
You must be signed in to change notification settings - Fork 612
Add dry run mode #2012
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add dry run mode #2012
Conversation
[ghstack-poisoned]
Summary: The current configuration validation requires torchx and GPUs. It can waste time, resources, ane engery. Polar bears are crying. Let's fix this by providing a dry run mode. This PR doesn't verify everything. In theory, we should be able to verify parallelisms settings as well. This PR is just a start but it at least can let us catch the typos quickly. ghstack-source-id: 8ffb03d Pull-Request: #2012
tianyu-l
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wish we could keep train.py and run_train.sh simple..
|
Okay, there is one minor issue by have a separate DryRunTrainer. More and more applications are subclassing Trainer. So if we create another DryRunTrainer, then we don't benefit those applications (e.g., full dtensor). So I guess we will eventually merge this back once we figure out how to do the full dry run with fake backend (after the DeviceMesh PR is landed). |
Summary: The current configuration validation requires torchx and GPUs. It can waste time, resources, ane engery. Polar bears are crying. Let's fix this by providing a dry run mode. This PR doesn't verify everything. In theory, we should be able to verify parallelisms settings as well. This PR is just a start but it at least can let us catch the typos quickly. ghstack-source-id: 72f4316 Pull-Request: #2012
[ghstack-poisoned]
Summary: The current configuration validation requires torchx and GPUs. It can waste time, resources, ane engery. Polar bears are crying. Let's fix this by providing a dry run mode. This PR doesn't verify everything. In theory, we should be able to verify parallelisms settings as well. This PR is just a start but it at least can let us catch the typos quickly. ghstack-source-id: 4e599a7 Pull-Request: #2012
tianyu-l
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Had a suggestion on where to put the dry_run.py file.
Agree that we should switch to fake backend if that works.
Also maybe LocalTensor can be helpful?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Would you consider putting this under torchtitan/tools/dry_run.py (or other 2nd level directory under torchtitan), or scripts/dry_run.py?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, let's put it under scripts/dry_run.py for now. We should investigate how to merge it back to train.py with LocalTensor or fake backend anyway.
[ghstack-poisoned]
Summary: The current configuration validation requires torchx and GPUs. It can waste time, resources, ane engery. Polar bears are crying. Let's fix this by providing a dry run mode. This PR doesn't verify everything. In theory, we should be able to verify parallelisms settings as well. This PR is just a start but it at least can let us catch the typos quickly. ghstack-source-id: 99e031f Pull-Request: #2012
[ghstack-poisoned]
Summary: The current configuration validation requires torchx and GPUs. It can waste time, resources, ane engery. Polar bears are crying. Let's fix this by providing a dry run mode. This PR doesn't verify everything. In theory, we should be able to verify parallelisms settings as well. This PR is just a start but it at least can let us catch the typos quickly. ghstack-source-id: 7c01387 Pull-Request: #2012
Stack from [ghstack](https://github.com/ezyang/ghstack/tree/0.12.0) (oldest at bottom): * #2012 * __->__ #2011 It is not correct as JobConfig has changed.
Stack from [ghstack](https://github.com/ezyang/ghstack/tree/0.12.0) (oldest at bottom): * pytorch#2012 * __->__ pytorch#2011 It is not correct as JobConfig has changed.
Stack from [ghstack](https://github.com/ezyang/ghstack/tree/0.12.0) (oldest at bottom): * __->__ pytorch#2012 * pytorch#2011 Summary: The current configuration validation requires torchx and GPUs. It can waste time, resources, ane engery. Polar bears are crying. Let's fix this by providing a dry run mode. This PR doesn't verify everything. In theory, we should be able to verify parallelisms settings as well. This PR is just a start but it at least can let us catch the typos quickly.
Stack from ghstack (oldest at bottom):
Summary:
The current configuration validation requires torchx and GPUs. It can waste time, resources, ane engery. Polar bears are crying. Let's fix this by providing a dry run mode. This PR doesn't verify everything. In theory, we should be able to verify parallelisms settings as well. This PR is just a start but it at least can let us catch the typos quickly.