Skip to content

Conversation

@fegin
Copy link
Contributor

@fegin fegin commented Nov 10, 2025

Stack from ghstack (oldest at bottom):

Summary:
The current configuration validation requires torchx and GPUs. It can waste time, resources, ane engery. Polar bears are crying. Let's fix this by providing a dry run mode. This PR doesn't verify everything. In theory, we should be able to verify parallelisms settings as well. This PR is just a start but it at least can let us catch the typos quickly.

[ghstack-poisoned]
[ghstack-poisoned]
fegin added a commit that referenced this pull request Nov 10, 2025
Summary:
The current configuration validation requires torchx and GPUs. It can waste time, resources, ane engery. Polar bears are crying. Let's fix this by providing a dry run mode. This PR doesn't verify everything. In theory, we should be able to verify parallelisms settings as well. This PR is just a start but it at least can let us catch the typos quickly.


ghstack-source-id: 8ffb03d
Pull-Request: #2012
@meta-cla meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Nov 10, 2025
Copy link
Contributor

@tianyu-l tianyu-l left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wish we could keep train.py and run_train.sh simple..

@fegin
Copy link
Contributor Author

fegin commented Nov 11, 2025

Okay, there is one minor issue by have a separate DryRunTrainer. More and more applications are subclassing Trainer. So if we create another DryRunTrainer, then we don't benefit those applications (e.g., full dtensor). So I guess we will eventually merge this back once we figure out how to do the full dry run with fake backend (after the DeviceMesh PR is landed).

[ghstack-poisoned]
fegin added a commit that referenced this pull request Nov 11, 2025
Summary:
The current configuration validation requires torchx and GPUs. It can waste time, resources, ane engery. Polar bears are crying. Let's fix this by providing a dry run mode. This PR doesn't verify everything. In theory, we should be able to verify parallelisms settings as well. This PR is just a start but it at least can let us catch the typos quickly.

ghstack-source-id: 72f4316
Pull-Request: #2012
[ghstack-poisoned]
[ghstack-poisoned]
fegin added a commit that referenced this pull request Nov 11, 2025
Summary:
The current configuration validation requires torchx and GPUs. It can waste time, resources, ane engery. Polar bears are crying. Let's fix this by providing a dry run mode. This PR doesn't verify everything. In theory, we should be able to verify parallelisms settings as well. This PR is just a start but it at least can let us catch the typos quickly.

ghstack-source-id: 4e599a7
Pull-Request: #2012
@fegin fegin requested a review from tianyu-l November 11, 2025 02:01
Copy link
Contributor

@tianyu-l tianyu-l left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Had a suggestion on where to put the dry_run.py file.

Agree that we should switch to fake backend if that works.

Also maybe LocalTensor can be helpful?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Would you consider putting this under torchtitan/tools/dry_run.py (or other 2nd level directory under torchtitan), or scripts/dry_run.py?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, let's put it under scripts/dry_run.py for now. We should investigate how to merge it back to train.py with LocalTensor or fake backend anyway.

[ghstack-poisoned]
[ghstack-poisoned]
fegin added a commit that referenced this pull request Nov 11, 2025
Summary:
The current configuration validation requires torchx and GPUs. It can waste time, resources, ane engery. Polar bears are crying. Let's fix this by providing a dry run mode. This PR doesn't verify everything. In theory, we should be able to verify parallelisms settings as well. This PR is just a start but it at least can let us catch the typos quickly.

ghstack-source-id: 99e031f
Pull-Request: #2012
[ghstack-poisoned]
[ghstack-poisoned]
fegin added a commit that referenced this pull request Nov 11, 2025
Summary:
The current configuration validation requires torchx and GPUs. It can waste time, resources, ane engery. Polar bears are crying. Let's fix this by providing a dry run mode. This PR doesn't verify everything. In theory, we should be able to verify parallelisms settings as well. This PR is just a start but it at least can let us catch the typos quickly.

ghstack-source-id: 7c01387
Pull-Request: #2012
fegin added a commit that referenced this pull request Nov 11, 2025
Stack from [ghstack](https://github.com/ezyang/ghstack/tree/0.12.0)
(oldest at bottom):
* #2012
* __->__ #2011

It is not correct as JobConfig has changed.
@fegin fegin changed the base branch from gh/fegin/29/base to main November 11, 2025 17:44
@fegin fegin merged commit f5d2b18 into main Nov 11, 2025
7 checks passed
ahoffman-aws pushed a commit to drcanchi-aws/torchtitan that referenced this pull request Nov 11, 2025
Stack from [ghstack](https://github.com/ezyang/ghstack/tree/0.12.0)
(oldest at bottom):
* pytorch#2012
* __->__ pytorch#2011

It is not correct as JobConfig has changed.
ahoffman-aws pushed a commit to drcanchi-aws/torchtitan that referenced this pull request Nov 11, 2025
Stack from [ghstack](https://github.com/ezyang/ghstack/tree/0.12.0)
(oldest at bottom):
* __->__ pytorch#2012
* pytorch#2011

Summary:
The current configuration validation requires torchx and GPUs. It can
waste time, resources, ane engery. Polar bears are crying. Let's fix
this by providing a dry run mode. This PR doesn't verify everything. In
theory, we should be able to verify parallelisms settings as well. This
PR is just a start but it at least can let us catch the typos quickly.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Meta Open Source bot.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants