Add dry run mode #2012

fegin · 2025-11-10T22:47:43Z

Stack from ghstack (oldest at bottom):

Summary:
The current configuration validation requires torchx and GPUs. It can waste time, resources, ane engery. Polar bears are crying. Let's fix this by providing a dry run mode. This PR doesn't verify everything. In theory, we should be able to verify parallelisms settings as well. This PR is just a start but it at least can let us catch the typos quickly.

[ghstack-poisoned]

Summary: The current configuration validation requires torchx and GPUs. It can waste time, resources, ane engery. Polar bears are crying. Let's fix this by providing a dry run mode. This PR doesn't verify everything. In theory, we should be able to verify parallelisms settings as well. This PR is just a start but it at least can let us catch the typos quickly. ghstack-source-id: 8ffb03d Pull-Request: #2012

tianyu-l

Wish we could keep train.py and run_train.sh simple..

torchtitan/train.py

run_train.sh

fegin · 2025-11-11T00:12:20Z

Okay, there is one minor issue by have a separate DryRunTrainer. More and more applications are subclassing Trainer. So if we create another DryRunTrainer, then we don't benefit those applications (e.g., full dtensor). So I guess we will eventually merge this back once we figure out how to do the full dry run with fake backend (after the DeviceMesh PR is landed).

[ghstack-poisoned]

Summary: The current configuration validation requires torchx and GPUs. It can waste time, resources, ane engery. Polar bears are crying. Let's fix this by providing a dry run mode. This PR doesn't verify everything. In theory, we should be able to verify parallelisms settings as well. This PR is just a start but it at least can let us catch the typos quickly. ghstack-source-id: 72f4316 Pull-Request: #2012

[ghstack-poisoned]

Summary: The current configuration validation requires torchx and GPUs. It can waste time, resources, ane engery. Polar bears are crying. Let's fix this by providing a dry run mode. This PR doesn't verify everything. In theory, we should be able to verify parallelisms settings as well. This PR is just a start but it at least can let us catch the typos quickly. ghstack-source-id: 4e599a7 Pull-Request: #2012

tianyu-l

Had a suggestion on where to put the dry_run.py file.

Agree that we should switch to fake backend if that works.

Also maybe LocalTensor can be helpful?

tianyu-l · 2025-11-11T03:06:03Z

scripts/dry_run.py

nit: Would you consider putting this under torchtitan/tools/dry_run.py (or other 2nd level directory under torchtitan), or scripts/dry_run.py?

Okay, let's put it under scripts/dry_run.py for now. We should investigate how to merge it back to train.py with LocalTensor or fake backend anyway.

[ghstack-poisoned]

Summary: The current configuration validation requires torchx and GPUs. It can waste time, resources, ane engery. Polar bears are crying. Let's fix this by providing a dry run mode. This PR doesn't verify everything. In theory, we should be able to verify parallelisms settings as well. This PR is just a start but it at least can let us catch the typos quickly. ghstack-source-id: 99e031f Pull-Request: #2012

[ghstack-poisoned]

Summary: The current configuration validation requires torchx and GPUs. It can waste time, resources, ane engery. Polar bears are crying. Let's fix this by providing a dry run mode. This PR doesn't verify everything. In theory, we should be able to verify parallelisms settings as well. This PR is just a start but it at least can let us catch the typos quickly. ghstack-source-id: 7c01387 Pull-Request: #2012

Stack from [ghstack](https://github.com/ezyang/ghstack/tree/0.12.0) (oldest at bottom): * #2012 * __->__ #2011 It is not correct as JobConfig has changed.

Stack from [ghstack](https://github.com/ezyang/ghstack/tree/0.12.0) (oldest at bottom): * pytorch#2012 * __->__ pytorch#2011 It is not correct as JobConfig has changed.

Stack from [ghstack](https://github.com/ezyang/ghstack/tree/0.12.0) (oldest at bottom): * __->__ pytorch#2012 * pytorch#2011 Summary: The current configuration validation requires torchx and GPUs. It can waste time, resources, ane engery. Polar bears are crying. Let's fix this by providing a dry run mode. This PR doesn't verify everything. In theory, we should be able to verify parallelisms settings as well. This PR is just a start but it at least can let us catch the typos quickly.

fegin added 2 commits November 10, 2025 14:47

Update

5276a60

[ghstack-poisoned]

Update (base update)

b07c43b

[ghstack-poisoned]

fegin requested review from tianyu-l, wconstab and wwwjn as code owners November 10, 2025 22:47

fegin mentioned this pull request Nov 10, 2025

Fix the error message of maybe_enable_async_tp() #2011

Merged

meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Nov 10, 2025

tianyu-l reviewed Nov 10, 2025

View reviewed changes

torchtitan/train.py Outdated Show resolved Hide resolved

run_train.sh Outdated Show resolved Hide resolved

Update

3ce98b9

[ghstack-poisoned]

fegin added 2 commits November 10, 2025 16:59

Update (base update)

d85baa5

[ghstack-poisoned]

Update

06fb7a4

[ghstack-poisoned]

fegin requested a review from tianyu-l November 11, 2025 02:01

tianyu-l approved these changes Nov 11, 2025

View reviewed changes

fegin added 2 commits November 11, 2025 00:07

Update (base update)

6c4c4f9

[ghstack-poisoned]

Update

fb8286e

[ghstack-poisoned]

fegin added 2 commits November 11, 2025 08:58

Update (base update)

5074595

[ghstack-poisoned]

Update

6ec21c2

[ghstack-poisoned]

fegin added a commit that referenced this pull request Nov 11, 2025

Fix the error message of maybe_enable_async_tp() (#2011)

11d73a2

Stack from [ghstack](https://github.com/ezyang/ghstack/tree/0.12.0) (oldest at bottom): * #2012 * __->__ #2011 It is not correct as JobConfig has changed.

fegin changed the base branch from gh/fegin/29/base to main November 11, 2025 17:44

fegin merged commit f5d2b18 into main Nov 11, 2025
7 checks passed

fegin mentioned this pull request Nov 14, 2025

Extend Dry Run Mode to Cover Trainer Initialization in TorchTitan #2044

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add dry run mode #2012

Add dry run mode #2012

Uh oh!

fegin commented Nov 10, 2025 •

edited

Loading

Uh oh!

tianyu-l left a comment

Uh oh!

Uh oh!

Uh oh!

fegin commented Nov 11, 2025

Uh oh!

tianyu-l left a comment

Uh oh!

tianyu-l Nov 11, 2025

Uh oh!

fegin Nov 11, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Add dry run mode #2012

Add dry run mode #2012

Uh oh!

Conversation

fegin commented Nov 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tianyu-l left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

fegin commented Nov 11, 2025

Uh oh!

tianyu-l left a comment

Choose a reason for hiding this comment

Uh oh!

tianyu-l Nov 11, 2025

Choose a reason for hiding this comment

Uh oh!

fegin Nov 11, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

fegin commented Nov 10, 2025 •

edited

Loading