[Full DTensor] Initial skeleton for full_dtensor mode #2049

fegin · 2025-11-17T22:19:02Z

Stack from ghstack (oldest at bottom):

This PR provides a skelet

This PR introduces an initial prototype and skeleton for fully DTensor-based training. The current codebase builds upon SimpleFSDP, but we anticipate developing our own parameterization to better serve our specific use case. There are several reasons why SimpleFSDP's parameterization is insufficient. For instance, the current parallelize_buffers() implementation in this PR will not function correctly when additional parallelization strategies are applied. Despite these limitations, this PR provides a starting point for experimenting with a full DTensor trainer.

Accuracy verification:
HSDP
SimpleFSDP v.s. FSDP2

python3 scripts/loss_compare.py . . \
--baseline-options='--activation_checkpoint.mode="none" --parallelism.data_parallel_replicate_degree=2' \
--test-options='--model.name full_dtensor.llama3 --activation_checkpoint.mode="none" --parallelism.data_parallel_replicate_degree=2' \
--test-train-file=torchtitan.experiments.full_dtensor.train  \
--steps=10 --assert-equal --no-seed-checkpoint

[LOSS_COMPARE]
[LOSS_COMPARE] Asserting losses are equal...
[LOSS_COMPARE] Baseline log: /tmp/baseline_training.log
[LOSS_COMPARE] Test log: /tmp/test_training.log
[LOSS_COMPARE] Extracted 100 steps from baseline log
[LOSS_COMPARE] Extracted 100 steps from test log
test_losses_equal
(__main__.assert_losses_equal.<locals>.LossEqualityTest.test_losses_equal)
... ok

----------------------------------------------------------------------
Ran 1 test in 0.000s

OK

Note that, --no-seed-checkpoint is used because when seed-checkpoint is used, we got accuracy mismatch.

[ghstack-poisoned]

This PR provides a skelet This PR introduces an initial prototype and skeleton for fully DTensor-based training. The current codebase builds upon SimpleFSDP, but we anticipate developing our own Reparameterization to better serve our specific use case. There are several reasons why SimpleFSDP's Reparameterization is insufficient. For instance, the current parallelize_buffers() implementation in this PR will not function correctly when additional parallelization strategies are applied. Despite these limitations, this PR provides a starting point for experimenting with a full DTensor trainer. Accuracy verification: HSDP SimpleFSDP v.s. FSDP2 ``` python3 scripts/loss_compare.py . . \ --baseline-options='--activation_checkpoint.mode="none" --parallelism.data_parallel_replicate_degree=2' \ --test-options='--model.name full_dtensor.llama3 --activation_checkpoint.mode="none" --parallelism.data_parallel_replicate_degree=2' \ --test-train-file=torchtitan.experiments.full_dtensor.train \ --steps=10 --assert-equal --no-seed-checkpoint ``` ``` [LOSS_COMPARE] [LOSS_COMPARE] Asserting losses are equal... [LOSS_COMPARE] Baseline log: /tmp/baseline_training.log [LOSS_COMPARE] Test log: /tmp/test_training.log [LOSS_COMPARE] Extracted 100 steps from baseline log [LOSS_COMPARE] Extracted 100 steps from test log test_losses_equal (__main__.assert_losses_equal.<locals>.LossEqualityTest.test_losses_equal) ... ok ---------------------------------------------------------------------- Ran 1 test in 0.000s OK ``` Note that, `--no-seed-checkpoint` is used because when seed-checkpoint is used, we got accuracy mismatch. ghstack-source-id: 67cd703 Pull-Request: #2049

[ghstack-poisoned]

This PR provides a skelet This PR introduces an initial prototype and skeleton for fully DTensor-based training. The current codebase builds upon SimpleFSDP, but we anticipate developing our own Reparameterization to better serve our specific use case. There are several reasons why SimpleFSDP's Reparameterization is insufficient. For instance, the current parallelize_buffers() implementation in this PR will not function correctly when additional parallelization strategies are applied. Despite these limitations, this PR provides a starting point for experimenting with a full DTensor trainer. Accuracy verification: HSDP SimpleFSDP v.s. FSDP2 ``` python3 scripts/loss_compare.py . . \ --baseline-options='--activation_checkpoint.mode="none" --parallelism.data_parallel_replicate_degree=2' \ --test-options='--model.name full_dtensor.llama3 --activation_checkpoint.mode="none" --parallelism.data_parallel_replicate_degree=2' \ --test-train-file=torchtitan.experiments.full_dtensor.train \ --steps=10 --assert-equal --no-seed-checkpoint ``` ``` [LOSS_COMPARE] [LOSS_COMPARE] Asserting losses are equal... [LOSS_COMPARE] Baseline log: /tmp/baseline_training.log [LOSS_COMPARE] Test log: /tmp/test_training.log [LOSS_COMPARE] Extracted 100 steps from baseline log [LOSS_COMPARE] Extracted 100 steps from test log test_losses_equal (__main__.assert_losses_equal.<locals>.LossEqualityTest.test_losses_equal) ... ok ---------------------------------------------------------------------- Ran 1 test in 0.000s OK ``` Note that, `--no-seed-checkpoint` is used because when seed-checkpoint is used, we got accuracy mismatch. ghstack-source-id: 6cf9b5e Pull-Request: #2049

[ghstack-poisoned]

This PR provides a skelet This PR introduces an initial prototype and skeleton for fully DTensor-based training. The current codebase builds upon SimpleFSDP, but we anticipate developing our own Reparameterization to better serve our specific use case. There are several reasons why SimpleFSDP's Reparameterization is insufficient. For instance, the current parallelize_buffers() implementation in this PR will not function correctly when additional parallelization strategies are applied. Despite these limitations, this PR provides a starting point for experimenting with a full DTensor trainer. Accuracy verification: HSDP SimpleFSDP v.s. FSDP2 ``` python3 scripts/loss_compare.py . . \ --baseline-options='--activation_checkpoint.mode="none" --parallelism.data_parallel_replicate_degree=2' \ --test-options='--model.name full_dtensor.llama3 --activation_checkpoint.mode="none" --parallelism.data_parallel_replicate_degree=2' \ --test-train-file=torchtitan.experiments.full_dtensor.train \ --steps=10 --assert-equal --no-seed-checkpoint ``` ``` [LOSS_COMPARE] [LOSS_COMPARE] Asserting losses are equal... [LOSS_COMPARE] Baseline log: /tmp/baseline_training.log [LOSS_COMPARE] Test log: /tmp/test_training.log [LOSS_COMPARE] Extracted 100 steps from baseline log [LOSS_COMPARE] Extracted 100 steps from test log test_losses_equal (__main__.assert_losses_equal.<locals>.LossEqualityTest.test_losses_equal) ... ok ---------------------------------------------------------------------- Ran 1 test in 0.000s OK ``` Note that, `--no-seed-checkpoint` is used because when seed-checkpoint is used, we got accuracy mismatch. ghstack-source-id: 0d3e3f0 Pull-Request: #2049

Update

d5e8a97

[ghstack-poisoned]

fegin requested review from tianyu-l, wconstab and wwwjn as code owners November 17, 2025 22:19

fegin mentioned this pull request Nov 17, 2025

Add a loss comparison script #2029

Open

meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Nov 17, 2025

fegin marked this pull request as draft November 17, 2025 22:25

Update

155d733

[ghstack-poisoned]

Update

84a4c65

[ghstack-poisoned]

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Full DTensor] Initial skeleton for full_dtensor mode #2049

[Full DTensor] Initial skeleton for full_dtensor mode #2049

Uh oh!

fegin commented Nov 17, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

[Full DTensor] Initial skeleton for full_dtensor mode #2049

Are you sure you want to change the base?

[Full DTensor] Initial skeleton for full_dtensor mode #2049

Uh oh!

Conversation

fegin commented Nov 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

fegin commented Nov 17, 2025 •

edited

Loading