Skip to content

Conversation

@fegin
Copy link
Contributor

@fegin fegin commented Nov 17, 2025

Stack from ghstack (oldest at bottom):

This PR provides a skelet

This PR introduces an initial prototype and skeleton for fully DTensor-based training. The current codebase builds upon SimpleFSDP, but we anticipate developing our own parameterization to better serve our specific use case. There are several reasons why SimpleFSDP's parameterization is insufficient. For instance, the current parallelize_buffers() implementation in this PR will not function correctly when additional parallelization strategies are applied. Despite these limitations, this PR provides a starting point for experimenting with a full DTensor trainer.

Accuracy verification:
HSDP
SimpleFSDP v.s. FSDP2

python3 scripts/loss_compare.py . . \
--baseline-options='--activation_checkpoint.mode="none" --parallelism.data_parallel_replicate_degree=2' \
--test-options='--model.name full_dtensor.llama3 --activation_checkpoint.mode="none" --parallelism.data_parallel_replicate_degree=2' \
--test-train-file=torchtitan.experiments.full_dtensor.train  \
--steps=10 --assert-equal --no-seed-checkpoint
[LOSS_COMPARE]
[LOSS_COMPARE] Asserting losses are equal...
[LOSS_COMPARE] Baseline log: /tmp/baseline_training.log
[LOSS_COMPARE] Test log: /tmp/test_training.log
[LOSS_COMPARE] Extracted 100 steps from baseline log
[LOSS_COMPARE] Extracted 100 steps from test log
test_losses_equal
(__main__.assert_losses_equal.<locals>.LossEqualityTest.test_losses_equal)
... ok

----------------------------------------------------------------------
Ran 1 test in 0.000s

OK

Note that, --no-seed-checkpoint is used because when seed-checkpoint is used, we got accuracy mismatch.

[ghstack-poisoned]
fegin added a commit that referenced this pull request Nov 17, 2025
This PR provides a skelet

This PR introduces an initial prototype and skeleton for fully DTensor-based training. The current codebase builds upon SimpleFSDP, but we anticipate developing our own Reparameterization to better serve our specific use case. There are several reasons why SimpleFSDP's Reparameterization is insufficient. For instance, the current parallelize_buffers() implementation in this PR will not function correctly when additional parallelization strategies are applied.  Despite these limitations, this PR provides a starting point for experimenting with a full DTensor trainer.

Accuracy verification:
HSDP
SimpleFSDP v.s. FSDP2

```
python3 scripts/loss_compare.py . . \
--baseline-options='--activation_checkpoint.mode="none" --parallelism.data_parallel_replicate_degree=2' \
--test-options='--model.name full_dtensor.llama3 --activation_checkpoint.mode="none" --parallelism.data_parallel_replicate_degree=2' \
--test-train-file=torchtitan.experiments.full_dtensor.train  \
--steps=10 --assert-equal --no-seed-checkpoint
```

```
[LOSS_COMPARE]
[LOSS_COMPARE] Asserting losses are equal...
[LOSS_COMPARE] Baseline log: /tmp/baseline_training.log
[LOSS_COMPARE] Test log: /tmp/test_training.log
[LOSS_COMPARE] Extracted 100 steps from baseline log
[LOSS_COMPARE] Extracted 100 steps from test log
test_losses_equal
(__main__.assert_losses_equal.<locals>.LossEqualityTest.test_losses_equal)
... ok

----------------------------------------------------------------------
Ran 1 test in 0.000s

OK
```

Note that, `--no-seed-checkpoint` is used because when seed-checkpoint is used, we got accuracy mismatch.


ghstack-source-id: 67cd703
Pull-Request: #2049
@meta-cla meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Nov 17, 2025
@fegin fegin marked this pull request as draft November 17, 2025 22:25
[ghstack-poisoned]
fegin added a commit that referenced this pull request Nov 17, 2025
This PR provides a skelet

This PR introduces an initial prototype and skeleton for fully DTensor-based training. The current codebase builds upon SimpleFSDP, but we anticipate developing our own Reparameterization to better serve our specific use case. There are several reasons why SimpleFSDP's Reparameterization is insufficient. For instance, the current parallelize_buffers() implementation in this PR will not function correctly when additional parallelization strategies are applied.  Despite these limitations, this PR provides a starting point for experimenting with a full DTensor trainer.

Accuracy verification:
HSDP
SimpleFSDP v.s. FSDP2

```
python3 scripts/loss_compare.py . . \
--baseline-options='--activation_checkpoint.mode="none" --parallelism.data_parallel_replicate_degree=2' \
--test-options='--model.name full_dtensor.llama3 --activation_checkpoint.mode="none" --parallelism.data_parallel_replicate_degree=2' \
--test-train-file=torchtitan.experiments.full_dtensor.train  \
--steps=10 --assert-equal --no-seed-checkpoint
```

```
[LOSS_COMPARE]
[LOSS_COMPARE] Asserting losses are equal...
[LOSS_COMPARE] Baseline log: /tmp/baseline_training.log
[LOSS_COMPARE] Test log: /tmp/test_training.log
[LOSS_COMPARE] Extracted 100 steps from baseline log
[LOSS_COMPARE] Extracted 100 steps from test log
test_losses_equal
(__main__.assert_losses_equal.<locals>.LossEqualityTest.test_losses_equal)
... ok

----------------------------------------------------------------------
Ran 1 test in 0.000s

OK
```

Note that, `--no-seed-checkpoint` is used because when seed-checkpoint is used, we got accuracy mismatch.

ghstack-source-id: 6cf9b5e
Pull-Request: #2049
[ghstack-poisoned]
fegin added a commit that referenced this pull request Nov 18, 2025
This PR provides a skelet

This PR introduces an initial prototype and skeleton for fully DTensor-based training. The current codebase builds upon SimpleFSDP, but we anticipate developing our own Reparameterization to better serve our specific use case. There are several reasons why SimpleFSDP's Reparameterization is insufficient. For instance, the current parallelize_buffers() implementation in this PR will not function correctly when additional parallelization strategies are applied.  Despite these limitations, this PR provides a starting point for experimenting with a full DTensor trainer.

Accuracy verification:
HSDP
SimpleFSDP v.s. FSDP2

```
python3 scripts/loss_compare.py . . \
--baseline-options='--activation_checkpoint.mode="none" --parallelism.data_parallel_replicate_degree=2' \
--test-options='--model.name full_dtensor.llama3 --activation_checkpoint.mode="none" --parallelism.data_parallel_replicate_degree=2' \
--test-train-file=torchtitan.experiments.full_dtensor.train  \
--steps=10 --assert-equal --no-seed-checkpoint
```

```
[LOSS_COMPARE]
[LOSS_COMPARE] Asserting losses are equal...
[LOSS_COMPARE] Baseline log: /tmp/baseline_training.log
[LOSS_COMPARE] Test log: /tmp/test_training.log
[LOSS_COMPARE] Extracted 100 steps from baseline log
[LOSS_COMPARE] Extracted 100 steps from test log
test_losses_equal
(__main__.assert_losses_equal.<locals>.LossEqualityTest.test_losses_equal)
... ok

----------------------------------------------------------------------
Ran 1 test in 0.000s

OK
```

Note that, `--no-seed-checkpoint` is used because when seed-checkpoint is used, we got accuracy mismatch.

ghstack-source-id: 0d3e3f0
Pull-Request: #2049
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Meta Open Source bot.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants