Skip to content

Conversation

wwwjn
Copy link
Contributor

@wwwjn wwwjn commented Oct 7, 2025

Benchmarking

Step time log
to_hf() 0.1103s [trainer0|0]:[titan] 2025-10-03 17:07:45,697 - root - INFO - Completed to_hf conversion, generated 189 keys, duration: 0.1103s
Split local GroupedExperts DTensor to individual experts’ weight 0.008 s per layer per matrix (total 58 MoE Layers * 3 weight matrices per layer) [trainer0|0]:[titan] 2025-10-03 17:07:45,697 - root - INFO - Completed _get_local_experts_weights for layer 6, abstract_key: model.layers.{}.mlp.experts.{}.up_proj.weight, duration: 0.0082s
dcp.load()Threads count=4 193.20s [trainer0|0]:[titan] 2025-10-03 17:10:58,899 - root - INFO - dcp.load with HuggingFaceStorageReader completed in 193.20 seconds
from_hf() 0.48s [trainer0|0]:[titan] 2025-10-03 17:10:59,378 - root - INFO - Completed from_hf conversion, processed 189 keys, duration: 0.4787s
Concatenate individual experts weight into GroupedExperts weight 0.01s per layer per matrix (total 58 MoE Layers * 3 weight matrices) [trainer0|0]:[titan] 2025-10-03 17:10:59,120 - root - INFO - Completed _concatenate_expert_weights_dtensor for layer 5, abstract_key: layers.{}.moe.experts.w2, duration: 0.0142s
Total 193.87s [trainer0|0]:[titan] 2025-10-03 17:10:59,458 - root - INFO - Finished loading the checkpoint in 193.87 seconds.

End-to-End verification for 671B model

Parallelsim: FSDP=32, PP=8, 1F1B, EP=32

Screenshot 2025-10-06 at 8 32 37 PM Screenshot 2025-10-06 at 8 32 54 PM

@meta-cla meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Oct 7, 2025
@wwwjn wwwjn requested a review from tianyu-l October 7, 2025 19:47
Comment on lines +90 to +91
target_dtype=torch.float32,
block_size=BLOCK_SIZE,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should these two be configurable? If not we can remove these two lines to use default.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean block_size and taget_dtype? PyTorch default value is

thread_count: int = 1,
target_dtype: torch.dtype = torch.float32,
block_size: int = 128,

I explicit leave block_size here to make the dequantize algorithm not so mysterious - The user can easily know it's block-wise dequantized with blocksize 128

@wwwjn wwwjn merged commit a6f0cfc into main Oct 8, 2025
8 checks passed
@tianyu-l tianyu-l deleted the remove-dequant branch October 8, 2025 18:35
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was this intended to stay here? It looks like a debugging change that's been left in this PR by mistake? The correct number of layers looks like 61 to me from here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My bad, yes you are right and let me fix this configuration. Thanks for pointing out

wwwjn added a commit that referenced this pull request Oct 9, 2025
Fix the number of layer issue introduced by #1804
githubsgi pushed a commit to githubsgi/torchtitan that referenced this pull request Oct 13, 2025
…ch#1804)

## Benchmarking
<meta charset="utf-8"><b style="font-weight:normal;"
id="docs-internal-guid-852d634c-7fff-a3ae-72e8-d17e64bb4b2c"><div
dir="ltr" style="margin-left:0pt;" align="center">
Step | time | log
-- | -- | --
to_hf() | 0.1103s | [trainer0\|0]:[titan] 2025-10-03 17:07:45,697 - root
- INFO - Completed to_hf conversion, generated 189 keys, duration:
0.1103s
Split local GroupedExperts DTensor to individual experts’ weight | 0.008
s per layer per matrix (total 58 MoE Layers * 3 weight matrices per
layer) | [trainer0\|0]:[titan] 2025-10-03 17:07:45,697 - root - INFO -
Completed _get_local_experts_weights for layer 6, abstract_key:
model.layers.{}.mlp.experts.{}.up_proj.weight, duration: 0.0082s
dcp.load()Threads count=4 | 193.20s | [trainer0\|0]:[titan] 2025-10-03
17:10:58,899 - root - INFO - dcp.load with HuggingFaceStorageReader
completed in 193.20 seconds
from_hf() | 0.48s | [trainer0\|0]:[titan] 2025-10-03 17:10:59,378 - root
- INFO - Completed from_hf conversion, processed 189 keys, duration:
0.4787s
Concatenate individual experts weight into GroupedExperts weight | 0.01s
per layer per matrix (total 58 MoE Layers * 3 weight matrices) |
[trainer0\|0]:[titan] 2025-10-03 17:10:59,120 - root - INFO - Completed
_concatenate_expert_weights_dtensor for layer 5, abstract_key:
layers.{}.moe.experts.w2, duration: 0.0142s
Total | 193.87s | [trainer0\|0]:[titan] 2025-10-03 17:10:59,458 - root -
INFO - Finished loading the checkpoint in 193.87 seconds.

</div></b>

## End-to-End verification for 671B model
Parallelsim: FSDP=32, PP=8, 1F1B, EP=32

<img width="393" height="421" alt="Screenshot 2025-10-06 at 8 32 37 PM"
src="https://github.com/user-attachments/assets/6d8dab00-a188-4c57-8348-02bae1d21d03"
/>
<img width="393" height="421" alt="Screenshot 2025-10-06 at 8 32 54 PM"
src="https://github.com/user-attachments/assets/a730f71b-3dc8-45e0-8d3e-b21080884f8d"
/>
githubsgi pushed a commit to githubsgi/torchtitan that referenced this pull request Oct 13, 2025
Fix the number of layer issue introduced by pytorch#1804
githubsgi pushed a commit to githubsgi/torchtitan that referenced this pull request Oct 15, 2025
…ch#1804)

## Benchmarking
<meta charset="utf-8"><b style="font-weight:normal;"
id="docs-internal-guid-852d634c-7fff-a3ae-72e8-d17e64bb4b2c"><div
dir="ltr" style="margin-left:0pt;" align="center">
Step | time | log
-- | -- | --
to_hf() | 0.1103s | [trainer0\|0]:[titan] 2025-10-03 17:07:45,697 - root
- INFO - Completed to_hf conversion, generated 189 keys, duration:
0.1103s
Split local GroupedExperts DTensor to individual experts’ weight | 0.008
s per layer per matrix (total 58 MoE Layers * 3 weight matrices per
layer) | [trainer0\|0]:[titan] 2025-10-03 17:07:45,697 - root - INFO -
Completed _get_local_experts_weights for layer 6, abstract_key:
model.layers.{}.mlp.experts.{}.up_proj.weight, duration: 0.0082s
dcp.load()Threads count=4 | 193.20s | [trainer0\|0]:[titan] 2025-10-03
17:10:58,899 - root - INFO - dcp.load with HuggingFaceStorageReader
completed in 193.20 seconds
from_hf() | 0.48s | [trainer0\|0]:[titan] 2025-10-03 17:10:59,378 - root
- INFO - Completed from_hf conversion, processed 189 keys, duration:
0.4787s
Concatenate individual experts weight into GroupedExperts weight | 0.01s
per layer per matrix (total 58 MoE Layers * 3 weight matrices) |
[trainer0\|0]:[titan] 2025-10-03 17:10:59,120 - root - INFO - Completed
_concatenate_expert_weights_dtensor for layer 5, abstract_key:
layers.{}.moe.experts.w2, duration: 0.0142s
Total | 193.87s | [trainer0\|0]:[titan] 2025-10-03 17:10:59,458 - root -
INFO - Finished loading the checkpoint in 193.87 seconds.

</div></b>

## End-to-End verification for 671B model
Parallelsim: FSDP=32, PP=8, 1F1B, EP=32

<img width="393" height="421" alt="Screenshot 2025-10-06 at 8 32 37 PM"
src="https://github.com/user-attachments/assets/6d8dab00-a188-4c57-8348-02bae1d21d03"
/>
<img width="393" height="421" alt="Screenshot 2025-10-06 at 8 32 54 PM"
src="https://github.com/user-attachments/assets/a730f71b-3dc8-45e0-8d3e-b21080884f8d"
/>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Meta Open Source bot.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants