[DSV3] Offload dequantization process to DCP QuantizedHFReader #1804

wwwjn · 2025-10-07T03:33:15Z

Benchmarking

Step time log

to_hf() 0.1103s [trainer0|0]:[titan] 2025-10-03 17:07:45,697 - root - INFO - Completed to_hf conversion, generated 189 keys, duration: 0.1103s

Split local GroupedExperts DTensor to individual experts’ weight 0.008 s per layer per matrix (total 58 MoE Layers * 3 weight matrices per layer) [trainer0|0]:[titan] 2025-10-03 17:07:45,697 - root - INFO - Completed _get_local_experts_weights for layer 6, abstract_key: model.layers.{}.mlp.experts.{}.up_proj.weight, duration: 0.0082s

dcp.load()Threads count=4 193.20s [trainer0|0]:[titan] 2025-10-03 17:10:58,899 - root - INFO - dcp.load with HuggingFaceStorageReader completed in 193.20 seconds

from_hf() 0.48s [trainer0|0]:[titan] 2025-10-03 17:10:59,378 - root - INFO - Completed from_hf conversion, processed 189 keys, duration: 0.4787s

Concatenate individual experts weight into GroupedExperts weight 0.01s per layer per matrix (total 58 MoE Layers * 3 weight matrices) [trainer0|0]:[titan] 2025-10-03 17:10:59,120 - root - INFO - Completed _concatenate_expert_weights_dtensor for layer 5, abstract_key: layers.{}.moe.experts.w2, duration: 0.0142s

Total 193.87s [trainer0|0]:[titan] 2025-10-03 17:10:59,458 - root - INFO - Finished loading the checkpoint in 193.87 seconds.

End-to-End verification for 671B model

Parallelsim: FSDP=32, PP=8, 1F1B, EP=32

torchtitan/experiments/qwen3/model/state_dict_adapter.py

torchtitan/models/deepseek_v3/__init__.py

tianyu-l · 2025-10-07T21:00:04Z

torchtitan/models/deepseek_v3/model/state_dict_adapter.py

+                target_dtype=torch.float32,
+                block_size=BLOCK_SIZE,


should these two be configurable? If not we can remove these two lines to use default.

Do you mean block_size and taget_dtype? PyTorch default value is

thread_count: int = 1, target_dtype: torch.dtype = torch.float32, block_size: int = 128,

I explicit leave block_size here to make the dequantize algorithm not so mysterious - The user can easily know it's block-wise dequantized with blocksize 128

jbremz · 2025-10-09T15:55:22Z

torchtitan/models/deepseek_v3/__init__.py

Was this intended to stay here? It looks like a debugging change that's been left in this PR by mistake? The correct number of layers looks like 61 to me from here.

My bad, yes you are right and let me fix this configuration. Thanks for pointing out

Fix the number of layer issue introduced by #1804

…ch#1804) ## Benchmarking <meta charset="utf-8"><b style="font-weight:normal;" id="docs-internal-guid-852d634c-7fff-a3ae-72e8-d17e64bb4b2c"><div dir="ltr" style="margin-left:0pt;" align="center"> Step | time | log -- | -- | -- to_hf() | 0.1103s | [trainer0\|0]:[titan] 2025-10-03 17:07:45,697 - root - INFO - Completed to_hf conversion, generated 189 keys, duration: 0.1103s Split local GroupedExperts DTensor to individual experts’ weight | 0.008 s per layer per matrix (total 58 MoE Layers * 3 weight matrices per layer) | [trainer0\|0]:[titan] 2025-10-03 17:07:45,697 - root - INFO - Completed _get_local_experts_weights for layer 6, abstract_key: model.layers.{}.mlp.experts.{}.up_proj.weight, duration: 0.0082s dcp.load()Threads count=4 | 193.20s | [trainer0\|0]:[titan] 2025-10-03 17:10:58,899 - root - INFO - dcp.load with HuggingFaceStorageReader completed in 193.20 seconds from_hf() | 0.48s | [trainer0\|0]:[titan] 2025-10-03 17:10:59,378 - root - INFO - Completed from_hf conversion, processed 189 keys, duration: 0.4787s Concatenate individual experts weight into GroupedExperts weight | 0.01s per layer per matrix (total 58 MoE Layers * 3 weight matrices) | [trainer0\|0]:[titan] 2025-10-03 17:10:59,120 - root - INFO - Completed _concatenate_expert_weights_dtensor for layer 5, abstract_key: layers.{}.moe.experts.w2, duration: 0.0142s Total | 193.87s | [trainer0\|0]:[titan] 2025-10-03 17:10:59,458 - root - INFO - Finished loading the checkpoint in 193.87 seconds. </div></b> ## End-to-End verification for 671B model Parallelsim: FSDP=32, PP=8, 1F1B, EP=32 <img width="393" height="421" alt="Screenshot 2025-10-06 at 8 32 37 PM" src="https://github.com/user-attachments/assets/6d8dab00-a188-4c57-8348-02bae1d21d03" /> <img width="393" height="421" alt="Screenshot 2025-10-06 at 8 32 54 PM" src="https://github.com/user-attachments/assets/a730f71b-3dc8-45e0-8d3e-b21080884f8d" />

Fix the number of layer issue introduced by pytorch#1804

…ch#1804) ## Benchmarking <meta charset="utf-8"><b style="font-weight:normal;" id="docs-internal-guid-852d634c-7fff-a3ae-72e8-d17e64bb4b2c"><div dir="ltr" style="margin-left:0pt;" align="center"> Step | time | log -- | -- | -- to_hf() | 0.1103s | [trainer0\|0]:[titan] 2025-10-03 17:07:45,697 - root - INFO - Completed to_hf conversion, generated 189 keys, duration: 0.1103s Split local GroupedExperts DTensor to individual experts’ weight | 0.008 s per layer per matrix (total 58 MoE Layers * 3 weight matrices per layer) | [trainer0\|0]:[titan] 2025-10-03 17:07:45,697 - root - INFO - Completed _get_local_experts_weights for layer 6, abstract_key: model.layers.{}.mlp.experts.{}.up_proj.weight, duration: 0.0082s dcp.load()Threads count=4 | 193.20s | [trainer0\|0]:[titan] 2025-10-03 17:10:58,899 - root - INFO - dcp.load with HuggingFaceStorageReader completed in 193.20 seconds from_hf() | 0.48s | [trainer0\|0]:[titan] 2025-10-03 17:10:59,378 - root - INFO - Completed from_hf conversion, processed 189 keys, duration: 0.4787s Concatenate individual experts weight into GroupedExperts weight | 0.01s per layer per matrix (total 58 MoE Layers * 3 weight matrices) | [trainer0\|0]:[titan] 2025-10-03 17:10:59,120 - root - INFO - Completed _concatenate_expert_weights_dtensor for layer 5, abstract_key: layers.{}.moe.experts.w2, duration: 0.0142s Total | 193.87s | [trainer0\|0]:[titan] 2025-10-03 17:10:59,458 - root - INFO - Finished loading the checkpoint in 193.87 seconds. </div></b> ## End-to-End verification for 671B model Parallelsim: FSDP=32, PP=8, 1F1B, EP=32 <img width="393" height="421" alt="Screenshot 2025-10-06 at 8 32 37 PM" src="https://github.com/user-attachments/assets/6d8dab00-a188-4c57-8348-02bae1d21d03" /> <img width="393" height="421" alt="Screenshot 2025-10-06 at 8 32 54 PM" src="https://github.com/user-attachments/assets/a730f71b-3dc8-45e0-8d3e-b21080884f8d" />

wwwjn added 7 commits October 6, 2025 19:55

benchmarking

199519c

test

f816eb1

benchmarking

ceb1411

remove dequantize

bcd786b

reformat

643bfb6

rebase

0d5fdba

reformat

f2a2011

wwwjn requested review from fegin, tianyu-l and wconstab as code owners October 7, 2025 03:33

meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Oct 7, 2025

wwwjn mentioned this pull request Oct 7, 2025

[WIP][DSV3] Offload dequantization process to DCP QuantizedHFReader #1692

Closed

fix lint

ae5e9d5

tianyu-l reviewed Oct 7, 2025

View reviewed changes

torchtitan/experiments/qwen3/model/state_dict_adapter.py Outdated Show resolved Hide resolved

torchtitan/models/deepseek_v3/__init__.py Outdated Show resolved Hide resolved

make dequantized a job config

fa1824c

wwwjn requested a review from tianyu-l October 7, 2025 19:47

tianyu-l reviewed Oct 7, 2025

View reviewed changes

tianyu-l approved these changes Oct 7, 2025

View reviewed changes

wwwjn merged commit a6f0cfc into main Oct 8, 2025
8 checks passed

tianyu-l deleted the remove-dequant branch October 8, 2025 18:35

jbremz reviewed Oct 9, 2025

View reviewed changes

wwwjn mentioned this pull request Oct 9, 2025

Fix num of layers for deepseek-v3 #1845

Merged

wwwjn added a commit that referenced this pull request Oct 9, 2025

Fix num of layers for deepseek-v3 (#1845)

b6ccf22

Fix the number of layer issue introduced by #1804

githubsgi pushed a commit to githubsgi/torchtitan that referenced this pull request Oct 13, 2025

Fix num of layers for deepseek-v3 (pytorch#1845)

bc0383d

Fix the number of layer issue introduced by pytorch#1804

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[DSV3] Offload dequantization process to DCP QuantizedHFReader #1804

[DSV3] Offload dequantization process to DCP QuantizedHFReader #1804

Uh oh!

wwwjn commented Oct 7, 2025 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

tianyu-l Oct 7, 2025

Uh oh!

wwwjn Oct 7, 2025

Uh oh!

Uh oh!

jbremz Oct 9, 2025

Uh oh!

wwwjn Oct 9, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Step	time	log
to_hf()	0.1103s	[trainer0\|0]:[titan] 2025-10-03 17:07:45,697 - root - INFO - Completed to_hf conversion, generated 189 keys, duration: 0.1103s
Split local GroupedExperts DTensor to individual experts’ weight	0.008 s per layer per matrix (total 58 MoE Layers * 3 weight matrices per layer)	[trainer0\|0]:[titan] 2025-10-03 17:07:45,697 - root - INFO - Completed _get_local_experts_weights for layer 6, abstract_key: model.layers.{}.mlp.experts.{}.up_proj.weight, duration: 0.0082s
dcp.load()Threads count=4	193.20s	[trainer0\|0]:[titan] 2025-10-03 17:10:58,899 - root - INFO - dcp.load with HuggingFaceStorageReader completed in 193.20 seconds
from_hf()	0.48s	[trainer0\|0]:[titan] 2025-10-03 17:10:59,378 - root - INFO - Completed from_hf conversion, processed 189 keys, duration: 0.4787s
Concatenate individual experts weight into GroupedExperts weight	0.01s per layer per matrix (total 58 MoE Layers * 3 weight matrices)	[trainer0\|0]:[titan] 2025-10-03 17:10:59,120 - root - INFO - Completed _concatenate_expert_weights_dtensor for layer 5, abstract_key: layers.{}.moe.experts.w2, duration: 0.0142s
Total	193.87s	[trainer0\|0]:[titan] 2025-10-03 17:10:59,458 - root - INFO - Finished loading the checkpoint in 193.87 seconds.

[DSV3] Offload dequantization process to DCP QuantizedHFReader #1804

[DSV3] Offload dequantization process to DCP QuantizedHFReader #1804

Uh oh!

Conversation

wwwjn commented Oct 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Benchmarking

End-to-End verification for 671B model

Uh oh!

Uh oh!

Uh oh!

tianyu-l Oct 7, 2025

Choose a reason for hiding this comment

Uh oh!

wwwjn Oct 7, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

jbremz Oct 9, 2025

Choose a reason for hiding this comment

Uh oh!

wwwjn Oct 9, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

wwwjn commented Oct 7, 2025 •

edited

Loading