Memory optimization for DSD for TorchTune LoRA #134025

mori360 · 2024-08-20T18:35:58Z

Optimize memory cost at PR#129635

There are 2 main part of the optimization here:

optimize the tensor distributing part, postpone the full_tensor generation, which avoids the memory overlap, saves around 50% peak memory at 2 param test case.
apply assign=True for the load_state_dict, saves memory cost at the state dict loading by assigning the input param, around 50% peak memory at loading part.

Future work:
Memory optimization to the opt will be conducted in the next PR

cc @XilunWu @H-Huang @awgu @kwen2501 @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k @c-p-i-o @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @yf225 @chenyang78 @kadeng @muchulee8 @ColinPeppler @amjames @desertfire @chauhang @LucasLLC @MeetVadakkanchery @mhorowitz @pradeepfn

pytorch-bot · 2024-08-20T18:36:01Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/134025

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ You can merge normally! (2 Unrelated Failures)

As of commit 80cef74 with merge base 333890b ():

FLAKY - The following job failed but was likely due to flakiness present on trunk:

periodic / win-vs2019-cuda11.8-py3 / test (default, 2, 4, windows.g5.4xlarge.nvidia.gpu) (gh) (similar failure)
test_nestedtensor.py::TestNestedTensorSubclassCUDA::test_sdpa_autocast_cuda

UNSTABLE - The following job failed but was likely due to flakiness present on trunk and has been marked as unstable:

periodic / linux-focal-rocm6.1-py3.8 / test (distributed, 2, 3, linux.rocm.gpu, unstable) (gh) (#129209)
distributed/tensor/parallel/test_micro_pipeline_tp.py::MicroPipelineTPTest::test_fuse_all_gather_scaled_matmul_A_dims_2_gather_dim_0

This comment was automatically generated by Dr. CI and updates every 15 minutes.

…2323) Summary: **Context:** Currently we have a helper to print out AtenTensor in [shim_common.cpp](https://github.com/pytorch/pytorch/blob/v2.4.0-rc4/torch/csrc/inductor/aoti_torch/shim_common.cpp#L866) The way we were using this function was a “manual” process. We inject this function into the generated output.cpp file, and recompile and reload the file. This diff automates the printing value process. **Changes:** 1. Added a simple initial debug printer helper to print out tensor values 2. Added a filter option to selectively dump tensor values. **Usage:** Sample cmd : ``` AOT_INDUCTOR_DEBUG_INTERMEDIATE_VALUE_PRINTER=1 TORCHINDUCTOR_FORCE_DISABLE_CACHES=1 TORCHINDUCTOR_ABI_COMPATIBLE=1 TORCH_COMPILE_DEBUG=1 TORCH_LOGS="+inductor, +schedule, output_code" python test/inductor/test_aot_inductor.py -k test_addmm_abi_compatible_cuda ``` Sample outputs : ``` [ before_launch - triton_poi_fused_0 - buf0 ]: 0.6331 1.6358 -0.3459 1.0196 -0.4122 1.4279 [ CUDAFloatType{6} ] Min value: -0.412198 Max value: 1.63582 Device: cuda:0 Size: [6] Stride: [1] Dtype: float Layout: Strided Number of elements: 6 Is contiguous: 1 Requires grad: 0 [ after_launch - triton_poi_fused_0 - buf0 ]: 0.6331 1.6358 -0.3459 1.0196 -0.4122 1.4279 [ CUDAFloatType{6} ] Min value: -0.412198 Max value: 1.63582 Device: cuda:0 Size: [6] Stride: [1] Dtype: float Layout: Strided Number of elements: 6 Is contiguous: 1 Requires grad: 0 [ before_launch - aoti_torch_cuda_addmm_out - buf1 ]: Min value: -2.25655 Max value: 2.32996 Device: cuda:0 Size: [16, 6] Stride: [6, 1] Dtype: float Layout: Strided Number of elements: 96 Is contiguous: 1 Requires grad: 0 [ before_launch - aoti_torch_cuda_addmm_out - buf0 ]: 0.6331 1.6358 -0.3459 1.0196 -0.4122 1.4279 [ CUDAFloatType{6} ] Min value: -0.412198 Max value: 1.63582 Device: cuda:0 Size: [6] Stride: [1] Dtype: float Layout: Strided Number of elements: 6 Is contiguous: 1 Requires grad: 0 [ after_launch - aoti_torch_cuda_addmm_out - buf1 ]: Min value: -12.0839 Max value: 11.6878 Device: cuda:0 Size: [16, 6] Stride: [6, 1] Dtype: float Layout: Strided Number of elements: 96 Is contiguous: 1 Requires grad: 0 [ after_launch - aoti_torch_cuda_addmm_out - buf0 ]: 0.6331 1.6358 -0.3459 1.0196 -0.4122 1.4279 [ CUDAFloatType{6} ] Min value: -0.412198 Max value: 1.63582 Device: cuda:0 Size: [6] Stride: [1] Dtype: float Layout: Strided Number of elements: 6 Is contiguous: 1 Requires grad: 0 stats [('calls_captured', 1), ('unique_graphs', 1)] inductor [('pattern_matcher_count', 2), ('pattern_matcher_nodes', 2), ('extern_calls', 2)] . ---------------------------------------------------------------------- Ran 1 test in 10.867s OK ``` The user is able to filter kernel names to print out values by specifying env var `AOT_INDUCTOR_FILTERED_KERNELS_TO_PRINT` and see choices of kernel names in a log message like below: ``` torch/_inductor/graph.py:1642] Finished codegen for all nodes. The list of kernel names available: ['triton_poi_fused_0', 'aoti_torch_cuda_addmm_out'] ``` In the follow-up diff, will add `torch.save()` to dump/save the intermediate tensors into individual `.pt` files that can be further `torch.load()`. Test Plan: Run Unit Tests in OSS: (similar cmd as mentioned above in the usage part) `AOT_INDUCTOR_DEBUG_INTERMEDIATE_VALUE_PRINTER=1 TORCHINDUCTOR_FORCE_DISABLE_CACHES=1 TORCHINDUCTOR_ABI_COMPATIBLE=1 TORCH_COMPILE_DEBUG=1 TORCH_LOGS="+inductor, output_code" python test/inductor/test_aot_inductor.py -k test_addmm_abi_compatible_cuda` Differential Revision: D60538496 Pull Request resolved: pytorch#132323 Approved by: https://github.com/ColinPeppler

mori360 · 2024-08-21T18:03:31Z

torch/distributed/_state_dict_utils.py

            if local_state is None:
                continue
            elif isinstance(local_state, DTensor):
-                local_state_dict[key] = (local_state, full_tensor)


postpone the full_tensor generation to avoid memory cost overlap

mori360 · 2024-08-21T18:04:14Z

torch/distributed/checkpoint/state_dict.py

            _IncompatibleKeys,
            _state_dict_fn(model, "load_state_dict")(
-                state_dict=state_dict, strict=info.strict
+                state_dict=state_dict, strict=info.strict, assign=assign


assign = True could avoid the memory cost from model.to_empty(device=device) for meta device
TorchTune uses assign=True already.
load_state_dict takes assign=False by default, only set assign = True when find device is meta device

mori360 · 2024-08-26T17:13:02Z

@pytorchbot merge

pytorchmergebot · 2024-08-26T17:14:50Z

Merge failed

Reason: This PR needs a release notes: label
If your changes are user facing and intended to be a part of release notes, please use a label starting with release notes:.

If not, please add the topic: not user facing label.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "topic: not user facing"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

Details for Dev Infra team

Raised by workflow job

mori360 · 2024-08-26T17:16:01Z

@pytorchbot merge

pytorchmergebot · 2024-08-26T17:18:50Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

Optimize memory cost at [PR#129635](pytorch#129635) There are 2 main part of the optimization here: 1. optimize the tensor distributing part, postpone the full_tensor generation, which avoids the memory overlap, saves around 50% peak memory at 2 param test case. 2. apply `assign=True` for the `load_state_dict`, saves memory cost at the state dict loading by assigning the input param, around 50% peak memory at loading part. Future work: Memory optimization to the opt will be conducted in the next PR Pull Request resolved: pytorch#134025 Approved by: https://github.com/fegin Co-authored-by: Rachel Guo <guorachel@meta.com>

mori360 added 14 commits August 14, 2024 17:53

move to_empty

9962512

Merge branch 'main' of github.com:mori360/pytorch into memory

32603bb

optimzie full_tensor in full_state_dict=True

7f68122

lintrunner error

642a43f

lintrunner error

874e088

split empty_like

88647ee

split model and dict

f913bf7

test split state dict

d5924f9

revert strict

68591b6

test

4cd99de

apply assign=True

94bcb07

lintrunner

39637a3

Merge branch 'pytorch:main' into memory

02d5abd

Merge branch 'main' of github.com:mori360/pytorch into memory

085e8f8

pytorch-bot bot added module: distributed_checkpoint oncall: distributed Add this issue/PR to distributed oncall triage queue labels Aug 20, 2024

mori360 added ciflow/trunk Trigger trunk jobs on your pull request ciflow/periodic Trigger jobs ran periodically on master (periodic.yml) on the PR labels Aug 20, 2024

mori360 changed the title ~~Mem~~ Memory optimization for DSD for TorchTune LoRA Aug 20, 2024

mori360 and others added 2 commits August 20, 2024 14:50

refactor

5424fa8

pytorch-bot bot added ciflow/inductor module: inductor labels Aug 20, 2024

mori360 added 3 commits August 20, 2024 15:08

reverse

2c11053

reverse

2920be2

revert new_device

80cef74

mori360 commented Aug 21, 2024

View reviewed changes

mori360 requested a review from fegin August 21, 2024 18:04

mori360 requested review from awgu and weifengpy August 21, 2024 18:04

mori360 marked this pull request as ready for review August 21, 2024 20:27

mori360 marked this pull request as draft August 23, 2024 19:22

mori360 marked this pull request as ready for review August 23, 2024 19:39

fegin approved these changes Aug 24, 2024

View reviewed changes

pytorchmergebot added the merging label Aug 26, 2024

pytorchmergebot removed the merging label Aug 26, 2024

mori360 added the topic: not user facing topic category label Aug 26, 2024

pytorchmergebot added the merging label Aug 26, 2024

pytorchmergebot added the Merged label Aug 26, 2024

pytorchmergebot closed this in d0ac5d5 Aug 26, 2024

pytorchmergebot removed the merging label Aug 26, 2024

mori360 deleted the mem branch November 25, 2024 19:18

mori360 mentioned this pull request Jan 10, 2025

DSD for TorchTune LoRA #128745

Closed

This was referenced Jul 16, 2025

[Bug]FSDP2 failed to load large model state_dict volcengine/verl#1517

Closed

[fsdp2] fix: oom issue when loading model state dict in fsdp2 volcengine/verl#1667

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Memory optimization for DSD for TorchTune LoRA #134025

Memory optimization for DSD for TorchTune LoRA #134025

Uh oh!

mori360 commented Aug 20, 2024 •

edited

Loading

Uh oh!

pytorch-bot bot commented Aug 20, 2024 •

edited

Loading

Uh oh!

mori360 Aug 21, 2024

Uh oh!

mori360 Aug 21, 2024 •

edited

Loading

Uh oh!

mori360 commented Aug 26, 2024

Uh oh!

pytorchmergebot commented Aug 26, 2024

Uh oh!

mori360 commented Aug 26, 2024

Uh oh!

pytorchmergebot commented Aug 26, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Memory optimization for DSD for TorchTune LoRA #134025

Memory optimization for DSD for TorchTune LoRA #134025

Uh oh!

Conversation

mori360 commented Aug 20, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Aug 20, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/134025

✅ You can merge normally! (2 Unrelated Failures)

Uh oh!

mori360 Aug 21, 2024

Choose a reason for hiding this comment

Uh oh!

mori360 Aug 21, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mori360 commented Aug 26, 2024

Uh oh!

pytorchmergebot commented Aug 26, 2024

Merge failed

Uh oh!

mori360 commented Aug 26, 2024

Uh oh!

pytorchmergebot commented Aug 26, 2024

Merge started

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

mori360 commented Aug 20, 2024 •

edited

Loading

pytorch-bot bot commented Aug 20, 2024 •

edited

Loading

mori360 Aug 21, 2024 •

edited

Loading