[TPU] Call torch._sync(param) during weight loading #9437

WoosukKwon · 2024-10-17T00:10:28Z

During weight loading, we often do something like:

narrowed_tensor = param.data.narrow(0, offset, len)
narrowed_tensor.copy_(real_weight)

expecting narrowed_tensor and param.data to share the same storage. However, on TPUs, narrowed_tensor will lazily propagate to the base tensor, which is param.data, leading to the redundant memory usage. This sometimes causes OOM errors during model loading.

This PR address this problem by adding a post-hook to call torch._sync(param) after the weight loader of each param is called.

When loading Llama3-8B (bf16) on v5e-8,

Before this PR: 3.4 GB allocated after weight loading
After this PR: 2.0 GB allocated after weight loading

github-actions · 2024-10-17T00:10:40Z

👋 Hi! Thank you for contributing to the vLLM project.
Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can do one of these:

Add ready label to the PR
Enable auto-merge.

🚀

WoosukKwon · 2024-10-17T00:11:21Z

Thanks @JackCaoG for finding out the bug and providing the solution.

JackCaoG · 2024-10-17T00:19:46Z

vllm/model_executor/utils.py

        assert not hasattr(
            weight, key), (f"Overwriting existing tensor attribute: {key}")
+
+        # NOTE(woosuk): For TPU, param.data.copy_(weight) happens lazily,


to be more accurate this is because in VLLM we do

narrowed_tensor = param.data.narrow(0, offset, len) narrowed_tensor.copy_(real_weight)

narrowed_tensor and param.data share the same storage. With functionization, the in place update on the narrowed_tensor will lazily propagate to the base tensor which is param.data.

Thanks for the elaboration. Fixed the comment!

JackCaoG · 2024-10-17T00:19:59Z

lgtm

mgoin

Thanks for referencing the CT issue, LGTM!

Signed-off-by: Alvant <alvasian@yandex.ru>

Signed-off-by: Amit Garg <mitgarg17495@gmail.com>

Signed-off-by: qishuai <ferdinandzhong@gmail.com>

Signed-off-by: Sumit Dubey <sumit.dubey2@ibm.com>

Signed-off-by: LeiWang1999 <leiwang1999@outlook.com>

WoosukKwon added 2 commits October 16, 2024 23:57

[TPU] Ensure torch._sync(param) is called after param.data.copy_()

bb7c741

yapf

cf842bd

WoosukKwon added the tpu Related to Google TPUs label Oct 17, 2024

JackCaoG reviewed Oct 17, 2024

View reviewed changes

JackCaoG approved these changes Oct 17, 2024

View reviewed changes

This was referenced Oct 17, 2024

[Quantization][TPU] compressed-tensors integration for TPU #9301

Closed

[TPU] Correctly profile peak memory usage & Upgrade PyTorch XLA #9438

Merged

WoosukKwon changed the title ~~[TPU] Ensure torch._sync(param) is called after param.data.copy_()~~ [TPU] Call torch._sync(param) during weight loading Oct 17, 2024

Update comment

f5d8d91

mgoin approved these changes Oct 17, 2024

View reviewed changes

WoosukKwon merged commit 8e1cddc into main Oct 17, 2024
30 checks passed

WoosukKwon deleted the tpu-sync branch October 17, 2024 16:00

Alvant pushed a commit to compressa-ai/vllm that referenced this pull request Oct 26, 2024

[TPU] Call torch._sync(param) during weight loading (vllm-project#9437)

c2ab3eb

Signed-off-by: Alvant <alvasian@yandex.ru>

garg-amit pushed a commit to garg-amit/vllm that referenced this pull request Oct 28, 2024

[TPU] Call torch._sync(param) during weight loading (vllm-project#9437)

aabc4d1

Signed-off-by: Amit Garg <mitgarg17495@gmail.com>

FerdinandZhong pushed a commit to FerdinandZhong/vllm that referenced this pull request Oct 29, 2024

[TPU] Call torch._sync(param) during weight loading (vllm-project#9437)

441ec16

Signed-off-by: qishuai <ferdinandzhong@gmail.com>

sumitd2 pushed a commit to sumitd2/vllm that referenced this pull request Nov 14, 2024

[TPU] Call torch._sync(param) during weight loading (vllm-project#9437)

7c3ddbb

Signed-off-by: Sumit Dubey <sumit.dubey2@ibm.com>

LeiWang1999 pushed a commit to LeiWang1999/vllm-bitblas that referenced this pull request Mar 26, 2025

[TPU] Call torch._sync(param) during weight loading (vllm-project#9437)

dfc340a

Signed-off-by: LeiWang1999 <leiwang1999@outlook.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[TPU] Call torch._sync(param) during weight loading #9437

[TPU] Call torch._sync(param) during weight loading #9437

Uh oh!

WoosukKwon commented Oct 17, 2024 •

edited

Loading

Uh oh!

github-actions bot commented Oct 17, 2024

Uh oh!

WoosukKwon commented Oct 17, 2024

Uh oh!

JackCaoG Oct 17, 2024

Uh oh!

WoosukKwon Oct 17, 2024

Uh oh!

JackCaoG commented Oct 17, 2024

Uh oh!

mgoin left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

[TPU] Call torch._sync(param) during weight loading #9437

[TPU] Call torch._sync(param) during weight loading #9437

Uh oh!

Conversation

WoosukKwon commented Oct 17, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Oct 17, 2024

Uh oh!

WoosukKwon commented Oct 17, 2024

Uh oh!

JackCaoG Oct 17, 2024

Choose a reason for hiding this comment

Uh oh!

WoosukKwon Oct 17, 2024

Choose a reason for hiding this comment

Uh oh!

JackCaoG commented Oct 17, 2024

Uh oh!

mgoin left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

WoosukKwon commented Oct 17, 2024 •

edited

Loading