enable OffloadedCache on XPU from PyTorch 2.7 #36654

yao-matrix · 2025-03-12T04:11:44Z

XPU are aligning features in PyTorch w/ CUDA. Since PyTorch 2.6, an device agnostic torch.Stream is supported and XPU support this API. So, I enabled OffloadedCache on XPU.

Why start from 2.7? The reason is OffloadedCache needs StreamContext, but the PR to support __enter__ attribute of StreamContext is not merged in 2.6, but will be in 2.7.

Tested w/ PyTorch 2.7 dev package(pip install --pre torch==2.7.0.dev20250306 --index-url https://download.pytorch.org/whl/nightly/xpu).

…tionalGeneration model

Signed-off-by: Yao, Matrix <matrix.yao@intel.com>

github-actions · 2025-03-12T04:11:57Z

Hi 👋, thank you for opening this pull request! The pull request is converted to draft by default. When it is ready for review, please click the Ready for review button (at the bottom of the PR page).

Signed-off-by: Yao, Matrix <matrix.yao@intel.com>

Signed-off-by: root <root@a4bf01945cfe.jf.intel.com>

yao-matrix · 2025-03-12T07:20:24Z

the ci failed cases seems irrelevant to my changes.

ydshieh · 2025-03-13T11:27:23Z

Hi @yao-matrix Thank you for make this supported.

Hi @n17s, are you interested to take a first look? cc @gante

SunMarc

Looks fine to me overall !

src/transformers/cache_utils.py

tests/utils/test_cache_utils.py

Signed-off-by: N <matrix.yao@intel.com>

n17s

Looks good to me

gante

LGTM, thank you for adding support! 🤗

Added a minor nit with a more recent import guard practice, happy to merge when it's sorted

src/transformers/cache_utils.py

Signed-off-by: root <root@a4bf01945cfe.jf.intel.com>

This reverts commit acf1484.

Signed-off-by: root <root@a4bf01945cfe.jf.intel.com>

SunMarc

Thanks ! LGTM !

loadams · 2025-03-19T21:54:05Z

Hi @yao-matrix and @SunMarc - it looks like running this PR with torch 2.5.0a0+b465a5843b.nv24.9 (from nvcr.io/nvidia/pytorch:24.09-py3) I see the following error:

ImportError: cannot import name 'Replicate' from 'torch.distributed.tensor' (/usr/local/lib/python3.10/dist-packages/torch/distributed/tensor/__init__.py)

Perhaps the guards are on the wrong version of pytorch?

yao-matrix · 2025-03-19T23:24:15Z

Hi @yao-matrix and @SunMarc - it looks like running this PR with torch 2.5.0a0+b465a5843b.nv24.9 (from nvcr.io/nvidia/pytorch:24.09-py3) I see the following error:
ImportError: cannot import name 'Replicate' from 'torch.distributed.tensor' (/usr/local/lib/python3.10/dist-packages/torch/distributed/tensor/__init__.py)
Perhaps the guards are on the wrong version of pytorch?

it's weird, my version checks on 2.7, which means if version >= 2.7, goes the new API; else the old. But I can see in your PR you changed the pytorch from 2.5 to 2.6, both versions go the old path.

loadams · 2025-03-19T23:40:15Z

Hi @yao-matrix and @SunMarc - it looks like running this PR with torch 2.5.0a0+b465a5843b.nv24.9 (from nvcr.io/nvidia/pytorch:24.09-py3) I see the following error:
ImportError: cannot import name 'Replicate' from 'torch.distributed.tensor' (/usr/local/lib/python3.10/dist-packages/torch/distributed/tensor/__init__.py)
Perhaps the guards are on the wrong version of pytorch?
it's weird, my version checks on 2.7, which means if version >= 2.7, goes the new API; else the old. But I can see in your PR you changed the pytorch from 2.5 to 2.6, both versions go the old path.

@yao-matrix - yes that is quite odd, but I was able to bisect the failure to this PR, so perhaps it is another code path that this PR is enabling that I'm hitting it from? But it does seem to be resolved by updating the torch version.

Changes from huggingface/transformers#36654 in transformers cause issues with the torch 2.5 version we were using. This just updated us to use a newer version. --------- Signed-off-by: Logan Adams <loadams@microsoft.com>

gante · 2025-03-20T10:29:12Z

@yao-matrix I'm going to revert part of the changes in is_torch_greater_or_equal, as it is breaking in other parts of the library. In a nutshell, we can't confirm that all dev versions for 2.X contain the features that will be release in 2.X, which is the error @loadams is seeing (2.5.0a0+b465a5843b.nv24.9 is a dev version of 2.5.0).

@yao-matrix to enable your use case I'm going to add an accept_dev flag to is_torch_greater_or_equal

Changes from huggingface/transformers#36654 in transformers cause issues with the torch 2.5 version we were using. This just updated us to use a newer version. --------- Signed-off-by: Logan Adams <loadams@microsoft.com>

yao-matrix and others added 6 commits March 6, 2025 15:01

fix "Cannot copy out of meta tensor; no data!" issue for BartForCondi…

d56974c

…tionalGeneration model

Merge branch 'huggingface:main' into main

5451453

Merge branch 'main' into main

ddd9443

follow Marc's suggestion to use _tie_weights to fix

9298d45

Signed-off-by: Yao, Matrix <matrix.yao@intel.com>

Merge branch 'huggingface:main' into main

aaf748d

enable OffloadedCache on XPU since PyTorch 2.7

da6901d

Signed-off-by: Yao, Matrix <matrix.yao@intel.com>

github-actions bot marked this pull request as draft March 12, 2025 04:11

yao-matrix and others added 2 commits March 12, 2025 12:12

Merge branch 'main' into offloadedcache

6177fa7

fix style

62e3991

Signed-off-by: Yao, Matrix <matrix.yao@intel.com>

yao-matrix marked this pull request as ready for review March 12, 2025 04:31

github-actions bot requested review from Rocketknight1 and ydshieh March 12, 2025 04:31

don't change bart

36293e3

Signed-off-by: root <root@a4bf01945cfe.jf.intel.com>

yao-matrix and others added 2 commits March 13, 2025 09:19

Merge branch 'main' into offloadedcache

f5b58e3

Merge branch 'main' into offloadedcache

97beafc

Merge branch 'main' into offloadedcache

5851f66

SunMarc approved these changes Mar 13, 2025

View reviewed changes

n17s reviewed Mar 13, 2025

View reviewed changes

src/transformers/cache_utils.py Outdated Show resolved Hide resolved

src/transformers/cache_utils.py Show resolved Hide resolved

tests/utils/test_cache_utils.py Show resolved Hide resolved

SunMarc requested a review from gante March 14, 2025 13:25

make code more concise per review comments

5d28624

Signed-off-by: N <matrix.yao@intel.com>

n17s approved these changes Mar 14, 2025

View reviewed changes

Merge branch 'main' into offloadedcache

b6b323c

gante approved these changes Mar 17, 2025

View reviewed changes

src/transformers/cache_utils.py Outdated Show resolved Hide resolved

yao-matrix and others added 4 commits March 18, 2025 08:03

Merge branch 'main' into offloadedcache

fa15c53

fix review comments

acf1484

Signed-off-by: root <root@a4bf01945cfe.jf.intel.com>

Revert "fix review comments"

9148e76

This reverts commit acf1484.

fix review comments

3d0a158

Signed-off-by: root <root@a4bf01945cfe.jf.intel.com>

fix style

48af80e

Signed-off-by: root <root@a4bf01945cfe.jf.intel.com>

SunMarc approved these changes Mar 18, 2025

View reviewed changes

SunMarc requested a review from gante March 18, 2025 17:35

Merge branch 'main' into offloadedcache

1ee5788

SunMarc merged commit b11050d into huggingface:main Mar 19, 2025
21 checks passed

loadams mentioned this pull request Mar 19, 2025

Update container version that runs on A6000 tests. deepspeedai/DeepSpeed#7153

Merged

yao-matrix deleted the offloadedcache branch March 19, 2025 23:49

gante mentioned this pull request Mar 20, 2025

[Utils] torch version checks optionally accept dev versions #36847

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

enable OffloadedCache on XPU from PyTorch 2.7 #36654

enable OffloadedCache on XPU from PyTorch 2.7 #36654

yao-matrix commented Mar 12, 2025 •

edited

Loading

github-actions bot commented Mar 12, 2025

yao-matrix commented Mar 12, 2025

ydshieh commented Mar 13, 2025

SunMarc left a comment

n17s left a comment

gante left a comment •

edited

Loading

SunMarc left a comment

loadams commented Mar 19, 2025

yao-matrix commented Mar 19, 2025

loadams commented Mar 19, 2025

gante commented Mar 20, 2025

enable OffloadedCache on XPU from PyTorch 2.7 #36654

enable OffloadedCache on XPU from PyTorch 2.7 #36654

Conversation

yao-matrix commented Mar 12, 2025 • edited Loading

github-actions bot commented Mar 12, 2025

yao-matrix commented Mar 12, 2025

ydshieh commented Mar 13, 2025

SunMarc left a comment

Choose a reason for hiding this comment

n17s left a comment

Choose a reason for hiding this comment

gante left a comment • edited Loading

Choose a reason for hiding this comment

SunMarc left a comment

Choose a reason for hiding this comment

loadams commented Mar 19, 2025

yao-matrix commented Mar 19, 2025

loadams commented Mar 19, 2025

gante commented Mar 20, 2025

yao-matrix commented Mar 12, 2025 •

edited

Loading

gante left a comment •

edited

Loading