Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

enable OffloadedCache on XPU from PyTorch 2.7 #36654

Merged
merged 20 commits into from
Mar 19, 2025

Conversation

yao-matrix
Copy link
Contributor

@yao-matrix yao-matrix commented Mar 12, 2025

XPU are aligning features in PyTorch w/ CUDA. Since PyTorch 2.6, an device agnostic torch.Stream is supported and XPU support this API. So, I enabled OffloadedCache on XPU.

Why start from 2.7? The reason is OffloadedCache needs StreamContext, but the PR to support __enter__ attribute of StreamContext is not merged in 2.6, but will be in 2.7.

Tested w/ PyTorch 2.7 dev package(pip install --pre torch==2.7.0.dev20250306 --index-url https://download.pytorch.org/whl/nightly/xpu).

@github-actions github-actions bot marked this pull request as draft March 12, 2025 04:11
Copy link

Hi 👋, thank you for opening this pull request! The pull request is converted to draft by default. When it is ready for review, please click the Ready for review button (at the bottom of the PR page).

yao-matrix and others added 2 commits March 12, 2025 12:12
Signed-off-by: Yao, Matrix <matrix.yao@intel.com>
@yao-matrix yao-matrix marked this pull request as ready for review March 12, 2025 04:31
Signed-off-by: root <root@a4bf01945cfe.jf.intel.com>
@yao-matrix
Copy link
Contributor Author

the ci failed cases seems irrelevant to my changes.

@ydshieh
Copy link
Collaborator

ydshieh commented Mar 13, 2025

Hi @yao-matrix Thank you for make this supported.

Hi @n17s, are you interested to take a first look? cc @gante

Copy link
Member

@SunMarc SunMarc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks fine to me overall !

@SunMarc SunMarc requested a review from gante March 14, 2025 13:25
Signed-off-by: N <matrix.yao@intel.com>
Copy link
Contributor

@n17s n17s left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me

Copy link
Member

@gante gante left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thank you for adding support! 🤗

Added a minor nit with a more recent import guard practice, happy to merge when it's sorted

yao-matrix and others added 4 commits March 18, 2025 08:03
Signed-off-by: root <root@a4bf01945cfe.jf.intel.com>
This reverts commit acf1484.
Signed-off-by: root <root@a4bf01945cfe.jf.intel.com>
Signed-off-by: root <root@a4bf01945cfe.jf.intel.com>
Copy link
Member

@SunMarc SunMarc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks ! LGTM !

@SunMarc SunMarc requested a review from gante March 18, 2025 17:35
@SunMarc SunMarc merged commit b11050d into huggingface:main Mar 19, 2025
21 checks passed
@loadams
Copy link
Contributor

loadams commented Mar 19, 2025

Hi @yao-matrix and @SunMarc - it looks like running this PR with torch 2.5.0a0+b465a5843b.nv24.9 (from nvcr.io/nvidia/pytorch:24.09-py3) I see the following error:

ImportError: cannot import name 'Replicate' from 'torch.distributed.tensor' (/usr/local/lib/python3.10/dist-packages/torch/distributed/tensor/__init__.py)

Perhaps the guards are on the wrong version of pytorch?

@yao-matrix
Copy link
Contributor Author

Hi @yao-matrix and @SunMarc - it looks like running this PR with torch 2.5.0a0+b465a5843b.nv24.9 (from nvcr.io/nvidia/pytorch:24.09-py3) I see the following error:

ImportError: cannot import name 'Replicate' from 'torch.distributed.tensor' (/usr/local/lib/python3.10/dist-packages/torch/distributed/tensor/__init__.py)

Perhaps the guards are on the wrong version of pytorch?

it's weird, my version checks on 2.7, which means if version >= 2.7, goes the new API; else the old. But I can see in your PR you changed the pytorch from 2.5 to 2.6, both versions go the old path.

@loadams
Copy link
Contributor

loadams commented Mar 19, 2025

Hi @yao-matrix and @SunMarc - it looks like running this PR with torch 2.5.0a0+b465a5843b.nv24.9 (from nvcr.io/nvidia/pytorch:24.09-py3) I see the following error:

ImportError: cannot import name 'Replicate' from 'torch.distributed.tensor' (/usr/local/lib/python3.10/dist-packages/torch/distributed/tensor/__init__.py)

Perhaps the guards are on the wrong version of pytorch?

it's weird, my version checks on 2.7, which means if version >= 2.7, goes the new API; else the old. But I can see in your PR you changed the pytorch from 2.5 to 2.6, both versions go the old path.

@yao-matrix - yes that is quite odd, but I was able to bisect the failure to this PR, so perhaps it is another code path that this PR is enabling that I'm hitting it from? But it does seem to be resolved by updating the torch version.

github-merge-queue bot pushed a commit to deepspeedai/DeepSpeed that referenced this pull request Mar 19, 2025
Changes from huggingface/transformers#36654 in
transformers cause issues with the torch 2.5 version we were using. This
just updated us to use a newer version.

---------

Signed-off-by: Logan Adams <loadams@microsoft.com>
@yao-matrix yao-matrix deleted the offloadedcache branch March 19, 2025 23:49
@gante
Copy link
Member

gante commented Mar 20, 2025

@yao-matrix I'm going to revert part of the changes in is_torch_greater_or_equal, as it is breaking in other parts of the library. In a nutshell, we can't confirm that all dev versions for 2.X contain the features that will be release in 2.X, which is the error @loadams is seeing (2.5.0a0+b465a5843b.nv24.9 is a dev version of 2.5.0).

@yao-matrix to enable your use case I'm going to add an accept_dev flag to is_torch_greater_or_equal

mauryaavinash95 pushed a commit to DataStates/DeepSpeed that referenced this pull request Mar 20, 2025
Changes from huggingface/transformers#36654 in
transformers cause issues with the torch 2.5 version we were using. This
just updated us to use a newer version.

---------

Signed-off-by: Logan Adams <loadams@microsoft.com>
loadams added a commit to deepspeedai/DeepSpeed that referenced this pull request Mar 25, 2025
Changes from huggingface/transformers#36654 in
transformers cause issues with the torch 2.5 version we were using. This
just updated us to use a newer version.

---------

Signed-off-by: Logan Adams <loadams@microsoft.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants