add checkpoint support for custom device #99626

heidongxianhua · 2023-04-20T13:12:14Z

Fixes #ISSUE_NUMBER
1、add checkpoint support for custom device
2、add a device argument, I want to add a device="cuda" parameter to the func forward of CheckpointFunction, and I can specify the device type when using it, but the func apply of torch.autograd.Function does not support kwargs, so I added a variable named _device.

pytorch-bot · 2023-04-20T13:12:18Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/99626

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

❗ 1 Active SEVs

There are 1 currently active SEVs. If your PR is affected, please view them below:

Degradation on most runner types due to networking outage

❌ 1 New Failure

As of commit 3eaa287:

NEW FAILURE - The following job has failed:

Check Labels (gh)

This comment was automatically generated by Dr. CI and updates every 15 minutes.

heidongxianhua · 2023-04-20T21:33:28Z

@albanD sorry to borther you , can we take a look? and the failed check No module named 'triton' seems unrelated to this change.

heidongxianhua · 2023-04-21T02:56:31Z

@pytorchbot rebase

pytorchmergebot · 2023-04-21T02:58:35Z

@pytorchbot successfully started a rebase job. Check the current status here

pytorchmergebot · 2023-04-21T02:58:41Z

Successfully rebased fix_checkpoint_main onto refs/remotes/origin/viable/strict, please pull locally before adding more changes (for example, via git checkout fix_checkpoint_main && git pull --rebase)

albanD

I am curious what @soulitzer thinks but my personal feeling is that this should always work on all devices and not expect the user to specify the device?

heidongxianhua · 2023-04-21T21:17:08Z

I am curious what @soulitzer thinks but my personal feeling is that this should always work on all devices and not expect the user to specify the device?

yes, it should always work on all devices. And in the implementation there are some funcs related to device and it specifies to use the cuda device, such as with torch.cuda.device, so I add a func to specify the device at first time and it can works on other devices too.

soulitzer · 2023-04-21T21:19:47Z

@heidongxianhua I think what Alban means here is that it should work for all devices without the user having to explicitly pass in the device at all (for example maybe infer the device somehow from the inputs?)

heidongxianhua · 2023-04-21T21:37:50Z

@heidongxianhua I think what Alban means here is that it should work for all devices without the user having to explicitly pass in the device at all (for example maybe infer the device somehow from the inputs?)

yeah, thanks for your reply @soulitzer . The inputs parameter here has no restriction type, so there may be no tensor and we could not to infer the device from inputs. so I a add a func to specify the device.

heidongxianhua · 2023-04-24T09:50:11Z

@soulitzer I have reviewed these code, the inputs may only have no tensor or may be empty, so we can not get a device type from the inputs. And I have make the arg device as a static argument of CheckpointFunction, and these changes will not affect the existing funcs. could you have a look again?

soulitzer · 2023-04-24T14:52:39Z

torch/utils/checkpoint.py

It appears that in this comment we already do this tradeoff, so I think its okay to make the same trade off here.

We should just have a note on the checkpoint docs that device state is only preserved for devices of the Tensor args, the workaround if there are no Tensor args is just to explicitly pass in a dummy tensor on the correct device.

It appears that in this comment we already do this tradeoff, so I think its okay to make the same trade off here.

We should just have a note on the checkpoint docs that device state is only preserved for devices of the Tensor args, the workaround if there are no Tensor args is just to explicitly pass in a dummy tensor on the correct device.

yeah, thanks for your reminder, I hadn't noticed the note. And I made some modifications to extract the device information from the input parameters. @soulitzer

soulitzer · 2023-04-26T18:27:04Z

Thanks for the update. From talking with @albanD, I don't think should need this constraint that there must only be a single non-cpu device. Let's just loop through all devices and just get/set state for all of them?

heidongxianhua · 2023-04-27T14:30:08Z

Thanks for the update. From talking with @albanD, I don't think should need this constraint that there must only be a single non-cpu device. Let's just loop through all devices and just get/set state for all of them?

yeah, thanks for your comment. @albanD @soulitzer . And there are a few points: 1. we add a func to extract the device type info from args, and if there are no tensors or no no-cpu tensors in args, return the default device. 2. add a func to set/get default device referenced aforementioned. These changes do not affect the existing code and various other types of devices(MPS/XPU/HPU and so on) can also be supported. 3. Now it does not support multi-type device(such as there are CUDA-tensors & XPU-tensor simultaneously). If we want to support this, we have to refactor many funcs, such as CheckpointFunction.backward and _checkpoint_without_reentrant. But if you want to do this, we can do it later., it will be a big change.

soulitzer · 2023-04-27T15:41:42Z

@heidongxianhua I guess this is bc-breaking because previously one was allowed to pass in tensors on multiple devices. The device state wouldn't be saved, but that may not matter unless the checkpointed functions have randomness.

One way to not make it bc-breaking is just to remove the error (and therefore just take the first device). To make behavior identical to what it was before we'd have to check if cuda is one of the devices and prioritize saving the device state of cuda even if cuda is not the first device type.

Ideally we'd just support saving state of devices from multiple device types though, could you clarify what is difficult about that?

heidongxianhua · 2023-04-27T16:10:23Z

@heidongxianhua I guess this is bc-breaking because previously one was allowed to pass in tensors on multiple devices. The device state wouldn't be saved, but that may not matter unless the checkpointed functions have randomness.

One way to not make it bc-breaking is just to remove the error (and therefore just take the first device). To make behavior identical to what it was before we'd have to check if cuda is one of the devices and prioritize saving the device state of cuda even if cuda is not the first device type.

Ideally we'd just support saving state of devices from multiple device types though, could you clarify what is difficult about that?

yes, I got it. If there are multiple devices, we use the first device type, and if there has cuda, using cuda is ok. It is a good solution,thank you. Now it is ok for multiple devices. And I give changes again as you said, maybe you could review again. @soulitzer

soulitzer · 2023-04-27T18:40:59Z

Thanks, one more thing we want to do is for it to warn when one of the devices is getting ignored - so just change that error into a warning would be good.

soulitzer

Looks good, beyond just adding back the check you had as a warning, just had some small comments.

soulitzer · 2023-04-27T20:55:58Z

torch/utils/checkpoint.py

+    def get_device_type():
+        return DefaultDevice._default_device_type
+
+def infer_device_type(*args):


let's make this a private function (by prepending underscore to its name?)
or we should add its name to the __all__ list in this file

same applies to DefaultDevice class above

I guess I can imagine DefaultDevice being used elsewhere, but curious what the reasoning would be for infer_device_type.

soulitzer · 2023-04-27T21:13:07Z

torch/utils/checkpoint.py

+        device_module = _get_device_module(device)
+        device_autocast_kwargs = {"enabled": device_module.is_autocast_enabled(),
+                                  "dtype": device_module.get_autocast_dtype(),
+                                  "cache_enabled": torch.is_autocast_cache_enabled()}


Just to make sure, this is intentional right? (is there no such thing as device_module.is_autocast_cache_enabled())

soulitzer · 2023-04-27T21:16:08Z

torch/utils/checkpoint.py

-        # Cuda was not initialized before running the forward, so we didn't
-        # stash the CUDA state.
+    if device_module._initialized and preserve_rng_state and not had_device_in_fwd:
+        # Deivce was not initialized before running the forward, so we didn't


Deivce -> Device

soulitzer · 2023-04-28T03:21:41Z

torch/utils/checkpoint.py

+    device_types = list({arg.device.type for arg in args
+                        if isinstance(arg, torch.Tensor) and not arg.device.type == "cpu"})
+    if len(device_types) > 1:
+        warnings.warn("Tensor args except CPU tensor are on at least two devices ", device_types,


Maybe something along the lines of:

Tensor arguments, excluding CPU tensors, are detected on at least two types of devices. Device state will only be saved for devices of a single device type, and the remaining devices will be ignored. Consequently, if any checkpointed functions involve randomness, this may result in incorrect gradients. (Note that if CUDA devices are among the devices detected, it will be prioritized; otherwise, the first device encountered will be selected.)

soulitzer

LGTM, just had a suggestion on how to word the warning. Also don't forget about lint.

linux-foundation-easycla · 2023-04-28T09:38:11Z

The committers listed above are authorized under a signed CLA.

✅ login: heidongxianhua (81b0289, 12868b6, a2adb96, 01dcb5e, 22ba7dc, 2dd82fb, f1e852a)

heidongxianhua · 2023-04-28T12:04:38Z

@pytorchbot merge

heidongxianhua · 2023-05-03T12:36:59Z

@pytorchbot merge

pytorchmergebot · 2023-05-03T12:38:52Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

heidongxianhua · 2023-05-03T13:19:31Z

@pytorchbot merge

pytorchmergebot · 2023-05-03T13:19:50Z

The merge job was canceled. If you believe this is a mistake,then you can re trigger it through pytorch-bot.

pytorchmergebot · 2023-05-03T13:21:42Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

soulitzer · 2023-05-03T14:04:52Z

torch/utils/checkpoint.py


    @staticmethod
-    def _get_device_type():
-        return _DefaultDevice._default_device_type


Maybe we could do something like this?

class DefaultDevice: """ A class that manages the default device type for checkpointing. If no non-CPU tensors are present, the default device type will be used. The default value is 'cuda'. The device type is used in the checkpointing process when determining which device states to save and restore for recomputation. """ _default_device_type = "cuda" @staticmethod def set_device_type(device: str = "cuda"): """ Set the default device type for checkpointing. Args: device (str): The device type to be set as default. Default is 'cuda'. """ DefaultDevice._default_device_type = device @staticmethod def get_device_type() -> str: """ Get the current default device type for checkpointing. Returns: str: The current default device type. """ return DefaultDevice._default_device_type

pytorchmergebot · 2023-05-03T14:07:27Z

Merge failed

Reason: 1 jobs have failed, first few of them are: windows-binary-libtorch-debug / libtorch-cpu-shared-with-deps-debug-test

Details for Dev Infra team

Raised by workflow job

soulitzer · 2023-05-03T14:08:40Z

Thanks for the update @heidongxianhua, I just had one final comment on the docs

Would also be good to have a look at the doc build before land, to make sure things render properly.

heidongxianhua · 2023-05-03T14:44:21Z

Thanks for the update @heidongxianhua, I just had one final comment on the docs

Would also be good to have a look at the doc build before land, to make sure things render properly.

thanks for your comment, I add the detailed comment@soulitzer. I think this function is simple and clear, so I do not give detailed comments. But your comment is very well and detailed, and is user friendly.

soulitzer · 2023-05-03T15:31:57Z

Thanks for adding that! Sorry for the back and forth on this one - I just had another random comment on naming - maybe it would be better if DefaultDevice were named DefaultDeviceType? I think its a super subtle difference, but might make things slightly clearer.

heidongxianhua · 2023-05-03T21:26:49Z

Thanks for adding that! Sorry for the back and forth on this one - I just had another random comment on naming - maybe it would be better if DefaultDevice were named DefaultDeviceType? I think its a super subtle difference, but might make things slightly clearer.

yeah,it is better to be named with DefaultDeviceType. I didn't pay attention to these details, thank you for your careful comments to making the code more perfect. @soulitzer

heidongxianhua · 2023-05-04T00:19:04Z

@pytorchbot merge

pytorchmergebot · 2023-05-04T00:23:37Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

pytorchbot added the open source label Apr 20, 2023

pytorchmergebot force-pushed the fix_checkpoint_main branch from 2ed2b21 to de25b5d Compare April 21, 2023 02:58

albanD reviewed Apr 21, 2023

View reviewed changes

albanD requested a review from soulitzer April 21, 2023 18:24

heidongxianhua force-pushed the fix_checkpoint_main branch from de25b5d to b564c7a Compare April 23, 2023 04:42

soulitzer reviewed Apr 24, 2023

View reviewed changes

mikaylagawarecki added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Apr 25, 2023

heidongxianhua force-pushed the fix_checkpoint_main branch from 08b49de to 22ba7dc Compare April 27, 2023 16:04

soulitzer reviewed Apr 27, 2023

View reviewed changes

soulitzer reviewed Apr 28, 2023

View reviewed changes

soulitzer approved these changes Apr 28, 2023

View reviewed changes

heidongxianhua force-pushed the fix_checkpoint_main branch from 2aabcc9 to f1e852a Compare April 28, 2023 09:45

pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Apr 28, 2023

pytorchmergebot added the merging label Apr 28, 2023

heidongxianhua added 6 commits May 3, 2023 15:06

fix name check bug

a770d56

fix func name bug

a85a34d

fix lint and warning info

3cf2a5a

rename funcs as private

7d34401

add docs

8430f16

fix lintrunner

7ea1acf

heidongxianhua force-pushed the fix_checkpoint_main branch from 0fe93fe to 7ea1acf Compare May 3, 2023 09:30

soulitzer reviewed May 3, 2023

View reviewed changes

pytorchmergebot removed the merging label May 3, 2023

improve docs

0d124fb

rename defaultdevice name

3eaa287

pytorchmergebot added the merging label May 4, 2023

pytorchmergebot added Merged and removed merging labels May 4, 2023

pytorchmergebot closed this in 6aeb85a May 4, 2023

tmm1 mentioned this pull request Jun 30, 2023

AttributeError: module 'torch.mps' has no attribute 'is_autocast_enabled' #104478

Closed

heidongxianhua deleted the fix_checkpoint_main branch January 8, 2025 03:37

add checkpoint support for custom device #99626

add checkpoint support for custom device #99626

Uh oh!

Conversation

heidongxianhua commented Apr 20, 2023

Uh oh!

pytorch-bot bot commented Apr 20, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/99626

❗ 1 Active SEVs

❌ 1 New Failure

Uh oh!

heidongxianhua commented Apr 20, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

heidongxianhua commented Apr 21, 2023

Uh oh!

pytorchmergebot commented Apr 21, 2023

Uh oh!

pytorchmergebot commented Apr 21, 2023

Uh oh!

albanD left a comment

Choose a reason for hiding this comment

Uh oh!

heidongxianhua commented Apr 21, 2023

Uh oh!

soulitzer commented Apr 21, 2023

Uh oh!

heidongxianhua commented Apr 21, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

heidongxianhua commented Apr 24, 2023

Uh oh!

soulitzer Apr 24, 2023

Choose a reason for hiding this comment

Uh oh!

heidongxianhua Apr 26, 2023

Choose a reason for hiding this comment

Uh oh!

soulitzer commented Apr 26, 2023

Uh oh!

heidongxianhua commented Apr 27, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

soulitzer commented Apr 27, 2023

Uh oh!

heidongxianhua commented Apr 27, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

soulitzer commented Apr 27, 2023

Uh oh!

soulitzer left a comment

Choose a reason for hiding this comment

Uh oh!

soulitzer Apr 27, 2023

Choose a reason for hiding this comment

Uh oh!

soulitzer Apr 27, 2023

Choose a reason for hiding this comment

Uh oh!

soulitzer Apr 27, 2023

Choose a reason for hiding this comment

Uh oh!

soulitzer Apr 28, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

soulitzer left a comment

Choose a reason for hiding this comment

Uh oh!

linux-foundation-easycla bot commented Apr 28, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

heidongxianhua commented Apr 28, 2023

Uh oh!

heidongxianhua commented May 3, 2023

Uh oh!

pytorchmergebot commented May 3, 2023

Merge started

pytorch-bot bot commented Apr 20, 2023 •

edited

Loading

heidongxianhua commented Apr 20, 2023 •

edited

Loading

heidongxianhua commented Apr 21, 2023 •

edited

Loading

heidongxianhua commented Apr 27, 2023 •

edited

Loading

heidongxianhua commented Apr 27, 2023 •

edited

Loading

soulitzer Apr 28, 2023 •

edited

Loading

linux-foundation-easycla bot commented Apr 28, 2023 •

edited

Loading

soulitzer May 3, 2023 •

edited

Loading

soulitzer commented May 3, 2023 •

edited

Loading