New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

[wip] Loading nn.Module from checkpoint tutorial #2519

Closed

mikaylagawarecki wants to merge 4 commits into gh/mikaylagawarecki/1/base from gh/mikaylagawarecki/1/head

Contributor

mikaylagawarecki commented Aug 3, 2023 •

edited

Stack from ghstack (oldest at bottom):

-> [wip] Loading nn.Module from checkpoint tutorial #2519


          [wip] Loading nn.Module from checkpoint tutorial

824bb8f

[ghstack-poisoned]

pytorch-bot bot commented Aug 3, 2023 •

edited

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/tutorials/2519

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

❌ 1 New Failure

As of commit 1e0c60e:

NEW FAILURE - The following job has failed:

pytorch_tutorial_build_worker (11, 15, linux.4xlarge.nvidia.gpu) (gh)

This comment was automatically generated by Dr. CI and updates every 15 minutes.

facebook-github-bot added the cla signed label

mikaylagawarecki added a commit that referenced this pull request


          [wip] Loading nn.Module from checkpoint tutorial

e01282e

ghstack-source-id: 0e8d09c20c4d52256ffe3d31288aaefd166a9dd3
Pull Request resolved: #2519

mikaylagawarecki marked this pull request as draft

August 3, 2023 19:07


          Update on "[wip] Loading nn.Module from checkpoint tutorial"

7fcbd2d

[ghstack-poisoned]

mikaylagawarecki added a commit that referenced this pull request


          [wip] Loading nn.Module from checkpoint tutorial

cd1c06e

ghstack-source-id: 1b46bb7c89e43751fec6f4f782587190e105d42e
Pull Request resolved: #2519


          Update on "[wip] Loading nn.Module from checkpoint tutorial"

760dc42

[ghstack-poisoned]

mikaylagawarecki added a commit that referenced this pull request


          [wip] Loading nn.Module from checkpoint tutorial

301b497

ghstack-source-id: 1d1c6b824999c6877017b27ef7a947efe20e85ff
Pull Request resolved: #2519


          Update on "[wip] Loading nn.Module from checkpoint tutorial"

1e0c60e

[ghstack-poisoned]

mikaylagawarecki added a commit that referenced this pull request


          [wip] Loading nn.Module from checkpoint tutorial

72ff0ef

ghstack-source-id: bdd3fe67f9998a8ed6e6a2c76513894c8c8d2af1
Pull Request resolved: #2519

Contributor

svekars commented Aug 3, 2023

Thanks, @mikaylagawarecki! Can you please resubmit as a regular PR? We don't support ghtstack in this repo.

mikaylagawarecki requested a review from albanD

August 3, 2023 20:32

mikaylagawarecki commented

View reviewed changes

Contributor Author

mikaylagawarecki left a comment •

edited

@svekars I will open a proper non-ghstack PR for landing, just using this one as a scratch for review for now if that's alright!

recipes_source/recipes/module_load_state_dict_tips.py

+              #     2. The user does not want to wait for the entire checkpoint to be loaded
+              #        into RAM before doing for example some per-tensor processing
+              #
+              # The `mmap` keyword argument to `torch.load` attempts to solve the above two

Contributor Author

mikaylagawarecki Aug 3, 2023

how to better explain this,

should we go into internals of how the zip file is structured
should we talk about don't expect it to be mmap on CUDA

recipes_source/recipes/module_load_state_dict_tips.py

+              ###############################################################################
+              # The [`torch.device()`](https://pytorch.org/docs/main/tensor_attributes.html#torch-device)
+              # context manager makes sure that factory calls will be performed as if they
+              # were passed device as an argument. However, it does not affect factory

Contributor Author

mikaylagawarecki Aug 3, 2023

There is a new PR (not landed yet) that lets this "not affecting factory calls be overridden by function calls with a explicit device argument" be overriden with TorchFunctionMode, should that be mentioned?

Contributor

albanD Aug 7, 2023

I think this tutorial should consider the user is using the latest version of main (so after the PR is landed).

recipes_source/recipes/module_load_state_dict_tips.py Show resolved Hide resolved

albanD reviewed

View reviewed changes

recipes_source/recipes/module_load_state_dict_tips.py

+.  The `torch.device()` context manager
+.  The `assign` keyword argument on `nn.Module.load_state_dict()`
+              The following snippet of code illustrates the use of the above three utilities.

Contributor

albanD Aug 7, 2023

I like the idea of having a "tl;dr:" at the top of the tutorial, we can make it more explicit: tl;dr: if you're loading a checkpoint and want to reduce compute and memory as much as posisble, do the following:

recipes_source/recipes/module_load_state_dict_tips.py Show resolved Hide resolved

recipes_source/recipes/module_load_state_dict_tips.py Show resolved Hide resolved

recipes_source/recipes/module_load_state_dict_tips.py

+              #
+              # The `mmap` keyword argument to `torch.load` attempts to solve the above two
+              # problems by using an [`mmap` call](https://man7.org/linux/man-pages/man2/mmap.2.html)
+              # on the checkpoint, so that tensor storages are memory-mapped and when they are

Contributor

albanD Aug 7, 2023

nit: I think this is a bit too much "in the middle": if you assume that the user doesn't know mmap, you should have one sentence that explains what it does (map the file on disk into virtual memory and let the OS handle loading/unloading into physical memory automatically). Or if you assume they already know, you can simplify this paragraph.

recipes_source/recipes/module_load_state_dict_tips.py

+              # on the checkpoint, so that tensor storages are memory-mapped and when they are
+              # fetched from disk to memory is managed by the OS.
+              #
+              # Next, we consider the creation of the module.

Contributor

albanD Aug 7, 2023

Have titles? Not sure how this renders

recipes_source/recipes/module_load_state_dict_tips.py Show resolved Hide resolved

recipes_source/recipes/module_load_state_dict_tips.py

+              ###############################################################################
+              # The [`torch.device()`](https://pytorch.org/docs/main/tensor_attributes.html#torch-device)
+              # context manager makes sure that factory calls will be performed as if they
+              # were passed device as an argument. However, it does not affect factory

Contributor

albanD Aug 7, 2023

I think this tutorial should consider the user is using the latest version of main (so after the PR is landed).

recipes_source/recipes/module_load_state_dict_tips.py

+              # other metadata a tensor carries such as `.size()` and `.stride()`,
+              # `.requires_grad` etc.
+              #
+              # Next, we consider the loading of the state dictionary.

Contributor

albanD Aug 7, 2023

paragraph

recipes_source/recipes/module_load_state_dict_tips.py Show resolved Hide resolved

recipes_source/recipes/module_load_state_dict_tips.py Show resolved Hide resolved

mikaylagawarecki closed this

facebook-github-bot deleted the gh/mikaylagawarecki/1/head branch

September 17, 2023 14:22

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment