Start refactoring runtime wrappers #125595

jamesjwu · 2024-05-06T16:16:07Z

Stack from ghstack (oldest at bottom):

This is the first PR in a series where I try to organize our runtime wrappers a bit: specifically, I'd like to separate wrappers into objects that have (up to) 2 methods:
A pre-compile function, which takes in flat_fn and flat_args (inputs to the compiler) and wraps/modifies them
A post-compile function, which takes in a compiled_fn and runtime args and wraps the compiled_function.

Extra metadata necessary to run the compile functions can be stored on the attributes of the class. This way, when we think about caching, the set of attributes on the class should be the exact set of metadata that we need to serialize and save in the cache (along with common data, like fw_metadata)

[ghstack-poisoned]

pytorch-bot · 2024-05-06T16:16:10Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/125595

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ You can merge normally! (2 Unrelated Failures)

As of commit 2d11a83 with merge base failed to retrieve merge base, please contact dev infra:

FLAKY - The following jobs failed but were likely due to flakiness present on trunk:

inductor / cuda12.1-py3.10-gcc9-sm86 / test (dynamic_inductor_timm, 2, 2, linux.g5.4xlarge.nvidia.gpu) (gh) (similar failure)
sebotnet33ts_256
inductor / rocm6.1-py3.8-inductor / test (inductor, 1, 1, linux.rocm.gpu.2) (gh) (similar failure)
test/distributed/_composable/fsdp/test_fully_shard_training.py::TestFullyShard2DTraining::test_train_parity_2d_mlp

This comment was automatically generated by Dr. CI and updates every 15 minutes.

[ghstack-poisoned]

bdhirsh · 2024-05-07T16:26:58Z

torch/_functorch/_aot_autograd/jit_compile_runtime_wrappers.py

        trace_joint=False,
-        keep_input_mutations=aot_config.keep_inference_input_mutations,
        disable_amp=disable_amp,
+    ).post_compile(


Mostly just a callout: for the most part we have a 1:1 relationship between "pre_compile and post_compile transformations. But for the particular case of this runtime_wrapper bit, the transformation is something like:

in the inference path: aot_dispatch_base() makes the compile-time change, and uses create_runtime_wrapper to make the runtime change

in the training path: aot_dispatch_autograd() makes the compile-time change, and uses a combination of CompiledFunction + create_runtime_wrapper to make the runtime change.

So the actual "post_compile" runtime wrapper change needed between the inference and training paths have some commonalities, that are shared in the the create_runtime_wrapper helper.

If we are refactoring all of the runtime wrappers to use CompilerWrapper, then I'm mostly just calling out that I would have originally expected every instance to have both a pre_compile and post_compile, and I guess we have just a post_compile here since it lines up with how the runtime wrappers are used today / makes the refactor easier.

Do you imagine a better end-state being that we eventually refactor the wrapper logic so there is more of a 1:1 between pre-compile and post-compile for each layer? Or do you think this is ok as an end state (I think this refactor looks good to me though, happy to stamp)

Hmm, so from what I can tell, the runtime change is only made in create_runtime_wrapper, is it not? Do you mind giving an example of somewhere where CompiledFunction is making a runtime change related to what create_runtime_wrapper is doing? I had thought that the only runtime change create_runtime_wrapper is responsible for was in the two places it's called in aot_dispatch_autograd and aot_dispatch_base, both of which are refactored here.

Just based on my own understanding, I had thought that create_runtime_wrapper was a runtime only wrapper (in that it does not seem to affect the flat_fn state), but if you could point me to what you would consider the "pre_compile" step of create_runtime_wrapper (i.e., where it's modifying the flat_args or flat_fn), I can follow up and refactor that bit too!

Thinking about this more, I think the point you're makign is that aot_dispatch_base and aot_dispatch_autograd, are, in a sense, themselves just big wrappers, and create_runtime_wrapper is actually just a post_compile for the precompile step that exists somewhere in the code of aot_dispatch_*.

I do think it would be valuable to disentangle the logic of aot_dispatch_* that has to do with whatever create_runtime_wrappers is doing so that it can be expressed as a pre_compile for the new wrapper. Happy to work on that, though I might prioritize it a bit lower just because it's not necessary for caching.

I think the same can be said about the other "post compile only" wrappers that I define in the next PR: there's probably some pre-compile logic in aot_dispatch_* that could be put into a pre compile step, but it would require a more involved refactor. I even tried to do it for rng functionalization, but found it to be pretty challenging to get right just due to the sheer number of tangled dependencies within aot_dispatch_autograd.

but found it to be pretty challenging to get right just due to the sheer number of tangled dependencies within aot_dispatch_autograd.

yeah... agreed 😛

bdhirsh · 2024-05-07T18:40:21Z

torch/_functorch/_aot_autograd/runtime_wrappers.py

+        flat_args: List[Tensor],
+        aot_config: AOTConfig,
+        *,
+        fw_metadata: ViewAndMutationMeta,


Just another callout - right now, we have:

(1) multiple transformations (each of which will eventually get its own CompilerWrapper according to this PR)
(2) each transformation requires some metadata

And some of that metadata is shared across transformations (like aot_config and fw_metadata, which you're uniformly passing into every instance of CompilerWrapper, while other metadata is specific to a certain transformation (like indices_of_inps_to_detach).

I definitely wouldn't block your current refactor on this extra refactor, but I do wonder if we should aim to go more in either of those directions: either each CompilerWrapper gets exactly the metadata it cares about and nothing is shared, or we glob all metadata into a single shared object that is plumbed around everywhere (we definitely lean more in this direction today)

Hmm this is a bit tricky to get right I think: I don't really think it's possible to make each CompilerWrapper completely independent, simply because the specific fields that are shared are also modified by each CompilerWrapper's pre_compile functions, so they're not actually independent.

Having a single metadata structure represent all of the shared info would definitely be nice, and also is pretty much what a cache entry would look like, so I might end up doing that refactor when I actually create cache entries(basically, passing along a single data structure with all the fields shared between CompilerWrappers).

Though it's also kind of nice, code readability wise, for CompilerWrappers to have statically defined parameters for fields like indices_of_inps_to_detach instead of a single "metadata" field: it means if you change metadata it's easier to find out which wrappers consume the data, instead of having to hunt through each usage in the code.

I think the compromise here might be that there is a single object we plumb through aot autograd, but the compilerWrappers themselves still just take individual arguments: note that none of these CompilerWrapper objects should need to be directly plumbed outside the function they're defined in: in a perfect world or end state, we should just have a series of pre compile steps, followed by actual compilation (aot_dispatch_base/autograd), followed by a series of post compile steps.

bdhirsh

lgtm

jamesjwu · 2024-05-08T15:17:40Z

@pytorchbot merge -i

pytorchmergebot · 2024-05-08T15:20:10Z

Merge started

Your change will be merged while ignoring the following 2 checks: inductor / cuda12.1-py3.10-gcc9-sm86 / test (dynamic_inductor_timm, 2, 2, linux.g5.4xlarge.nvidia.gpu), inductor / rocm6.1-py3.8-inductor / test (inductor, 1, 1, linux.rocm.gpu.2)

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

Update

ed456c7

[ghstack-poisoned]

pytorch-bot bot added ciflow/inductor release notes: AO frontend labels May 6, 2024

jamesjwu mentioned this pull request May 6, 2024

Refactor other post compile wrappers in forward functions #125610

Closed

jamesjwu added 2 commits May 6, 2024 11:12

Update

87dc0f7

[ghstack-poisoned]

Update

11d259c

[ghstack-poisoned]

Update

abc6c28

[ghstack-poisoned]

jamesjwu added 2 commits May 6, 2024 12:28

Update

77c6535

[ghstack-poisoned]

Update

48d32ee

[ghstack-poisoned]

jamesjwu requested a review from bdhirsh May 7, 2024 15:16

jamesjwu marked this pull request as draft May 7, 2024 15:18

Update

2d11a83

[ghstack-poisoned]

jamesjwu marked this pull request as ready for review May 7, 2024 15:46

bdhirsh reviewed May 7, 2024

View reviewed changes

bdhirsh approved these changes May 7, 2024

View reviewed changes

jamesjwu added ciflow/trunk Trigger trunk jobs on your pull request topic: not user facing topic category and removed release notes: AO frontend labels May 8, 2024

pytorchmergebot added the merging label May 8, 2024

pytorchmergebot closed this in c5b6c69 May 8, 2024

pytorchmergebot added the Merged label May 8, 2024

pytorchmergebot removed the merging label May 8, 2024

github-actions bot deleted the gh/jamesjwu/21/head branch June 8, 2024 01:54

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Start refactoring runtime wrappers #125595

Start refactoring runtime wrappers #125595

Uh oh!

jamesjwu commented May 6, 2024 •

edited

Loading

Uh oh!

pytorch-bot bot commented May 6, 2024 •

edited

Loading

Uh oh!

bdhirsh May 7, 2024

Uh oh!

bdhirsh May 7, 2024

Uh oh!

jamesjwu May 8, 2024

Uh oh!

jamesjwu May 8, 2024

Uh oh!

bdhirsh May 8, 2024

Uh oh!

bdhirsh May 7, 2024

Uh oh!

jamesjwu May 8, 2024

Uh oh!

bdhirsh left a comment

Uh oh!

jamesjwu commented May 8, 2024

Uh oh!

pytorchmergebot commented May 8, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Start refactoring runtime wrappers #125595

Start refactoring runtime wrappers #125595

Uh oh!

Conversation

jamesjwu commented May 6, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented May 6, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/125595

✅ You can merge normally! (2 Unrelated Failures)

Uh oh!

bdhirsh May 7, 2024

Choose a reason for hiding this comment

Uh oh!

bdhirsh May 7, 2024

Choose a reason for hiding this comment

Uh oh!

jamesjwu May 8, 2024

Choose a reason for hiding this comment

Uh oh!

jamesjwu May 8, 2024

Choose a reason for hiding this comment

Uh oh!

bdhirsh May 8, 2024

Choose a reason for hiding this comment

Uh oh!

bdhirsh May 7, 2024

Choose a reason for hiding this comment

Uh oh!

jamesjwu May 8, 2024

Choose a reason for hiding this comment

Uh oh!

bdhirsh left a comment

Choose a reason for hiding this comment

Uh oh!

jamesjwu commented May 8, 2024

Uh oh!

pytorchmergebot commented May 8, 2024

Merge started

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

jamesjwu commented May 6, 2024 •

edited

Loading

pytorch-bot bot commented May 6, 2024 •

edited

Loading