Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix inductor <> ddp_optimizer issue #108081

Closed
wants to merge 4 commits into from

Conversation

bdhirsh
Copy link
Contributor

@bdhirsh bdhirsh commented Aug 28, 2023

cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @Xia-Weiwen @wenzhe-nrv @jiayisunx @peterbell10 @ipiszy @ngimel @yf225 @chenyang78 @kadeng @muchulee8 @aakhundov @anijain2305 @shunting314 this is a tentative fix for #107362. More discussion at #107362 (comment).

@wconstab pointed out that inductor found a graph with 6 input mutations and only 1 output, and seemed to be (incorrectly) chopping off the first "6" outputs from the graph (even though there is only 1). It looks like this is because:

(1) AOTAutograd has special handling for input mutations in inference vs. training graphs. In a training graph, whenever AOTAutograd sees an input mutation, it will add an extra output to the graph, corresponding to the updated input (and then at runtime, it will grab the updated input, and perform the actual mutation outside of the graph).

In inference, AOTAutograd is smarter and can leave the input mutations directly in the graph for inductor to optimize (doing this in training is harder). In inference, AOTAutograd will not add any extra graph outputs for input mutations.

It looks like inductor was unconditionally assuming that input mutations counted as extra outputs in the graph, which is wrong for the inference case.

Stack from ghstack (oldest at bottom):

cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @Xia-Weiwen @wenzhe-nrv @jiayisunx @peterbell10 @ipiszy @ngimel @yf225 @chenyang78 @kadeng @muchulee8 @aakhundov

@pytorch-bot
Copy link

pytorch-bot bot commented Aug 28, 2023

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/108081

Note: Links to docs will display an error until the docs builds have been completed.

❗ 1 Merge Blocking SEVs

There is 1 active merge blocking SEVs. Please view them below:

If you must merge, use @pytorchbot merge -f.

✅ No Failures

As of commit c33fed0 with merge base a16b0aa (image):
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

bdhirsh added a commit that referenced this pull request Aug 28, 2023
ghstack-source-id: 3404582932e2237cabc5c88c46eb896a11c81a0e
Pull Request resolved: #108081
Copy link
Contributor

@wconstab wconstab left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for PR'ing this fix! would be good to add a test, i can turn the orig PR repro into one in a follow up.

@bdhirsh
Copy link
Contributor Author

bdhirsh commented Aug 28, 2023

Sounds good - let me try to make a simpler tests (should be doable just by using inductor + inference graph + input mutations + layout optimization)

@bdhirsh
Copy link
Contributor Author

bdhirsh commented Aug 28, 2023

Oh I'll also add an assertion that we don't go beyond the number of graph outputs (which would also have caught this issue instead of silently not preserving strides for certain outputs)

@bdhirsh
Copy link
Contributor Author

bdhirsh commented Aug 28, 2023

Hmm I tried creating a repro by taking the example from the linked issue, removing the DDP code, and manually inserting a graph break at the same place that the ddp optimizer would. I confirmed that between the original repro and my modified version the AOTAutograd graphs that we send to inductor are identical, but for some reason I can't trigger the issue.

I'll try inspecting the generating inductor code and see if it's any different

@shunting314
Copy link
Contributor

@bdhirsh that maybe expected since the graph break caused by DDPOptimizer has subtile difference compared to other graph breaks in dynamo. Check #102591 for a bit more details.

@bdhirsh
Copy link
Contributor Author

bdhirsh commented Aug 29, 2023

hmm @shunting314 - will pointed out to me that the issue you linked only affects code that's run under TORCHINDUCTOR_KEEP_OUTPUT_STRIDE=0. But my repro isn't using that env var - so that issue is probably unrelated?

@shunting314
Copy link
Contributor

@bdhirsh so my point is that for the manually injected graph breaks, downstream graph already get the output of OPTIMIZED upstream graph, whether we succeed to decide the correct output tensors to keep their strides does not matter. Even if we the compiler change the output tensor's stride, the downstream graph will get tensors with those updated stride. Just want to explain why you could not repro earlier and I think we have to use DDPOptimizer to repro the issue.

@wconstab
Copy link
Contributor

i'm not sure how brian did the repro, but i would expect you could repro without DDPOptimizer and without graph-breaks, just by

  1. creating a model with some input mutations
  2. running its forward inside no_grad, so inductor incorrectly ignores preserving output strides
  3. asserting that the output tensors from the model have the same strides as eager had when running the model

Why is DDPoptimizer needed for the repro?

I do understand that if graph-breaks are introduced by dynamo, dynamo works around compilation, ensuring it will match expectations.

cc shunting314 this is a tentative fix for #107362. More discussion at #107362 (comment).

wconstab pointed out that inductor found a graph with 6 input mutations and only 1 output, and seemed to be (incorrectly) chopping off the first "6" outputs from the graph (even though there is only 1). It looks like this is because:

(1) AOTAutograd has special handling for input mutations in inference vs. training graphs. In a training graph, whenever AOTAutograd sees an input mutation, it will add an **extra** output to the graph, corresponding to the updated input (and then at runtime, it will grab the updated input, and perform the actual mutation outside of the graph).

In inference, AOTAutograd is smarter and can leave the input mutations directly in the graph for inductor to optimize (doing this in training is harder). In inference, AOTAutograd will **not** add any extra graph outputs for input mutations.

It looks like inductor was unconditionally assuming that input mutations counted as extra outputs in the graph, which is wrong for the inference case.





cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx peterbell10 ipiszy ngimel yf225 chenyang78 kadeng muchulee8 aakhundov

[ghstack-poisoned]
@@ -3270,6 +3270,29 @@ def f():

self.assertEqual(f(), _GLOBAL_CPU_TENSOR + _GLOBAL_CPU_TENSOR)

@torch._inductor.config.patch("layout_optimization", True)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wconstab got a repro! It just took a bit more work than I thought in order to get inductor to actually change the output stride of the convolution. I needed:

  • layout_optimization = True
  • channels dim to be > 64 (from the inductor code here)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

great, thanks @bdhirsh!!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yea, we may disable layout optimization on the fly if certain patterns are found.

cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx peterbell10 ipiszy ngimel yf225 chenyang78 kadeng muchulee8 aakhundov anijain2305 shunting314 this is a tentative fix for #107362. More discussion at #107362 (comment).

wconstab pointed out that inductor found a graph with 6 input mutations and only 1 output, and seemed to be (incorrectly) chopping off the first "6" outputs from the graph (even though there is only 1). It looks like this is because:

(1) AOTAutograd has special handling for input mutations in inference vs. training graphs. In a training graph, whenever AOTAutograd sees an input mutation, it will add an **extra** output to the graph, corresponding to the updated input (and then at runtime, it will grab the updated input, and perform the actual mutation outside of the graph).

In inference, AOTAutograd is smarter and can leave the input mutations directly in the graph for inductor to optimize (doing this in training is harder). In inference, AOTAutograd will **not** add any extra graph outputs for input mutations.

It looks like inductor was unconditionally assuming that input mutations counted as extra outputs in the graph, which is wrong for the inference case.





cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx peterbell10 ipiszy ngimel yf225 chenyang78 kadeng muchulee8 aakhundov

[ghstack-poisoned]
cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx peterbell10 ipiszy ngimel yf225 chenyang78 kadeng muchulee8 aakhundov anijain2305 shunting314 this is a tentative fix for #107362. More discussion at #107362 (comment).

wconstab pointed out that inductor found a graph with 6 input mutations and only 1 output, and seemed to be (incorrectly) chopping off the first "6" outputs from the graph (even though there is only 1). It looks like this is because:

(1) AOTAutograd has special handling for input mutations in inference vs. training graphs. In a training graph, whenever AOTAutograd sees an input mutation, it will add an **extra** output to the graph, corresponding to the updated input (and then at runtime, it will grab the updated input, and perform the actual mutation outside of the graph).

In inference, AOTAutograd is smarter and can leave the input mutations directly in the graph for inductor to optimize (doing this in training is harder). In inference, AOTAutograd will **not** add any extra graph outputs for input mutations.

It looks like inductor was unconditionally assuming that input mutations counted as extra outputs in the graph, which is wrong for the inference case.





cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx peterbell10 ipiszy ngimel yf225 chenyang78 kadeng muchulee8 aakhundov

[ghstack-poisoned]
pytorchmergebot pushed a commit that referenced this pull request Sep 5, 2023
…stride=False together (#108235)

From talking to @wconstab, we agreed that because of the way DDPOptimizer is written, it is (sort of) incompatible with inductor's `keep_output_stride=False` optimizations (and will cause silent correctness problems if you use them ogether). Added an assertion.

Pull Request resolved: #108235
Approved by: https://github.com/wconstab
ghstack dependencies: #108081
pytorchmergebot pushed a commit that referenced this pull request Sep 5, 2023
Fixes #105327. The problem is that `lift_fresh_copy()`'s functionalization implementation currently assumes that the input is always not functional. This is apparently too limiting: when you have "user" code like this (which can potentially come from exporting a model and then running compile on the resulting graph):
```
tensor_constant0 = torch.tensor(2)
lift_fresh = torch.ops.aten.lift_fresh_copy.default(tensor_constant0)
```

When we run this through AOTAutograd, the first call (torch.tensor(2)) will **already** be lifted into a functional tensor wrapper - so the `lift_fresh_copy` call doesn't need to do any "lifting" anymore - it just needs to do a clone.

Pull Request resolved: #108243
Approved by: https://github.com/albanD
ghstack dependencies: #108081, #108235
@facebook-github-bot facebook-github-bot deleted the gh/bdhirsh/452/head branch September 9, 2023 14:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants