Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[inductor] layout optimization for conv #99773

Closed
wants to merge 53 commits into from

Conversation

shunting314
Copy link
Contributor

@shunting314 shunting314 commented Apr 21, 2023

Stack from ghstack (oldest at bottom):

convolution kernel with channels last runs much faster then kernel with contiguous inputs. The PR leverage that to optimize tensor layouts so we provide 'channels last' inputs to convolution. Some care need to be taken to not convert tensor layout between contiguous and channels last back and forth. Those extra copies hurt performance quite much.

Latest perf number here

  • TB: 1.64x -> 1.69x
  • HF: 1.79x -> 1.78x (random noise)
  • TIMM: 1.51x -> 1.65x

Right now we disable layout optimization for dynamic shape since there is perf loss in that combination. Here is a GH issue to followup: #102670

cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @Xia-Weiwen @wenzhe-nrv @jiayisunx @peterbell10 @ipiszy @ngimel @yf225 @anijain2305 @soumith @desertfire

@pytorch-bot
Copy link

pytorch-bot bot commented Apr 21, 2023

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/99773

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 94c2012:
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

shunting314 added a commit that referenced this pull request Apr 21, 2023
ghstack-source-id: 4d7cc9cae903c92e14de207c7c5b50d6a10088f4
Pull Request resolved: #99773
@github-actions
Copy link

This PR needs a label

If your changes are user facing and intended to be a part of release notes, please use a label starting with release notes:.

If not, please add the topic: not user facing label.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "topic: not user facing"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

@shunting314 shunting314 changed the title [inductor] layout optimization for conv [WIP][inductor] layout optimization for conv Apr 21, 2023
@ngimel
Copy link
Collaborator

ngimel commented Apr 22, 2023

This looks fine but I"m interested in the full results for timm suite.

@shunting314
Copy link
Contributor Author

shunting314 commented Apr 22, 2023

This looks fine but I"m interested in the full results for timm suite.

The CI run is ready.

We see there are quite a few torchench/timm_models get 10-20% speedup

There are also a lot of things need to dig.

  • some models have slow down
  • some passed model failed
  • for some model having improved perf, they get worse memory compression ratio. E.g. the 3 resnet models in torchbench

My goal here is figure these things out so that we get the speedup by using channels last layout for convolution but don't slow-down or fail any passed model. I'll debug memory compression ratio after that.

@jansel
Copy link
Contributor

jansel commented Apr 22, 2023

Yup makes sense, the tricky part will be coming up with a heuristic to harvest all those wins without taking on the slowdowns.

convolution kernel with channels last runs much faster then kernel with contiguous inputs. The PR leverage that to optimize tensor layouts so we provide 'channels last' inputs to convolution. Some care need to be taken to not convert tensor layout between contiguous and channels last back and forth. Those extra copies hurt performance quite much.

# Example command
- with the optimization disabled

TORCHINDUCTOR_LAYOUT_OPT=0 python benchmarks/dynamo/torchbench.py --backend inductor --amp --perform
ance --dashboard --only resnet18 --training

- with the optimization enabled

TORCHINDUCTOR_LAYOUT_OPT=1 python benchmarks/dynamo/torchbench.py --backend inductor --amp --perform
ance --dashboard --only resnet18 --training


# Result
I'll do some local runs first and then create a report from CI.

- resnet18: 1.269x -> 1.446x
- resnet50: 1.100x -> 1.263x
- resnet152: 1.048x -> 1.218x
- vgg16: 1.266x -> 1.500x
- alexnet: 1.116x -> 1.263x
- timm_resnest: 1.582x -> FAIL (NEED DEBUG)
- resmlp_12_224: 1.265x -> 1.290x
- convmixer_768_32: 0.994x -> 2.958x

- hf_Bert: 1.489x -> 1.487x (neutral as expected as the model does not have convolution)



cc soumith voznesenskym penguinwu anijain2305 EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx peterbell10 desertfire

[ghstack-poisoned]
shunting314 added a commit that referenced this pull request Apr 25, 2023
ghstack-source-id: 7c0ea5bbfe09d4fae81839c4b30ddfba599b391a
Pull Request resolved: #99773
convolution kernel with channels last runs much faster then kernel with contiguous inputs. The PR leverage that to optimize tensor layouts so we provide 'channels last' inputs to convolution. Some care need to be taken to not convert tensor layout between contiguous and channels last back and forth. Those extra copies hurt performance quite much.

# Example command
- with the optimization disabled

TORCHINDUCTOR_LAYOUT_OPT=0 python benchmarks/dynamo/torchbench.py --backend inductor --amp --perform
ance --dashboard --only resnet18 --training

- with the optimization enabled

TORCHINDUCTOR_LAYOUT_OPT=1 python benchmarks/dynamo/torchbench.py --backend inductor --amp --perform
ance --dashboard --only resnet18 --training


# Result
I'll do some local runs first and then create a report from CI.

- resnet18: 1.269x -> 1.446x
- resnet50: 1.100x -> 1.263x
- resnet152: 1.048x -> 1.218x
- vgg16: 1.266x -> 1.500x
- alexnet: 1.116x -> 1.263x
- timm_resnest: 1.582x -> FAIL (NEED DEBUG)
- resmlp_12_224: 1.265x -> 1.290x
- convmixer_768_32: 0.994x -> 2.958x

- hf_Bert: 1.489x -> 1.487x (neutral as expected as the model does not have convolution)



cc soumith voznesenskym penguinwu anijain2305 EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx peterbell10 desertfire

[ghstack-poisoned]
shunting314 added a commit that referenced this pull request Apr 25, 2023
ghstack-source-id: e928b06a63441578b99e5f1f7df4b230a8be0666
Pull Request resolved: #99773
@shunting314
Copy link
Contributor Author

Unlike regular convolution, grouped convolution (groups argument > 1) prefers channels last. Using channels last for grouped convolution can results in 1.65x slow down (https://github.com/pytorch/pytorch/pull/99971/files#diff-7e3515959c8570fe48dcbeb882b15bc9079729b3da365c50f622ec9d4349adc9R25 ).

Here are some tests for timm_regnet. Previously I used channels last for all convolutions. That even slows down inference. Later on, I use channels last for regular convolution while contiguous for grouped convolution, we improve inference from 1.453x to 1.710x (number measured on my dev environment). But with this strategy training is still slowed down (slow down from 1.168x to 0.847x).

For now, I just disable the convolution layout optimization if a model uses grouped convolution.

@shunting314
Copy link
Contributor Author

New round of perf result

We can see previous failed models like timm_resnest/timm_nfnet now pass. Also previously we see timm_regnet slows down from 1.23x to 0.88x. Now it's neutral since the model contains grouped convolution. And we skip layout optimization for such models for now.

Still need debug why some other models get slow down.

convolution kernel with channels last runs much faster then kernel with contiguous inputs. The PR leverage that to optimize tensor layouts so we provide 'channels last' inputs to convolution. Some care need to be taken to not convert tensor layout between contiguous and channels last back and forth. Those extra copies hurt performance quite much.

# Example command
- with the optimization disabled

TORCHINDUCTOR_LAYOUT_OPT=0 python benchmarks/dynamo/torchbench.py --backend inductor --amp --perform
ance --dashboard --only resnet18 --training

- with the optimization enabled

TORCHINDUCTOR_LAYOUT_OPT=1 python benchmarks/dynamo/torchbench.py --backend inductor --amp --perform
ance --dashboard --only resnet18 --training


# Result
I'll do some local runs first and then create a report from CI.

- resnet18: 1.269x -> 1.446x
- resnet50: 1.100x -> 1.263x
- resnet152: 1.048x -> 1.218x
- vgg16: 1.266x -> 1.500x
- alexnet: 1.116x -> 1.263x
- timm_resnest: 1.582x -> FAIL (NEED DEBUG)
- resmlp_12_224: 1.265x -> 1.290x
- convmixer_768_32: 0.994x -> 2.958x

- hf_Bert: 1.489x -> 1.487x (neutral as expected as the model does not have convolution)



cc soumith voznesenskym penguinwu anijain2305 EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx peterbell10 desertfire

[ghstack-poisoned]
shunting314 added a commit that referenced this pull request Apr 25, 2023
ghstack-source-id: 212df6e3d38bc757576dbf628a6817aa4fb148a7
Pull Request resolved: #99773
convolution kernel with channels last runs much faster then kernel with contiguous inputs. The PR leverage that to optimize tensor layouts so we provide 'channels last' inputs to convolution. Some care need to be taken to not convert tensor layout between contiguous and channels last back and forth. Those extra copies hurt performance quite much.

# Example command
- with the optimization disabled

TORCHINDUCTOR_LAYOUT_OPT=0 python benchmarks/dynamo/torchbench.py --backend inductor --amp --perform
ance --dashboard --only resnet18 --training

- with the optimization enabled

TORCHINDUCTOR_LAYOUT_OPT=1 python benchmarks/dynamo/torchbench.py --backend inductor --amp --perform
ance --dashboard --only resnet18 --training


# Result
I'll do some local runs first and then create a report from CI.

- resnet18: 1.269x -> 1.446x
- resnet50: 1.100x -> 1.263x
- resnet152: 1.048x -> 1.218x
- vgg16: 1.266x -> 1.500x
- alexnet: 1.116x -> 1.263x
- timm_resnest: 1.582x -> FAIL (NEED DEBUG)
- resmlp_12_224: 1.265x -> 1.290x
- convmixer_768_32: 0.994x -> 2.958x

- hf_Bert: 1.489x -> 1.487x (neutral as expected as the model does not have convolution)



cc soumith voznesenskym penguinwu anijain2305 EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx peterbell10 desertfire

[ghstack-poisoned]
shunting314 added a commit that referenced this pull request Apr 25, 2023
ghstack-source-id: 458c2db61563640867511917e50e36867663d738
Pull Request resolved: #99773
convolution kernel with channels last runs much faster then kernel with contiguous inputs. The PR leverage that to optimize tensor layouts so we provide 'channels last' inputs to convolution. Some care need to be taken to not convert tensor layout between contiguous and channels last back and forth. Those extra copies hurt performance quite much.

# Example command
- with the optimization disabled

TORCHINDUCTOR_LAYOUT_OPT=0 python benchmarks/dynamo/torchbench.py --backend inductor --amp --perform
ance --dashboard --only resnet18 --training

- with the optimization enabled

TORCHINDUCTOR_LAYOUT_OPT=1 python benchmarks/dynamo/torchbench.py --backend inductor --amp --perform
ance --dashboard --only resnet18 --training


# Result
I'll do some local runs first and then create a report from CI.

- resnet18: 1.269x -> 1.446x
- resnet50: 1.100x -> 1.263x
- resnet152: 1.048x -> 1.218x
- vgg16: 1.266x -> 1.500x
- alexnet: 1.116x -> 1.263x
- timm_resnest: 1.582x -> FAIL (NEED DEBUG)
- resmlp_12_224: 1.265x -> 1.290x
- convmixer_768_32: 0.994x -> 2.958x

- hf_Bert: 1.489x -> 1.487x (neutral as expected as the model does not have convolution)



cc soumith voznesenskym penguinwu anijain2305 EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx peterbell10 desertfire

[ghstack-poisoned]
shunting314 added a commit that referenced this pull request Apr 26, 2023
ghstack-source-id: 359c5157f611d6e031a2fcc6aa47382450d74e33
Pull Request resolved: #99773
convolution kernel with channels last runs much faster then kernel with contiguous inputs. The PR leverage that to optimize tensor layouts so we provide 'channels last' inputs to convolution. Some care need to be taken to not convert tensor layout between contiguous and channels last back and forth. Those extra copies hurt performance quite much.

# Example command
- with the optimization disabled

TORCHINDUCTOR_LAYOUT_OPT=0 python benchmarks/dynamo/torchbench.py --backend inductor --amp --perform
ance --dashboard --only resnet18 --training

- with the optimization enabled

TORCHINDUCTOR_LAYOUT_OPT=1 python benchmarks/dynamo/torchbench.py --backend inductor --amp --perform
ance --dashboard --only resnet18 --training


# Result
I'll do some local runs first and then create a report from CI.

- resnet18: 1.269x -> 1.446x
- resnet50: 1.100x -> 1.263x
- resnet152: 1.048x -> 1.218x
- vgg16: 1.266x -> 1.500x
- alexnet: 1.116x -> 1.263x
- timm_resnest: 1.582x -> FAIL (NEED DEBUG)
- resmlp_12_224: 1.265x -> 1.290x
- convmixer_768_32: 0.994x -> 2.958x

- hf_Bert: 1.489x -> 1.487x (neutral as expected as the model does not have convolution)



cc soumith voznesenskym penguinwu anijain2305 EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx peterbell10 desertfire

[ghstack-poisoned]
shunting314 added a commit that referenced this pull request May 1, 2023
ghstack-source-id: f45e8024b431e6e3b7c0d7647b518b146ce13373
Pull Request resolved: #99773
convolution kernel with channels last runs much faster then kernel with contiguous inputs. The PR leverage that to optimize tensor layouts so we provide 'channels last' inputs to convolution. Some care need to be taken to not convert tensor layout between contiguous and channels last back and forth. Those extra copies hurt performance quite much.

# Example command
- with the optimization disabled

TORCHINDUCTOR_LAYOUT_OPT=0 python benchmarks/dynamo/torchbench.py --backend inductor --amp --perform
ance --dashboard --only resnet18 --training

- with the optimization enabled

TORCHINDUCTOR_LAYOUT_OPT=1 python benchmarks/dynamo/torchbench.py --backend inductor --amp --perform
ance --dashboard --only resnet18 --training


# Result
I'll do some local runs first and then create a report from CI.

- resnet18: 1.269x -> 1.446x
- resnet50: 1.100x -> 1.263x
- resnet152: 1.048x -> 1.218x
- vgg16: 1.266x -> 1.500x
- alexnet: 1.116x -> 1.263x
- timm_resnest: 1.582x -> FAIL (NEED DEBUG)
- resmlp_12_224: 1.265x -> 1.290x
- convmixer_768_32: 0.994x -> 2.958x

- hf_Bert: 1.489x -> 1.487x (neutral as expected as the model does not have convolution)



cc soumith voznesenskym penguinwu anijain2305 EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx peterbell10 desertfire

[ghstack-poisoned]
@shunting314
Copy link
Contributor Author

@pytorchbot merge

@pytorch-bot pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Jun 2, 2023
@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@pytorchmergebot
Copy link
Collaborator

The merge job was canceled. If you believe this is a mistake,then you can re trigger it through pytorch-bot.

@shunting314
Copy link
Contributor Author

@pytorchbot merge

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@pytorchmergebot
Copy link
Collaborator

Merge failed

Reason: Command git -C /home/runner/work/pytorch/pytorch cherry-pick -x 23fc8d626446b8d2308d39e0586ef1b5c5e14113 returned non-zero exit code 1

Auto-merging benchmarks/dynamo/ci_expected_accuracy/inductor_torchbench_training.csv
CONFLICT (content): Merge conflict in benchmarks/dynamo/ci_expected_accuracy/inductor_torchbench_training.csv
Auto-merging test/test_fake_tensor.py
CONFLICT (content): Merge conflict in test/test_fake_tensor.py
Auto-merging torch/_functorch/aot_autograd.py
CONFLICT (content): Merge conflict in torch/_functorch/aot_autograd.py
Auto-merging torch/_inductor/codegen/triton.py
CONFLICT (content): Merge conflict in torch/_inductor/codegen/triton.py
Auto-merging torch/_inductor/codegen/wrapper.py
Auto-merging torch/_inductor/fx_passes/post_grad.py
Auto-merging torch/_inductor/ir.py
Auto-merging torch/_meta_registrations.py
error: could not apply 23fc8d62644... [inductor] layout optimization for conv
hint: After resolving the conflicts, mark them with
hint: "git add/rm <pathspec>", then run
hint: "git cherry-pick --continue".
hint: You can instead skip this commit with "git cherry-pick --skip".
hint: To abort and get back to the state before "git cherry-pick",
hint: run "git cherry-pick --abort".
Details for Dev Infra team Raised by workflow job

convolution kernel with channels last runs much faster then kernel with contiguous inputs. The PR leverage that to optimize tensor layouts so we provide 'channels last' inputs to convolution. Some care need to be taken to not convert tensor layout between contiguous and channels last back and forth. Those extra copies hurt performance quite much.

Latest perf number [here](https://hud.pytorch.org/benchmark/compilers?startTime=Wed%2C%2024%20May%202023%2023%3A40%3A37%20GMT&stopTime=Wed%2C%2031%20May%202023%2023%3A40%3A37%20GMT&granularity=hour&suite=torchbench&mode=training&dtype=amp&lBranch=shunting-layout-opt-19&lCommit=baa797fc100688dfb044fbcbdebcfd2591710f78&rBranch=main&rCommit=999bae0f54108ffc5b7cf2524a02a83901554b16)
- TB: 1.64x -> 1.69x
- HF: 1.79x -> 1.78x (random noise)
- TIMM: 1.51x -> 1.65x

Right now we disable layout optimization for dynamic shape since there is perf loss in that combination. Here is a GH issue to followup: #102670



cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx peterbell10 ipiszy ngimel anijain2305 soumith desertfire

[ghstack-poisoned]
shunting314 added a commit that referenced this pull request Jun 2, 2023
ghstack-source-id: af44d77ecc46147afa880d6dfdc63b1151ed7fa4
Pull Request resolved: #99773
convolution kernel with channels last runs much faster then kernel with contiguous inputs. The PR leverage that to optimize tensor layouts so we provide 'channels last' inputs to convolution. Some care need to be taken to not convert tensor layout between contiguous and channels last back and forth. Those extra copies hurt performance quite much.

Latest perf number [here](https://hud.pytorch.org/benchmark/compilers?startTime=Wed%2C%2024%20May%202023%2023%3A40%3A37%20GMT&stopTime=Wed%2C%2031%20May%202023%2023%3A40%3A37%20GMT&granularity=hour&suite=torchbench&mode=training&dtype=amp&lBranch=shunting-layout-opt-19&lCommit=baa797fc100688dfb044fbcbdebcfd2591710f78&rBranch=main&rCommit=999bae0f54108ffc5b7cf2524a02a83901554b16)
- TB: 1.64x -> 1.69x
- HF: 1.79x -> 1.78x (random noise)
- TIMM: 1.51x -> 1.65x

Right now we disable layout optimization for dynamic shape since there is perf loss in that combination. Here is a GH issue to followup: #102670



cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx peterbell10 ipiszy ngimel anijain2305 soumith desertfire

[ghstack-poisoned]
shunting314 added a commit that referenced this pull request Jun 2, 2023
ghstack-source-id: ad58d381595366fe374618ca5f6f530cd18ea917
Pull Request resolved: #99773
@shunting314
Copy link
Contributor Author

@pytorchbot merge

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

alimoezzi pushed a commit to alimoezzi/pytorch that referenced this pull request Jun 3, 2023
convolution kernel with channels last runs much faster then kernel with contiguous inputs. The PR leverage that to optimize tensor layouts so we provide 'channels last' inputs to convolution. Some care need to be taken to not convert tensor layout between contiguous and channels last back and forth. Those extra copies hurt performance quite much.

Latest perf number [here](https://hud.pytorch.org/benchmark/compilers?startTime=Wed%2C%2024%20May%202023%2023%3A40%3A37%20GMT&stopTime=Wed%2C%2031%20May%202023%2023%3A40%3A37%20GMT&granularity=hour&suite=torchbench&mode=training&dtype=amp&lBranch=shunting-layout-opt-19&lCommit=baa797fc100688dfb044fbcbdebcfd2591710f78&rBranch=main&rCommit=999bae0f54108ffc5b7cf2524a02a83901554b16)
- TB: 1.64x -> 1.69x
- HF: 1.79x -> 1.78x (random noise)
- TIMM: 1.51x -> 1.65x

Right now we disable layout optimization for dynamic shape since there is perf loss in that combination. Here is a GH issue to followup: pytorch#102670

Pull Request resolved: pytorch#99773
Approved by: https://github.com/jansel
@ezyang
Copy link
Contributor

ezyang commented Jun 3, 2023

Dynamic shape perf metric is neutral now since we disable layout optimization for it dashboard

Following up on this, do we have a plan for dynamic shapes? It seems like layout optimization should help even with dynamic batch size. Is the problem tuning the heuristics?

cc @voznesenskym

@voznesenskym
Copy link
Contributor

Dynamic shape perf metric is neutral now since we disable layout optimization for it dashboard

Following up on this, do we have a plan for dynamic shapes? It seems like layout optimization should help even with dynamic batch size. Is the problem tuning the heuristics?

cc @voznesenskym

It would be nice if we stopped doing changes where we ignored dyn shapes. I don't know what the best way to educate folks is, other than asking @jansel to reject PRs that carve out dynamic shapes / don't take them into account.

That being said, @shunting314 has done a lot of due diligence and has a very nice followup task linked in the code #102670 - I think, at the very least, we should start by replacing the if dynamic check here with a check for if all the inputs are static, (after we reject conv > 1 channels, smaller out channels, sdpa, we can iterate over the nodes and I think if they are all static it will be identical to if dynamic=False for this feature).

@shunting314 let's followup, I am happy to help with the change I described above, and also with actual proper layout optimization for dyn afterwords, not just static-in-dyn.

@shunting314
Copy link
Contributor Author

Thanks @ezyang and @voznesenskym for following up on dynamic shape support.

In #102670, we've found that the main reason we see slow down when enabling both dynamic shape and layout optimization is because of the disabling of split reduction. Currently, when dynamic shape is enabled, we may disable split reduction in the code (if any dimension has dynamic shape). It turns out that layout optimization get more penalty by disabling split reduction (we have experiment results support this in the tracking issue).

So if we want to let layout optimization brings gain for dynamic shape as well, we should try to enable split reduction. Some ideas we have is to generate multiple kernels for reduction - each kernel corresponds to different ranges of values for the dynamic dimension. Then at runtime we can pick the proper one based on the runtime value of the dynamic dimension. We are building this multi-kernel support anyway for something else, so it looks promising to build it in general and apply it here as well.

@ezyang
Copy link
Contributor

ezyang commented Jun 5, 2023

OK. Generating multiple kernels is the PoR. We just have to do it.

Who is working on multi-kernel support?

@shunting314
Copy link
Contributor Author

Who is working on multi-kernel support?

I'll work on that after I finish a couple of other followups on this PR.

# saved activations can have different stride to eager if
# the compiler does layout optimization. We should restride the
# tensor passed in for compiling the backward graph using the
# saved tensor's stride.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason why layout optimization cannot be done prior to partitioning? If it could be done at that point in time, then you would have all the strides correct and you wouldn't have to write this code.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So layout optimization currently is applied in inductor IR (i.e. apply certain strides in ir.Layout) so it has to be done after partitioning. If we were adding some Fx nodes to represent layout change, then maybe we can do that early before the partitioning. But that's a completely different way to make it work. I also think this introduce extra complexities on AOTAutograd side.

real_arg = all_args[i]
if not isinstance(ph_arg, torch.Tensor):
continue
if ph_arg.stride() != real_arg.stride():
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This forces specializations because ph_arg is symbolic, but real_arg is not symbolic, and an equality test between not symbolic and symbolic does specialization. Then you run into the other problem which is we silently discard backward guards (hopefully you got a warning about this, which helped you diagnose the problem.)

There is no way to do this stride test. What you should do instead is a permutation test. It should be possible to extract the layout permutation without causing extra guards and then do an equality test.

But I am also skeptical that you should be doing this logic here anyway.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @ezyang for explanation! One question I have is, can I get the stride hint for ph_arg directly and use that to compare with real_arg.stride() ?

Hopefully this also makes sure we don't drop dynamic dimensions in ph_arg.

@facebook-github-bot facebook-github-bot deleted the gh/shunting314/52/head branch June 8, 2023 18:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

10 participants