[inductor] Enable multilayer reductions with dynamic shapes #106747

peterbell10 · 2023-08-08T02:33:13Z

Stack from ghstack (oldest at bottom):

Currently multilayer reduction (aka split reductions) are only used with static
shapes which results in worse performance and accuracy when dynamic shapes are
enabled. Instead, this only requires that the shape has a hint value.

cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @Xia-Weiwen @wenzhe-nrv @jiayisunx @ipiszy @ngimel @yf225 @chenyang78 @kadeng @muchulee8 @aakhundov

Currently multilayer reduction (aka split reductions) are only used with static shapes which results in worse performance and accuracy when dynamic shapes are enabled. Instead, this only requires that the shape has a hint value. [ghstack-poisoned]

pytorch-bot · 2023-08-08T02:33:16Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/106747

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ 3 Unrelated Failures

As of commit 27705b3:

UNSTABLE - The following jobs failed but were likely due to flakiness present on trunk and has been marked as unstable:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

lezcano · 2023-08-08T13:00:07Z

torch/_inductor/ir.py

+            and _is_static(reduction_numel_hint)
+            and _is_static(numel_hint)


Can it be the case that numel is static but reduction_numel is not?

In fact, we probably just want _is_static(reduction_numel_hint) for the general case. Then, if we also have numel_hint static even better, but it's not 100% necessary for the main optimisation, I don't think.

Can it be the case that numel is static but reduction_numel is not?

numel here is actually the number of output elements, so doesn't include the reduced dimensions.

In fact, we probably just want _is_static(reduction_numel_hint) for the general case. Then, if we also have numel_hint static even better, but it's not 100% necessary for the main optimisation, I don't think.

numel_hint is used in conditionals like numel_hint >= num_sm * 2 * 32 when deciding on the numbers of splits, so we need to have a concrete value.

btw as a further step we could use bound_sympy to deal with unbacked SymInts but thats more than I need at this point to get cumsum working.

bound_sympy does not currently deal with unbacked SymInts, but may be able to do something in some cases when #106568 is merged.

Say I had an expression s0*100 would bound_sympy not give a lower bound of 100 since shape variables are positive?

That currently happens if the symbol is marked as non-negative. When that PR is merged, we'll leverage all the other information we may as part of the value range analysis and the constraints that are put in place during tracing time.

Currently multilayer reduction (aka split reductions) are only used with static shapes which results in worse performance and accuracy when dynamic shapes are enabled. Instead, this only requires that the shape has a hint value. ghstack-source-id: 3a689e9a6ac41cda4a517d94efe012c091716cd7 Pull Request resolved: pytorch#106747

…shapes" Currently multilayer reduction (aka split reductions) are only used with static shapes which results in worse performance and accuracy when dynamic shapes are enabled. Instead, this only requires that the shape has a hint value. cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx ipiszy ngimel yf225 chenyang78 kadeng muchulee8 aakhundov [ghstack-poisoned]

Currently multilayer reduction (aka split reductions) are only used with static shapes which results in worse performance and accuracy when dynamic shapes are enabled. Instead, this only requires that the shape has a hint value. ghstack-source-id: a963e1ffd0b969223c15e843416f80d2faca690b Pull Request resolved: pytorch#106747

…shapes" Currently multilayer reduction (aka split reductions) are only used with static shapes which results in worse performance and accuracy when dynamic shapes are enabled. Instead, this only requires that the shape has a hint value. cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx ipiszy ngimel yf225 chenyang78 kadeng muchulee8 aakhundov [ghstack-poisoned]

peterbell10 · 2023-08-09T14:00:24Z

I'm seeing geomean speedups in cudagraphs_dynamic for all benchmarks: 12% in torchbench, 17% in huggingface and 15% in Timm models.

https://hud.pytorch.org/benchmark/compilers?startTime=Wed%2C%2002%20Aug%202023%2013%3A57%3A39%20GMT&stopTime=Wed%2C%2009%20Aug%202023%2013%3A57%3A39%20GMT&granularity=hour&suite=torchbench&mode=training&dtype=amp&lBranch=gh/peterbell10/593/orig&lCommit=6d76b15034b7200b0a12206b050130fda70b21b0&rBranch=main&rCommit=416bf4e3e7c5f90bbc306e55501881b4f8bde981

Currently multilayer reduction (aka split reductions) are only used with static shapes which results in worse performance and accuracy when dynamic shapes are enabled. Instead, this only requires that the shape has a hint value. ghstack-source-id: 41c451e9756013c9db6ebe83dc21e71b383884d7 Pull Request resolved: pytorch#106747

Currently multilayer reduction (aka split reductions) are only used with static shapes which results in worse performance and accuracy when dynamic shapes are enabled. Instead, this only requires that the shape has a hint value. cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx ipiszy ngimel yf225 chenyang78 kadeng muchulee8 aakhundov [ghstack-poisoned]

lezcano

Great stuff! Perhaps also open an issue to follow up on the bound_sympy point?

Currently multilayer reduction (aka split reductions) are only used with static shapes which results in worse performance and accuracy when dynamic shapes are enabled. Instead, this only requires that the shape has a hint value. cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx ipiszy ngimel yf225 chenyang78 kadeng muchulee8 aakhundov [ghstack-poisoned]

peterbell10 · 2023-08-10T18:33:22Z

@pytorchbot merge

pytorchmergebot · 2023-08-10T18:40:17Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

eellison · 2023-08-10T21:41:37Z

torch/_inductor/ir.py

+        reduction_numel_hint = V.graph.sizevars.symbolic_hint(reduction_numel)
+        numel_hint = V.graph.sizevars.symbolic_hint(sympy_product(ranges))
+


Maybe as a follow up, add guards based on the heuristics here:

pytorch/torch/_inductor/ir.py

Line 624 in 27705b3

def inner_reduction_splits(reduction_numel_hint, numel_hint):

shunting314 · 2023-08-14T17:37:36Z

I vaguely remember Voz has something similar like this but did not land? I guess the reasons is, the split reduction based on hint may not work well when the dynamic dimension change later. cc @voznesenskym @ezyang

peterbell10 · 2023-08-15T18:37:34Z

I couldn't find Voz's PR but with my own testing I'm seeing a pretty reasonable trade-off.

When close to the hint size, x.sum() is significantly faster when split, up to 10x for very large tensors. Then if the runtime size is tiny the split kernel it's only ~1.5x slower than the non-split kernel. On the other hand if the runtime size is larger than the hint then it's still a huge win. So, this only presents an issue if the hint grossly overstates the actual runtime sizes.

eellison · 2023-08-15T20:25:38Z

@peterbell10 this is true for the case where we run with large tensors, emit multilayer reductions, and then run with small tensors. But at present if we run with small tensors, emit a single layer, when we run with large tensors later we will still hit the single layer path.

peterbell10 · 2023-08-17T17:25:18Z

Yes I agree this has no effect when the runtime size is larger than the hint. My impression was that guarding purely for performance reasons was discouraged, though maybe the accuracy implications would change that. I'm not sure, maybe @voznesenskym could weight in as he removed the old maybe_guard performance guards.

ezyang · 2023-08-18T20:33:25Z

We really need some way to indicate "please use this alternate size as the hint when compiling the dynamic shape". This is related #105634 (comment)

This was referenced Aug 8, 2023

[inductor] Fix reference_as_float gradcheck #106626

Closed

[inductor] Add ir.Scan and lower aten.cumsum on CUDA #106581

Closed

github-actions bot added module: inductor ciflow/inductor labels Aug 8, 2023

pytorchbot added the open source label Aug 8, 2023

lezcano reviewed Aug 8, 2023

View reviewed changes

peterbell10 mentioned this pull request Aug 9, 2023

[inductor] Lower cumprod, cummin, cummax on CUDA #106861

Closed

peterbell10 marked this pull request as ready for review August 9, 2023 14:00

peterbell10 changed the title ~~WIP: [inductor] Enable multilayer reductions with dynamic shapes~~ [inductor] Enable multilayer reductions with dynamic shapes Aug 9, 2023

peterbell10 added the topic: not user facing topic category label Aug 9, 2023

peterbell10 mentioned this pull request Aug 9, 2023

[inductor] Type triton size arguments in the kernel index_dtype #106870

Closed

lezcano approved these changes Aug 10, 2023

View reviewed changes

pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Aug 10, 2023

pytorchmergebot added the merging label Aug 10, 2023

pytorchmergebot added Merged and removed merging labels Aug 10, 2023

pytorchmergebot closed this in a62de2d Aug 10, 2023

eellison reviewed Aug 10, 2023

View reviewed changes

facebook-github-bot deleted the gh/peterbell10/593/head branch August 14, 2023 14:17

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[inductor] Enable multilayer reductions with dynamic shapes #106747

[inductor] Enable multilayer reductions with dynamic shapes #106747

peterbell10 commented Aug 8, 2023 •

edited

pytorch-bot bot commented Aug 8, 2023 •

edited

lezcano Aug 8, 2023

lezcano Aug 8, 2023

peterbell10 Aug 8, 2023 •

edited

peterbell10 Aug 8, 2023

lezcano Aug 8, 2023

peterbell10 Aug 8, 2023

lezcano Aug 8, 2023

peterbell10 commented Aug 9, 2023

lezcano left a comment

peterbell10 commented Aug 10, 2023

pytorchmergebot commented Aug 10, 2023

eellison Aug 10, 2023

shunting314 commented Aug 14, 2023

peterbell10 commented Aug 15, 2023

eellison commented Aug 15, 2023

peterbell10 commented Aug 17, 2023

ezyang commented Aug 18, 2023

		and _is_static(reduction_numel_hint)
		and _is_static(numel_hint)

		reduction_numel_hint = V.graph.sizevars.symbolic_hint(reduction_numel)
		numel_hint = V.graph.sizevars.symbolic_hint(sympy_product(ranges))

[inductor] Enable multilayer reductions with dynamic shapes #106747

[inductor] Enable multilayer reductions with dynamic shapes #106747

Conversation

peterbell10 commented Aug 8, 2023 • edited

pytorch-bot bot commented Aug 8, 2023 • edited

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/106747

✅ 3 Unrelated Failures

lezcano Aug 8, 2023

Choose a reason for hiding this comment

lezcano Aug 8, 2023

Choose a reason for hiding this comment

peterbell10 Aug 8, 2023 • edited

Choose a reason for hiding this comment

peterbell10 Aug 8, 2023

Choose a reason for hiding this comment

lezcano Aug 8, 2023

Choose a reason for hiding this comment

peterbell10 Aug 8, 2023

Choose a reason for hiding this comment

lezcano Aug 8, 2023

Choose a reason for hiding this comment

peterbell10 commented Aug 9, 2023

lezcano left a comment

Choose a reason for hiding this comment

peterbell10 commented Aug 10, 2023

pytorchmergebot commented Aug 10, 2023

Merge started

eellison Aug 10, 2023

Choose a reason for hiding this comment

shunting314 commented Aug 14, 2023

peterbell10 commented Aug 15, 2023

eellison commented Aug 15, 2023

peterbell10 commented Aug 17, 2023

ezyang commented Aug 18, 2023

peterbell10 commented Aug 8, 2023 •

edited

pytorch-bot bot commented Aug 8, 2023 •

edited

peterbell10 Aug 8, 2023 •

edited