Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[inductor] Enable multilayer reductions with dynamic shapes #106747

Closed
wants to merge 6 commits into from

Conversation

peterbell10
Copy link
Collaborator

@peterbell10 peterbell10 commented Aug 8, 2023

Stack from ghstack (oldest at bottom):

Currently multilayer reduction (aka split reductions) are only used with static
shapes which results in worse performance and accuracy when dynamic shapes are
enabled. Instead, this only requires that the shape has a hint value.

cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @Xia-Weiwen @wenzhe-nrv @jiayisunx @ipiszy @ngimel @yf225 @chenyang78 @kadeng @muchulee8 @aakhundov

Currently multilayer reduction (aka split reductions) are only used with static
shapes which results in worse performance and accuracy when dynamic shapes are
enabled. Instead, this only requires that the shape has a hint value.

[ghstack-poisoned]
@pytorch-bot
Copy link

pytorch-bot bot commented Aug 8, 2023

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/106747

Note: Links to docs will display an error until the docs builds have been completed.

✅ 3 Unrelated Failures

As of commit 27705b3:

UNSTABLE - The following jobs failed but were likely due to flakiness present on trunk and has been marked as unstable:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

Comment on lines +611 to +612
and _is_static(reduction_numel_hint)
and _is_static(numel_hint)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can it be the case that numel is static but reduction_numel is not?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In fact, we probably just want _is_static(reduction_numel_hint) for the general case. Then, if we also have numel_hint static even better, but it's not 100% necessary for the main optimisation, I don't think.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can it be the case that numel is static but reduction_numel is not?

numel here is actually the number of output elements, so doesn't include the reduced dimensions.

In fact, we probably just want _is_static(reduction_numel_hint) for the general case. Then, if we also have numel_hint static even better, but it's not 100% necessary for the main optimisation, I don't think.

numel_hint is used in conditionals like numel_hint >= num_sm * 2 * 32 when deciding on the numbers of splits, so we need to have a concrete value.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

btw as a further step we could use bound_sympy to deal with unbacked SymInts but thats more than I need at this point to get cumsum working.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bound_sympy does not currently deal with unbacked SymInts, but may be able to do something in some cases when #106568 is merged.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Say I had an expression s0*100 would bound_sympy not give a lower bound of 100 since shape variables are positive?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That currently happens if the symbol is marked as non-negative. When that PR is merged, we'll leverage all the other information we may as part of the value range analysis and the constraints that are put in place during tracing time.

peterbell10 added a commit to peterbell10/pytorch that referenced this pull request Aug 8, 2023
Currently multilayer reduction (aka split reductions) are only used with static
shapes which results in worse performance and accuracy when dynamic shapes are
enabled. Instead, this only requires that the shape has a hint value.

ghstack-source-id: 3a689e9a6ac41cda4a517d94efe012c091716cd7
Pull Request resolved: pytorch#106747
…shapes"

Currently multilayer reduction (aka split reductions) are only used with static
shapes which results in worse performance and accuracy when dynamic shapes are
enabled. Instead, this only requires that the shape has a hint value.

cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx ipiszy ngimel yf225 chenyang78 kadeng muchulee8 aakhundov

[ghstack-poisoned]
peterbell10 added a commit to peterbell10/pytorch that referenced this pull request Aug 8, 2023
Currently multilayer reduction (aka split reductions) are only used with static
shapes which results in worse performance and accuracy when dynamic shapes are
enabled. Instead, this only requires that the shape has a hint value.

ghstack-source-id: a963e1ffd0b969223c15e843416f80d2faca690b
Pull Request resolved: pytorch#106747
…shapes"

Currently multilayer reduction (aka split reductions) are only used with static
shapes which results in worse performance and accuracy when dynamic shapes are
enabled. Instead, this only requires that the shape has a hint value.

cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx ipiszy ngimel yf225 chenyang78 kadeng muchulee8 aakhundov

[ghstack-poisoned]
@peterbell10 peterbell10 marked this pull request as ready for review August 9, 2023 14:00
@peterbell10 peterbell10 changed the title WIP: [inductor] Enable multilayer reductions with dynamic shapes [inductor] Enable multilayer reductions with dynamic shapes Aug 9, 2023
@peterbell10 peterbell10 added the topic: not user facing topic category label Aug 9, 2023
peterbell10 added a commit to peterbell10/pytorch that referenced this pull request Aug 9, 2023
Currently multilayer reduction (aka split reductions) are only used with static
shapes which results in worse performance and accuracy when dynamic shapes are
enabled. Instead, this only requires that the shape has a hint value.

ghstack-source-id: 41c451e9756013c9db6ebe83dc21e71b383884d7
Pull Request resolved: pytorch#106747
Currently multilayer reduction (aka split reductions) are only used with static
shapes which results in worse performance and accuracy when dynamic shapes are
enabled. Instead, this only requires that the shape has a hint value.

cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx ipiszy ngimel yf225 chenyang78 kadeng muchulee8 aakhundov

[ghstack-poisoned]
Currently multilayer reduction (aka split reductions) are only used with static
shapes which results in worse performance and accuracy when dynamic shapes are
enabled. Instead, this only requires that the shape has a hint value.

cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx ipiszy ngimel yf225 chenyang78 kadeng muchulee8 aakhundov

[ghstack-poisoned]
Copy link
Collaborator

@lezcano lezcano left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great stuff! Perhaps also open an issue to follow up on the bound_sympy point?

Currently multilayer reduction (aka split reductions) are only used with static
shapes which results in worse performance and accuracy when dynamic shapes are
enabled. Instead, this only requires that the shape has a hint value.

cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx ipiszy ngimel yf225 chenyang78 kadeng muchulee8 aakhundov

[ghstack-poisoned]
@peterbell10
Copy link
Collaborator Author

@pytorchbot merge

@pytorch-bot pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Aug 10, 2023
@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

Comment on lines +598 to +600
reduction_numel_hint = V.graph.sizevars.symbolic_hint(reduction_numel)
numel_hint = V.graph.sizevars.symbolic_hint(sympy_product(ranges))

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe as a follow up, add guards based on the heuristics here:

def inner_reduction_splits(reduction_numel_hint, numel_hint):

@facebook-github-bot facebook-github-bot deleted the gh/peterbell10/593/head branch August 14, 2023 14:17
@shunting314
Copy link
Contributor

I vaguely remember Voz has something similar like this but did not land? I guess the reasons is, the split reduction based on hint may not work well when the dynamic dimension change later. cc @voznesenskym @ezyang

@peterbell10
Copy link
Collaborator Author

I couldn't find Voz's PR but with my own testing I'm seeing a pretty reasonable trade-off.

When close to the hint size, x.sum() is significantly faster when split, up to 10x for very large tensors. Then if the runtime size is tiny the split kernel it's only ~1.5x slower than the non-split kernel. On the other hand if the runtime size is larger than the hint then it's still a huge win. So, this only presents an issue if the hint grossly overstates the actual runtime sizes.

@eellison
Copy link
Contributor

@peterbell10 this is true for the case where we run with large tensors, emit multilayer reductions, and then run with small tensors. But at present if we run with small tensors, emit a single layer, when we run with large tensors later we will still hit the single layer path.

@peterbell10
Copy link
Collaborator Author

Yes I agree this has no effect when the runtime size is larger than the hint. My impression was that guarding purely for performance reasons was discouraged, though maybe the accuracy implications would change that. I'm not sure, maybe @voznesenskym could weight in as he removed the old maybe_guard performance guards.

@ezyang
Copy link
Contributor

ezyang commented Aug 18, 2023

We really need some way to indicate "please use this alternate size as the hint when compiling the dynamic shape". This is related #105634 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants