New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[inductor] Enable multilayer reductions with dynamic shapes #106747
Conversation
Currently multilayer reduction (aka split reductions) are only used with static shapes which results in worse performance and accuracy when dynamic shapes are enabled. Instead, this only requires that the shape has a hint value. [ghstack-poisoned]
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/106747
Note: Links to docs will display an error until the docs builds have been completed. ✅ 3 Unrelated FailuresAs of commit 27705b3: UNSTABLE - The following jobs failed but were likely due to flakiness present on trunk and has been marked as unstable:
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
and _is_static(reduction_numel_hint) | ||
and _is_static(numel_hint) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can it be the case that numel
is static but reduction_numel
is not?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In fact, we probably just want _is_static(reduction_numel_hint)
for the general case. Then, if we also have numel_hint
static even better, but it's not 100% necessary for the main optimisation, I don't think.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can it be the case that
numel
is static butreduction_numel
is not?
numel
here is actually the number of output elements, so doesn't include the reduced dimensions.
In fact, we probably just want
_is_static(reduction_numel_hint)
for the general case. Then, if we also havenumel_hint
static even better, but it's not 100% necessary for the main optimisation, I don't think.
numel_hint
is used in conditionals like numel_hint >= num_sm * 2 * 32
when deciding on the numbers of splits, so we need to have a concrete value.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
btw as a further step we could use bound_sympy
to deal with unbacked SymInts but thats more than I need at this point to get cumsum working.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
bound_sympy
does not currently deal with unbacked SymInts, but may be able to do something in some cases when #106568 is merged.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Say I had an expression s0*100
would bound_sympy
not give a lower bound of 100 since shape variables are positive?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That currently happens if the symbol is marked as non-negative. When that PR is merged, we'll leverage all the other information we may as part of the value range analysis and the constraints that are put in place during tracing time.
Currently multilayer reduction (aka split reductions) are only used with static shapes which results in worse performance and accuracy when dynamic shapes are enabled. Instead, this only requires that the shape has a hint value. ghstack-source-id: 3a689e9a6ac41cda4a517d94efe012c091716cd7 Pull Request resolved: pytorch#106747
…shapes" Currently multilayer reduction (aka split reductions) are only used with static shapes which results in worse performance and accuracy when dynamic shapes are enabled. Instead, this only requires that the shape has a hint value. cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx ipiszy ngimel yf225 chenyang78 kadeng muchulee8 aakhundov [ghstack-poisoned]
Currently multilayer reduction (aka split reductions) are only used with static shapes which results in worse performance and accuracy when dynamic shapes are enabled. Instead, this only requires that the shape has a hint value. ghstack-source-id: a963e1ffd0b969223c15e843416f80d2faca690b Pull Request resolved: pytorch#106747
…shapes" Currently multilayer reduction (aka split reductions) are only used with static shapes which results in worse performance and accuracy when dynamic shapes are enabled. Instead, this only requires that the shape has a hint value. cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx ipiszy ngimel yf225 chenyang78 kadeng muchulee8 aakhundov [ghstack-poisoned]
I'm seeing geomean speedups in cudagraphs_dynamic for all benchmarks: 12% in torchbench, 17% in huggingface and 15% in Timm models. |
Currently multilayer reduction (aka split reductions) are only used with static shapes which results in worse performance and accuracy when dynamic shapes are enabled. Instead, this only requires that the shape has a hint value. ghstack-source-id: 41c451e9756013c9db6ebe83dc21e71b383884d7 Pull Request resolved: pytorch#106747
Currently multilayer reduction (aka split reductions) are only used with static shapes which results in worse performance and accuracy when dynamic shapes are enabled. Instead, this only requires that the shape has a hint value. cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx ipiszy ngimel yf225 chenyang78 kadeng muchulee8 aakhundov [ghstack-poisoned]
Currently multilayer reduction (aka split reductions) are only used with static shapes which results in worse performance and accuracy when dynamic shapes are enabled. Instead, this only requires that the shape has a hint value. cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx ipiszy ngimel yf225 chenyang78 kadeng muchulee8 aakhundov [ghstack-poisoned]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great stuff! Perhaps also open an issue to follow up on the bound_sympy
point?
Currently multilayer reduction (aka split reductions) are only used with static shapes which results in worse performance and accuracy when dynamic shapes are enabled. Instead, this only requires that the shape has a hint value. cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx ipiszy ngimel yf225 chenyang78 kadeng muchulee8 aakhundov [ghstack-poisoned]
@pytorchbot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
reduction_numel_hint = V.graph.sizevars.symbolic_hint(reduction_numel) | ||
numel_hint = V.graph.sizevars.symbolic_hint(sympy_product(ranges)) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe as a follow up, add guards based on the heuristics here:
Line 624 in 27705b3
def inner_reduction_splits(reduction_numel_hint, numel_hint): |
I vaguely remember Voz has something similar like this but did not land? I guess the reasons is, the split reduction based on hint may not work well when the dynamic dimension change later. cc @voznesenskym @ezyang |
I couldn't find Voz's PR but with my own testing I'm seeing a pretty reasonable trade-off. When close to the hint size, |
@peterbell10 this is true for the case where we run with large tensors, emit multilayer reductions, and then run with small tensors. But at present if we run with small tensors, emit a single layer, when we run with large tensors later we will still hit the single layer path. |
Yes I agree this has no effect when the runtime size is larger than the hint. My impression was that guarding purely for performance reasons was discouraged, though maybe the accuracy implications would change that. I'm not sure, maybe @voznesenskym could weight in as he removed the old |
We really need some way to indicate "please use this alternate size as the hint when compiling the dynamic shape". This is related #105634 (comment) |
Stack from ghstack (oldest at bottom):
Currently multilayer reduction (aka split reductions) are only used with static
shapes which results in worse performance and accuracy when dynamic shapes are
enabled. Instead, this only requires that the shape has a hint value.
cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @Xia-Weiwen @wenzhe-nrv @jiayisunx @ipiszy @ngimel @yf225 @chenyang78 @kadeng @muchulee8 @aakhundov