[inductor] config to control whether we assume inputs are aligned #122158

davidberard98 · 2024-03-19T00:25:56Z

Stack from ghstack (oldest at bottom):

-> [inductor] config to control whether we assume inputs are aligned #122158

Motivation: #112771

Summary: Inductor generates triton that assumes that inputs are going to be 16-byte aligned. If the inputs aren't aligned, Inductor clones the inputs. This PR introduces a config option to not do this: when assume_aligned_inputs=False, Inductor will not pass inputs as being divisible_by_16, and Inductor will not make clones. This an can generate code that might be a bit slower, but this tradeoff can be worth it in some scenarios where you might otherwise make a lot of clones.

Ideally, we could do this on a per-tensor basis. But this would be a lot of work, and attempts to add guards on storage offsets to do this automatically have run into issues: recompilations and excessive time to generate/check guards.

Tests #122159 flips this to False. It didn't run through all errors, but the ones we see are all expected failures: divisible_by_16 changes; triton kernel caching fails if we call the same triton kernel multiple times (this makes sense because the first call will have unaligned inputs, but subsequent calls have aligned inputs); and some xfailed tests start passing.

Alternatives/RFC:

Is this the right thing to do with cudagraphs?
Elias and Jason mentioned that we probably still want to make clones if we're dealing with unaligned inputs to matmuls. Is this something we should add in this config option? (In the use case I'm targeting, it seems like we don't need this optimization right now)

cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @peterbell10 @ipiszy @yf225 @chenyang78 @kadeng @muchulee8 @aakhundov @ColinPeppler @amjames @desertfire @chauhang

Differential Revision: D55079094

[ghstack-poisoned]

pytorch-bot · 2024-03-19T00:25:58Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/122158

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ You can merge normally! (2 Unrelated Failures)

As of commit 9d0fef9 with merge base 4b53590 ():

FLAKY - The following jobs failed but were likely due to flakiness present on trunk:

pull / linux-focal-py3.11-clang10 / test (default, 1, 3, linux.2xlarge) (gh)
profiler/test_profiler.py::TestProfiler::test_source_multithreaded_multiple_preexisting_work_in_main_thread_False
pull / linux-jammy-py3.10-clang15-asan / test (default, 6, 6, linux.4xlarge) (gh)
profiler/test_profiler.py::TestProfiler::test_source_multithreaded_multiple_preexisting_work_in_main_thread_False

This comment was automatically generated by Dr. CI and updates every 15 minutes.

ghstack-source-id: 94996e3 Pull Request resolved: #122158

davidberard98 · 2024-03-19T17:25:02Z

@davidberard98 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

…aligned" cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 aakhundov ColinPeppler amjames desertfire chauhang Differential Revision: [D55079094](https://our.internmc.facebook.com/intern/diff/D55079094) [ghstack-poisoned]

…aligned" **Motivation**: #112771 **Summary**: Inductor generates triton that assumes that inputs are going to be 16-byte aligned. If the inputs aren't aligned, Inductor clones the inputs. This PR introduces a config option to not do this: when assume_aligned_inputs=False, Inductor will _not_ pass inputs as being divisible_by_16, and Inductor will not make clones. This an can generate code that might be a bit slower, but this tradeoff can be worth it in some scenarios where you might otherwise make a lot of clones. Ideally, we could do this on a per-tensor basis. But this would be a lot of work, and attempts to add guards on storage offsets to do this automatically have run into issues: recompilations and excessive time to generate/check guards. **Tests** #122159 flips this to False. It didn't run through all errors, but the ones we see are all expected failures: divisible_by_16 changes; triton kernel caching fails if we call the same triton kernel multiple times (this makes sense because the first call will have unaligned inputs, but subsequent calls have aligned inputs); and some xfailed tests start passing. **Alternatives/RFC**: * Is this the right thing to do with cudagraphs? * Elias and Jason mentioned that we probably still want to make clones if we're dealing with unaligned inputs to matmuls. Is this something we should add in this config option? (In the use case I'm targeting, it seems like we don't need this optimization right now) cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 aakhundov ColinPeppler amjames desertfire chauhang Differential Revision: [D55079094](https://our.internmc.facebook.com/intern/diff/D55079094) [ghstack-poisoned]

ghstack-source-id: e0bc600 Pull Request resolved: #122158

ezyang

Very nice, we should figure out a policy but this is a great start

eellison

Nice ! We should be able to assume alignment on parameters / static_inputs. If folks are going to be testing perf on this internally might be worth doing ?

davidberard98 · 2024-03-22T17:10:52Z

@davidberard98 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

davidberard98 · 2024-03-22T18:22:58Z

@eellison in the internal target this is apparently showing sufficiently good QPS. I think it’s reasonable to leave this as is until we have a use case where we expect the additional considerations to be necessary for QPS?

davidberard98 · 2024-03-22T20:00:49Z

@pytorchbot merge

pytorchmergebot · 2024-03-22T20:03:21Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

gchanan · 2024-03-28T15:04:48Z

torch/_inductor/config.py

+# assume_aligned_inputs means that we assume that inputs will be aligned; we generate
+# code using this assumption, and clone tensors before use if they aren't aligned.
+# In the common case, most inputs will be aligned.
+assume_aligned_inputs: bool = True


I'm a little confused by the naming. If we are cloning unaligned tensors, aren't we forcing the tensors to be aligned, not assuming they are going to be aligned? Or is this an inductor vs rest of the system perspective thing?

Whoops, didn't see this comment until now.

I guess my thought process was that, when this flag is turned on, inductor generates triton code that assumes its inputs will be aligned.

But I can see why this is confusing - do you have a suggestion for a better name?

or, put another way, assume_aligned_inputs=True -> optimize for the case where inputs are aligned, assume_aligned_inputs=False -> optimize for the case where inputs are unaligned

…22158) **Motivation**: #112771 **Summary**: Inductor generates triton that assumes that inputs are going to be 16-byte aligned. If the inputs aren't aligned, Inductor clones the inputs. This PR introduces a config option to not do this: when assume_aligned_inputs=False, Inductor will _not_ pass inputs as being divisible_by_16, and Inductor will not make clones. This an can generate code that might be a bit slower, but this tradeoff can be worth it in some scenarios where you might otherwise make a lot of clones. Ideally, we could do this on a per-tensor basis. But this would be a lot of work, and attempts to add guards on storage offsets to do this automatically have run into issues: recompilations and excessive time to generate/check guards. **Tests** #122159 flips this to False. It didn't run through all errors, but the ones we see are all expected failures: divisible_by_16 changes; triton kernel caching fails if we call the same triton kernel multiple times (this makes sense because the first call will have unaligned inputs, but subsequent calls have aligned inputs); and some xfailed tests start passing. **Alternatives/RFC**: * Is this the right thing to do with cudagraphs? * Elias and Jason mentioned that we probably still want to make clones if we're dealing with unaligned inputs to matmuls. Is this something we should add in this config option? (In the use case I'm targeting, it seems like we don't need this optimization right now) Differential Revision: [D55079094](https://our.internmc.facebook.com/intern/diff/D55079094) Pull Request resolved: #122158 Approved by: https://github.com/ezyang

[inductor] config to control whether we assume inputs are aligned

6004abb

[ghstack-poisoned]

pytorch-bot bot added ciflow/inductor module: inductor labels Mar 19, 2024

davidberard98 added a commit that referenced this pull request Mar 19, 2024

[inductor] config to control whether we assume inputs are aligned

22bf80d

ghstack-source-id: 94996e3 Pull Request resolved: #122158

davidberard98 mentioned this pull request Mar 19, 2024

[TEST] assume no aligned inputs #122159

Closed

davidberard98 added a commit that referenced this pull request Mar 22, 2024

[inductor] config to control whether we assume inputs are aligned

a8bfc28

ghstack-source-id: e0bc600 Pull Request resolved: #122158

davidberard98 requested review from eellison, ezyang and jansel March 22, 2024 14:29

davidberard98 marked this pull request as ready for review March 22, 2024 14:30

davidberard98 added the topic: not user facing topic category label Mar 22, 2024

ezyang approved these changes Mar 22, 2024

View reviewed changes

pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Mar 22, 2024

eellison reviewed Mar 22, 2024

View reviewed changes

pytorchmergebot added the merging label Mar 22, 2024

pytorchmergebot closed this in 8013c44 Mar 22, 2024

pytorchmergebot added Merged and removed merging labels Mar 22, 2024

gchanan reviewed Mar 28, 2024

View reviewed changes

github-actions bot deleted the gh/davidberard98/279/head branch May 4, 2024 01:55

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[inductor] config to control whether we assume inputs are aligned #122158

[inductor] config to control whether we assume inputs are aligned #122158

Uh oh!

davidberard98 commented Mar 19, 2024 •

edited

Loading

Uh oh!

pytorch-bot bot commented Mar 19, 2024 •

edited

Loading

Uh oh!

davidberard98 commented Mar 19, 2024

Uh oh!

ezyang left a comment

Uh oh!

eellison left a comment

Uh oh!

davidberard98 commented Mar 22, 2024

Uh oh!

davidberard98 commented Mar 22, 2024

Uh oh!

davidberard98 commented Mar 22, 2024

Uh oh!

pytorchmergebot commented Mar 22, 2024

Uh oh!

gchanan Mar 28, 2024

Uh oh!

davidberard98 Apr 3, 2024

Uh oh!

davidberard98 Apr 3, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

[inductor] config to control whether we assume inputs are aligned #122158

[inductor] config to control whether we assume inputs are aligned #122158

Uh oh!

Conversation

davidberard98 commented Mar 19, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Mar 19, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/122158

✅ You can merge normally! (2 Unrelated Failures)

Uh oh!

davidberard98 commented Mar 19, 2024

Uh oh!

ezyang left a comment

Choose a reason for hiding this comment

Uh oh!

eellison left a comment

Choose a reason for hiding this comment

Uh oh!

davidberard98 commented Mar 22, 2024

Uh oh!

davidberard98 commented Mar 22, 2024

Uh oh!

davidberard98 commented Mar 22, 2024

Uh oh!

pytorchmergebot commented Mar 22, 2024

Merge started

Uh oh!

gchanan Mar 28, 2024

Choose a reason for hiding this comment

Uh oh!

davidberard98 Apr 3, 2024

Choose a reason for hiding this comment

Uh oh!

davidberard98 Apr 3, 2024

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

davidberard98 commented Mar 19, 2024 •

edited

Loading

pytorch-bot bot commented Mar 19, 2024 •

edited

Loading